目录
6.3.1、二值离散化 Binarizer【通常基于均值二值化】
6.3.2、分桶离散化 pd.cut 与 分位数分桶 pd.qcut 【 pandas的qcut() 与 cut()】
6.4、Encoding categorical features 【特征工程】
6.4.1、scikit 中的 OneHotEncoder【一般不用】
6.4.2、 sklearn 中的 LabelEncode【需要一列一列传入】
6.4.3、pandas中的 get_dummies【允许整个 df 传入】
6.4.4、Imputation of missing values 填充缺省值
官方中文参考文档:
https://sklearn.apachecn.org/docs/0.21.3/
https://sklearn.apachecn.org/docs/0.21.3/50.html
一、简单的概念回顾
监督学习与无监督学习
- 最大的区别就是有没有标签
- 工业应用中主要是用监督学习
分类任务和回归任务
- 能用线性模型,决不用非线性模型(容易over fitting,且计算量太大)
模型的评估【分类模型的五个指标】
- accuracy:很少用,样本不均衡时,易出问题
- recall与precision:二者之间的trade off
- F1-score:综合均衡考量recall与precision
- ROC曲线
- AUC值:ROC曲线下方面积
特征处理(特征工程)
- 决定机器学习建模效果的核心
- 业务经验相关
- 熟悉相关工具
二、Sklearn的设计概述
官方文档:
scikit-learn: machine learning in Python — scikit-learn 1.4.1 documentation
- Classification【分类】
- Regression【回归】
- Clustering【聚类】
- Dimensionality reduction【降维】
- Model selection【模型选择与验证】
- Preprocessing【预处理】
三、机器学习流程
机器学习流程
- 获取数据
爬虫
数据库
数据文件(csv、excel、txt) - 数据处理
文本处理
量纲一致
降维 - 建立模型
分类
回归
聚类 - 评估模型
超参数择优
哪个模型更好
四、简单的sklearn API 套路
fit:训练模型
transform:将数据转换为模型处理后的结果(label会放在test集后面)
predict:返回模型预测结果
predict_proba:预测概率值
score:模型准确率(很少用默认的accuracy,会设置为f1)
get_params:获取参数
五、Preparing data
数据集划分:
- Training data(70%)
- Validation data
- Testing data(30%)
实际工作中,大部分的情况下不会完全随机划分,会用已经发生(时间在前的、过去的)数据作为训练集,来预测未来(时间在后的)数据。否则使用未来数据预测过去的数据,
会引入一些未来发生的先验信息,是不合理的,容易造成过拟合。另外也会有其他情况,例如按地域划分
六、数据处理(上)
数据集:ML DATASETS
import numpy as np
import pandas as pd
df = pd.read_csv('forestfires.csv',header='infer',names=None)
out:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 517 entries, 0 to 516
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 X 517 non-null int64
1 Y 517 non-null int64
2 month 517 non-null object
3 day 517 non-null object
4 FFMC 517 non-null float64
5 DMC 517 non-null float64
6 DC 517 non-null float64
7 ISI 517 non-null float64
8 temp 517 non-null float64
9 RH 517 non-null int64
10 wind 517 non-null float64
11 rain 517 non-null float64
12 area 517 non-null float64
dtypes: float64(8), int64(3), object(2)
memory usage: 52.6+ KB
print(df.head())
out:
X Y month day FFMC DMC DC ISI temp RH wind rain area
0 7 5 mar fri 86.2 26.2 94.3 5.1 8.2 51 6.7 0.0 0.0
1 7 4 oct tue 90.6 35.4 669.1 6.7 18.0 33 0.9 0.0 0.0
2 7 4 oct sat 90.6 43.7 686.9 6.7 14.6 33 1.3 0.0 0.0
3 8 6 mar fri 91.7 33.3 77.5 9.0 8.3 97 4.0 0.2 0.0
4 8 6 mar sun 89.3 51.3 102.2 9.6 11.4 99 1.8 0.0 0.0
6.1、Standardization(标准化)
df1 = df.loc[:,'FFMC':'rain']
print(df1.head())
out:
FFMC DMC DC ISI temp RH wind rain
0 86.2 26.2 94.3 5.1 8.2 51 6.7 0.0
1 90.6 35.4 669.1 6.7 18.0 33 0.9 0.0
2 90.6 43.7 686.9 6.7 14.6 33 1.3 0.0
3 91.7 33.3 77.5 9.0 8.3 97 4.0 0.2
4 89.3 51.3 102.2 9.6 11.4 99 1.8 0.0
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(df1.astype(float),df['area'],test_size=0.3,random_state=42) # 70% train, 30% test
"""
X_train, X_test, y_train, y_test = train_test_split(
... X, y, test_size=0.33, random_state=42)
"""
print(X_train.shape,X_test.shape,Y_train.shape,Y_test.shape)
out:
(361, 8) (156, 8) (361,) (156,)
6.1.1、离差标准化 MinMaxScaler
Scaling features to lie between a given minimum and maximum value, often between zero and one, or so that
the maximum absolute value of each feature is scaled to unit size.
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()
"""
copy=True,
feature_range=(0, 1) # 默认 0-1 区间
"""
mms.fit(X_train)
X_train_mms = mms.transform(X_train)
X_test_mms = mms.transform(X_test)
print(X_train_mms)
out:
[[0.95096774 0.36078567 0.8797936 ... 0.15294118 0.10588235 0. ]
[0.93548387 0.32115782 0.86372698 ... 0.37647059 0.47058824 0. ]
[0.92258065 0.37835975 0.62096869 ... 0.32941176 0.57647059 0. ]
...
[0.94709677 0.52205376 0.76263633 ... 0.48235294 0.25882353 0. ]
[0.93032258 0.28807719 0.43239123 ... 0.42352941 0.10588235 0. ]
[0.98193548 0.36940041 0.74961886 ... 0.29411765 0.36470588 0. ]]
print(X_train_mms.shape,X_test_mms.shape)
out:
(361, 8) (156, 8)
6.1.2、标准差标准化 StandardScaler
The preprocessing module further provides a utility class StandardScaler that implements the Transformer API
to compute the mean and standard deviation on a training set so as to be able to later reapply the same
transformation on the testing set.
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
"""
copy=True,
with_mean=True, # 是否中心化
with_std=True # 是否缩放
"""
ss.fit(X_train.astype(float))
X_train_ss = ss.transform(X_train)
X_test_ss = ss.transform(X_test)
print(X_train_ss)
out:
[[ 0.31227419 -0.09559515 0.84776472 ... -1.01258179 -1.21784014
-0.07097384]
[ 0.10927722 -0.26994859 0.79301684 ... 0.13635096 0.52453381
-0.07097384]
[-0.05988692 -0.0182732 -0.0341957 ... -0.10552962 1.03038431
-0.07097384]
...
[ 0.26152495 0.61394751 0.44854476 ... 0.68058227 -0.4871672
-0.07097384]
[ 0.04161157 -0.4154958 -0.676784 ... 0.37823154 -1.21784014
-0.07097384]
[ 0.71826813 -0.05769223 0.40418698 ... -0.28694005 0.01868331
-0.07097384]]
print(X_train_ss.shape,X_test_ss.shape)
out:
(361, 8) (156, 8)
print(X_train_ss.mean(axis=0))
out:
[-7.50400050e-17 -2.82937724e-17 0.00000000e+00 3.93652485e-17
-3.21072808e-16 1.25476730e-16 1.37778370e-16 -1.47619682e-17]
print(X_train_ss.std(axis=0))
out:
[1. 1. 1. 1. 1. 1. 1. 1.]
6.2、Normalization(正则化,基于行操作)
Normalizer类也拥有fit、transform等转换器API拥有的常见方法,但实际上fit和transform对其是没有实际意义的,因为归一化
操作是对每个样本单独进行变换,不存在针对所有样本上的统计学习过程。这里的设计,仅仅是为了供sklearn中的pipeline等API调
用时,传入该对象时,各API的方法能够保持一致性,方便使用pipeline。
from sklearn.preprocessing import Normalizer
norm = Normalizer(norm="l2")
norm.fit(X_train)
X_train_norm = norm.transform(X_train) # l2 正则本质上将向量转化为单位向量
X_test_norm = norm.transform(X_test)
print(X_train_norm)
out:
[[0.1196928 0.13705085 0.98202506 ... 0.03627055 0.00233168 0. ]
[0.12038336 0.12447534 0.9826028 ... 0.06203967 0.00646797 0. ]
[0.16160439 0.19869099 0.96281818 ... 0.07703979 0.01039141 0. ]
...
[0.13451037 0.22286951 0.96128907 ... 0.08178698 0.00452749 0. ]
[0.22668263 0.21145395 0.94018369 ... 0.12732174 0.00449371 0. ]
[0.14263086 0.16294221 0.97359102 ... 0.0601818 0.00601818 0. ]]
print(sum(np.square(X_train_norm[1])))
out:
0.9999999999999997
6.3、Binarization(离散化)
df.head()
X Y month day FFMC DMC DC ISI temp RH wind rain area
0 7 5 mar fri 86.2 26.2 94.3 5.1 8.2 51 6.7 0.0 0.0
1 7 4 oct tue 90.6 35.4 669.1 6.7 18.0 33 0.9 0.0 0.0
2 7 4 oct sat 90.6 43.7 686.9 6.7 14.6 33 1.3 0.0 0.0
3 8 6 mar fri 91.7 33.3 77.5 9.0 8.3 97 4.0 0.2 0.0
4 8 6 mar sun 89.3 51.3 102.2 9.6 11.4 99 1.8 0.0 0.0
6.3.1、二值离散化 Binarizer【通常基于均值二值化】
from sklearn.preprocessing import Binarizer
bi = Binarizer(threshold=548)
DC_bi = bi.fit_transform(df[['DC']]) # 这里 Binarizer API 需要2D结构的数据,因此这里多加了一个 [ ]
print(DC_bi)
out:
[[0.]
[1.]
[1.]
[0.]
[0.]
[0.]
[0.]
[1.]
...
]
df['DC_bi'] = DC_bi
print(df)
out:
X Y month day FFMC DMC ... temp RH wind rain area DC_bi
0 7 5 mar fri 86.2 26.2 ... 8.2 51 6.7 0.0 0.00 0.0
1 7 4 oct tue 90.6 35.4 ... 18.0 33 0.9 0.0 0.00 1.0
2 7 4 oct sat 90.6 43.7 ... 14.6 33 1.3 0.0 0.00 1.0
3 8 6 mar fri 91.7 33.3 ... 8.3 97 4.0 0.2 0.00 0.0
4 8 6 mar sun 89.3 51.3 ... 11.4 99 1.8 0.0 0.00 0.0
.. .. .. ... ... ... ... ... ... .. ... ... ... ...
512 4 3 aug sun 81.6 56.7 ... 27.8 32 2.7 0.0 6.44 1.0
513 2 4 aug sun 81.6 56.7 ... 21.9 71 5.8 0.0 54.29 1.0
514 7 4 aug sun 81.6 56.7 ... 21.2 70 6.7 0.0 11.16 1.0
515 1 4 aug sat 94.4 146.0 ... 25.6 42 4.0 0.0 0.00 1.0
516 6 3 nov tue 79.5 3.0 ... 11.8 31 4.5 0.0 0.00 0.0
[517 rows x 14 columns]
6.3.2、分桶离散化 pd.cut 与 分位数分桶 pd.qcut 【 pandas的qcut() 与 cut()】
- 使用 cut 分桶离散化
1、使用 cut 均匀分桶
--------------------------------------------------------------------------------------------
quartiles = pd.cut(df['DC'],5)
print(quartiles)
out:
0 (7.047, 178.44]
1 (519.52, 690.06]
2 (519.52, 690.06]
3 (7.047, 178.44]
4 (7.047, 178.44]
...
512 (519.52, 690.06]
513 (519.52, 690.06]
514 (519.52, 690.06]
515 (519.52, 690.06]
516 (7.047, 178.44]
Name: DC, Length: 517, dtype: category
Categories (5, interval[float64]): [(7.047, 178.44] < (178.44, 348.98] < (348.98, 519.52] <
(519.52, 690.06] < (690.06, 860.6]]
# Category对象可以直接丢进 groupy 中
print(df["DC_bi"].groupby(quartiles).count())
print(df.groupby(quartiles).count())
out:
DC
(7.047, 178.44] 88
(178.44, 348.98] 16
(348.98, 519.52] 47
(519.52, 690.06] 176
(690.06, 860.6] 190
Name: DC_bi, dtype: int64
X Y month day FFMC ... RH wind rain area DC_bi
DC ...
(7.047, 178.44] 88 88 88 88 88 ... 88 88 88 88 88
(178.44, 348.98] 16 16 16 16 16 ... 16 16 16 16 16
(348.98, 519.52] 47 47 47 47 47 ... 47 47 47 47 47
(519.52, 690.06] 176 176 176 176 176 ... 176 176 176 176 176
(690.06, 860.6] 190 190 190 190 190 ... 190 190 190 190 190
# groupby 与 apply 共同使用
def func(group):
return {'min':group.min(),'max':group.max(),'mean':group.mean()}
reault = df["DC_bi"].groupby(quartiles).apply(func)
print(reault.unstack())
out:
min max mean
DC
(7.047, 178.44] 0.0 0.0 0.000000
(178.44, 348.98] 0.0 0.0 0.000000
(348.98, 519.52] 0.0 0.0 0.000000
(519.52, 690.06] 0.0 1.0 0.971591
(690.06, 860.6] 1.0 1.0 1.000000
2、使用 cut 给定已有的桶进行分桶
--------------------------------------------------------------------------------------------
quartiles = pd.cut(df['DC'],[-1000,0,100,400,10000])
print(quartiles)
out:
0 (0, 100]
1 (400, 10000]
2 (400, 10000]
3 (0, 100]
4 (100, 400]
...
512 (400, 10000]
513 (400, 10000]
514 (400, 10000]
515 (400, 10000]
516 (100, 400]
Name: DC, Length: 517, dtype: category
Categories (4, interval[int64]): [(-1000, 0] < (0, 100] < (100, 400] < (400, 10000]]
- 使用 qcut 根据分位数进行分桶
1、使用分位数qcut进行分桶
quartiles = pd.qcut(df['DC'],[0,0.2,0.5,0.8,1]) # 按照分位数 0.2 0.5 0.8 分为 4个桶
print(quartiles)
out:
0 (7.899, 323.3]
1 (664.2, 728.6]
2 (664.2, 728.6]
3 (7.899, 323.3]
4 (7.899, 323.3]
...
512 (664.2, 728.6]
513 (664.2, 728.6]
514 (664.2, 728.6]
515 (323.3, 664.2]
516 (7.899, 323.3]
Name: DC, Length: 517, dtype: category
Categories (4, interval[float64]): [(7.899, 323.3] < (323.3, 664.2] < (664.2, 728.6] < (728.6, 860.6]]
6.4、Encoding categorical features 【特征工程】
6.4.1、scikit 中的 OneHotEncoder【一般不用】
class sklearn.preprocessing.OneHotEncoder(n_values='auto', categorical_features='all', dtype=<type 'numpy.float64'>,
sparse=True, handle_unknown='error')
tip: 返回时稀释矩阵,需要转化为稠密矩阵,结果与get_dummies相同的onehot矩阵【不常用】
from sklearn.preprocessing import OneHotEncoder
ont = OneHotEncoder()
month_ont = ont.fit_transform(df[['month']]) # 这里传入二维结构
print(month_ont.toarray())
out:
[[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 1. 0.]
[0. 0. 0. ... 0. 1. 0.]
...
[0. 1. 0. ... 0. 0. 0.]
[0. 1. 0. ... 0. 0. 0.]
[0. 0. 0. ... 1. 0. 0.]]
6.4.2、 sklearn 中的 LabelEncode【需要一列一列传入】
from sklearn.preprocessing import 、LabelEncoder
ont = LabelEncoder()
month_ont = ont.fit_transform(df['month']) # 传入一维结构
print(month_ont)
out:
[ 7 10 10 7 7 1 1 1 11 11 11 11 1 11 11 11 7 10 7 0 11 11 6 1
1 1 11 11 11 11 11 11 11 11 11 11 10 10 10 7 5 1 1 11 11 11 11 5
7 7 11 1 1 1 1 11 11 10 3 3 7 7 1 1 1 1 11 11 11 7 7 11
7 1 11 3 3 7 1 1 1 1 1 1 1 11 11 11 11 7 1 7 1 1 1 11
3 7 1 1 1 1 1 11 4 7 7 1 11 11 7 7 11 11 7 7 7 7 7 1
1 1 11 11 11 10 7 11 10 10 3 7 7 11 7 1 11 11 5 11 11 1 1 5
1 1 7 11 1 11 6 5 5 11 11 1 11 1 1 11 7 1 7 11 11 7 1 1
7 1 11 1 1 11 1 1 0 1 11 1 11 10 3 10 1 11 7 11 7 7 7 1
1 11 1 1 0 11 11 11 11 7 3 10 7 11 1 11 11 11 10 1 11 7 7 7
11 11 11 7 1 11 7 5 11 11 10 1 11 1 11 11 11 11 11 1 11 11 11 0
0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 5 5 5
5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 8 11 11 11 11 11 11 11
11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
11 11 11 11 11 11 11 11 11 11 11 5 1 1 11 11 1 1 7 4 5 1 1 1
1 1 11 7 1 1 3 11 11 7 3 3 11 1 1 6 6 11 1 1 11 1 11 3
11 5 3 3 5 1 1 1 5 7 1 1 1 1 5 11 1 1 1 1 1 1 11 1
1 1 1 5 1 1 1 11 11 1 0 5 11 1 1 7 11 1 1 1 1 1 1 5
1 1 1 1 1 1 11 3 3 3 7 7 7 0 0 8 6 6 6 6 5 5 5 5
5 5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 9]
6.4.3、pandas中的 get_dummies【允许整个 df 传入】
pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None,
sparse=False, drop_first=False)
"""
data 数据
columns 选择df的那些列用于onehot
"""
ont_hots = pd.get_dummies(data=df,columns=['month','day']) # 原来指定的列将会消失,转化为ont-hot结构,不过pandas会智能选择较小的int用于保存数据,所以可以做一个astype转化
print(ont_hots.info())
out:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 517 entries, 0 to 516
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 X 517 non-null int64
1 Y 517 non-null int64
2 FFMC 517 non-null float64
3 DMC 517 non-null float64
4 DC 517 non-null float64
5 ISI 517 non-null float64
6 temp 517 non-null float64
7 RH 517 non-null int64
8 wind 517 non-null float64
9 rain 517 non-null float64
10 area 517 non-null float64
11 DC_bi 517 non-null float64
12 month_apr 517 non-null uint8
13 month_aug 517 non-null uint8
14 month_dec 517 non-null uint8
15 month_feb 517 non-null uint8
16 month_jan 517 non-null uint8
17 month_jul 517 non-null uint8
18 month_jun 517 non-null uint8
19 month_mar 517 non-null uint8
20 month_may 517 non-null uint8
21 month_nov 517 non-null uint8
22 month_oct 517 non-null uint8
23 month_sep 517 non-null uint8
24 day_fri 517 non-null uint8
25 day_mon 517 non-null uint8
26 day_sat 517 non-null uint8
27 day_sun 517 non-null uint8
28 day_thu 517 non-null uint8
29 day_tue 517 non-null uint8
30 day_wed 517 non-null uint8
dtypes: float64(9), int64(3), uint8(19)
memory usage: 58.2 KB
6.4.4、Imputation of missing values 填充缺省值
1、使用 sklearn 中的 SimpleImputer
------------------------------------------------------------------------------------------------
class sklearn.impute.SimpleImputer(missing_values=nan, strategy='mean', fill_value=None, verbose=0,
copy=True, add_indicator=False)
"""
If “mean”, then replace missing values using the mean along the axis.
If “median”, then replace missing values using the median along the axis.
If “most_frequent”, then replace missing using the most frequent value along the axis.
If “constant”, then replace missing values with fill_value. Can be used with strings or numeric data.
"""
df # 原始数据
out:
X Y month day FFMC DMC ... temp RH wind rain area DC_bi
0 7 5 mar fri 86.2 26.2 ... 8.2 51 6.7 0.0 0.00 0.0
1 7 4 oct tue 90.6 35.4 ... 18.0 33 0.9 0.0 0.00 1.0
2 7 4 oct sat 90.6 43.7 ... 14.6 33 1.3 0.0 0.00 1.0
3 8 6 mar fri 91.7 33.3 ... 8.3 97 4.0 0.2 0.00 0.0
4 8 6 mar sun 89.3 51.3 ... 11.4 99 1.8 0.0 0.00 0.0
.. .. .. ... ... ... ... ... ... .. ... ... ... ...
512 4 3 aug sun 81.6 56.7 ... 27.8 32 2.7 0.0 6.44 1.0
513 2 4 aug sun 81.6 56.7 ... 21.9 71 5.8 0.0 54.29 1.0
514 7 4 aug sun 81.6 56.7 ... 21.2 70 6.7 0.0 11.16 1.0
515 1 4 aug sat 94.4 146.0 ... 25.6 42 4.0 0.0 0.00 1.0
516 6 3 nov tue 79.5 3.0 ... 11.8 31 4.5 0.0 0.00 0.0
[517 rows x 14 columns]
df['DC_na'] = np.nan
df.loc[df['DC']>600,'DC_na'] = df["DC"] # 涉及到bool索引,必须用 .loc方式,具体查考 pandas 的笔记
print(df)
out:
X Y month day FFMC DMC ... RH wind rain area DC_bi DC_na
0 7 5 mar fri 86.2 26.2 ... 51 6.7 0.0 0.00 0.0 NaN
1 7 4 oct tue 90.6 35.4 ... 33 0.9 0.0 0.00 1.0 669.1
2 7 4 oct sat 90.6 43.7 ... 33 1.3 0.0 0.00 1.0 686.9
3 8 6 mar fri 91.7 33.3 ... 97 4.0 0.2 0.00 0.0 NaN
4 8 6 mar sun 89.3 51.3 ... 99 1.8 0.0 0.00 0.0 NaN
.. .. .. ... ... ... ... ... .. ... ... ... ... ...
512 4 3 aug sun 81.6 56.7 ... 32 2.7 0.0 6.44 1.0 665.6
513 2 4 aug sun 81.6 56.7 ... 71 5.8 0.0 54.29 1.0 665.6
514 7 4 aug sun 81.6 56.7 ... 70 6.7 0.0 11.16 1.0 665.6
515 1 4 aug sat 94.4 146.0 ... 42 4.0 0.0 0.00 1.0 614.7
516 6 3 nov tue 79.5 3.0 ... 31 4.5 0.0 0.00 0.0 NaN
[517 rows x 15 columns]
1、使用 SimpleImputer API 进行预处理
from sklearn.impute import SimpleImputer
SI = SimpleImputer(strategy='mean')
"""
missing_values=np.nan, # 一般默认,指定那些值是需要操作的值
strategy="mean", "mean"、"median"、"most_frequent"
fill_value=None, # 只有上面 strategy="constant" 生效
"""
DC_na_si = SI.fit_transform(df[["DC_na"]]) # 传入2D结构,这里多使用一个 []
print(DC_na_si)
out:
[[703.07754491]
[669.1 ]
[686.9 ]
[703.07754491]
[703.07754491]
[703.07754491]
[703.07754491]
[608.2 ]
[692.6 ]
[698.6 ]
[698.6 ]
[713. ]
[665.3 ]
... ]
2、也可以使用 df.fillna() 与 df.replace() API,详情见 Pandas 笔记
【一些实践中的 tips】
- 尽量不要把包含个别特征缺失值的样本删除,实践中最好使用一些业务经验来做一些合理的推测值的填充,利用好样本
- 如果没有合适的推测手段来填充,可以填充一些像-999,-1这样的没有意义的值