缺失值填充
- 如果missing value占总体的比例非常小,那么直接填入平均值或者众数
- 如果missing value所占比例不算小也不算大,那么可以考虑它跟其他特征的关系,如果关系明显,那么直接根据其他特征填入;也可以建立简单的模型,比如线性回归,随机森林等。
- 如果missing value所占比例大,那么直接将miss value当做一种特殊的情况,另取一个值填入
SimpleImputer(missing_values=nan, strategy=’mean’, fill_value=None, verbose=0, copy=True)
1. missing_values:number,string,nan
2. strategy:mean,median,most_frequent,constant
3. fill_value:constant的情况下的填充值
>>> import numpy as np
>>> from sklearn.impute import SimpleImputer
>>> imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
>>> imp_mean.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])
>>> X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]
>>> print(imp_mean.transform(X))
[[ 7. 2. 3. ]
[ 4. 3.5 6. ]
[10. 3.5 9. ]]
import numpy as np
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer(missing_values=np.nan, strategy='median')
imp_mean.fit([[7,