一.impute
1.简介:
该模块用于处理缺失值
2.使用:
单变量插值器:class sklearn.impute.SimpleImputer([missing_values=nan,strategy='mean',fill_value=None,verbose=0,copy=True,add_indicator=False])
#参数说明:
missing_values:指定缺失值的占位符(哪个值应被视为缺失值);为num/str/np.nan/None
strategy:指定插值策略;为"mean"/"median"/"most_frequent"/"constant"
fill_value:指定使用什么值取代缺失值;为str/num/None(0 for num,"missing_value" for str/object)
#仅当strategy="constant"时使用
verbose:指定日志的详细程度;为int
copy:指定是否一定创建数据的副本;为bool
add_indicator:If True,a MissingIndicator transform will stack onto output of the imputer's transform.This allows a
predictive estimator to account for missingness despite imputation.If a feature has no missing values
at fit/train time,the feature won’t appear on the missing indicator even if there are missing values
at transform/test time
######################################################################################################################
多变量插值器:class sklearn.impute.IterativeImputer([estimator=None,missing_values=nan,sample_posterior=False,max_iter=10,tol=0.001,n_nearest_features=None,initial_strategy='mean',imputation_order='ascending',skip_complete=False,min_value=-inf,max_value=inf,verbose=0,random_state=None,add_indicator=False])
#该功能仍处于试验阶段
######################################################################################################################
缺失值位置的二值指示器:class sklearn.impute.MissingIndicator([missing_values=nan,features='missing-only',sparse='auto',error_on_new=True])
#参数说明:其他参数同class sklearn.impute.SimpleImputer()
features:指定填充器掩码代表哪些特征;为"all"(全部特征)/"missing-only"(仅包含缺失值的特征)
sparse:指定填充掩码格式;为"auto"/True(稀疏矩阵)/False(np.array)
error_on_new:If True,transform will raise an error when there are features with missing values in transform that have no missing values in fit
#仅当features="missing-only"时使用
######################################################################################################################
基于KNN的插值器:class sklearn.impute.KNNImputer([missing_values=nan,n_neighbors=5,weights='uniform',metric='nan_euclidean',copy=True,add_indicator=False])
#参数说明:其他参数同class sklearn.impute.SimpleImputer()
n_neighbors:指定用于插值的相邻样本数;为int
weights:指定用于插值的样本的权重;为"uniform"/"distance"/callable
metric:指定用于搜索相邻样本的距离度量;为"nan_euclidean"/callable
二.preprocessing
1.简介:
该模块用于进行"数据预处理"(data preprocessing),主要包括"缩放"(scaling)/"中心化"(centering)/"二值化"(binarization)/"离散化"
(discretization)等功能
2.二值化
(1)类:
"二值化器"(binarizer):class sklearn<