# scikit学习心得——Imputing missing values before building an estimator

http://scikit-learn.org/stable/auto_examples/missing_values.html#example-missing-values-py

------------------------------------------------------------------------------------------------------------------------------

This example shows that imputing the missing values can give better results than discarding the samples containing any missing value.

Imputing does not always improve the predictions, so please check via cross-validation.Sometimes dropping rows or using marker values is more effective.

Missing values can be replaced by the mean, the median or the most frequent value using thestrategy hyper-parameter.

The median is a more robust estimator for data with high magnitude variables which could dominate results (otherwise known as a ‘long tail’).

import numpy as np

from sklearn.ensemble import RandomForestRegressor#随机深林回归
from sklearn.pipeline import Pipeline#通道，
from sklearn.preprocessing import Imputer#处理输入值得函数
from sklearn.cross_validation import cross_val_score#交叉验证函数

rng = np.random.RandomState(0)#生成随机种子

X_full, y_full = dataset.data, dataset.target#x为数据，y为预测值
n_samples = X_full.shape[0]#有多少条样本
n_features = X_full.shape[1]#有多少特征

# Estimate the score on the entire dataset, with no missing values
estimator = RandomForestRegressor(random_state=0, n_estimators=100)#估计函数随机森林回归函数
score = cross_val_score(estimator, X_full, y_full).mean()#使用随机森林回归函数进行交叉验证得到一个分数这个分数是没有进过处理的
print("Score with the entire dataset = %.2f" % score)

# Add missing values in 75% of the lines
missing_rate = 0.75#损失比例
n_missing_samples = np.floor(n_samples * missing_rate)#损失的样本数量
missing_samples = np.hstack((np.zeros(n_samples - n_missing_samples,
dtype=np.bool),
np.ones(n_missing_samples,
dtype=np.bool)))#
rng.shuffle(missing_samples)
missing_features = rng.randint(0, n_features, n_missing_samples)

# Estimate the score without the lines containing missing values
X_filtered = X_full[~missing_samples, :]
y_filtered = y_full[~missing_samples]
estimator = RandomForestRegressor(random_state=0, n_estimators=100)
score = cross_val_score(estimator, X_filtered, y_filtered).mean()
print("Score without the samples containing missing values = %.2f" % score)

# Estimate the score after imputation of the missing values
X_missing = X_full.copy()
X_missing[np.where(missing_samples)[0], missing_features] = 0
y_missing = y_full.copy()
estimator = Pipeline([("imputer", Imputer(missing_values=0,
strategy="mean",
axis=0)),
("forest", RandomForestRegressor(random_state=0,
n_estimators=100))])
score = cross_val_score(estimator, X_missing, y_missing).mean()
print("Score after imputation of the missing values = %.2f" % score)
----------------------------------------------------------------------------------------------------------------------------------

RandomForestRegressor(n_estimators=10,criterion='mse', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0,

max_features='auto', max_leaf_nodes=None, bootstrap=True,oob_score=False, n_jobs=1, random_state=None, verbose=0,warm_start=False)

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor

A random forest is a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset and use averaging to improve

the predictive accuracy and control over-fitting.The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement i fbootstrap=True (default).

n_estimators : integer, optional (default=10)

整数选项

The number of trees in the forest.

criterion : string, optional (default=”mse”)

The function to measure the quality of a split. The only supported criterion is “mse” for the mean squared error.Note: this parameter is tree-specific.

max_features : int, float, string or None, optional (default=”auto”)

The number of features to consider when looking for the best split:

• If int, then consider max_features features at each split.
• If float, then max_features is a percentage andint(max_features * n_features) features are considered at eachsplit.
• If “auto”, then max_features=n_features.
• If “sqrt”, then max_features=sqrt(n_features).
• If “log2”, then max_features=log2(n_features).
• If None, then max_features=n_features.

Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires toeffectively inspect more thanmax_features features.Note: this parameter is tree-specific.

max_depth : integer or None, optional (default=None)

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split

samples.Ignored if max_leaf_nodes is not None.Note: this parameter is tree-specific.

min_samples_split : integer, optional (default=2)

The minimum number of samples required to split an internal node.Note: this parameter is tree-specific.

min_samples_leaf : integer, optional (default=1)

The minimum number of samples in newly created leaves. A split is discarded if after the split, one of the leaves would contain less thenmin_samples_leaf samples.Note: this parameter is tree-specific.

min_weight_fraction_leaf : float, optional (default=0.)

The minimum weighted fraction of the input samples required to be at aleaf node.Note: this parameter is tree-specific.

max_leaf_nodes : int or None, optional (default=None)

Grow trees with max_leaf_nodes in best-first fashion.Best nodes are defined as relative reduction in impurity.If None then unlimited number of leaf nodes.If not None thenmax_depth will be ignored.Note: this parameter is tree-specific.

bootstrap : boolean, optional (default=True)

Whether bootstrap samples are used when building trees.

oob_score : bool

whether to use out-of-bag samples to estimatethe generalization error.

n_jobs : integer, optional (default=1)

The number of jobs to run in parallel for both fit and predict.If -1, then the number of jobs is set to the number of cores.

random_state : int, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator;If RandomState instance, random_state is the random number generator;If None, the random number generator is the RandomState instance usedbynp.random.

verbose : int, optional (default=0)

Controls the verbosity of the tree building process.

warm_start : bool, optional (default=False)

When set to True, reuse the solution of the previous call to fitand add more estimators to the ensemble, otherwise, just fit a wholenew forest.

Methods

 apply(X) Apply trees in the forest to X, return leaf indices. fit(X, y[, sample_weight]) Build a forest of trees from the training set (X, y). fit_transform(X[, y]) Fit to data, then transform it. get_params([deep]) Get parameters for this estimator. predict(X) Predict regression target for X. score(X, y[, sample_weight]) Returns the coefficient of determination R^2 of the prediction. set_params(**params) Set the parameters of this estimator. transform(*args, **kwargs) DEPRECATED: Support to use estimators as feature selectors will be removed in version 0.19.

Imputer(missing_values='NaN', strategy='mean', axis=0, verbose=0, copy=True)

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html#sklearn.preprocessing.Imputer

Imputation transformer for completing missing values.

Parameters:

missing_values : integer or “NaN”, optional (default=”NaN”)

The place holder for the missing values. All occurrences of missing_values will be imputed. For missing values encoded as np.nan,use the string value “NaN”.

strategy : string, optional (default=”mean”)

The imputation strategy

• If “mean”, then replace missing values using the mean alongthe axis.
• If “median”, then replace missing values using the median alongthe axis.
• If “most_frequent”, then replace missing using the most frequentvalue along the axis.

axis : integer, optional (default=0)

The axis along which to impute.

• If axis=0, then impute along columns.
• If axis=1, then impute along rows.

verbose : integer, optional (default=0)

Controls the verbosity of the imputer.

copy : boolean, optional (default=True)

If True, a copy of X will be created. If False, imputation willbe done in-place whenever possible. Note that, in the following cases,a new copy will always be made, even ifcopy=False:

• If X is not an array of floating values;
• If X is sparse and missing_values=0;
• If axis=0 and X is encoded as a CSR matrix;
• If axis=1 and X is encoded as a CSC matrix.

Methods

 fit(X[, y]) Fit the imputer on X. fit_transform(X[, y]) Fit to data, then transform it. get_params([deep]) Get parameters for this estimator. set_params(**params) Set the parameters of this estimator. transform(X) Impute all missing values in X.

np.hstack



• 本文已收录于以下专栏：

举报原因： 您举报文章：scikit学习心得——Imputing missing values before building an estimator 色情 政治 抄袭 广告 招聘 骂人 其他 (最多只允许输入30个字)