scikit学习心得——Imputing missing values before building an estimator

翻译 2015年11月19日 15:48:39



This example shows that imputing the missing values can give better results than discarding the samples containing any missing value.


Imputing does not always improve the predictions, so please check via cross-validation.Sometimes dropping rows or using marker values is more effective.


Missing values can be replaced by the mean, the median or the most frequent value using thestrategy hyper-parameter.


The median is a more robust estimator for data with high magnitude variables which could dominate results (otherwise known as a ‘long tail’).


import numpy as np

from sklearn.datasets import load_boston#波士顿房价回归预测的数据
from sklearn.ensemble import RandomForestRegressor#随机深林回归
from sklearn.pipeline import Pipeline#通道,
from sklearn.preprocessing import Imputer#处理输入值得函数
from sklearn.cross_validation import cross_val_score#交叉验证函数

rng = np.random.RandomState(0)#生成随机种子

dataset = load_boston()#提取数据
X_full, y_full =,为数据,y为预测值
n_samples = X_full.shape[0]#有多少条样本
n_features = X_full.shape[1]#有多少特征

# Estimate the score on the entire dataset, with no missing values
estimator = RandomForestRegressor(random_state=0, n_estimators=100)#估计函数随机森林回归函数
score = cross_val_score(estimator, X_full, y_full).mean()#使用随机森林回归函数进行交叉验证得到一个分数这个分数是没有进过处理的
print("Score with the entire dataset = %.2f" % score)

# Add missing values in 75% of the lines
missing_rate = 0.75#损失比例
n_missing_samples = np.floor(n_samples * missing_rate)#损失的样本数量
missing_samples = np.hstack((np.zeros(n_samples - n_missing_samples,
missing_features = rng.randint(0, n_features, n_missing_samples)

# Estimate the score without the lines containing missing values
X_filtered = X_full[~missing_samples, :]
y_filtered = y_full[~missing_samples]
estimator = RandomForestRegressor(random_state=0, n_estimators=100)
score = cross_val_score(estimator, X_filtered, y_filtered).mean()
print("Score without the samples containing missing values = %.2f" % score)

# Estimate the score after imputation of the missing values
X_missing = X_full.copy()
X_missing[np.where(missing_samples)[0], missing_features] = 0
y_missing = y_full.copy()
estimator = Pipeline([("imputer", Imputer(missing_values=0,
                      ("forest", RandomForestRegressor(random_state=0,
score = cross_val_score(estimator, X_missing, y_missing).mean()
print("Score after imputation of the missing values = %.2f" % score)

RandomForestRegressor(n_estimators=10,criterion='mse', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0,

max_features='auto', max_leaf_nodes=None, bootstrap=True,oob_score=False, n_jobs=1, random_state=None, verbose=0,warm_start=False)

A random forest is a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset and use averaging to improve


the predictive accuracy and control over-fitting.The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement i fbootstrap=True (default).


n_estimators : integer, optional (default=10)


The number of trees in the forest.


criterion : string, optional (default=”mse”)

The function to measure the quality of a split. The only supported criterion is “mse” for the mean squared error.Note: this parameter is tree-specific.


max_features : int, float, string or None, optional (default=”auto”)

The number of features to consider when looking for the best split:


  • If int, then consider max_features features at each split.
  • If float, then max_features is a percentage andint(max_features * n_features) features are considered at eachsplit.
  • If “auto”, then max_features=n_features.
  • If “sqrt”, then max_features=sqrt(n_features).
  • If “log2”, then max_features=log2(n_features).
  • If None, then max_features=n_features.

Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires toeffectively inspect more thanmax_features features.Note: this parameter is tree-specific.

max_depth : integer or None, optional (default=None)

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split


samples.Ignored if max_leaf_nodes is not None.Note: this parameter is tree-specific.

min_samples_split : integer, optional (default=2)

The minimum number of samples required to split an internal node.Note: this parameter is tree-specific.


min_samples_leaf : integer, optional (default=1)

The minimum number of samples in newly created leaves. A split is discarded if after the split, one of the leaves would contain less thenmin_samples_leaf samples.Note: this parameter is tree-specific.


min_weight_fraction_leaf : float, optional (default=0.)

The minimum weighted fraction of the input samples required to be at aleaf node.Note: this parameter is tree-specific.

max_leaf_nodes : int or None, optional (default=None)

Grow trees with max_leaf_nodes in best-first fashion.Best nodes are defined as relative reduction in impurity.If None then unlimited number of leaf nodes.If not None thenmax_depth will be ignored.Note: this parameter is tree-specific.

bootstrap : boolean, optional (default=True)

Whether bootstrap samples are used when building trees.

oob_score : bool

whether to use out-of-bag samples to estimatethe generalization error.

n_jobs : integer, optional (default=1)

The number of jobs to run in parallel for both fit and predict.If -1, then the number of jobs is set to the number of cores.

random_state : int, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator;If RandomState instance, random_state is the random number generator;If None, the random number generator is the RandomState instance usedbynp.random.

verbose : int, optional (default=0)

Controls the verbosity of the tree building process.

warm_start : bool, optional (default=False)

When set to True, reuse the solution of the previous call to fitand add more estimators to the ensemble, otherwise, just fit a wholenew forest.


apply(X) Apply trees in the forest to X, return leaf indices.
fit(X, y[, sample_weight]) Build a forest of trees from the training set (X, y).
fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
predict(X) Predict regression target for X.
score(X, y[, sample_weight]) Returns the coefficient of determination R^2 of the prediction.
set_params(**params) Set the parameters of this estimator.
transform(*args, **kwargs) DEPRECATED: Support to use estimators as feature selectors will be removed in version 0.19.

Imputer(missing_values='NaN', strategy='mean', axis=0, verbose=0, copy=True)

Imputation transformer for completing missing values.



missing_values : integer or “NaN”, optional (default=”NaN”)

The place holder for the missing values. All occurrences of missing_values will be imputed. For missing values encoded as np.nan,use the string value “NaN”.


strategy : string, optional (default=”mean”)

The imputation strategy

  • If “mean”, then replace missing values using the mean alongthe axis.
  • If “median”, then replace missing values using the median alongthe axis.
  • If “most_frequent”, then replace missing using the most frequentvalue along the axis.

axis : integer, optional (default=0)

The axis along which to impute.

  • If axis=0, then impute along columns.
  • If axis=1, then impute along rows.

verbose : integer, optional (default=0)

Controls the verbosity of the imputer.

copy : boolean, optional (default=True)

If True, a copy of X will be created. If False, imputation willbe done in-place whenever possible. Note that, in the following cases,a new copy will always be made, even ifcopy=False:

  • If X is not an array of floating values;
  • If X is sparse and missing_values=0;
  • If axis=0 and X is encoded as a CSR matrix;
  • If axis=1 and X is encoded as a CSC matrix.


fit(X[, y]) Fit the imputer on X.
fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
set_params(**params) Set the parameters of this estimator.
transform(X) Impute all missing values in X.



scikit学习心得——Isotonic Regression

Isotonic Regression

PB5.0 build CE5.0 SDK错误:It is recommended that you build a run-time image before building an SDK

PB5.0在build CE5.0 SDK时候出现错误:It is recommended that you build a run-time image before building an S...





OGRE学习心得——安装    简介:本教程基于Ogre Wiki上的Basi...


介绍另一种平衡二叉树:红黑树(Red Black Tree),红黑树由Rudolf Bayer于1972年发明,当时被称为平衡二叉B树(symmetric binary B-trees),1978年被...



Cocoa Touch 入门记——《精通 iOS 开发》学习心得(1) [基本控件的交互]

首先是熟悉 Xcode 界面。因为这本书的 Xcode 版本并不是最新版,而我的 MacBook 上已经装了 Xcode 5.0,所以带来了一些不便。Xcode 5.0 在创建 Single View...


类: 1.定义类 [修饰符] class 类名 {     *构造代码块     零到多个 构造器     零到多个 Field(成员变量)     零到多个 方法 ...