scikit学习心得——Imputing missing values before building an estimator

翻译 2015年11月19日 15:48:39



This example shows that imputing the missing values can give better results than discarding the samples containing any missing value.


Imputing does not always improve the predictions, so please check via cross-validation.Sometimes dropping rows or using marker values is more effective.


Missing values can be replaced by the mean, the median or the most frequent value using thestrategy hyper-parameter.


The median is a more robust estimator for data with high magnitude variables which could dominate results (otherwise known as a ‘long tail’).


import numpy as np

from sklearn.datasets import load_boston#波士顿房价回归预测的数据
from sklearn.ensemble import RandomForestRegressor#随机深林回归
from sklearn.pipeline import Pipeline#通道,
from sklearn.preprocessing import Imputer#处理输入值得函数
from sklearn.cross_validation import cross_val_score#交叉验证函数

rng = np.random.RandomState(0)#生成随机种子

dataset = load_boston()#提取数据
X_full, y_full =,为数据,y为预测值
n_samples = X_full.shape[0]#有多少条样本
n_features = X_full.shape[1]#有多少特征

# Estimate the score on the entire dataset, with no missing values
estimator = RandomForestRegressor(random_state=0, n_estimators=100)#估计函数随机森林回归函数
score = cross_val_score(estimator, X_full, y_full).mean()#使用随机森林回归函数进行交叉验证得到一个分数这个分数是没有进过处理的
print("Score with the entire dataset = %.2f" % score)

# Add missing values in 75% of the lines
missing_rate = 0.75#损失比例
n_missing_samples = np.floor(n_samples * missing_rate)#损失的样本数量
missing_samples = np.hstack((np.zeros(n_samples - n_missing_samples,
missing_features = rng.randint(0, n_features, n_missing_samples)

# Estimate the score without the lines containing missing values
X_filtered = X_full[~missing_samples, :]
y_filtered = y_full[~missing_samples]
estimator = RandomForestRegressor(random_state=0, n_estimators=100)
score = cross_val_score(estimator, X_filtered, y_filtered).mean()
print("Score without the samples containing missing values = %.2f" % score)

# Estimate the score after imputation of the missing values
X_missing = X_full.copy()
X_missing[np.where(missing_samples)[0], missing_features] = 0
y_missing = y_full.copy()
estimator = Pipeline([("imputer", Imputer(missing_values=0,
                      ("forest", RandomForestRegressor(random_state=0,
score = cross_val_score(estimator, X_missing, y_missing).mean()
print("Score after imputation of the missing values = %.2f" % score)

RandomForestRegressor(n_estimators=10,criterion='mse', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0,

max_features='auto', max_leaf_nodes=None, bootstrap=True,oob_score=False, n_jobs=1, random_state=None, verbose=0,warm_start=False)

A random forest is a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset and use averaging to improve


the predictive accuracy and control over-fitting.The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement i fbootstrap=True (default).


n_estimators : integer, optional (default=10)


The number of trees in the forest.


criterion : string, optional (default=”mse”)

The function to measure the quality of a split. The only supported criterion is “mse” for the mean squared error.Note: this parameter is tree-specific.


max_features : int, float, string or None, optional (default=”auto”)

The number of features to consider when looking for the best split:


  • If int, then consider max_features features at each split.
  • If float, then max_features is a percentage andint(max_features * n_features) features are considered at eachsplit.
  • If “auto”, then max_features=n_features.
  • If “sqrt”, then max_features=sqrt(n_features).
  • If “log2”, then max_features=log2(n_features).
  • If None, then max_features=n_features.

Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires toeffectively inspect more thanmax_features features.Note: this parameter is tree-specific.

max_depth : integer or None, optional (default=None)

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split


samples.Ignored if max_leaf_nodes is not None.Note: this parameter is tree-specific.

min_samples_split : integer, optional (default=2)

The minimum number of samples required to split an internal node.Note: this parameter is tree-specific.


min_samples_leaf : integer, optional (default=1)

The minimum number of samples in newly created leaves. A split is discarded if after the split, one of the leaves would contain less thenmin_samples_leaf samples.Note: this parameter is tree-specific.


min_weight_fraction_leaf : float, optional (default=0.)

The minimum weighted fraction of the input samples required to be at aleaf node.Note: this parameter is tree-specific.

max_leaf_nodes : int or None, optional (default=None)

Grow trees with max_leaf_nodes in best-first fashion.Best nodes are defined as relative reduction in impurity.If None then unlimited number of leaf nodes.If not None thenmax_depth will be ignored.Note: this parameter is tree-specific.

bootstrap : boolean, optional (default=True)

Whether bootstrap samples are used when building trees.

oob_score : bool

whether to use out-of-bag samples to estimatethe generalization error.

n_jobs : integer, optional (default=1)

The number of jobs to run in parallel for both fit and predict.If -1, then the number of jobs is set to the number of cores.

random_state : int, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator;If RandomState instance, random_state is the random number generator;If None, the random number generator is the RandomState instance usedbynp.random.

verbose : int, optional (default=0)

Controls the verbosity of the tree building process.

warm_start : bool, optional (default=False)

When set to True, reuse the solution of the previous call to fitand add more estimators to the ensemble, otherwise, just fit a wholenew forest.


apply(X) Apply trees in the forest to X, return leaf indices.
fit(X, y[, sample_weight]) Build a forest of trees from the training set (X, y).
fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
predict(X) Predict regression target for X.
score(X, y[, sample_weight]) Returns the coefficient of determination R^2 of the prediction.
set_params(**params) Set the parameters of this estimator.
transform(*args, **kwargs) DEPRECATED: Support to use estimators as feature selectors will be removed in version 0.19.

Imputer(missing_values='NaN', strategy='mean', axis=0, verbose=0, copy=True)

Imputation transformer for completing missing values.



missing_values : integer or “NaN”, optional (default=”NaN”)

The place holder for the missing values. All occurrences of missing_values will be imputed. For missing values encoded as np.nan,use the string value “NaN”.


strategy : string, optional (default=”mean”)

The imputation strategy

  • If “mean”, then replace missing values using the mean alongthe axis.
  • If “median”, then replace missing values using the median alongthe axis.
  • If “most_frequent”, then replace missing using the most frequentvalue along the axis.

axis : integer, optional (default=0)

The axis along which to impute.

  • If axis=0, then impute along columns.
  • If axis=1, then impute along rows.

verbose : integer, optional (default=0)

Controls the verbosity of the imputer.

copy : boolean, optional (default=True)

If True, a copy of X will be created. If False, imputation willbe done in-place whenever possible. Note that, in the following cases,a new copy will always be made, even ifcopy=False:

  • If X is not an array of floating values;
  • If X is sparse and missing_values=0;
  • If axis=0 and X is encoded as a CSR matrix;
  • If axis=1 and X is encoded as a CSC matrix.


fit(X[, y]) Fit the imputer on X.
fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
set_params(**params) Set the parameters of this estimator.
transform(X) Impute all missing values in X.


scikit学习心得——Isotonic Regression

Isotonic Regression

PB5.0 build CE5.0 SDK错误:It is recommended that you build a run-time image before building an SDK

PB5.0在build CE5.0 SDK时候出现错误:It is recommended that you build a run-time image before building an S...


  • 2012年09月06日 11:36
  • 3.61MB
  • 下载

Surf算法学习心得(一)——算法原理 写在前面的话: Surf算法是对Sift算法的一种改进,主要是在算法的执行效率上,比Sift...




栈的定义栈是限定尽在表尾进行插入或者删除操作的线性表。因此,对栈来说,表尾端有其特殊含义,称为栈顶,相应地,表头端称为栈底。不含元素的空表称为空栈。栈又称为后进先出的线性表。 和线性表类似,栈也...


1.需求: Ø 异步随机生成按照各个路线行驶的车辆。 例如:        由南向而来去往北向的车辆 ---- 直行车辆        由西向而来去往南向的车辆 ---- 右转车辆   ...




今天来介绍RabbitMQ的第三种工作方式:发布/订阅(使用扇形交换机) 一、交换机 在开始之前,我们要先简单的介绍一下交换机(在简介教程中有介绍)。前面我们也提到了生产者是把消息发送给交...


UITableView是在iOS开发中最常用的控件之一。我的第一篇学习心得献给它了         UITableView是由一行一行的UITableViewCell构成的。         首先想...
您举报文章:scikit学习心得——Imputing missing values before building an estimator