scikit学习心得——Imputing missing values before building an estimator

翻译 2015年11月19日 15:48:39



This example shows that imputing the missing values can give better results than discarding the samples containing any missing value.


Imputing does not always improve the predictions, so please check via cross-validation.Sometimes dropping rows or using marker values is more effective.


Missing values can be replaced by the mean, the median or the most frequent value using thestrategy hyper-parameter.


The median is a more robust estimator for data with high magnitude variables which could dominate results (otherwise known as a ‘long tail’).


import numpy as np

from sklearn.datasets import load_boston#波士顿房价回归预测的数据
from sklearn.ensemble import RandomForestRegressor#随机深林回归
from sklearn.pipeline import Pipeline#通道,
from sklearn.preprocessing import Imputer#处理输入值得函数
from sklearn.cross_validation import cross_val_score#交叉验证函数

rng = np.random.RandomState(0)#生成随机种子

dataset = load_boston()#提取数据
X_full, y_full =,为数据,y为预测值
n_samples = X_full.shape[0]#有多少条样本
n_features = X_full.shape[1]#有多少特征

# Estimate the score on the entire dataset, with no missing values
estimator = RandomForestRegressor(random_state=0, n_estimators=100)#估计函数随机森林回归函数
score = cross_val_score(estimator, X_full, y_full).mean()#使用随机森林回归函数进行交叉验证得到一个分数这个分数是没有进过处理的
print("Score with the entire dataset = %.2f" % score)

# Add missing values in 75% of the lines
missing_rate = 0.75#损失比例
n_missing_samples = np.floor(n_samples * missing_rate)#损失的样本数量
missing_samples = np.hstack((np.zeros(n_samples - n_missing_samples,
missing_features = rng.randint(0, n_features, n_missing_samples)

# Estimate the score without the lines containing missing values
X_filtered = X_full[~missing_samples, :]
y_filtered = y_full[~missing_samples]
estimator = RandomForestRegressor(random_state=0, n_estimators=100)
score = cross_val_score(estimator, X_filtered, y_filtered).mean()
print("Score without the samples containing missing values = %.2f" % score)

# Estimate the score after imputation of the missing values
X_missing = X_full.copy()
X_missing[np.where(missing_samples)[0], missing_features] = 0
y_missing = y_full.copy()
estimator = Pipeline([("imputer", Imputer(missing_values=0,
                      ("forest", RandomForestRegressor(random_state=0,
score = cross_val_score(estimator, X_missing, y_missing).mean()
print("Score after imputation of the missing values = %.2f" % score)

RandomForestRegressor(n_estimators=10,criterion='mse', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0,

max_features='auto', max_leaf_nodes=None, bootstrap=True,oob_score=False, n_jobs=1, random_state=None, verbose=0,warm_start=False)

A random forest is a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset and use averaging to improve


the predictive accuracy and control over-fitting.The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement i fbootstrap=True (default).


n_estimators : integer, optional (default=10)


The number of trees in the forest.


criterion : string, optional (default=”mse”)

The function to measure the quality of a split. The only supported criterion is “mse” for the mean squared error.Note: this parameter is tree-specific.


max_features : int, float, string or None, optional (default=”auto”)

The number of features to consider when looking for the best split:


  • If int, then consider max_features features at each split.
  • If float, then max_features is a percentage andint(max_features * n_features) features are considered at eachsplit.
  • If “auto”, then max_features=n_features.
  • If “sqrt”, then max_features=sqrt(n_features).
  • If “log2”, then max_features=log2(n_features).
  • If None, then max_features=n_features.

Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires toeffectively inspect more thanmax_features features.Note: this parameter is tree-specific.

max_depth : integer or None, optional (default=None)

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split


samples.Ignored if max_leaf_nodes is not None.Note: this parameter is tree-specific.

min_samples_split : integer, optional (default=2)

The minimum number of samples required to split an internal node.Note: this parameter is tree-specific.


min_samples_leaf : integer, optional (default=1)

The minimum number of samples in newly created leaves. A split is discarded if after the split, one of the leaves would contain less thenmin_samples_leaf samples.Note: this parameter is tree-specific.


min_weight_fraction_leaf : float, optional (default=0.)

The minimum weighted fraction of the input samples required to be at aleaf node.Note: this parameter is tree-specific.

max_leaf_nodes : int or None, optional (default=None)

Grow trees with max_leaf_nodes in best-first fashion.Best nodes are defined as relative reduction in impurity.If None then unlimited number of leaf nodes.If not None thenmax_depth will be ignored.Note: this parameter is tree-specific.

bootstrap : boolean, optional (default=True)

Whether bootstrap samples are used when building trees.

oob_score : bool

whether to use out-of-bag samples to estimatethe generalization error.

n_jobs : integer, optional (default=1)

The number of jobs to run in parallel for both fit and predict.If -1, then the number of jobs is set to the number of cores.

random_state : int, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator;If RandomState instance, random_state is the random number generator;If None, the random number generator is the RandomState instance usedbynp.random.

verbose : int, optional (default=0)

Controls the verbosity of the tree building process.

warm_start : bool, optional (default=False)

When set to True, reuse the solution of the previous call to fitand add more estimators to the ensemble, otherwise, just fit a wholenew forest.


apply(X) Apply trees in the forest to X, return leaf indices.
fit(X, y[, sample_weight]) Build a forest of trees from the training set (X, y).
fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
predict(X) Predict regression target for X.
score(X, y[, sample_weight]) Returns the coefficient of determination R^2 of the prediction.
set_params(**params) Set the parameters of this estimator.
transform(*args, **kwargs) DEPRECATED: Support to use estimators as feature selectors will be removed in version 0.19.

Imputer(missing_values='NaN', strategy='mean', axis=0, verbose=0, copy=True)

Imputation transformer for completing missing values.



missing_values : integer or “NaN”, optional (default=”NaN”)

The place holder for the missing values. All occurrences of missing_values will be imputed. For missing values encoded as np.nan,use the string value “NaN”.


strategy : string, optional (default=”mean”)

The imputation strategy

  • If “mean”, then replace missing values using the mean alongthe axis.
  • If “median”, then replace missing values using the median alongthe axis.
  • If “most_frequent”, then replace missing using the most frequentvalue along the axis.

axis : integer, optional (default=0)

The axis along which to impute.

  • If axis=0, then impute along columns.
  • If axis=1, then impute along rows.

verbose : integer, optional (default=0)

Controls the verbosity of the imputer.

copy : boolean, optional (default=True)

If True, a copy of X will be created. If False, imputation willbe done in-place whenever possible. Note that, in the following cases,a new copy will always be made, even ifcopy=False:

  • If X is not an array of floating values;
  • If X is sparse and missing_values=0;
  • If axis=0 and X is encoded as a CSR matrix;
  • If axis=1 and X is encoded as a CSC matrix.


fit(X[, y]) Fit the imputer on X.
fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
set_params(**params) Set the parameters of this estimator.
transform(X) Impute all missing values in X.


scikit-learn:0.5. Choosing the right estimator(你的问题适合什么estimator来建模呢)

内容来自: Often the hardest part of solving a machine learn...
  • mmc2015
  • mmc2015
  • 2015年07月01日 08:45
  • 1188

编译错误syntax error : missing ';' before 'type'原因探寻

 在VC6中运行以下代码//////////////////////////////////main.c#include int main(){ char a[100]; memset(&a, 0, ...
  • sadgod
  • sadgod
  • 2007年08月09日 16:36
  • 39447

error C2143: syntax error : missing ';' before '*'

在VC编程过程中我们经常会遇到这样的错误提示信息error C2143: syntax error : missing ';' before '*',即在“*”号之前少了“;”。究竟是什么原因? ...
  • Blaider
  • Blaider
  • 2011年08月08日 13:51
  • 12241

调用航班查询接口,用jsonp跨域时出现SyntaxError: missing ; before statement错误解决记录

以前没有做过跨域,这几天算是比较懂了,记录一下。 遇到的这个问题,究其原因是,在用jsonp跨域时,返回的数据格式应该是形如 jsonp1({“name”:“CZ3869”}),而接口实际返回的数据是...
  • lx_1024
  • lx_1024
  • 2017年09月28日 09:57
  • 662

jquery ajax请求成功后eval(data)报错 SyntaxError: missing ; before statement

在一次请求中 $.ajax({                 cache : true,                 type : "POST",                 url : "...
  • zhaoyachao123
  • zhaoyachao123
  • 2017年08月15日 19:32
  • 2030

在javascript中在function处提示missing(before function parameters错误

例://任意点击时关闭该控件 function document.onclick(){ with(window.event.srcElement){      if (tagName != "INPU...
  • wzy126126
  • wzy126126
  • 2010年10月13日 16:42
  • 2128

error C2143: syntax error : missing ';' before '<class-head>'

  • mowwwcom
  • mowwwcom
  • 2014年10月27日 12:47
  • 1650

error C2146: syntax error : missing ';' before identifier 'PVOID64'

error C2146: syntax error : missing ; before identifier PVOID64由 directdraw 的升级引起在网上查了下资料,一般的解决方法是:解...
  • yysdsyl
  • yysdsyl
  • 2008年07月21日 14:40
  • 29250


先搞清楚集合,文档,数据库的关系。 “ 转载自:// 不管我们学习什么数据库都应该...
  • Sasoritattoo
  • Sasoritattoo
  • 2013年07月27日 11:08
  • 5833

missing template arguments before异常解决

missing template arguments before异常解决
  • jacke121
  • jacke121
  • 2017年03月09日 14:04
  • 1491
您举报文章:scikit学习心得——Imputing missing values before building an estimator