scikit学习心得——Imputing missing values before building an estimator

翻译 2015年11月19日 15:48:39

对数据的预处理,解决丢失值


http://scikit-learn.org/stable/auto_examples/missing_values.html#example-missing-values-py

------------------------------------------------------------------------------------------------------------------------------

This example shows that imputing the missing values can give better results than discarding the samples containing any missing value.

这个例子展示了输入缺省值可以比对删掉缺省值得到更好的结果

Imputing does not always improve the predictions, so please check via cross-validation.Sometimes dropping rows or using marker values is more effective.

但是不总是能够提高预测的效果,说要检查交叉验证。

Missing values can be replaced by the mean, the median or the most frequent value using thestrategy hyper-parameter.

缺省值可以用均值代替,中位值,或者频繁出现的值使用超参数

The median is a more robust estimator for data with high magnitude variables which could dominate results (otherwise known as a ‘long tail’).

中位值对大数量级的数据来说是比价稳定的估计值

import numpy as np

from sklearn.datasets import load_boston#波士顿房价回归预测的数据
from sklearn.ensemble import RandomForestRegressor#随机深林回归
from sklearn.pipeline import Pipeline#通道,
from sklearn.preprocessing import Imputer#处理输入值得函数
from sklearn.cross_validation import cross_val_score#交叉验证函数

rng = np.random.RandomState(0)#生成随机种子

dataset = load_boston()#提取数据
X_full, y_full = dataset.data, dataset.target#x为数据,y为预测值
n_samples = X_full.shape[0]#有多少条样本
n_features = X_full.shape[1]#有多少特征

# Estimate the score on the entire dataset, with no missing values
estimator = RandomForestRegressor(random_state=0, n_estimators=100)#估计函数随机森林回归函数
score = cross_val_score(estimator, X_full, y_full).mean()#使用随机森林回归函数进行交叉验证得到一个分数这个分数是没有进过处理的
print("Score with the entire dataset = %.2f" % score)

# Add missing values in 75% of the lines
missing_rate = 0.75#损失比例
n_missing_samples = np.floor(n_samples * missing_rate)#损失的样本数量
missing_samples = np.hstack((np.zeros(n_samples - n_missing_samples,
                                      dtype=np.bool),
                             np.ones(n_missing_samples,
                                     dtype=np.bool)))#
rng.shuffle(missing_samples)
missing_features = rng.randint(0, n_features, n_missing_samples)

# Estimate the score without the lines containing missing values
X_filtered = X_full[~missing_samples, :]
y_filtered = y_full[~missing_samples]
estimator = RandomForestRegressor(random_state=0, n_estimators=100)
score = cross_val_score(estimator, X_filtered, y_filtered).mean()
print("Score without the samples containing missing values = %.2f" % score)

# Estimate the score after imputation of the missing values
X_missing = X_full.copy()
X_missing[np.where(missing_samples)[0], missing_features] = 0
y_missing = y_full.copy()
estimator = Pipeline([("imputer", Imputer(missing_values=0,
                                          strategy="mean",
                                          axis=0)),
                      ("forest", RandomForestRegressor(random_state=0,
                                                       n_estimators=100))])
score = cross_val_score(estimator, X_missing, y_missing).mean()
print("Score after imputation of the missing values = %.2f" % score)
----------------------------------------------------------------------------------------------------------------------------------

RandomForestRegressor(n_estimators=10,criterion='mse', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0,

max_features='auto', max_leaf_nodes=None, bootstrap=True,oob_score=False, n_jobs=1, random_state=None, verbose=0,warm_start=False)

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor

A random forest is a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset and use averaging to improve

随机深林是一种估计函数,他适合各种决策树分类,并且使用均值提高预测的精度和控制过拟合

the predictive accuracy and control over-fitting.The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement i fbootstrap=True (default).

子样本大小通常和原始输入样本大小一样,但是如果bootstrap=true画出来的图将被更换

n_estimators : integer, optional (default=10)

                          整数选项

The number of trees in the forest.

森林的数量

criterion : string, optional (default=”mse”)

The function to measure the quality of a split. The only supported criterion is “mse” for the mean squared error.Note: this parameter is tree-specific.

测量分片的质量,唯一支持的标准是“mse”均方误差,

max_features : int, float, string or None, optional (default=”auto”)

The number of features to consider when looking for the best split:

决定什么时候寻找最好分片的数量

  • If int, then consider max_features features at each split.
  • If float, then max_features is a percentage andint(max_features * n_features) features are considered at eachsplit.
  • If “auto”, then max_features=n_features.
  • If “sqrt”, then max_features=sqrt(n_features).
  • If “log2”, then max_features=log2(n_features).
  • If None, then max_features=n_features.

Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires toeffectively inspect more thanmax_features features.Note: this parameter is tree-specific.

max_depth : integer or None, optional (default=None)

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split

树的最大深度,如果没有,那么节点将会一直拓展直到所有叶子是纯粹的或者直到所有叶子比最小的分片小

samples.Ignored if max_leaf_nodes is not None.Note: this parameter is tree-specific.

min_samples_split : integer, optional (default=2)

The minimum number of samples required to split an internal node.Note: this parameter is tree-specific.

样本被分片成节点的最小数量,

min_samples_leaf : integer, optional (default=1)

The minimum number of samples in newly created leaves. A split is discarded if after the split, one of the leaves would contain less thenmin_samples_leaf samples.Note: this parameter is tree-specific.

样本新生成的叶子的最小数目,节点拆开后将被丢弃,只有比最小样本小的节点将被保留

min_weight_fraction_leaf : float, optional (default=0.)

The minimum weighted fraction of the input samples required to be at aleaf node.Note: this parameter is tree-specific.

max_leaf_nodes : int or None, optional (default=None)

Grow trees with max_leaf_nodes in best-first fashion.Best nodes are defined as relative reduction in impurity.If None then unlimited number of leaf nodes.If not None thenmax_depth will be ignored.Note: this parameter is tree-specific.

bootstrap : boolean, optional (default=True)

Whether bootstrap samples are used when building trees.

oob_score : bool

whether to use out-of-bag samples to estimatethe generalization error.

n_jobs : integer, optional (default=1)

The number of jobs to run in parallel for both fit and predict.If -1, then the number of jobs is set to the number of cores.

random_state : int, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator;If RandomState instance, random_state is the random number generator;If None, the random number generator is the RandomState instance usedbynp.random.

verbose : int, optional (default=0)

Controls the verbosity of the tree building process.

warm_start : bool, optional (default=False)

When set to True, reuse the solution of the previous call to fitand add more estimators to the ensemble, otherwise, just fit a wholenew forest.


Methods

apply(X) Apply trees in the forest to X, return leaf indices.
fit(X, y[, sample_weight]) Build a forest of trees from the training set (X, y).
fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
predict(X) Predict regression target for X.
score(X, y[, sample_weight]) Returns the coefficient of determination R^2 of the prediction.
set_params(**params) Set the parameters of this estimator.
transform(*args, **kwargs) DEPRECATED: Support to use estimators as feature selectors will be removed in version 0.19.


Imputer(missing_values='NaN', strategy='mean', axis=0, verbose=0, copy=True)

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html#sklearn.preprocessing.Imputer

Imputation transformer for completing missing values.

转换完全缺省值

Parameters:

missing_values : integer or “NaN”, optional (default=”NaN”)

The place holder for the missing values. All occurrences of missing_values will be imputed. For missing values encoded as np.nan,use the string value “NaN”.

在缺省值补迭代数值,所有缺省值将被估算,因为缺省值被编码为np.nan或“nan”表示

strategy : string, optional (default=”mean”)

The imputation strategy

  • If “mean”, then replace missing values using the mean alongthe axis.
  • If “median”, then replace missing values using the median alongthe axis.
  • If “most_frequent”, then replace missing using the most frequentvalue along the axis.

axis : integer, optional (default=0)

The axis along which to impute.

  • If axis=0, then impute along columns.
  • If axis=1, then impute along rows.

verbose : integer, optional (default=0)

Controls the verbosity of the imputer.

copy : boolean, optional (default=True)

If True, a copy of X will be created. If False, imputation willbe done in-place whenever possible. Note that, in the following cases,a new copy will always be made, even ifcopy=False:

  • If X is not an array of floating values;
  • If X is sparse and missing_values=0;
  • If axis=0 and X is encoded as a CSR matrix;
  • If axis=1 and X is encoded as a CSC matrix.

Methods

fit(X[, y]) Fit the imputer on X.
fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
set_params(**params) Set the parameters of this estimator.
transform(X) Impute all missing values in X.


np.hstack
同列合并

scikit学习心得——Isotonic Regression

Isotonic Regression

PB5.0 build CE5.0 SDK错误:It is recommended that you build a run-time image before building an SDK

PB5.0在build CE5.0 SDK时候出现错误:It is recommended that you build a run-time image before building an S...

Linux学习心得——内存管理方法

  • 2012年09月06日 11:36
  • 3.61MB
  • 下载

Surf算法学习心得(一)——算法原理

http://www.yongblog.com/archives/123.html 写在前面的话: Surf算法是对Sift算法的一种改进,主要是在算法的执行效率上,比Sift...

FPGA学习心得——分频器

分频器是FPGA设计过程中使用频率非常高的基本单元之一。其基于FPGA的实现主要包括:1、通过FPGA芯片内部集成锁相环(如altera的PLL,Xilinx的DLL)来进行时钟的分频、倍频以及相移设...

数据结构学习心得——顺序栈和链栈

栈的定义栈是限定尽在表尾进行插入或者删除操作的线性表。因此,对栈来说,表尾端有其特殊含义,称为栈顶,相应地,表头端称为栈底。不含元素的空表称为空栈。栈又称为后进先出的线性表。 和线性表类似,栈也...

交通灯管理系统——学习心得

1.需求: Ø 异步随机生成按照各个路线行驶的车辆。 例如:        由南向而来去往北向的车辆 ---- 直行车辆        由西向而来去往南向的车辆 ---- 右转车辆   ...

Axure学习心得二——中继器

中继器可以说是一个转折,在不知道中继器之前,我还停留在用表格来画原型图的时代,总觉得表格会很方便的排版和显示,虽然仅仅是规范显示而已。但是当我知道有中继器这个控件之后,就再也不用表格来表示列表数据了。...

RabbitMQ学习心得——发布/订阅(上)

今天来介绍RabbitMQ的第三种工作方式:发布/订阅(使用扇形交换机) 一、交换机 在开始之前,我们要先简单的介绍一下交换机(在简介教程中有介绍)。前面我们也提到了生产者是把消息发送给交...

iOS学习心得——UITableViewCell的复用

UITableView是在iOS开发中最常用的控件之一。我的第一篇学习心得献给它了         UITableView是由一行一行的UITableViewCell构成的。         首先想...
内容举报
返回顶部
收藏助手
不良信息举报
您举报文章:scikit学习心得——Imputing missing values before building an estimator
举报原因:
原因补充:

(最多只允许输入30个字)