Chapter2-数据预处理

最新推荐文章于 2022-05-01 16:54:01 发布

不吃草的小猪

最新推荐文章于 2022-05-01 16:54:01 发布

阅读量393

点赞数

分类专栏： # Hands on ML

本文链接：https://blog.csdn.net/Convolution_ZQ/article/details/104096134

版权

Hands on ML 专栏收录该内容

0 篇文章 0 订阅

订阅专栏

文章目录

- 1. 处理数据流程：

1. 处理数据流程：

1. pd加载数据，查看数据结构， hist画出关系，分析其中数据。

2. 创建测试和训练数据集

from sklearn.model_selection import train_test_split

train_set, test_set = trian_test_split(housing, test_size=0.2, random_state=42)

为什么要对median_income进行分层处理？
- 在 dataset很大时，上述随机采样时可取的。但是dataset不大时，上述方法就会有采样偏差的风险。
- stratified sampling 在某些程度上很重要，但是在应用之前，应对相应特征进行分层处理。
- [x]

housing['income_cat'] = np.ceil(housing['median_income'] / 1.5)
housing.income_cat.where(housing.income_cat < 5, 5, inplace=True)

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

fro train_index, test_index in split.split(housing, housing['income_cat']):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]
    
for set in (strat_train_set, strat_test_set):  # delete featuren of income_cat
    set.drop(['income_cat'], axis=1, inplace=True)

3. 数据探索和可视化，发现规律

housing = strat_ttrain_set.copy() # 深拷贝，避免可视化过程中对数据影响

发现了一些数据的巧合，需要在给算法提供数据之前，将其去除。你还发现了一些属性间有趣的关联，特别是目标属性。你还注意到一些属性具有长尾分布，因此你可能要将其进行转换(例如，计算其 log 对数)。当然，不同项目的处理方法各不相同，但大体思路是相似的。
主要通过图形和目标要求查看相关数据

4. 数据预处理

(1). 得到完整的数据集，将目标数据label与原来data分开，为接下来预处理准备

housing = strat_train_set.drop("median_house_value", axis=1) # drop labels for training set
housing_labels = strat_train_set["median_house_value"].copy()

(2). 对Nan进行相关填充
这一过程中，要注意将内部非number的attributes全部去除，因为SimpleImputer只能对number进行计算

try:
    from sklearn.impute import SimpleImputer # Scikit-Learn 0.20+
except ImportError:
    from sklearn.preprocessing import Imputer as SimpleImputer

# create Imputer object
imputer = SimpleImputer(strategy="median")  # using median 
housing_num = housing.drop('ocean_proximity', axis=1)
# alternatively: housing_num = housing.select_dtypes(include=[np.number])
imputer.fit(housing_num)
# transform the training set
X = imputer.transform(housing_num)
# cause the type of return value of imputer.transform is array, we need to tranform it into Dataframe
housing_tr = pd.DataFrame(X, columns=housing_num.columns,
                          index=housing.index)

(3). 处理分类数据， txt类型
我们采用OrdinalEncoder中的fit_transform()将文本类数据转化为number型数据
采用sklearn.preprocessing中的OneHotEncoder独热编码，因为上面的编码仍然会对后续Machine Learning Algorithm产生障碍。比如在分类问题中，会因为数字大小不一认为4比1距离1更远
Note: OneHotEncoder.fit_transform()不对1-D有效，一般要用.reshape(-1, 1)进行转换为矩阵

# extract feature
housing_cat = housing[['ocean_proximity']]  

try:
    from sklearn.preprocessing import OrdinalEncoder
except ImportError:
    from future_encoders import OrdinalEncoder # Scikit-Learn < 0.20

# create OrdinalEncoder object
ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)

try:
    from sklearn.preprocessing import OrdinalEncoder # just to raise an ImportError if Scikit-Learn < 0.20
    from sklearn.preprocessing import OneHotEncoder
except ImportError:
    from future_encoders import OneHotEncoder # Scikit-Learn < 0.20

cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)  # return sparse  array, we could use housing_cat_1hot.toarray() to create a dense array, like matrix.

(4). 自定义转换
此处可以自定义增加 feature
Note that we need to set validate=False because the data contains non-float values (validate will default to False in Scikit-Learn 0.22).

from sklearn.preprocessing import FunctionTransformer

# get the right column indices
rooms_ix, bedrooms_ix, population_ix, household_ix = [
    list(housing.columns).index(col)
    for col in ("total_rooms", "total_bedrooms", "population", "households")]
    
def add_extra_features(X, add_bedrooms_per_room=True):
    rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
    population_per_household = X[:, population_ix] / X[:, household_ix]
    if add_bedrooms_per_room:
        bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
        return np.c_[X, rooms_per_household, population_per_household,
                     bedrooms_per_room]
    else:
        return np.c_[X, rooms_per_household, population_per_household]

attr_adder = FunctionTransformer(add_extra_features, validate=False,
                                 kw_args={"add_bedrooms_per_room": False})
housing_extra_attribs = attr_adder.fit_transform(housing.values)
housing_extra_attribs = pd.DataFrame(
    housing_extra_attribs,
    columns=list(housing.columns)+["rooms_per_household", "population_per_household"],
    index=housing.index)

(5). 转换流水线
让上述转换按照一定流水线进行，因为上述转化有一定顺序
在sklearn中 Pipeline 执行时前面的 n-1 项会调用 fit_transform，最后一项之调用fit。

sciki-learn官方文档
管道机制在机器学习算法中得以应用的根源在于，参数集在新数据集（比如测试集）上的重复使用。

管道机制实现了对全部步骤的流式化封装和管理（streaming workflows with pipelines）。

注意：管道机制更像是编程技巧的创新，而非算法的创新。

Pipeline 执行流程分析

Pipeline 的中间过程由scikit-learn相适配的转换器（transformer）构成，最后一步是一个estimator。比如，StandardScaler和 PCA transformer 构成 intermediate steps，LogisticRegression 作为最终的estimator。

当我们执行 pipe_lr.fit(X_train, y_train)时，首先由StandardScaler在训练集上执行 fit和transform方法，transformed后的数据又被传递给Pipeline对象的下一步，也即PCA()。和StandardScaler一样，PCA也是执行fit和transform方法，最终将转换后的数据传递给 LosigsticRegression。整个流程如下图所示：
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-jzDG9tsC-1580134498614)(evernotecid://CBA4164C-5A56-4141-899C-42B7F62EAC9F/appyinxiangcom/27727878/ENResource/p1)]

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler  # standardlisation

# 1. fill is-null number, but when it occurs to txt, could be different
# 2. add extra features
# 3. standardlisation
num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
        ('attribs_adder', FunctionTransformer(add_extra_features, validate=False)),
        ('std_scaler', StandardScaler()),
    ])

housing_num_tr = num_pipeline.fit_transform(housing_num)

房屋价格设置了上限，处理的时候要注意是否要删除。
这些柱状图分布相对于靠左，可以通过后续变换变为正态分布。

不吃草的小猪

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Chapter2-数据预处理

文章目录1. 处理数据流程：1. pd加载数据，查看数据结构， hist画出关系，分析其中数据。2. 创建测试和训练数据集3. 数据探索和可视化，发现规律4. 数据预处理Pipeline 执行流程分析1. 处理数据流程：1. pd加载数据，查看数据结构， hist画出关系，分析其中数据。2. 创建测试和训练数据集from sklearn.model_selection impor...
复制链接

扫一扫