项目开发python----数据预处理模块

预处理必要性

在将数据放入到模型中训练之前,数据通常是很脏的,可能存在缺失、数据类型不统一、存在异常值、需要标准化处理等。

一般来说,预处理包括数据填充、数据标准化、特征编码、数据离散化等步骤。特别注意,在这些工作之前,你需要了解你的数据集哪些特征是数值或者分类变量,了解哪个特征存在缺失。

数据:https://github.com/yushiyin/handson-ml/tree/master/datasets/housing

import os
import tarfile
from six.moves import urllib
import pandas as pd
housing =pd.read_csv("./housing.csv")
housing.head()

准备工作:

housing.info()      ##变量类型
housing["ocean_proximity"].value_counts()  ##分类变量频数统计
####哪个样本存在缺失
housing[housing.isnull().any(axis=1)].head()
####哪个位置存在缺失
housing.isnull()
####哪个特征存在缺失
housing.isnull().any()

正式进入工作,注意下面的处理过程均是利用训练集进行。

数据转换模块(dataframe----array):

##输入对应的属性(数值或分类)index或者name
from sklearn.base import BaseEstimator, TransformerMixin
    # Create a class to select numerical or categorical columns 
    # since Scikit-Learn doesn't handle DataFrames yet
    class DataFrameSelector(BaseEstimator, TransformerMixin):
         def __init__(self, attribute_names):
            self.attribute_names = attribute_names
        def fit(self, X, y=None):
            return self
        def transform(self, X):
            return X[self.attribute_names].values

数据填充模块:

##利用中位数填充
from sklearn.preprocessing import Imputer
imputer = Imputer(strategy="median")  ##还有mean  most_frequent
imputer.fit(housing_num)
imputer.statistics_
X=imputer.transform(housing_num)  ##需要注意输出的结果是一个数组
housing_tr = pd.DataFrame(X, columns=housing_num.columns, 
                     index = list(housing.index.values))   

合并特征形成行的特征:

from sklearn.base import BaseEstimator, TransformerMixin
# column index
rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6
class CombinedAttributesAdder(BaseEstimator,TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
        population_per_household = X[:, population_ix] / X[:, household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]
            
attr_adder = CombinedAttributesAdder()
housing_extra_attribs = attr_adder.transform(X)
housing_extra_attribs.shape

数据标准化模块:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(housing_extra_attribs)
std_housing=scaler.transform(housing_extra_attribs)

数据编码模块

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值