项目开发python----数据预处理模块

最新推荐文章于 2024-05-10 16:54:35 发布

maomaogo

最新推荐文章于 2024-05-10 16:54:35 发布

阅读量696

点赞数

分类专栏：项目开发模块文章标签：数据标准化缺失填补热值编码

本文链接：https://blog.csdn.net/yushiyin1314/article/details/90735096

版权

预处理必要性

在将数据放入到模型中训练之前，数据通常是很脏的，可能存在缺失、数据类型不统一、存在异常值、需要标准化处理等。

一般来说，预处理包括数据填充、数据标准化、特征编码、数据离散化等步骤。特别注意，在这些工作之前，你需要了解你的数据集哪些特征是数值或者分类变量，了解哪个特征存在缺失。

数据：https://github.com/yushiyin/handson-ml/tree/master/datasets/housing

import os
import tarfile
from six.moves import urllib
import pandas as pd
housing =pd.read_csv("./housing.csv")
housing.head()

准备工作：

housing.info()      ##变量类型
housing["ocean_proximity"].value_counts()  ##分类变量频数统计
####哪个样本存在缺失
housing[housing.isnull().any(axis=1)].head()
####哪个位置存在缺失
housing.isnull()
####哪个特征存在缺失
housing.isnull().any()

正式进入工作，注意下面的处理过程均是利用训练集进行。

数据转换模块（dataframe----array）：

##输入对应的属性（数值或分类）index或者name
from sklearn.base import BaseEstimator, TransformerMixin
    # Create a class to select numerical or categorical columns 
    # since Scikit-Learn doesn't handle DataFrames yet
    class DataFrameSelector(BaseEstimator, TransformerMixin):
         def __init__(self, attribute_names):
            self.attribute_names = attribute_names
        def fit(self, X, y=None):
            return self
        def transform(self, X):
            return X[self.attribute_names].values

数据填充模块：

##利用中位数填充
from sklearn.preprocessing import Imputer
imputer = Imputer(strategy="median")  ##还有mean  most_frequent
imputer.fit(housing_num)
imputer.statistics_
X=imputer.transform(housing_num)  ##需要注意输出的结果是一个数组
housing_tr = pd.DataFrame(X, columns=housing_num.columns, 
                     index = list(housing.index.values))

合并特征形成行的特征：

from sklearn.base import BaseEstimator, TransformerMixin
# column index
rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6
class CombinedAttributesAdder(BaseEstimator,TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
        population_per_household = X[:, population_ix] / X[:, household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]
            
attr_adder = CombinedAttributesAdder()
housing_extra_attribs = attr_adder.transform(X)
housing_extra_attribs.shape

数据标准化模块：

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(housing_extra_attribs)
std_housing=scaler.transform(housing_extra_attribs)

数据编码模块

最低0.47元/天解锁文章

maomaogo

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
项目开发python----数据预处理模块

预处理必要性在将数据放入到模型中训练之前，数据通常是很脏的，可能存在缺失、数据类型不统一、存在异常值、需要标准化处理等。一般来说，预处理包括数据填充、数据标准化、特征编码、数据离散化等步骤。特别注意，在这些工作之前，你需要了解你的数据集哪些特征是数值或者分类变量，了解哪个特征存在缺失。数据：https://github.com/yushiyin/handson-ml/tree/master/...
复制链接

扫一扫