预处理必要性
在将数据放入到模型中训练之前,数据通常是很脏的,可能存在缺失、数据类型不统一、存在异常值、需要标准化处理等。
一般来说,预处理包括数据填充、数据标准化、特征编码、数据离散化等步骤。特别注意,在这些工作之前,你需要了解你的数据集哪些特征是数值或者分类变量,了解哪个特征存在缺失。
数据:https://github.com/yushiyin/handson-ml/tree/master/datasets/housing
import os
import tarfile
from six.moves import urllib
import pandas as pd
housing =pd.read_csv("./housing.csv")
housing.head()
准备工作:
housing.info() ##变量类型
housing["ocean_proximity"].value_counts() ##分类变量频数统计
####哪个样本存在缺失
housing[housing.isnull().any(axis=1)].head()
####哪个位置存在缺失
housing.isnull()
####哪个特征存在缺失
housing.isnull().any()
正式进入工作,注意下面的处理过程均是利用训练集进行。
数据转换模块(dataframe----array):
##输入对应的属性(数值或分类)index或者name
from sklearn.base import BaseEstimator, TransformerMixin
# Create a class to select numerical or categorical columns
# since Scikit-Learn doesn't handle DataFrames yet
class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names):
self.attribute_names = attribute_names
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.attribute_names].values
数据填充模块:
##利用中位数填充
from sklearn.preprocessing import Imputer
imputer = Imputer(strategy="median") ##还有mean most_frequent
imputer.fit(housing_num)
imputer.statistics_
X=imputer.transform(housing_num) ##需要注意输出的结果是一个数组
housing_tr = pd.DataFrame(X, columns=housing_num.columns,
index = list(housing.index.values))
合并特征形成行的特征:
from sklearn.base import BaseEstimator, TransformerMixin
# column index
rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6
class CombinedAttributesAdder(BaseEstimator,TransformerMixin):
def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
self.add_bedrooms_per_room = add_bedrooms_per_room
def fit(self, X, y=None):
return self # nothing else to do
def transform(self, X, y=None):
rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
population_per_household = X[:, population_ix] / X[:, household_ix]
if self.add_bedrooms_per_room:
bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
return np.c_[X, rooms_per_household, population_per_household,
bedrooms_per_room]
else:
return np.c_[X, rooms_per_household, population_per_household]
attr_adder = CombinedAttributesAdder()
housing_extra_attribs = attr_adder.transform(X)
housing_extra_attribs.shape
数据标准化模块:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(housing_extra_attribs)
std_housing=scaler.transform(housing_extra_attribs)
数据编码模块