转换器
大部分机器学习算法都更容易和数字打交道,所以我们先将这些文本转化为数字。sklearn为这类任务提供了转换器。
要转换的数据
# 加载库
import os
import tarfile
import pandas as pd
from six.moves import urllib
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"
# 加载数据
def load_housing_data(housing_path=HOUSING_PATH):
csv_path=os.path.join(housing_path,"housing.csv")
return pd.read_csv(csv_path)
housing = load_housing_data()
housing.head() # 查看前五行的信息
housing_cat = housing[['ocean_proximity']]
housing_cat.head(10)
文本信息是’<1H OCEAN’, ‘INLAND’, ‘ISLAND’, ‘NEAR BAY’, ‘NEAR OCEAN’,要转换为数字信息。多种思路:
- 普通编码转换器:直接将’<1H OCEAN’, ‘INLAND’, ‘ISLAND’, ‘NEAR BAY’, 'NEAR
OCEAN’对应数字1,2,3,4,5。输出为一列。 - one-hot编码转换器:将’<1H OCEAN’, ‘INLAND’,‘ISLAND’, ‘NEAR BAY’, 'NEAR
OCEAN’分为5列的布尔量。例如13769行为‘INLAND’,则对应0,1,0,0,0。
普通编码转换器OrdinalEncoder
#将ocean_proximity的文本转化为数字
from sklearn.preprocessing import OrdinalEncoder #普通编码器
ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)
housing_cat_encoded[:10]
#查看ocean_proximity的文本的类别
print(ordinal_encoder.categories_)
缺点:计算机会以为相近的两个数字比离得较远的数字更相似一些,然而事实并非如此,0和4的相似度就可能比0和1更高
one-hot编码转换器OneHotEncoder
返回稀疏矩阵
from sklearn.preprocessing import OneHotEncoder #OneHot编码器
cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot
#查看ocean_proximity的文本的类别
cat_encoder.categories_
# 默认返回的时候稀疏矩阵,仅存储非0值元素位置,节省内存,转换为Numpy数组来显示
housing_cat_1hot.toarray()
返回二维矩阵
from sklearn.preprocessing import OneHotEncoder #OneHot编码器
# 指定 sparse=False 来返回二维矩阵,而不是稀疏矩阵
cat_encoder = OneHotEncoder(sparse=False)
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot
#查看ocean_proximity的文本的类别
cat_encoder.categories_
自定义转换器
使用自定义转换器来增加特征rooms_per_household population_per_household
自定义的函数:去掉了median_housing_value;增加了rooms_per_household population_per_household两列
处理对象:部分housing.columns
BaseEstimator& TransformerMixin
硬编码 (英文称 hard code, hard coding) :指的是在软体实作上,把输出或输入的相关参数 (例如:路径、输出的形式、格式) 直接写死在原始码中,而非在执行时期由外界指定的设定、资源、资料、或者格式做出适当回应。
def init: 定义一个类
init()方法,在创建一个对象时默认被调用,不需要手动调用
init(self)中,默认有1个参数名字为self
init(self,x,y)在创建对象时传递了2个实参,那么__init__(self)中出了self作为第一个形参外还需要2个形参
init(self)中的self参数,不需要开发者传递,python解释器会自动把当前的对象引用传递进去
#转换器1
from sklearn.base import BaseEstimator, TransformerMixin
#转换器1
from sklearn.base import BaseEstimator, TransformerMixin
# get the right column indices: safer than hard-coding indices 3, 4, 5, 6
#取需要运算的数据
rooms_ix, bedrooms_ix, population_ix, household_ix = [
list(housing.columns).index(col)
for col in ("total_rooms", "total_bedrooms", "population", "households")]
class CombinedAttributesAdder(BaseEstimator, TransformerMixin): #组合属性加法器
def __init__(self, add_bedrooms_per_room = True): # no *args or **kwargs
self.add_bedrooms_per_room = add_bedrooms_per_room
def fit(self, X, y=None):
return self # nothing else to do
def transform(self, X, y=None):
rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
population_per_household = X[:, population_ix] / X[:, household_ix]
if self.add_bedrooms_per_room:
bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
return np.c_[X, rooms_per_household, population_per_household, #把数组array1和数组array2配对后输出
bedrooms_per_room]
else:
return np.c_[X, rooms_per_household, population_per_household]
attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)
#转换器2
from sklearn.preprocessing import FunctionTransformer
def add_extra_features(X, add_bedrooms_per_room=True): #增加其他特征函数 add_bedrooms_per_room 标志
rooms_per_household = X[:, rooms_ix] / X[:, household_ix] #每户房间数
population_per_household = X[:, population_ix] / X[:, household_ix] #每户人数
if add_bedrooms_per_room:
bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix] #房间中卧室占比
return np.c_[X, rooms_per_household, population_per_household,
bedrooms_per_room]
else:
return np.c_[X, rooms_per_household, population_per_household]
attr_adder = FunctionTransformer(add_extra_features, #参数func: 自定义函数,及增加其他特征
validate=False, #validate: bool量, default=False 输入验证关
kw_args={"add_bedrooms_per_room": False}) #kw_argsdict, default=None要传递给 func 的附加关键字参数的字典。
housing_extra_attribs = attr_adder.fit_transform(housing.values)
housing_extra_attribs = pd.DataFrame(
housing_extra_attribs,
columns=list(housing.columns)+["rooms_per_household", "population_per_household"],
index=housing.index)
housing_extra_attribs.head()
转换前
转换后
转换后id并非按顺序排列
median_income及其以前的列都未改变
github源码
源码内容包含整章,本文单独摘出部分内容理解细节
感谢大佬们!