【python机器学习基础教程】（四）_python机器学习基础教程(四)-CSDN博客

本文链接：https://blog.csdn.net/Algernon98/article/details/125124449

数据表示与特征工程

到目前为止，我们一直假设数据是由浮点数组成的二维数组，其中每一列是描述数据点的连续特征。对于许多应用而言，数据的收集方式并不是这样。一种特别常见的特征类型就是分类特征，也叫离散特征。

对于某个特定应用而言，如何找到最佳数据表示，这个问题被称为特征工程。

分类变量

One-Hot编码（虚拟变量）

到目前为止，表示分类变量最常用的方法就是使用one-hot编码或N取一编码，也叫虚拟变量。
虚拟变量背后的思想是将一个分类变量替换为一个或多个新特征，新特征取值为0和1。

首先，我们使用pandas从逗号分隔值（CSV）文件中加载数据：
数据来源于1994年美国人口普查数据库。(下载地址https://archive.ics.uci.edu/ml/datasets/Adult)

import pandas as pd
from IPython.display import display

data=pd.read_csv("data/adult.data",header=None,index_col=False,names=['age','workclass','fnlwgt','education','education-num','marital-status','occupation','relationship','race','gender','capital-gain','capital-loss','hours-per-week','native-country','income'])

#为方面说明，我们只选了其中几列
data=data[['age','workclass','education','gender','hours-per-week','occupation','income']]
display(data.head())

结果：

age workclass … occupation income
0 39 State-gov … Adm-clerical <=50K
1 50 Self-emp-not-inc … Exec-managerial <=50K
2 38 Private … Handlers-cleaners <=50K
3 53 Private … Handlers-cleaners <=50K
4 28 Private … Prof-specialty <=50K

1.检查字符串编码的分类数据
读完数据集之后，最好先检查每一列是否包含有意义的分类数据。

print(data.gender.value_counts())

[5 rows x 7 columns]
Male 21790
Female 10771
Name: gender, dtype: int64

用pandas编码数据有一种非常简单的方法，就是使用get_dummies函数。
get_dummies函数自动变换所有具有对象类型（比如字符串）的列或所有分类的列：

print("Original features:\n",list(data.columns),"\n")
data_dummies=pd.get_dummies(data)
print("features after get_dummies:\n",list(data_dummies.columns))

输出：

Original features:
 ['age', 'workclass', 'education', 'gender', 'hours-per-week', 'occupation', 'income'] 

features after get_dummies:
 ['age', 'hours-per-week', 'workclass_ ?', 'workclass_ Federal-gov', 'workclass_ Local-gov', 'workclass_ Never-worked', 'workclass_ Private', 'workclass_ Self-emp-inc', 'workclass_ Self-emp-not-inc', 'workclass_ State-gov', 'workclass_ Without-pay', 'education_ 10th', 'education_ 11th', 'education_ 12th', 'education_ 1st-4th', 'education_ 5th-6th', 'education_ 7th-8th', 'education_ 9th', 'education_ Assoc-acdm', 'education_ Assoc-voc', 'education_ Bachelors', 'education_ Doctorate', 'education_ HS-grad', 'education_ Masters', 'education_ Preschool', 'education_ Prof-school', 'education_ Some-college', 'gender_ Female', 'gender_ Male', 'occupation_ ?', 'occupation_ Adm-clerical', 'occupation_ Armed-Forces', 'occupation_ Craft-repair', 'occupation_ Exec-managerial', 'occupation_ Farming-fishing', 'occupation_ Handlers-cleaners', 'occupation_ Machine-op-inspct', 'occupation_ Other-service', 'occupation_ Priv-house-serv', 'occupation_ Prof-specialty', 'occupation_ Protective-serv', 'occupation_ Sales', 'occupation_ Tech-support', 'occupation_ Transport-moving', 'income_ <=50K', 'income_ >50K']

连续特征age和hours-per-week没有发生变化，而分类特征的每个可能取值都被扩展为一个新特征：

display(data_dummies.head())

age hours-per-week … income_ <=50K income_ >50K
0 39 40 … 1 0
1 50 13 … 1 0
2 38 40 … 1 0
3 53 40 … 1 0
4 28 40 … 1 0
[5 rows x 46 columns]

下面我们使用values属性将data_dummies数据框（DataFrame）转换为Numpy数组，然后在其上训练一个机器学习模型。
在训练模型之前，注意要把目标变量（现在被编码为两个income列）从数据中分离出来。

features = data_dummies.loc[:, 'age':'occupation_ Transport-moving']

#提取Numpy数组
X=features.values

y = data_dummies['income_ >50K'].values
print("X.shape:{}  y.shape:{}".format(X.shape,y.shape))

X.shape:(32561, 44) y.shape:(32561,)

现在数据的表示方式可以被scikit-learn处理，我们可以像之前一样进行下一步：

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=0)
logreg=LogisticRegression()
logreg.fit(X_train,y_train)
print("Test score:{:.2f}".format(logreg.score(X_test,y_test)))

Test score:0.81

分箱、离散化、线性模型与树

数据表示的最佳方法不仅取决于数据的语义，还取决于所使用的模型种类。
线性模型和基于树的模型（比如决策树、梯度提升树和随机森林）是两种成员很多同时又非常实用的模型，它们在处理不同的特征表示时就具有非常不同的性质。

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt
X,y=mglearn.datasets.make_wave(n_samples=100)

line=np.linspace(-3,3,1000,endpoint=False).reshape(-1,1)
reg=DecisionTreeRegressor(min_samples_split=3).fit(X,y)

plt.plot(line,reg.predict(line),label="decision tree")

reg=LinearRegression().fit(X,y)
plt.plot(line,reg.predict(line),label='linear regression ')
plt.plot(X[:,0],y,'o',c='k')

plt.legend(loc="best")
plt.ylabel("Regression output")
plt.xlabel("Input feature")

在这里插入图片描述

有一种方法可以让线性模型在连续数据上变得更加强大，就是使用特征分箱（也叫离散化）将其划分为多个特征。
我们假设将特征的输入范围划分成固定个数的箱子（bin）,比如10个，那么数据点就可以利用它所在的箱子来表示。
为了确定这一点，我们首先需要定义箱子。
在这个例子中，我们在-3和3之间定义10个均匀分布的箱子。
我们用np.linspace函数创造11个元素，从而创建10个箱子，即两个连续边界之间的空间：

bins=np.linspace(-3,3,11)
print("bins:{}".format(bins))

bins:[-3. -2.4 -1.8 -1.2 -0.6 0. 0.6 1.2 1.8 2.4 3. ]

这里第一个箱子包含的特征取值在-3到-2.4之间的所有数据点，第二个箱子包含特征取值在-2.4到-1.8之间的所有数据点，以此类推。

接下来，我们记录每个数据点所属的箱子。
这可以用np.digitize函数轻松计算出来：

which_bin=np.digitize(X,bins=bins)
print("\nData points:\n",X[:5])
print("\nBin membership for data points:\n",which_bin[:5])

输出：

Data points:
 [[-0.75275929]
 [ 2.70428584]
 [ 1.39196365]
 [ 0.59195091]
 [-2.06388816]]

Bin membership for data points:
 [[ 4]
 [10]
 [ 8]
 [ 6]
 [ 2]]

from sklearn.preprocessing import OneHotEncoder
#使用OneHotEncoder进行变换
encoder=OneHotEncoder(sparse=False)
#encoder.fit找到which_bin中的唯一值
encoder.fit(which_bin)
#transform创建one-hot编码
X_binned=encoder.transform(which_bin)
print(X_binned[:,5])

输出：

[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.
 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.
 1. 0. 0. 0.]

下面我们在one-hot编码后的数据上构建新的线性模型和新的决策树模型。结果如下，箱子的边界由黑色虚线表示：

line_binned=encoder.transform(np.digitize(line,bins=bins))

reg=LinearRegression().fit(X_binned,y)
plt.plot(line,reg.predict(line_binned),label='linear regression binned')

reg=DecisionTreeRegressor(min_samples_split=3).fit(X_binned,y)
plt.plot(line,reg.predict(line_binned),label='linear regression binned')
plt.plot(X[:,0],y,'o',c='k')
plt.vlines(bins,-3,3,linewidth=1,alpha=.2)
plt.legend(loc="best")
plt.ylabel("Regression output")
plt.xlabel("Input feature")

在这里插入图片描述
虚线和实线完全重合，说明线性回归模型和决策树做出了完全相同的预测。

交互特征与多项式特征

想要丰富特征表示，特别是对于线性模型而言，另一种方法是添加原始数据的交互特征和多项式特征。这种特征工程通常用于统计建模，但也常用于许多实际的机器学习应用中。

单变量非线性变换

下面我们使用一个模拟的计数数据集，其性质与在自然状态下能找到的数据集类似。
特征全都是整数值，而响应是连续的：

rnd=np.random.RandomState(0)
X_org=rnd.normal(size=(1000,3))
w=rnd.normal(size=3)

X=rnd.poisson(10*np.exp(X_org))
y=np.dot(X_org,w)

print("Number of feature appearances:\n{}".format(np.bincount(X[:,0])))

输出：

Number of feature appearances:
[28 38 68 48 61 59 45 56 37 40 35 34 36 26 23 26 27 21 23 23 18 21 10  9
 17  9  7 14 12  7  3  8  4  5  5  3  4  2  4  1  1  3  2  5  3  8  2  5
  2  1  2  3  3  2  2  3  3  0  1  2  1  0  0  3  1  0  0  0  1  3  0  1
  0  2  0  1  1  0  0  0  0  1  0  0  2  2  0  1  1  0  0  0  0  1  1  0
  0  0  0  0  0  0  1  0  0  0  0  0  1  1  0  0  1  0  0  0  0  0  0  0
  1  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1]

我们将其计数可视化：

bins=np.bincount(X[:,0])
plt.bar(range(len(bins)),bins,color='r')
plt.ylabel("Number of appearances")
plt.xlabel("value")

在这里插入图片描述
我们尝试拟合一个岭回归模型：

from sklearn.linear_model import Ridge
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=0)
score=Ridge().fit(X_train,y_train).score(X_test,y_test)
print("Test score:{:.3f}".format(score))

Test score:0.622

X_train_log = np.log(X_train+1)
X_test_log=np.log(X_test+1)

plt.hist(X_train_log[:,0],bins=25,color='red')
plt.ylabel("number of appearances")
plt.xlabel("value")