特征工程练手（二）：特征增强

大地之灯

已于 2024-08-12 18:05:59 修改

阅读量661

点赞数 8

分类专栏：机器学习总结文章标签：机器学习人工智能算法 python

于 2024-08-10 22:58:41 首次发布

本文链接：https://blog.csdn.net/qq_33489955/article/details/141097950

版权

机器学习总结专栏收录该内容

10 篇文章 0 订阅

订阅专栏

本文为和鲸python 特征工程入门与实践·闯关训练营资料整理而来，加入了自己的理解（by GPT4o）

原活动链接

原作者：云中君，大厂后端研发工程师

0、总结

在本节，我们深入学习了处理数据中的缺失值，特别是针对定量数据的修复方法。了解如何有效地处理缺失值是数据预处理中至关重要的一步，它能够提高模型的鲁棒性和性能。以下是本节学习的主要内容和技能：

重点内容主要有观察标签分布，各维度相关性。处理异常值，将原本用0填充的数据做None填充，观察用0填充的数据统计量变化情况。分别用不同方式做数据填充以及数据标准化的方法，并使用网格搜索得到最优结果。值得注意的是，在做数据填充的时候，需要在划分训练集测试集后，使用训练集的填充数据填充训练集和测试集，以达到更好的泛化性。

代码部分:

# 使用scikit-learn预处理类的Imputer模块
from sklearn.preprocessing import Imputer

# 实例化对象
imputer = Imputer(strategy='mean')

Imputer模块在较新的scikit-learn版本中已经被弃用了。可以使用sklearn.impute模块中的SimpleImputer类。以下是更新后的代码示例：

from sklearn.impute import SimpleImputer

# 实例化对象
imputer = SimpleImputer(strategy='mean')

缺失值处理：
学习了常见的缺失值处理方法，包括删除、插值和填充。了解了如何使用均值、中位数和其他统计量进行填充，以保持数据的统计特性。

数据归一化：
使用归一化操作确保不同特征具有相似尺度的过程，以防止某些特征对模型的影响过大

下一步展望：
了解到下一节将学习如何从现有特征扩展出新的特征。

1、前言

特征增强是对数据的进一步修改,我们开始清洗和增强数据:清洗数据是指调整已有的列和行,增强数据则是指在数据集中删除和添加新的列。
涉及到的操作有:

识别数据中的缺失值
删除有害数据
填充缺失值
对数据进行归一化/标准化

2、基础知识讲解

数据集
本节使用的数据集是皮马印第安人糖尿病预测数据集:数据有9列，共768行数据。该数据集涵盖了皮马人的医疗记录，以及过去5年内是否有糖尿病，所有的数据都以数字的形式呈现。通过分类算法模型，我们可以判断所选对象是否有糖尿病（是为1否为0）。

怀孕次数 — Number of times pregnant
2小时口服葡萄糖耐量试验中的血浆葡萄糖浓度 — Plasma glucose concentration a 2 hours in an oral glucose tolerance test
舒张压（毫米汞柱）— Diastolic blood pressure (mm Hg)
2小时血清胰岛素（mu U/ml) — 2-Hour serum insulin (mu U/ml)
三头肌皮褶厚度 (毫米) — Triceps skin fold thickness (mm)
体重指数（BMI）— Body mass index (weight in kg/(height in m)^2)
糖尿病血系功能 — Diabetes pedigree function
年龄（年）— Age (years)
类别：过去5年内是否有糖尿病 — Class variable (0 or 1)

2.1识别缺失值

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use('fivethirtyeight')

# 添加标题
pima_column_names = ['times_pregnant', 'plasma_glucose_concentration', 'diastolic_blood_pressure', 'triceps_thickness',
                    'serum_insulin', 'bmi', 'pedigree_function', 'age', 'onset_diabetes']

# 源数据的csv文件没有列名，需手动添加
path ='./data/pima-indians-diabetes.csv'
# header=0 少一行
pima = pd.read_csv(path,header=0,names=pima_column_names)
pima.head()

	times_pregnant	plasma_glucose_concentration	diastolic_blood_pressure	triceps_thickness	serum_insulin	bmi	pedigree_function	age	onset_diabetes
0	6	148	72	35	0	33.6	0.627	50	1
1	1	85	66	29	0	26.6	0.351	31	0
2	8	183	64	0	0	23.3	0.672	32	1
3	1	89	66	23	94	28.1	0.167	21	0
4	0	137	40	35	168	43.1	2.288	33	1

pima.shape

(768, 9)

# 计算一下空准确率
pima['onset_diabetes'].value_counts(normalize=True)

0    0.651042
1    0.348958
Name: onset_diabetes, dtype: float64

对糖尿病患者和健康人进行可视化，希望直方图可以显示一些规律，或者这两类之间的显著差异

# 对plasma_glucose_concentration列绘制两类的直方图
col = 'plasma_glucose_concentration'
# 不患糖尿病
plt.hist(pima[pima['onset_diabetes']==0][col],bins=10,alpha=0.5,label='non-diabetes')
# 患糖尿病
plt.hist(pima[pima['onset_diabetes']==1][col],bins=10,alpha=0.5,label='diabetes')
plt.legend(loc='upper right')
plt.xlabel(col)
plt.ylabel('Frequency')
plt.title('Histogram of {}'.format(col))

Text(0.5, 1.0, 'Histogram of plasma_glucose_concentration')

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

# 绘制其他列的直方图
cols = ['bmi','diastolic_blood_pressure','plasma_glucose_concentration']

for col in cols:
    # 不患糖尿病
    plt.hist(pima[pima['onset_diabetes']==0][col],bins=10,alpha=0.5,label='non-diabetes')
    # 患糖尿病
    plt.hist(pima[pima['onset_diabetes']==1][col],bins=10,alpha=0.5,label='diabetes')
    plt.legend(loc='upper right')
    plt.xlabel(col)
    plt.ylabel('Frequency')
    plt.title('Histogram of {}'.format(col))
    plt.show()

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

用线性相关矩阵量化变量间的关系

# 数据相关矩阵的热力图
sns.heatmap(pima.corr())

<AxesSubplot:>

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

可以看出plasma_glucose_concentration（血浆葡萄糖浓度）和onset_diabetes（糖尿病）有很强的相关性

pima.corr()['onset_diabetes'].sort_values(ascending=False)

onset_diabetes                  1.000000
plasma_glucose_concentration    0.466581
bmi                             0.292695
age                             0.238356
times_pregnant                  0.221898
pedigree_function               0.173844
serum_insulin                   0.130548
triceps_thickness               0.074752
diastolic_blood_pressure        0.065068
Name: onset_diabetes, dtype: float64

# 查看数据中是否存在缺失值
pima.isnull().sum()

times_pregnant                  0
plasma_glucose_concentration    0
diastolic_blood_pressure        0
triceps_thickness               0
serum_insulin                   0
bmi                             0
pedigree_function               0
age                             0
onset_diabetes                  0
dtype: int64

# 查看数据的基本描述性统计
pima.describe()

	times_pregnant	plasma_glucose_concentration	diastolic_blood_pressure	triceps_thickness	serum_insulin	bmi	pedigree_function	age	onset_diabetes
count	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000
mean	3.845052	120.894531	69.105469	20.536458	79.799479	31.992578	0.471876	33.240885	0.348958
std	3.369578	31.972618	19.355807	15.952218	115.244002	7.884160	0.331329	11.760232	0.476951
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.078000	21.000000	0.000000
25%	1.000000	99.000000	62.000000	0.000000	0.000000	27.300000	0.243750	24.000000	0.000000
50%	3.000000	117.000000	72.000000	23.000000	30.500000	32.000000	0.372500	29.000000	0.000000
75%	6.000000	140.250000	80.000000	32.000000	127.250000	36.600000	0.626250	41.000000	1.000000
max	17.000000	199.000000	122.000000	99.000000	846.000000	67.100000	2.420000	81.000000	1.000000

发现BMI的最小值是0,这是不符合医学常识的，猜测缺失或不存在的数据都用0填充了。我们发现以下列的最小值都是0：

times_pregnant
plasma_glucose_concentration
diastolic_blood_pressure
triceps_thickness
serum_insulin
bmi
onset_diabetes

因为onset_diabetes中的0代表没有糖尿病，人也可以怀孕0次，所以可以得出结论，除这两列之外的其它列缺失值用0填充了

如果数据集没有文档，缺失值的常见填充方法有：

0（数值型）
unknown或Unknown（类别型）
？（类别型）

2.2处理缺失值

最主要的处理方法:

删除缺少值的行
填充缺失值

这两种办法都可以清洗我们的数据集，让算法可以运行，但是每种办法都各有优缺点。在进一步处理前，先用Python中的None填充所有的数字0，这样Pandas的fillna和dropna方法就可以正常工作了。

# 手动将每列的0替换成None
print(pima['serum_insulin'].isnull().sum())

# 用None手动替换0
pima['serum_insulin']=pima['serum_insulin'].map(lambda x:x if x!=0 else None)

# 检查缺失值数量
print(pima['serum_insulin'].isnull().sum())

0
374

# 直接对所有列操作
columns = ['serum_insulin', 'bmi', 'plasma_glucose_concentration', 'diastolic_blood_pressure', 'triceps_thickness']
for col in columns:
    pima[col].replace([0], [None], inplace=True)

# 查看缺失值情况
pima.isnull().sum()

times_pregnant                    0
plasma_glucose_concentration      5
diastolic_blood_pressure         35
triceps_thickness               227
serum_insulin                   374
bmi                              11
pedigree_function                 0
age                               0
onset_diabetes                    0
dtype: int64

pima.head()

	times_pregnant	plasma_glucose_concentration	diastolic_blood_pressure	triceps_thickness	serum_insulin	bmi	pedigree_function	age	onset_diabetes
0	6	148	72	35	NaN	33.6	0.627	50	1
1	1	85	66	29	NaN	26.6	0.351	31	0
2	8	183	64	None	NaN	23.3	0.672	32	1
3	1	89	66	23	94.0	28.1	0.167	21	0
4	0	137	40	35	168.0	43.1	2.288	33	1

pima.describe() # 不包含有缺失值的列

	times_pregnant	pedigree_function	age	onset_diabetes
count	768.000000	768.000000	768.000000	768.000000
mean	3.845052	0.471876	33.240885	0.348958
std	3.369578	0.331329	11.760232	0.476951
min	0.000000	0.078000	21.000000	0.000000
25%	1.000000	0.243750	24.000000	0.000000
50%	3.000000	0.372500	29.000000	0.000000
75%	6.000000	0.626250	41.000000	1.000000
max	17.000000	2.420000	81.000000	1.000000

删除存在缺失值的行

# 删除存在缺失的行
pima_dropped = pima.dropna()
# 检查删除了多少行
num_rows_lost = round(100*(pima.shape[0] - pima_dropped.shape[0]) / float(pima.shape[0]))
print("删除 {}%".format(num_rows_lost))

删除 49%

继续对数据做一下探索性分析,比较一下丢弃缺失值前后的统计数据

# 未删除数据的空准确率
pima['onset_diabetes'].value_counts(normalize=True)

0    0.651042
1    0.348958
Name: onset_diabetes, dtype: float64

# 删除数据后的空准确率
pima_dropped['onset_diabetes'].value_counts(normalize=True)

0    0.668367
1    0.331633
Name: onset_diabetes, dtype: float64

发现删除前后没什么太大变化

# 未删除数据的均值
pima.mean()

times_pregnant                    3.845052
plasma_glucose_concentration    121.686763
diastolic_blood_pressure         72.405184
triceps_thickness                29.153420
serum_insulin                   155.548223
bmi                              32.457464
pedigree_function                 0.471876
age                              33.240885
onset_diabetes                    0.348958
dtype: float64

# 删除数据后的均值
pima_dropped.mean()

times_pregnant                    3.301020
plasma_glucose_concentration    122.627551
diastolic_blood_pressure         70.663265
triceps_thickness                29.145408
serum_insulin                   156.056122
bmi                              33.086224
pedigree_function                 0.523046
age                              30.864796
onset_diabetes                    0.331633
dtype: float64

创建一个新图表，将每列均值变化的百分比可视化

# 均值变化百分比条形图
ax = ((pima_dropped.mean() - pima.mean()) / pima.mean()).plot(kind='bar', title='% change in average column values')
ax.set_ylabel('% change')

Text(0, 0.5, '% change')

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

可以看到，times_pregnant（怀孕次数）的均值在删除缺失值后下降了14%，变化很大！pedigree_function（糖尿病血系功能）也上升了11%，也是个飞跃。可以看到，删除行会严重影响数据的形状，所以应该保留尽可能多的数据

Baseline

对pima_dropped数据进行机器学习，获得准确率

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings('ignore')

# 删除标签数据
X_dropped = pima_dropped.drop('onset_diabetes', axis=1)  # 特征
print("leanrning from {} rows".format(X_dropped.shape[0]))
y_dropped = pima_dropped['onset_diabetes']   # 标签

# KNN的模型参数
knn_params = {'n_neighbors': [1, 2, 3, 4, 5, 6, 7]}

# KNN模型
knn = KNeighborsClassifier()

# 使用网格搜索优化
grid = GridSearchCV(knn, knn_params)
grid.fit(X_dropped, y_dropped)

# 输出结果
print(grid.best_score_, grid.best_params_)

leanrning from 392 rows
0.7348263550795197 {'n_neighbors': 7}

填充缺失值

用此列其余部分的均值填充缺失值
中位数填充
0填充

# 查看缺失值情况
pima.isnull().sum()

times_pregnant                    0
plasma_glucose_concentration      5
diastolic_blood_pressure         35
triceps_thickness               227
serum_insulin                   374
bmi                              11
pedigree_function                 0
age                               0
onset_diabetes                    0
dtype: int64

# 用fillna方法填充
pima['plasma_glucose_concentration'].fillna(pima['plasma_glucose_concentration'].mean(),inplace=True)
pima.isnull().sum()

times_pregnant                    0
plasma_glucose_concentration      0
diastolic_blood_pressure         35
triceps_thickness               227
serum_insulin                   374
bmi                              11
pedigree_function                 0
age                               0
onset_diabetes                    0
dtype: int64

# 使用scikit-learn预处理类的Imputer模块
# from sklearn.preprocessing import Imputer # Imputer模块在较新的scikit-learn版本中已经被弃用了。要解决这个问题，你可以使用sklearn.impute模块中的SimpleImputer类。
from sklearn.impute import SimpleImputer

# 实例化对象
# imputer = Imputer(strategy='mean')
imputer = SimpleImputer(strategy='mean')

# 创建新对象
pima_imputed = imputer.fit_transform(pima)
# 将得到的ndarray类型转化为DataFrame
pima_imputed = pd.DataFrame(pima_imputed, columns=pima_column_names)
pima_imputed.head()

	times_pregnant	plasma_glucose_concentration	diastolic_blood_pressure	triceps_thickness	serum_insulin	bmi	pedigree_function	age	onset_diabetes
0	6.0	148.0	72.0	35.00000	155.548223	33.6	0.627	50.0	1.0
1	1.0	85.0	66.0	29.00000	155.548223	26.6	0.351	31.0	0.0
2	8.0	183.0	64.0	29.15342	155.548223	23.3	0.672	32.0	1.0
3	1.0	89.0	66.0	23.00000	94.000000	28.1	0.167	21.0	0.0
4	0.0	137.0	40.0	35.00000	168.000000	43.1	2.288	33.0	1.0

# 判断是否有缺失值
pima_imputed.isnull().sum()

times_pregnant                  0
plasma_glucose_concentration    0
diastolic_blood_pressure        0
triceps_thickness               0
serum_insulin                   0
bmi                             0
pedigree_function               0
age                             0
onset_diabetes                  0
dtype: int64

尝试一下填充一些别的值，查看对KNN模型的影响

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings('ignore')

# 用0填充
pima_zero = pima.fillna(0)
X_zero = pima_zero.drop('onset_diabetes', axis=1)
y_zero = pima_zero['onset_diabetes']
print("learning from {} rows".format(X_zero.shape[0]))

# KNN的模型参数
knn_params = {'n_neighbors': [1, 2, 3, 4, 5, 6, 7]}

# KNN模型
knn = KNeighborsClassifier()

# 使用网格搜索优化
grid = GridSearchCV(knn, knn_params)
grid.fit(X_zero, y_zero)

# 输出结果
print(grid.best_score_, grid.best_params_)

learning from 768 rows
0.7409387997623291 {'n_neighbors': 7}

如果用0填充，准确率会低于直接删掉有缺失值的行。目前，我们的目标是建立一个可以从全部768行中学习的机器学习流水线，而且比仅用392行的结果还好。也就是说，我们的结果要好于0.745，即74.5%。

在机器学习流水线中填充值

因为学习算法的目标是泛化训练集的模式并将其应用于测试集。如果在划分数据集和应用算法之前直接对整个数据集填充值，我们就是在作弊，模型其实学不到任何模式

from sklearn.model_selection import train_test_split

X = pima['serum_insulin'].copy()
y = pima['onset_diabetes'].copy()

X.isnull().sum()

不恰当的做法:在划分前填充值

# 取整个数据集的均值
entire_data_set_mean = X.mean()
# print(entire_data_set_mean)
# 填充缺失值
X =X.fillna(entire_data_set_mean)
# 使用一个随机状态，使每次检查的划分都一样
X_train,X_test,y_train,y_test = train_test_split(X.values,y.values,random_state=99)
entire_data_set_mean

155.5482233502538

0      155.548223
1      155.548223
2      155.548223
3       94.000000
4      168.000000
          ...    
763    180.000000
764    155.548223
765    112.000000
766    155.548223
767    155.548223
Name: serum_insulin, Length: 768, dtype: float64

# 用KNN模型拟合训练集和测试集
knn = KNeighborsClassifier()
knn.fit(X_train.reshape(-1,1),y_train)
knn.score(X_test.reshape(-1,1),y_test)

0.6666666666666666

恰当的做法:在划分后填充值

from sklearn.model_selection import train_test_split

X = pima['serum_insulin'].copy()
y = pima['onset_diabetes'].copy()

# 使用一个随机状态，使每次检查的划分都一样
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=99)
X.isnull().sum()

不取整个X矩阵的均值，而是用训练集的均值填充训练集和测试集的缺失值

training_mean = X_train.mean()
X_train = X_train.fillna(training_mean)
X_test = X_test.fillna(training_mean)

print(training_mean)

158.54605263157896

knn = KNeighborsClassifier()
knn.fit(X_train.values.reshape(-1,1),y_train.values)
knn.score(X_test.values.reshape(-1,1),y_test)

0.6822916666666666

准确率低了很多，但更诚实地代表了模型的泛化能力，即从训练集的特征中学习并将所学应用到未知隐藏数据上的能力。结合使用scikit-learn的Pipeline和Imputer,让机器学习流水线的搭建更容易

from sklearn.pipeline import Pipeline

knn_params = {'classify__n_neighbors':[1,2,3,4,5,6,7]}

knn = KNeighborsClassifier()

mean_impute = Pipeline([
    ('imputer',SimpleImputer(strategy='mean')),
    ('classify',knn)
])

X = pima.drop('onset_diabetes',axis=1)
y = pima['onset_diabetes']

grid = GridSearchCV(mean_impute,knn_params)

grid.fit(X,y)
print(grid.best_score_,grid.best_params_)

0.7318394024276378 {'classify__n_neighbors': 7}

from sklearn.pipeline import Pipeline

# 中位数填充
knn_params = {'imputer__strategy':['median'],'classify__n_neighbors':[1,2,3,4,5,6,7]}

knn = KNeighborsClassifier()

mean_impute = Pipeline([
    ('imputer',SimpleImputer()),
    ('classify',knn)
])

X = pima.drop('onset_diabetes',axis=1)
y = pima['onset_diabetes']

grid = GridSearchCV(mean_impute,knn_params)

grid.fit(X,y)
print(grid.best_score_,grid.best_params_)

0.7292589763177999 {'classify__n_neighbors': 7, 'imputer__strategy': 'median'}

2.3标准化和归一化

到目前为止，我们已经知道了如何识别数据类型，如何识别缺失值，以及如何处理缺失值。现在继续讨论如何处理数据，以进一步增强机器学习流水线。目前，我们已经用过4种不同的方式处理数据集，最佳的KNN交叉验证准确率是0.745。

impute = SimpleImputer(strategy='mean')
# 填充所有的缺失值
pima_imputed_mean = pd.DataFrame(impute.fit_transform(pima), columns=pima_column_names)
# 画直方图
pima_imputed_mean.hist(figsize=(15, 15))

array([[<AxesSubplot:title={'center':'times_pregnant'}>,
        <AxesSubplot:title={'center':'plasma_glucose_concentration'}>,
        <AxesSubplot:title={'center':'diastolic_blood_pressure'}>],
       [<AxesSubplot:title={'center':'triceps_thickness'}>,
        <AxesSubplot:title={'center':'serum_insulin'}>,
        <AxesSubplot:title={'center':'bmi'}>],
       [<AxesSubplot:title={'center':'pedigree_function'}>,
        <AxesSubplot:title={'center':'age'}>,
        <AxesSubplot:title={'center':'onset_diabetes'}>]], dtype=object)

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

我们发现每列的均值、最小值、最大值和标准差差别很大

pima_imputed_mean.describe()

	times_pregnant	plasma_glucose_concentration	diastolic_blood_pressure	triceps_thickness	serum_insulin	bmi	pedigree_function	age	onset_diabetes
count	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000
mean	3.845052	121.686763	72.405184	29.153420	155.548223	32.457464	0.471876	33.240885	0.348958
std	3.369578	30.435949	12.096346	8.790942	85.021108	6.875151	0.331329	11.760232	0.476951
min	0.000000	44.000000	24.000000	7.000000	14.000000	18.200000	0.078000	21.000000	0.000000
25%	1.000000	99.750000	64.000000	25.000000	121.500000	27.500000	0.243750	24.000000	0.000000
50%	3.000000	117.000000	72.202592	29.153420	155.548223	32.400000	0.372500	29.000000	0.000000
75%	6.000000	140.250000	80.000000	32.000000	155.548223	36.600000	0.626250	41.000000	1.000000
max	17.000000	199.000000	122.000000	99.000000	846.000000	67.100000	2.420000	81.000000	1.000000

一些机器学习模型受数据尺度（scale）的影响很大。如果每列的差别过大，那么算法不会达到最优化状态。我们可以在直方图方法中调用可选的sharex和sharey参数，在同一比例下查看每个图表

# x轴相同
pima_imputed_mean.hist(figsize=(15,15),sharex=True)

array([[<AxesSubplot:title={'center':'times_pregnant'}>,
        <AxesSubplot:title={'center':'plasma_glucose_concentration'}>,
        <AxesSubplot:title={'center':'diastolic_blood_pressure'}>],
       [<AxesSubplot:title={'center':'triceps_thickness'}>,
        <AxesSubplot:title={'center':'serum_insulin'}>,
        <AxesSubplot:title={'center':'bmi'}>],
       [<AxesSubplot:title={'center':'pedigree_function'}>,
        <AxesSubplot:title={'center':'age'}>,
        <AxesSubplot:title={'center':'onset_diabetes'}>]], dtype=object)

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

很明显，所有的数据尺度都不同。我们可以选用某种归一化操作，将所有定量列转化为同一个静态范围中的值（例如，所有数都位于0～1）。数据归一化和标准化都属于数据特征无量纲的一种方式。无量纲指的是将不同规格的数据转换到同一规格，或不同分布的数据转换到某个特定分布的需求，称之为数据“无量纲化”。
在模型训练过程中，经过无量纲化之后的数据特征对于模型的求解有加速作用，特别是对于需要计算梯度和矩阵的模型（例如逻辑回归中通过梯度下降求解损失函数）。在k近邻、聚类等算法中需要计算距离，使用无量纲化可以提升模型精度，避免异常值对整体的计算造成影响。

数据的无量纲化可以是线性的，也可以是非线性的。非线性的无量纲不太常用，例如百分位数转换、应用特征功率转换等，基本很少用到；而常用的线性无量纲化主要包括中心化处理和缩放处理，在特征工程中比较常见。
中心化的本质是让所有记录减去一个固定值，即让数据样本平移到某个位置。
缩放的本质是通过除以一个固定值，将数据固定在某个范围之中。

标准化（Standardization）与归一化（Normalization ）都属于特征缩放（feature scaling），根据wiki上对特征缩放方法的定义，standardization其实就是z-score normalization，也就是说标准化其实是归一化的一种。它们唯一的区别在于归一化的结果范围在0到1之间，标准化的值域没有范围，只是数据整体的均值为 0，标准差为 1 。

归一化和标准化都不会改变数据的分布。它们都是对于数据的线性无量纲化，通过相应的缩放和平移使得数据发生改变的过程，但是并没有改变原始数据的排列顺序。另外，归一化缩放的比例仅仅和极值有关，而标准化缩放的比例和整体数据集有关。所以对于存在异常数据的样本来说，用归一化并不是一个聪明的决定

我们将重点关注3种数据归一化方法:

z分数标准化
min-max标准化
行归一化

3、跟练

运用上面所学知识，让我们来对pima-indians-diabetes.csv数据进行归一化分析，尝试独自写出代码

3.1z分数标准化

z分数标准化利用了统计学里简单的z分数（标准分数）思想。z分数标准化的输出会被重新缩放，使均值为0、标准差为1。通过缩放特征、统一化均值和方差（标准差的平方），可以让KNN这种模型达到最优化，而不会倾向于较大比例的特征。公式如下:
$\frac{x-u}{\sigma}$

对血浆葡萄糖浓度（plasma_glucose_concentration）进行z标准化

pima['plasma_glucose_concentration'].head()

0    148.0
1     85.0
2    183.0
3     89.0
4    137.0
Name: plasma_glucose_concentration, dtype: float64

# 取此列均值
mu = pima['plasma_glucose_concentration'].mean()
# 取此列标准差
sigma = pima['plasma_glucose_concentration'].std()
# 对每个值计算z分数
((pima['plasma_glucose_concentration'] - mu) / sigma).head()

0    0.864545
1   -1.205376
2    2.014501
3   -1.073952
4    0.503130
Name: plasma_glucose_concentration, dtype: float64

该列中的每个值都会被替换，而且某些值为负数。这是因为该值代表到均值的距离，如果它最初低于该列的均值，z分数就是负数

# 使用sklearn内置的z分数归一化StandardScaler进行z分数标准化
from sklearn.preprocessing import StandardScaler

# 用z分数标准化
scaler = StandardScaler()
glucose_z_score_standardized = scaler.fit_transform(pima[['plasma_glucose_concentration']])

# 绘制直方图
ax = pd.Series(glucose_z_score_standardized.reshape(-1,)).hist()
ax.set_title('Distribution of plasma_glucose_concentration after Z Score Scaling')

Text(0.5, 1.0, 'Distribution of plasma_glucose_concentration after Z Score Scaling')

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

# 对每一列进行转换
scale = StandardScaler()

pima_imputed_mean_scaled = pd.DataFrame(scale.fit_transform(pima_imputed_mean),columns=pima_column_names)
pima_imputed_mean_scaled.hist(figsize=(15,15),sharex=True)

array([[<AxesSubplot:title={'center':'times_pregnant'}>,
        <AxesSubplot:title={'center':'plasma_glucose_concentration'}>,
        <AxesSubplot:title={'center':'diastolic_blood_pressure'}>],
       [<AxesSubplot:title={'center':'triceps_thickness'}>,
        <AxesSubplot:title={'center':'serum_insulin'}>,
        <AxesSubplot:title={'center':'bmi'}>],
       [<AxesSubplot:title={'center':'pedigree_function'}>,
        <AxesSubplot:title={'center':'age'}>,
        <AxesSubplot:title={'center':'onset_diabetes'}>]], dtype=object)

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

流水线
使用KNN算法检验效果

# 将z分数标准化插入到机器学习流水线上
knn_params = {'imputer__strategy': ['mean', 'median'],
           'classify__n_neighbors': [1, 2, 3, 4, 5, 6, 7]
}

mean_impute_standardize = Pipeline([
    ('imputer', SimpleImputer()), 
    ('standardize', StandardScaler()),
    ('classify', knn)
])

X = pima.drop('onset_diabetes', axis=1)
y = pima['onset_diabetes']

grid = GridSearchCV(mean_impute_standardize, knn_params)
grid.fit(X, y)

print(grid.best_score_, grid.best_params_)

0.7539173245055598 {'classify__n_neighbors': 7, 'imputer__strategy': 'mean'}

3.2min-max标准化

$\frac{X-X_{min}}{X_{max}-X_{min}}$

# 使用sklearn模块MinMaxScaler进行min-max标准化
from sklearn.preprocessing import MinMaxScaler

# 实例化
min_max = MinMaxScaler()

# min-max标准化
pima_min_maxed = pd.DataFrame(min_max.fit_transform(pima_imputed), columns=pima_column_names)

# 得到描述性统计
pima_min_maxed.describe()

	times_pregnant	plasma_glucose_concentration	diastolic_blood_pressure	triceps_thickness	serum_insulin	bmi	pedigree_function	age	onset_diabetes
count	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000
mean	0.226180	0.501205	0.493930	0.240798	0.170130	0.291564	0.168179	0.204015	0.348958
std	0.198210	0.196361	0.123432	0.095554	0.102189	0.140596	0.141473	0.196004	0.476951
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	0.058824	0.359677	0.408163	0.195652	0.129207	0.190184	0.070773	0.050000	0.000000
50%	0.176471	0.470968	0.491863	0.240798	0.170130	0.290389	0.125747	0.133333	0.000000
75%	0.352941	0.620968	0.571429	0.271739	0.170130	0.376278	0.234095	0.333333	1.000000
max	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000

注意，最小值都是0，最大值都是1。这种缩放的特点是标准差都非常小，但可能不利于某些模型，因为异常值的权重降低了

# 使用knn流水线检验效果
knn_params = {'imputer__strategy': ['mean', 'median'], 'classify__n_neighbors': [1, 2, 3, 4, 5, 6, 6]}

mean_impute_standardize = Pipeline([('imputer', SimpleImputer()), ('standardize', MinMaxScaler()), ('classify', knn)])

X = pima.drop('onset_diabetes', axis=1)
y = pima['onset_diabetes']

grid = GridSearchCV(mean_impute_standardize, knn_params)
grid.fit(X, y)

print(grid.best_score_, grid.best_params_)

0.7486800780918428 {'classify__n_neighbors': 4, 'imputer__strategy': 'median'}

3.3行归一化

行归一化不是计算每列的统计值（均值、最小值、最大值等），而是会保证每行有单位范数（unit norm），意味着每行的向量长度相同。想象一下，如果每行数据都在一个n维空间内，那么每行都有一个向量范数（长度）。也就是说，我们认为每行都是空间内的一个向量。
L2范数:
$(x_1^2+x_2^2+\cdots+x_n^2)^{1/2}$

# 计算矩阵的平均范数
np.sqrt((pima_imputed**2).sum(axis=1)).mean()

223.3622202582376

# 使用sklearn的Normalizer
# 引入行归一化
from sklearn.preprocessing import Normalizer

# 实例化
normalize = Normalizer()

pima_normalized = pd.DataFrame(normalize.fit_transform(pima_imputed), columns=pima_column_names)

# 计算归一化后矩阵的平均范数
np.sqrt((pima_normalized**2).sum(axis=1)).mean()

1.0

# 使用流水线检验效果
knn_params = {'imputer__strategy': ['mean', 'median'], 'classify__n_neighbors': [1, 2, 3, 4, 5, 6, 6]}

mean_impute_standardize = Pipeline([('imputer', SimpleImputer()), ('normalize', Normalizer()), ('classify', knn)])

X = pima.drop('onset_diabetes', axis=1)
y = pima['onset_diabetes']

grid = GridSearchCV(mean_impute_standardize, knn_params)
grid.fit(X, y)

print(grid.best_score_, grid.best_params_)

0.6914438502673796 {'classify__n_neighbors': 4, 'imputer__strategy': 'median'}

很多算法会受尺度的影响，下面就是其中一些流行的学习算法：

KNN——因为依赖欧几里得距离
K均值聚类——和KNN的原因一样
逻辑回归、支持向量机、神经网络——如果使用梯度下降来学习权重
主成分分析——特征向量将偏向较大的列。

需要注意的是，树模型一般不需要做归一化处理，做归一化处理的目的主要为了使同一特征的取值在同一量纲，降低方差太大带来的影响。树模型并不关心特征的具体取值，只关心特征取值的分布

4、闯关题

答题说明：
请在题目下方的答题Code Cell中输入你的答案，并按照步骤说明完成你的提交。
请注意，一定要按照顺序依次运行下方的代码，否则会出现报错喔！

答案全部为大写字符串、无任何分隔符（如：a1 =‘A’ 或 a1 =‘AB’)，判断题T为正确，F为错误。

STEP1：根据要求完成题目

Q1. （单选）下面哪种方法不属于数据归一化方法？
A. z分数标准化
B. min-max标准化
C. 均值填充
D. 行归一化

Q2. （判断题）在划分数据集和应用算法之前，应该对整个数据集填充缺失值？

Q3. （多选）以下哪些机器学习算法可以不对特征做归一化处理？
A. 随机森林
B. 逻辑回归
C. SVM
D. GBDT

在机器学习中，特征归一化处理是一个常见的步骤，但并不是所有算法都需要特征归一化。对于这个问题的答案：

A. 随机森林：不需要对特征做归一化处理。随机森林是一种基于决策树的算法，决策树对特征的尺度不敏感，因此不需要归一化。
B. 逻辑回归：需要对特征做归一化处理。逻辑回归是一种基于梯度下降优化的算法，对特征的尺度敏感，因此通常需要归一化。
C. SVM：需要对特征做归一化处理。支持向量机（SVM）特别是使用核函数时，对特征的尺度非常敏感，因此归一化是必要的。
D. GBDT：不需要对特征做归一化处理。梯度提升决策树（GBDT）也是基于树的算法，对特征的尺度不敏感。

所以，答案是 A. 随机森林 和 D. GBDT。

#填入你的答案并运行,注意大小写
a1 = 'C'  # 如 a1= 'AB'
a2 = 'F'  # 如 a2= 'T/F'
a3 = 'AD'  # 如 a3= 'B'

大地之灯

关注

8
点赞
踩
28

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录