系列文章（一）：机器学习与深度学习——数据预处理（数值型数据）

最新推荐文章于 2024-08-18 09:26:09 发布

str_717

最新推荐文章于 2024-08-18 09:26:09 发布

阅读量3.4k

点赞数 3

分类专栏： Python 文章标签： python 机器学习

本文链接：https://blog.csdn.net/str_717/article/details/116073495

版权

Python 专栏收录该内容

4 篇文章 5 订阅

订阅专栏

系列文章（一）：机器学习与深度学习中的数据预处理（数值型数据）

一、引言

无论是在机器学习还是在深度学习中（深度学习是一种机器学习的方法😅）我们都需要获得大量的数据，而数据源是多种多样的。可能是一个网络数据库，一个网站，别人提供的数据，也有可能是自己利用爬虫爬取的数据，等等…

1.1 为何预处理（Why preprocessing?）

Raw data is dirty and noisy
Machine learning algorithms have certain constraints regarding input data
Transformations can improve the model performance
无论是什么数据，在我们获得以后，都面临许多问题：数据缺失，数据格式不正确等等。这种我们直接获得，难以进行使用的数据有一种形象的称呼：脏数据

而让这种脏数据能够使用的办法就是：数据的预处理（Data preprocessing）
在这里插入图片描述
source

source

1.2 预处理会带来什么？（Preprocessing results?）

经过预处理的数据才可以真正被我们使用，同时，预处理后的数据也往往能够提升模型的效果。在预处理时，我们甚至可以发现一些数据之间的联系。

所以，想要真正能够训练一个自己的模型，学会数据预处理是必须的，也是重要的。

二、数据的分类

在机器学习与深度学习的问题中，有几种类型的数据是十分常用的：

数值型数据（是一种结构化的数据，例如：吴恩达机器学习课程中的房价数据）
文本数据（是一种非结构化的数据，需进行标注。例如：我国的人民日报语料库，国外训练word2vec模型的GoogleNews语料库）
图片数据（非结构化数据，需要进行标注。例如：ImageNet）

鉴于每种类型的数据都有独特的数据预处理方式（图片：图像增强、灰度处理等；文本：词向量，词根化，标点大小写等），我将分三篇文章来介绍不同的数据预处理方法。

本篇先介绍最常用的：数值型数据

三、不同类型的数据异常&处理方式

注意：此处使用Python中的Pandas库来进行数据预处理

3.1 重复（Duplicate）

3.1.1 数据重复？(What)

数据中的重复是十分常见的，你所拥有的数据中就有可能包含两个相同的数据项：
eg:

NO.X: size —— 20，price——100
…
NO.Y: size —— 20，price —— 100

3.1.2 为什么处理数据重复？(Why)

数据泄露啦！！！
为了保证模型的泛化能力，十分重要的是：在整个训练过程中，算法中看不到测试集中的数据。
如果在模型的训练集和测试集中都存在相同的数据，这可能会导致模型的评估结果不可靠。

3.1.3 如何去除重复数据？(How)

import pandas as pd

df = pd.read_csv("your_path_to_file。csv")
data = df.copy()


len(data) # Check number of rows before removing duplicates
'''Next line!!!'''
data = data.drop_duplicates() # Remove duplicates
len(data)# Check new number of rows

3.2 缺失（Missing）

3.2.1 什么是数据缺失？（What）

Emm，这个我就不解释啦~

3.2.2 为什么会有数据缺失&为什么处理？（Why）

Common reasons for missing data（数据缺失的常见原因）
- Programming error
- Failure of measurement
- Random events
Common representations（数据缺失值的常见表示）
- NaN (not a number)
- Large negative(较大的负数)
- 无穷大

为什么处理呢？你肯定不希望你都开始模型训练了，突然告诉你：你缺少了一些数据，所以我们把那个数据变成了0🙄

3.2.3 如何处理Missing data(How)

'''发现数据中的缺失值'''
data.isnull().sum().sort_values(ascending=False) #NaN count for each column
data.info() #也可以
'''处理数据中的缺失值'''

等等等！！！
处理缺失值的方法可有很多哦！不同的处理方法会导致不同的结果！每种方法在何时使用其实是仁者见仁智者见智的事情

Drop (删除)

import numpy as np

data.drop(columns=["BA","CA"], inplace=True) #删除某一列
data = data[data.BA.notna()] #删除某一列中所有有缺失值的行

Fill (填充)

data.BA.replace(np.nan, "NoBA", inplace=True) #Replace NaN by "NoBA"

🚨缺失值有时是有含义的！（Missing data does not necessarily mean no information!）
但是填充数据也会带来问题：人为的主观因素、列与列之间的关系丢失…

我们也可使sklearn中的simplerImputer进行数据填充，填充时可以选择想要的填充方式[“mean”,“median”,“most_frequent”]

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="mean") # Instanciate a SimpleImputer object with strategy of choice

imputer.fit(data[['BA']]) # Call the "fit" method on the object

data['BA'] = imputer.transform(data[['BA']]) # Call the "transform" method on the object

print(imputer.statistics_) # The mean is stored in the transformer's memory

3.3 异常值（Outliers）

3.3.1 什么是Outlier？（What）

Outlier,异常值。也就是说某项数据和大多数其他数据之间存在较大差异。

3.3.2 Outlier会影响什么？（Why）

Dataset distributions and patterns
Central tendency metrics e.g. mean and standard deviation
Machine learning models’ performances（看到这就明白为什么都要处理异常值了）

3.3.2 Outlier如何发现和去除？(How）

发现Outlier：使用Boxplot(箱线图)！！！
```
data[["HAVEOUTLIER"]].boxplot()
```

去除Outlier

''' 要找到不合理的异常值的index！！！'''
false_observation = data['HAVEOUTLIER'].argmin() # Get index corresponding to minimum value

data = data.drop(false_observation).reset_index(drop=True) # Drop row

data[['HAVEOUTLIER']].boxplot() # Visualize boxplot

3.4 缩放（Scaling）

3.4.1 什么是Scaling？（What）

将连续数据转变到一个更小的范围内。（注意：非连续型的变量不应采用该方法）

3.4.2 为何要进行Scaling？（Why）

大幅度的特征可能会错误地超过小幅度的特征（Features with large magnitudes can incorrectly outweigh features of small magnitudes）
缩放到较小的幅度可提高计算效率（Scaling to smaller magnitudes improves computational efficiency）
增加特征系数的可解释性（Increases interpretability of feature coefficients）

3.4.3 如何进行Scaling？（How）

Standardizing（标准化）
$\huge z = \frac{(x - mean)}{std}$
标准化函数大家都知道的~
```
from sklearn.preprocessing import StandardScaler

scaler= StandardScaler()
std_data = scaler.fit_transform(data)
```
Normalizing(归一化)
$\huge X' = \frac{(X - X_{min})}{X_{max} - X_{min}}$
👉 Sklearn MinMaxScaler() documentation
RobustScalling
$\huge Robust Scaled = \frac{(x - median)}{IQR }$
```
from sklearn.preprocessing import RobustScaler

r_scaler = RobustScaler() # Instanciate Robust Scaler
r_scaler.fit(data) # Fit scaler to feature
data = r_scaler.transform(data) #Scale
data.head()
```
其实这几种方法各有其优缺点，使用的时候还是要判断一下哪个更加合适

3.5 数据不平衡（Balancing）

3.5.1 什么是Balacing？（What）

在分类的数据中，常常会有不同类别的数据量相差极大的情况。此时需要通过Balacing将不同类别的数据量之间进行平衡

（eg：在2021年的美赛C题中，官方提供的马蜂数据就存在极度不均匀的情况）

3.5.2 为何Balacing？（Why）

机器学习和深度学习的模型都是通过数据来进行训练的，不进行Balacing，就会导致模型有极大的偏向性（数据量较少的很难被预测出来），虽然看起来模型的效果都很好。但是实际上，这种模型是不能够使用的。

3.5.3 如何进行Scaling？（How）

Oversampling or Undersampling
在这里插入图片描述
两种方法
🚨注意先进行train_test_split,并只对train_set进行Balacing（以防数据泄露）

这里使用了一个新的库，imblearn。相关博客

# 使用imlbearn库中上采样方法中的SMOTE接口
from imblearn.over_sampling import SMOTE
# 定义SMOTE模型，random_state相当于随机数种子的作用
smo = SMOTE(random_state=42)
X_smo, y_smo = smo.fit_sample(X, y)

3.6 编码（Encoding）

3.6.1 什么是Encoding？（What）

对于分类数据，将其目标字段转化为数值，从而进行模型训练。（总不能往模型里面输入"Cat"和”Dog“吧）

3.6.2 如何进行Scaling？（How）

Traget Encoding

👉 Sklearn LabelEncoder() documentation
Feature Encoding

👉 Sklearn OneHotEncoder() documentation

3.7 离散化(Discretizing)

3.7.1 什么是离散化？(What)

将连续型的特征，通过设置bins，转变为分类的离散特征（有一种图也是这种逻辑，histogram！！！）

3.7.2 如何进行离散化？(How)

# 注意里面的bins
data['SalePriceBinary'] = pd.cut(x = data['SalePrice'],
                       bins=[data['SalePrice'].min()-1,
                             data['SalePrice'].mean(),
                             data['SalePrice'].max()+1], 
                       labels=['cheap', 'expensive'])
data.head()

3.8 创建特征（Feature creation）

如果对于要训练的数据有一定的专业知识，就可以进行特征的创建，进而使用创建的特征进行模型训练 💪

也就是自己构造特征（例子）：

体重和升高的数据，生成一个体脂率的数据

3.9 选择特征（Feature selection）

3.9.1 什么是特征选择？(What)

有的时候，我们获得的数据里面有许多特征字段，那么是不是所有字段都需要使用？

答案一定是：否定

特征选择是消除非信息性特征的过程。统计特征选择有2种主要类型：

单变量特征选择
多变量

3.9.2 为何特征选择？(Why)

进去是垃圾，模型出来也是垃圾
过高维度的数据在传统机器学习种难以处理，减少复杂性
只有对于要解决的问题有用的数据才应该放到模型中

3.9.3 如何特征选择？(How)

特征的相关性
第一个方法是判断两个特征之间的相关性。如果两个特征高度相关，那么就移除其中一个。

High correlation = redundant information

方法：
Pearson Correlation

import seaborn as sns

# Heatmap
corr = data.corr()
sns.heatmap(corr, 
        xticklabels=corr.columns,
        yticklabels=corr.columns,
        cmap= "YlGnBu")

corr_df = corr.unstack().reset_index() # Unstack correlation matrix 
corr_df.columns = ['feature_1','feature_2', 'correlation'] # rename columns
corr_df.sort_values(by="correlation",ascending=False, inplace=True) # sort by correlation
corr_df = corr_df[corr_df['feature_1'] != corr_df['feature_2']] # Remove self correlation
corr_df.head()

特征置换(Feature Permutation)
特征置换是第二种特征选择算法，用于评估每个特征在预测目标中的重要性

步骤

训练并记录包含所有特征的Baseline的测试分数
随机选择（置换）测试集中的特征
在改组的测试集上记录新分数
将新分数与原始分数进行比较
对每个feature重复该操纵

👉 If the score drops when a feature is shuffled, it is considered important.

from sklearn.inspection import permutation_importance

log_model = LogisticRegression().fit(X, y) # Fit model

permutation_score = permutation_importance(log_model, X, y, n_repeats=100) # Perform Permutation

importance_df = pd.DataFrame(np.vstack((X.columns,
                                        permutation_score.importances_mean)).T) # Unstack results
importance_df.columns=['feature','score decrease']

importance_df.sort_values(by="score decrease", ascending = False) # Order by importance