数据预处理步骤及模板

囧囧慧君

已于 2022-12-13 14:41:37 修改

阅读量826

点赞数

分类专栏：机器学习文章标签： python 数据挖掘算法

于 2022-12-12 12:02:44 首次发布

本文链接：https://blog.csdn.net/weixin_46347116/article/details/128282996

版权

本文详细介绍了数据预处理的步骤，包括数据清洗中的类型转换、特殊值处理、错误值处理、离群值处理、缺失值处理和重复值处理，接着讨论特征工程，涉及数据均衡、离散化、编码、合并、转换、分解和组合，最后深入特征选择，分析特征相关性和重要性，并探讨PCA、SVD和LDA等降维方法以及递归特征消除和剔除共线性特征的策略。

摘要由CSDN通过智能技术生成

数据清洗

类型转换

data.info() 可以快速让我们知道数据类型与缺失值
data[col].astype(float)转化数据类型float

特殊值处理

一些错误导致的特殊值，例如 ±Inf, NA NaN

错误值处理

比如人的年龄不可能出现负数

离群值处理

离群点的选择可能需要再斟酌一些，这里选择的方法是extreme outlier。
离群值处理

$\text{First Quartile} -3 * \text{Interquartile Range}$
$\text{Third Quartile} + 3 * \text{Interquartile Range}$

# Calculate first and third quartile
des = data['col'].describe()
first_quartile = data['col'].describe()['25%']
third_quartile = data['col'].describe()['75%']

# Interquartile range
iqr = third_quartile - first_quartile

# Remove outliers
data = data[(data['col'] > (first_quartile - 3 * iqr)) &
            (data['col'] < (third_quartile + 3 * iqr))]

分布不合理，存在离群值
在这里插入图片描述
分布合理

缺失值处理

额外的数据补充: 有点难弄
均值填充:这样可以不改变当前数据集整体的均值
回归模型预测:建立一个回归模型去得到预测值

统计缺失值排名

# Function to calculate missing values by column
def missing_values_table(df):
        # Total missing values
        mis_val = df.isnull().sum()
        
        # Percentage of missing values
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        
        # Make a table with the results
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        
        # Rename the columns
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {
   0 : 'Missing Values', 1 : '% of Total Values'})
        
        # Sort the table by percentage of missing descending
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        
        # Print some summary information
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        
        # Return the dataframe with missing information
        return mis_val_table_ren_columns

在这里插入图片描述

缺失值可视化

# 缺失值显示
missingno.matrix(dataset_raw, figsize = (30,5))

在这里插入图片描述

missingno.bar(dataset_raw, sort='ascending', figsize = (30,5))

在这里插入图片描述

去除空值

all_data.dropna(axis=0, inplace=True)

缺失值填充

from sklearn.impute import SimpleImputer
# Create an imputer object with a median filling strategy
imputer = SimpleImputer(strategy='median')

# Train on the training features
imputer.fit(train_features)

# Transform both training data and testing data
X = imputer.transform(train_features)

重复值处理

data.drop_duplicates(inplace=True)

特征工程

数据均衡

数据离散

连续数据→离散数据
我们可以选择离散一些我们所拥有的连续变量，因为一些算法会执行得更快。但是会对结果产生什么样的影响呢？需要比较离散和非离散的建模结果

data['age'] = pd.cut(dataset['age'], 10) # 将连续值进行切分
#左图是切分后的结果 右图是根据不同的收入等级划分
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,5)) 
plt.subplot(1, 2, 1)
sns.countplot(y="age", data=data);
plt.subplot(1, 2, 2)
sns.distplot(data.loc[data_con['predclass'] == 1]['age'], kde_kws={
   "label": ">$50K"});
sns.distplot(data.loc[data_con['predclass'] == 0]['age'], kde_kws={
   "label": "<$50K"});