数据清洗与预处理

最新推荐文章于 2024-07-17 10:19:40 发布

WGS.

最新推荐文章于 2024-07-17 10:19:40 发布

阅读量1.8k

点赞数 3

分类专栏： # 金融风控 # 数据挖掘 # 机器学习

本文链接：https://blog.csdn.net/qq_42363032/article/details/109708232

版权

机器学习同时被 3 个专栏收录

138 篇文章 97 订阅

订阅专栏

数据挖掘

37 篇文章 6 订阅

订阅专栏

金融风控

12 篇文章 11 订阅

订阅专栏

数据集成

评分卡模型开发需求确定后，接下来需要收集数据，进行数据集成。为了全面地描述借款人的信用属性，会从多个维度进行考量，如借款人的基本信息数据、信用数据、消费数据和行为数据等

也就是所谓的壮库，例如把业务数据和第三方数据通过唯一标识等集合到一起

基本信息数据

- 基本信息数据可以反映借款人的资质、还款能力与稳定性信息，

如借款人的年龄、性别、学历、收入、工作年限、工作单位、
单位性质、是否有车、住房类型（自有住房、租用等）等信息。

信用数据

- 信用数据可以反映借款人的历史信用状态与负债情况，

如征信数据与社保数据。

征信数据如借款人信用卡逾期情况、房贷情况、车贷情况、其他借记卡逾期情况及其他负债情况等；

社保数据如社保缴费基数、缴费单位及缴费年限等

消费数据

- 消费数据可以反映借款人的经济实力，

如借款人的月消费金额、购买产品类型等

行为数据

- 行为数据可以反映借款人的真实行为轨迹，

如借款人的行为轨迹、入网地点和上网习惯等。

数据清洗

目的在于保证原始数据的正确性，以避免在后续特征工程乃至建模过程中出现问题，或在错误的数据上得到错误的规则。

数据清洗包括特殊字符清洗、数据格式转换、数据概念统一、数据类型转换和样本去冗余等

缺失值填补的方法很多，如均值填补、中位数填补、随机森林模型填补等

探索性分析(EDA)

探索性数据分析（Exploratory Data Analysis，EDA）又称为描述性统计分析，是一种通过计算统计量、数据可视化等方法，快速了解原始数据的结构与规律的一种数据分析方法。它可以直观地了解原始数据中各个变量或字段的数据范围、数据缺失情况（数据完整性）、有无异常值、变量分布情况，进而从总体上把握各个变量的真实情况，为后续的特征工程做准备。

就是描述性统计加可视化

数据集字段说明

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-v6QWuryz-1605440113027)(DAE8BF3E21094C3AB2F60D9B852D5768)][外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-4hInKKY1-1605440113038)(108121D0CD044848A18D469FC05241AA)]

代码实现

import os
import pandas as pd
import numpy as np
import time
import datetime
import missingno as msno
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")  ##忽略警告
matplotlib.rcParams['font.sans-serif'] = ['SimHei']  # 用黑体显示中文
matplotlib.rcParams['axes.unicode_minus'] = False  # 正常显示负号

读取数据集

def data_read(data_path, file_name):
    # delim_whitespace 指定空格作为分隔符   header 列名
    df = pd.read_csv(os.path.join(data_path, file_name), delim_whitespace=True, header=None, engine='python')
    
    # 添加列名
    columns = ['status_account', 'duration', 'credit_history', 'purpose', 'amount',
               'svaing_account', 'present_emp', 'income_rate', 'personal_status',
               'other_debtors', 'residence_info', 'property', 'age',
               'inst_plans', 'housing', 'num_credits',
               'job', 'dependents', 'telephone', 'foreign_worker', 'target']
    
    df.columns = columns
    
    # 将标签变量由状态1,2转为0,1; 0表示好用户，1表示坏用户
    df.target = df.target - 1
    return df

区分离散变量和连续变量

如果给定的数据框是int或float，这里直接作为数值型变量(连续变量)
离散变量中可能有时间类型，需要查看原始数据加以区分

# 离散变量与连续变量区分
def category_continue_separation(df, feature_names):
    categorical_var = []  # 离散特征
    numerical_var = []  # 连续特征
    
    if 'target' in feature_names:
        feature_names.remove('target')  # 标签从特征中去除

    # 先判断类型，如果是int或float就直接作为连续变量
    numerical_var = list(df[feature_names].select_dtypes(
        include=['int', 'float', 'int32', 'float32', 'int64', 'float64']).columns)

    categorical_var = [x for x in feature_names if x not in numerical_var]

    return categorical_var, numerical_var

由于数据集比较规范，为了演示注入脏数据

对变量status_account随机注入字符串

# 对变量status_account随机注入特殊字符
def add_str(x):
    str_1 = ['%', ' ', '/t', '$', ';', '@']
    str_2 = str_1[np.random.randint(0, high=len(str_1) - 1)]
    return x + str_2

添加两列时间格式的数据

+ time.mktime
参数: struct_time或完整的9个元组
返回值: 返回一个浮点数，以便与time()兼容

+ time.localtime
格式化时间戳为本地时间，元组格式

+ time.strftime
根据格式转换时间

# 添加两列时间格式的数据
# num：列数、style：格式
def add_time(num, style="%Y-%m-%d"):
    # time.mktime 返回用秒数来表示时间的浮点数
    start_time = time.mktime((2010, 1, 1, 0, 0, 0, 0, 0, 0))
    stop_time = time.mktime((2015, 1, 1, 0, 0, 0, 0, 0, 0))
    re_time = []
    for i in range(num):
        rand_time = np.random.randint(start_time, stop_time)
        # 将时间戳生成时间元组
        re_time.append(time.strftime(style, time.localtime(rand_time)))
    return re_time

添加冗余数据

# 添加冗余数据
def add_row(df_temp, num):
    # size：随机数的尺寸
    # np.random.randint(2, size=10)  ----->   array([1, 0, 0, 0, 1, 1, 0, 0, 1, 0])
    
    index_1 = np.random.randint(low=0, high=df_temp.shape[0] - 1, size=num)
    return df_temp.loc[index_1]

特殊字符清洗

# 特殊字符清洗
    df.status_account = df.status_account.apply(lambda x: x.replace(' ', '').replace('%', '').
                                                replace('/t', '').replace('$', '').replace('@', '').replace(';', ''))

时间格式统一

datetime.datetime.strptime：字符串格式转为日期格式

# 统一为'%Y-%m-%d格式
df['job_time'] = df['job_time'].apply(lambda x: x.split(' ')[0].replace('/', '-'))

# 时间为字符串格式转为时间格式
df['job_time'] = df['job_time'].apply(lambda x: datetime.datetime.strptime(x, '%Y-%m-%d'))

df['apply_time'] = df['apply_time'].apply(lambda x: datetime.datetime.strptime(x, '%Y-%m-%d'))

样本去除冗余

df.drop_duplicates(subset=['A','B'],keep='first',inplace=True)
+ subset： 输入要进行去重的列名，默认为None
+ keep： 可选参数有三个：‘first’、 ‘last’、 False， 默认值 ‘first’。其中，
    - first表示： 保留第一次出现的重复行，删除后面的重复行。
    - last表示： 删除重复项，保留最后一次出现。
    - False表示： 删除所有重复项。
inplace：布尔值，默认为False，是否直接在原数据上删除重复项或删除重复项后返回副本

# 样本去冗余
df.drop_duplicates(subset=None, keep='first', inplace=True)

# 可以按照订单如冗余 新增一个order_id特征
df['order_id'] = np.random.randint(low=0, high=df.shape[0] - 1, size=df.shape[0])
df.drop_duplicates(subset=['order_id'], keep='first', inplace=True)

如果需要用列名取出冗余，则可以先将数据转置，按列名去重，再转置即可

# 如果有按列名去重复
df_1 = df.T
df_1 = df_1[~df_1.index.duplicated()]
df = df_1.T

探索性分析

查看详细信息

df[numerical_var].describe()

添加缺失值

df.reset_index(drop=True, inplace=True)
var_name = categorical_var + numerical_var
for i in var_name:
    num = np.random.randint(low=0, high=df.shape[0] - 1)
    index_1 = np.random.randint(low=0, high=df.shape[0] - 1, size=num)
    index_1 = np.unique(index_1)
    df[i].loc[index_1] = np.nan

缺失值绘图

msno.bar(df, labels=True, figsize=(10, 6), fontsize=10)

对于连续数据绘制箱线图，观察是否有异常值

plt.figure(figsize=(10, 6))  # 设置图形尺寸大小
for j in range(1, len(numerical_var) + 1):
    plt.subplot(2, 4, j)
    df_temp = df[numerical_var[j - 1]][~df[numerical_var[j - 1]].isnull()]
    plt.boxplot(df_temp,
                notch=False,  # 中位线处不设置凹陷
                widths=0.2,  # 设置箱体宽度
                medianprops={'color': 'red'},  # 中位线设置为红色
                boxprops=dict(color="blue"),  # 箱体边框设置为蓝色
                labels=[numerical_var[j - 1]],  # 设置标签
                whiskerprops={'color': "black"},  # 设置须的颜色，黑色
                capprops={'color': "green"},  # 设置箱线图顶端和末端横线的属性，颜色为绿色
                flierprops={'color': 'purple', 'markeredgecolor': "purple"}  # 异常值属性，这里没有异常值，所以没表现出来
                )
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-TUcoeCDo-1605440113041)(DE851BF256ED4CDE9A1782AD0605EBFF)]

查看数据分布

# 连续变量不同类别下的分布
for i in numerical_var:
    #        i = 'duration'
    # 取非缺失值的数据
    df_temp = df.loc[~df[i].isnull(), [i, 'target']]
    df_good = df_temp[df_temp.target == 0]
    df_bad = df_temp[df_temp.target == 1]
    # 计算统计量
    valid = round(df_temp.shape[0] / df.shape[0] * 100, 2)
    Mean = round(df_temp[i].mean(), 2)
    Std = round(df_temp[i].std(), 2)
    Max = round(df_temp[i].max(), 2)
    Min = round(df_temp[i].min(), 2)
    # 绘图
    plt.figure(figsize=(10, 6))
    fontsize_1 = 12
    plt.hist(df_good[i], bins=20, alpha=0.5, label='好样本')
    plt.hist(df_bad[i], bins=20, alpha=0.5, label='坏样本')
    plt.ylabel(i, fontsize=fontsize_1)
    plt.title('valid rate=' + str(valid) + '%, Mean=' + str(Mean) + ', Std=' + str(Std) + ', Max=' + str(
        Max) + ', Min=' + str(Min))
    plt.legend()

    # 保存图片
    file = os.path.join(path, 'plot_num', i + '.png')
    plt.savefig(file)
    plt.close(1)

# 离散变量不同类别下的分布
for i in categorical_var:
    #        i = 'status_account'
    # 非缺失值数据
    df_temp = df.loc[~df[i].isnull(), [i, 'target']]
    df_bad = df_temp[df_temp.target == 1]
    valid = round(df_temp.shape[0] / df.shape[0] * 100, 2)

    bad_rate = []
    bin_rate = []
    var_name = []
    for j in df[i].unique():

        if pd.isnull(j):
            df_1 = df[df[i].isnull()]
            bad_rate.append(sum(df_1.target) / df_1.shape[0])
            bin_rate.append(df_1.shape[0] / df.shape[0])
            var_name.append('NA')
        else:
            df_1 = df[df[i] == j]
            bad_rate.append(sum(df_1.target) / df_1.shape[0])
            bin_rate.append(df_1.shape[0] / df.shape[0])
            var_name.append(j)
    df_2 = pd.DataFrame({'var_name': var_name, 'bin_rate': bin_rate, 'bad_rate': bad_rate})
    # 绘图
    plt.figure(figsize=(10, 6))
    fontsize_1 = 12
    plt.bar(np.arange(1, df_2.shape[0] + 1), df_2.bin_rate, 0.1, color='black', alpha=0.5, label='占比')
    plt.xticks(np.arange(1, df_2.shape[0] + 1), df_2.var_name)
    plt.plot(np.arange(1, df_2.shape[0] + 1), df_2.bad_rate, color='green', alpha=0.5, label='坏样本比率')

    plt.ylabel(i, fontsize=fontsize_1)
    plt.title('valid rate=' + str(valid) + '%')
    plt.legend()
    # 保存图片
    file = os.path.join(path, 'plot_cat', i + '.png')
    plt.savefig(file)
    plt.close(1)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-nzJomlme-1605440113046)(455F84A74E7746158838ABF2F99809B1)]

源码

# -*- coding: utf-8 -*-
"""
数据清洗与预处理
"""
import os
import pandas as pd
import numpy as np
import time
import datetime
import missingno as msno
import matplotlib.pyplot as plt
import myFinance


# import warnings
# warnings.filterwarnings("ignore")  ##忽略警告
# matplotlib.rcParams['font.sans-serif'] = ['SimHei']  # 用黑体显示中文
# matplotlib.rcParams['axes.unicode_minus'] = False  # 正常显示负号


# 数据读取

def data_read(data_path, file_name):
    # delim_whitespace 指定空格作为分隔符   header 列名
    df = pd.read_csv(os.path.join(data_path, file_name), delim_whitespace=True, header=None, engine='python')
    # 变量重命名
    columns = ['status_account', 'duration', 'credit_history', 'purpose', 'amount',
               'svaing_account', 'present_emp', 'income_rate', 'personal_status',
               'other_debtors', 'residence_info', 'property', 'age',
               'inst_plans', 'housing', 'num_credits',
               'job', 'dependents', 'telephone', 'foreign_worker', 'target']

    df.columns = columns

    # 将标签变量由状态1,2转为0,1;0表示好用户，1表示坏用户
    df.target = df.target - 1
    return df


# 离散变量与连续变量区分
def category_continue_separation(df, feature_names):
    categorical_var = []  # 离散特征
    numerical_var = []  # 连续特征

    if 'target' in feature_names:
        feature_names.remove('target')  # 标签从特征中去除

    # 先判断类型，如果是int或float就直接作为连续变量
    # numerical_var = list(df[feature_names].select_dtypes(
    #     include=['int', 'float', 'int32', 'float32', 'int64', 'float64']).columns.values)
    numerical_var = list(df[feature_names].select_dtypes(
        include=['int', 'float', 'int32', 'float32', 'int64', 'float64']).columns)

    categorical_var = [x for x in feature_names if x not in numerical_var]

    return categorical_var, numerical_var


# 对变量status_account随机注入特殊字符
def add_str(x):
    str_1 = ['%', ' ', '/t', '$', ';', '@']
    str_2 = str_1[np.random.randint(0, high=len(str_1) - 1)]
    return x + str_2


# 添加两列时间格式的数据
# num：列数、style：格式
def add_time(num, style="%Y-%m-%d"):
    # time.mktime 返回用秒数来表示时间的浮点数
    start_time = time.mktime((2010, 1, 1, 0, 0, 0, 0, 0, 0))
    stop_time = time.mktime((2015, 1, 1, 0, 0, 0, 0, 0, 0))
    re_time = []
    for i in range(num):
        rand_time = np.random.randint(start_time, stop_time)
        # 将时间戳生成时间元组
        # date_touple = time.localtime(rand_time)
        re_time.append(time.strftime(style, time.localtime(rand_time)))
    return re_time


# 添加冗余数据
def add_row(df_temp, num):
    # size：随机数的尺寸
    # np.random.randint(2, size=10)  ----->   array([1, 0, 0, 0, 1, 1, 0, 0, 1, 0])

    index_1 = np.random.randint(low=0, high=df_temp.shape[0] - 1, size=num)
    return df_temp.loc[index_1]


if __name__ == '__main__':
    path = r'G:\\code\\chapter4'
    # data_path = os.path.join(path, 'data')
    data_path = r'G:\\code\\chapter4\\data'
    file_name = 'german.csv'

    # 读取数据
    df = data_read(data_path, file_name)

    # 区分离散变量与连续变量
    feature_names = list(df.columns)
    feature_names.remove('target')
    categorical_var, numerical_var = category_continue_separation(df, feature_names)

    # df.describe()
    ##########数据清洗################

    # 注入“脏数据”
    # 变量status_account随机加入特殊字符
    df.status_account = df.status_account.apply(add_str)

    # 添加两列时间格式的数据
    df['apply_time'] = add_time(df.shape[0], "%Y-%m-%d")
    df['job_time'] = add_time(df.shape[0], "%Y/%m/%d %H:%M:%S")

    # 添加行冗余数据
    df_temp = add_row(df, 10)
    df = pd.concat([df, df_temp], axis=0, ignore_index=True)
    df.shape

    # 数据清洗
    # 默认值显示5列
    df.head()
    # 设置显示多列或全部全是
    pd.set_option('display.max_columns', 10)
    df.head()
    pd.set_option('display.max_columns', None)
    df.head()
    # 离散变量先看一下范围
    df.status_account.unique()

    # 特殊字符清洗
    df.status_account = df.status_account.apply(lambda x: x.replace(' ', '').replace('%', '').
                                                replace('/t', '').replace('$', '').replace('@', '').replace(';', ''))

    # unique()是以 数组形式（numpy.ndarray）返回列的所有唯一值（特征的所有唯一值）
    df.status_account.unique()

    # 时间格式统一
    # 统一为'%Y-%m-%d格式
    df['job_time'] = df['job_time'].apply(lambda x: x.split(' ')[0].replace('/', '-'))

    # datetime.datetime.strptime：字符串格式转为日期格式
    # 时间为字符串格式转为时间格式
    df['job_time'] = df['job_time'].apply(lambda x: datetime.datetime.strptime(x, '%Y-%m-%d'))
    df['apply_time'] = df['apply_time'].apply(lambda x: datetime.datetime.strptime(x, '%Y-%m-%d'))

    # 样本去冗余
    df.drop_duplicates(subset=None, keep='first', inplace=True)
    df.shape
    # 可以按照订单如冗余
    df['order_id'] = np.random.randint(low=0, high=df.shape[0] - 1, size=df.shape[0])
    # subset:要去重的列名
    # keep：
    df.drop_duplicates(subset=['order_id'], keep='first', inplace=True)
    df.shape
    # 如果有按列名去重复
    #    df_1 = df.T
    #    df_1 = df_1[~df_1.index.duplicated()]
    #    df = df_1.T

    # 探索性分析
    df[numerical_var].describe()
    # 添加缺失值
    df.reset_index(drop=True, inplace=True)
    var_name = categorical_var + numerical_var
    for i in var_name:
        num = np.random.randint(low=0, high=df.shape[0] - 1)
        index_1 = np.random.randint(low=0, high=df.shape[0] - 1, size=num)
        index_1 = np.unique(index_1)
        df[i].loc[index_1] = np.nan

    # 缺失值绘图
    msno.bar(df, labels=True, figsize=(10, 6), fontsize=10)

    # 对于连续数据绘制箱线图，观察是否有异常值
    plt.figure(figsize=(10, 6))  # 设置图形尺寸大小
    for j in range(1, len(numerical_var) + 1):
        plt.subplot(2, 4, j)
        df_temp = df[numerical_var[j - 1]][~df[numerical_var[j - 1]].isnull()]
        plt.boxplot(df_temp,
                    notch=False,  # 中位线处不设置凹陷
                    widths=0.2,  # 设置箱体宽度
                    medianprops={'color': 'red'},  # 中位线设置为红色
                    boxprops=dict(color="blue"),  # 箱体边框设置为蓝色
                    labels=[numerical_var[j - 1]],  # 设置标签
                    whiskerprops={'color': "black"},  # 设置须的颜色，黑色
                    capprops={'color': "green"},  # 设置箱线图顶端和末端横线的属性，颜色为绿色
                    flierprops={'color': 'purple', 'markeredgecolor': "purple"}  # 异常值属性，这里没有异常值，所以没表现出来
                    )
    plt.show()

    # 查看数据分布
    # 连续变量不同类别下的分布
    for i in numerical_var:
        #        i = 'duration'
        # 取非缺失值的数据
        df_temp = df.loc[~df[i].isnull(), [i, 'target']]
        df_good = df_temp[df_temp.target == 0]
        df_bad = df_temp[df_temp.target == 1]
        # 计算统计量
        valid = round(df_temp.shape[0] / df.shape[0] * 100, 2)
        Mean = round(df_temp[i].mean(), 2)
        Std = round(df_temp[i].std(), 2)
        Max = round(df_temp[i].max(), 2)
        Min = round(df_temp[i].min(), 2)
        # 绘图
        plt.figure(figsize=(10, 6))
        fontsize_1 = 12
        plt.hist(df_good[i], bins=20, alpha=0.5, label='好样本')
        plt.hist(df_bad[i], bins=20, alpha=0.5, label='坏样本')
        plt.ylabel(i, fontsize=fontsize_1)
        plt.title('valid rate=' + str(valid) + '%, Mean=' + str(Mean) + ', Std=' + str(Std) + ', Max=' + str(
            Max) + ', Min=' + str(Min))
        plt.legend()

        # 保存图片
        file = os.path.join(path, 'plot_num', i + '.png')
        plt.savefig(file)
        plt.close(1)

    # 离散变量不同类别下的分布
    for i in categorical_var:
        #        i = 'status_account'
        # 非缺失值数据
        df_temp = df.loc[~df[i].isnull(), [i, 'target']]
        df_bad = df_temp[df_temp.target == 1]
        valid = round(df_temp.shape[0] / df.shape[0] * 100, 2)

        bad_rate = []
        bin_rate = []
        var_name = []
        for j in df[i].unique():

            if pd.isnull(j):
                df_1 = df[df[i].isnull()]
                bad_rate.append(sum(df_1.target) / df_1.shape[0])
                bin_rate.append(df_1.shape[0] / df.shape[0])
                var_name.append('NA')
            else:
                df_1 = df[df[i] == j]
                bad_rate.append(sum(df_1.target) / df_1.shape[0])
                bin_rate.append(df_1.shape[0] / df.shape[0])
                var_name.append(j)
        df_2 = pd.DataFrame({'var_name': var_name, 'bin_rate': bin_rate, 'bad_rate': bad_rate})
        # 绘图
        plt.figure(figsize=(10, 6))
        fontsize_1 = 12
        plt.bar(np.arange(1, df_2.shape[0] + 1), df_2.bin_rate, 0.1, color='black', alpha=0.5, label='占比')
        plt.xticks(np.arange(1, df_2.shape[0] + 1), df_2.var_name)
        plt.plot(np.arange(1, df_2.shape[0] + 1), df_2.bad_rate, color='green', alpha=0.5, label='坏样本比率')

        plt.ylabel(i, fontsize=fontsize_1)
        plt.title('valid rate=' + str(valid) + '%')
        plt.legend()
        # 保存图片
        file = os.path.join(path, 'plot_cat', i + '.png')
        plt.savefig(file)
        plt.close(1)