数据科学 5.4 数据清洗(代码)

5.4 数据清洗

import pandas as pd
import os 
import numpy as np
# os.chdir(r"D:\Python_Training\script_Python\5Preprocessing")
camp = pd.read_csv(r'.\teleco_camp_orig.csv')
camp.head()
IDSuc_flagARPUPromCnt12PromCnt36PromCntMsg12PromCntMsg36ClassAgeGenderHomeOwnerAvgARPUAvgHomeValueAvgIncome
012150.061023457.0MH49.8949043340039460
1530NaN5914355.0MH48.5747423760033545
267125.061124157.0FH49.27264610040042091
371180.071024152.0FH47.3349533990039313
4142115.0611241NaNFU47.827404475000

5.4.1、脏数据或数据不正确

import matplotlib.pyplot as plt
plt.hist(camp['AvgIncome'], bins=20, density=True)#查看分布情况
camp['AvgIncome'].describe(include='all')
count      9686.000000
mean      40491.444249
std       28707.494146
min           0.000000
25%       24464.000000
50%       43100.000000
75%       56876.000000
max      200001.000000
Name: AvgIncome, dtype: float64

在这里插入图片描述

plt.hist(camp['AvgHomeValue'], bins=20, density=True)#查看分布情况
camp['AvgHomeValue'].describe(include='all')
count      9686.000000
mean     110986.299814
std       98670.855450
min           0.000000
25%       52300.000000
50%       76900.000000
75%      128175.000000
max      600000.000000
Name: AvgHomeValue, dtype: float64

在这里插入图片描述

#这里的0值应该是缺失值
camp['AvgIncome']=camp['AvgIncome'].replace({0: np.NaN})
#像这种外部获取的数据要比较小心,经常出现意义不清晰或这错误值。AvgHomeValue也有这种情况
plt.hist(camp['AvgIncome'], bins=20, density=True,range=(camp.AvgIncome.min(),camp.AvgIncome.max()))#由于数据中存在缺失值,需要指定绘图的值域
camp['AvgIncome'].describe(include='all')
count      7329.000000
mean      53513.457361
std       19805.168339
min        2499.000000
25%       40389.000000
50%       48699.000000
75%       62385.000000
max      200001.000000
Name: AvgIncome, dtype: float64

在这里插入图片描述

camp['AvgHomeValue']=camp['AvgHomeValue'].replace({0: np.NaN})
plt.hist(camp['AvgHomeValue'], bins=20, density=True,range=(camp.AvgHomeValue.min(),camp.AvgHomeValue.max()))#由于数据中存在缺失值,需要指定绘图的值域
camp['AvgHomeValue'].describe(include='all')
count      9583.000000
mean     112179.202755
std       98522.888583
min        7500.000000
25%       53200.000000
50%       77700.000000
75%      129350.000000
max      600000.000000
Name: AvgHomeValue, dtype: float64

在这里插入图片描述

5.4.2、数据不一致

  • 这个问题需要详细的结合描述统计进行变量说明核对
1、数据重复
camp['dup'] = camp.duplicated() # 生成重复标识变量
camp.dup.head()
0    False
1    False
2    False
3    False
4    False
Name: dup, dtype: bool
#本数据没有重复记录,此处只是示例
camp_dup = camp[camp['dup'] == True] # 把有重复的数据保存出来,以备核查
camp_nodup = camp[camp['dup'] == False] # 注意与camp.drop_duplicates()的区别
camp_nodup.head()
IDSuc_flagARPUPromCnt12PromCnt36PromCntMsg12PromCntMsg36ClassAgeGenderHomeOwnerAvgARPUAvgHomeValueAvgIncomedup
012150.061023457.0MH49.89490433400.039460.0False
1530NaN5914355.0MH48.57474237600.033545.0False
267125.061124157.0FH49.272646100400.042091.0False
371180.071024152.0FH47.33495339900.039313.0False
4142115.0611241NaNFU47.82740447500.0NaNFalse
camp['dup1'] = camp['ID'].duplicated() # 按照主键进行重复记录标识
2、缺失值处理
  • 如果count数量少于样本量,说明存在缺失

  • 缺失最多的两个变量是Age和AvgIncome,缺失了大概20%。

  • 其他有缺失变量请自行填补,找到一个有缺失的分类变量,使用众数进行填补

  • 多重插补:sklearn.preprocessing.Imputer仅可用于填补均值、中位数、众数,多重插补可考虑使用Orange、impute、Theano等包

  • 多重插补的处理有两个要点:1、被解释变量有缺失值的观测不能填补,只能删除;2、只对放入模型的解释变量进行插补。

camp.describe()
IDSuc_flagARPUPromCnt12PromCnt36PromCntMsg12PromCntMsg36ClassAgeAvgARPUAvgHomeValueAvgIncome
count9686.0000009686.0000004843.0000009686.0000009686.0000009686.0000009686.0000009686.0000007279.0000009686.0000009583.0000007329.000000
mean97975.4740860.50000078.1217223.4952517.4669631.0345862.3230442.42453049.56738652.905156112179.20275553513.457361
std56550.1711200.50002662.2256861.2702581.9779090.2441710.9040831.0490476.9913064.99377598522.88858319805.168339
min12.0000000.0000005.0000001.0000001.0000000.0000000.0000001.00000016.00000046.1389687500.0000002499.000000
25%48835.5000000.00000050.0000003.0000006.0000001.0000001.0000002.00000045.00000049.76011653200.00000040389.000000
50%99106.0000000.50000065.0000003.0000008.0000001.0000003.0000002.00000050.00000050.87667277700.00000048699.000000
75%148538.7500001.000000100.0000004.0000008.0000001.0000003.0000003.00000055.00000054.452822129350.00000062385.000000
max191779.0000001.0000001000.00000015.00000020.0000004.0000006.0000004.00000060.00000099.444787600000.000000200001.000000
vmean = camp['Age'].mean(axis=0, skipna=True)
camp['Age_empflag'] = camp['Age'].isnull()    #空值标记
camp['Age']= camp['Age'].fillna(vmean)        #用平均值填补空值
camp['Age'].describe()
count    9686.000000
mean       49.567386
std         6.060585
min        16.000000
25%        47.000000
50%        49.567386
75%        54.000000
max        60.000000
Name: Age, dtype: float64
camp
IDSuc_flagARPUPromCnt12PromCnt36PromCntMsg12PromCntMsg36ClassAgeGenderHomeOwnerAvgARPUAvgHomeValueAvgIncomedupdup1Age_empflag
012150.061023457.000000MH49.89490433400.039460.0FalseFalseFalse
1530NaN5914355.000000MH48.57474237600.033545.0FalseFalseFalse
267125.061124157.000000FH49.272646100400.042091.0FalseFalseFalse
371180.071024152.000000FH47.33495339900.039313.0FalseFalseFalse
4142115.061124149.567386FU47.82740447500.0NaNFalseFalseFalse
......................................................
9681191547175.04923258.000000FU58.40399531800.042358.0FalseFalseFalse
96821916490NaN3813155.000000FU49.16668938200.042373.0FalseFalseFalse
9683191663150.02713254.000000MU49.43463753800.0NaNFalseFalseFalse
96841916720NaN4813159.000000FU48.63490036600.044023.0FalseFalseFalse
96851917791750.03512254.000000MU62.081260143200.079635.0FalseFalseFalse

9686 rows × 17 columns

vmean = camp['AvgHomeValue'].mean(axis=0, skipna=True)
camp['AvgHomeValue_empflag'] = camp['AvgHomeValue'].isnull()
camp['AvgHomeValue']= camp['AvgHomeValue'].fillna(vmean)
camp['AvgHomeValue'].describe()
count      9686.000000
mean     112179.202755
std       97997.592632
min        7500.000000
25%       53500.000000
50%       78450.000000
75%      128175.000000
max      600000.000000
Name: AvgHomeValue, dtype: float64
vmean = camp['AvgIncome'].mean(axis=0, skipna=True)
camp['AvgIncome_empflag'] = camp['AvgIncome'].isnull()
camp['AvgIncome']= camp['AvgIncome'].fillna(vmean)
camp['AvgIncome'].describe()
count      9686.000000
mean      53513.457361
std       17227.468161
min        2499.000000
25%       42775.000000
50%       53513.457361
75%       56876.000000
max      200001.000000
Name: AvgIncome, dtype: float64

5.4.3、噪声值处理

1、盖帽法
def blk(floor, root): # 'blk' will return a function
    def f(x):       
        if x < floor:
            x = floor
        elif x > root:
            x = root
        return x
    return f
q1 = camp['Age'].quantile(0.01) # 计算百分位数
q99 = camp['Age'].quantile(0.99)
blk_tot = blk(floor=q1, root=q99) # 'blk_tot' is a function
camp['Age']= camp['Age'].map(blk_tot)
camp['Age'].describe()
count    9686.000000
mean       49.624685
std         5.835803
min        31.000000
25%        47.000000
50%        49.567386
75%        54.000000
max        60.000000
Name: Age, dtype: float64
2、分箱法
  • 分箱法——等宽分箱
camp['Age_group1'] = pd.qcut( camp['Age'], 4) # 这里以age_oldest_tr字段等宽分为4段
camp.Age_group1
0         (54.0, 60.0]
1         (54.0, 60.0]
2         (54.0, 60.0]
3       (49.567, 54.0]
4       (47.0, 49.567]
             ...      
9681      (54.0, 60.0]
9682      (54.0, 60.0]
9683    (49.567, 54.0]
9684      (54.0, 60.0]
9685    (49.567, 54.0]
Name: Age_group1, Length: 9686, dtype: category
Categories (4, interval[float64]): [(30.999, 47.0] < (47.0, 49.567] < (49.567, 54.0] < (54.0, 60.0]]
  • 分箱法——等深分箱
camp['Age_group2'] = pd.cut( camp['Age'], 4) # 这里以age_oldest_tr字段等宽分为4段
camp.Age_group2
0       (52.75, 60.0]
1       (52.75, 60.0]
2       (52.75, 60.0]
3       (45.5, 52.75]
4       (45.5, 52.75]
            ...      
9681    (52.75, 60.0]
9682    (52.75, 60.0]
9683    (52.75, 60.0]
9684    (52.75, 60.0]
9685    (52.75, 60.0]
Name: Age_group2, Length: 9686, dtype: category
Categories (4, interval[float64]): [(30.971, 38.25] < (38.25, 45.5] < (45.5, 52.75] < (52.75, 60.0]]
camp.to_csv('tele_camp_ok.csv')
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

irober

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值