python数据清洗面试题_利用python进行数据清洗

1.数据清洗知识

2.提出问题

客户违约的可能性是怎样的?

3.数据理解

There are 25 variables:ID: ID of each client

LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit

SEX: Gender (1=male, 2=female) 性别

EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)教育

MARRIAGE: Marital status (1=married, 2=single, 3=others) 婚姻

AGE: Age in years年龄

PAY_0: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, ... 8=payment delay for eight months, 9=payment delay for nine months and above) 还款状况

PAY_2: Repayment status in August, 2005 (scale same as above)

PAY_3: Repayment status in July, 2005 (scale same as above)

PAY_4: Repayment status in June, 2005 (scale same as above)

PAY_5: Repayment status in May, 2005 (scale same as above)

PAY_6: Repayment status in April, 2005 (scale same as above)

BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar) 账单

BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)

BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)

BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)

BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)

BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)

PAY_AMT1: Amount of previous payment in September, 2005 (NT dollar) 之前的付款金额

PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar)

PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar)

PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar)

PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar)

PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar)

default.payment.next.month: Default payment (1=yes, 0=no)

数据来源Credit Card Default: a very pedagogical notebook​www.kaggle.com

4.数据清洗

import os

os.getcwd()

os.chdir('/Users/qianxin/downloads')

file=pd.DataFrame(pd.read_csv('UCI_Credit_Card.csv'))

重命名列名

df=file.rename(columns={'default.payment.next.month':'def_pay','PAY_0': 'PAY_1'})

查看数据情况

http://file.info()

RangeIndex: 30000 entries, 0 to 29999 Data columns (total 25 columns): ID 30000 non-null int64 LIMIT_BAL 30000 non-null float64 SEX 30000 non-null int64 EDUCATION 30000 non-null int64 MARRIAGE 30000 non-null int64 AGE 30000 non-null int64 PAY_0 30000 non-null int64 PAY_2 30000 non-null int64 PAY_3 30000 non-null int64 PAY_4 30000 non-null int64 PAY_5 30000 non-null int64 PAY_6 30000 non-null int64 BILL_AMT1 30000 non-null float64 BILL_AMT2 30000 non-null float64 BILL_AMT3 30000 non-null float64 BILL_AMT4 30000 non-null float64 BILL_AMT5 30000 non-null float64 BILL_AMT6 30000 non-null float64 PAY_AMT1 30000 non-null float64 PAY_AMT2 30000 non-null float64 PAY_AMT3 30000 non-null float64 PAY_AMT4 30000 non-null float64 PAY_AMT5 30000 non-null float64 PAY_AMT6 30000 non-null float64 default.payment.next.month 30000 non-null int64 dtypes: float64(13), int64(12) memory usage: 5.7 MB

发现数据并没有缺失值

进一步查看数据情况

file.describe()

其中 pay_0-Pay_6出现了负值,这是不符合常理的,因此将负值全部替换为0

def clean_pay(pay):

list_new=[]

for value in pay:

if value<0:

value=0

else:

value=value

list_new.append(value)

payser=pd.Series(list_new)

return payser

df.loc[:,'PAY_1']=clean_pay(df.loc[:,'PAY_1'])

df.loc[:,'PAY_2']=clean_pay(df.loc[:,'PAY_2'])

df.loc[:,'PAY_3']=clean_pay(df.loc[:,'PAY_3'])

df.loc[:,'PAY_4']=clean_pay(df.loc[:,'PAY_4'])

df.loc[:,'PAY_5']=clean_pay(df.loc[:,'PAY_5'])

df.loc[:,'PAY_6']=clean_pay(df.loc[:,'PAY_6'])

5.数据分析

想通过数据发现违约的概率

#default probability

df.def_pay.sum()/len(df.def_pay)

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值