python数据清洗面试题_利用python进行数据清洗

最新推荐文章于 2024-04-20 10:35:46 发布

weixin_39979948

最新推荐文章于 2024-04-20 10:35:46 发布

阅读量497

点赞数 1

文章标签： python数据清洗面试题

1.数据清洗知识

2.提出问题

客户违约的可能性是怎样的？

3.数据理解

There are 25 variables:ID: ID of each client

LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit

SEX: Gender (1=male, 2=female) 性别

EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)教育

MARRIAGE: Marital status (1=married, 2=single, 3=others) 婚姻

AGE: Age in years年龄

PAY_0: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, ... 8=payment delay for eight months, 9=payment delay for nine months and above) 还款状况

PAY_2: Repayment status in August, 2005 (scale same as above)

PAY_3: Repayment status in July, 2005 (scale same as above)

PAY_4: Repayment status in June, 2005 (scale same as above)

PAY_5: Repayment status in May, 2005 (scale same as above)

PAY_6: Repayment status in April, 2005 (scale same as above)

BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar) 账单

BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)

BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)

BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)

BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)

BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)

PAY_AMT1: Amount of previous payment in September, 2005 (NT dollar) 之前的付款金额

PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar)

PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar)

PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar)

PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar)

PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar)

default.payment.next.month: Default payment (1=yes, 0=no)

数据来源Credit Card Default: a very pedagogical notebookwww.kaggle.com

4.数据清洗

import os

os.getcwd()

os.chdir('/Users/qianxin/downloads')

file=pd.DataFrame(pd.read_csv('UCI_Credit_Card.csv'))

重命名列名

df=file.rename(columns={'default.payment.next.month':'def_pay','PAY_0': 'PAY_1'})

查看数据情况

http://file.info()

RangeIndex: 30000 entries, 0 to 29999 Data columns (total 25 columns): ID 30000 non-null int64 LIMIT_BAL 30000 non-null float64 SEX 30000 non-null int64 EDUCATION 30000 non-null int64 MARRIAGE 30000 non-null int64 AGE 30000 non-null int64 PAY_0 30000 non-null int64 PAY_2 30000 non-null int64 PAY_3 30000 non-null int64 PAY_4 30000 non-null int64 PAY_5 30000 non-null int64 PAY_6 30000 non-null int64 BILL_AMT1 30000 non-null float64 BILL_AMT2 30000 non-null float64 BILL_AMT3 30000 non-null float64 BILL_AMT4 30000 non-null float64 BILL_AMT5 30000 non-null float64 BILL_AMT6 30000 non-null float64 PAY_AMT1 30000 non-null float64 PAY_AMT2 30000 non-null float64 PAY_AMT3 30000 non-null float64 PAY_AMT4 30000 non-null float64 PAY_AMT5 30000 non-null float64 PAY_AMT6 30000 non-null float64 default.payment.next.month 30000 non-null int64 dtypes: float64(13), int64(12) memory usage: 5.7 MB

发现数据并没有缺失值

进一步查看数据情况

file.describe()

其中 pay_0-Pay_6出现了负值，这是不符合常理的，因此将负值全部替换为0

def clean_pay(pay):

list_new=[]

for value in pay:

if value<0:

value=0

else:

value=value

list_new.append(value)

payser=pd.Series(list_new)

return payser

df.loc[:,'PAY_1']=clean_pay(df.loc[:,'PAY_1'])

df.loc[:,'PAY_2']=clean_pay(df.loc[:,'PAY_2'])

df.loc[:,'PAY_3']=clean_pay(df.loc[:,'PAY_3'])

df.loc[:,'PAY_4']=clean_pay(df.loc[:,'PAY_4'])

df.loc[:,'PAY_5']=clean_pay(df.loc[:,'PAY_5'])

df.loc[:,'PAY_6']=clean_pay(df.loc[:,'PAY_6'])

5.数据分析

想通过数据发现违约的概率

#default probability

df.def_pay.sum()/len(df.def_pay)

weixin_39979948

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
python数据清洗面试题_利用python进行数据清洗

1.数据清洗知识2.提出问题客户违约的可能性是怎样的？3.数据理解There are 25 variables:ID: ID of each clientLIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary creditSEX: Gender (1=male, 2=...
复制链接

扫一扫