1.数据清洗知识
2.提出问题
客户违约的可能性是怎样的?
3.数据理解
There are 25 variables:ID: ID of each client
LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit
SEX: Gender (1=male, 2=female) 性别
EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)教育
MARRIAGE: Marital status (1=married, 2=single, 3=others) 婚姻
AGE: Age in years年龄
PAY_0: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, ... 8=payment delay for eight months, 9=payment delay for nine months and above) 还款状况
PAY_2: Repayment status in August, 2005 (scale same as above)
PAY_3: Repayment status in July, 2005 (scale same as above)
PAY_4: Repayment status in June, 2005 (scale same as above)
PAY_5: Repayment status in May, 2005 (scale same as above)
PAY_6: Repayment status in April, 2005 (scale same as above)
BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar) 账单
BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)
BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)
BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)
BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)
BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)
PAY_AMT1: Amount of previous payment in September, 2005 (NT dollar) 之前的付款金额
PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar)
PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar)
PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar)
PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar)
PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar)
default.payment.next.month: Default payment (1=yes, 0=no)
数据来源Credit Card Default: a very pedagogical notebookwww.kaggle.com
4.数据清洗
import os
os.getcwd()
os.chdir('/Users/qianxin/downloads')
file=pd.DataFrame(pd.read_csv('UCI_Credit_Card.csv'))
重命名列名
df=file.rename(columns={'default.payment.next.month':'def_pay','PAY_0': 'PAY_1'})
查看数据情况
http://file.info()
RangeIndex: 30000 entries, 0 to 29999 Data columns (total 25 columns): ID 30000 non-null int64 LIMIT_BAL 30000 non-null float64 SEX 30000 non-null int64 EDUCATION 30000 non-null int64 MARRIAGE 30000 non-null int64 AGE 30000 non-null int64 PAY_0 30000 non-null int64 PAY_2 30000 non-null int64 PAY_3 30000 non-null int64 PAY_4 30000 non-null int64 PAY_5 30000 non-null int64 PAY_6 30000 non-null int64 BILL_AMT1 30000 non-null float64 BILL_AMT2 30000 non-null float64 BILL_AMT3 30000 non-null float64 BILL_AMT4 30000 non-null float64 BILL_AMT5 30000 non-null float64 BILL_AMT6 30000 non-null float64 PAY_AMT1 30000 non-null float64 PAY_AMT2 30000 non-null float64 PAY_AMT3 30000 non-null float64 PAY_AMT4 30000 non-null float64 PAY_AMT5 30000 non-null float64 PAY_AMT6 30000 non-null float64 default.payment.next.month 30000 non-null int64 dtypes: float64(13), int64(12) memory usage: 5.7 MB
发现数据并没有缺失值
进一步查看数据情况
file.describe()
其中 pay_0-Pay_6出现了负值,这是不符合常理的,因此将负值全部替换为0
def clean_pay(pay):
list_new=[]
for value in pay:
if value<0:
value=0
else:
value=value
list_new.append(value)
payser=pd.Series(list_new)
return payser
df.loc[:,'PAY_1']=clean_pay(df.loc[:,'PAY_1'])
df.loc[:,'PAY_2']=clean_pay(df.loc[:,'PAY_2'])
df.loc[:,'PAY_3']=clean_pay(df.loc[:,'PAY_3'])
df.loc[:,'PAY_4']=clean_pay(df.loc[:,'PAY_4'])
df.loc[:,'PAY_5']=clean_pay(df.loc[:,'PAY_5'])
df.loc[:,'PAY_6']=clean_pay(df.loc[:,'PAY_6'])
5.数据分析
想通过数据发现违约的概率
#default probability
df.def_pay.sum()/len(df.def_pay)