分析信用借贷问题

主要内容:数据的多方位清洗  建立分类模型  分类精度以及回召率

其中数据的清洗主要包括:

1:查看数据的基本信息 样本数 特征数 

2:去除个人认为无影响特征、某列缺失过半特征、一行元素全部相同的样本、类别不明确的行(本例中是表示是否借款不明确)

3:去除只有一个特征属性的特征、或者 一个特征+nan 的特征

4:查看空值(null)的数量 删除空值较多的列

5:如果发现某些特征存在部分空值,则可以直接删除存在空值的样本

6:去除相似度高的特征,取其1即可

6:查看所有特征类型,查找是否存在非数值型特征,并对其进行one-hot编码,(因为sklearn的机器学习算法中 无法操作非数值类型 即 object类型 ,所以需要全部转换为数值类型)

7:建模过程中如果发现精确度和召回率都比较高,就需要确认是否存在样本不均衡问题,如果是则需要修改权重方法1    class_weight="balanced",方法2 自定义权重

8:如果还是不行 则考虑换模型

# # **************第一次数据清洗******************
# import pandas as pd
# #  skiprows  需要忽略的行数(从文件开始处算起),或需要跳过的行号列表(从0开始)
# loans_2007 = pd.read_csv('./Lending Club Statistics/LoanStats3a.csv',skiprows = 1) 
# # print(loans_2007.shape) #(42538, 145)
# 
# half_count = len(loans_2007) / 2
# loans_2007 = loans_2007.dropna(thresh = half_count,axis=1) #某列特征缺失数量超过半数 则丢弃
# # ,'url' 丢掉认为没用的指定列 不能和上面一行代码同时执行 因为url在上面一步已经被删除了
# loans_2007 = loans_2007.drop(['desc'],axis=1) 
# loans_2007.to_csv('loans_2007.csv',index = False)

## **************第二次数据清洗******************
# import pandas as pd
# loans_2007 = pd.read_csv('loans_2007.csv')
# # print(loans_2007.shape) #(42538, 53)
# loans_2007.drop_duplicates() #loans_2007中一行元素全部相同时才去除
# # print(loans_2007.shape) #(42538, 53)
# # print(loans_2007.iloc[0])
# # print(loans_2007.iloc[0].shape[0]) #53
#  
# #删除没什么作用的特征
# # print(loans_2007.shape) #(42538, 53)
# # print('id' in loans_2007.keys()) #False
# # print('member_id' in loans_2007.keys()) #False
# drop_columns = ["funded_amnt", "funded_amnt_inv", "grade", "sub_grade", "emp_title", "issue_d",
#                 "zip_code", "out_prncp", "out_prncp_inv", "total_pymnt", "total_pymnt_inv", "total_rec_prncp",
#                 "total_rec_int", "total_rec_late_fee", "recoveries", "collection_recovery_fee", "last_pymnt_d",
#                 "last_pymnt_amnt"]
# loans_2007 = loans_2007.drop(drop_columns, axis=1)
# # print(loans_2007.shape)  #(42538, 35)
# 
# #删除贷款状态不明确的行
# # print(loans_2007['loan_status'].value_counts()) #34116  5670 1988 761
# loans_2007 = loans_2007[(loans_2007['loan_status'] == 'Fully Paid') | (loans_2007['loan_status'] == 'Charged Off')]
# status_replace = {'loan_status':{'Fully Paid':1,
#                                  'Charged Off':0
#                                  }
#                 }
# loans_2007 = loans_2007.replace(status_replace)
# # print(loans_2007['loan_status'].value_counts()) #34116  5670
# 
# #删除特征只有一个状态的值
# print(loans_2007.shape) #(39786, 35)
# orig_columns = loans_2007.columns
# drop_columns = []
# for col in orig_columns:
#     col_series = loans_2007[col].dropna().unique() #放置某个特征只有nan 和另一个固定值
#     if len(col_series) == 1:
#         drop_columns.append(col)
# loans_2007 = loans_2007.drop(drop_columns,axis = 1)
# print(drop_columns) #(39786, 24)
# print(loans_2007.shape)
# loans_2007.to_csv('filtered_loans_2007.csv',index = False)        


# **************第三次数据清洗******************
# import pandas as pd
# loans = pd.read_csv('filtered_loans_2007.csv')
# #查看空值数量
# null_counts = loans.isnull().sum()
# # print(null_counts) #pub_rec_bankruptcies 697    revol_util 50    title 10
# # 删除 空值较多的列   存在空值的行
# loans = loans.drop('pub_rec_bankruptcies',axis = 1)
# loans = loans.dropna(axis=0)
#   
# print(loans.dtypes.value_counts()) # 特征中存在半数的object属性
# object_columns_df = loans.select_dtypes(include=['object'])  #通过列类型选取列
# print(object_columns_df.iloc[0]) # iloc 行索引  
# '''
#     term                      36 months
#     int_rate                     10.65%
#     emp_length                10+ years
#     home_ownership                 RENT
#     verification_status        Verified
#     purpose                 credit_card
#     title                      Computer
#     addr_state                       AZ
#     earliest_cr_line           Jan-1985
#     revol_util                    83.7%
#     last_credit_pull_d         Jun-2018
#     debt_settlement_flag              N
# '''
# cols = ['home_ownership','verification_status','emp_length','term','addr_state']
# # for c in cols:
# #     print(loans[c].value_counts())
#    
# #这两个特征表示 贷款目的和原因 存在相似性 所以可以去掉其中一个 这边选择去掉title
# # print(loans['purpose'].value_counts())
# # print(loans['title'].value_counts())
#    
# mapping_dict = {'emp_length':{'10+ years':10,
#                               '9 years':9,
#                               '8 years':8,
#                               '7 years':7,
#                               '6 years':6,
#                               '5 years':5,
#                               '4 years':4,
#                               '3 years':3,
#                               '2 years':2,
#                               '1 years':1,
#                               '< 1 years':0,
#                               'n/a':0
#                               }}
#    
# loans = loans.drop(['last_credit_pull_d','earliest_cr_line','addr_state','title'],axis = 1)
# loans['int_rate'] = loans['int_rate'].str.rstrip('%').astype('float')
# loans['revol_util'] = loans['revol_util'].str.rstrip('%').astype('float')
# loans = loans.replace(mapping_dict)
#    
# cat_columns = ['home_ownership','verification_status','emp_length','purpose','term']
# dummy_df = pd.get_dummies(loans[cat_columns]) #进行one hot 编码
# loans = pd.concat([loans,dummy_df],axis = 1)
# loans = loans.drop(cat_columns,axis = 1)
# loans = loans.drop('debt_settlement_flag',axis = 1)
# loans.to_csv('cleaned_loans_2007.csv')    

#
import pandas as pd
loans = pd.read_csv('cleaned_loans_2007.csv')
# print(loans.info())

##第一次建立模型 
# from sklearn.linear_model import LogisticRegression
# lr = LogisticRegression()
# cols = loans.columns
# train_cols = cols.drop("loan_status")
# features = loans[train_cols]
# target = loans["loan_status"]
# lr.fit(features, target)
# predictions = lr.predict(features)
# from sklearn.linear_model import LogisticRegression
# from sklearn.cross_validation import cross_val_predict, KFold
# lr = LogisticRegression()
# kf = KFold(features.shape[0], random_state=1)
# predictions = cross_val_predict(lr, features, target, cv=kf)
# predictions = pd.Series(predictions)
# 
# # False positives.
# fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
# fp = len(predictions[fp_filter])
# # True positives.
# tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
# tp = len(predictions[tp_filter])
# # False negatives.
# fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
# fn = len(predictions[fn_filter])
# # True negatives
# tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
# tn = len(predictions[tn_filter])
# # Rates
# tpr = tp / float((tp + fn))
# fpr = fp / float((fp + tn))
# print(tpr) #0.9994717224782086
# print(fpr) #0.9991152008494072
# print(predictions[:20])

##打印预测结果发现,全都是1,全借,证明这是一个废模型–原因是样本不均衡,因为大部分都信用良好 
##增加权重项–负样本权重加大,正样本权重降低


##第二次建立模型
# from sklearn.linear_model import LogisticRegression
# from sklearn.cross_validation import cross_val_predict
# from sklearn.linear_model import LogisticRegression
# from sklearn.cross_validation import cross_val_predict, KFold
# 
# cols = loans.columns
# train_cols = cols.drop("loan_status")
# features = loans[train_cols]
# target = loans["loan_status"]
# # 使用sklearn的样本均衡策略  更改正负样本权重
# lr = LogisticRegression(class_weight="balanced")
# kf = KFold(features.shape[0], random_state=1)
# predictions = cross_val_predict(lr, features, target, cv=kf)
# predictions = pd.Series(predictions)
# # False positives.
# fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
# fp = len(predictions[fp_filter])
# # True positives.
# tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
# tp = len(predictions[tp_filter])
# # False negatives.
# fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
# fn = len(predictions[fn_filter])
# # True negatives
# tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
# tn = len(predictions[tn_filter])
# # Rates
# tpr = tp / float((tp + fn))
# fpr = fp / float((fp + tn))
# print(tpr) #0.708537551727174
# print(fpr) #0.4429304547867634

##第三次建立模型
# from sklearn.linear_model import LogisticRegression
# from sklearn.cross_validation import cross_val_predict,KFold
# cols = loans.columns
# train_cols = cols.drop("loan_status")
# features = loans[train_cols]
# target = loans["loan_status"]
# # 自定义权重项  不使用sklearn的权重项
# penalty = {
#     0: 5,
#     1: 1
# }
# lr = LogisticRegression(class_weight=penalty)
# kf = KFold(features.shape[0], random_state=1)
# predictions = cross_val_predict(lr, features, target, cv=kf)
# predictions = pd.Series(predictions)
# # False positives.
# fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
# fp = len(predictions[fp_filter])
# # True positives.
# tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
# tp = len(predictions[tp_filter])
# # False negatives.
# fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
# fn = len(predictions[fn_filter])
# # True negatives
# tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
# tn = len(predictions[tn_filter])
# # Rates
# tpr = tp / float((tp + fn))
# fpr = fp / float((fp + tn))
# print(tpr) #0.6711178939336131
# print(fpr) #0.4546098035745886
# # print(predictions[:20])

##第四次建立模型
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_predict,KFold
cols = loans.columns
train_cols = cols.drop("loan_status")
features = loans[train_cols]
target = loans["loan_status"]
rf = RandomForestClassifier(n_estimators=10,class_weight="balanced", random_state=1)
#print help(RandomForestClassifier)
kf = KFold(features.shape[0], random_state=1)
predictions = cross_val_predict(rf, features, target, cv=kf)
predictions = pd.Series(predictions)
# False positives.
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])
# True positives.
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])
# False negatives.
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])
# True negatives
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])
# Rates
tpr = tp / float((tp + fn))
fpr = fp / float((fp + tn))
print(tpr)  # 0.9846506031168374 没啥卵用
print(fpr)  # 0.966377632277473 没啥卵用

 

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值