分析信用借贷问题

最新推荐文章于 2023-11-29 17:54:36 发布

淮南草

最新推荐文章于 2023-11-29 17:54:36 发布

阅读量371

点赞数

分类专栏：数据挖掘

本文链接：https://blog.csdn.net/zhuisaozhang1292/article/details/81568906

版权

数据挖掘专栏收录该内容

18 篇文章 5 订阅

订阅专栏

主要内容：数据的多方位清洗建立分类模型分类精度以及回召率

其中数据的清洗主要包括:

1：查看数据的基本信息样本数特征数

2：去除个人认为无影响特征、某列缺失过半特征、一行元素全部相同的样本、类别不明确的行(本例中是表示是否借款不明确)

3：去除只有一个特征属性的特征、或者一个特征+nan 的特征

4：查看空值（null）的数量删除空值较多的列

5：如果发现某些特征存在部分空值，则可以直接删除存在空值的样本

6：去除相似度高的特征，取其1即可

6：查看所有特征类型，查找是否存在非数值型特征，并对其进行one-hot编码，（因为sklearn的机器学习算法中无法操作非数值类型即 object类型，所以需要全部转换为数值类型）

7：建模过程中如果发现精确度和召回率都比较高，就需要确认是否存在样本不均衡问题，如果是则需要修改权重方法1 class_weight="balanced"，方法2 自定义权重

8：如果还是不行则考虑换模型

# # **************第一次数据清洗******************
# import pandas as pd
# #  skiprows  需要忽略的行数（从文件开始处算起），或需要跳过的行号列表（从0开始）
# loans_2007 = pd.read_csv('./Lending Club Statistics/LoanStats3a.csv',skiprows = 1) 
# # print(loans_2007.shape) #(42538, 145)
# 
# half_count = len(loans_2007) / 2
# loans_2007 = loans_2007.dropna(thresh = half_count,axis=1) #某列特征缺失数量超过半数 则丢弃
# # ,'url' 丢掉认为没用的指定列 不能和上面一行代码同时执行 因为url在上面一步已经被删除了
# loans_2007 = loans_2007.drop(['desc'],axis=1) 
# loans_2007.to_csv('loans_2007.csv',index = False)

## **************第二次数据清洗******************
# import pandas as pd
# loans_2007 = pd.read_csv('loans_2007.csv')
# # print(loans_2007.shape) #(42538, 53)
# loans_2007.drop_duplicates() #loans_2007中一行元素全部相同时才去除
# # print(loans_2007.shape) #(42538, 53)
# # print(loans_2007.iloc[0])
# # print(loans_2007.iloc[0].shape[0]) #53
#  
# #删除没什么作用的特征
# # print(loans_2007.shape) #(42538, 53)
# # print('id' in loans_2007.keys()) #False
# # print('member_id' in loans_2007.keys()) #False
# drop_columns = ["funded_amnt", "funded_amnt_inv", "grade", "sub_grade", "emp_title", "issue_d",
#                 "zip_code", "out_prncp", "out_prncp_inv", "total_pymnt", "total_pymnt_inv", "total_rec_prncp",
#                 "total_rec_int", "total_rec_late_fee", "recoveries", "collection_recovery_fee", "last_pymnt_d",
#                 "last_pymnt_amnt"]
# loans_2007 = loans_2007.drop(drop_columns, axis=1)
# # print(loans_2007.shape)  #(42538, 35)
# 
# #删除贷款状态不明确的行
# # print(loans_2007['loan_status'].value_counts()) #34116  5670 1988 761
# loans_2007 = loans_2007[(loans_2007['loan_status'] == 'Fully Paid') | (loans_2007['loan_status'] == 'Charged Off')]
# status_replace = {'loan_status':{'Fully Paid':1,
#                                  'Charged Off':0
#                                  }
#                 }
# loans_2007 = loans_2007.replace(status_replace)
# # print(loans_2007['loan_status'].value_counts()) #34116  5670
# 
# #删除特征只有一个状态的值
# print(loans_2007.shape) #(39786, 35)
# orig_columns = loans_2007.columns
# drop_columns = []
# for col in orig_columns:
#     col_series = loans_2007[col].dropna().unique() #放置某个特征只有nan 和另一个固定值
#     if len(col_series) == 1:
#         drop_columns.append(col)
# loans_2007 = loans_2007.drop(drop_columns,axis = 1)
# print(drop_columns) #(39786, 24)
# print(loans_2007.shape)
# loans_2007.to_csv('filtered_loans_2007.csv',index = False)        


# **************第三次数据清洗******************
# import pandas as pd
# loans = pd.read_csv('filtered_loans_2007.csv')
# #查看空值数量
# null_counts = loans.isnull().sum()
# # print(null_counts) #pub_rec_bankruptcies 697    revol_util 50    title 10
# # 删除 空值较多的列   存在空值的行
# loans = loans.drop('pub_rec_bankruptcies',axis = 1)
# loans = loans.dropna(axis=0)
#   
# print(loans.dtypes.value_counts()) # 特征中存在半数的object属性
# object_columns_df = loans.select_dtypes(include=['object'])  #通过列类型选取列
# print(object_columns_df.iloc[0]) # iloc 行索引  
# '''
#     term                      36 months
#     int_rate                     10.65%
#     emp_length                10+ years
#     home_ownership                 RENT
#     verification_status        Verified
#     purpose                 credit_card
#     title                      Computer
#     addr_state                       AZ
#     earliest_cr_line           Jan-1985
#     revol_util                    83.7%
#     last_credit_pull_d         Jun-2018
#     debt_settlement_flag              N
# '''
# cols = ['home_ownership','verification_status','emp_length','term','addr_state']
# # for c in cols:
# #     print(loans[c].value_counts())
#    
# #这两个特征表示 贷款目的和原因 存在相似性 所以可以去掉其中一个 这边选择去掉title
# # print(loans['purpose'].value_counts())
# # print(loans['title'].value_counts())
#    
# mapping_dict = {'emp_length':{'10+ years':10,
#                               '9 years':9,
#                               '8 years':8,
#                               '7 years':7,
#                               '6 years':6,
#                               '5 years':5,
#                               '4 years':4,
#                               '3 years':3,
#                               '2 years':2,
#                               '1 years':1,
#                               '< 1 years':0,
#                               'n/a':0
#                               }}
#    
# loans = loans.drop(['last_credit_pull_d','earliest_cr_line','addr_state','title'],axis = 1)
# loans['int_rate'] = loans['int_rate'].str.rstrip('%').astype('float')
# loans['revol_util'] = loans['revol_util'].str.rstrip('%').astype('float')
# loans = loans.replace(mapping_dict)
#    
# cat_columns = ['home_ownership','verification_status','emp_length','purpose','term']
# dummy_df = pd.get_dummies(loans[cat_columns]) #进行one hot 编码
# loans = pd.concat([loans,dummy_df],axis = 1)
# loans = loans.drop(cat_columns,axis = 1)
# loans = loans.drop('debt_settlement_flag',axis = 1)
# loans.to_csv('cleaned_loans_2007.csv')    

#
import pandas as pd
loans = pd.read_csv('cleaned_loans_2007.csv')
# print(loans.info())

##第一次建立模型 
# from sklearn.linear_model import LogisticRegression
# lr = LogisticRegression()
# cols = loans.columns
# train_cols = cols.drop("loan_status")
# features = loans[train_cols]
# target = loans["loan_status"]
# lr.fit(features, target)
# predictions = lr.predict(features)
# from sklearn.linear_model import LogisticRegression
# from sklearn.cross_validation import cross_val_predict, KFold
# lr = LogisticRegression()
# kf = KFold(features.shape[0], random_state=1)
# predictions = cross_val_predict(lr, features, target, cv=kf)
# predictions = pd.Series(predictions)
# 
# # False positives.
# fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
# fp = len(predictions[fp_filter])
# # True positives.
# tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
# tp = len(predictions[tp_filter])
# # False negatives.
# fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
# fn = len(predictions[fn_filter])
# # True negatives
# tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
# tn = len(predictions[tn_filter])
# # Rates
# tpr = tp / float((tp + fn))
# fpr = fp / float((fp + tn))
# print(tpr) #0.9994717224782086
# print(fpr) #0.9991152008494072
# print(predictions[:20])

##打印预测结果发现，全都是1，全借，证明这是一个废模型–原因是样本不均衡，因为大部分都信用良好 
##增加权重项–负样本权重加大，正样本权重降低


##第二次建立模型
# from sklearn.linear_model import LogisticRegression
# from sklearn.cross_validation import cross_val_predict
# from sklearn.linear_model import LogisticRegression
# from sklearn.cross_validation import cross_val_predict, KFold
# 
# cols = loans.columns
# train_cols = cols.drop("loan_status")
# features = loans[train_cols]
# target = loans["loan_status"]
# # 使用sklearn的样本均衡策略  更改正负样本权重
# lr = LogisticRegression(class_weight="balanced")
# kf = KFold(features.shape[0], random_state=1)
# predictions = cross_val_predict(lr, features, target, cv=kf)
# predictions = pd.Series(predictions)
# # False positives.
# fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
# fp = len(predictions[fp_filter])
# # True positives.
# tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
# tp = len(predictions[tp_filter])
# # False negatives.
# fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
# fn = len(predictions[fn_filter])
# # True negatives
# tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
# tn = len(predictions[tn_filter])
# # Rates
# tpr = tp / float((tp + fn))
# fpr = fp / float((fp + tn))
# print(tpr) #0.708537551727174
# print(fpr) #0.4429304547867634

##第三次建立模型
# from sklearn.linear_model import LogisticRegression
# from sklearn.cross_validation import cross_val_predict,KFold
# cols = loans.columns
# train_cols = cols.drop("loan_status")
# features = loans[train_cols]
# target = loans["loan_status"]
# # 自定义权重项  不使用sklearn的权重项
# penalty = {
#     0: 5,
#     1: 1
# }
# lr = LogisticRegression(class_weight=penalty)
# kf = KFold(features.shape[0], random_state=1)
# predictions = cross_val_predict(lr, features, target, cv=kf)
# predictions = pd.Series(predictions)
# # False positives.
# fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
# fp = len(predictions[fp_filter])
# # True positives.
# tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
# tp = len(predictions[tp_filter])
# # False negatives.
# fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
# fn = len(predictions[fn_filter])
# # True negatives
# tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
# tn = len(predictions[tn_filter])
# # Rates
# tpr = tp / float((tp + fn))
# fpr = fp / float((fp + tn))
# print(tpr) #0.6711178939336131
# print(fpr) #0.4546098035745886
# # print(predictions[:20])

##第四次建立模型
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_predict,KFold
cols = loans.columns
train_cols = cols.drop("loan_status")
features = loans[train_cols]
target = loans["loan_status"]
rf = RandomForestClassifier(n_estimators=10,class_weight="balanced", random_state=1)
#print help(RandomForestClassifier)
kf = KFold(features.shape[0], random_state=1)
predictions = cross_val_predict(rf, features, target, cv=kf)
predictions = pd.Series(predictions)
# False positives.
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])
# True positives.
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])
# False negatives.
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])
# True negatives
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])
# Rates
tpr = tp / float((tp + fn))
fpr = fp / float((fp + tn))
print(tpr)  # 0.9846506031168374 没啥卵用
print(fpr)  # 0.966377632277473 没啥卵用

淮南草

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
分析信用借贷问题

主要内容：数据的多方位清洗建立分类模型分类精度以及回召率其中数据的清洗主要包括:1：查看数据的基本信息样本数特征数 2：去除个人认为无影响特征、某列缺失过半特征、一行元素全部相同的样本、类别不明确的行(本例中是表示是否借款不明确)3：去除只有一个特征属性的特征、或者一个特征+nan 的特征4：查看空值（null）的数量删除空值较多的列5：如果发现某些特征存在...
复制链接

扫一扫