Logstic Regression模型对German Credit数据集进行分类

Case: German Credit

在这份作业中,我们使用了Logstic Regression模型对German Credit数据集进行了分类。并用混淆矩阵和ROC曲线对模型进行了评估。

若对本文存有疑问或获取数据代码,请直接私信博主或直接添加博主VX: 1178623893

The German Credit data set contains observations on 30 variables for 1000 past applicants for credit. Each applicant was rated as “good credit”(700 cases) or “bad credit” (300 cases).


Assignment

1. Review the predictor variables and guess from their definition at what their role might be in a credit decision. Are there any surprises in the data?
2.Divide the data randomly into training (60%) and validation (40%) partitions, and develop classification models using the following data mining techniques in XLMiner
3.Choose one model from each technique and report the confusion matrix and the cost/gain matrix for the validation data. For the logistic regression model use a cutoff “predicted probability of success” (“success”=1) of 0.5. Which technique gives the most net profit on the validation data?
4. Let’s see if we can improve our performance by changing the cutoff. Rather than accepting above classification of everones’s credit status, let’s use the “predicted probability of finding a good applicant” in logistic regression as a basis for selecting the best credit risks first, followed by poorer risk applicants.
a. Sort the test data on "predicted probability of success."

b. For each test case, calculate the actual cost/gain of extending credit.

c. Add another column for cumulative net profit.

d. How far into the test data do you go to get maximum net profit? (Often this is specified as a percentile or rounded to deciles.)

e. If this logistic regression model is scored to future applicants, what "probability of success" cutoff should be used in extending credit?
# Import Libary
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pandas_profiling 
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore") # Ignore warnings

Q1. Review the predictor variables and guess from their definition at what their role might be in a credit decision. Are there any surprises in the data?

# Load Data
df = pd.read_excel(r'GermanCredit.xlsx')
df.head(10)  # Browse data samples
OBS#CHK_ACCTDURATIONHISTORYNEW_CARUSED_CARFURNITURERADIO/TVEDUCATIONRETRAINING...AGEOTHER_INSTALLRENTOWN_RESNUM_CREDITSJOBNUM_DEPENDENTSTELEPHONEFOREIGNRESPONSE
01064000100...67001221101
121482000100...22001121000
233124000010...49001112001
340422001000...45000122001
450243100000...53000222000
563362000010...35000112101
673242001000...53001121001
781362010000...35010131101
893122000100...61001111001
9101304100000...28001231000

10 rows × 32 columns

df.info()  #Data integrity shows that all data has no NAN value
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 32 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   OBS#              1000 non-null   int64
 1   CHK_ACCT          1000 non-null   int64
 2   DURATION          1000 non-null   int64
 3   HISTORY           1000 non-null   int64
 4   NEW_CAR           1000 non-null   int64
 5   USED_CAR          1000 non-null   int64
 6   FURNITURE         1000 non-null   int64
 7   RADIO/TV          1000 non-null   int64
 8   EDUCATION         1000 non-null   int64
 9   RETRAINING        1000 non-null   int64
 10  AMOUNT            1000 non-null   int64
 11  SAV_ACCT          1000 non-null   int64
 12  EMPLOYMENT        1000 non-null   int64
 13  INSTALL_RATE      1000 non-null   int64
 14  MALE_DIV          1000 non-null   int64
 15  MALE_SINGLE       1000 non-null   int64
 16  MALE_MAR_or_WID   1000 non-null   int64
 17  CO-APPLICANT      1000 non-null   int64
 18  GUARANTOR         1000 non-null   int64
 19  PRESENT_RESIDENT  1000 non-null   int64
 20  REAL_ESTATE       1000 non-null   int64
 21  PROP_UNKN_NONE    1000 non-null   int64
 22  AGE               1000 non-null   int64
 23  OTHER_INSTALL     1000 non-null   int64
 24  RENT              1000 non-null   int64
 25  OWN_RES           1000 non-null   int64
 26  NUM_CREDITS       1000 non-null   int64
 27  JOB               1000 non-null   int64
 28  NUM_DEPENDENTS    1000 non-null   int64
 29  TELEPHONE         1000 non-null   int64
 30  FOREIGN           1000 non-null   int64
 31  RESPONSE          1000 non-null   int64
dtypes: int64(32)
memory usage: 250.1 KB
plt.figure()
df.hist()
plt.show()

在这里插入图片描述

# sns.pairplot(df)
# df.profile_report()

It is easy to get from the preliminary analysis of the above data that the data information is complete without any missing values and outliers.From the results of the data analysis, we get the following interesting information:

  1. Of all the applicants, nearly 40% have no no checking accounts and 30% have a balance of less than zero
  2. Most loans are for cars, furniture, and TV/Radio
  3. The credit margin is mainly distributed in 0-5000
  4. People who are already employed are more likely to take out loans
  5. The age of the borrowers mainly ranges from 20 to 50, among which the middle-aged are the main group
  6. Credit amount and Duration of Credit have a high linear correlation

Q2. Divide the data randomly into training(60%) and test(40%) partitions, and develop a classification model using the logistic regression technique in Python and evaluate the model by using the confusion matrix and the ROC curve.

# logistic regression technique 
clf = LogisticRegression()
X,y = df.iloc[:,:-1],df['RESPONSE']
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.4)
X_train
# Standardize the training data
ss=StandardScaler()
ss.fit(X_train)
x_train_stand=ss.transform(X_train)
x_test_stand=ss.transform(X_test)
#  Train the model 
clf.fit(x_train_stand,y_train)
print('------------------------------------------')
print("The modelling results:")
print('The slope of the logistic regression technique:',clf.coef_) # Print out the slope
print('The intercept of the logistic regression technique:',clf.intercept_) # Print out the intercept
# Prediction
y_pre = clf.predict(x_test_stand)
print('------------------------------------------')
print('The prediction results:','\n',y_pre)
------------------------------------------
The modelling results:
The slope of the logistic regression technique: [[-0.03410161  0.62564931 -0.48414517  0.504973   -0.45165218  0.11433931
  -0.18546591  0.09348364 -0.23355452 -0.07481663 -0.17873389  0.52594587
   0.245582   -0.3468231   0.02593704  0.37926205  0.15389971 -0.06706053
   0.16737355 -0.03974959  0.14224639 -0.25124865  0.04792307 -0.26928597
  -0.51482951 -0.24693031 -0.19668434  0.02740041 -0.02860024  0.20056279
   0.27471923]]
The intercept of the logistic regression technique: [1.38831511]
------------------------------------------
The prediction results: 
 [1 1 0 1 1 0 1 1 0 1 1 1 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1
 1 1 0 1 1 1 1 0 1 0 0 0 0 1 1 1 1 0 0 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 1 1 1
 1 1 1 1 0 1 1 1 1 1 0 0 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1
 1 0 1 1 1 1 0 1 1 0 0 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1
 0 0 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 1 1 0 1
 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1 1 1 0
 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 0
 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 0 1 0 0 1 1 1 1 1 0 1 1
 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 1 1
 1 1 0 1 1 1 1 1 1 1 0 1 1 0 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 0 0 0 1 1 1 1 0]
# Calculate confusion matrix and plot it 
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test, y_pre) 
plt.matshow(confusion_matrix, cmap=plt.cm.Greens) 
plt.colorbar()
for i in range(len(confusion_matrix)): 
    for j in range(len(confusion_matrix)):
        plt.annotate(confusion_matrix[i,j], xy=(i, j), horizontalalignment='center', verticalalignment='center')
plt.ylabel('True label')
plt.xlabel('Predicted label') 
plt.show()

在这里插入图片描述

# ROC curve
metrics.plot_roc_curve(clf,x_test_stand,y_test)
plt.show()

在这里插入图片描述

Results analysis

Among the 400 test samples, 62 samples with actual value of 0 and 235 samples with actual value of 1 were correctly classified.However, 41 samples with an actual value of 1 and 62 samples with an actual value of 0 were wrongly classified.The classification accuracy of the classifier reached ( 62 + 235 ) / 400 ∗ 100 % = 74.25 % (Q2-1) (62+235)/400*100\%=74.25\% \tag{Q2-1} (62+235)/400100%=74.25%(Q2-1)

And 'AUC in test set is 0.733 0.733 0.733

Q3. Based on the confusion matrix and the payoff matrix, what is the net profit on the data?

Answer to Q3.

we already have

KaTeX parse error: Expected '}', got '_' at position 17: …\text{Confusion_̲Matrix} = \left…
and
KaTeX parse error: Expected '}', got '_' at position 12: \text{Net_̲Profit} = \left…

Hence, we can easily calculate the net profit on the test data is 62 × 100 + 62 × ( − 500 ) = − 24800 62 \times 100 + 62 \times (-500) = -24800 62×100+62×(500)=24800

Q4. Let’s see if we can improve our performance by changing the cutoff. Rather than accepting the above classification of everyone’s credit status, let’s use the “predicted probability of finding a good applicant” in logistic regression as a basis for selecting the best credit risks first, followed by poorer risk applicants.

a.Sort the test data on "predicted probability of finding a good applicant."
b.For each test case, calculate the actual cost/gain of extending credit.
c.Add another column for cumulative net profit.
d.How far into the test data do you go to get maximum net profit? (Often this is specified as apercentile or rounded to deciles.)
e.If this logistic regression model is scored to future applicants, what "probability of success" cutoff should be used in extending credit?
Q4.a
Sort the test data on “predicted probability of finding a good applicant.”
def sigmoid(x):
    '''
    Define the sigmoid function
    :param x: param x
    :return: results
    '''
    return 1.0 / (1 + np.exp(-x))
score = x_test_stand @ (clf.coef_.reshape([-1,1])) +clf.intercept_
s = sigmoid(score)
plt.hist(s)
plt.show()
# Double-Check  whether the score of each case in the test set is consistent with the previous results
a = np.zeros([len(s),1])
for i in range(len(s)):
    if s[i]> 0.5 :  # Success probability greater than 0.5 means success
        a[i] = 1  
    else:           # Success probability greater than 0.5 means failure
        a[i] = 0
print('------------------------------------------------------')
print('Number of successes from verification results:',sum(a.reshape(-1)))
print('Number of successes from model calculation results:',sum(y_pre))

在这里插入图片描述

------------------------------------------------------
Number of successes from verification results: 315.0
Number of successes from model calculation results: 315
# Add another column for predicted probability of success
X_test.loc[:,'Score'] = s
# Sort the validation data on "predicted probability of success.
X_test.sort_values("Score",inplace=True)
plt.plot(X_test.loc[:,'Score'].values)
plt.show()
X_test

在这里插入图片描述

OBS#CHK_ACCTDURATIONHISTORYNEW_CARUSED_CARFURNITURERADIO/TVEDUCATIONRETRAINING...AGEOTHER_INSTALLRENTOWN_RESNUM_CREDITSJOBNUM_DEPENDENTSTELEPHONEFOREIGNScore
9729730241100000...29010201000.021543
3343350240001000...23110222000.038262
7287291481000001...59010121000.049493
59600364001000...23010211100.056447
11120482000001...24010121000.074500
..................................................................
519520364000100...36000221000.992815
1351363124000100...38001221100.992897
5675683244000100...34001121000.992978
156157094000000...48001222010.993214
2092103122010000...55001121010.996783

400 rows × 32 columns

print('The sorted test data on "predicted probability of success" as follows:')
X_test.loc[:,'Score']
The sorted test data on "predicted probability of success" as follows:





972    0.021543
334    0.038262
728    0.049493
59     0.056447
11     0.074500
         ...   
519    0.992815
135    0.992897
567    0.992978
156    0.993214
209    0.996783
Name: Score, Length: 400, dtype: float64
Q4.b
For each test case, calculate the actual cost/gain of extending credit.
actual_gain = X_test['Score']*100-500*(1-X_test['Score'])
X_test.loc[:,'Actual_gain'] = actual_gain
plt.plot(actual_gain.values)
plt.xlabel('number of test cases')
plt.ylabel('actual gain')
plt.show()
X_test

在这里插入图片描述

OBS#CHK_ACCTDURATIONHISTORYNEW_CARUSED_CARFURNITURERADIO/TVEDUCATIONRETRAINING...OTHER_INSTALLRENTOWN_RESNUM_CREDITSJOBNUM_DEPENDENTSTELEPHONEFOREIGNScoreActual_gain
9729730241100000...010201000.021543-487.074435
3343350240001000...110222000.038262-477.042925
7287291481000001...010121000.049493-470.304182
59600364001000...010211100.056447-466.132080
11120482000001...010121000.074500-455.300084
..................................................................
519520364000100...000221000.99281595.688838
1351363124000100...001221100.99289795.738059
5675683244000100...001121000.99297895.786722
156157094000000...001222010.99321495.928315
2092103122010000...001121010.99678398.069581

400 rows × 33 columns

print('The actual cost/gain of extending credit for each case as follows:')
X_test.loc[:,'Actual_gain']
The actual cost/gain of extending credit for each case as follows:





972   -487.074435
334   -477.042925
728   -470.304182
59    -466.132080
11    -455.300084
          ...    
519     95.688838
135     95.738059
567     95.786722
156     95.928315
209     98.069581
Name: Actual_gain, Length: 400, dtype: float64

Q4.c

Add another column for cumulative net profit.
cumulate_net_profit = np.cumsum(actual_gain)
X_test.loc[:,'Cumulate_net_profit'] = cumulate_net_profit
plt.plot(cumulate_net_profit.values)
plt.xlabel('number of test cases')
plt.ylabel('cumulate net profit')
plt.show()
X_test

在这里插入图片描述

OBS#CHK_ACCTDURATIONHISTORYNEW_CARUSED_CARFURNITURERADIO/TVEDUCATIONRETRAINING...RENTOWN_RESNUM_CREDITSJOBNUM_DEPENDENTSTELEPHONEFOREIGNScoreActual_gainCumulate_net_profit
9729730241100000...10201000.021543-487.074435-487.074435
3343350240001000...10222000.038262-477.042925-964.117360
7287291481000001...10121000.049493-470.304182-1434.421542
59600364001000...10211100.056447-466.132080-1900.553622
11120482000001...10121000.074500-455.300084-2355.853706
..................................................................
519520364000100...00221000.99281595.688838-28084.173974
1351363124000100...01221100.99289795.738059-27988.435915
5675683244000100...01121000.99297895.786722-27892.649194
156157094000000...01222010.99321495.928315-27796.720878
2092103122010000...01121010.99678398.069581-27698.651297

400 rows × 34 columns

print('The column for cumulative net profit. as follows:')
X_test.loc[:,'Cumulate_net_profit'] 
The column for cumulative net profit. as follows:





972     -487.074435
334     -964.117360
728    -1434.421542
59     -1900.553622
11     -2355.853706
           ...     
519   -28084.173974
135   -27988.435915
567   -27892.649194
156   -27796.720878
209   -27698.651297
Name: Cumulate_net_profit, Length: 400, dtype: float64
Q4.d
How far into the test data do you go to get maximum net profit? (Often this is specified as a percentile or rounded to deciles.)
plt.bar(np.arange(31),clf.coef_.reshape(-1))
plt.xlabel('Variables')
plt.ylabel('Weight')
plt.show()

在这里插入图片描述

In order to maximize net profit, the “predictive success probability” of the test data is critical.
The results of the model show that the checking account status has a significant positive effect on the probability of success, while the duration of credit and the purpose of credit have a negative effect on the probability of success

Q4.e
If this logistic regression model is scored to future applicants, what “probability of success” cutoff should be used in extending credit?
X_test.iloc[int((5/6)*400),-3]
0.9578759345718599

In order to reduce the cost of extending credit, balance risks and benefits.Here we propose to set up a reasonable “probability of success” cutoff point so that the number of successful people should be greater than or equal to five times the number of unsuccessful people.This means that 5/6 of all people should be below the probability of success. By calculation, I think the “probability of success” cutoff point should be set at 0.95

  • 5
    点赞
  • 19
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

春风惹人醉

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值