Case: German Credit
在这份作业中,我们使用了Logstic Regression模型对German Credit数据集进行了分类。并用混淆矩阵和ROC曲线对模型进行了评估。
若对本文存有疑问或获取数据代码,请直接私信博主或直接添加博主VX: 1178623893
The German Credit data set contains observations on 30 variables for 1000 past applicants for credit. Each applicant was rated as “good credit”(700 cases) or “bad credit” (300 cases).
Assignment
1. Review the predictor variables and guess from their definition at what their role might be in a credit decision. Are there any surprises in the data?
2.Divide the data randomly into training (60%) and validation (40%) partitions, and develop classification models using the following data mining techniques in XLMiner
3.Choose one model from each technique and report the confusion matrix and the cost/gain matrix for the validation data. For the logistic regression model use a cutoff “predicted probability of success” (“success”=1) of 0.5. Which technique gives the most net profit on the validation data?
4. Let’s see if we can improve our performance by changing the cutoff. Rather than accepting above classification of everones’s credit status, let’s use the “predicted probability of finding a good applicant” in logistic regression as a basis for selecting the best credit risks first, followed by poorer risk applicants.
a. Sort the test data on "predicted probability of success."
b. For each test case, calculate the actual cost/gain of extending credit.
c. Add another column for cumulative net profit.
d. How far into the test data do you go to get maximum net profit? (Often this is specified as a percentile or rounded to deciles.)
e. If this logistic regression model is scored to future applicants, what "probability of success" cutoff should be used in extending credit?
# Import Libary
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pandas_profiling
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore") # Ignore warnings
Q1. Review the predictor variables and guess from their definition at what their role might be in a credit decision. Are there any surprises in the data?
# Load Data
df = pd.read_excel(r'GermanCredit.xlsx')
df.head(10) # Browse data samples
OBS# | CHK_ACCT | DURATION | HISTORY | NEW_CAR | USED_CAR | FURNITURE | RADIO/TV | EDUCATION | RETRAINING | ... | AGE | OTHER_INSTALL | RENT | OWN_RES | NUM_CREDITS | JOB | NUM_DEPENDENTS | TELEPHONE | FOREIGN | RESPONSE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 6 | 4 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 67 | 0 | 0 | 1 | 2 | 2 | 1 | 1 | 0 | 1 |
1 | 2 | 1 | 48 | 2 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 22 | 0 | 0 | 1 | 1 | 2 | 1 | 0 | 0 | 0 |
2 | 3 | 3 | 12 | 4 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 49 | 0 | 0 | 1 | 1 | 1 | 2 | 0 | 0 | 1 |
3 | 4 | 0 | 42 | 2 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 45 | 0 | 0 | 0 | 1 | 2 | 2 | 0 | 0 | 1 |
4 | 5 | 0 | 24 | 3 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 53 | 0 | 0 | 0 | 2 | 2 | 2 | 0 | 0 | 0 |
5 | 6 | 3 | 36 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 35 | 0 | 0 | 0 | 1 | 1 | 2 | 1 | 0 | 1 |
6 | 7 | 3 | 24 | 2 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 53 | 0 | 0 | 1 | 1 | 2 | 1 | 0 | 0 | 1 |
7 | 8 | 1 | 36 | 2 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 35 | 0 | 1 | 0 | 1 | 3 | 1 | 1 | 0 | 1 |
8 | 9 | 3 | 12 | 2 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 61 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 1 |
9 | 10 | 1 | 30 | 4 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 28 | 0 | 0 | 1 | 2 | 3 | 1 | 0 | 0 | 0 |
10 rows × 32 columns
df.info() #Data integrity shows that all data has no NAN value
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 32 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 OBS# 1000 non-null int64
1 CHK_ACCT 1000 non-null int64
2 DURATION 1000 non-null int64
3 HISTORY 1000 non-null int64
4 NEW_CAR 1000 non-null int64
5 USED_CAR 1000 non-null int64
6 FURNITURE 1000 non-null int64
7 RADIO/TV 1000 non-null int64
8 EDUCATION 1000 non-null int64
9 RETRAINING 1000 non-null int64
10 AMOUNT 1000 non-null int64
11 SAV_ACCT 1000 non-null int64
12 EMPLOYMENT 1000 non-null int64
13 INSTALL_RATE 1000 non-null int64
14 MALE_DIV 1000 non-null int64
15 MALE_SINGLE 1000 non-null int64
16 MALE_MAR_or_WID 1000 non-null int64
17 CO-APPLICANT 1000 non-null int64
18 GUARANTOR 1000 non-null int64
19 PRESENT_RESIDENT 1000 non-null int64
20 REAL_ESTATE 1000 non-null int64
21 PROP_UNKN_NONE 1000 non-null int64
22 AGE 1000 non-null int64
23 OTHER_INSTALL 1000 non-null int64
24 RENT 1000 non-null int64
25 OWN_RES 1000 non-null int64
26 NUM_CREDITS 1000 non-null int64
27 JOB 1000 non-null int64
28 NUM_DEPENDENTS 1000 non-null int64
29 TELEPHONE 1000 non-null int64
30 FOREIGN 1000 non-null int64
31 RESPONSE 1000 non-null int64
dtypes: int64(32)
memory usage: 250.1 KB
plt.figure()
df.hist()
plt.show()
# sns.pairplot(df)
# df.profile_report()
It is easy to get from the preliminary analysis of the above data that the data information is complete without any missing values and outliers.From the results of the data analysis, we get the following interesting information:
- Of all the applicants, nearly 40% have no no checking accounts and 30% have a balance of less than zero
- Most loans are for cars, furniture, and TV/Radio
- The credit margin is mainly distributed in 0-5000
- People who are already employed are more likely to take out loans
- The age of the borrowers mainly ranges from 20 to 50, among which the middle-aged are the main group
- Credit amount and Duration of Credit have a high linear correlation
Q2. Divide the data randomly into training(60%) and test(40%) partitions, and develop a classification model using the logistic regression technique in Python and evaluate the model by using the confusion matrix and the ROC curve.
# logistic regression technique
clf = LogisticRegression()
X,y = df.iloc[:,:-1],df['RESPONSE']
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.4)
X_train
# Standardize the training data
ss=StandardScaler()
ss.fit(X_train)
x_train_stand=ss.transform(X_train)
x_test_stand=ss.transform(X_test)
# Train the model
clf.fit(x_train_stand,y_train)
print('------------------------------------------')
print("The modelling results:")
print('The slope of the logistic regression technique:',clf.coef_) # Print out the slope
print('The intercept of the logistic regression technique:',clf.intercept_) # Print out the intercept
# Prediction
y_pre = clf.predict(x_test_stand)
print('------------------------------------------')
print('The prediction results:','\n',y_pre)
------------------------------------------
The modelling results:
The slope of the logistic regression technique: [[-0.03410161 0.62564931 -0.48414517 0.504973 -0.45165218 0.11433931
-0.18546591 0.09348364 -0.23355452 -0.07481663 -0.17873389 0.52594587
0.245582 -0.3468231 0.02593704 0.37926205 0.15389971 -0.06706053
0.16737355 -0.03974959 0.14224639 -0.25124865 0.04792307 -0.26928597
-0.51482951 -0.24693031 -0.19668434 0.02740041 -0.02860024 0.20056279
0.27471923]]
The intercept of the logistic regression technique: [1.38831511]
------------------------------------------
The prediction results:
[1 1 0 1 1 0 1 1 0 1 1 1 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1
1 1 0 1 1 1 1 0 1 0 0 0 0 1 1 1 1 0 0 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 1 1 1
1 1 1 1 0 1 1 1 1 1 0 0 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1
1 0 1 1 1 1 0 1 1 0 0 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1
0 0 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 1 1 0 1
1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1 1 1 0
1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 0
0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 0 1 0 0 1 1 1 1 1 0 1 1
1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 1 1
1 1 0 1 1 1 1 1 1 1 0 1 1 0 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 0 0 0 1 1 1 1 0]
# Calculate confusion matrix and plot it
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test, y_pre)
plt.matshow(confusion_matrix, cmap=plt.cm.Greens)
plt.colorbar()
for i in range(len(confusion_matrix)):
for j in range(len(confusion_matrix)):
plt.annotate(confusion_matrix[i,j], xy=(i, j), horizontalalignment='center', verticalalignment='center')
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()
# ROC curve
metrics.plot_roc_curve(clf,x_test_stand,y_test)
plt.show()
Results analysis
Among the 400 test samples, 62 samples with actual value of 0 and 235 samples with actual value of 1 were correctly classified.However, 41 samples with an actual value of 1 and 62 samples with an actual value of 0 were wrongly classified.The classification accuracy of the classifier reached ( 62 + 235 ) / 400 ∗ 100 % = 74.25 % (Q2-1) (62+235)/400*100\%=74.25\% \tag{Q2-1} (62+235)/400∗100%=74.25%(Q2-1)
And 'AUC in test set is 0.733 0.733 0.733
Q3. Based on the confusion matrix and the payoff matrix, what is the net profit on the data?
Answer to Q3.
we already have
KaTeX parse error: Expected '}', got '_' at position 17: …\text{Confusion_̲Matrix} = \left…
and
KaTeX parse error: Expected '}', got '_' at position 12: \text{Net_̲Profit} = \left…Hence, we can easily calculate the net profit on the test data is 62 × 100 + 62 × ( − 500 ) = − 24800 62 \times 100 + 62 \times (-500) = -24800 62×100+62×(−500)=−24800
Q4. Let’s see if we can improve our performance by changing the cutoff. Rather than accepting the above classification of everyone’s credit status, let’s use the “predicted probability of finding a good applicant” in logistic regression as a basis for selecting the best credit risks first, followed by poorer risk applicants.
a.Sort the test data on "predicted probability of finding a good applicant."
b.For each test case, calculate the actual cost/gain of extending credit.
c.Add another column for cumulative net profit.
d.How far into the test data do you go to get maximum net profit? (Often this is specified as apercentile or rounded to deciles.)
e.If this logistic regression model is scored to future applicants, what "probability of success" cutoff should be used in extending credit?
Q4.a
Sort the test data on “predicted probability of finding a good applicant.”
def sigmoid(x):
'''
Define the sigmoid function
:param x: param x
:return: results
'''
return 1.0 / (1 + np.exp(-x))
score = x_test_stand @ (clf.coef_.reshape([-1,1])) +clf.intercept_
s = sigmoid(score)
plt.hist(s)
plt.show()
# Double-Check whether the score of each case in the test set is consistent with the previous results
a = np.zeros([len(s),1])
for i in range(len(s)):
if s[i]> 0.5 : # Success probability greater than 0.5 means success
a[i] = 1
else: # Success probability greater than 0.5 means failure
a[i] = 0
print('------------------------------------------------------')
print('Number of successes from verification results:',sum(a.reshape(-1)))
print('Number of successes from model calculation results:',sum(y_pre))
------------------------------------------------------
Number of successes from verification results: 315.0
Number of successes from model calculation results: 315
# Add another column for predicted probability of success
X_test.loc[:,'Score'] = s
# Sort the validation data on "predicted probability of success.
X_test.sort_values("Score",inplace=True)
plt.plot(X_test.loc[:,'Score'].values)
plt.show()
X_test
OBS# | CHK_ACCT | DURATION | HISTORY | NEW_CAR | USED_CAR | FURNITURE | RADIO/TV | EDUCATION | RETRAINING | ... | AGE | OTHER_INSTALL | RENT | OWN_RES | NUM_CREDITS | JOB | NUM_DEPENDENTS | TELEPHONE | FOREIGN | Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
972 | 973 | 0 | 24 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 29 | 0 | 1 | 0 | 2 | 0 | 1 | 0 | 0 | 0.021543 |
334 | 335 | 0 | 24 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 23 | 1 | 1 | 0 | 2 | 2 | 2 | 0 | 0 | 0.038262 |
728 | 729 | 1 | 48 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 59 | 0 | 1 | 0 | 1 | 2 | 1 | 0 | 0 | 0.049493 |
59 | 60 | 0 | 36 | 4 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 23 | 0 | 1 | 0 | 2 | 1 | 1 | 1 | 0 | 0.056447 |
11 | 12 | 0 | 48 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 24 | 0 | 1 | 0 | 1 | 2 | 1 | 0 | 0 | 0.074500 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
519 | 520 | 3 | 6 | 4 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 36 | 0 | 0 | 0 | 2 | 2 | 1 | 0 | 0 | 0.992815 |
135 | 136 | 3 | 12 | 4 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 38 | 0 | 0 | 1 | 2 | 2 | 1 | 1 | 0 | 0.992897 |
567 | 568 | 3 | 24 | 4 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 34 | 0 | 0 | 1 | 1 | 2 | 1 | 0 | 0 | 0.992978 |
156 | 157 | 0 | 9 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 48 | 0 | 0 | 1 | 2 | 2 | 2 | 0 | 1 | 0.993214 |
209 | 210 | 3 | 12 | 2 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 55 | 0 | 0 | 1 | 1 | 2 | 1 | 0 | 1 | 0.996783 |
400 rows × 32 columns
print('The sorted test data on "predicted probability of success" as follows:')
X_test.loc[:,'Score']
The sorted test data on "predicted probability of success" as follows:
972 0.021543
334 0.038262
728 0.049493
59 0.056447
11 0.074500
...
519 0.992815
135 0.992897
567 0.992978
156 0.993214
209 0.996783
Name: Score, Length: 400, dtype: float64
Q4.b
For each test case, calculate the actual cost/gain of extending credit.
actual_gain = X_test['Score']*100-500*(1-X_test['Score'])
X_test.loc[:,'Actual_gain'] = actual_gain
plt.plot(actual_gain.values)
plt.xlabel('number of test cases')
plt.ylabel('actual gain')
plt.show()
X_test
OBS# | CHK_ACCT | DURATION | HISTORY | NEW_CAR | USED_CAR | FURNITURE | RADIO/TV | EDUCATION | RETRAINING | ... | OTHER_INSTALL | RENT | OWN_RES | NUM_CREDITS | JOB | NUM_DEPENDENTS | TELEPHONE | FOREIGN | Score | Actual_gain | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
972 | 973 | 0 | 24 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 2 | 0 | 1 | 0 | 0 | 0.021543 | -487.074435 |
334 | 335 | 0 | 24 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 1 | 1 | 0 | 2 | 2 | 2 | 0 | 0 | 0.038262 | -477.042925 |
728 | 729 | 1 | 48 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 1 | 0 | 1 | 2 | 1 | 0 | 0 | 0.049493 | -470.304182 |
59 | 60 | 0 | 36 | 4 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 2 | 1 | 1 | 1 | 0 | 0.056447 | -466.132080 |
11 | 12 | 0 | 48 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 1 | 0 | 1 | 2 | 1 | 0 | 0 | 0.074500 | -455.300084 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
519 | 520 | 3 | 6 | 4 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 2 | 2 | 1 | 0 | 0 | 0.992815 | 95.688838 |
135 | 136 | 3 | 12 | 4 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 1 | 2 | 2 | 1 | 1 | 0 | 0.992897 | 95.738059 |
567 | 568 | 3 | 24 | 4 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 1 | 1 | 2 | 1 | 0 | 0 | 0.992978 | 95.786722 |
156 | 157 | 0 | 9 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 2 | 2 | 2 | 0 | 1 | 0.993214 | 95.928315 |
209 | 210 | 3 | 12 | 2 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 1 | 2 | 1 | 0 | 1 | 0.996783 | 98.069581 |
400 rows × 33 columns
print('The actual cost/gain of extending credit for each case as follows:')
X_test.loc[:,'Actual_gain']
The actual cost/gain of extending credit for each case as follows:
972 -487.074435
334 -477.042925
728 -470.304182
59 -466.132080
11 -455.300084
...
519 95.688838
135 95.738059
567 95.786722
156 95.928315
209 98.069581
Name: Actual_gain, Length: 400, dtype: float64
Q4.c
Add another column for cumulative net profit.
cumulate_net_profit = np.cumsum(actual_gain)
X_test.loc[:,'Cumulate_net_profit'] = cumulate_net_profit
plt.plot(cumulate_net_profit.values)
plt.xlabel('number of test cases')
plt.ylabel('cumulate net profit')
plt.show()
X_test
OBS# | CHK_ACCT | DURATION | HISTORY | NEW_CAR | USED_CAR | FURNITURE | RADIO/TV | EDUCATION | RETRAINING | ... | RENT | OWN_RES | NUM_CREDITS | JOB | NUM_DEPENDENTS | TELEPHONE | FOREIGN | Score | Actual_gain | Cumulate_net_profit | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
972 | 973 | 0 | 24 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 2 | 0 | 1 | 0 | 0 | 0.021543 | -487.074435 | -487.074435 |
334 | 335 | 0 | 24 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 1 | 0 | 2 | 2 | 2 | 0 | 0 | 0.038262 | -477.042925 | -964.117360 |
728 | 729 | 1 | 48 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 0 | 1 | 2 | 1 | 0 | 0 | 0.049493 | -470.304182 | -1434.421542 |
59 | 60 | 0 | 36 | 4 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 1 | 0 | 2 | 1 | 1 | 1 | 0 | 0.056447 | -466.132080 | -1900.553622 |
11 | 12 | 0 | 48 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 1 | 0 | 1 | 2 | 1 | 0 | 0 | 0.074500 | -455.300084 | -2355.853706 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
519 | 520 | 3 | 6 | 4 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 2 | 2 | 1 | 0 | 0 | 0.992815 | 95.688838 | -28084.173974 |
135 | 136 | 3 | 12 | 4 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 1 | 2 | 2 | 1 | 1 | 0 | 0.992897 | 95.738059 | -27988.435915 |
567 | 568 | 3 | 24 | 4 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 1 | 1 | 2 | 1 | 0 | 0 | 0.992978 | 95.786722 | -27892.649194 |
156 | 157 | 0 | 9 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 2 | 2 | 2 | 0 | 1 | 0.993214 | 95.928315 | -27796.720878 |
209 | 210 | 3 | 12 | 2 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 1 | 2 | 1 | 0 | 1 | 0.996783 | 98.069581 | -27698.651297 |
400 rows × 34 columns
print('The column for cumulative net profit. as follows:')
X_test.loc[:,'Cumulate_net_profit']
The column for cumulative net profit. as follows:
972 -487.074435
334 -964.117360
728 -1434.421542
59 -1900.553622
11 -2355.853706
...
519 -28084.173974
135 -27988.435915
567 -27892.649194
156 -27796.720878
209 -27698.651297
Name: Cumulate_net_profit, Length: 400, dtype: float64
Q4.d
How far into the test data do you go to get maximum net profit? (Often this is specified as a percentile or rounded to deciles.)
plt.bar(np.arange(31),clf.coef_.reshape(-1))
plt.xlabel('Variables')
plt.ylabel('Weight')
plt.show()
In order to maximize net profit, the “predictive success probability” of the test data is critical.
The results of the model show that the checking account status has a significant positive effect on the probability of success, while the duration of credit and the purpose of credit have a negative effect on the probability of success
Q4.e
If this logistic regression model is scored to future applicants, what “probability of success” cutoff should be used in extending credit?
X_test.iloc[int((5/6)*400),-3]
0.9578759345718599
In order to reduce the cost of extending credit, balance risks and benefits.Here we propose to set up a reasonable “probability of success” cutoff point so that the number of successful people should be greater than or equal to five times the number of unsuccessful people.This means that 5/6 of all people should be below the probability of success. By calculation, I think the “probability of success” cutoff point should be set at 0.95