Logstic Regression模型对German Credit数据集进行分类

最新推荐文章于 2023-01-02 21:50:11 发布

春风惹人醉

最新推荐文章于 2023-01-02 21:50:11 发布

阅读量2.8k

点赞数 5

文章标签：数据挖掘机器学习逻辑回归分类算法

本文链接：https://blog.csdn.net/GODSuner/article/details/115029101

版权

Case: German Credit

在这份作业中，我们使用了Logstic Regression模型对German Credit数据集进行了分类。并用混淆矩阵和ROC曲线对模型进行了评估。

若对本文存有疑问或获取数据代码，请直接私信博主或直接添加博主VX: 1178623893

The German Credit data set contains observations on 30 variables for 1000 past applicants for credit. Each applicant was rated as “good credit”(700 cases) or “bad credit” (300 cases).

Assignment

1. Review the predictor variables and guess from their definition at what their role might be in a credit decision. Are there any surprises in the data?

2.Divide the data randomly into training (60%) and validation (40%) partitions, and develop classification models using the following data mining techniques in XLMiner

3.Choose one model from each technique and report the confusion matrix and the cost/gain matrix for the validation data. For the logistic regression model use a cutoff “predicted probability of success” (“success”=1) of 0.5. Which technique gives the most net profit on the validation data?

4. Let’s see if we can improve our performance by changing the cutoff. Rather than accepting above classification of everones’s credit status, let’s use the “predicted probability of finding a good applicant” in logistic regression as a basis for selecting the best credit risks first, followed by poorer risk applicants.

a. Sort the test data on "predicted probability of success."

b. For each test case, calculate the actual cost/gain of extending credit.

c. Add another column for cumulative net profit.

d. How far into the test data do you go to get maximum net profit? (Often this is specified as a percentile or rounded to deciles.)

e. If this logistic regression model is scored to future applicants, what "probability of success" cutoff should be used in extending credit?

# Import Libary
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pandas_profiling 
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore") # Ignore warnings

Q1. Review the predictor variables and guess from their definition at what their role might be in a credit decision. Are there any surprises in the data?

# Load Data
df = pd.read_excel(r'GermanCredit.xlsx')
df.head(10)  # Browse data samples

	OBS#	CHK_ACCT	DURATION	HISTORY	NEW_CAR	USED_CAR	FURNITURE	RADIO/TV	EDUCATION	...	AGE	RENT	OWN_RES	NUM_CREDITS	JOB	NUM_DEPENDENTS	TELEPHONE	RESPONSE
0	1	0	6	4	0	0	0	1	0	...	67	0	1	2	2	1	1	1
1	2	1	48	2	0	0	0	1	0	...	22	0	1	1	2	1	0	0
2	3	3	12	4	0	0	0	0	1	...	49	0	1	1	1	2	0	1
3	4	0	42	2	0	0	1	0	0	...	45	0	0	1	2	2	0	1
4	5	0	24	3	1	0	0	0	0	...	53	0	0	2	2	2	0	0
5	6	3	36	2	0	0	0	0	1	...	35	0	0	1	1	2	1	1
6	7	3	24	2	0	0	1	0	0	...	53	0	1	1	2	1	0	1
7	8	1	36	2	0	1	0	0	0	...	35	1	0	1	3	1	1	1
8	9	3	12	2	0	0	0	1	0	...	61	0	1	1	1	1	0	1
9	10	1	30	4	1	0	0	0	0	...	28	0	1	2	3	1	0	0

10 rows × 32 columns

df.info()  #Data integrity shows that all data has no NAN value

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 32 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   OBS#              1000 non-null   int64
 1   CHK_ACCT          1000 non-null   int64
 2   DURATION          1000 non-null   int64
 3   HISTORY           1000 non-null   int64
 4   NEW_CAR           1000 non-null   int64
 5   USED_CAR          1000 non-null   int64
 6   FURNITURE         1000 non-null   int64
 7   RADIO/TV          1000 non-null   int64
 8   EDUCATION         1000 non-null   int64
 9   RETRAINING        1000 non-null   int64
 10  AMOUNT            1000 non-null   int64
 11  SAV_ACCT          1000 non-null   int64
 12  EMPLOYMENT        1000 non-null   int64
 13  INSTALL_RATE      1000 non-null   int64
 14  MALE_DIV          1000 non-null   int64
 15  MALE_SINGLE       1000 non-null   int64
 16  MALE_MAR_or_WID   1000 non-null   int64
 17  CO-APPLICANT      1000 non-null   int64
 18  GUARANTOR         1000 non-null   int64
 19  PRESENT_RESIDENT  1000 non-null   int64
 20  REAL_ESTATE       1000 non-null   int64
 21  PROP_UNKN_NONE    1000 non-null   int64
 22  AGE               1000 non-null   int64
 23  OTHER_INSTALL     1000 non-null   int64
 24  RENT              1000 non-null   int64
 25  OWN_RES           1000 non-null   int64
 26  NUM_CREDITS       1000 non-null   int64
 27  JOB               1000 non-null   int64
 28  NUM_DEPENDENTS    1000 non-null   int64
 29  TELEPHONE         1000 non-null   int64
 30  FOREIGN           1000 non-null   int64
 31  RESPONSE          1000 non-null   int64
dtypes: int64(32)
memory usage: 250.1 KB

plt.figure()
df.hist()
plt.show()

在这里插入图片描述

# sns.pairplot(df)

# df.profile_report()

It is easy to get from the preliminary analysis of the above data that the data information is complete without any missing values and outliers.From the results of the data analysis, we get the following interesting information:

Of all the applicants, nearly 40% have no no checking accounts and 30% have a balance of less than zero
Most loans are for cars, furniture, and TV/Radio
The credit margin is mainly distributed in 0-5000
People who are already employed are more likely to take out loans
The age of the borrowers mainly ranges from 20 to 50, among which the middle-aged are the main group
Credit amount and Duration of Credit have a high linear correlation

Q2. Divide the data randomly into training(60%) and test(40%) partitions, and develop a classification model using the logistic regression technique in Python and evaluate the model by using the confusion matrix and the ROC curve.

# logistic regression technique 
clf = LogisticRegression()
X,y = df.iloc[:,:-1],df['RESPONSE']
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.4)
X_train
# Standardize the training data
ss=StandardScaler()
ss.fit(X_train)
x_train_stand=ss.transform(X_train)
x_test_stand=ss.transform(X_test)
#  Train the model 
clf.fit(x_train_stand,y_train)
print('------------------------------------------')
print("The modelling results:")
print('The slope of the logistic regression technique:',clf.coef_) # Print out the slope
print('The intercept of the logistic regression technique:',clf.intercept_) # Print out the intercept
# Prediction
y_pre = clf.predict(x_test_stand)
print('------------------------------------------')
print('The prediction results:','\n',y_pre)

------------------------------------------
The modelling results:
The slope of the logistic regression technique: [[-0.03410161  0.62564931 -0.48414517  0.504973   -0.45165218  0.11433931
  -0.18546591  0.09348364 -0.23355452 -0.07481663 -0.17873389  0.52594587
   0.245582   -0.3468231   0.02593704  0.37926205  0.15389971 -0.06706053
   0.16737355 -0.03974959  0.14224639 -0.25124865  0.04792307 -0.26928597
  -0.51482951 -0.24693031 -0.19668434  0.02740041 -0.02860024  0.20056279
   0.27471923]]
The intercept of the logistic regression technique: [1.38831511]
------------------------------------------
The prediction results: 
 [1 1 0 1 1 0 1 1 0 1 1 1 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1
 1 1 0 1 1 1 1 0 1 0 0 0 0 1 1 1 1 0 0 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 1 1 1
 1 1 1 1 0 1 1 1 1 1 0 0 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1
 1 0 1 1 1 1 0 1 1 0 0 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1
 0 0 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 1 1 0 1
 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1 1 1 0
 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 0
 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 0 1 0 0 1 1 1 1 1 0 1 1
 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 1 1
 1 1 0 1 1 1 1 1 1 1 0 1 1 0 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 0 0 0 1 1 1 1 0]

# Calculate confusion matrix and plot it 
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test, y_pre) 
plt.matshow(confusion_matrix, cmap=plt.cm.Greens) 
plt.colorbar()
for i in range(len(confusion_matrix)): 
    for j in range(len(confusion_matrix)):
        plt.annotate(confusion_matrix[i,j], xy=(i, j), horizontalalignment='center', verticalalignment='center')
plt.ylabel('True label')
plt.xlabel('Predicted label') 
plt.show()

在这里插入图片描述

# ROC curve
metrics.plot_roc_curve(clf,x_test_stand,y_test)
plt.show()

在这里插入图片描述

Results analysis

Among the 400 test samples, 62 samples with actual value of 0 and 235 samples with actual value of 1 were correctly classified.However, 41 samples with an actual value of 1 and 62 samples with an actual value of 0 were wrongly classified.The classification accuracy of the classifier reached $(62+235)/400*100\%=74.25\% \tag{Q2-1}$

And 'AUC in test set is $0.733$

Q3. Based on the confusion matrix and the payoff matrix, what is the net profit on the data?

Answer to Q3.

we already have

$KaTeX parse error: Expected '}', got '_' at position 17: …\text{Confusion_̲Matrix} = \left…$
and
$KaTeX parse error: Expected '}', got '_' at position 12: \text{Net_̲Profit} = \left…$

Hence, we can easily calculate the net profit on the test data is $62 \times 100 + 62 \times (-500) = -24800$

Q4. Let’s see if we can improve our performance by changing the cutoff. Rather than accepting the above classification of everyone’s credit status, let’s use the “predicted probability of finding a good applicant” in logistic regression as a basis for selecting the best credit risks first, followed by poorer risk applicants.

a.Sort the test data on "predicted probability of finding a good applicant."
b.For each test case, calculate the actual cost/gain of extending credit.
c.Add another column for cumulative net profit.
d.How far into the test data do you go to get maximum net profit? (Often this is specified as apercentile or rounded to deciles.)
e.If this logistic regression model is scored to future applicants, what "probability of success" cutoff should be used in extending credit?

Q4.a

Sort the test data on “predicted probability of finding a good applicant.”

def sigmoid(x):
    '''
    Define the sigmoid function
    :param x: param x
    :return: results
    '''
    return 1.0 / (1 + np.exp(-x))
score = x_test_stand @ (clf.coef_.reshape([-1,1])) +clf.intercept_
s = sigmoid(score)
plt.hist(s)
plt.show()
# Double-Check  whether the score of each case in the test set is consistent with the previous results
a = np.zeros([len(s),1])
for i in range(len(s)):
    if s[i]> 0.5 :  # Success probability greater than 0.5 means success
        a[i] = 1  
    else:           # Success probability greater than 0.5 means failure
        a[i] = 0
print('------------------------------------------------------')
print('Number of successes from verification results：',sum(a.reshape(-1)))
print('Number of successes from model calculation results：',sum(y_pre))

在这里插入图片描述

------------------------------------------------------
Number of successes from verification results： 315.0
Number of successes from model calculation results： 315

# Add another column for predicted probability of success
X_test.loc[:,'Score'] = s
# Sort the validation data on "predicted probability of success.
X_test.sort_values("Score",inplace=True)
plt.plot(X_test.loc[:,'Score'].values)
plt.show()
X_test

在这里插入图片描述

	OBS#	CHK_ACCT	DURATION	HISTORY	NEW_CAR	USED_CAR	FURNITURE	RADIO/TV	EDUCATION	RETRAINING	...	AGE	OTHER_INSTALL	RENT	OWN_RES	NUM_CREDITS	JOB	NUM_DEPENDENTS	TELEPHONE	FOREIGN	Score
972	973	0	24	1	1	0	0	0	0	0	...	29	0	1	0	2	0	1	0	0	0.021543
334	335	0	24	0	0	0	1	0	0	0	...	23	1	1	0	2	2	2	0	0	0.038262
728	729	1	48	1	0	0	0	0	0	1	...	59	0	1	0	1	2	1	0	0	0.049493
59	60	0	36	4	0	0	1	0	0	0	...	23	0	1	0	2	1	1	1	0	0.056447
11	12	0	48	2	0	0	0	0	0	1	...	24	0	1	0	1	2	1	0	0	0.074500
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
519	520	3	6	4	0	0	0	1	0	0	...	36	0	0	0	2	2	1	0	0	0.992815
135	136	3	12	4	0	0	0	1	0	0	...	38	0	0	1	2	2	1	1	0	0.992897
567	568	3	24	4	0	0	0	1	0	0	...	34	0	0	1	1	2	1	0	0	0.992978
156	157	0	9	4	0	0	0	0	0	0	...	48	0	0	1	2	2	2	0	1	0.993214
209	210	3	12	2	0	1	0	0	0	0	...	55	0	0	1	1	2	1	0	1	0.996783

400 rows × 32 columns

print('The sorted test data on "predicted probability of success" as follows:')
X_test.loc[:,'Score']

The sorted test data on "predicted probability of success" as follows:





972    0.021543
334    0.038262
728    0.049493
59     0.056447
11     0.074500
         ...   
519    0.992815
135    0.992897
567    0.992978
156    0.993214
209    0.996783
Name: Score, Length: 400, dtype: float64

Q4.b

For each test case, calculate the actual cost/gain of extending credit.

actual_gain = X_test['Score']*100-500*(1-X_test['Score'])
X_test.loc[:,'Actual_gain'] = actual_gain
plt.plot(actual_gain.values)
plt.xlabel('number of test cases')
plt.ylabel('actual gain')
plt.show()
X_test

在这里插入图片描述

	OBS#	CHK_ACCT	DURATION	HISTORY	NEW_CAR	USED_CAR	FURNITURE	RADIO/TV	EDUCATION	RETRAINING	...	OTHER_INSTALL	RENT	OWN_RES	NUM_CREDITS	JOB	NUM_DEPENDENTS	TELEPHONE	FOREIGN	Score	Actual_gain
972	973	0	24	1	1	0	0	0	0	0	...	0	1	0	2	0	1	0	0	0.021543	-487.074435
334	335	0	24	0	0	0	1	0	0	0	...	1	1	0	2	2	2	0	0	0.038262	-477.042925
728	729	1	48	1	0	0	0	0	0	1	...	0	1	0	1	2	1	0	0	0.049493	-470.304182
59	60	0	36	4	0	0	1	0	0	0	...	0	1	0	2	1	1	1	0	0.056447	-466.132080
11	12	0	48	2	0	0	0	0	0	1	...	0	1	0	1	2	1	0	0	0.074500	-455.300084
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
519	520	3	6	4	0	0	0	1	0	0	...	0	0	0	2	2	1	0	0	0.992815	95.688838
135	136	3	12	4	0	0	0	1	0	0	...	0	0	1	2	2	1	1	0	0.992897	95.738059
567	568	3	24	4	0	0	0	1	0	0	...	0	0	1	1	2	1	0	0	0.992978	95.786722
156	157	0	9	4	0	0	0	0	0	0	...	0	0	1	2	2	2	0	1	0.993214	95.928315
209	210	3	12	2	0	1	0	0	0	0	...	0	0	1	1	2	1	0	1	0.996783	98.069581

400 rows × 33 columns

print('The actual cost/gain of extending credit for each case as follows:')
X_test.loc[:,'Actual_gain']

The actual cost/gain of extending credit for each case as follows:





972   -487.074435
334   -477.042925
728   -470.304182
59    -466.132080
11    -455.300084
          ...    
519     95.688838
135     95.738059
567     95.786722
156     95.928315
209     98.069581
Name: Actual_gain, Length: 400, dtype: float64

Q4.c

Add another column for cumulative net profit.

cumulate_net_profit = np.cumsum(actual_gain)
X_test.loc[:,'Cumulate_net_profit'] = cumulate_net_profit
plt.plot(cumulate_net_profit.values)
plt.xlabel('number of test cases')
plt.ylabel('cumulate net profit')
plt.show()
X_test

在这里插入图片描述

	OBS#	CHK_ACCT	DURATION	HISTORY	NEW_CAR	USED_CAR	FURNITURE	RADIO/TV	EDUCATION	RETRAINING	...	RENT	OWN_RES	NUM_CREDITS	JOB	NUM_DEPENDENTS	TELEPHONE	FOREIGN	Score	Actual_gain	Cumulate_net_profit
972	973	0	24	1	1	0	0	0	0	0	...	1	0	2	0	1	0	0	0.021543	-487.074435	-487.074435
334	335	0	24	0	0	0	1	0	0	0	...	1	0	2	2	2	0	0	0.038262	-477.042925	-964.117360
728	729	1	48	1	0	0	0	0	0	1	...	1	0	1	2	1	0	0	0.049493	-470.304182	-1434.421542
59	60	0	36	4	0	0	1	0	0	0	...	1	0	2	1	1	1	0	0.056447	-466.132080	-1900.553622
11	12	0	48	2	0	0	0	0	0	1	...	1	0	1	2	1	0	0	0.074500	-455.300084	-2355.853706
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
519	520	3	6	4	0	0	0	1	0	0	...	0	0	2	2	1	0	0	0.992815	95.688838	-28084.173974
135	136	3	12	4	0	0	0	1	0	0	...	0	1	2	2	1	1	0	0.992897	95.738059	-27988.435915
567	568	3	24	4	0	0	0	1	0	0	...	0	1	1	2	1	0	0	0.992978	95.786722	-27892.649194
156	157	0	9	4	0	0	0	0	0	0	...	0	1	2	2	2	0	1	0.993214	95.928315	-27796.720878
209	210	3	12	2	0	1	0	0	0	0	...	0	1	1	2	1	0	1	0.996783	98.069581	-27698.651297

400 rows × 34 columns

print('The column for cumulative net profit. as follows:')
X_test.loc[:,'Cumulate_net_profit']

The column for cumulative net profit. as follows:





972     -487.074435
334     -964.117360
728    -1434.421542
59     -1900.553622
11     -2355.853706
           ...     
519   -28084.173974
135   -27988.435915
567   -27892.649194
156   -27796.720878
209   -27698.651297
Name: Cumulate_net_profit, Length: 400, dtype: float64

Q4.d

How far into the test data do you go to get maximum net profit? (Often this is specified as a percentile or rounded to deciles.)

plt.bar(np.arange(31),clf.coef_.reshape(-1))
plt.xlabel('Variables')
plt.ylabel('Weight')
plt.show()

在这里插入图片描述

In order to maximize net profit, the “predictive success probability” of the test data is critical.
The results of the model show that the checking account status has a significant positive effect on the probability of success, while the duration of credit and the purpose of credit have a negative effect on the probability of success

Q4.e

If this logistic regression model is scored to future applicants, what “probability of success” cutoff should be used in extending credit?

X_test.iloc[int((5/6)*400),-3]

0.9578759345718599

In order to reduce the cost of extending credit, balance risks and benefits.Here we propose to set up a reasonable “probability of success” cutoff point so that the number of successful people should be greater than or equal to five times the number of unsuccessful people.This means that 5/6 of all people should be below the probability of success. By calculation, I think the “probability of success” cutoff point should be set at 0.95

春风惹人醉

关注

5
点赞
踩
20

收藏

觉得还不错? 一键收藏
打赏
1
评论
Logstic Regression模型对German Credit数据集进行分类

Case: German Credit在这份作业中，我们使用了Logstic Regression模型对German Credit数据集进行了分类。并用混淆矩阵和ROC曲线对模型进行了评估。若对本文存有疑问或获取数据代码，请直接私信博主或直接添加博主VX: 1178623893The German Credit data set contains observations on 30 variables for 1000 past applicants for credit. Each app
复制链接

扫一扫

	OBS#	CHK_ACCT	DURATION	HISTORY	NEW_CAR	USED_CAR	FURNITURE	RADIO/TV	EDUCATION	...	AGE	RENT	OWN_RES	NUM_CREDITS	JOB	NUM_DEPENDENTS	TELEPHONE	RESPONSE
0	1	0	6	4	0	0	0	1	0	...	67	0	1	2	2	1	1	1
1	2	1	48	2	0	0	0	1	0	...	22	0	1	1	2	1	0	0
2	3	3	12	4	0	0	0	0	1	...	49	0	1	1	1	2	0	1
3	4	0	42	2	0	0	1	0	0	...	45	0	0	1	2	2	0	1
4	5	0	24	3	1	0	0	0	0	...	53	0	0	2	2	2	0	0
5	6	3	36	2	0	0	0	0	1	...	35	0	0	1	1	2	1	1
6	7	3	24	2	0	0	1	0	0	...	53	0	1	1	2	1	0	1
7	8	1	36	2	0	1	0	0	0	...	35	1	0	1	3	1	1	1
8	9	3	12	2	0	0	0	1	0	...	61	0	1	1	1	1	0	1
9	10	1	30	4	1	0	0	0	0	...	28	0	1	2	3	1	0	0

	OBS#	CHK_ACCT	DURATION	HISTORY	NEW_CAR	USED_CAR	FURNITURE	RADIO/TV	EDUCATION	...	AGE	RENT	OWN_RES	NUM_CREDITS	JOB	NUM_DEPENDENTS	TELEPHONE	RESPONSE
0	1	0	6	4	0	0	0	1	0	...	67	0	1	2	2	1	1	1
1	2	1	48	2	0	0	0	1	0	...	22	0	1	1	2	1	0	0
2	3	3	12	4	0	0	0	0	1	...	49	0	1	1	1	2	0	1
3	4	0	42	2	0	0	1	0	0	...	45	0	0	1	2	2	0	1
4	5	0	24	3	1	0	0	0	0	...	53	0	0	2	2	2	0	0
5	6	3	36	2	0	0	0	0	1	...	35	0	0	1	1	2	1	1
6	7	3	24	2	0	0	1	0	0	...	53	0	1	1	2	1	0	1
7	8	1	36	2	0	1	0	0	0	...	35	1	0	1	3	1	1	1
8	9	3	12	2	0	0	0	1	0	...	61	0	1	1	1	1	0	1
9	10	1	30	4	1	0	0	0	0	...	28	0	1	2	3	1	0	0

Logstic Regression模型对German Credit数据集进行分类

Case: German Credit

The German Credit data set contains observations on 30 variables for 1000 past applicants for credit. Each applicant was rated as “good credit”(700 cases) or “bad credit” (300 cases).

Assignment

1. Review the predictor variables and guess from their definition at what their role might be in a credit decision. Are there any surprises in the data?

2.Divide the data randomly into training (60%) and validation (40%) partitions, and develop classification models using the following data mining techniques in XLMiner

3.Choose one model from each technique and report the confusion matrix and the cost/gain matrix for the validation data. For the logistic regression model use a cutoff “predicted probability of success” (“success”=1) of 0.5. Which technique gives the most net profit on the validation data?

Q1. Review the predictor variables and guess from their definition at what their role might be in a credit decision. Are there any surprises in the data?

Q2. Divide the data randomly into training(60%) and test(40%) partitions, and develop a classification model using the logistic regression technique in Python and evaluate the model by using the confusion matrix and the ROC curve.

Q3. Based on the confusion matrix and the payoff matrix, what is the net profit on the data?

Q4.a

Sort the test data on “predicted probability of finding a good applicant.”

Q4.b

For each test case, calculate the actual cost/gain of extending credit.

Q4.c

Add another column for cumulative net profit.

Q4.d

How far into the test data do you go to get maximum net profit? (Often this is specified as a percentile or rounded to deciles.)

Q4.e

If this logistic regression model is scored to future applicants, what “probability of success” cutoff should be used in extending credit?

“相关推荐”对你有帮助么？

	OBS#	CHK_ACCT	DURATION	HISTORY	NEW_CAR	USED_CAR	FURNITURE	RADIO/TV	EDUCATION	...	AGE	RENT	OWN_RES	NUM_CREDITS	JOB	NUM_DEPENDENTS	TELEPHONE	RESPONSE
0	1	0	6	4	0	0	0	1	0	...	67	0	1	2	2	1	1	1
1	2	1	48	2	0	0	0	1	0	...	22	0	1	1	2	1	0	0
2	3	3	12	4	0	0	0	0	1	...	49	0	1	1	1	2	0	1
3	4	0	42	2	0	0	1	0	0	...	45	0	0	1	2	2	0	1
4	5	0	24	3	1	0	0	0	0	...	53	0	0	2	2	2	0	0
5	6	3	36	2	0	0	0	0	1	...	35	0	0	1	1	2	1	1
6	7	3	24	2	0	0	1	0	0	...	53	0	1	1	2	1	0	1
7	8	1	36	2	0	1	0	0	0	...	35	1	0	1	3	1	1	1
8	9	3	12	2	0	0	0	1	0	...	61	0	1	1	1	1	0	1
9	10	1	30	4	1	0	0	0	0	...	28	0	1	2	3	1	0	0