cox风险回归模型参数估计_信用风险管理：分类模型和超参数调整

最新推荐文章于 2022-07-06 07:53:36 发布

weixin_26752765

最新推荐文章于 2022-07-06 07:53:36 发布

阅读量1.6k

点赞数

文章标签： python 机器学习 java 人工智能深度学习

原文链接：https://towardsdatascience.com/credit-risk-management-classification-models-hyperparameter-tuning-d3785edd8371

版权

本文探讨了Cox风险回归模型在信用风险管理中的应用，重点关注模型的参数估计过程，以及如何通过超参数调整提升模型性能。

摘要由CSDN通过智能技术生成

cox风险回归模型参数估计

The final part aims to walk you through the process of applying different classification algorithms on our transformed dataset as well as producing the best-performing model using Hyperparameter Tuning.

最后一部分旨在引导您完成在转换后的数据集上应用不同分类算法的过程，以及使用超参数调整生成性能最佳的模型的过程。

As a reminder, this end-to-end project aims to solve a classification problem in Data Science, particularly in finance industry and is divided into 3 parts:

提醒一下，此端到端项目旨在解决数据科学(特别是金融行业)中的分类问题，分为三个部分：

Explanatory Data Analysis (EDA) & Feature Engineering
解释性数据分析(EDA)和特征工程
Feature Scaling and Selection (Bonus: Imbalanced Data Handling)
功能缩放和选择(奖金：不平衡的数据处理)
Machine Learning Modelling (Classification)
机器学习建模(分类)

If you have missed the previous two parts, feel free to check them out here and here before going through the final part which leveraged their output in producing the best classification model.

如果您错过了前两个部分，请随时在此处查看 在进行最后一部分之前，这里将利用他们的输出来产生最佳分类模型。

A.分类模型 (A. Classification Models)

Which algorithms should be used to build a model that addresses and solves a classification problem?

应该使用哪种算法来构建可解决并解决分类问题的模型？

When it comes to classification, we have quite a handful of different algorithms to use unlike regression. To name some, Logistic Regression, K-Neighbors, SVC, Decision Tree and Random Forest are the top common and widely used algorithms to solve such problems.

关于分类，与回归不同，我们有很多不同的算法可以使用。仅举一些例子，逻辑回归，K邻居，SVC，决策树和随机森林是解决此类问题的最常用且广泛使用的算法。

Here’s a quick recap of what each algorithm does and how it distinguishes itself from the others:

以下是每种算法的功能及其与众不同之处的快速概述：

Logistic Regression: this algorithm uses regression to predict the continuous probability of a data sample (from 0 to 1), then classifies that sample to the more probable target (either 0 or 1). However, it assumes a linear relationship between the the inputs and the target, which might not be a good choice if the dataset does not follow Gaussian Distribution.
Logistic回归 ：此算法使用回归来预测数据样本的连续概率 (从0到1)，然后将该样本分类为更可能的目标(0或1)。但是，它假设输入和目标之间存在线性关系，如果数据集不遵循高斯分布，则可能不是一个好的选择。
K-Neighbors: this algorithm assumes data points which are in close proximity to each other belong to the same class. Particularly, it classifies the target (either 0 or 1) of a data sample by a plurality vote of the neighbors which are close in distance to it.
K-Neighbors ：该算法假定彼此接近的数据点属于同一类。特别是，它通过距离最近的邻居的多次投票对数据样本的目标(0或1)进行分类。
SVC: this algorithm makes classifications by defining a decision boundary and then classify the data sample to the target (either 0 or 1) by seeing which side of the boundary it falls on. Essentially, the algorithm aims to maximize the distance between the decision boundary and points in each class to decrease the chance of false classification.
SVC ：此算法通过定义决策边界进行分类，然后通过查看数据样本落在边界的哪一侧将其分类到目标(0或1)。本质上，该算法旨在最大化决策边界和每个类别中的点之间的距离，以减少错误分类的机会。
Decision Tree: as the name tells, this algorithm splits the root of the tree (the entire dataset) into decision nodes, and each decision node will be split until no further node is splittable. Then, the algorithm classifies the data sample by sorting them down the tree from the root to the leaf/terminal node and seeing which target node it falls on.
决策树 ：顾名思义，此算法将树的根 (整个数据集)拆分为决策节点，并且每个决策节点都将被拆分，直到没有其他节点可拆分为止。然后，该算法通过对数据样本从根到叶/终端节点的树进行分类，并查看其落在哪个目标节点上，从而对数据样本进行分类。
Random Forest: this algorithm is an ensemble technique developed from the Decision Tree, in which it involves many decision tree that work together. Particularly, the random forest gives that data sample to each of the decision trees and returns the most popular classification to assign the target to that data sample. This algorithm helps avoid overfitting which may occurs to Decision Tree, as it aggregates the classification from multiple trees instead of 1.
随机森林 ：此算法是从决策树开发的一种集成技术，其中涉及许多协同工作的决策树。特别地，随机森林将数据样本提供给每个决策树，并返回最流行的分类以将目标分配给该数据样本。该算法有助于避免决策树可能发生的过拟合，因为它会聚合来自多个树而不是1的分类。

Let’s see how they work with our dataset compared to one another:

让我们比较一下它们如何与我们的数据集一起工作：

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifierclassifiers = {
    "LogisticRegression" : LogisticRegression(),
    "KNeighbors" : KNeighborsClassifier(),
    "SVC" : SVC(),
    "DecisionTree" : DecisionTreeClassifier(),
    "RandomForest" : RandomForestClassifier()
}

After importing the algorithms from sklearn, I created a dictionary which combines all algorithms into one place, so that it’s easier to apply them on the data at once, without the need to manually iterate each individually.

从sklearn导入算法后，我创建了一个字典，将所有算法组合到一个位置 ，这样可以更轻松地将它们一次应用于数据，而无需手动进行单独迭代。

#Compute the training score of each modelstrain_scores = []
test_scores = []for key, classifier in classifiers.items():
    classifier.fit(x_a_train_rs_over_pca, y_a_train_over)
    train_score = round(classifier.score(x_a_train_rs_over_pca, y_a_train_over),2)
    train_scores.append(train_score)
    test_score = round(classifier.score(x_a_test_rs_over_pca, y_a_test_over),2)
    test_scores.append(test_score)print(train_scores)
print(test_scores)

After applying the algorithms on both train and test sets, it seems that Logistic Regression doesn’t work well for the dataset as the scores are relatively low (around 50%, which indicates that the model is not able to classify the target). This is quite understandable and somehow proves that our original dataset is not normally distributed.

在训练集和测试集上应用算法后，由于分数相对较低(大约50％，这表明该模型无法对目标进行分类)，因此Logistic回归似乎不适用于数据集。这是完全可以理解的，并且以某种方式证明了我们的原始数据集不是正态分布的。

In contrast, Decision Tree and Random Forest produced a significantly high accuracy scores on the train sets (85%). Yet, it’s the otherwise for the test set when the scores are remarkably low (over 50%). Possible reasons that might explain the large gap is (1) overfitting the train set, (2) leaking target to the test set. However, after cross checking, it doesn’t seem as the case.

相反，决策树和随机森林在火车上产生了很高的准确性得分(85％)。但是，当分数非常低(超过50％)时，则是测试集的其他情况。可能解释大差距的可能原因是：(1)过度安装了列车组；(2)目标泄漏到测试组。但是，经过交叉检查后，情况似乎并非如此。

Hence, I decided to look into another scoring metric, Cross Validation Score, to see if there’s any difference. Basically, this technique splits the training set into n folds (default = 5), then fits the data on n-1 folds and score on the other fold. This process is repeated in n folds from which the average score will be calculated. Cross validation score brings a more objective analysis on how the models works as compared to the standard accuracy score.

因此，我决定研究另一个得分指标， 交叉验证得分，以查看是否存在任何差异。基本上，此技术将训练集分为n折(默认= 5)，然后将数据拟合为n-1折，而得分为另一折。该过程以n倍重复，将从中计算出平均分数。与标准准确性分数相比，交叉验证分数可更客观地分析模型的工作方式。

from sklearn.model_selection import cross_val_scoretrain_cross_scores = []
test_cross_scores = []for key, classifier in classifiers.items():
    classifier.fit(x_a_train_rs_over_pca, y_a_train_over)
    train_score = cross_val_score(classifier, x_a_train_rs_over_pca, y_a_train_over, cv=5)
    train_cross_scores.append(round(train_score.mean(),2))
    test_score = cross_val_score(classifier, x_a_test_rs_over_pca, y_a_test_over, cv=5)
    test_cross_scores.append(round(test_score.mean(),2))
    
print(train_cross_scores)
print(test_cross_scores)

As seen, the gap between the train and test scores was significantly bridged!

如图所示，训练成绩和考试成绩之间的差距已大大缩小！

Since Random Forest model produced the highest cross validation score, we will test it against another score metric named ROC AUC Score as well as see how it performs on the ROC Curve.

由于随机森林模型产生了最高的交叉验证得分，因此我们将使用另一个名为ROC AUC得分的得分度量标准对其进行测试，并查看其在ROC曲线上的表现 。

Essentially, ROC Curve is a plot of the false positive rate (x-axis) versus the true positive rate (y-axis) within the threshold between 0 and 1 while AUC represents the degree or measure of separability (simply, the ability to distinguish the target).

本质上， ROC曲线是在0到1之间的阈值内，假阳性率(x轴)与真阳性率(y轴)的关系图，而AUC表示可分离性的程度或度量(简单地，区分能力目标)。

Below is a quick summary table of how to calculate FPR (the inversion of Specificity) and TPR (also known as Sensitivity):

以下是有关如何计算FPR (特异性倒置)和TPR (也称为灵敏度)的快速摘要表：

from sklearn.model_selection import cross_val_predict
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_scorerf = RandomForestClassifier()
rf.fit(x_a_train_rs_over_pca, y_a_train_over)
rf_pred = cross_val_predict(rf, x_a_test_rs_over_pca, y_a_test_over, cv=5)
print(roc_auc_score(y_a_test_over, rf_pred))#Plot the ROC Curve
fpr, tpr, _ = roc_curve(y_a_test_over, rf_pred)
plt.plot(fpr, tpr)
plt.show()

As I had proved that cross validation worked on this dataset, I then applied another cross validation technique called “cross_val_predict”, which follows similar methodology of splitting n-folds and predicting the value accordingly.

当我证明交叉验证可在该数据集上工作时，我随后应用了另一种称为“ cross_val_predict ”的交叉验证技术，该技术遵循类似的拆分n折并相应地预测值的方法。

B.超参数调整 (B. Hyperparameter Tuning)

What is hyperparameter tuning and how does it help to improve the accuracy of the model?

什么是超参数调整，它如何帮助提高模型的准确性？

After computing the model from the default estimators of each algorithm, I was hoping to see if further improvement could be made, which comes down to Hyperparameter Tuning. Essentially, this technique chooses a set of optimal estimators from each algorithm that (might) produces the highest accuracy score on the given dataset.

从每种算法的默认估计量计算出模型后，我希望看看是否可以进行进一步的改进，这归结为“超参数调整”。本质上，此技术从(可能)在给定数据集上产生最高准确性得分的每种算法中选择一组最佳估计量 。

The reason why I put (might) in the definition is that for some cases, little to none improvement is seen depends on the dataset as well as the preparation done initially (plus it takes like forever to run). However, Hyperparameter Tuning should be taken into consideration with the hope of finding the best performing model.

我之所以定义(可能)，是因为在某些情况下，几乎看不到任何改善，这取决于数据集以及最初完成的准备工作(而且要花很长时间才能运行)。但是，应考虑超参数调整，以期找到性能最佳的模型。

#Use GridSearchCV to find the best parametersfrom sklearn.model_selection import GridSearchCV#Logistic Regression
lr = LogisticRegression()
lr_params = {"penalty": ['l1', 'l2'], "C": [0.001, 0.01, 0.1, 1, 10, 100, 1000], "solver": ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']}
grid_logistic = GridSearchCV(lr, lr_params)
grid_logistic.fit(x_a_train_rs_over_pca, y_a_train_over)
lr_best = grid_logistic.best_estimator_#KNearest Neighbors
knear = KNeighborsClassifier()
knear_params = {"n_neighbors": list(range(2,7,1)), "algorithm": ['auto', 'ball_tree', 'kd_tree', 'brutle']}
grid_knear = GridSearchCV(knear, knear_params)
grid_knear.fit(x_a_train_rs_over_pca, y_a_train_over)
knear_best = grid_knear.best_estimator_#SVCsvc = SVC()
svc_params = {"C": [0.5, 0.7, 0.9, 1], "kernel":['rbf', 'poly', 'sigmoid', 'linear']}
grid_svc = GridSearchCV(svc, svc_params)
grid_svc.fit(x_a_train_rs_over_pca, y_a_train_over)
svc_best = grid_svc.best_estimator_#Decision Treetree = DecisionTreeClassifier()
tree_params = {"criterion": ['gini', 'entropy'], "max_depth":list(range(2,5,1)), "min_samples_leaf":list(range(5,7,1))}
grid_tree = GridSearchCV(tree, tree_params)
grid_tree.fit(x_a_train_rs_over_pca, y_a_train_over)
tree_best = grid_tree.best_estimator_

GridSearchCV is the key to finding the set of optimal estimators in each algorithm, as it scrutinizes and combines different estimators to fit the dataset, then returns the best set among all.

GridSearchCV是在每种算法中找到最佳估计量集合的关键，因为它会仔细检查并组合不同的估计量以适合数据集，然后返回所有之中的最佳估计量。

One thing of note is that we have to remember by heart all available estimators of each algorithm to be able to use. For example, with Logistic Regression, we have a set of “penalty”, “C”, and “solver” which do not belong to other algorithms.

需要注意的一件事是，我们必须牢记每种算法能够使用的所有可用估计量。例如，对于Logistic回归，我们拥有一组不属于其他算法的“惩罚”，“ C”和“求解器”。

After finding the .best_estimator_ of each algorithm, fit and predict the data using each algorithm with its best set. However, we need to compare the new scores against the original to determine if any improvement is seen or to continue fine-tuning the estimators again.

找到每种算法的.best_estimator_之后，使用每种算法的最佳组合来拟合和预测数据。但是，我们需要将新得分与原始得分进行比较，以确定是否看到了任何改进，或者继续对估计量进行微调。

奖励：XGBoost和LightGBM (Bonus: XGBoost and LightGBM)

What are XGBoost and LightGBM and how significantly better do these algorithms do compared to the traditional?

什么是XGBoost和LightGBM？与传统算法相比，这些算法的效果有多明显？

Apart from the common classification algorithms I’ve heard of, I also have known a couple of advanced algorithms which rooted from the traditional. In this case, XGBoost and LightGBM can be considered as the successor of Decision and Random Forest. Look at the below timeline for a better understanding of how these algorithms were developed:

除了我听说过的常见分类算法外，我还知道一些源自传统的高级算法。在这种情况下，可以将XGBoost和LightGBM视为“决策和随机森林”的后继者。 请查看以下时间轴，以更好地了解这些算法的开发方式：

I’m not going to go into details of how these algorithms differ mathematically, but in general, they are able to prune the decision trees better while handling missing values + avoid overfitting at the same time.

我不会详细介绍这些算法在数学上的区别，但总的来说，它们能够在处理缺失值的同时更好地修剪决策树，同时避免过度拟合。

#XGBoost
import xgboost as xgbxgb_model = xgb.XGBClassifier()
xgb_model.fit(x_a_train_rs_over_pca, y_a_train_over)
xgb_train_score = cross_val_score(xgb_model, x_a_train_rs_over_pca, y_a_train_over, cv=5)
xgb_test_score = cross_val_score(xgb_model, x_a_test_rs_over_pca, y_a_test_over, cv=5)print(round(xgb_train_score.mean(),2))
print(round(xgb_test_score.mean(),2))#LightGBM
import lightgbm as lgblgb_model = lgb.LGBMClassifier()
lgb_model.fit(x_a_train_rs_over_pca, y_a_train_over)
lgb_train_score = cross_val_score(lgb_model, x_a_train_rs_over_pca, y_a_train_over, cv=5)
lgb_test_score = cross_val_score(lgb_model, x_a_test_rs_over_pca, y_a_test_over, cv=5)print(round(lgb_train_score.mean(),2))
print(round(lgb_test_score.mean(),2))

After computing, the train and set scores of each model are 72% & 73% (XGBoost) and 69% & 72% (LightGBM), which is relatively the same as Random Forest model computed above. However, we are still able to make further optimisations via Hyperparameter Tuning for these advanced models, but beware that it might take forever since XGBoost and LightGBM have longer runtime due to the complexity of their algorithm.

经过计算，每个模型的训练和设定分数分别为72％和73％(XGBoost)和69％和72％(LightGBM)，与上面计算的随机森林模型相对相同。但是，对于这些高级模型，我们仍然可以通过Hyperparameter Tuning进行进一步的优化，但是请注意，由于XGBoost和LightGBM的算法复杂性，它们的运行时间更长，因此可能要花很长时间。

Voila! That’s the wrap for this end-to-end project with regards to Classification! If you are keen to explore the entire code, feel free to check out my Github below:

瞧！这就是有关分类的端到端项目的内容！如果您热衷于浏览整个代码，请随时在下面查看我的Github：

Repository: https://github.com/andrewnguyen07/credit-risk-managementLinkedIn: www.linkedin.com/in/andrewnguyen07

资料库： https : //github.com/andrewnguyen07/credit-risk-management LinkedIn： www.linkedin.com/in/andrewnguyen07

Follow my Medium to keep posted on future projects coming up soon!

按照我的中号来发布即将发布的未来项目！