用于可视化和预测贷款批准的Web应用程序。

The aim is to develop an end to end web application that helps a user to determine his chances of getting a loan. This is done by implementing several different statistical learning techniques to try to estimate the probability of default for each loan. Machine learning models are used to classify a set of unseen data and statistical metrics are used to compare the results. The finalized model is then deployed using a flask library on the Heroku servers and a website is created for the user to navigate and predict his chances of acquiring a loan. The structure of the post is as follows:

目的是开发一种端到端的Web应用程序,该应用程序可以帮助用户确定其获得贷款的机会。 这是通过实施几种不同的统计学习技术来尝试估计每笔贷款的违约概率来完成的。 机器学习模型用于对一组看不见的数据进行分类,统计指标用于比较结果。 然后使用Heroku服务器上的Flask库部署最终模型,并创建一个网站供用户浏览和预测其获得贷款的机会。 该职位的结构如下:

1. Data Overview

1.数据概述

2. Data science and Exploratory data analysis.

2.数据科学和探索性数据分析。

3. Machine Learning and Deployment.

3.机器学习和部署。

4. Tableau Visualization.

4. Tableau可视化。

5. Summary

5.总结

资料总览 (Data Overview)

Lending club data consists of 2,195,670 rows and 151 columns. The source of the data is from the Lending Club Website or Kaggle competitions have the latest updated datasets. The target column was identified as loan_status. For this analysis only Fully Paid Loans and Charged off/ Defaulted Loans have been taken into consideration, which reduced the data set to 1,344,251 rows and 151 columns. Upon eyeballing the data, it was divided into two subsets grouped basis on the Application Type. The reason being the features that were associated with the second applicant was null for Individual loans, which might get deleted during the data cleaning process.

借贷俱乐部数据包括2,195,670行和151列。 数据来源来自Lending Club网站或Kaggle竞赛具有最新更新的数据集。 目标列被标识为loan_status。 对于此分析,仅考虑了“全额贷款”和“冲销/违约贷款”,这将数据集减少为1,344,251行和151列。 根据数据,将其根据应用程序类型分为两个子集。 原因是与第二个申请人相关联的功能对于“个人贷款”为空,在数据清理过程中可能会被删除。

The issue date was one of the features which describes when a loan is issued to the applicant. For both the subsets of data, 15% of the latest issued loans has been taken as the test data on which the final model will be tested.

签发日期是描述何时向申请人签发贷款的功能之一。 对于这两个数据子集,已将最新发行贷款的15%作为测试数据,将对最终模型进行测试。

Image for post
Fig 1: Loan Issuance over the years. (Source: Image by author)
图1:多年来的贷款发行。 (来源:作者提供的图片)

The above graph indicates the distribution of the loans issued over a period for Individual and Joint applicants. Since the majority of the loans issued after 2017 will be in the servicing stage it might create some bias in the model. That is why the training data was taken off the loans which were all matured. The machine learning models prepared will be trained upon 85% of the data and that model will be applied on the remainder 15% data.

上图显示了个人和联合申请人在一定时期内发放的贷款的分布。 由于2017年之后发行的大部分贷款将处于服务阶段,因此可能会在模型中产生一些偏差。 这就是为什么将培训数据从所有已到期的贷款中扣除的原因。 准备的机器学习模型将根据85%的数据进行训练,而该模型将应用于其余15%的数据。

数据科学 (Data Science)

  1. Data Cleaning

    数据清理

    Data leakage refers to a mistake made by the creator of a machine learning model in which they accidentally share information between the test and training datasets. This is also one of the reasons that the test and training split was done based on the loan issue date to avoid any information from the future to hinder our predictions. In our case, few features were recorded after the loan commencement. Such features are not available to the user in reality when he is predicting default. Thus, these features must be eliminated and must not be included as part of the prediction model. With this, the columns were reduced from 151 to 86.

    数据泄漏是指机器学习模型的创建者犯的一个错误,即他们不小心在测试和训练数据集之间共享信息。 这也是根据贷款发放日期进行测试和培训划分的原因之一,以避免将来的任何信息妨碍我们的预测。 在我们的案例中,贷款开始后几乎没有记录任何功能。 实际上,当用户预测默认值时,此类功能对用户不可用。 因此,必须消除这些功能,并且不得将其作为预测模型的一部分。 这样,列数从151减少到86。

Image for post
Fig 2: Missing Data in Dataset. (Source: Image by author)
图2:数据集中缺少数据。 (来源:作者提供的图片)

2. Missing Data and Variance Threshold

2.缺少数据和方差阈值

The threshold value for missing data was set to 50%, which means any feature with more than 50% of missing values was removed from our data set. It can be noted from the graph majority of the missing features had more than 90% missing data so imputations in such cases wouldn’t have come handy. This reduced the number of features from 86 to 31.

缺失数据的阈值设置为50%,这意味着缺失值超过50%的所有要素都将从我们的数据集中删除。 从图中可以看出,大多数缺少的功能都具有超过90%的丢失数据,因此在这种情况下的估算不会很方便。 这将功能的数量从86个减少到31个。

For Categorical Features: Any feature with the same 1 category in the data set would not have provided any information to the model. Hence features like ‘policy_code’, ‘application_type’, ‘hardship_type’, ‘deferral_term’, and ‘hardship_length’ were removed. This reduced the number of features from 31 to 26.

对于分类要素:数据集中具有相同1类别的任何要素都不会向模型提供任何信息。 因此,删除了诸如“ policy_code”,“ application_type”,“ hardship_type”,“ deferral_term”和“ hardship_length”之类的功能。 这将功能的数量从31个减少到26个。

For Numerical Features: Any feature with very little variance would not be very informative for the model. From the data set none of the features were eliminated as the variance was substantially high.

对于数值特征:方差很小的任何特征对于模型而言都不会提供足够的信息。 从数据集中,没有任何特征被消除,因为方差很大。

3. Feature Engineering.

3.特征工程。

It was required to ensure that the remaining columns must not have missing values as few machine learning models do not accept null values. Thus, few imputations were done and new features were engineered based on domain knowledge:

由于少数机器学习模型不接受空值,因此需要确保其余的列一定不能缺少值。 因此,几乎没有进行插补,并且基于领域知识设计了新功能:

a. credit_hist: This feature denotes how old the credit line is for the applicant. It was created by counting the number of days between the earliest credit line open date and the issue date of the loan. While making predictions, today’s date was used as the loan issue date, and the user was asked for the earliest credit line open date. [ credit_hist= issue_d — earliest_cr_line]

一个。 credit_hist:此功能表示申请人的信用额度有多大。 它是通过计算最早的信用额度开放日期和贷款的发行日期之间的天数而创建的。 在进行预测时,将今天的日期用作贷款的发行日期,并要求用户提供最早的信用额度开放日期。 [credit_hist = issue_d-最早的cr_line]

b. credit_line_ratio: This is the ratio between open credit lines in the borrower’s file and the total number of credit lines ever opened in the borrower’s file. [credit_line_ratio = open_acc / total_acc]

b。 credit_line_ratio:这是借款人档案中未清信贷额度与借款人档案中曾经开放的信贷额总数之间的比率。 [credit_line_ratio = open_acc / total_acc]

c. balance_annual_inc: This is the ratio between the loan amount issued and the annual income of the applicant. [balance_annual_inc: loan_amount / annual_inc]

C。 balance_annual_inc:这是发放的贷款额与申请人的年收入之间的比率。 [年度余额余额:贷款金额/年度收入]

d. annual_inc was transformed into a logarithmic scale to have a normal distribution.

d。 将year_inc转换为对数刻度以具有正态分布。

e. fico_avg_score: Fico high and fico low scores of each applicant was provided in the dataset. The average of it was taken and the field was named fico_avg_score. While making predictions, Fico Score is taken as input.

e。 fico_avg_score:数据集中提供了每个申请人的Fico高分和fico低分。 取其平均值,该字段命名为fico_avg_score。 在进行预测时,将Fico分数作为输入。

f. inst_amnt_ratio: This is the ratio between the installment amount and loan amount issued to the applicant. As the installment amount is a feature that might leak information to us as it is not available to us while making predictions, it was ensured that it is calculated based on the interest rate, term, and loan amount.

F。 inst_amnt_ratio:这是发放给申请人的分期付款金额与贷款金额之间的比率。 由于分期付款是一项功能,可能会向我们泄漏信息,因为在进行预测时我们无法获得该信息,因此请确保根据利率,期限和贷款金额进行计算。

4. Feature Scaling

4.特征缩放

In this data set, the data was normalized by taking the mean and standard deviation of each feature based on the first 2 digits of the zip codes. The reasoning behind this is that $20,000 has a different value in Austin, TX than say, San Francisco, CA. It makes more economic sense to scale observations by metropolitan statistical area (MSA) than the whole nation.

在此数据集中,通过基于邮政编码的前两位数字获取每个特征的平均值和标准偏差来对数据进行归一化。 其背后的原因是,在德克萨斯州奥斯汀市,20,000美元的价值与加利福尼亚州旧金山市的价值不同。 按城市统计区域(MSA)规模进行观测,要比对整个国家进行经济观测更具经济意义。

From the above graphs, there are a few key points to be noted: balance annual income for loans that tend to default is slightly more than non-defaulter. The reason might be that the mean issued loan amount for defaulters can be higher than that for non-defaulters or the mean annual income for defaulters can be less than for non-defaulters. Similarly, Debt to Income ratio for defaulters is higher for Defaulter than that for non-defaulters.

从以上图表中,有几点要注意:倾向于违约的贷款余额年收入略高于非违约者。 原因可能是违约者的平均发放贷款金额可能高于非违约者的平均贷款金额,或者违约者的平均年收入可能低于非违约者的平均年收入。 同样,违约者的债务收入比也高于非违约者。

The interest rate and subgrade show a very interesting trend. Higher interest loans tend to default more, and loans issued with lower subgrade applicants’ default more, which indicates that subgrade and interest rates are correlated. Applicants with higher subgrade tend to get a lower interest rate and thus default less, whereas applicants with lower subgrade get a higher interest rate and default more.

利率和路基显示出非常有趣的趋势。 较高利息的贷款倾向于更多的违约,而低等级的路基申请人发行的贷款更容易违约,这表明路基和利率是相关的。 具有较高路基的申请人趋向于获得较低的利率,因此违约率较低,而具有较低路基的申请人则具有较高的利率,而违约率更高。

To this end, it is an important thing to note that having correlated variables may create unnecessary noise in our machine learning model. As part of this analysis, FICO average score, subgrade, grade, and interest rate show a high correlation, thus using K-nearest neighbor, from the FICO average score, the subgrade and interest rate can be predicted, considering the term provided by the applicant. Interest rates and subgrade are thus not included in ML models.

为此,需要注意的重要一点是,具有相关变量可能会在我们的机器学习模型中产生不必要的噪音。 作为此分析的一部分,FICO平均得分,路基,等级和利率显示出高度相关性,因此,使用K近邻,可以从FICO平均得分中预测路基和利率,并考虑FICO提供的期限。申请人。 因此,ML模型不包括利率和路基。

机器学习与部署 (Machine Learning and Deployment)

There are 18 variables and 1 response variable, considering the end-user, 18 variables seem too much information for the user to input. So before applying machine learning models to it, it was required to remove some features based on the f-score it got. In this process pub_rec, pub_rec_bankruptcies, emp_length, purpose, and revol_bal were removed from the dataset. Thus finally 13 columns were left, namely: term, sub_grade, balance_annual_inc, fico_avg_score, dti, inst_amnt_ratio, verification_status, mort_acc, revol_util, annual_inc, credit_line_ratio, home_ownership and credit_hist.

考虑到最终用户,有18个变量和1个响应变量,这18个变量似乎太多信息供用户输入。 因此,在对其应用机器学习模型之前,需要根据它获得的f分数删除一些功能。 在此过程中,已从数据集中删除了pub_rec,pub_rec_bankruptcies,emp_length,用途和revol_bal。 因此,最后剩下13列,即:term,sub_grade,balance_annual_inc,fico_avg_score,dti,inst_amnt_ratio,verification_status,mort_acc,revol_util,annual_inc,credit_line_ratio,home_ownership和credit_hist。

Machine learning models applied:1. Logistics Regression2. Gradient Boosting Classifier3. Random Forest Classifier4. XGBoost Classifier5. Support Vector Classifier6. K Nearest Classifier

机器学习模型的应用:1。 物流回归2。 梯度提升分类器3。 随机森林分类器4。 XGBoost分类器5。 支持向量分类器6。 K最近分类器

Metrics used to compare Machine Learning Models:1. AUC Score stands for area under the curve of a ROC curve. ROC curve is the graph between False Positive Rate and True Positive Rate at different thresholds.2. Precision-Recall Curve: The precision-recall curve shows the trade-off between precision and recall for different thresholds. A high area under the curve represents both high recall and high precision, where high precision relates to a low false-positive rate, and high recall relates to a low false-negative rate.

用于比较机器学习模型的指标:1。 AUC分数代表ROC曲线曲线下的面积。 ROC曲线是在不同阈值下假阳性率和真阳性率之间的关系图。2。 精度调用曲线:精度调用曲线显示了在不同阈值下精度和调用之间的权衡。 曲线下的高区域代表高召回率和高精度,其中高精度与低假阳性率相关,高召回率与低假阴性率相关。

Image for post
Fig 3: Machine Learning Model Comparison. (Source: Image by author)
图3:机器学习模型比较。 (来源:作者提供的图片)

The above two graphs show the ROC Curve and Precision-Recall Curve for all the mentioned models. XGBoost, Logistics Regression, and GBTree are the best performers on the data set as they give the highest AUC score. For this analysis, XGBoost has been selected for individual applicant data set and GBTree for the joint loan data set as the model.This model is then saved into a pickle file (. pkl type) which can be loaded anywhere and be used for making predictions. But, before that hyperparameter tuning was done to the model which increased the AUC score from 0.69 to 0.71.

上面的两个图表显示了所有提到的模型的ROC曲线和Precision-Recall曲线。 XGBoost,物流回归和GBTree在数据集上表现最佳,因为它们给出了最高的AUC得分。 在此分析中,已为个人申请人数据集选择了XGBoost,为联合贷款数据集选择了GBTree作为模型,然后将该模型保存到一个pickle文件(.pkl类型)中,该文件可以在任何地方加载并用于进行预测。 但是,在对该模型进行超参数调整之前,该模型将AUC得分从0.69提高到0.71。

Image for post
Fig 4: Confusion Matrix and Report for XGBoost after hyperparameter tuning. (Source: Image by author)
图4:超参数调整后的XGBoost混淆矩阵和报告。 (来源:作者提供的图片)

The model does a really good job in predicting the true defaults with a 66% Precision, meaning out of every 100 loans the model predicts are going to default, 66 of them will default, and following a conservative approach, it seems a good step. This model has a good specificity of 83%, which means it is successful in predicting 83% of the loans as non-defaults. Overall, the accuracy of our model is 65%, which when considered that it is real-world data, and banking data, the model does quite a good job. This model was thus finalized and saved into a pickle object.

该模型在以66%的精确度预测真实违约方面做得非常好,这意味着模型预测的每100笔违约贷款中,有66笔将违约,按照保守的方法,这似乎是一个好步骤。 该模型具有83%的良好特异性,这意味着它可以成功地预测83%的贷款为非违约。 总体而言,我们的模型的准确性为65%,当考虑到它是真实数据和银行数据时,该模型做得很好。 因此,该模型已最终确定并保存到一个泡菜对象中。

部署架构 (Deployment Architecture)

Image for post
Fig 5: App Architecture. (Source: Image by author)
图5:应用架构。 (来源:作者提供的图片)

The first vertical include data exploration, data cleaning, feature engineering, model training, model validation, model selection, and finally saving it into a .pkl file. All these data analysis steps are carried out in the Jupyter notebook using the sci-kit learn library. This pickle file contains the final model which has been used in the app(.py) file to load the model and generate prediction from the input data that will be received from the HTML form. This app file is a flask web framework that provides tools, libraries, and technologies that allow us to build a web application. This was written in python using Spyder and all templates (.HTML, .CSS, .js) files for the webpages were included in the folders along with the app file. Using the get HTTP method, data was transferred to the loaded model and the result was posted back to the web page using the post HTTP method. This application was then deployed on the Heroku server which uses gUnicorn WSGI HTTP Server. This is the entire structure of the web app created to predict the loan defaults.

第一个领域包括数据探索,数据清理,功能工程,模型训练,模型验证,模型选择,最后将其保存到.pkl文件中。 所有这些数据分析步骤都是使用sci-kit学习库在Jupyter笔记本中执行的。 该pickle文件包含最终模型,该模型已在app(.py)文件中用于加载模型并根据将从HTML表单接收的输入数据生成预测。 此应用程序文件是Flask Web框架,提供了允许我们构建Web应用程序的工具,库和技术。 这是使用Spyder用python编写的,并且网页的所有模板(.HTML,.CSS,.js)文件都与应用程序文件一起包含在文件夹中。 使用get HTTP方法,将数据传输到加载的模型,并使用post HTTP方法将结果发布回网页。 然后将该应用程序部署在使用gUnicorn WSGI HTTP Server的Heroku服务器上。 这是为预测贷款违约而创建的Web应用程序的整个结构。

Tableau可视化 (Tableau Visualization)

For easy visualization of the data set, Tableau was used as it can be easily embedded in the website as well. Tableau dashboard is well coerced and informative about the data set. The dashboard consists of 6 sheets;

为了轻松显示数据集,使用了Tableau,因为它也可以轻松嵌入网站中。 Tableau仪表板具有很好的强制性,可以提供有关数据集的信息。 仪表板包括6张纸;

  1. It gives a representation of the map of the USA and the color gradient gives us the average income for that state.

    它给出了美国地图的表示,颜色渐变给出了该州的平均收入。
  2. It represents balance_annual_inc vs grades, grouped by defaulters and non-defaulters, which gives us a clear idea that defaulters for each grade had a higher loan to annual income ratio.

    它代表了balance_annual_inc与等级的关系,按违约者和非违约者分组,这使我们清楚地知道,每个等级的违约者的贷款与年收入之比更高。
  3. It gives us the distribution of all the loans over the years for the lending club.

    它为我们提供了贷款俱乐部多年来所有贷款的分配。
  4. It shows a box plot for the interest rate at each grade, quite evident that for lower grades the interest rates are higher.

    它显示了每个年级利率的箱形图,很明显,对于较低年级,利率较高。
  5. It is the distribution between the average interest rate and loan amount issued, grouped by defaulters and non-defaulters.

    它是平均利率和已发行贷款额之间的分配,按违约者和非违约者分组。
  6. It is a representation by the size of the reason behind loan issuance.

    它通过发放贷款背后原因的大小来表示。

Now, because of this coercive nature of the dashboard, the user can visualize the data by selecting a state and all sheets in the dashboard will be filtered based on the state provided. This gives the user a brighter insight into the state and how the data is associated with it.

现在,由于仪表板的这种强制性,用户可以通过选择状态来可视化数据,并且仪表板中的所有工作表都将基于提供的状态进行过滤。 这使用户可以更清楚地了解状态以及数据如何与状态关联。

Image for post
Fig 6: Tableau Dashboard. (Source: Image by author)
图6:Tableau仪表板。 (来源:作者提供的图片)

摘要 (Summary)

In the first part of the article the data set is presented, variables are explained, and changes are made where necessary, to make the variables usable in the model building stage. Missing observations are removed from the data set and decisions are made regarding which loans to keep. The data set was trimmed so that only those loans that have reached maturity are included. Exploratory data analysis was performed, through the sweetviz library in python.

在本文的第一部分中,介绍了数据集,解释了变量,并在必要时进行了更改,以使变量在模型构建阶段可用。 缺少的观察结果将从数据集中删除,并决定保留哪些贷款。 对数据集进行了修剪,以便仅包括已到期的那些贷款。 通过python中的sweetviz库进行了探索性数据分析。

Then the implementation of the website and data analysis has been discussed. Various machine learning classifiers are used like Logistics Regression Gradient Boosting Classifier Random Forest Classifier XGBoost Classifier Support Vector Classifier, K Nearest Classifier. The metric used to determine the best model for the dataset was the AUC score as it worked well for an imbalanced dataset problem. Since the dataset was divided into two subsets based on the application type viz. Individual and Joint applicants, two different machine learning models were saved in a pickle file and subsequent hyperparameter tuning was done to get the highest possible auc_score of 0.71 and 0.72 subsequently. After the data analysis, the models were saved and a python flask application file was connected to the website. The website forms were used to take inputs from the applicant, and the entered data was passed to the model through this flask file. The flask application rendered the web pages which were created using HTML, CSS, and JavaScript ensuring the user interface and user experience are smooth and minimal.

然后讨论了网站的实施和数据分析。 使用了各种机器学习分类器,例如后勤回归梯度提升分类器随机森林分类器XGBoost分类器支持向量分类器,K最近分类器。 用于确定数据集最佳模型的指标是AUC评分,因为它在解决数据集不平衡问题时效果很好。 由于根据应用程序类型viz将数据集分为两个子集。 对于个人和联合申请人,两个不同的机器学习模型被保存在一个pickle文件中,随后进行了超参数调整,以使auc_score分别达到0.71和0.72。 数据分析之后,保存了模型,并将python flask应用程序文件连接到了网站。 网站表格用于从申请人那里获取输入,输入的数据通过此flask文件传递到模型。 flask应用程序呈现了使用HTML,CSS和JavaScript创建的网页,从而确保用户界面和用户体验流畅且最小。

Apart from the prediction, a user can navigate and visualize the dataset thoroughly from the tableau dashboard that was embedded into the website. This dashboard helps the user to understand various aspects of the data, as it has coercive sheets well-integrated, with the United States map. The website is also integrated with sweetviz reports that give a really good and detailed analysis of all the features in the dataset.

除了预测之外,用户还可以从嵌入网站的Tableau仪表板中彻底浏览和可视化数据集。 该仪表板可帮助用户了解数据的各个方面,因为其强制性工作表已与美国地图很好地集成在一起。 该网站还集成了sweetviz报告,这些报告对数据集中的所有功能进行了非常好的详细分析。

In the end, the website is deployed on the Heroku servers. This is done through GIT integration on the Heroku platform. A git repository was created and that was linked with Heroku servers.

最后,该网站已部署在Heroku服务器上。 这是通过Heroku平台上的GIT集成完成的。 创建了一个git存储库,并将其与Heroku服务器链接。

Useful Links

有用的链接

1. Webapp

1. Webapp

Lending Club Analysis

借贷俱乐部分析

2. Github

2. Github

ikunal95/loan-default-prediction

ikunal95 /贷款默认预测

3. Tableau Dashboard

3. Tableau仪表板

Tableau Public

Tableau Public

翻译自: https://medium.com/@ikunal95/lending-club-data-web-app-ada56ff64cee

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值