在一个公司工作七年_信贷公司真的需要七年的数据吗

最新推荐文章于 2024-10-01 09:29:45 发布

李_涛

最新推荐文章于 2024-10-01 09:29:45 发布

阅读量391

点赞数

文章标签： java python 大数据人工智能 linux

原文链接：https://medium.com/@calebelgut/do-credit-companies-really-need-seven-years-of-our-data-dd815c780b88

版权

在一个公司工作七年

If you grew up in the United States, you probably have a distinct memory of receiving 15,000 different credit card applications the MOMENT you turned 18. You may even remember being accosted by friendly employees at American Eagle, Hollister, Best Buy, and Barnes & Noble trying to sell you on their latest credit card to guarantee you 20% off your purchase! The moment you agreed to sign on with that first card was, for many of you, the moment your journey into the world of credit began! There are also things like student loans and car loans, and, if you’re in your mid-20s or early-30s, you may be looking at home loans and mortgages! All of this is wrapped up in the world of credit.

如果您在美国长大，您可能会记忆犹新，即收到您转18岁的MOMENT时会收到15,000种不同的信用卡申请。您甚至可能还记得被American Eagle，Hollister，Best Buy和Barnes＆Noble的友好员工所吸引尝试用他们最新的信用卡卖给您，以保证您购买时有20％的折扣！对许多人来说，您同意使用第一张卡登录的那一刻，就是您进入信用世界的那一刻！还有学生贷款和汽车贷款之类的东西，如果您处于20多岁或30多岁初期，则可能正在寻找房屋贷款和抵押！所有这些都包裹在信贷领域。

In the world of credit, there are many ways of making mistakes and not too many ways of digging your way out. If/when you do buy more than you can pay for on credit and miss a few payments, you’re likely to rack up an adverse credit history, which reflects in your score. Did you know that some of these mistakes will remain for seven years?

在信用的世界中，有很多犯错误的方法，而没有太多的出路。如果/当您购买的商品超出您可以用信用卡支付的金额，却错过了几笔付款时，您很可能累积不良的信用记录，这会在您的分数中反映出来。您是否知道其中一些错误会持续七年？

According to credit monitoring companies like Equifax, Experian, and TransUnion, mistakes such as late/missed payments, accounts that head into collections, and chapter 13 bankruptcy all remain in your credit history (and therefore affect your daily life) up to 7 years. Chapter 7 bankruptcy lasts for ten years! Of course, there are a great many things you can do to build your score back up even with these, at-times-heavy, dings on your credit. You can begin by making a call to collections or the company you owe money to and create a payment plan, for example. However, I long wondered about the *why* behind it all.

根据Equifax ， Experian和TransUnion之类的信用监控公司的说法，诸如逾期/未付款，账户催收以及第13章破产之类的错误都将保留在您的信用记录中(并因此影响您的日常生活)，长达7年之久。第7章破产持续十年！当然，即使有这些不时的沉重负担，您也可以做很多事情来建立分数备份。例如，您可以先致电您的欠款公司或欠款公司，然后创建付款计划。但是，我一直想知道这背后的*原因*。

Why seven years? Seven years is an incredibly long time for most mistakes to follow you around. This did not seem like the sort of mistake that falls in the category of “Should Follow You For 7+ Years.”

为什么要七年？对于大多数错误来说，七年是一段难以置信的漫长时光。这似乎不属于“应该跟随您7年以上”类别的错误。

I wonder — can payment history that reaches years into your past indeed be a good predictor for your ability to make credit payments in the future?

我想知道– 过去数年的付款历史确实可以很好地预测您将来的信用付款能力吗？

I did some digging and found a data set upon which to test my questions! The University of Irvine’s Machine Learning Database had a dataset related to this question! It contained payment history, payment status, and account information for 30,000 people who were lent credit by a central bank in Taiwan between April and September of 2005. I wanted to know which features would be the best predictors for whether someone would default on their credit payment in the next month. I created classification models to answer this question.

我做了一些挖掘，发现了一个数据集，可以用来测试我的问题！尔湾大学的机器学习数据库有一个与此问题相关的数据集！它包含付款历史记录，付款状态以及2005年4月至2005年9月之间台湾中央银行向30,000人提供的信贷的帐户信息。我想知道哪些功能最能预测某人是否会违约下个月付款。我创建了分类模型来回答这个问题。

Classification models are essential to machine learning. They are fed training data full of observations containing a variety of predictor variables and a target variable. If all is done correctly, these models can predict to which sub-group a specific observation belongs.

分类模型对于机器学习至关重要 。向他们提供训练数据，其中包含观察值，这些观察值包含各种预测变量和目标变量。如果所有步骤都正确完成，则这些模型可以预测特定观察结果属于哪个子组。

Think of your email inbox for a second. Do you ever wonder how spam rarely, if ever, seems to reach you? That is thanks to a classification filter built into your inbox! It has been trained on millions and millions of combinations of words to understand which words feature more in a spam email than a regular email. It filters each email you receive and will classify any spam that it comes across. These models aren’t perfect; of course, they are not entirely leak-proof. However, they give us an incredible amount of information about how to classify data relevant to our everyday lives.

想一想您的电子邮件收件箱。您是否想知道垃圾邮件很少能(如果有的话)到达您的地方吗？这要归功于收件箱中内置的分类过滤器！已经对数以百万计的单词组合进行了培训，以了解垃圾邮件中哪些单词比常规电子邮件更具特色。它过滤您收到的每封电子邮件，并对收到的任何垃圾邮件进行分类。这些模型并不完美。当然，它们并不是完全防漏的。但是，它们为我们提供了有关如何分类与日常生活相关的数据的大量信息。

I built 15 different models to analyze this data. I will highlight two of the most influential models and then discuss which features I found were most relevant when predicting an individual defaulting on their credit. The two models I would like to highlight are called Decision Trees and Random Forests.

我建立了15种不同的模型来分析这些数据。我将重点介绍两个最有影响力的模型，然后讨论在预测个人信用违约时发现的最重要的功能。我要强调的两个模型称为决策树和随机森林。

什么是决策树？ (WHAT IS A DECISION TREE?)

Imagine you want to decide whether you will play golf based on the weather’s various conditions. Certain weather conditions are more important to you than others. The weather category is the root node: Sunny, Overcast, or Rainy. Yes/No will determine whether or not the tree continues. If the weather is rainy, then you reach a terminal node of NO. If it is Sunny or Overcast, perhaps then you want to know if it is windy. If it is not windy and overcast, not windy and sunny, or windy and sunny, the result is YES, but if the product is overcast and windy, the answer is NO. This example is how a Decision Tree works!

假设您要根据天气的不同条件决定是否打高尔夫球。某些天气条件对您比其他人更重要。天气类别是根节点：晴天，阴天或阴雨。是/否将确定树是否继续。如果天气多雨，则到达终端节点NO。如果是晴天或阴天，也许您想知道是否有风。如果不是大风和阴天，不是大风和晴天，或者大风和晴天，结果为是，但是如果产品是大风和多云，答案为否。此示例是决策树的工作方式！

Two of the most important terms related to decision trees are entropy and information gain. Entropy measures the impurity of the input set. Information is a decrease in entropy.

与决策树相关的两个最重要的术语是熵和信息增益。 熵测量输入集的杂质。信息是熵的减少。

When I refer to a data set’s impurity, here is what I mean: If you have a bowl of 100 white grapes, you know that if you pluck a grape at random, you will get a white grape. Your bowl has purity. If, however, I remove 30 of the white grapes and replace them with purple grapes, your likelihood of plucking out a white grape has decreased to 70%. Your bowl has become impure. The entropy has increased.

当我提到数据集的杂质时，这就是我的意思：如果您有一碗100个白葡萄，您知道如果随机采摘一个葡萄，您将得到一个白葡萄。你的碗很纯净。但是，如果我删除了30个白葡萄并换成紫色葡萄，那么摘出白葡萄的可能性就会降低到70％。你的碗变得不纯净了。熵增加了。

As each split occurs in your decision tree, entropy is measured. The split with the lowest entropy compared to the parent node and other splits is chosen — the lesser the entropy, the better.

在决策树中发生每次拆分时，都会测量熵。与父节点相比，选择具有最低熵的拆分和其他拆分-熵越小越好。

The Decision Tree has a great many hyperparameters that we need to tune. Hyperparameters are parameters whose values are set before the learning process begins. In Decision Tree Models, the relevant hyperparameters are the criterion, max depth, minimum samples leaf with split, minimum leaf sample size, and max features.

决策树有许多我们需要调整的超参数。超参数是在学习过程开始之前已设置其值的参数。在决策树模型中，相关的超参数是准则，最大深度，带分割的最小样本叶子，最小叶子样本大小和最大特征。

Criterion: Entropy or Gini. Different measures of impurity. Not a vast difference between each.
标准：熵或基尼。不同措施的杂质。两者之间没有太大差异。
Maximum Depth: Reduces the depth of the tree to build a generalized tree. This is set depending on your need.
最大深度：减少树的深度以构建通用树。 根据您的需要进行设置。
Minimum Samples Leaf with Split: Restricts size of sample leaf
带分割的最小样本叶：限制样本叶的大小
Minimum Leaf Sample Size: Size in terminal nodes can be fixed
最小叶子样本大小：终端节点中的大小可以固定
Maximum Features: Max number of features to consider when splitting a node.
最大功能：拆分节点时要考虑的最大功能数。

While I initially wrote code to tune each of these individually, to find the best results, I used a function called GridSearch to find the best parameters for me. GridSearch is a function that tries every possible parameter combination that you feed it to find out which variety of parameters will give you the best possible score. It combines K-Fold CrossValidation with a grid search of the parameters to do so.

当我最初编写代码来分别调整每个参数以找到最佳结果时，我使用了一个名为GridSearch的函数来为我找到最佳参数。 GridSearch是一种函数，它会尝试您提供给它的每种可能的参数组合，以找出哪种参数将为您提供最佳的评分。它将K-Fold CrossValidation与参数的网格搜索结合在一起。

This is an example of a tuned decision tree with shallow depth:

这是一个深度较浅的调整决策树的示例：

At this point, I wanted to see what this model saw as relevant to predicting one class over the other. I created a chart of important features.

在这一点上，我想看看这个模型与预测一个班级比另一个班级有什么关系。我创建了重要功能图表。

This chart shows MASSIVE importance for the payment status in September but such a low readout for the rest. There needed to be a way to gain more precise information.

该图显示了9月份付款状态的重要性，而其余月份的读数是如此低。需要一种获取更精确信息的方法。

A word on the data for a moment — I mention payment status throughout this essay. It becomes a central theme in the analysis. The column that measured payment status used the numbers -2, -1, 0, and 1–9. -2 denoted no use of credit for the month, -1 denoted someone who had paid up their account that month, 0 denoted someone using revolving credit, and 1–9 marked someone who was that many months behind in payments (9 stood for nine and above).

在数据上说一句话 -我在本文中提到付款状态。它成为分析的中心主题。衡量付款状态的列使用数字-2，-1、0和1-9。 -2表示当月不使用信用额度，-1表示某人在该月已付清帐款，0表示某人使用循环信用额度，而1–9表示某人的付款滞后数月(9代表9以上)。

进入随机森林 (ENTER RANDOM FORESTS)

想象一下创建许多决策树！ (Imagine creating MANY decision trees!)

That’s what a random forest is! It is an ensemble of decision trees. Each decision tree makes choices that maximize information. With a diverse series of trees, I can have a model that gives me even more information than the single tree I created.

这就是随机森林！它是决策树的集合。每个决策树都做出使信息最大化的选择。使用一系列不同的树，我可以拥有一个模型，该模型可以为我提供比我创建的单个树更多的信息。

Random Forests are also very resilient to overfitting — our random forest of diverse decision trees are trained on different sets of data and looks to varying subsets of features to make predictions. There is room for error for any given tree, but odds that every tree will make the same error because they looked at the same predictor is incredibly small!

随机森林在过度拟合方面也非常有韧性-我们的随机决策树由各种决策树组成，它们在不同的数据集上进行训练，并希望通过变化的特征子集来进行预测。任何给定的树都有错误的余地，但是每棵树都会犯同样的错误的几率很小，因为它们看的是相同的预测变量，这是非常小的！

After GridSearching my random forest and fitting it to my data, these were the features it found to be important:

在GridSearch搜索我的随机森林并将其拟合到我的数据之后，这些是重要的功能：

The above feature importance chart is incredibly informative.

上面的功能重要性表提供了难以置信的信息。

The status of Payment from September is still paramount.

从9月开始的付款状态仍然至关重要。

However now instead of being in the range of above 0.8, it is now close to 0.175
但是现在，它不再位于0.8之上，而是接近0.175
The amount one pays each month and the balance have increased in importance; however, the status of payment in August & September is still the best predictor.
一个人每月支付的金额和余额的重要性增加了；但是，八月和九月的付款状态仍然是最好的预测指标。
Most notable third place: july_status
最著名的第三名：july_status

I could have stopped here. My models were sufficiently strong, and I had a decent amount of information regarding feature importance. However, I wanted to dig a bit more deeply into the data, and therefore I created a new dataframe that included the top ten predictors.

我本可以在这里停下来。我的模型足够强大，关于功能重要性的信息也很多。但是，我想对数据进行更深入的研究，因此我创建了一个新的数据框，其中包含前十大预测变量。

The top ten features were the status, monthly payment, and account balance from the past three months (September, August, and July) and the credit limit. I came to decide on these features based on the results of the above feature importance charts and the conversations I had with people who worked in banking, particularly in lending. They explained that recent payment history and information tell a far better story than older data. The two models from this analysis that I will focus on are the GridSearched Decision Tree and the GridSearched Random Forest.

前十大功能是状态，每月付款，过去三个月(9月，8月和7月)的帐户余额以及信用额度。我根据上述功能重要性图的结果以及与银行业，尤其是贷款业人员的交谈来决定这些功能。他们解释说，最近的付款历史和信息比以前的数据讲的要好得多。我将重点分析的两个模型是GridSearched决策树和GridSearched随机森林。

The Decision Tree didn’t bring many new results. It reflected the immense importance of September and August’s payment statuses as paramount above the rest of the variables. The Random Forest did as well, albeit more cleanly, here is the chart displaying the feature importance from the GridSearched Random Forests model fit with data from the top ten dataframe:

决策树并没有带来很多新结果。它反映了九月和八月的付款状态对其他变量的重要性，这一点极为重要。随机森林也做得很好，尽管更加清晰，下面的图表显示了GridSearched随机森林模型中的功能重要性，并与前十个数据框中的数据进行了拟合：

Well, there we have it! The past three months’ payment status are the most important features when determining whether someone will default on their next credit payment. This dataset only covered a few months of payment history. I would love to do more in-depth research into years of payment data to see if my questions are answered clearly. However, this much I understand to be valid from the data analysis: your payment activity from the more recent period of your life is much more predictive of your next moves financially than data from years in your past. I do not believe that credit companies need to keep seven years of your mistakes, holding them over your head for what could be around 1/10th or so of your lifespan. Please let me know if there are reasons I am missing for such a long period of penance.

好吧，我们有！过去三个月的付款状态是确定某人是否会拖欠下一次信用付款时最重要的功能。该数据集仅涵盖了几个月的付款历史。我很想对多年的付款数据进行更深入的研究，以查看我的问题是否得到了明确回答。但是，从数据分析的角度来看，我认为这是正确的：与过去几年的数据相比，您生命中较新时期的付款活动对财务下一步行动的预测性更高。我不认为信贷公司需要将您的错误保持七年，而将其拖在您的脑海中，大概是您使用寿命的1/10左右。请让我知道是否有这么长时间的思念让我失踪。