成像数据更好的展示_为什么更多的数据并不总是更好

最新推荐文章于 2024-07-26 17:36:26 发布

weixin_26713521

最新推荐文章于 2024-07-26 17:36:26 发布

阅读量166

点赞数

文章标签： python java 人工智能大数据机器学习

原文链接：https://towardsdatascience.com/why-more-data-is-not-always-better-de96723d1499

版权

成像数据更好的展示

Over the past few years, there has been a growing consensus that the more data one has, the better the eventual analysis will be.

在过去的几年中，越来越多的共识是，数据越多，最终的分析就越好。

However, just as humans can become overwhelmed by too much information — so can machine learning models.

但是，就像人类会被太多的信息所淹没一样，机器学习模型也是如此。

以酒店取消为例 (Hotel Cancellations as an example)

I was thinking about this issue recently when reflecting on a side project I have been working on for the past year — Predicting Hotel Cancellations with Machine Learning.

最近，当我反思过去一年来一直在做的一个辅助项目时，我正在考虑这个问题- 通过机器学习预测酒店取消 。

Having written numerous articles about the topic on Medium — it is clear that the landscape for the hospitality industry has changed fundamentally in the past year.

在“媒介”这个主题上写了许多文章之后，很显然，在过去一年中，酒店业的格局发生了根本变化。

With a growing emphasis on “staycations”, or local holidays — this fundamentally changes the assumptions that any machine learning model should make when predicting hotel cancellations.

随着人们越来越重视“住宿”或当地假期，这从根本上改变了任何机器学习模型在预测酒店取消时都应做出的假设。

The original data from Antonio, Almeida and Nunes (2016) used datasets from Portuguese hotels with a response variable indicating whether the customer had cancelled their booking or not, along with other information on that customer such as country of origin, market segment, etc.

Antonio，Almeida和Nunes(2016)的原始数据使用了来自葡萄牙酒店的数据集，其响应变量指示客户是否取消了预订，以及该客户的其他信息，例如原籍国，细分市场等。

In the two datasets in question, approximately 55-60% of all customers were international customers.

在上述两个数据集中，大约55-60％的客户是国际客户。

However, let’s assume this scenario for a moment. This time next year — hotel occupancy is back to normal levels — but the vast majority of customers are domestic, in this case from Portugal. For the purposes of this example, let’s assume the extreme scenario that 100% of customers are domestic.

但是，让我们暂时假设这种情况。明年的这个时候-酒店入住率恢复到正常水平-但绝大多数客户来自国内，在这种情况下来自葡萄牙。出于本示例的目的，我们假设100％的国内客户是极端情况。

Such an assumption will radically affect the ability of any previously trained model to accurately forecast cancellations. Let’s take an example.

这样的假设将从根本上影响任何先前训练的模型准确预测取消的能力。让我们举个例子。

使用SVM模型进行分类 (Classification using SVM Model)

An SVM model was originally used to predict hotel cancellations — with the model being trained on one dataset (H1) and the predictions then compared to a test set (H2) using the feature data from that test set. The response variable is categorical (1 = booking cancelled by customer, 0 = booking not cancelled by customer).

SVM模型最初用于预测酒店的取消情况-在一个数据集(H1)上对该模型进行训练，然后使用来自该测试集的特征数据将该预测与测试集(H2)进行比较。响应变量是分类变量(1 =客户取消预订，0 =客户未取消预订)。

Here are the results as displayed by a confusion matrix across three different scenarios.

这是在三种不同情况下的混淆矩阵显示的结果。

方案1：在H1(完整数据集)上训练，在H2(完整数据集)上测试 (Scenario 1: Trained on H1 (full dataset), tested on H2 (full dataset))

[[25217 21011]
 [ 8436 24666]]
              precision    recall  f1-score   support
           0       0.75      0.55      0.63     46228
           1       0.54      0.75      0.63     33102
    accuracy                           0.63     79330
   macro avg       0.64      0.65      0.63     79330
weighted avg       0.66      0.63      0.63     79330

Overall accuracy comes in at 63%, while recall for the positive class (cancellations) came in at 75%. To clarify, recall in this instance means that of all the cancellation incidences — the model correctly identifies 75% of them.

总体准确度为63％，而正面评价(取消)的查全率为75％。为了明确起见，在这种情况下，召回意味着所有取消事件-该模型正确地识别了其中的75％。

Now let’s see what happens when we train the SVM model on the full training set, but only include domestic customers from Portugal in our test set.

现在，让我们看看在完整的训练集上训练SVM模型但仅在测试集中包括葡萄牙的国内客户时会发生什么。

方案2：在H1(完整数据集)上进行培训，在H2(仅适用于本地)上进行了测试 (Scenario 2: Trained on H1 (full dataset), tested on H2 (domestic only))

[[10879     0]
 [20081     0]]
              precision    recall  f1-score   support
           0       0.35      1.00      0.52     10879
           1       0.00      0.00      0.00     20081
    accuracy                           0.35     30960
   macro avg       0.18      0.50      0.26     30960
weighted avg       0.12      0.35      0.18     30960

Accuracy has dropped dramatically to 35%, while recall for the cancellation class has dropped to 0% (meaning the model has not predicted any of the cancellation incidences in the test set). The performance in this instance is clearly very poor.

准确性急剧下降到35％，而取消类的召回率下降到0％(这意味着模型尚未预测测试集中的任何取消发生率)。在这种情况下，性能显然很差。

方案3：在H1(仅限于国内)上受过培训，并在H2(仅限于国内)上进行了测试 (Scenario 3: Trained on H1 (domestic only), tested on H2 (domestic only))

However, what if the training set was modified to only include customers from Portugal and the model trained once again?

但是，如果将培训集修改为仅包括来自葡萄牙的客户，并且再次对模型进行了培训，该怎么办？

[[ 8274  2605]
 [ 6240 13841]]
              precision    recall  f1-score   support
           0       0.57      0.76      0.65     10879
           1       0.84      0.69      0.76     20081
    accuracy                           0.71     30960
   macro avg       0.71      0.72      0.70     30960
weighted avg       0.75      0.71      0.72     30960

Accuracy is back up to 71%, while recall is at 69%. Using less, but more relevant data in the training set has allowed for the SVM model to predict cancellations across the test set much more accurately.

准确率回升到71％，而召回率则达到69％。在训练集中使用更少但更相关的数据可以使SVM模型更加准确地预测整个测试集中的取消情况。

如果数据错误，则模型结果也将错误 (If The Data Is Wrong, Model Results Will Also Be Wrong)

More data is not better if much of that data is irrelevant to what you are trying to predict. Even machine learning models can be misled if the training set is not representative of reality.

如果很多数据与您要预测的内容无关，则更多的数据并不会更好。如果训练集不能代表现实，那么甚至会误导机器学习模型。

This was cited by a Columbia Business School study as an issue in the 2016 U.S. Presidential Elections, where the polls had put Clinton on a firm lead against Trump. However, it turned out that there were many “secret Trump voters” who had not been accounted for in the polls — and this had skewed the results towards a predicted Clinton win.

哥伦比亚大学商学院的一项研究将其引用为2016年美国总统大选的一个问题，民意测验使克林顿在对抗特朗普方面处于坚决领先地位。然而，事实证明，有许多“秘密特朗普选民”并未在民意调查中得到解释，这使结果偏向了预期的克林顿胜利。

I’m non-U.S. and neutral on the subject by the way — I simply use this as an example to illustrate that even data we often think of as “big” can still contain inherent biases and may not be representative of what is actually going on.

顺便说一下，我不是美国人，对这个问题持中立态度-我仅以此为例来说明，即使我们经常认为“大”的数据也可能包含固有偏差，并且可能无法代表实际情况上。

Instead, the choice of data needs to be scrutinised as much as model selection, if not more so. Is inclusion of certain data relevant to the problem that we are trying to solve?

取而代之的是，数据选择需要与模型选择一样仔细检查，如果不是更多的话。是否包含与我们要解决的问题相关的某些数据？

Going back to the hotel example, inclusion of international customer data in the training set did not enhance our model when our goal is to predict cancellations across the domestic customer base.

回到酒店的例子，当我们的目标是预测整个国内客户群的取消时，将国际客户数据包含在培训集中并不能改善我们的模型。

结论 (Conclusion)

There is increasingly a push to gather more data across all domains. While more data in and of itself is not a bad thing — it should not be assumed that blindly introducing more data into a model will improve its accuracy.

越来越多的人要求跨所有域收集更多数据。尽管更多的数据本身并不是一件坏事，但不应认为盲目地将更多数据引入模型可以提高其准确性。

Rather, data scientists still need the ability to determine the relevance of such data to the problem at hand. From this point of view, model selection becomes somewhat of an afterthought. If the data is representative of the problem that you are trying to solve in the first instance, then even the more simple machine learning models will generate strong predictive results.

而是，数据科学家仍然需要能够确定此类数据与当前问题的相关性。从这个角度来看，模型选择变得有些事后思考。如果数据代表您首先要解决的问题，那么即使是更简单的机器学习模型也将产生强大的预测结果。

Many thanks for reading, and feel free to leave any questions or feedback in the comments below.

非常感谢您的阅读，并随时在下面的评论中留下任何问题或反馈。

If you are interested in taking a deeper look at the hotel cancellation example, you can find my GitHub repository here.

如果您想深入了解酒店取消示例，则可以在此处找到我的GitHub存储库。

Disclaimer: This article is written on an “as is” basis and without warranty. It was written with the intention of providing an overview of data science concepts, and should not be interpreted as professional advice in any way.

免责声明：本文按“原样”撰写，不作任何担保。 它旨在提供数据科学概念的概述，并且不应以任何方式解释为专业建议。

翻译自: https://towardsdatascience.com/why-more-data-is-not-always-better-de96723d1499

成像数据更好的展示

weixin_26713521

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
成像数据更好的展示_为什么更多的数据并不总是更好

成像数据更好的展示Over the past few years, there has been a growing consensus that the more data one has, the better the eventual analysis will be. 在过去的几年中，越来越多的共识是，数据越多，最终的分析就越好。 However, just as humans can ...
复制链接

扫一扫