如何将偏见纳入预测模型

最新推荐文章于 2024-08-13 12:00:00 发布

weixin_26730921

最新推荐文章于 2024-08-13 12:00:00 发布

阅读量449

点赞数

原文链接：https://towardsdatascience.com/how-to-incorporate-bias-in-your-predictive-models-d9fef364ece2

版权

You really do not want a biased model making predictions for you, do you? Despite the abundance of top quality machine learning (ML) practitioners and technological advancements, there is no dearth of real-life ML failures.

您真的不希望有偏见的模型为您做出预测，对吗？尽管有大量的顶级机器学习(ML)从业者和技术进步，但现实生活中的ML失败并不乏。

Let us first revisit these high-profile ML and AI failures where an underlying biased model was one, if not the predominant, reason for their failure, before going over certain sources of bias that can impact your predictive models.

让我们首先回顾一下这些引人注目的ML和AI故障，其中潜在的有偏差模型是其失败的原因(如果不是主要原因的话)，然后再讨论可能影响您的预测模型的某些偏差源。

英国GCSE和A级成绩惨败 (UK GCSE and A-Levels Grades Fiasco)

UK students did not sit for their GCSE and A-Levels exams this year due to the COVID-19 lockdowns. Instead, the UK exam regulator, Ofqual, utilized an algorithm to determine a student’s expected grades based on its:

由于COVID-19锁定，今年英国学生没有参加GCSE和A-Level考试。取而代之的是，英国考试监管机构Ofqual运用一种算法，根据以下算法确定学生的预期成绩：

estimated grade as determined by the student’s teacher
由学生的老师确定的估计成绩
a relative ranking with other students in the same school that had similar estimated grades
与同一所学校中其他具有相似预估成绩的学生的相对排名
school’s performances in each subject over the previous three years
过去三年中学校在各科目的成绩

However, what transpired on the results day, August 13 was a complete meltdown: nearly 40% of A-Levels grades were lower than teachers’ assessments. This caused a considerable uproar, resulting in a government u-turn on August 17, whereby it was announced that the student grades would be revised to reflect the higher of the original teacher’s assessment and the model output.

但是，在8月13日考试当天发生的事情完全崩溃了：将近40％的A-Level成绩低于教师的评估。这引起了轩然大波，导致政府于8月17日掉头，据此宣布将对学生成绩进行修订，以反映原始老师的评估和模型输出中的较高者。

IBM“沃森肿瘤学”的失败 (Failure of IBM’s ‘Watson for Oncology’)

In October 2013, IBM partnered with The University of Texas MD Anderson Cancer Center (MD Anderson) to develop an AI-powered solution to cure cancer based on its Watson supercomputer. However, it failed to live up to its expectations.

2013年10月，IBM与德克萨斯大学MD安德森癌症中心(MD Anderson)合作，基于其Watson超级计算机开发了一种AI支持的解决癌症的解决方案。但是，它没有达到其期望。

Forbes reported in February 2017 that MD Anderson benched its Watson for Oncology project and was actively looking for other partners to replace IBM and continue with its future research. Later in July 2018, a report by STAT revealed, based on an analysis of internal IBM documents, multiple counts of downright erroneous cancer treatment advice by Watson and that several customers reported “multiple examples of unsafe and incorrect treatment recommendations”.

福布斯》在2017年2月报道称，MD安德森(MD Anderson)担任其Watson for Oncology项目的负责人，并正在积极寻找其他合作伙伴来取代IBM并继续其未来的研究。 2018年7月下旬， STAT的一份报告显示，根据对IBM内部文件的分析，沃森(Watson)提出了多项错误的彻头彻尾的错误癌症治疗建议，并且有几位客户报告了“不安全和不正确的治疗建议的多个例子”。

亚马逊的Misogynist AI招聘 (Amazon’s Misogynist AI for Recruitment)

Amazon’s engineers started working on an automated recruitment tool in 2014 to review job aspirants’ resumes. However, Reuters reported in 2018 that by 2015 “the company realized its new system was not rating candidates for software developer jobs and other technical posts in a gender-neutral way”.

亚马逊的工程师从2014年开始研究自动招聘工具，以审查求职者的简历。然而，路透社在2018年报道称，到2015年，“该公司意识到其新系统并未以性别中立的方式对软件开发人员职位和其他技术职位的候选人进行评级”。

Apparently, the model was trained on ten years worth of job applications that were predominantly from men, indicative of male dominance in the tech industry then. The project was subsequently shelved, and the team disbanded in early 2017.

显然，该模型经过了十年的工作申请培训，这些申请主要来自男性，这表明当时在技术行业中男性占主导地位。该项目随后被搁置，团队于2017年初解散。

机器学习模型中的偏差来源 (Sources of Bias in ML Models)

Specific ML models can be biased in a subtle way, others not so — as seen above. Some of the potential reasons for a biased model include:

如上所示，特定的ML模型可以以细微的方式产生偏见，而其他的则不是。偏向模型的一些潜在原因包括：

偏差数据/采样偏差 (Biased Data / Sampling Bias)

Biased training data will eventuate in a biased model — the GIGO adage! This was evident in the Amazon example above, where a shortage of female applicants in the training data forced the model to give advantage to male applicants. The model did not have sufficiently gender-impartial data to train on and learn from. The data often reflects the status quo and the current bias in society (for better or for worse).

有偏见的训练数据将以一种有偏见的模型产生-GIGO格言！这在上面的亚马逊示例中很明显，培训数据中女性申请人的短缺迫使该模型向男性申请人提供了优势。该模型没有足够的性别偏爱数据来进行培训和学习。数据通常反映出社会的现状和当前的偏见(无论好坏)。

Appropriate data collection and sampling strategies are of paramount importance to ensure the availability of relevant, representative, suitable, and sufficiently diverse data. There is a reason why 80% of a data scientist’s work revolves around data collection and feature engineering.

适当的数据收集和采样策略对于确保可获得相关的，代表性的，合适的和足够多样化的数据至关重要。有80％的数据科学家的工作都围绕着数据收集和特征工程这是有原因的。

算法偏差 (Algorithm Bias)

ML algorithms have their own inherent biases due to their various mathematical and statistical assumptions that have nothing to do with the underlying training data. Biased algorithms are more rigid, assume and require a tightly defined data distribution, and are more resistant to noise in the data. However, they can have difficulty in learning data complexities. In contract, algorithms with a low bias can handle more complex data but generally do not adapt well to data in production.

机器学习算法由于其各种数学和统计假设而与固有的训练数据无关，因此具有其固有的偏差。偏差算法更严格，假设并要求严格定义数据分布，并且更能抵抗数据中的噪声。但是，他们可能在学习数据复杂性方面有困难。合同中，具有低偏差的算法可以处理更复杂的数据，但通常不能很好地适应生产中的数据。

This conflicting situation is usually referred to as the bias/variance trade-off. It requires a delicate balancing act by the ML practitioner to ensure optimum bias (and by definition, also variance).

这种矛盾的情况通常称为偏差/方差折衷。它要求ML执业者采取微妙的平衡动作，以确保最佳偏差(根据定义，还应包括方差)。

排除偏差 (Exclusion Bias)

This occurs when we exclude specific features or variables from the dataset on an incorrect presumption that they not useful for the prediction problem on hand without validating that assumption first.

当我们以不正确的假设从数据集中排除特定特征或变量时会发生这种情况，因为这些特征或变量在没有先验证该假设的情况下对手头的预测问题无用。

Always test the correlation between predictors and the target variable before discarding any features. Some of these feature selection strategies for various variable types were covered in one of my previous articles.

在丢弃任何功能之前，请务必先测试预测变量与目标变量之间的相关性。我以前的一篇文章介绍了针对各种变量类型的某些功能选择策略。

测量偏差 (Measurement Bias)

Erroneous measurement, e.g., by a device, can result in a systematic distortion of data and measurement bias. For example, if a weighing device consistently under-reports weights by the same amount, any model that utilizes the resultant data from this faulty device will not be accurate. Improperly designed surveys with ‘leading’ questions are another usual source of measurement bias.

错误的测量(例如通过设备进行的测量)可能导致数据的系统失真和测量偏差。例如，如果称重设备始终未按相同数量报告重量，则任何利用该故障设备产生的数据的模型都将不准确。带有“主要”问题的设计不当的调查是另一个常见的衡量偏差来源。

观察者偏见 (Observer Bias)

Observer bias happens when the data collection or feature engineering strategies are influenced by the data analyst’s preconceived, often false, notions. For example, in case I am biased against Sydneysiders (which I am not), then I might allow my bias to creep into my data in subconscious ways. A classic example of observer bias was The Burt Affair.

当数据收集或要素工程策略受到数据分析师的先入为主的(通常是错误的)观念的影响时，就会发生观察者偏见。例如，如果我对悉尼人有偏见(不是)，那么我可以允许我的偏见以潜意识的方式潜入我的数据中。观察者偏见的一个典型例子是《伯特事件》。

It is often challenging to detect observer bias, but it can be prevented through a combination of training and screening strategies together with clearly defined and implemented policies and procedures.

检测观察者的偏见通常具有挑战性，但是可以通过结合培训和筛选策略以及明确定义和实施的政策和程序来避免。

反馈回路 (Feedback Loops)

Consider an ML model that somehow influences the generation of data that is then used by it for predictions. As a result, the model will make predictions that are biased to the data that it had a say on. However, such feedback loops can also be a design feature and not necessarily a cause for concern — usually in content personalization and recommender systems.

考虑一个ML模型，该模型以某种方式影响数据的生成，然后将其用于预测。结果，该模型将做出相对于其具有发言权的数据的预测。但是，此类反馈循环也可以是设计功能，而不必引起关注-通常在内容个性化和推荐系统中。

系统漂移 (System Drift)

Drift in this context refers to changes over time to a system or application that generates data used for modeling. For example, a change to the business definition of delayed payments (in case of a default prediction problem) or the addition of new modes of user interaction.

在本文中，漂移是指随着时间的流逝，系统或应用程序会生成用于建模的数据。例如，更改延迟付款的业务定义(在发生默认预测问题的情况下)或添加新的用户交互模式。

Such bias is the easiest to prevent and detect through appropriate change management and error tracking practices together with regular model updates and training.

通过适当的变更管理和错误跟踪实践以及定期的模型更新和培训，最容易防止和发现这种偏差。

结论 (Conclusion)

The above are some of the potential sources of bias that can have an impact on the performance of your predictive model. In my experience, the majority of the above can be adequately handled through evaluating and then re-evaluating data collection and sampling strategies, and thorough testing/validation routines.

以上是可能对预测模型的性能产生影响的一些潜在偏差来源。以我的经验，可以通过评估然后重新评估数据收集和采样策略，以及彻底的测试/验证例程来充分地解决上述问题。

Feel free to reach out to me if you would like to discuss anything related to data analytics, machine learning, financial or credit analysis.

如果您想讨论与数据分析，机器学习，财务或信用分析有关的任何内容，请随时与我联系。

Till next time, rock on!

直到下一次，继续前进！