责备的近义词_考试结果危机：我们应该责备算法吗？

最新推荐文章于 2024-10-06 14:10:45 发布

weixin_26746401

最新推荐文章于 2024-10-06 14:10:45 发布

阅读量411

点赞数

文章标签：算法 python

原文链接：https://medium.com/swlh/the-exam-results-crisis-should-we-be-blaming-algorithms-ffe489461f47

版权

责备的近义词

I’ve been considering writing on the topic of algorithms for a little while, but with the Exam Results Fiasco dominating the headline news in the UK during the past week, I felt that now is the time to look more closely into the subject.

我一直在考虑有关算法主题的写作，但是在过去一周内，考试成绩惨败在英国的头条新闻中占据着主导地位，我认为现在是时候更加仔细地研究这个主题了。

With high school-leavers’ exams being canceled due to Covid-19 this summer, students’ final exam grades were adjusted from teacher assessments by means of an ‘algorithm’, produced by Ofqual. This caused controversy due to the impact it had on the results of those from the most disadvantaged backgrounds. Some schools and colleges had more than half of grades marked down, and consequently many felt the outcome was unfair. So controversial were these results, that in the face of enormous political pressure, the government eventually scrapped the grades generated by the algorithm in favor of the original grades recommended by teachers.

今年夏天，由于Covid-19取消了高中生考试，学生的期末考试成绩根据教师的评估通过Ofqual制定的“算法”进行了调整。由于它对来自最不利背景的人的结果产生了影响，因此引起了争议。一些学校和学院有超过档次的一半下调，因此许多人认为结果是不公平的。这些结果极具争议性，以至于面对巨大的政治压力，政府最终放弃了算法生成的成绩，转而使用老师推荐的原始成绩。

As we saw this week, the presumed complexity of the algorithm generating A-Level grade results allowed for the blame to be placed on the algorithm itself, rather than a number of factors which went in to its implementation. This essentially created a digital scapegoat for a flawed system.

正如我们在本周看到的那样，生成A级成绩结果的算法的复杂性允许将责任归咎于算法本身，而不是影响其实施的许多因素。这本质上为存在缺陷的系统创造了数字替罪羊。

Algorithms will play an increasingly important role in our lives, determining prospects, status, relationships and many other factors which determine both our societal and self worth. According to the CEO of Ofqual:

算法将在我们的生活中发挥越来越重要的作用，确定前景，地位，关系和许多其他因素，这些因素决定了我们的社会和自我价值。根据Ofqual的首席执行官：

Algorithms can be supportive of good decision-making, reduce human error and combat existing systemic biases.

算法可以支持良好的决策，减少人为错误并消除现有的系统偏差。

‘But issues can arise if, instead, algorithms begin to reinforce problematic biases, for example because of errors in design or because of biases in the underlying data sets.

但是，如果算法开始加强有问题的偏差，例如由于设计错误或底层数据集中的偏差，就会出现问题。

‘When these algorithms are then used to support important decisions about people’s lives, for example determining whether they are invited to a job interview, they have the potential to cause serious harm.’

“当这些算法被用来支持有关人们生活的重要决定时，例如确定是否邀请他们参加工作面试，它们就有可能造成严重伤害。”

I’m going to examine what exactly we mean by an algorithm and look at what went wrong in this particular case, with the hope that lessons can be learned for the future.

我将研究算法到底是什么意思，并查看在这种特殊情况下出了什么问题，希望以后能吸取教训。

那么什么是“算法”？ (So what is an ‘algorithm’?)

Wikipedia has a predictably dry definition of an algorithm, which doesn’t contribute much to our understanding and is too broad to be correct in its actual usage in everyday situations. You will find this kind of definition all over the web.

Wikipedia具有可预测的算法干燥定义，对我们的理解没有多大帮助，而且范围太广，无法在日常情况下正确使用。您会在网上找到这种定义。

An algorithm, in more everyday terms, is a definition of a sequence of operations and rules on an input, to provide an output. So, under this argument,

用更日常的术语来说，算法是对输入的一系列操作和规则的定义，以提供输出。所以，根据这个论点，

input -> “Add 4” -> output

输入->“加4”->输出

Is an algorithm, as is:

是一种算法，如下所示：

input -> if (input >10, output 10) else (output 0) -> output

输入->如果(输入> 10，输出10)否则(输出0)->输出

However, both of these examples contradict the way in which the word “Algorithm” is used commonly. The above would probably be referred to as ‘code’, ‘processes’, ‘logic’, ‘computation’ or similar. Algorithm as a term seems to be reserved for those sequences which are complicated to describe or explain to humans.

但是，这两个示例都与通常使用“算法”一词的方式相矛盾。以上内容可能称为“代码”，“过程”，“逻辑”，“计算”或类似内容。算法作为一个术语似乎被保留用于被复杂描述或解释人类的那些序列。

Conversely, we still use the word algorithm to describe some simple tasks, for example sorting. This is, admittedly, a simple algorithm that can be explained in a sentence (see Bubble sort for one example), but no-one is arguing that it isn’t an algorithm. Perhaps the complexity, in this case, comes from the fact that the stop condition is not pre-determined, i.e. we need to know the input before we know exactly how many steps will be taken.

相反，我们仍然使用单词算法来描述一些简单的任务，例如排序。诚然，这是一种可以用句子解释的简单算法(例如，请参阅“ 冒泡排序 ”)，但是没有人争辩说它不是一种算法。在这种情况下，可能的复杂性来自未预先确定停止条件的事实，即我们需要在确切知道将要执行多少步骤之前先知道输入。

As a further counter-example:

再举一个反例：

input -> (output = all positive integers below input) -> output

输入->(输出=输入以下所有正整数)->输出

Is this an algorithm? It appears simple, yet we don’t know in advance how many steps we will take. If the input is 0, no steps will be taken. Is this complex enough to be branded an algorithm?

这是算法吗？看起来很简单，但我们事先不知道将要执行多少步骤。如果输入为0，将不执行任何步骤。这种复杂程度足以打上算法烙印吗？

I think it is safe to say that the true definition of an Algorithm is difficult, however, the following comes close:

我认为可以肯定地说，算法的真正定义很困难，但是，以下内容很接近：

A definition of a finite set of sequences or rules which must be implemented through computing to generate the desired output.

有限的一组序列或规则的定义，必须通过计算来实现以生成所需的输出。

Think of it as a blueprint or a set of instructions to a computer to perform a task.

将其视为蓝图或计算机执行任务的一组指令。

怎么会这么危险？ (How can that be so dangerous?)

As teachers, students and parents across the UK sought answers to why so many A level results were downgraded, the stock answer was that an algorithm had decided the outcome. But the blame cannot sit with a computational process.

当英国各地的老师，学生和家长都在寻找答案，为什么要对这么多的A级成绩进行降级时，常见的答案是算法决定了结果。但是，责任不能与计算过程放在一起。

Let’s take a simple example. Assume that the algorithm looks to the public, or even students like:

让我们举一个简单的例子。假设该算法面向公众，甚至学生也喜欢：

Inside the box, this is what is happening:

在盒子里，这是正在发生的事情：

Mock Grade -> “50% of time same, 40% Subtract 1, 10% add one” -> A-Level Result.

模拟成绩->“ 50％的时间相同，40％减去1，10％加一”-> A级成绩。

The result? 40% of students have now ended up with a worse grade than their mock exam results, while 10% did better. There would be a similar outcry, and questions would be asked about what made the algorithm choose to lower certain students grades, and why did it decide that all 3 students in the small village of Ombersley should get increased grades?

结果？现在，有40％的学生成绩比模拟考试成绩差，而10％的学生成绩更好。会有类似的强烈抗议，并且会问有关该算法为何选择降低某些学生的成绩，以及为什么它决定在Ombersley小村庄中的所有3名学生都应提高成绩？

Although my algorithm is very simple, it would allow me to hide behind the assumed complexity of the algorithm, leading people to blame the unknown black-box instead of the real culprit, me.

尽管我的算法非常简单，但它可以让我躲在算法的假定复杂度后面，从而导致人们责怪未知的黑匣子而不是真正的罪魁祸首。

Ofqual：隐藏在算法背后 (Ofqual: Hiding behind an algorithm)

Ofqual, in their defense, have in the interests of transparency released an approximation of their methodology. It’s a 319 page document, describing and evaluating the factors contributing to the final grades, but never outlining the decision making process in detail. It also contains a particularly interesting sentence. After analyzing distributions across many different factors and cross referencing against variables such as year and subject, they state:

为了维护自己的利益，Oqual为透明起见，发布了一种近似于他们的方法论的方法。这是一份319页的文件，描述和评估了影响最终成绩的因素，但从未详细概述决策过程。它还包含一个特别有趣的句子。在分析了许多不同因素的分布并交叉引用诸如年份和主题之类的变量之后，他们指出：

The analyses conducted show no evidence that this year’s process of awarding grades has introduced bias. Changes in outcomes for students with different protected characteristics and from different socio-economic backgrounds are similar to those seen between 2018 and 2019.

进行的分析表明，没有证据表明今年的成绩授予过程引入了偏见。具有不同受保护特征，来自不同社会经济背景的学生的成绩变化与2018年至2019年之间的变化相似。

In effect, this says that the distributions across each of the factors analysed are not significantly different to those seen in previous years. The above quote is also the entirety of the conclusion section, in a 319 page document.

实际上，这表示分析的每个因素之间的分布与前几年没有明显差异。上面的引用还是319页文档中结论部分的全部 。

Let’s explain very quickly what’s wrong with this, with a simplified example.

让我们用一个简化的例子快速解释一下这是怎么回事。

The analysis was performed on the results of the algorithm-based marks compared to previous years, against a large set of factors including gender, past performances, socio-economic grouping and many more. This is impressive to see and shows that some real effort has been put in to considering potential pitfalls of the resulting grading. However, very few factors are actually compared, focusing instead on mean and variation (standard deviation) and the % of students achieving greater than C or A grades.

分析是根据与前几年相比基于算法的标记的结果进行的，其中涉及大量因素，包括性别，过往表现，社会经济群体等等。令人印象深刻，这表明在考虑最终评分的潜在陷阱方面已经付出了一些实际的努力。但是，实际上很少进行比较，而是着重于均值和方差(标准差)以及达到C或A以上成绩的学生所占的百分比。

Above, there are two curves. Let’s say one is 2019 and one is 2020 results. Both have exactly the same mean and standard deviation, and so could be stated in these results as “showing no evidence of bias”. However, these plots would provide an additional 3% Grades of U (lowest attainable) and 2% more A* grades (highest attainable) in 2020. Let’s now assume that I’m doing this analysis to show that boys and girls results are unbiased:

上方有两条曲线。假设一个是2019年的结果，另一个是2020年的结果。两者均具有完全相同的均值和标准差，因此可以在这些结果中表述为“无偏见迹象”。但是，这些图将在2020年提供另外3％的U(可达到的最低成绩)和2％的A *(可达到的最高成绩)成绩。现在让我们假设我正在做此分析，以表明男孩和女孩的成绩是公正的：

Just looking at U Grades.

只看U成绩。

Boys: 14% to 21%. Increase of 7%

男孩：14％至21％。增长7％

Girls: 23% to 15%. Decrease of 8%

女孩：23％至15％。减少8％

Clearly a bias is shown in these results towards Girls getting fewer U Grades.

显然，这些结果表明，女生获得的U成绩降低了。

This is a very simple explanation, playing with one of a number of factors in one particular type of distribution. My point is that 3 sentences saying “showing no evidence of bias” in an over 300 page report, with no mention of potential pitfalls may be because insufficient work was done or presented to provide the evidence.

这是一个非常简单的解释，涉及一种特定类型的分布中的众多因素之一。我的观点是，在300多页的报告中，有3句话说“没有偏见”，而没有提及潜在的陷阱，这可能是因为未完成或没有提供足够的证据来提供证据。

如果我们不能责怪算法，那又是谁呢？ (If we can’t blame algorithms, then who?)

I believe that the bottom line comes down to human processes. After all, the computational process is a product of human design. Firstly, the requirements are gathered by analysts and turned into code by data scientists and developers. If I developed the “algorithm” I discussed earlier, and ran any kind of testing I would have been able to see how the resulting distribution compared to the previous year(s), as well as other variables. I would know at the time of developing this, that the percentages chosen were essentially arbitrary and I should immediately question their correctness and impact.

我认为最重要的是人的过程。毕竟，计算过程是人类设计的产物。首先，需求由分析师收集，并由数据科学家和开发人员转换为代码。如果我开发了前面讨论的“算法”，并且进行了任何形式的测试，我将能够看到与前一年相比所得的分布以及其他变量。在开发此程序时，我会知道所选的百分比基本上是任意的，我应该立即质疑它们的正确性和影响。

This is always good practice for any data scientist /developer to follow when writing code. If you set thresholds or constants in your code, then you should always question whether they are correct. Often, when sand-boxing, I will write #TODO comments forcing myself to come back and check or justify each of the set values, providing explainability for myself and future users of the code.

这是任何数据科学家/开发人员在编写代码时始终遵循的良好做法。如果您在代码中设置了阈值或常量，则应始终质疑它们是否正确。通常，在进行沙箱测试时，我会编写#TODO注释，迫使自己返回并检查或证明每个设置值，从而为我自己和将来的代码用户提供可解释性。

Secondly, live systems need to be tested fully before release. In this case, the exam results will have been ready for at minimum several days before release, and therefore could have been checked to detect unforeseen bias. Given the importance of the outputs from this system and the impact felt by those affected, I think that data scientists should have anticipated this feedback and could have pulled the plug and improved the algorithm before releasing results.

其次，实时系统需要在发布之前进行全面测试。在这种情况下，至少要在发布前几天准备好考试结果，因此可以检查以发现无法预料的偏差。考虑到该系统输出的重要性以及受影响者所受到的影响，我认为数据科学家应该预料到这种反馈，并且可以在发布结果之前拔掉插头并改进算法。

Some may argue that it’s only a small number of cases which are affected. The small number of high impact cases should be more important to understanding inadequacies and bias in any complex system and so should have been noticed before release. Outliers are where you should be looking, especially in huge datasets, if you want to understand what is happening in a complicated system. It wouldn’t be difficult to look through what happened to the 100 biggest reductions and improvements and run through the algorithm to understand why someone would drop from a predicted grade of C (pass) to a resulting U (lowest attainable).

有人可能会说，受影响的只是少数案例。少数高影响力案例对于理解任何复杂系统中的不足和偏见应更为重要，因此在发布之前应引起注意。如果您想了解复杂系统中正在发生的事情，那么离群值是您应该寻找的地方，尤其是在庞大的数据集中。仔细查看这100项最大的减少和改进所发生的事情，并遍历算法，以理解为什么有人会从预测的C(通过)等级降至最终的U(可达到的最低等级)并不难。

Finally, a portion of blame lies with non-technical staff involved in the process. It might have been that Scientists and Developers raised all of the above issues loudly, but were not involved in the final decision on whether to release or delay, and what the acceptable level of “correctness” is. Usually, this group of people do not fully understand the algorithm enough to make an informed decision.

最后，部分责任归咎于参与该过程的非技术人员。可能是科学家和开发人员大声提出了上述所有问题，但并未参与有关发布还是延迟以及“正确性”的可接受水平的最终决定。通常，这群人不完全了解算法，无法做出明智的决定。

我们可以从中学到什么？ (What can we learn from this?)

I have mentioned a couple of things throughout the piece about some things which we can do to mitigate against unintended outcomes.

在整篇文章中，我提到了一些事情，我们可以采取一些措施来减轻意外结果的影响。

As a Developer/Data Scientist (someone writing the algorithm):

作为开发人员/数据科学家(编写算法的人)：

Think about the assumptions you are making. Try to justify these choices.
考虑一下您所做的假设。尝试证明这些选择的合理性。
Pay close(r) attention to the extremes of your data, think about the impact of these extremes.
密切注意数据的极端情况，考虑这些极端情况的影响。
If you see potential problems speak up, loudly, as soon as possible. Be part of a culture where this is encouraged.
如果您发现潜在问题，请尽快大声说出来。成为鼓励这种文化的一部分。
Take responsibility. Don’t hide behind the complexity of your solution.
承担责任。不要隐藏解决方案的复杂性。

As a Manager/Director/Salesperson (i.e, someone not writing the algorithm):

作为经理/主任/销售员(即不编写算法的人)：

Listen to your staff. Develop a culture where they can speak out about problems without fear.
听你的员工。发展一种文化，使他们可以毫无畏惧地讲出问题。
Ask about the impact of any solution. What’s the worst case?
询问任何解决方案的影响。最坏的情况是什么？
Try to understand any solution, even at a non-technical level. If no-one is capable of explaining sufficiently then this should raise red-flags.
尝试理解任何解决方案，即使在非技术层面也是如此。如果没有人能够充分解释，那么这将引起危险信号。
Take responsibility. Don’t hide publicly behind the complexity of your solution.
承担责任。不要公开地隐藏解决方案的复杂性。

As a consumer (someone affected by an algorithm)

作为消费者(受算法影响的人)

Don’t blame algorithms.
不要怪算法。

In most cases, issues caused by complex systems can and should be prevented by human intervention at various stages in the process. The algorithm does
在大多数情况下，可以而且应该通过在过程的各个阶段进行人工干预来防止由复杂系统引起的问题。该算法确实

exactly what it was told to do by people.
正是人们告诉人们要做的。
Remember that people, and anything they create, are not necessarily perfect!
请记住，人及其创造的任何事物不一定都是完美的！