[Paper Summary: My Machine sucks, or eval sucks? ... or both?] A Survey of 25 Years of Evaluation-CSDN博客

本文链接：https://blog.csdn.net/weixin_43928665/article/details/118987497

A Survey of 25 Years of Evaluation

— Kenneth Ward Church & Joel Hestness, 2019, Natural Language Engineering

Sometimes the numbers are too good to be true, and sometimes the truth is better than the numbers. Sometimes the problem is not with the numbers but with the interpretation.

Wayne’s emphsis on evaluation helped pull the field out of an AI Winter. But does this only obscure true capabilities we want to measure?

A single figure of merit such as F-measure makes it easy to create a leaderboard. However, leaderboards ought to make it clear which differences are significant and which are not, though leaderboards rarely report error bars.

Estimates of inter-annotator agreement rates show that judges agree with one another about as often as systems agree with judges. Does this mean that systems are performing as well as people? Can these systems pass the Turing Test? → Error analysis is the key to distinguish between soft mistakes vs. hard mistakes. But the tacit agreement on a single figure of merit makes this an often forgotten step.

Metrics of metrics

There are a number of ways to address the mindless concern: reproducibility, reliability, validity and insight.

Godfrey’s Gap

Too much of the work in our field is tied to too many specifics , and too many ad-hocs on the eval side.

Jack Godfrey observed a large gap between performance of systems on standard academic bake-offs and performance on real tasks of interest to our sponsors (typically in government and industry). Funding agencies have attempted to address this gap by encouraging work on domain adaptation, surprise languages, low resources (and even zero resources). Too many of our evaluations are too specific to a specific task and a specific corpus. We can warn the sponsors that their mileage may vary, but that’s a pretty lame excuse.

A big problem is a common but unrealistic assumption that the test set and the training set are drawn from the same population. Language is a moving target. Topics change quickly over time. Tomorrow’s news will not be the same as yesterday’s news. Tomorrow’s kids will invent new ways to use social media that their parents have never anticipated. We ought to do a better job of addressing Godfrey’s Gap. Our evaluations ought to be more insightful than they are in helping sponsors understand how well proposed methods will generalize to real applications that matter.

(Maybe) a beginning of anti-specificity? Does the economies of scale welcome more general-purpose solutions? Just as hardware community appreciated the value of general purpose solutions, there may be advantages to designing general purpose networks that do not need as much training for specific task. There might be an intriguing analog in the recent growth and development of multi-task learning and fine-tuning in language domains. ELMo, BERT and GPT have ramped up the use of pre-trained contextualized embeddings and attentional transformers. One’s head might start spinning wondering whether we probably care most about the general-purposeness of these language-understanding techniques.

Numbers, Fundings and Deliverables: Declaring success is not a formula for success

Everyone knew that everyone knew, but not saying [shh]

It was scary to get up at an ACL meeting in 1988 and talk about POS Tagging. The other papers at the conference were addressing more difficult tasks. How can we talk about POS Tagging when parsing had been declared solved?

The field had painted itself into a corner. Over the years, the field had gone to the funding agencies and proposed to do more than what was promised in the last proposal. At the end of each round of funding, all the problems in the last proposal would be declared solved, and the field would attack something even more ambitious. This pyramid system worked for a while (during boom times), but eventually led to a bust.