[Paper Summary] oLMpics - On what LM Pre-training Captures [Talmor 2019]

最新推荐文章于 2022-03-18 17:30:40 发布

芝麻挞

最新推荐文章于 2022-03-18 17:30:40 发布

阅读量102

点赞数

分类专栏：我爱读的paper

本文链接：https://blog.csdn.net/weixin_43928665/article/details/118538770

版权

我爱读的paper 专栏收录该内容

23 篇文章 0 订阅

订阅专栏

oLMpics - On what LM Pre-training Captures [Talmor 2019]

Keypoints

We propose a diverse set of probing tasks for types of symbolic reasoning that are potentially difficult to capture using a LM objective. u1s1有的task设计得就只挑了几个template感觉没有cover很全面的language phenomenon.
We propose an eval protocol for disentangling knowledge retained in pre-training from fine-tuning (i.e. No language control & Perturbed language control)
We provide an analysis of skills that current LMs possess. Their success is context-dependent, closely tied to specific values and isn’t achieved via abstraction and composition as humans perceive it. 说起来没毛病因为你所谓的reasoning ability就是只从能看到的context里学来的 They fail completely when numbers in Age Comparison task do not fall into normal range.

Probing setup

Zero-shot: cast tasks in the masked LM format
No pre-training control
No language control: Only minimal language tokens are given. We remove all input except for [MASK] and the arguments of the task. If the model succeeds, then the performance can be mostly attributed to fine-tuning rather than to the pre-trained language representations. In such cases the model demonstrate low language sensitivity.
Perturbed language control: Replace words that are central for the reasoning task with nonsense. They have a list of 10 words that carry relatively limited meaning, e.g. ‘blah’.
MC-MLM: for tasks where the answer set is small.
MC-QA: for tasks where the answer set substantially varies between questions.

Metrics

Learning curves are informative but they are not a single number that can be easily compared. Thus, we summarize learning curves using two aggregate statistics.

Max: Maximum acc on the learning curve
WS: A weighted avg of accuracies. Higher weights are given to points where N is small, highlighting our focus on performance given little fine-tuning data. WS is related to the area under curve, and to the online code proposed by MDL.

Task types

Age Comparison (comparison by age / birth year)
Object Size Comparison 吼吼原来这篇跟是同一波儿人！
Always-Never
The task was mostly tackled at fine-tuning time.
An anecdote: Reporting bias may play a role in the inability to correctly determine that ‘A rhinoceros NEVER has fur’. Interestingly, behavorial research conducted on blind humans shows they exhibit a similar bias.
Negation
e.g. “He was [MASK] fast, he was very slow.”; “He was [MASK] fast, he was rapid”. This tests whether the model distinguishes a negation vs. intensification adverb based on synonymy/antonymy relations.
Property Conjunction (MC-QA setup)
e.g. “What is located in a street and is related to octagon?”
Taxonomy Conjunction (find the mutual hypernym of a pair of concepts)
e.g. "A ferry and a floatplane are both a type of [MASK].
LMs prefer hypernyms that are closer in terms of edge distance. When distractors are closer to one of the entities in the statement, the model will consistently (80%) choose the distractor, ignoring the second entity in the phrase.
Encyclopedia Composition
就离谱的是他只有三个template: 1) “when did the band where ENT played first form?”, 2) “who is the spouse of the actor that played in ENT?”, 3) “where is the headquarters of the company that ENT established located?”
Multi-Hop Composition
The task was mostly tackled at fine-tuning time.
Predicting the subject of sentences whose predicate is in a superlative form, where the relevant info is contained in a “when” clause e.g. “When comparing a 23, a 38, and a 31 year old, the [MASK] is oldest” A. second B. first C. third
We picked the statement for each task through manual experimentation. We tried multiple phrasings and chose the one that achieves highest average zero-shot acc across all tested LMs. 所以knowledge能否探测出来跟prompt selection也脱不了干系[狗头]。Success is evidence that the model has the necessary skill, but failure could be attributed to issues with familiarity with the templates.