[Paper Summary] oLMpics - On what LM Pre-training Captures [Talmor 2019]

oLMpics - On what LM Pre-training Captures [Talmor 2019]


Keypoints
  • We propose a diverse set of probing tasks for types of symbolic reasoning that are potentially difficult to capture using a LM objective. u1s1有的task设计得就只挑了几个template感觉没有cover很全面的language phenomenon.
  • We propose an eval protocol for disentangling knowledge retained in pre-training from fine-tuning (i.e. No language control & Perturbed language control)
  • We provide an analysis of skills that current LMs possess. Their success is context-dependent, closely tied to specific values and isn’t achieved via abstraction and composition as humans perceive it. 说起来没毛病因为你所谓的reasoning ability就是只从能看到的context里学来的 They fail completely when numbers in Age Comparison task do not fall into normal range.
Probing setup
  1. Zero-shot: cast tasks in the masked LM format
  2. No pre-training control
  3. No language control: Only minimal language tokens are given. We remove all input except for [MASK] and the arguments of the task. If the model succeeds, then the performance can be mostly attributed to fine-tuning rather than to the pre-trained language representations. In such cases the model demonstrate low language sensitivity.
  4. Perturbed language control: Replace words that are central for the reasoning task with nonsense. They have a list of 10 words that carry relatively limited meaning, e.g. ‘blah’.
  5. MC-MLM: for tasks where the answer set is small.
  6. MC-QA: for tasks where the answer set substantially varies between questions.
Metrics

Learning curves are informative but they are not a single number that can be easily compared. Thus, we summarize learning curves using two aggregate statistics.

  1. Max: Maximum acc on the learning curve
  2. WS: A weighted avg of accuracies. Higher weights are given to points where N is small, highlighting our focus on performance given little fine-tuning data. WS is related to the area under curve, and to the online code proposed by MDL.
Task types
  1. Age Comparison (comparison by age / birth year)

  2. Object Size Comparison 吼吼原来这篇跟是同一波儿人!

  3. Always-Never
    The task was mostly tackled at fine-tuning time.
    An anecdote: Reporting bias may play a role in the inability to correctly determine that ‘A rhinoceros NEVER has fur’. Interestingly, behavorial research conducted on blind humans shows they exhibit a similar bias.

  4. Negation
    e.g. “He was [MASK] fast, he was very slow.”; “He was [MASK] fast, he was rapid”. This tests whether the model distinguishes a negation vs. intensification adverb based on synonymy/antonymy relations.

  5. Property Conjunction (MC-QA setup)
    e.g. “What is located in a street and is related to octagon?”

  6. Taxonomy Conjunction (find the mutual hypernym of a pair of concepts)
    e.g. "A ferry and a floatplane are both a type of [MASK].
    LMs prefer hypernyms that are closer in terms of edge distance. When distractors are closer to one of the entities in the statement, the model will consistently (80%) choose the distractor, ignoring the second entity in the phrase.

  7. Encyclopedia Composition
    就离谱的是他只有三个template: 1) “when did the band where ENT played first form?”, 2) “who is the spouse of the actor that played in ENT?”, 3) “where is the headquarters of the company that ENT established located?”

  8. Multi-Hop Composition
    The task was mostly tackled at fine-tuning time.
    Predicting the subject of sentences whose predicate is in a superlative form, where the relevant info is contained in a “when” clause e.g. “When comparing a 23, a 38, and a 31 year old, the [MASK] is oldest” A. second B. first C. third

  9. We picked the statement for each task through manual experimentation. We tried multiple phrasings and chose the one that achieves highest average zero-shot acc across all tested LMs. 所以knowledge能否探测出来跟prompt selection也脱不了干系[狗头]。Success is evidence that the model has the necessary skill, but failure could be attributed to issues with familiarity with the templates.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值