[Paper Summary] Pareto Probing: Trading Off Acc for Complexity [Pimental 2020]

Pareto Probing: Trading Off Acc for Complexity [Pimental 2020]


Keypoints
  • Call for harder probing tasks
    Toy probing tasks, such as POS labeling and dependency arc labeling are inadequate to evaluate the linguistic feature encoded in contextual word representations.
    We advocate for using actual NLP tasks as probing task, which reveal more about the advantages BERT provides over non-contextual representations.
  • Agree with [Pimental 2020 Info-theoretic Probing] that purely optimizing for performance does not tell us anything about the representations, but only about the sentence itself.
  • Directly acknowledge the existence of a trade-off between acc and complexity by treating probing as a bi-objective optimization problem (high acc + low complexity)
  • Def <Pareto optimal probes>: probes that are both simpler and more accurate than all others. Such probes form a Pareto frontier
    Pure desire for imple probes force us to conclude that one-hot encoding almost always encode more linguistic structures than contextual representations. On ther other hand, seeking the highest acc is equivalent to performing NLP task-based research in the classic way. Pareto curve-based measurements strike a reasonable balance.
  • Pareto curve
    x-axis: complexity; y-axis: acc. Representations that achieve the highest probing acc for every given complexity form the Pareto frontier.
  • Pareto hypervolume
    Area under Pareto curve, used as a metric to analyze a single representation that incorporates both acc and complexity.
Measure of probe complexity

**tl;dr: ** If we only care about how friendly a representation is to simple probes, then one-hot encoding is always the best. We introduce both parametric & non-parametric measures.

  1. Parametric mesures (essentially they’re adding a regularization term)
  • A nature measure is the rank of projection matrix. This was used by [Hewitt and Maning 2019 A Structual Probe for Finding Syntax in Word Representations]. But they left out a very basic baseline – one-hot encoding. We use nuclear norm (a convex relaxation of rank) as the complexity measure. One stunning result is, if we only care about “ease of extraction”, then one-hot encoding is always the best choice.

  • It’s easy to see why one-hot encoding does so well: For most of the toy tasks, word identity is the single most important factor. It’s natural to expect simple probes are unable to exploit more than word identity. That being said one-hot encoding is the best since it represents word identity trivially.

  • Biasing towards small nuclear norm hurts generalization cuz we’re already in an underfitting circumstance & you are still optimizing for low complexity. Usually for a linear probe family we’re feeding small number of features and expecting it to fit a large training set.

  • Relation to MDL (variational approach in [Voita 2020 MDL]): The likelihood term tells us how well we have coded the data and the prior term tells us the length of the model’s code. “Prior” 就是指[Voita 2020 MDL]这个公式里面的a

  • And point out a fundamental problem of MDL: The model codelength depends on the choice of prior over model parameters. The fact that we can always “hack” the prior to make it favor certain probes over others may not correspond to our intuition about complexity. 于是我们不想predefine prior,这就是为啥要提出non-parametric measures

  1. Non-parametric measures (connect the notion of model complexity with the ease of memorizing )
  • Refer to [Zhang 2017: Understanding deep learning requires rethinking generalization], which suggests neural networks find it easier to memorize labels if the inputs are structured: shuffle labels so that inputs are no longer predictive of their labels and models are forced to memorize。

  • Label-shuffled randomly assign labels
    Fully-shuffled why need this in additional to label-shuffled: in the label-shuffled case, inputs are still structured. Maybe unstructured inputs are harder to represent because the model cannot rely on syntactic pattern when memorizing shuffled training data.

  • An implementation note for one-hot encoding: Random initialization & are learned during training

Results
  1. Simple probes (i.e. with relatively low memorization capacity) achieve as high an acc as the more complex ones. We take this as support for the need of harder probing tasks as toy tasks are not very interesting and informative.
  2. A simple dictionary look-up strategy, which relies entirely on word identity solves the task up to 86% acc. Thus current probing tasks lack discriminative power and are artificially making non-contextual embeddings – e.g. fastText – seemingly as good as contextual ones.
Further Thoughts
  1. 关于“how to induce the goodness of a representation from probing performance”简直是三体问题,在orignal representation, probing complexity, input structure之间有各种奇奇妙妙的联系,就跟data quality, model quality, eval metrics之间的关系一样难搞。主要是你可以为了证明某个东西去hack另一个调整这个三体系统的状态让结果去favor某一个结论。
  2. 接着上一条, Conlusion里面隐约读出点这个意思: “We need harder probing task to confirm that contextual representations indeed provide much more useable syntactic knowledge than non-contextual ones”,感觉像是先expect BERT to be superior, then strive for a method that reveals that. 毕竟自己说的不算数,要有实验table才行. 但这不算缺陷,整体非常insightful。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值