[Paper Summary] Information-theoretic probing for linguistic structure [Pimental 2020]

最新推荐文章于 2024-03-19 21:37:00 发布

芝麻挞

最新推荐文章于 2024-03-19 21:37:00 发布

阅读量136

点赞数

分类专栏：我爱读的paper

本文链接：https://blog.csdn.net/weixin_43928665/article/details/118641740

版权

本文探讨了信息论在探查语言结构中的应用，指出寻找上下文嵌入句子中的句法是无意义的，因为根据假设1，我们预先就知道上下文单词嵌入包含了句子相同的信息量。文章呼吁使用更复杂的探针，并对控制任务提出批评，强调探针的任务不应是揭示表示是否编码语言结构，而应关注信息的提取难度。

摘要由CSDN通过智能技术生成

Information-theoretic probing for linguistic structure [Pimental 2020]

Teaser
… under our operationalization, the endeavour of finding syntax in contextualized embeddings sentences is nonsensical. This is because, under Assumption 1, we know the answer a priori—the contextualized word embeddings of a sentence contain exactly the same amount of information about syntax as does the sentence itself.

Keypoints

Call for complex probes
One should always select the highest performing probe one can without resorting to artificial constraints, even if it is more complex, since it will result in a tighter estimate, and thus reveal more of the linguistic information inherent in the representation.
Call for harder probing tasks or formal definition of ease of extraction, since the current operationalization doesn’t reveal much advantage of contextual embeddings over non-contextual ones.
Most of the info indded to tag POS is ancoded at the lexical level, and does not require sentential context. Put it simple, words are not very embiguous with respect to POS. The gain of BERT over a control for dependency labeling is also modest. So the main point that favors the argument for BERT might be ‘BERT makes info more readily accessible’, rather than ‘BERT provides extra info that wasn’t there with other representations’
Contextual word embeddings contain the same amount of info about the linguistic property of interest as the original sentence. This follows from the data-processing inequality under a mild assumption.
Probing for linguistic properties in representations may not be a well grounded enterprise at all.
Linguistic properties are always there, in some sense, known as a priori. It might make more sense to pursue ease of extraction. The famous question written in [Hewitt & Liang 2019] about representation encodes linguistic structure v.s probes just learn the task is a false dichotomy, since there is no different between learning a task and representations encoding the linguistic structure. Probing provides no more insight about linguistic features in the representation cuz we know they are there ahead of time. And we want the best probe such that we get the tightest bound to the actual distribution p(t | r), where t is the linguistic property-valued r.v and r is the representation-valued r.v. We estimate this distribution to learn about the input sentence itself, not about the representation.

Assumptions

Every contextualized embedding is unique

Note that this require words with the same identity but occur in different sentences have different embeddings.

There exists a function id() that maps a contextualized embedding to its word type. Note that id() is not a bijection since multiple embeddings will map to the same type.

Any non-contextualized word embedding will contain no more information than a contextualized word embedding, cuz non-contextualized ones only tell word identity, but contextualized ones tell word identity + sth else (maybe its neighbors). More formally, this results from data-processing inequality $\geq I(T; id(R)) = I(T; W) \geq I(T; e(W))$ , where $W$ is word identity, $e ()$ is a look-up function that maps word identity to a non-contextualized (i.e. word type-level) embedding. Data-processing inequality states that when you apply a function to an input, you only compress the info and never create new info.

Gain

How much info did we gain from contextualized embeddings over a type-level control?

最低0.47元/天解锁文章

芝麻挞

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
[Paper Summary] Information-theoretic probing for linguistic structure [Pimental 2020]

Information-theoretic probing for linguistic structure [Pimental 2020]Teaser… under our operationalization, the endeavour of finding syntax in contextualized embeddings sentences is nonsensical. This is because, under Assumption 1, we know the answer a
复制链接

扫一扫