Information-theoretic probing for linguistic structure [Pimental 2020]
Teaser
… under our operationalization, the endeavour of finding syntax in contextualized embeddings sentences is nonsensical. This is because, under Assumption 1, we know the answer a priori—the contextualized word embeddings of a sentence contain exactly the same amount of information about syntax as does the sentence itself.
Keypoints
- Call for complex probes
One should always select the highest performing probe one can without resorting to artificial constraints, even if it is more complex, since it will result in a tighter estimate, and thus reveal more of the linguistic information inherent in the representation. - Call for harder probing tasks or formal definition of ease of extraction, since the current operationalization doesn’t reveal much advantage of contextual embeddings over non-contextual ones.
Most of the info indded to tag POS is ancoded at the lexical level, and does not require sentential context. Put it simple, words are not very embiguous with respect to POS. The gain of BERT over a control for dependency labeling is also modest. So the main point that favors the argument for BERT might be ‘BERT makes info more readily accessible’, rather than ‘BERT provides extra info that wasn’t there with other representations’ - Contextual word embeddings contain the same amount of info about the linguistic property of interest as the original sentence. This follows from the data-processing inequality under a mild assumption.
- Probing for linguistic properties in representations may not be a well grounded enterprise at all.
Linguistic properties are always there, in some sense, known as a priori. It might make more sense to pursue ease of extraction. The famous question written in [Hewitt & Liang 2019] about representation encodes linguistic structure v.s probes just learn the task is a false dichotomy, since there is no different between learning a task and representations encoding the linguistic structure. Probing provides no more insight about linguistic features in the representation cuz we know they are there ahead of time. And we want the best probe such that we get the tightest bound to the actual distribution p(t | r), where t is the linguistic property-valued r.v and r is the representation-valued r.v. We estimate this distribution to learn about the input sentence itself, not about the representation.
Assumptions
- Every contextualized embedding is unique
Note that this require words with the same identity but occur in different sentences have different embeddings.
- There exists a function
id()
that maps a contextualized embedding to its word type. Note thatid()
is not a bijection since multiple embeddings will map to the same type.
Any non-contextualized word embedding will contain no more information than a contextualized word embedding, cuz non-contextualized ones only tell word identity, but contextualized ones tell word identity + sth else (maybe its neighbors). More formally, this results from data-processing inequality I ( T ; R ) ≥ I ( T ; i d ( R ) ) = I ( T ; W ) ≥ I ( T ; e ( W ) ) I(T;R) \geq I(T; id(R)) = I(T; W) \geq I(T; e(W)) I(T;R)≥I(T;id(R))=I(T;W)≥I(T;e(W)), where W W W is word identity, e ( ) e() e() is a look-up function that maps word identity to a non-contextualized (i.e. word type-level) embedding. Data-processing inequality states that when you apply a function to an input, you only compress the info and never create new info.
Gain
How much info did we gain from contextualized embeddings over a type-level control?
-
G ( T , R , c