[Paper Summary] Information-Theoretic Probing with Minimum Description Length [Voita & Titov 2020]

芝麻挞

于 2021-07-08 04:28:30 发布

阅读量120

点赞数

分类专栏：我爱读的paper

本文链接：https://blog.csdn.net/weixin_43928665/article/details/118563524

版权

我爱读的paper 专栏收录该内容

23 篇文章 0 订阅

订阅专栏

Information-Theoretic Probing with Minimum Description Length [Voita & Titov 2020]

tl;dr
To see reasonable difference in acc with respect to these random baselines, preious work had to contrain either the amount of probe training data or its model size.
In additional to probe quality, the description length evaluates ‘the amount of effort’ needed to achieve the quality. This amount of effort characterizes either a) size of a probing model - variational coding, or b) the amount of data needed to achieve the high quality - online coding. Variational code can be used to inspect the induced sparse architecture, but the online code is easier to implement.

MDL

Cast to a transmission problem. Alice wants to transmit labels of each itm to Bob and they agree on doing such thing with a probabilistic model of data $p (y ∣ x) .$ Since Bob does not know the precise trained model, some explicit or implicit transmission of the model itself is also required.
We are interested in the number of bits needed to transmit the labels given the representations. The overall codelength is a combination of the quality of fit of the model plus the cost of transmitting the model itself. Variational code directly compute model cost while online code indirectly does so.
We draw conclusion that the ability of a probe to achieve good quality using a small amount of data and its ability to achieve good quality using a small probe architecture reflect the same property: strength of the regularity in the data.

Analogy to Transmission of the data

Data codelength = $\sum_{i=1}^n log_2p(y_i|x_i)$ . Quality of a learned model p(y|x) is the codelength needed to transmit the data.
Compression is usually compared against uniform encoding, i.e. $\frac{1}{K}$ . This yields codelength $nlog_2K$ .
The gain of transmitting labels using a model over directly transmitting all labels is upper bounded by the mutual info $I (y; x)$ . While high mutual info is necessary for effective compression, a good representation is the one which also yields simple models predicting $y$ from $x$ .

Variational code

favors high acc & small model
Optimizing with some cost function equals to cross-entropy + regularization. This joint cost is exactly the loss function of a variational learning algorithm.
The best architecture is obtained as a byproduct of MDL optimization, and not by manual search.
Alice and Bob must have agreed on a prior distribution over the parameters $\alpha(\theta^*)$ . In variational approach, weights are treated as random variables, and the description length is given by the expectation:
$-\mathbb{E}_{\theta \sim \beta}\big[log_2\alpha(\theta) - log_2\beta(\theta) + \sum^n_{i=1}log_2p_\theta (y_i\ |\ x_i)$
$KL(\beta\ ||\ \alpha) - \mathbb{E}_{\theta \sim \beta} \sum^n_{i=1}log_2p_\theta (y_i\ |\ x_i)$ (1)
$\alpha$ and $\beta$ are distributions over model parameters. $\beta$ is chosen by minimizing the codelength given in $Expression\ (1)$ .
Info-theoretic我看不懂的关联知识 a). The negated codelength $Expression\ (1)$ is known as the evidence-lower-bound (ELBO) and used as the objective in variational inference. b) The distribution $\beta(\theta)$ approximates the intractable posterior distribution $p(\theta\ |\ x_{1:n}, y_{1:n})$ .
We use a network compression method which includes sparsity-inducing priors on the parameters, pruning neurons from the probing classifier as a byproduct of optimizing the ELBO. As a result we can assess the probe complexity both using its description length $KL(\beta\ ||\ \alpha)$ and by inspecting the discovered architecture. 这就是为啥说不用像之前那样进行architecture search, the optimal architecture will be derived as a byproduct.

Online code

favors high acc & model that achieves high acc with small amount of data
Transmission is not done at one time. They choose timesteps 1 = $t_0$ < $t_1$ < … < $t_S$ = n, and transmit data by blocks. Alice starts by communicating $y_{1:t_1}$ with a uniform code, then both Alice and Bob learn a model $p_{\theta_1}$ using ${(x_i, y_i)\}_{i=1}^{t_1}$ . Alice will use that model to communicate the next block $y_{t_1:t_2}$ . Then both Alice and Bob learn a model from a larger block ${(x_i, y_i)\}_{i=1}^{t_2}$ … This process continues untail the entire dataset has been transmitted. The resulting online code is
$t_1log_2K - \sum_{i=1}^{S-1}log_2p_{\theta_i}(y_{t_i+1:t_{i+1}}\ |\ x_{t_i+1:t_{i+1}})$
Relation to regularity in data: If the regularity in the data is strong, it can be revealed using a small subset, i.e. early in the transmission process, and can be exploited to efficiently transmit the rest of the dataset, i.e. model’s prediction achieves high acc early in the transmission process so that fewer bits are needed to be transmitted in the later process.
The online code is related to the area under the learning curve.
Though online code does not incorporate model cost explicitly, we can still eval model cost by interpreting the different between the cross-entropy of the model trained on all data and online codelength as the model cost.

Control Tasks: assign random POS labels to each word type

Following [Hewitt & Liang 2019], selectivity reveals how much the linguistic probe relies on the regularities encoded in the representations.
For the control task, codes become larger as we move up from the embedding layer. This is expected because the embedding layer encodes word type info trivially.

Findings

By looking at the induced architecture from variational approach, probes learned for linguistic tasks are much smaller than those for control tasks.
In 8 out of 10 different hyperparameter settings, acc on control task is better than that on the linguistic task, while MDL consistently reflects lower cost on linguistic task, since its more structured.
Also MDL is more stable across different random seeds, and produce well separated measurements
MDL shows that compared to the embedding layer (i.e. non-contextualized representations), contextualized layers are better even when the parameters are randomly initialized. But the compression gain from trained representations is twice as big as the gain from randomly initialized representations.

总之这篇是我接触probing的第一篇，她的想法还挺facinating的，借鉴并融合了很多之前的东西，interpretation是自己想的新颖而讨巧。然而还是感觉读到后面的experiments就无聊了，这难道是probing文章的通病吗还是experiments选择范围太狭窄了，搞来搞去就只有POS和dependency，只能比较Layer0 Layer1 Layer2，实验结果也就两种，要么跟前人一样，那就循规蹈矩，要么跟前人相反，但是这时候又说不清是metrics设计的问题导致的还是反映了真实的representation属性，就只能强行看图写话或者编故事最后依旧inconclusive。不过没有抹黑的意思，neural network interpretation本身就是解毛线团牵一发而动全身。虽然现在看到的work都是focus在片面的，但至少 better than nothing咯.

芝麻挞

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
[Paper Summary] Information-Theoretic Probing with Minimum Description Length [Voita & Titov 2020]

Information-Theoretic Probing with Minimum Description Length
复制链接

扫一扫