[Paper Summary] Information-Theoretic Probing with Minimum Description Length [Voita & Titov 2020]

Information-Theoretic Probing with Minimum Description Length [Voita & Titov 2020]


tl;dr
To see reasonable difference in acc with respect to these random baselines, preious work had to contrain either the amount of probe training data or its model size.
In additional to probe quality, the description length evaluates ‘the amount of effort’ needed to achieve the quality. This amount of effort characterizes either a) size of a probing model - variational coding, or b) the amount of data needed to achieve the high quality - online coding. Variational code can be used to inspect the induced sparse architecture, but the online code is easier to implement.


MDL
  • Cast to a transmission problem. Alice wants to transmit labels of each itm to Bob and they agree on doing such thing with a probabilistic model of data p ( y ∣ x ) . p(y | x). p(yx). Since Bob does not know the precise trained model, some explicit or implicit transmission of the model itself is also required.
  • We are interested in the number of bits needed to transmit the labels given the representations. The overall codelength is a combination of the quality of fit of the model plus the cost of transmitting the model itself. Variational code directly compute model cost while online code indirectly does so.
  • We draw conclusion that the ability of a probe to achieve good quality using a small amount of data and its ability to achieve good quality using a small probe architecture reflect the same property: strength of the regularity in the data.

Analogy to Transmission of the data
  • Data codelength = − ∑ i = 1 n l o g 2 p ( y i ∣ x i ) - \sum_{i=1}^n log_2p(y_i|x_i) i=1nlog2p(yixi). Quality of a learned model p(y|x) is the codelength needed to transmit the data.
  • Compression is usually compared against uniform encoding, i.e. p ( y ∣ x ) = 1 K p(y|x) = \frac{1}{K} p(yx)=K1. This yields codelength n l o g 2 K nlog_2K nlog2K.
  • The gain of transmitting labels using a model over directly transmitting all labels is upper bounded by the mutual info I ( y ; x ) I(y; x) I(y;x). While high mutual info is necessary for effective compression, a good representation is the one which also yields simple models predicting y y y from x x x.

Variational code
  • favors high acc & small model
  • Optimizing with some cost function equals to cross-entropy + regularization. This joint cost is exactly the loss function of a variational learning algorithm.
  • The best architecture is obtained as a byproduct of MDL optimization, and not by manual search.
  • Alice and Bob must have agreed on a prior distribution over the parameters α ( θ ∗ ) \alpha(\theta^*) α(θ). In variational approach, weights are treated as random variables, and the description length is given by the expectation:
    − E θ ∼ β [ l o g 2 α ( θ ) − l o g 2 β ( θ ) + ∑ i = 1 n l o g 2 p θ ( y i   ∣   x i ) -\mathbb{E}_{\theta \sim \beta}\big[log_2\alpha(\theta) - log_2\beta(\theta) + \sum^n_{i=1}log_2p_\theta (y_i\ |\ x_i) Eθβ[log2α(θ)log2β(θ)+i=1nlog2pθ(yi  xi)
    = K L ( β   ∣ ∣   α ) − E θ ∼ β ∑ i = 1 n l o g 2 p θ ( y i   ∣   x i ) = KL(\beta\ ||\ \alpha) - \mathbb{E}_{\theta \sim \beta} \sum^n_{i=1}log_2p_\theta (y_i\ |\ x_i) =KL(β  α)Eθβi=1nlog2pθ(yi  xi)  (1)
    α \alpha α and β \beta β are distributions over model parameters. β \beta β is chosen by minimizing the codelength given in E x p r e s s i o n   ( 1 ) Expression\ (1) Expression (1).
  • Info-theoretic我看不懂的关联知识 a). The negated codelength − E x p r e s s i o n   ( 1 ) - Expression\ (1) Expression (1) is known as the evidence-lower-bound (ELBO) and used as the objective in variational inference. b) The distribution β ( θ ) \beta(\theta) β(θ) approximates the intractable posterior distribution p ( θ   ∣   x 1 : n , y 1 : n ) p(\theta\ |\ x_{1:n}, y_{1:n}) p(θ  x1:n,y1:n).
  • We use a network compression method which includes sparsity-inducing priors on the parameters, pruning neurons from the probing classifier as a byproduct of optimizing the ELBO. As a result we can assess the probe complexity both using its description length K L ( β   ∣ ∣   α ) KL(\beta\ ||\ \alpha) KL(β  α) and by inspecting the discovered architecture. 这就是为啥说不用像之前那样进行architecture search, the optimal architecture will be derived as a byproduct.

Online code
  • favors high acc & model that achieves high acc with small amount of data
  • Transmission is not done at one time. They choose timesteps 1 = t 0 t_0 t0 < t 1 t_1 t1 < … < t S t_S tS = n, and transmit data by blocks. Alice starts by communicating y 1 : t 1 y_{1:t_1} y1:t1 with a uniform code, then both Alice and Bob learn a model p θ 1 p_{\theta_1} pθ1 using { ( x i , y i ) } i = 1 t 1 \{(x_i, y_i)\}_{i=1}^{t_1} {(xi,yi)}i=1t1. Alice will use that model to communicate the next block y t 1 : t 2 y_{t_1:t_2} yt1:t2. Then both Alice and Bob learn a model from a larger block { ( x i , y i ) } i = 1 t 2 \{(x_i, y_i)\}_{i=1}^{t_2} {(xi,yi)}i=1t2… This process continues untail the entire dataset has been transmitted. The resulting online code is
    t 1 l o g 2 K − ∑ i = 1 S − 1 l o g 2 p θ i ( y t i + 1 : t i + 1   ∣   x t i + 1 : t i + 1 ) t_1log_2K - \sum_{i=1}^{S-1}log_2p_{\theta_i}(y_{t_i+1:t_{i+1}}\ |\ x_{t_i+1:t_{i+1}}) t1log2Ki=1S1log2pθi(yti+1:ti+1  xti+1:ti+1)
  • Relation to regularity in data: If the regularity in the data is strong, it can be revealed using a small subset, i.e. early in the transmission process, and can be exploited to efficiently transmit the rest of the dataset, i.e. model’s prediction achieves high acc early in the transmission process so that fewer bits are needed to be transmitted in the later process.
  • The online code is related to the area under the learning curve.
  • Though online code does not incorporate model cost explicitly, we can still eval model cost by interpreting the different between the cross-entropy of the model trained on all data and online codelength as the model cost.

Control Tasks: assign random POS labels to each word type
  • Following [Hewitt & Liang 2019], selectivity reveals how much the linguistic probe relies on the regularities encoded in the representations.
  • For the control task, codes become larger as we move up from the embedding layer. This is expected because the embedding layer encodes word type info trivially.

Findings
  • By looking at the induced architecture from variational approach, probes learned for linguistic tasks are much smaller than those for control tasks.
  • In 8 out of 10 different hyperparameter settings, acc on control task is better than that on the linguistic task, while MDL consistently reflects lower cost on linguistic task, since its more structured.
  • Also MDL is more stable across different random seeds, and produce well separated measurements
  • MDL shows that compared to the embedding layer (i.e. non-contextualized representations), contextualized layers are better even when the parameters are randomly initialized. But the compression gain from trained representations is twice as big as the gain from randomly initialized representations.

总之这篇是我接触probing的第一篇,她的想法还挺facinating的,借鉴并融合了很多之前的东西,interpretation是自己想的新颖而讨巧。然而还是感觉读到后面的experiments就无聊了,这难道是probing文章的通病吗还是experiments选择范围太狭窄了,搞来搞去就只有POS和dependency,只能比较Layer0 Layer1 Layer2,实验结果也就两种,要么跟前人一样,那就循规蹈矩,要么跟前人相反,但是这时候又说不清是metrics设计的问题导致的还是反映了真实的representation属性,就只能强行看图写话或者编故事最后依旧inconclusive。不过没有抹黑的意思,neural network interpretation本身就是解毛线团牵一发而动全身。虽然现在看到的work都是focus在片面的,但至少 better than nothing咯.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值