Evaluating representations by the complexity of learning low-loss predictors [Whitney 2020]
tl;dr
Probing在这之前的thread都是acc - complexity trade-off. 然而这篇认为As the eval dataset size changes, the dynamics of acc - complexity trade-off also changes. And if you use metrics that handles acc - complexity trade-off, like VA or MDL, changing eval dataset size may lead you into making different conclusions about the best representation. Also, if the dataset is too small to yield any successful predictors, blindly applying VA or MDL may lead to premature eval by producing judgement even when there is not enough data to meaningfully distinguish one representation from another.
In this paper, they fix acc (i.e. defining a notion of ‘task being solved’) and measure goodness of representation by “how much effort it takes to achieve a loss within ε tolerance”. 这个effort文中提了SDL和 ϵ \epsilon ϵSC两个metrics来测量,他们generally可以对应为dataset effort, 但是SDL somewhat relate to the “mode codelength interpretation in MDL - online code version”, 所以也可以被interpreted as model complexity.
If dataset is not big enough to achieve a low-loss predictor, then the measurements give an approximated lower bound. If dataset is big enough, then the evaluation only depends on the data distribution rather than size. Adding more data from then on won’t affect eval results. Therefore it avoids prematuraly notifying the user if there is insufficient data to evaluate.
Keypoints
- Introduce loss-data framework for the evaluation of representations that deals with the issue of sensitivity to dataset size. 所以事实上MDL等paper中的结果都是在某固定dataset size上测试的,所以只能对应到loss-data framework上 x=some fixed value的切片
- Propose representation eval metrics: complexity of the model the solves a task to ε loss on top of a learned representation.
- Two measures of model complexity: SDL (surplus description length) and εSC (ε sample complexity)
Motivation
The best representation is the one which allows for the most efficent learning of a predictor to solve the task. They measure efficiency by “how many labels are needed to achieve a pre-defined ‘low-loss’ in the deployment phase”. The more labels are needed, the more expensive and the less widely applicable a representation will be. A good representation should yield improved data efficiency and should be evaluated on this.
Surplus description length (SDL)
- Def (Extend on ‘online code’ from MDL paper):
- This corresponds to the additioanl description length incurred by encoding data with the learning algorithm A on representation Φ rather than using a fixed predictor with loss ε.
- MDL with online code corresponds to the area under loss-data curve. SDL is computing the area between the loss-data curve of A & Φ and a baseline set by y=ε
- If we assume that algorithms are monotonically improving given more and more data, SDL only depends on i up to the first point where an ε-loss predictor shows up.
- 至于怎么算这个积分是跟[Viota 2020 MDL] 类似 “by taking a log-uniform partition of the eal dataset size and computing the Riemann sum” (surely I can’t do the math)
ε sample complexity (εSC)
- Def: the number of samples it takes to find an ε-loss predictor
- This corresponds to taking a horizontal slice of the loss-data curve at y=ε, analogous to VA’s slice at x=n.
- 其实有了算SDL的那个curve这个也就有了
Choosing ε
- Choices of ε ≥ H(Y | X) represent attainable functions. Selecting ε ≤ H(Y | X) leads to unbounded SDL and εSC. 不过按照[Pimental 2020 Info-theoretic Probing]那篇来说应该H(Y | X) = 0因为representation contains as much info as the original sentence does. 所以SDL and εSC 一定是bounded不过这又是另一回事了。毕竟[Pimental 2020 Info-theoretic Probing]是有一些mild assumption的,比如"every contextualized representation is unique".
- The setting of ε can be done by training a large model on the raw representation of the full eval dataset and using its validation loss as ε when evaluating other representations.
Future work
- In this paper the probing algorithm A is still fixed (so corresponds to a slice if you consider a loss-data-A framework). Future work might consider a set of algorithms and a method of combining them.
- Probing task is still fixed. Does not predict performance for transfer across tasks