Understanding Self-Supervised Learning Dynamics without Contrastive Pairs论文笔记

最新推荐文章于 2024-09-14 19:18:15 发布

q12373

最新推荐文章于 2024-09-14 19:18:15 发布

阅读量476

点赞数

文章标签：机器学习深度学习

本文链接：https://blog.csdn.net/q12373/article/details/118677467

版权

这篇论文探讨了在无对比对的自监督学习中，两层线性模型的权重衰减如何平衡预测器和在线网络权重，以及停止梯度如何防止模型崩溃。研究发现，没有预测器的停止梯度会导致模型坍缩。实验分析了不同因素，包括比例指数移动平均、各向同性数据和对称预测器的影响，并提供了关于权重更新和特征空间对齐的数学分析。此外，论文还讨论了预训练阶段的课程学习策略。

摘要由CSDN通过智能技术生成

Understanding Self-Supervised Learning Dynamics without Contrastive Pairs论文笔记

Two layer linear model

Points

weight decay can balance the predictor and online network weights
stop gradient can prevent collapsing
stop gradient with no predictor will lead to collapsing.

Model

在这里插入图片描述

$W_p \in \mathbb{R}^{n_2 \times n_2},W \in \mathbb{R}^{n_2 \times n_1},W_a \in \mathbb{R}^{n_2 \times n_1},x \in \mathbb{R}^{n_1}.{f}_1 = W x_1 \in \mathbb{R}^{n_2}, f_{2a}= W_a x_2 \in \mathbb{R}^{n_2}$

$x_1,x_2$ is augmented views.

Loss Function

Gradient

$\eta$ : weight decay, $\mathbb E_x[\bar x \bar x^T],\ \bar x(x) := \mathbb E_{x' \sim p_{aug}(\cdot|x)} [x'],\ X' = \mathbb E_x[\mathbb V_{x'|x}[x']]$ , $x^{'}$ is augmented view of $x$

Proof

weight decay can balance the predictor and online network weights (by removing $W_a$ )

$\alpha_p^{-1} [e^{2 \eta_p t } W_p^T W_p - W_p^T(0) W_p(0)] = e^{2 \eta t } WW^T - W(0)W^T(0)$
stop gradient can prevent collapsing

$\otimes (W_p^T W_p + I) + X \otimes \tilde W_p^T \tilde W_p + \eta I_{n_1 n_2},\ W_p = W_p - I_{n_2} \\ H(t) := X' \otimes (W_p^T W_p + I) + X \otimes (W_p^T W_p - 2 W_p + I) + \eta I_{n_1 n_2}$
stop gradient with no predictor will lead to collapsing.

$W_p = I$

Multiple factors analysis

Three Assumptions are made to decouple gradients to scalar.

Assumptions

Assumption 1 (Proportional EMA)

experiments validation

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-pElWzNV9-1626077314448)(图片数据\image-20210621164844134.png)]$

Assumption 2 (Isotropic（各向同性，等方差） data and augmentation).

Data distribution $p (x)$ has zero mean and identity covariance

Augmentation distribution $p_{aug} (\cdot|x)$ has mean $x$ and covariance $\sigma ^2 I$ .

$X = I, X’ = $ $\sigma ^2 I$

(Previous work)

Assumption 3 (symmetric predictor $W_p$ )

$W_p = W_p^T$

Motivation
- fixed point ( $\dot{W_p} = 0$ ) is symmetric (under some occasions) (particular $\eta, W W^T$ )
- Under Assumption 1 and 2 the asymmetry part ( $W_p - W_p^T$ ) vanishes. (particular $\eta, \tau$ )
Experiments Phenomena
- BYOL: symmetric > reg (slightly better)
- Simsiam: symmetric fails (why?)
$\bar \eta = 0.0004, \alpha_p = 1$

Conclusion

Under above assumptions, the eigenspace (eigenvectors) of $F$ and $W_p$ gradually align. (experiments validation Fig. 9)

(when $\eta$ is small or zero and $\tau$ is large, the alignment will vanish.)

$F:= WXW^T$ . Note that $F$ is the correlation matrix of the $Wx_1$ . By Assumption 2, $\mathbb{E}[x] = 0$ and F is also the covariance matrix.(?)
$W_p = U \Lambda_{W_p} U^T, F = U \Lambda_{F} U^T, \dot{W_p} = U G_1 U^T, \dot F = U G_2 U^T \ \ (\dot U = 0) \\ \Lambda_{W_p} = \text{diag}[p_1,p_2,...,p_d], \Lambda_F = \text{diag}[s_1,s_2,...,s_d]$
带入梯度式子后得到

Analysis on $\alpha_p, \eta, \beta$

(relative predictor learning rate, weight decay, EMA parameter.)

From Eqn. 11 and Eqn. 12 (remove $\tau$ ), we have

$c_j = s_j(0) - \alpha_p^{-1} p_j^2(0)$
和前面的某个式子很像
- $\eta = 0$ 时， $\alpha_p$ 太小， $c_j \lt 0$ , $s_j \rightarrow 0$ , might collapse.
- $\eta = 0$ 时， $\alpha_p$ 太大， $s_j(+ \infty) = s_j(0)$
- $\eta \gt 0$ , $s_j = \alpha_p^{-1} p_j^2$ balance.
$\eta \gt 0$ 时，把 $s_j = \alpha_p^{-1} p_j^2$ 带入Eqn. 11得到

$\dot p_j = p_j \Delta_j\\ \Delta_j := p_j[\tau - (1+\sigma^2) p_j] - \eta$

令 $\dot p_j = 0$ 有，

$\dot p_j$ 图像如下， $\dot p_{j-}$ unstable， $\dot p_{j0}$ collapse .

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-h6cZU5Hz-1626077314449)(图片数据/屏幕截图 2021-06-22 220434.png)]
- larger $\eta$ , smaller $\tau$ , lead $\dot p_{j-}$ to right and the basin of collapse expands.
- when $\eta \gt \frac{\tau^2}{4(1+\sigma^2)}$ , collapsing is unavoidable.
To satisfy eigenspace alignment, below equation has to hold.

$\Delta_j := p_j[\tau - (1+\sigma^2) p_j] - \eta$

so larger $\eta$ and $\alpha_p$ can loosen the bound.
Curriculum learning: Initially, $p_j$ and $s_j$ is small, and since $W$ changes rapidly, $\tau$ is also small. When $p_j$ approaches its stable fixed point $p_j^+$ , then $p_j$ and $s_j$ stop growing, making $\tau$ larger, and set a higher $p_j^+$ .
$\alpha_p \gt 1 , \eta_p \gt \eta_s$ , can make symmetric $W_p$ work without EMA (Simsiam) ? (更容易满足特征空间的对齐性质？)