【Continual Learning】Encoder Based Lifelong Learning

最新推荐文章于 2022-11-29 17:12:47 发布

Arron_hou

最新推荐文章于 2022-11-29 17:12:47 发布

阅读量1k

点赞数 2

分类专栏： Continual Learning 深度学习文章标签：深度学习计算机视觉

本文链接：https://blog.csdn.net/Arron_hou/article/details/105493917

版权

深度学习同时被 2 个专栏收录

26 篇文章 1 订阅

订阅专栏

Continual Learning

1 篇文章 0 订阅

订阅专栏

Encoder Based Lifelong Learning

ICCV2017
Amal Rannen Triki ∗† Rahaf Aljundi∗ Mathew B. Blaschko Tinne Tuytelaars KU Leuven
KU Leuven,
ESAT-PSI, iMinds, Belgium

Question

Our method aims at preserving the knowledge of the previous tasks while learning a new one by using autoencoders.

问题定义为如何使用自编码器, 缓解灾难性遗忘.

Achievement

在这里插入图片描述

3.1. Joint training

$F$ 特征提取, 共享参数;

$T$ 任务操作, 决定当前的特征应该进入某个特定任务;

$T_i$ 任务头, 表示第i个任务对应的输出头, 一般为全连接层.
$\sum_{t=1}^{\mathcal{T}} \frac{1}{N_{t}} \sum_{i=1}^{N_{t}} \ell\left(f_{t}\left(X_{i}^{(t)}\right), Y_{i}^{(t)}\right)$
通过上述loss得到整个模型对于多个任务的性能上界.

3.2. Shortcomings of Learning without Forgetting

在这里插入图片描述

LwF 主要的思路就是fine tuning 和 distillation.

本文主要也是借鉴和优化了LwF, $F$ 对应 $\theta_{s}$ , $T_i$ 对应 $\theta_o$ , LwF在训练的时候, 把新任务数据输入旧模型中, 得到一个旧模型在新数据上的 $feature_{old}$ , 如果利用新数据再训练一个新模型, 同时保证新数据的准确性以及也让新模型输出的 $feature_{new}$ 与 $feature_{old}$ 接近, 这样新的模型就可以同时在新旧两个模型中都有较高的性能.

如果新旧数据分布接近, 那么LwF效果就会很好, 因为 $feature_{old}$ 可以代表旧数据的feature的一个分布, 但是如果新旧数据分布相差较多, 就会会导致LwF性能下降.
$\mathbb{E}_{\left(\mathcal{X}^{(1)}\right)}\left[\ell\left(T_{1} \circ T \circ F\left(\mathcal{X}^{(1)}\right), T_{1}^{*} \circ T^{*} \circ F^{*}\left(\mathcal{X}^{(1)}\right)\right)\right]$
用上标*表示单独训练任务的模型, 没有上标*表示进行Lifelong Learning训练得到的模型. 👆表示经过Lifelong Learning后的生成误差.

3.3. Informative feature preservation

按照LwF的想法, 保留的 $feature_{old}$ 是Lifelong Learning的核心, 但是如果使用完全的 $feature_{old}$ 太过于保守, 应该仅仅限制最有用的部分, 放开其他的部分以提高灵活性. 如何做呢? 通过自编码器寻找.

3.3.1 Learning the informative submanifold with Autoencoders

在这里插入图片描述

对于每一个任务, 再额外构建一个对 $f e a t u r e$ 的autoencoders, 经过autoencoder重建的特征首先接近原始的特征, 并且重建特征对于原始的任务也要有很好的能力. r 表示 co.

$\begin{aligned} \arg \min _{r} \mathbb{E}_{\left(\mathcal{X}^{(1)}, \mathcal{Y}^{(1)}\right)}[&\lambda\left\|r\left(F^{*}\left(\mathcal{X}^{(1)}\right)\right)-F^{*}\left(\mathcal{X}^{(1)}\right)\right\|_{2} \\ &\left.+\ell\left(T_{1}^{*} \circ T^{*}\left(r\left(F^{*}\left(\mathcal{X}^{(1)}\right)\right)\right), \mathcal{Y}^{(1)}\right)\right] \end{aligned}$

$\lambda = 10^{-6}$ 使得autoencoder 和 baseline 都可以收敛.

在这里插入图片描述

3.3.2 Representation control with separate task operators

经过autoencoder 得到的feature code 保留了 feature 中重要的特征, 通过该特征限制 $F$ 的变化更加灵活.

3.3.3 Representation control with shared task operator

$\begin{aligned} \mathcal{R} &=\mathbb{E}\left[\ell\left(T_{2} \circ T \circ F\left(\mathcal{X}^{(2)}\right), \mathcal{Y}^{(2)}\right)\right) \\ &+\ell_{\text {dist}}\left(T_{1} \circ T \circ F\left(\mathcal{X}^{(2)}\right), T_{1}^{*} \circ T^{*} \circ F^{*}\left(\mathcal{X}^{(2)}\right)\right) \\ &\left.+\frac{\alpha}{2}\left\|\sigma\left(W_{\text {enc}} F\left(\mathcal{X}^{(2)}\right)\right)-\sigma\left(W_{\text {enc}} F^{*}\left(\mathcal{X}^{(2)}\right)\right)\right\|_{2}^{2}\right] \end{aligned}$

上式中, 第二项为dist loss, 是为了限制第二个任务对于第一个任务模型的改变能力.

第三项为 code loss , 通过encoder提取code同样也是限制的作用, 但是更加灵活.

两个限制一项为监督限制, 一项为无监督限制, 共同缓解灾难性遗忘.

$\ell_{\text {dist}}\left(\hat{\mathcal{Y}}, \mathcal{Y}^{*}\right)=-\left\langle\mathcal{Z}^{*}, \log \hat{\mathcal{Z}}\right\rangle$

$\mathcal{Z}_{i}^{*}=\frac{\mathcal{Y}_{i}^{* 1 / \theta}}{\sum_{j} \mathcal{Y}_{j}^{* 1 / \theta}} \text { and } \hat{\mathcal{Z}}_{i}=\frac{\hat{\mathcal{Y}}_{i}^{1 / \theta}}{\sum_{j} \hat{\mathcal{Y}}_{j}^{1 / \theta}}$

蒸馏loss 通过调整 $\theta$ 增加输出的较小值，并减小较高值的权重。减轻了使用不同数据分布的影响。

通过控制超参 $\alpha$ ( $\alpha$ of $10^{−3}$ for ImageNet and $10^{−2}$ for the rest of the tasks) 在遗忘和新任务预测中取得一个平衡.

3.4. Training procedure
$\begin{aligned} R_{N} &=\frac{1}{N} \sum_{i=1}^{N}\left(\ell\left(T_{T} \circ T \circ F\left(X_{i}^{(\mathcal{T})}\right), Y_{i}^{(\mathcal{T})}\right)\right.\\ &+\sum_{t=1}^{\mathcal{T}-1} \ell_{d i s t}\left(T_{t} \circ T \circ F\left(X_{i}^{(\mathcal{T})}\right), T_{t}^{*} \circ T^{*} \circ F^{*}\left(X_{i}^{(\mathcal{T})}\right)\right) \\ &\left.+\sum_{t=1}^{\mathcal{T}-1} \frac{\alpha_{t}}{2}\left\|\sigma\left(W_{e n c, t} F\left(X_{i}^{(\mathcal{T})}\right)\right)-\sigma\left(W_{e n c, t} F^{*}\left(X_{i}^{(\mathcal{T})}\right)\right)\right\|_{2}^{2}\right) \end{aligned}$

在这里插入图片描述

Methodology

通过 dist loss 和 code loss 的结合, paper的实验结果非常有效.

在这里插入图片描述

给定预训练任务flowers , 训练 Birds, 观察Birds训练时, flowers 在模型更新前后的距离, 可以发现 class loss 和 dist loss, code loss 存在着竞争关系, class loss 最先发挥作用, distance 从18上升83, 之后限制loss将距离降低.

Harvest

multi-task 通过增加输出头, 共享主干解决 lifelong learning 的问题, 但是误差会随着任务的增加而累加的问题.

EWC(Elastic weight consolidation) 通过增加对于更新参数的约束loss, $\sum_{i} \frac{\lambda}{2} F_{i}\left(W_{i}-W_{1, i}^{*}\right)$ ,使得参数处于新旧两个任务的平衡态, 但是存在两个问题:

因为参数约束是独立进行的, 保证新的参数在原始参数的附近, 是否有可能存在一种情况, 使得非独立参数约束仍能取得相同的效果, 甚至更好.
任然需要保存每一个任务的模型

Fisher Information反映了我们对参数估计的准确度, 越大, 对参数估计的准确度越高, 即代表了越多的信息. $F_i$ 是Fisher信息矩阵的对角项, 增加 $F_i$ 可以阻止对于旧任务有重要作用的权重改变过大.

理论上,预训练模型应该有如下的作用:

保证历史任务的准确性
使用新任务数据的归纳偏差(inductive bias, The need for biases in learning generalizations.)提升历史任务的准确性
使用先验知识提高新任务的准确性

code

autodecoder

class AutoEncoder(nn.Module):
    def __init__(self, x_dim, h1_dim):
        super(AutoEncoder, self).__init__()
        self.encode = nn.Sequential( 
            #fc layers for the encoder
            nn.Linear(x_dim , h1_dim),
            nn.Sigmoid())
        
        #fc layers for the decoder
        self.decode = nn.Sequential( 
            nn.Linear(h1_dim, x_dim),)
        
    def forward(self, x):
        h = self.encode(x) 
        x_recon = self.decode(h)        
        return x_recon

loss

code_loss = nn.MSELoss()
classify_loss = nn.CrossEntropyLoss()
def distillation_loss(task, target, T):
    """
    cross-entropy
    return -target*log(task)
    target and task should through norm and softmax.
    :param task: task feature from current model
    :param target: target feature from source model
    :param T: temperature < 1
    :return:
    """
    target = target / T
    target_item = F.softmax(target, dim=1)

    task = task / T
    task_item = F.log_softmax(task, dim=1)

    loss = torch.sum(-target_item * task_item, 1).mean()
    return loss