【Continual Learning】Encoder Based Lifelong Learning

Encoder Based Lifelong Learning

  • ICCV2017
  • Amal Rannen Triki ∗† Rahaf Aljundi∗ Mathew B. Blaschko Tinne Tuytelaars KU Leuven
    KU Leuven,
    ESAT-PSI, iMinds, Belgium

Question

Our method aims at preserving the knowledge of the previous tasks while learning a new one by using autoencoders.

问题定义为如何使用自编码器, 缓解灾难性遗忘.

Achievement

在这里插入图片描述

3.1. Joint training

F F F 特征提取, 共享参数;

T T T 任务操作, 决定当前的特征应该进入某个特定任务;

T i T_i Ti 任务头, 表示第i个任务对应的输出头, 一般为全连接层.
∑ t = 1 T 1 N t ∑ i = 1 N t ℓ ( f t ( X i ( t ) ) , Y i ( t ) ) \sum_{t=1}^{\mathcal{T}} \frac{1}{N_{t}} \sum_{i=1}^{N_{t}} \ell\left(f_{t}\left(X_{i}^{(t)}\right), Y_{i}^{(t)}\right) t=1TNt1i=1Nt(ft(Xi(t)),Yi(t))
通过上述loss得到整个模型对于多个任务的性能上界.

3.2. Shortcomings of Learning without Forgetting

在这里插入图片描述

LwF 主要的思路就是fine tuning 和 distillation.

本文主要也是借鉴和优化了LwF, F F F 对应 θ s \theta_{s} θs, T i T_i Ti 对应 θ o \theta_o θo, LwF在训练的时候, 把新任务数据输入旧模型中, 得到一个旧模型在新数据上的 f e a t u r e o l d feature_{old} featureold, 如果利用新数据再训练一个新模型, 同时保证新数据的准确性以及也让新模型输出的 f e a t u r e n e w feature_{new} featurenew f e a t u r e o l d feature_{old} featureold接近, 这样新的模型就可以同时在新旧两个模型中都有较高的性能.

如果新旧数据分布接近, 那么LwF效果就会很好, 因为 f e a t u r e o l d feature_{old} featureold可以代表旧数据的feature的一个分布, 但是如果新旧数据分布相差较多, 就会会导致LwF性能下降.
E ( X ( 1 ) ) [ ℓ ( T 1 ∘ T ∘ F ( X ( 1 ) ) , T 1 ∗ ∘ T ∗ ∘ F ∗ ( X ( 1 ) ) ) ] \mathbb{E}_{\left(\mathcal{X}^{(1)}\right)}\left[\ell\left(T_{1} \circ T \circ F\left(\mathcal{X}^{(1)}\right), T_{1}^{*} \circ T^{*} \circ F^{*}\left(\mathcal{X}^{(1)}\right)\right)\right] E(X(1))[(T1TF(X(1)),T1TF(X(1)))]
用上标*表示单独训练任务的模型, 没有上标*表示进行Lifelong Learning训练得到的模型. 👆表示经过Lifelong Learning后的生成误差.

3.3. Informative feature preservation

按照LwF的想法, 保留的 f e a t u r e o l d feature_{old} featureold是Lifelong Learning的核心, 但是如果使用完全的 f e a t u r e o l d feature_{old} featureold太过于保守, 应该仅仅限制最有用的部分, 放开其他的部分以提高灵活性. 如何做呢? 通过自编码器寻找.

3.3.1 Learning the informative submanifold with Autoencoders

在这里插入图片描述

对于每一个任务, 再额外构建一个对 f e a t u r e feature feature的autoencoders, 经过autoencoder重建的特征首先接近原始的特征, 并且重建特征对于原始的任务也要有很好的能力. r 表示 co.

arg ⁡ min ⁡ r E ( X ( 1 ) , Y ( 1 ) ) [ λ ∥ r ( F ∗ ( X ( 1 ) ) ) − F ∗ ( X ( 1 ) ) ∥ 2 + ℓ ( T 1 ∗ ∘ T ∗ ( r ( F ∗ ( X ( 1 ) ) ) ) , Y ( 1 ) ) ] \begin{aligned} \arg \min _{r} \mathbb{E}_{\left(\mathcal{X}^{(1)}, \mathcal{Y}^{(1)}\right)}[&\lambda\left\|r\left(F^{*}\left(\mathcal{X}^{(1)}\right)\right)-F^{*}\left(\mathcal{X}^{(1)}\right)\right\|_{2} \\ &\left.+\ell\left(T_{1}^{*} \circ T^{*}\left(r\left(F^{*}\left(\mathcal{X}^{(1)}\right)\right)\right), \mathcal{Y}^{(1)}\right)\right] \end{aligned} argrminE(X(1),Y(1))[λr(F(X(1)))F(X(1))2+(T1T(r(F(X(1)))),Y(1))]

λ = 1 0 − 6 \lambda = 10^{-6} λ=106 使得autoencoder 和 baseline 都可以收敛.

在这里插入图片描述

3.3.2 Representation control with separate task operators

经过autoencoder 得到的feature code 保留了 feature 中 重要的特征, 通过该特征限制 F F F 的变化更加灵活.

3.3.3 Representation control with shared task operator

R = E [ ℓ ( T 2 ∘ T ∘ F ( X ( 2 ) ) , Y ( 2 ) ) ) + ℓ dist ( T 1 ∘ T ∘ F ( X ( 2 ) ) , T 1 ∗ ∘ T ∗ ∘ F ∗ ( X ( 2 ) ) ) + α 2 ∥ σ ( W enc F ( X ( 2 ) ) ) − σ ( W enc F ∗ ( X ( 2 ) ) ) ∥ 2 2 ] \begin{aligned} \mathcal{R} &=\mathbb{E}\left[\ell\left(T_{2} \circ T \circ F\left(\mathcal{X}^{(2)}\right), \mathcal{Y}^{(2)}\right)\right) \\ &+\ell_{\text {dist}}\left(T_{1} \circ T \circ F\left(\mathcal{X}^{(2)}\right), T_{1}^{*} \circ T^{*} \circ F^{*}\left(\mathcal{X}^{(2)}\right)\right) \\ &\left.+\frac{\alpha}{2}\left\|\sigma\left(W_{\text {enc}} F\left(\mathcal{X}^{(2)}\right)\right)-\sigma\left(W_{\text {enc}} F^{*}\left(\mathcal{X}^{(2)}\right)\right)\right\|_{2}^{2}\right] \end{aligned} R=E[(T2TF(X(2)),Y(2)))+dist(T1TF(X(2)),T1TF(X(2)))+2ασ(WencF(X(2)))σ(WencF(X(2)))22]

上式中, 第二项为dist loss, 是为了限制第二个任务对于第一个任务模型的改变能力.

第三项为 code loss , 通过encoder提取code同样也是限制的作用, 但是更加灵活.

两个限制一项为监督限制, 一项为无监督限制, 共同缓解灾难性遗忘.

ℓ dist ( Y ^ , Y ∗ ) = − ⟨ Z ∗ , log ⁡ Z ^ ⟩ \ell_{\text {dist}}\left(\hat{\mathcal{Y}}, \mathcal{Y}^{*}\right)=-\left\langle\mathcal{Z}^{*}, \log \hat{\mathcal{Z}}\right\rangle dist(Y^,Y)=Z,logZ^

Z i ∗ = Y i ∗ 1 / θ ∑ j Y j ∗ 1 / θ  and  Z ^ i = Y ^ i 1 / θ ∑ j Y ^ j 1 / θ \mathcal{Z}_{i}^{*}=\frac{\mathcal{Y}_{i}^{* 1 / \theta}}{\sum_{j} \mathcal{Y}_{j}^{* 1 / \theta}} \text { and } \hat{\mathcal{Z}}_{i}=\frac{\hat{\mathcal{Y}}_{i}^{1 / \theta}}{\sum_{j} \hat{\mathcal{Y}}_{j}^{1 / \theta}} Zi=jYj1/θYi1/θ and Z^i=jY^j1/θY^i1/θ

蒸馏loss 通过调整 θ \theta θ 增加输出的较小值,并减小较高值的权重。减轻了使用不同数据分布的影响。

通过控制超参 α \alpha α ( α \alpha α of 1 0 − 3 10^{−3} 103 for ImageNet and 1 0 − 2 10^{−2} 102 for the rest of the tasks) 在遗忘和新任务预测中取得一个平衡.

3.4. Training procedure
R N = 1 N ∑ i = 1 N ( ℓ ( T T ∘ T ∘ F ( X i ( T ) ) , Y i ( T ) ) + ∑ t = 1 T − 1 ℓ d i s t ( T t ∘ T ∘ F ( X i ( T ) ) , T t ∗ ∘ T ∗ ∘ F ∗ ( X i ( T ) ) ) + ∑ t = 1 T − 1 α t 2 ∥ σ ( W e n c , t F ( X i ( T ) ) ) − σ ( W e n c , t F ∗ ( X i ( T ) ) ) ∥ 2 2 ) \begin{aligned} R_{N} &=\frac{1}{N} \sum_{i=1}^{N}\left(\ell\left(T_{T} \circ T \circ F\left(X_{i}^{(\mathcal{T})}\right), Y_{i}^{(\mathcal{T})}\right)\right.\\ &+\sum_{t=1}^{\mathcal{T}-1} \ell_{d i s t}\left(T_{t} \circ T \circ F\left(X_{i}^{(\mathcal{T})}\right), T_{t}^{*} \circ T^{*} \circ F^{*}\left(X_{i}^{(\mathcal{T})}\right)\right) \\ &\left.+\sum_{t=1}^{\mathcal{T}-1} \frac{\alpha_{t}}{2}\left\|\sigma\left(W_{e n c, t} F\left(X_{i}^{(\mathcal{T})}\right)\right)-\sigma\left(W_{e n c, t} F^{*}\left(X_{i}^{(\mathcal{T})}\right)\right)\right\|_{2}^{2}\right) \end{aligned} RN=N1i=1N((TTTF(Xi(T)),Yi(T))+t=1T1dist(TtTF(Xi(T)),TtTF(Xi(T)))+t=1T12αtσ(Wenc,tF(Xi(T)))σ(Wenc,tF(Xi(T)))22)

在这里插入图片描述

Methodology

通过 dist loss 和 code loss 的结合, paper的实验结果非常有效.

在这里插入图片描述在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

给定预训练任务flowers , 训练 Birds, 观察Birds训练时, flowers 在模型更新前后的距离, 可以发现 class loss 和 dist loss, code loss 存在着竞争关系, class loss 最先发挥作用, distance 从18上升83, 之后限制loss将距离降低.

Harvest

multi-task 通过增加输出头, 共享主干解决 lifelong learning 的问题, 但是误差会随着任务的增加而累加的问题.

EWC(Elastic weight consolidation) 通过增加对于更新参数的约束loss, ∑ i λ 2 F i ( W i − W 1 , i ∗ ) \sum_{i} \frac{\lambda}{2} F_{i}\left(W_{i}-W_{1, i}^{*}\right) i2λFi(WiW1,i),使得参数处于新旧两个任务的平衡态, 但是存在两个问题:

  1. 因为参数约束是独立进行的, 保证新的参数在原始参数的附近, 是否有可能存在一种情况, 使得非独立参数约束仍能取得相同的效果, 甚至更好.
  2. 任然需要保存每一个任务的模型

Fisher Information反映了我们对参数估计的准确度, 越大, 对参数估计的准确度越高, 即代表了越多的信息. F i F_i Fi是Fisher信息矩阵的对角项, 增加 F i F_i Fi可以阻止对于旧任务有重要作用的权重改变过大.

理论上,预训练模型应该有如下的作用:

  1. 保证历史任务的准确性
  2. 使用新任务数据的归纳偏差(inductive bias, The need for biases in learning generalizations.)提升历史任务的准确性
  3. 使用先验知识提高新任务的准确性

code

autodecoder

class AutoEncoder(nn.Module):
    def __init__(self, x_dim, h1_dim):
        super(AutoEncoder, self).__init__()
        self.encode = nn.Sequential( 
            #fc layers for the encoder
            nn.Linear(x_dim , h1_dim),
            nn.Sigmoid())
        
        #fc layers for the decoder
        self.decode = nn.Sequential( 
            nn.Linear(h1_dim, x_dim),)
        
    def forward(self, x):
        h = self.encode(x) 
        x_recon = self.decode(h)        
        return x_recon

loss

code_loss = nn.MSELoss()
classify_loss = nn.CrossEntropyLoss()
def distillation_loss(task, target, T):
    """
    cross-entropy
    return -target*log(task)
    target and task should through norm and softmax.
    :param task: task feature from current model
    :param target: target feature from source model
    :param T: temperature < 1
    :return:
    """
    target = target / T
    target_item = F.softmax(target, dim=1)

    task = task / T
    task_item = F.log_softmax(task, dim=1)

    loss = torch.sum(-target_item * task_item, 1).mean()
    return loss

Reference

图像检索:Fisher Information Matrix and Fisher Kernel

github

  • 2
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
Lifelong Machine Learning (Synthesis Lectures on Artificial Intelligence and Machine Learning) By 作者: Zhiyuan Chen – Bing Liu ISBN-10 书号: 1681733021 ISBN-13 书号:: 9781681733029 Edition 版本: 2 Release Finelybook 出版日期: 2018-08-14 pages 页数: (207 ) $79.95 Book Description to Finelybook sorting Synthesis Lectures on Artificial Intelligence and Machine Learning Lifelong Machine Learning, Second Edition 版本 is an introduction to an advanced machine learning paradigm that continuously learns by accumulating past knowledge that it then uses in future learning and problem solving. In contrast, the current dominant machine learning paradigm learns in isolation: given a training dataset, it runs a machine learning algorithm on the dataset to produce a model that is then used in its intended application. It makes no attempt to retain the learned knowledge and use it in subsequent learning. Unlike this isolated system, humans learn effectively with only a few examples precisely because our learning is very knowledge-driven: the knowledge learned in the past helps us learn new things with little data or effort. Lifelong learning aims to emulate this capability, because without it, an AI system cannot be considered truly intelligent. Research in lifelong learning has developed significantly in the relatively short time since the first edition of this book was published. The purpose of this second edition is to expand the definition of lifelong learning, update the content of several chapters, and add a new chapter about continual learning in deep neural networks—which has been actively researched over the past two or three years. A few chapters have also been reorganized to make each of them more coherent for the reader. Moreover, the authors want to propose a unified framework for the research area. Currently, there are several research topics in machine learning that are closely related to lifelong learning—most notably, multi-task learning, transfer learning, and meta-learning—because they also employ the idea of knowledge sharing and transfer. This book brings all these topics under one roof and discusses their similarities and differences. Its goal is to introduce this emerging machine learning paradigm and present a comprehensive survey and review of the important research results and latest ideas in the area. This book is thus suitable for students, researchers, and practitioners who are interested in machine learning, data mining, natural language processing, or pattern recognition. Lecturers can readily use the book for courses in any of these related fields. Preface Acknowledgments Introduction Related Learning Paradigms Lifelong Supervised Learning Continual Learning and Catastrophic Forgetting Open-World Learning Lifelong Topic Modeling Lifelong Information Extraction Continuous Knowledge Learning in Chatbots Lifelong Reinforcement Learning Conclusion and Future Directions Bibliography Authors’Biographies Blank Page

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值