Meta Learning: Learn to learn_learntolearn图示-CSDN博客

本文链接：https://blog.csdn.net/weixin_42437114/article/details/120421830

本文为李宏毅 2021 ML 课程的笔记

Introduction of Meta Learning

What is Meta Learning?

在 Meta learning 中，输入要学习的任务 (Training tasks)，输出是一个训练好的模型

3 steps: (1) Function with unknown; (2) Define loss function; (3) Optimization

Meta Learning – Step 1

What is learnable in a learning algorithm? - e.g. In DL: Net Architecture, Initial Parameters, Learning Rate… (In meta, we will try to learn some of them. We can categorize meta learning based on what is learnable)

Meta Learning – Step 2

Define loss function $L(\phi)$ for learning algorithm $F_\phi$

首先我们有一系列的 training tasks (相当于 ML 里的 training data) 用于训练 $F_\phi$
Across-task Training (includes within-task training and testing): 对每个 training task，我们都可以根据 $F_\phi$ ，使用 training task 中的 training examples (Support set) 学出一个模型 $f_{\theta^*}$ ，然后在 training task 中的 testing examples (Query set) 上计算损失函数 $l$

$\theta^{1*}$ : parameters of the classifier learned by $F_\phi$ using the training examples of task 1

最后将所有 training task 上计算得到的 loss 累加起来就能得到最终的 $L(\phi)$

Meta Learning – Step 3

Find $\phi$ that can minimize $L(\phi)$
$\phi^*=\argmin_\phi L(\phi)$
- If you know how to compute $\frac{\partial L(\phi)}{\partial \phi}$ , Gradient descent is your friend.
- What if $L(\phi)$ is not differentiable? – Reinforcement Learning / Evolutionary Algorithm

Framework

在 training tasks 上训练出一个学习算法 $F_{\phi^*}$ 后，需要在 testing tasks 上进行评估

Meta Learning v.s ML

What you know about ML can usually apply to meta learning
- Overfitting on training tasks $\Rightarrow$ (1) Get more training tasks to improve performance; (2) Task augmentation
- There are also hyperparameters when learning a learning algorithm …… $\Rightarrow$ We also need Development tasks (类似于 ML 中的验证集，用于选择超参)

Applications

Few-shot Image Classification

Few-shot Image Classification: Each class only has a few images.
- $N$ -ways $K$ -shot classification: In each task, there are N classes, each has K examples.

Omniglot

In meta learning, you need to prepare many $N$ -ways $K$ -shot tasks as training and testing tasks. 最常用的就是使用 Omniglot 数据集: 1623 characters; Each has 20 examples
Split your characters into training and testing characters
- Sample $N$ training characters, sample $K$ examples from each sampled characters → one training task
- Sample $N$ testing characters, sample $K$ examples from each sampled characters → one testing task

More…

在这里插入图片描述

What is learnable in a learning algorithm?

Learning to initialize

学习如何初始化网络参数

Model-Agnostic Meta-Learning (MAML)

MAML 读作 mammal

paper: Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks

在 training tasks 上训练时，只更新一次参数:
- Why? – (1) Fast (2) Good to truly train a model with one step (3) When using the algorithm, still update many times. (测试时可以多次更新参数) (4) Few-shot learning has limited data.

Gradient descent

MAML 利用梯度下降来更新 $\phi$ ，下面就计算梯度 $\nabla_\phi L(\phi)$ :
$\nabla_\phi L(\phi)=\nabla_\phi \sum_{n=1}^N l^n(\hat \theta^n)=\sum_{n=1}^N\nabla_\phi l^n(\hat \theta^n)$ 而
$\frac{\partial l(\hat \theta)}{\partial\phi_i}=\sum_j \frac{\partial l(\hat \theta)}{\partial\hat \theta_j}\frac{\partial \hat \theta_j}{\partial\phi_i}$ 现在只需求出 $\frac{\partial \hat \theta_j}{\partial\phi_i}$ 即可 (using a first-order approximation):
$\begin{aligned}\frac{\partial\hat \theta_j}{\partial\phi_i}&=\frac{\partial(\phi_j-\varepsilon\frac{\partial l(\phi)}{\partial\phi_j})}{\partial\phi_i}=\frac{\partial\phi_j}{\partial\phi_i}-\varepsilon\frac{\partial l(\phi)}{\partial\phi_j\partial\phi_i} \\&\approx\begin{cases}0&i\neq j\\1&i=j\end{cases}\end{aligned}$ 因此
$\frac{\partial l(\hat \theta)}{\partial\phi_i}=\sum_j \frac{\partial l(\hat \theta)}{\partial\hat \theta_j}\frac{\partial \hat \theta_j}{\partial\phi_i}\approx\frac{\partial l(\hat \theta)}{\partial\hat \theta_i}$ $\nabla_\phi L(\phi)\approx\sum_{n=1}^N\nabla_{\hat \theta} l^n(\hat \theta^n)$

MAML – Real Implementation

由于 $\phi$ 更新的梯度方向与 $\hat\theta$ 的梯度方向一样，因此可以在每个 training task 上计算两次梯度，第一次用于将 $\phi$ 更新为 $\hat\theta$ ，第二次用于将 $\phi^i$ 更新为 $\phi^{i+1}$

MAML v.s. Pre-training

在这里插入图片描述

MAML: 我們不在意 $\phi$ 在 training task 上表現如何，我們在意用 $\phi$ 訓練出來的 $\hat\theta^n$ 表現如何 (如下图所示，初始值 $\phi$ 在两个任务上表现并不是特别好，但在训练后却能在两个任务上都找到最优解)
Model Pre-training: 找尋在所有 task 都最好的 $\phi$ ，並不保證拿 $\phi$ 去訓練以後會得到好的 $\hat\theta^n$ (如下图所示，初始值 $\phi$ 在两个任务上表现不错，但在训练后却不能达到最优解)

Pre-training: Also known as multi-task learning (baseline of meta)

MAML is good because ……

paper: Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML
这篇文章探索了如下问题:
- Is the effectiveness of MAML due to the meta-initialization being primed for rapid learning (large, efficient changes in the representations) (MAML 学到的初始化参数有利于各个任务学习得到最优解) or due to feature reuse, with the meta initialization already containing high quality features (MAML 学到的初始化参数本身就比较接近各个任务的最优解)?
- We find that feature reuse is the dominant factor. This leads to the ANIL (Almost No Inner Loop) algorithm, a simplification of MAML where we remove the inner loop for all but the (task-specific) head of a MAML-trained network.

Reptile

paper: On First-Order Meta-Learning Algorithms

Reptile 的想法比 MAML 更简单：如下图所示，Reptile 允许在 training task $m$ 上多次更新参数得到参数 $\hat\theta^m$ ，然后将 $\phi$ 朝着 $\hat\theta^m$ 的方向上更新:

MAML v.s. Reptile v.s. Pre-training

在这里插入图片描述

MAML++

paper: How to train your MAML (MAML++)

Optimizer

paper: Learning to learn by gradient descent by gradient descent

Is the optimizer learnable?

常见的三种优化器 SGD, RMSProp, Adam 可以看作下图所示的结构，其中 $g$ 为梯度， $l$ 为学习率， $\hat v$ 为梯度平方的累加和， $\hat m$ 为动量
因此可以用如下结构，让机器自己学出一个优化器:

Network Architecture Search (NAS)

在这里插入图片描述

由于网络架构参数是不可微的，因此不能使用 gradient descent
Reinforcement Learning
- Barret Zoph, et al., Neural Architecture Search with Reinforcement Learning, ICLR 2017
- Barret Zoph, et al., Learning Transferable Architectures for Scalable Image Recognition, CVPR, 2018
- Hieu Pham, et al., Efficient Neural Architecture Search via Parameter Sharing, ICML, 2018
Evolution Algorithm
- Esteban Real, et al., Large-Scale Evolution of Image Classifiers, ICML 2017
- Esteban Real, et al., Regularized Evolution for Image Classifier Architecture Search, AAAI, 2019
- Hanxiao Liu, et al., Hierarchical Representations for Efficient Architecture Search, ICLR, 2018
DARTS: DARTS: Differentiable Architecture Search (想办法让 loss 可微，然后用梯度下降进行优化)

Data Processing

Data Augmentation

paper:

Sample Reweighting

paper:
- Meta-Weight-Net: Learning an Explicit Mapping For Sample Weighting
- Learning to Reweight Examples for Robust Deep Learning

Give different samples different weights

Learning to compare: Metric-based approach

paper: Meta-Learning with Latent Embedding Optimization

不再是学习 gradient descent 中的一个部分，而是直接抛弃 gradient descent 的框架，让学习算法读入训练数据和测试数据，就能直接输出测试数据的结果
- Input: Training data and their labels + Testing data
- Output: Predicted label of testing data