[2111] Adaptive Fourier Neural Operators: Efficient Token Mixers for Transformers

koukouvagia

已于 2022-04-04 10:15:30 修改

阅读量2.7k

点赞数

文章标签：计算机视觉深度学习

于 2022-03-24 17:07:52 首次发布

本文链接：https://blog.csdn.net/weixin_43355838/article/details/123666651

版权

paper

Content

Abstract

ViTs and ViMLPs

similarity capture long-terms dependencies between spatial locations
difference ViMLPs more simplified than ViTs by replacing SA with FCs
problem large computational complexity $\mathcal{O}(N^2)$ for token $X\in\Reals^{N\times C}$

existed solution

Swin: replace global SA with local SA by window partitioning
GFNet: propose depth-wise global conv and perform well in Fourier domain
- main steps
  - 2D fast Fourier transform: transfer input features from space domain to frequency domain
  - frequent gating: frequent features element-wise multiply with learnable global filter
  - 2D inverse fast Fourier transform: transfer learnt features from frequency domain to space domain
- drawbacks
  - lack adaptivity and expressiveness at high resolution $\impliedby$ complexity and parameter grow with sequence size
  - no channel mixing involved in frequent gating operation

Complexity, parameter count, and interpretation for FNO, AFNO, GFN, and Self-Attention. $N\coloneqq hw$ , $d$ , and $k$ refer to the sequence size, channel size, and block count, respectively.

contribution

propose adaptive Fourier neural operators (AFNO), an efficient token mixer with a quasi-linear complexity in sequence length
improve expressiveness and generalization of AFNO, by imposing block-diagonal structure, adaptive weight-sharing and sparsity

Method

Fourier neural operator

kernel integration

denote $x_{n, m}\in\Reals^d$ as the $(n, m)$ -th token in input tensor $X\in\Reals^{N\times d}, N\coloneqq hw$
index sequence token $X[s]\coloneqq X[n_s, m_s]$ for some $t\in[hw]$
definition 1 (self attention) define self attention mixing as $\Reals^{N\times D}\rightarrow\Reals^{N\times d}$
$Att(X)\coloneqq softmax(\frac{XW_q(XW_k)^T}{\sqrt{d}})XW_v$

where, $W_q, W_k, W_v\in\Reals^{d\times d}$ are query, key, value matrices

write self-attention as a kernel integration
define $K\coloneqq softmax(\frac{\langle XW_q, XW_k\rangle}{\sqrt{d}})$ as attention matric
treat self attention as an asymmetric matrix-valued kernel $\kappa: [N]\times[N]\rightarrow\Reals^{d\times d}$ parametrized as $\kappa[s, t]=K[s, t]\circ W_v^T$
view self attention as kernel summation
$Att(X)[s]\coloneqq\sum_{t=1}^N\kappa[s, t]X[t], \forall s\in[N]$

简单解释下这个公式的含义： $A t t (X) [s]$ 代表 $Att(X)\in\Reals^{N\times N}$ 矩阵的第 $s$ 行，是一个 $N$ 维的向量；它是一系列的 $N$ 维向量 $X [t]$ 的加权和，其中 $X [t]$ 对应的权值是 $\kappa[s, t]=K[s, t]\circ W_v^T\in\Reals^{d\times d}$
ref: zhihu

extend kernel summation into continuous kernel integrals
input tensor $X$ is a spatial function in function space $X\in(D, \Reals^d)$ , rather than a finite-dimensional vector in Euclidean space $X\in\Reals^{N\times d}$
definition 2 (kernel integral) define kernel integral operator $\mathcal{K}: (D, \Reals^d)\rightarrow(D, \Reals^d)$ as
$\mathcal{K}(X)(s)=\int_D\kappa(s, t)X(t)\mathrm{d}t, \forall s\in D$

with continuous kernel function $\kappa: D\times D\rightarrow\Reals^{d\times d}$

integral lead to global convolution
definition 3 (global convolution) given special case of Green kernel: $\kappa(s, t)=\kappa(s-t)$ , kernel operator admit
$\mathcal{K}(X)(s)=\int_D\kappa(s-t)X(t)\mathrm{d}t, \forall s\in D$

convolution has smaller complexity than integration
global convolution can be efficiently implemented by FFT

Fourier neural operator (FNO)

define Fourier neural operator as
definition 4 (Fourier neural operator) for continuous input $X\in D$ and kernel $\kappa$ , kernel integral at token $s$ is found as
$\mathcal{K}(X)(s)=\mathcal{F}^{-1}(\mathcal{F}(\kappa)\cdot\mathcal{F}(X))(s), \forall s\in D$

where, $\mathcal{F}, \mathcal{F}^{-1}$ are Fourier transform and its inverse

discrete FNO
for images with finite dimension on a discrete grid, mix tokens using discrete Fourier transform (DFT)
given input token tensor $X\in\Reals^{h\times w\times d}$ , do DFT per token $n)\in[h]\times[w]$
step 1 token mixing: discrete $\mathcal{F}(X)$
$z_{m, n}=[DFT(X)]_{m, n}$

step 2 channel mixing: discrete $\mathcal{F}(\kappa)$
$\tilde{z}_{m, n}=W_{m, n}z_{m, n}$

where, $W_{m, n}\coloneqq DFT(\kappa)\in\Complex^{h\times w\times d\times d}$ is complex-valued weight tensor to parametrize kernel
step 3 token de-mixing: discrete $\mathcal{F}^{-1}(\tilde{Z})$
$y_{m, n}=[IDFT(\tilde{Z})]_{m, n}$

step 4 add a residual term $x_{m, n}$ (parametrized as a convolution) to $y_{m, n}$
compensate local features and non-periodic boundaries
$\impliedby$ DFT assume a global convolution applied on periodic images, which is not typically true for real-world images

conclusions of FNO

merits
- after training on one resolution, directly evaluated at another resolution
- encode higher-frequency information in channel dimension
demerits
- static weights $W_{m, n}$ : unadaptive to different input resolution

adaptive Fourier neural operator

model architecture

The multi-layer transformer network with FNO, GFN, and AFNO mixers. GFNet performs element-wise matrix multiplication with separate weights across channels ( $k$ ). FNO performs full matrix multiplication that mixes all the channels. AFNO performs block-wise channel mixing using MLP along with soft-thresholding. The symbols $h$ , $w$ , $d$ , and $k$ refer to the height, width, channel size, and block count, respectively.

adaptive Fourier neural operator (AFNO)

AFNO mainly modify step 2 in FNO
impose a block diagonal structure on $W$ , divided into $k$ weight blocks of size $\frac{d}k\times\frac{d}k$
kernel operate independently on each block
$\tilde{z}_{m, n}^{(\ell)}=W_{m, n}^{(\ell)}z_{m, n}^{(\ell)}, \ell=1, ..., k$

note that each block can be interpreted a head as in multi-head self-attention
implemented by a 2-layer perceptron for $(n, m)$ -th token
$\tilde{z}_{m, n}=\mathrm{MLP}(z_{m, n})=W_2\mathrm{ReLU}(W_1z_{m, n})+b$

where, $W_1, W_2, b$ are shared for all tokens

images are inherently sparse in Fourier domain
$\implies$ adaptively mask tokens according to their importance towards end task
use LASSO channel mixing to sparsify tokens
$\min\Vert\tilde{z}_{m, n}-W_{m, n}z_{m, n}\Vert^2+\lambda\Vert\tilde{z}_{m, n}\Vert_1$

implemented by soft-thresholding and shrinkage operation
$\begin{aligned} \tilde{z}_{m, n}&=S_{\lambda}(W_{m, n}z_{m, n}) \\ S_{\lambda}&=\mathrm{sign}(x)\max\{\vert x\vert-\lambda, 0\} \end{aligned}$

where, $\lambda$ is a tuning parameter to control sparsity

Experiment

image classification

dataset ImageNet-1K
loss function cross-entropy
optimizer Adam: 300 epochs, weigh decay=0.05, init lr=5e-4, linear warm-up 5 epochs, cosine decay to 1e-5
max gradient norm clipped to 1.0

ImageNet-1K classification efficiency-accuracy trade-off when the input resolution is $224\times 224$ .

image inpainting

dataset ImageNet-1K
optimizer Adam: 100 epochs, weigh decay=0.01, init lr=1e-4 for self-attention or 1e-3 for other mixers, cosine decay to 1e-5
max gradient norm clipped to 1.0

Inpainting PSNR and SSIM for ImageNet-1k validation data. AFNO matches the performance of Self-Attention despite using significantly less FLOPs.

few-shot segmentation

dataset CelebA-Faces, ADE-Cars, LSUN-Cats
loss function cross-entropy
optimizer 2000 epochs, init lr=1e-4 for self-attention or 1e-3 for other mixers

Few-shot segmentation mIoU for AFNO versus alternative mixers. AFNO surpasses Self-Attention for 2/3 datasets while using less flops.

cityscapes segmentation

pre-training
dataset ImageNet-1K
optimizer Adam: batch size=1024, 300 epochs, weigh decay=0.05, init lr=1e-3, warm-up 6250 iterations, cosine decay to 1e-5
max gradient norm clipped to 1.0
fine-tuning
dataset Cityscapes
optimizer Adam: 450 epochs, weigh decay=0.05

mIoU and FLOPs for Cityscapes segmentation at $1024\times 1024$ resolution. Note, both the mixer and total FLOPs are included. For GFN and AFNO, the MLP layers are the bottleneck for the complexity. Also, AFNO-25% only keeps 25% of the low frequency modes, while AFNO-100% keeps all the modes. Results for self-attention cannot be obtained due to the long sequence length in the first few layers.

ablation studies

sparsity threshold

Ablations for the sparsity thresholds and block count measured by inpainting validation PSNR. The results suggest that soft thresholding and blocks are effective.

blocks number

impact of adaptive weights

Ablations for AFNO versus FNO, AFNO without adaptive weights, and hard thresholding. Results are on inpainting pretraining with 10% of ImageNet along with few-show segmentation mIoU on CelebAFaces. Hard thresholding only keeps 35% of low frequency modes. AFNO demonstrates superior performance for the same parameter count in both tasks.