Open AI 自监督学习笔记:Self-Supervised Learning | Tutorial | NeurIPS 2021

转载自微信公众号
原文链接: https://mp.weixin.qq.com/s?__biz=Mzg4MjgxMjgyMg==&mid=2247486049&idx=1&sn=1d98375dcbb9d0d68e8733f2dd0a2d40&chksm=cf51b898f826318ead24e414144235cfd516af4abb71190aeca42b1082bd606df6973eb963f0#rd

Open AI 自监督学习笔记


文章目录


Video: https://www.youtube.com/watch?v=7l6fttRJzeU
Slides: https://nips.cc/media/neurips-2021/Slides/21895.pdf

Self-Supervised Learning
Self-Prediction and Contrastive Learning

  • Self-Supervised Learning
    • a popular paradigm of representation learning

Outline

  • Introduction: motivation, basic concepts, examples
  • Early Work: Look into connection with old methods
  • Methods
    • Self-prediction
    • Contrastive Learning
    • (for each subsection, present the framework and categorization)
  • Pretext tasks: a wide range of literature review
  • Techniques: improve training efficiency

Introduction

What is self-supervised learning and why we need it?

What is self-supervised learning?
  • Self-supervised learning (SSL):
    • a special type of representation learning that enables learning good data representation from unlablled dataset
  • Motivation :
    • the idea of constructing supervised learning tasks out of unsupervised datasets

    • Why?

      ✅ Data labeling is expensive and thus high-quality dataset is limited

      ✅ Learning good representation makes it easier to transfer useful information to a variety of downstream tasks ⇒ \Rightarrow e.g. Few-shot learning / Zero-shot transfer to new tasks

Self-supervised learning tasks are also known as pretext tasks

What’s Possible with Self-Supervised Learning?
  • Video Colorization (Vondrick et al 2018)

    • a self-supervised learning method

    • resulting in a rich representation

    • can be used for video segmentation + unlabelled visual region tracking, without extra fine-tuning

    • just label the first frame

      picture 1

  • Zero-shot CLIP (Radford et al. 2021)

    • Despite of not training on supervised labels

    • Zero-shot CLIP classifier achieve great performance on challenging image-to-text classification tasks

      picture 2

Early Work

Precursors 先驱者 to recent self-supervised approaches

Early Work: Connecting the Dots

Some ideas:

  • Restricted Boltzmann Machines

  • Autoencoders

  • Word2Vec

  • Autogressive Modeling

  • Siamese networks

  • Multiple Instance / Metric Learning

Restricted Boltzmann Machines
  • RBM:
    • a special case of markov random field

      picture 3

    • consisting of visible units and hidden units

    • has connections between any pair across visible and hidden units, but not within each group

      picture 4

Autoencoder: Self-Supervised Learning for Vision in Early Days
  • Autoencoder: a precursor to the modren self-supervised approaches
    • Such as Denoising Autoencoder
  • Has inspired many self-learning approaches in later years
    • such as masked language model (e.g. BERT), MAE

picture 5

Word2Vec: Self-Supervised Learning for Language
  • Word Embeddings to map words to vectors
    • extract the feature of words
  • idea:
    • the sum of neighboring word embedding is predictive of the word in the middle

picture 6

  • An interesting phenomenon resulting from word2Vec:
    • you can observe linear substructures in the embedding space where the lines connecting comparable concepts such as the corresponding masculine and feminine words appear in roughly parallel lines

      picture 7

Autoregressive Modeling
  • Autoregressive model:

    • Autoregressive (AR) models are a class of time series models in which the value at a given time step is modeled as a linear function of previous values

    • NADE: Neural Autogressive Distribution Estimator

      picture 8

  • Autogressive model also has been a basis for many self-supervised methods such as gpt

Siamese Networks

Many contrastive self-supervised learning methods use a pair of neural networks and learned from their difference
– this idea can be tracked back to Siamese Networks

  • Self-organizing neural networks
    • where two neural networks take seperate but related parts of the input, and learns to maximize the agreement between the two outputs
  • Siamese Networks
    • if you believe that one network F can well encode x and get a good representation f(x)

    • then, 对于两个不同的输入x1和x2,their distance can be d(x1,x2) = L(f(x1),f(x2))

    • the idea of running two identical CNN on two different inputs and then comparing them —— a Siamese network

    • Train by:

      ✅ If xi and xj are the same person, ∣ ∣ f ( x i ) − f ( x j ) ||f(xi)-f(xj) ∣∣f(xi)f(xj) is small

      ✅ If xi and xj are the different person, ∣ ∣ f ( x i ) − f ( x j ) ||f(xi)-f(xj) ∣∣f(xi)f(xj) is large

picture 9

Multiple Instance Learning & Metric Learning

Predecessors of the predetestors of the recent contrastive learning techniques : multiple instance learning and metric learning

  • deviate frome the typical framework of empirical risk minimization

    • define the objective function in terms of multiple samples from the dataset ⇒ \Rightarrow multiple instance learning
  • ealy work:

    • around non-linear dimensionality reduction
    • 如multi-dimensional scaling and locally linear embedding
    • better than PCA: can preserving the local structure of data samples
  • metric learning:

    • x and y: two samples
    • A: A learnable positive semi-definite matrix
  • contrastive Loss:

    • use a spring system to decrease the distance between the same types of inputs, and increase between different type of inputs
  • Triplet loss

    • another way to obtain a learned metric
    • defined using 3 data points
    • anchor, positive and negative
    • the anchor point is learned to become similar to the positive, and dissimilar to the negative
  • N-pair loss:

    • generalized triplet loss
    • recent 对比学习 就以 N-pair loss 为原型

picture 13

Methods

  • self-prediction
  • Contrastive learning
Methods for Framing Self-Supervised Learning Tasks
  • Self-prediction: Given an individual data sample, the task is to predict one part of the sample given the other part
    • 即 “Intra-sample” prediction

The part to be predicted pretends to be missing

  • Contrastive learning: Given multiple data samples, the task is to predict the relationship among them
    • relationship: can be based on inner logics within data

      ✅ such as different camera views of the same scene

      ✅ or create multiple augmented version of the same sample

The multiple samples can be selected from the dataset based on some known logics (e.g., the order of words / sentences), or fabricated by altering the original version
即 we know the true relationship between samples but pretend to not know it

Self-Prediction
  • Self-prediction construct prediction tasks within every individual data sample

    • to predict a part of the data from the rest while pretending we don’t know that part

    • The following figure: demonstrate how flexible and diverse the options we have for constructing self-prediction learning tasks

      ✅ can mask any dimensions

      picture 14

  • 分类:

    • Autoregressive generation
    • Masked generation
    • Innate relationship prediction
    • Hybrid self-prediction
Self-prediction: Autoregressive Generation
  • The autoregressive model predicts future behavior based on past behavior

    • Any data that comes with an innate sequential order can be modeled with regression
  • Examples :

    • Audio (WaveNet, WaveRNN)
    • Autoregressive language modeling (GPT, XLNet)
    • Images in raster scan (PixelCNN, PixelRNN, iGPT)
Self-Prediction: Masked Generation
  • mask a random portion of information and pretend it is missiing, irrespective of the natural sequence

    • The model learns to predict the missing portion given other unmasked information
  • e.g.,

    • predicting random words based on other words in the same context around it
  • Examples :

    • Masked language modeling (BERT)
    • Images with masked patch (denoising autoencoder, context autoencoder, colorization)
Self-Prediction: Innate Relationship Prediction
  • Some transformation (e.g., segmentation, rotation) of one data samples should maintain the original information of follow the desired innate logic

  • Examples

    • Order of image patches

      ✅ e.g., shuffle the patches

      ✅ e.g., relative position, jigsaw puzzle

    • Image rotation

    • Counting features across patches

Self-Prediction: Hybrid Self-Prediction Models

Hybrid Self-Prediction Models: Combines different type of generation modeling

  • VQ-VAE + AR
    • Jukebox (Dhariwal et al. 2020), DALL-E (Ramesh et al. 2021)
  • VQ-VAE + AR + Adversial
    • VQGAN (Esser & Rombach et al. 2021)

    • VQ-VAE: to learn a discrete code book of context rich visual parts

    • A transformer model: trained to auto-aggressively modeling the color combination of this code book

      picture 15

Contrastive Learning
  • Goal:

    • To learn such an embedding space in which similar sample pairs stay close to each other while dissimilar ones are far apart

      picture 16

  • 对比学习 can be applied to both supervised and unsupervised settings

    • when working with unsupervised data, 对比学习 is one of the most powerful approach in the self-supervised learning
  • Category

    • Inter-sample classification

      🚩 the most dominant approach

      ✅ “inter-smaple”: emphasize or distinguish it from “intra-sample”

    • Feature clustering

    • Multiview coding

Contrastive Learning: Inter-Sample Classification
  • Given both similar (“positive”) and dissimilar (“negative”) candidates, to identify which ones are similar to the anchor data point is a classification task

    • anchor: the original input
  • How to construct a set of data point candidates:

    • The original input and its distorted version
    • Data that captures the same target from different views
  • Common loss functions :

    • Contrastive loss, 2005
    • Triplet loss, 2015
    • Lifted structured loss, 2015
    • Multi-class n-pair loss, 2016
    • Noise contrastive estimation, 2010
    • InfoNCE, 2018
    • Soft-nearest neighbors loss, 2007, 2019
Loss function 1: Contrastive loss
  • 2005

  • Works with labelled dataset

  • Encoder data into an embedding vector

    • such that examples from the same class have similar embeddings and samples from different classes have different ones
  • Given two labeled data pairs ( x i , y i ) (x_i,y_i) (xi,yi) and ( x j , y j ) (x_j,y_j) (xj,yj):

    picture 17

Loss function 2: Triplet loss
  • Triplet loss (Schroff et al. 2015)

    • learns to minimize the distance between the anchor x x x and positive x + x+ x+ and
    • maximize the distance between the anchor x x x and negative x − x- x at the same time
  • Given a triplet input ( x , x + , x − ) (x, x^{+}, x^{-}) (x,x+,x)

    picture 18

Triplet (三胞胎) loss: because it demands an input triplet containing one anchor, one positive and one negative

Loss function 3: N-pair loss
  • N-Pair loss (Sohn 2016)
    • generalizes triplet loss to include comparison with multiple negative samples
  • Given oen positive and N-1 negative samples:
    • { x , x + , x 1 − , . . . , x N − 1 − } \{x,x^{+},x_{1}^{-},...,x_{N-1}^{-}\} {x,x+,x1,...,xN1}

picture 19

Loss function 4: Lifted structured loss
  • Lifted structured loss (Song et al. 2015):

    • utilizes all the pairwise edges within one training batch for better computational efficiency

      picture 20

  • 对于大规模训练,batchsize经常非常大

    • means we have many samples within one batch
    • can construct multiple similar or dissimilar pairs
    • Lifted structured loss: utilize all the paragraphs edges of relationship within one training batch
    • improve compute efficiency as it incorporates more information within one batch
Loss function 5: Noise Contrastive Estimation (NCE)
  • Noise contrastive Estimation (NCE): Gutmann & Hyvarinen 2010

    • runs logistic regression to tell apart the target data from noise
  • Given target sample distribution p and noise distribution q:

    picture 21

  • initially proposed to learn word embedding in 2010

Loss function 6: InfoNCE
  • InfoNCE (2018)

    • Uses categorical cross-entropy loss to identify the positive sample amongst a set of unrelated noise samples
  • Given a context vector c, the positive sample should be drawn from the conditional distribution ( p ( x ∣ c ) ) (p(x|c)) (p(xc))

    • while N-1 negative samples are drawn from the proposal distribution p(x), independent from the context c
  • The probability of detecting the positive sample correctly is:

    picture 22

Loss function 7: Soft-Nearest Neighbors Loss
  • Soft-Nearest Neighbors Loss (Frosst et al. 2019): extends the loss function to include multiple positive samples given known labels
  • Given a batch of samples { x i , y i } ∣ i = 1 B \{x_i,y_i\}|_{i=1}^B {xi,yi}i=1B
    • known labels may come from supervised dataset or fabricated with data augmentation

    • temperature term: tuning how concentrated the feature space

      picture 23

Contrastive Learning: Feature Clustering
  • Find similar data samples by clustering them with learned features

  • core idea : Use clustering algorithms to assign pseudo lables to samples such that we can run intra-sample contrastive learning

  • Examples:

    • Deep Cluster (Caron et al 2018)

    • Inter CLR (Xie et al 2021)

      picture 24

Contrastive Learning: Multiview Coding
  • Apply the InfoNCE objective to two or more different views of input data

    picture 25

  • Became a mainstream contrastive learning method

    • AMDIM (Bachman et al 2019)
    • Contrastive multiview coding (CMC, Tian et al 2019) 等
Contrastive Learning between Modalities
  • Views can be from paired inputs from two or more modalities
    • CLIP (Radford et al 2021)、ALIGN (Jia et al 2021):enables zero-shot classification, cross-modal retrieval, guided image generation

    • CodeSearchNet (Husain et al 2019): contrast learning between text and code

      picture 26

Pretext tasks

Recap: Pretext Tasks
  • Step 1: Pre-train a model for a pretext task

  • Step 2: Transfer to applications

    picture 27

Pretext Tasks: Taxonomy
  • Generative
    • VAE
    • GAN
    • Autoregressive
    • Flow-based
    • Diffusion
  • Self-Prediction
    • Masked Prediction (Denoising AE, Context AE)
    • Channel Shuffling (colorization, split-brain)
  • Innate Relationship
    • Patch Positioning
    • Image Rotation
    • Feature Counting
    • Contrastive Predictive Coding
  • Contrastive
    • Instance Discrim

    • Augmented Views

    • Clustering-based

      picture 28

Image / Vision Pretext Tasks
Image Pretext Tasks: Varizational AutoEncoders
  • Auto-Encoding Variational Bayes (Kingma et al. 2014)

    picture 29

  • Image generation:

    • itself is an immensely broad field that deserves an entire tutorial or more
    • but can also serve as representation learning
Image Pretext Tasks: Generative Adversial Networks
  • Jointly train an encoder, additional to the usual GAN

    • Bidirectional GAN

    • Adversarially Learned Inference

      picture 30

  • GAN Inversion: learning encoder post-hoc and/or optimizing for given image

Vision Pretext Tasks: Autoregressive Image Generation
  • Neural autoregressive density estimation (NADE)
  • Pixel RNN, Pixel CNN
    • Use RNN and CNN to predict values conditioned on the neighboring pixels
  • Image GPT
    • Uses a transformer on discretized pixels and was able to obtain better representation than building of supervised approaches

picture 31

Vision Pretext Tasks: Diffusion Model
  • Diffusion Modeling :
    • Follows a Markov chain of diffusion steps to slowly add random noise to data

    • and then learn to reverse the diffusion process to construct desired data samples from the noise

      picture 32

Vision Pretext Tasks: Masked Prediction
  • Denoising autoencoder (Vincent et al. 2008)

    • Add noise = Randomly mask some pixels

    • Only reconstruction loss

      picture 33

  • Context autoencoder (Pathak et al 2016)

    • Mask a random region in the image

    • Reconstruction loss + adversial loss

    • adversial loss: tries to make it difficult to distingusih between the painting produced by the model and the actual image

      picture 34

Vision Pretext Tasks: Colorization and More

can not only be on the pixel value itself, but also on any subset of information from the image

  • Image Colorization

    • Predict the binned CIE Lab color space given a grayscale image
  • Split-brain autoencoder

    • Predict a subset of color channels from the rest of channels
    • Channels: luminosit, color, depth, etc.

    picture 35

In order to get representation that transfer well to downstream tasks

Vision Pretext Tasks: Innate Relationship Prediction
  • Learn the relationship among image patches:
    • Predict relative positions between patches
    • Jigsaw Puzzle using patches

picture 36

  • RotNet: predict which rotation is applied (Gidaris et al. 2018)
    • Rotation does not alter the semantic content of an image
  • Representation Learning by Learning to Count (Noroozi et al. 2017)
    • Counting features across patches without labels, using equivariance of counts
    • ie, learns a function that counts visual primitives in images

picture 37

Contrastive Predictive Coding and InfoNCE
  • Contrastive Predictive Coding (CPC) (van den Oord et al 2018)
    • Classify the “future” representation amongst a set of unrelated “negative” samples
    • an autoregressive context predictor is used to classify the correct future patches

picture 38

  • minimizing the loss function 等价于 maxmizing a lower bound to the mutual information between the predicted context c t c_t ct and the future patch x t + k x_{t+k} xt+k
    • 相当于预测的数据的latent representation最准确

CPC has been highly influential in contrastive learning

  • showing the effectiveness of causing the problem as an entire sample classification task
Vision Pretext Tasks: Inter-Sample Classification
  • Example CNN
  • Instance-level discrimination
    • Each istance is a distinct calss of its own

      🚩 # classes = # training samples

    • Non-parametric softmax that compares features

    • Memory bank for stroing representations of past samples V = V { i } V=V\{i\} V=V{i}

picture 39

The model learns to scatter the feature vectors in the hypersphere while mapping visually similar images into closer regions

Vision Pretext Tasks: Contrastive Learning
  • Common approach:
    • Positive: make multiple views to one images and consider the image and its distorted version as similar pairs
    • Negative: different images are treated dissimilar

picture 40

一个自然的问题:Is there better ways to creat multiview images? ↓ \downarrow

Vision Pretext Tasks: Data Augmentation and Multiple Views
  • Augment Multiscale Deep InfoMax
    • AMDIM, Bachman 2019
    • Views from different augmentations
    • create multiple views from one input image
  • Contrastive Multiview Coding
    • CMC
    • Multiple views from different channels or semantic segmentation labels of the image as different views from a single image
  • Pretext-Invariant Representation Learning
    • Jigsam transformation
    • (as an input transform)
Vision Pretext Tasks: Inter-Sample Classification
MoCo
  • MoCo (Momentum Contrast; He et al. 2019)

    • Memory bank is a FIFO queue now
    • The target features are encoded using a momentum encoder ⇒ \Rightarrow 一个batch付出很小的代价即可获得更多的negative samples
    • shuffling BN: 缓解BN对self-supervised learning的不利影响
      - MOCO v2:
    • MLP projection head
    • stronger data augmentation (添加了模糊)
    • Cosine learning rate schedule

    picture 42

  • MoCo v3:

    • Use Vision Transformer to replace ResNet
    • in-batch negatives

    picture 41

SimCLR
  • SimCLR (Simple framework for Contrastive Learning of visual Representation)
    • Contrastive learning loss

    • f() – base encoder

    • g() – projection head layer

    • In-batch negative samples

      ✅ Use large batches to have sufficient number of negative inputs

fully symmetric;

  • SimCLR v2
    • Larger ResNet models
    • Deeper g()
    • Memory bank

picture 43

Barlow Twins
  • Barlow Twins (Zbontar et al. 2021)

    • Learn to make the cross-correlation matrix between two output features for two distorted version of the same sample close to the identity
    • Make it as diagonal as possible
    • because: if the individual features are efficiently encoded, they shouldn’t be encoding information that is redundant between any pairs ⇒ \Rightarrow their corrleation should be zero

    picture 44

Vision Pretext Tasks: Non-Contrastive Siamese Networks

Learn similarity representations for different augmented views of the same sample, but no contrastive component involving negative samples

  • the objective is just minimizing the L2 distance between features encoded from the same image

  • Bootstrap Your Own Latent (BYOL, et al. )

    • Momentum-encoded features as the target
  • Simsiam (Chen 2020)

    • No momentum encoder
    • Large batch size unnecessary
  • BatchNorm seems to be playing an important role

    • might implicityly providing contrastive learning signal

picture 1

Vision Pretext Tasks: Feature Clustering with K-Means

another major technology for self-supervised learning:

  • to learn from clusters of features
  • DeepCluster (Caron et al. 2018)
    • Iteratively clusters features via k-means
    • then, uses cluster assignments as pseudo lables to provide supervised signals
  • Online DeepCluster (Zhan et al. 2020)
    • Performs clustering and netwrok update simultaneously rather than alternatingly

picture 2

  • Prototypical Cluster Learning (PCL, Li et al. 2020)
    • Online EM for clustering
    • combined with InfoNCE for smoothness
Vision Pretext Tasks: Feature Clustering with Sinkhorm-Knopp

Sinkhorm-Knopp: a cluster algorithm based on OT

  • SeLa (Self-Labelling, Asano et al. 2020)
  • SwAV (Swapping Assignments between multiple Views; Caron et al. 2020)
    • Implicit clustering via a learned prototype code (“anchor clusters”)
    • Predict cluster assignment in the other column

picture 3

Vision Pretext Tasks: Feature Clustering to improve SSL

In this approach, nobel ideas based on clustering are designed to be used in conjunction with other SSL methods

  • InterCLR (Xie et al. 2020)
    • Inter-sample contrastive pairs are constructed according to pseudo labels obtained by clustering
    • 即让对比学习的正样本也可以来自不同的图片 (而不是只能通过Multi-view) using pseudolabels from an online k-means clustering
  • Divide and Contraset (Tian et al. 2021)
    • Train expert models on the clustered datasets and then distill the experts into a single model

    • to improve the performance of other self-supervised learning models

      picture 4

Vision Pretext Tasks: Nearest-Neighbor
  • NNCLR (Dwibedi et al. 2021)
    • Contrast with the nearest neighbors in the embedding space

      ✅ to serve as the positive and negtive in contrastive learning

    • Allows for lighter data augmentation for views

      picture 5

Vision Pretext Tasks: Combining with Supervised Loss
  • Combine supervised loss + self-supervised learning
    • Self-supervised semi-supervised learning (S4L, Zhai et al 2019)
    • Unsupervised data augmentation (UDA, Xie et al 2019)
  • Use known labels for contrastive learning
    • Supervised Contrastive Loss (SupCon; Khosla et al. 2021)

      ✅ less sensitive to hyperparameter choices

picture 6

Video Pretext Tasks
Video Pretext Tasks: Innate Relationship Prediction
  • Most image pretext tasks can be applied to videos
  • However, with an additional time dimension, much more information about the video shot configuration or the physical world can be extracted from videos
    • Predicting object movements
    • 3D motion of camera
Video Pretext Tasks: Optical Flow

Tracking object movement tracking in time

  • Tracking movement of image patches (Wang & Gupta, 2016)

picture 7

  • Segmentation based on motion (Pathak et al. 2017)
Video Pretext Tasks: Sequence Ordering
  • Temporal order Verification

    • Misra et al. 2016

    • Fernando et al. 2017

    • 判断顺序是否正确

      picture 11

  • Predict the arrow of time, forward or backward

    • Wei et al. 2018
    • classify whether the sequene is moving forward or backward in time
    • outperform the temporal order verification model
Video Pretext Tasks: COlorization
  • Tracking emerges by colorizing videos (Vondrick et al. 2018)

    • Copy colors from a reference frame to another target frame in grayscale

    • by leverage the natural temporal coherence of colors across video frames

      picture 12

  • Tracking emerges by colorizing videos (Vondrick et al. 2018)

    • Used for video segmentation or human pose estimation without fine-tuning

      ✅ because the model can move the colored markings in the labeled input image directly in the prediction

picture 13

Video Pretext Tasks: Contrastive Multi-View Learning
  • TCN (Sermanet et al. 2017)

    • Use triplet loss

    • Different viewpoints at the same timestep of the same scene should share the same embedding, while embedding should vary in time, even of the same camera viewpoint

      picture 14

  • Multi-frame TCN (Dwibedi et al. 2019)

    • Use n-pairs loss
    • Multiple frames are aggregated into the embedding
Video Pretext Task: Autoregressive Generation

Because video files are huge, generating coherent continuous of video has been a difficult task

  • Predicting videos with VQ-VAE (Walker et al. 2021)

    • first: learning to discretized the video into latent codes using VQ-VAE

    • then: learning to auto regressively generate the frames using pixel cnn or transformers

    • Combining VQ-VAE and autogressively models to generate high dimensional data ⇒ \Rightarrow is a very powerful generating model

      picture 15

  • VideoGPT: Video generation using VQ-VAE and Transformers (Yan et al. 2021)

  • Jukebox (Dhariwal et al. 2020)

    • learning 3 different level of VQ-VAE using 3 different compression ratio
    • resulting 3 sequence of discrete code
    • then use them to generate new music

picture 16

picture 17

  • CALM (Castellon et al. 2021)
    • Jukebox representation for MIR tasks
  • TagBox (Manilow et al. 2021)
    • Source separation by steering Jukebox’ latent space

      picture 18

Audio Pretext Tasks
Audio Pretext Tasks: Contrastive Learning
  • COLA (Saeed et al. 2021)
    • Assigns high similarity between audio clip extracted from the same recording and low similarity to clips from different recordings
    • predicts a pari of encoded features are from the same recording or not
  • Multi-Format audio contrastive learning
    • assigns high similarity between the raw audio format and the corresponding spectral representation

    • maximizing agreement between between features included from the raw waveform and he spectrogram formats

      picture 19

Audio Pretext Task: Masked Languagee Modeling for ASR

ASR: Automatic speech recognition

  • Wav2Vec 2.0 (Baevski et al. 2020)

    • applies contrast siblings on the representation of mask portion of the audio

      ✅ to learn discrete tokens from them

    • speech recognition models: trained on these token, show better performance compared to those trained on conventional audio features / raw audio

  • HuBERT (Hsu et al. 2021, FAIR)

    • learned by alternating between an offline cadence clustering step and optimizing for cluster assignment prediction (similar to deep cluster)
  • Also employed by SpeechStew (Chan et al. 2021), Big SSL (Zhang et al. 2021)

    picture 20

Multimodal Pretext Tasks

applied to multimodal data, although the difinition of self-supervised learning gets kind of blurry here depending on whether you consider a multi-modal dataset as single unlabeled dataset or as if one modality gives supervision to another modality

  • MIL-NCE (Miech et al. 2020)

    • Find matching narration with video

    • trained constrastively to find matching narration with video, which can not only use for correcting misalignment in videos but also for action recognition, text to video retrieval, action localization and action segmentation

      picture 21

  • CLIP (Radford et al. ), ALIGN (Jia et al. 2021)

    • Contrast text and image embeddings from paired data
Language Pretext Tasks
Language Pretext Tasks: Generative Language Modeling
  • Pretrained language models:

    • They all rely on unsupervised text and try to predict one sentence from the context
    • only depend on the natural order of words and sequences
  • Some examples: changed the landscape of NLP research quite a lot

    • GPT

      ✅ Autogressive;

      ✅ predict the next token based on the previous tokens

    • BERT

      ✅ as a bi-directional transformer model

      ✅ Masked language modeling (MLM)

      ✅ Next sentence prediction (NSP) ⇒ \Rightarrow a binary classifier for telling whether one sentence is the next sentence of the other

    • ALBERT

      ✅ Sentence order prediction (SOP) ⇒ \Rightarrow Positive sample: a pair of two consecutive segments from the same document; Negative sample: same as above but with the segment order switch

    • ELECTRA

      ✅ Replaced token detection (RTD) ⇒ \Rightarrow random tokens are replaced and considered corrected, in parallel a binary discriminator is trained together with the generative model to predict whether each token has been replaced

Language Pretext Tasks: Sentence Embedding
  • Skip-thought vectors (Kiros et al. 2015)

    • Predict sentences based on other sentences around
  • Quick-thought vectors (Logeswaran & Lee, 2018)

    • Identify the correct context sentence among other contrastive sentences

    picture 22

  • IS-BERT (“Info-Sentence BERT”; Zhang et al. 2020)

    • matual information maximization
  • SimCSE (“Simple Contrastive learning of Sentence Embeddings”; Gao et al. 2021)

    • Predict a sentence from itself with only dropout noise
    • One sentence gets two different versions of dropout augmentations

    picture 23

  • Most of the models for learning sentence embedding relies on supervised NLI (Natural Language Inference) datasets, such as SBERT (Reimers & Gurevych 2019), BERT-flow
  • Unsupervised sentence embedding models (e.g., unsupervised SimCSE) still have performance gap with the supervised version (e.g., supervised SimCSE)

Training Techniques

  • Data augmentation
  • In-batch negatives samples
  • Hard negative mining
  • Memory bank
  • Large batchsize

contrastive learning can provide good results in terms of transfer performance

Techniques: Data augmentation
  • Data augmentation setup is critical for learning good embedding

    • and generalizable embdding features
  • 方法:

    • Introduces the non-essential variations into examples without modifying semantic meanings
    • ⇒ \Rightarrow thus encourages the model to learn the essential part within the representation

image augmentation; text augmentation

Techniques: Data augmentation – Image Augmentation
  • Basic Image Augmentation:

    • Random crop
    • color distortion
    • Gaussian blur
    • color jittering
    • random flip / rotation
    • etc.
  • Augmentation Strategies

    • AutoAugment (Cubuk, et al. 2018): Inspired by NAS
    • RandAugment (Cubuk et al. 2019): reduces NAS search space in AutoAugment
    • PBA (Population based augmentation; Ho et al. 2019): evolutionary algorithms
    • UDA (Unsupervised Data Augmentation ,Xie et al. 2019): select augmentation strategy to minimize the KL divergencec between the predicted distribution over an unlabelled example and its unlabelled augmented version
  • Image mixture

    • Mixup (Zhang et al. 2018): weighted pixel-wise combination of two images

      ✅ to create new sampls based on existed ones

    • Cutmix (Yun et al 2019): mix in a local region of one image into the other

    • MoCHi (Mixing of Contrastive Hard Negatives): mixture of hard negative samples

      ✅ explicitly maintains a queue of some number of negative samples sorted by similarity to the query in descending order ⇒ \Rightarrow the first couple samples in the queue should be the hardest and negative samples ⇒ \Rightarrow then new hard negative can be created by mixing images in this queue together or even with the query

Techniques: Data augmentation – Text Augmentation
  • Lexical (词汇的) Edits.

    • (Just changing the words or tokens)

    • EDA (Easy Data Augmentation; Wei & Zhou 2019): Synonym replacement, random insertion / swap / deletion

    • Contextual Augmentation (Kobayashi 2018): word substition by BERT prediction

      ✅ try to find the replacement words using a bi-directional language model

  • Back-translation (Sennrich et al. 2015)

    • augments by first translating it to another language and then translating it back to the original language

      ✅ depends on the translation model ⇒ \Rightarrow the meaning should stay largely unchanged

    • CERT (Fang et al. 2020) generates augmented sentences via back-translation

  • Dropout and Cutoff

    • SimCSE uses dropout (Gao et al. 2021)

      ✅ drouput: a universal way to apply transformnation on any input

      ✅ SimCSE: use drouput to creat different copies of the same text ⇒ \Rightarrow universial because it doe not need expert knowledege about the attributes of this input modality (it is changes on the architecture level)

    • Cutoff augmentation for text (Shen et al. 2020)

      ✅ masking random selected tokens, feature columns, spans

    picture 24

Hard Negative Mining
What is “hard negative mining”
  • Hard negative samples are different to learn
    • They should have different labels from the anchor samples
    • But the embedding features may be very close
  • Hard negative mining is important for contrastive learning
  • Challenging negative samples encourages the model to learn better representations that can distinguish hard negatives from true positives
Explicit hard negative mining
  • Extract task-specific hard negative samples from labelled datasets
    • e.g., “contradiction” sentence pairs from NLI datasets.
    • (Most sentence embedding papers)
  • Keyword based retrieval
    • can be found by classic information retrieval models (Such as BM25)
  • Upweight the negative sample probability to be proportional to its similarity to the anchor sample
  • MoCHi: mine hard negative by sorting them according to similarity to the query in descending order
Implicit hard negative mining
  • In-batch negative samples
  • Memory bank (Wu et al. 2018, He et al. 2019)
    • Increase batch size
  • Large batch size via various training parallelism

Need large batchsize

Theories

Why does contrastive learning work?

Contrastive learning captures shared information betweem views
  • InfoNCE (van den Oord et al. 2018)

    • is a lower bound to MI (Mutual information) between views:

    picture 25

  • Minimizing InfoNCE leads to maximizing the MI between view1 and view2

    • 因此,minimizing the inforNCE loss ⇒ \Rightarrow the encoder are optimizing the embedding space to retain as much information as possible that exsited between the two views
    • The info max principle in contrastive learning

    picture 26

  • Q: How can we design good views?

    • augmentations are crucial for the performance
The InfoMin Principle
  • Optimal views are at the sweet spot where it only encodes useful informnation for transfer
    • Minimal sufficient encoder depends on downstream tasks (Tian et al. 2020)

    • Composite loss for finding the sweet spot (Tsai et al. 2020)

      ✅ helps converging to a minimal sufficient encoder

picture 27

To perform well in transfer learning ⇒ \Rightarrow we want our model to capture the mutual information between the data x and the downstream label y I ( x ; y ) I(x;y) I(x;y)

  • if the mutual information between the views ( I ( v 1 ; v 2 ) I(v_1; v_2) I(v1;v2)) is smaller than I ( x ; y ) I(x;y) I(x;y) ⇒ \Rightarrow the model would fail to capture useful information for the downstream tasks
  • Meanwhile, if the mutual information between the views are too large ⇒ \Rightarrow would have excess information that is unrelated to the downstream tasks ⇒ \Rightarrow the transfor performance would decrease due to the noise
  • ⇒ \Rightarrow there is a sweet spot ⇒ \Rightarrow the minimal sufficient encoder
  • This shows:
    • The optimal views are dependent on the downstream tasks
Alignment and Uniformity on the Hypersphere
  • Contrastively learned features are more uniform and aligned

    • Uniform : features should be distributed uniformly on the hypershere S d S^d Sd
    • Aligned : features from two views of the same input should be the same

    picture 28

  • compared with random initialized network or a network trained with the supervised learning
  • also measured the alignment measuring how close the distance between features from two views of the same input is
Dimensional Collapse
  • Contrastive methods sometimes suffer from dimensional collapse (Hua et al. 2021)
    • Features span lower-dimensional subspace instead
    • (Learned features span lower dimensional subspace instead of using the full dimensionality)
  • Two causes demonstrated by Jing et al (2021)
    • 1 Strong augmentation while creating the views
    • 2 implicit regularization caused by the gradient decent dynamics
Provable Guarantees for Contrastive Learning
  • Sampling complexity decreases when:
    • Adopting contrastive learning objectives (Arora et al. 2019)
    • Predicting the known distribution in teh data (Lee et al. 2020)
  • Linear classifier on learned representation is nearly optimal (Tosh et al. 2021)
  • Spectral Contrastive Learning (HaoChen et al. 2021)
    • based on a spectral decomposition of the augmentation graph

总之,对比学习理论起到了很大作用,但仍有很长的路要走

Feature directions

briefly discuss a few open research questions and areas of work to look into

Future Directions
  • Large batch size ⇒ \Rightarrow improved transfer performance

  • High-quality large data corpus ⇒ \Rightarrow better performance

    • Learning from synthetic or Web data
    • Measuring dataset quality and filtering / active learning ⇒ \Rightarrow better control over data quality
  • Efficient negative sample selection

    • to do hard negative mining
    • (lage batchsize is not enough because batchsize cannot go to infinity)
  • Combine multiple pretext tasks

    • How to combine
    • Best strategies

    picture 29

  • Data augmentation tricks have critical impacts but are still quite ad-hoc

    • Modality-dependent: 大多数增强方法仅适用于单个modality ⇒ \Rightarrow most of them are handcrafted by human

    • Theoretical foundations

      ✅ e.g., on why certain augmentation works better than others

      ✅ to guide us to find more efficient data augmentation

  • Improving training efficiency

    • Self-supervised learning methods are pushing the deep learning arms race (军备竞赛)

      ❌ increase of model size and training batch size

      ⇒ \Rightarrow leads to increase the cost both economically and environmentally

    • Direct impacts on the economical and environmental costs

  • Social biases in the embedding space

    • Early work in debiasing word embedding
    • Biases in Dataset
  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

R.X. NLOS

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值