RL Paper Reading: Unsupervised State Representation Learning in Atari

Unsupervised State Representation Learning in Atari

[Paper] [Code]
Journal: NeurIPS
Year: 2019
Institute: Mila, Université de Montréa
Author:Ankesh Anand*, Evan Racah*, Sherjil Ozair*
# State Representation Learning # Constrastive Self-Supervised Learning

Abstract

State representation learning without supervision from rewards is a challenging open problem. This paper proposes a new contrastive state representation learning method called Spatiotemporal DeepInfomax (ST-DIM) that leverages recent advances in self-supervision and learns state representations by maximizing the mutual information across spatially and temporally distinct features of a neural encoder of the observations.

0.Mutual Information

  1. What is mutual information?
    Wikipedia-Mutual information
    It quantifies the “amount of information” (in units such as shannons, commonly called bits) obtained about one random variable through observing the other random variable.
    I ( X ; Y ) = D K L ( P ( X , Y ) ∣ ∣ P X ⊗ P Y ) I(X;Y) = D_{KL}(P_{(X,Y)}||P_X\otimes P_Y) I(X;Y)=DKL(P(X,Y)PXPY)
  2. Why to maximize it?

1.Spatiotemporal Deep Infomax

Prior work in neuroscience has suggested that the brain maximizes predictive information at an abstract level to avoid sensory overload. Thus our representation learning approach relies on maximizing an estimate based on a lower bound on the mutual information over consecutive observations x t x_t xt and x t + 1 x_{t+1} xt+1.
From https://arxiv.org/pdf/1906.08226.pdf
For the mutual information estimator, we use infoNCE:
I N C E ( { x i , y i } i = 1 N ) = ∑ i = 1 N log ⁡ exp ⁡ f ( x i , y i ) ∑ j = 1 N exp ⁡ f ( x i , y j ) I_{NCE}(\{x_i, y_i\}_{i=1}^N) = \sum\limits_{i=1}^N \log\frac{\exp f(x_i, y_i)}{\sum_{j=1}^N\exp f(x_i, y_j)} INCE({xi,yi}i=1N)=i=1Nlogj=1Nexpf(xi,yj)expf(xi,yi)
The global-local objective(The representations of the small image patches are taken to be the hidden activations of the convolutional encoder applied to the full observation):
L G L = ∑ m = 1 M ∑ n = 1 N − log ⁡ exp ⁡ g m , n ( x t , x t + 1 ) ∑ x t ∗ ∈ X n e x t N exp ⁡ g m , n ( x t , x t ∗ ) g m , n ( x t , x t + 1 ) = ϕ ( x t ) T W g ϕ m , n ( x t + 1 ) L_{GL} = \sum\limits_{m=1}^M \sum\limits_{n=1}^N -\log\frac{\exp g_{m,n}(x_t, x_{t+1})}{\sum_{x_t^*\in X_{next}}^N\exp g_{m,n}(x_t, x_{t^*})} \\ g_{m,n}(x_t, x_{t+1}) = \phi(x_t)^TW_g\phi_{m,n}(x_{t+1}) LGL=m=1Mn=1NlogxtXnextNexpgm,n(xt,xt)expgm,n(xt,xt+1)gm,n(xt,xt+1)=ϕ(xt)TWgϕm,n(xt+1)
The local-local objective:
L L L = ∑ m = 1 M ∑ n = 1 N − log ⁡ exp ⁡ f m , n ( x t , x t + 1 ) ∑ x t ∗ ∈ X n e x t N exp ⁡ f m , n ( x t , x t ∗ ) f m , n ( x t , x t + 1 ) = ϕ m , n ( x t ) T W l ϕ m , n ( x t + 1 ) L_{LL} = \sum\limits_{m=1}^M \sum\limits_{n=1}^N -\log\frac{\exp f_{m,n}(x_t, x_{t+1})}{\sum_{x_t^*\in X_{next}}^N\exp f_{m,n}(x_t, x_{t^*})} \\ f_{m,n}(x_t, x_{t+1}) = \phi_{m,n}(x_t)^TW_l\phi_{m,n}(x_{t+1}) LLL=m=1Mn=1NlogxtXnextNexpfm,n(xt,xt)expfm,n(xt,xt+1)fm,n(xt,xt+1)=ϕm,n(xt)TWlϕm,n(xt+1)

2.What the constrastive task is.

I’ve ask a question about this paper, and surprisingly the author Ankesh Anand himself answered my question. He said:

It’s easier to think of the method in terms of what the contrastive task it than to understand the mutual information stuff. The contrastive task is: Correctly classify whether a pair of frames are consecutive or not. To do this task well, an encoder needs to only focus on things that are changing across time (often there are high-level features such as the agents, enemies, scores etc.), and ignore the low-level or pixel-level details such as the precise texture of the background. For solving the contrastive task, we use the infoNCE loss. Turns out if you minimize the InfoNCE loss, it is equivalent to maximizing a lower bound on the mutual information between consecutive frames.
If you do this classification task between whole frames, the encoder will focus only on high entropy features such as the clock since its changing every frame. So, to prevent the encoder from focusing only on a single high entropy feature, we do the contrastive task across each local patch in the feature map.

3.Framework
From https://arxiv.org/pdf/1906.08226.pdf
3.Discussion
We also observe that the best generative model (PIXEL-PRED) does not suffer from this problem.It performs its worst on high-entropy features such as the clock and player score (where ST-DIM excels), and does slightly better than ST-DIM on low-entropy features which have a large contribution in the pixel space such as player and enemy locations. This sheds light on the qualitative difference between contrastive and generative methods: contrastive methods prefer capturing high-entropy features (irrespective of contribution to pixel space) while generative methods do not, and generative methods prefer capturing large objects which have low entropy. This complementary nature suggests hybrid models as an exciting direction of future work.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值