CMU 11-785 L20 Boltzmann machines 1

Training hopfield nets

Geometric approach

  • W = Y Y T − N p I \mathbf{W}=\mathbf{Y} \mathbf{Y}^{T}-N_{p} \mathbf{I} W=YYTNpI

  • E ( y ) = y T W y \mathbf{E}(\mathbf{y})=\mathbf{y}^{T} \mathbf{W y} E(y)=yTWy

  • Sine : y T ( Y Y T − N p I ) y = y T Y Y T y − N N p \mathbf{y}^{T}\left(\mathbf{Y} \mathbf{Y}^{T}-N_{p} \mathbf{I}\right) \mathbf{y}=\mathbf{y}^{T} \mathbf{Y} \mathbf{Y}^{T} \mathbf{y}-N N_{p} yT(YYTNpI)y=yTYYTyNNp

  • So W is identical to behavior with W = Y Y T \mathbf{W}=\mathbf{Y} \mathbf{Y}^{T} W=YYT

    • Energy landscape only differs by an additive constant
    • Have the same eigen vectors

在这里插入图片描述

  • A pattern y p y_p yp is stored if:

    • sign ⁡ ( W y p ) = y _ p \operatorname{sign}\left(\mathbf{W} \mathbf{y}_{p}\right)=\mathbf{y}\_{p} sign(Wyp)=y_p for all target patterns
  • Training: Design W W W such that this holds

  • Simple solution: y p y_p yp is an Eigenvector of W W W

Storing k orthogonal patterns
  • Let Y = [ y _ 1 y _ 2 … y _ K ] \mathbf{Y}=\left[\mathbf{y}\_{1} \mathbf{y}\_{2} \ldots \mathbf{y}\_{K}\right] Y=[y_1y_2y_K]
    • W = Y Λ Y T \mathbf{W}=\mathbf{Y} \Lambda \mathbf{Y}^{T} W=YΛYT
    • λ 1 , . . . , λ k \lambda_1,...,\lambda_k λ1,...,λk are positive
    • for λ 1 = λ 2 = λ k = 1 \lambda_1= \lambda_2=\lambda_k= 1 λ1=λ2=λk=1 this is exactly the Hebbian rule
  • Any pattern y y y can be written as
    • y = a 1 y 1 + a 2 y 2 + ⋯ + a N y N \mathbf{y}=a_{1} \mathbf{y}_{1}+a_{2} \mathbf{y}_{2}+\cdots+a_{N} \mathbf{y}_{N} y=a1y1+a2y2++aNyN
    • W y = a 1 W y 1 + a 2 W y 2 + ⋯ + a N W y N = y \mathbf{W y}=a_{1} \mathbf{W y}_{1}+a_{2} \mathbf{W y}_{2}+\cdots+a_{N} \mathbf{W y}_{N} = y Wy=a1Wy1+a2Wy2++aNWyN=y
  • All patterns are stable
    • Remembers everything
    • Completely useless network
  • Even if we store fewer than N N N patterns
    • Let Y = [ y _ 1 y _ 2 … y _ K r _ K + 1 r _ K + 2 … r _ N ] Y=\left[\mathbf{y}\_{1} \mathbf{y}\_{2} \ldots \mathbf{y}\_{K} \mathbf{r}\_{K+1} \mathbf{r}\_{K+2} \ldots \mathbf{r}\_{N}\right] Y=[y_1y_2y_Kr_K+1r_K+2r_N]
    • W = Y Λ Y T W=Y \Lambda Y^{T} W=YΛYT
    • r _ K + 1 r _ K + 2 … r _ N \mathbf{r}\_{K+1} \mathbf{r}\_{K+2} \ldots \mathbf{r}\_{N} r_K+1r_K+2r_N are orthogonal to y 1 y 2 … y K \mathbf{y}_1 \mathbf{y}_2 \ldots \mathbf{y}_K y1y2yK
    • λ 1 = λ 2 = λ k = 1 \lambda_1= \lambda_2=\lambda_k= 1 λ1=λ2=λk=1
    • Problem arise because eigen values are all 1.0
      • Ensures stationarity of vectors in the subspace
      • All stored patterns are equally important
General (nonorthogonal) vectors
  • w j i = ∑ p ∈ { p } y i p y j p w_{j i}=\sum_{p \in\{p\}} y_{i}^{p} y_{j}^{p} wji=p{p}yipyjp
  • The maximum number of stationary patterns is actually exponential in N N N (McElice and Posner, 84’)
  • For a specific set of K K K patterns, we can always build a network for which all K K K patterns are stable provided k ≤ N k \le N kN
    • But this may come with many “parasitic” memories

Optimization

  • Energy function
    • E = − 1 2 y T W y − b T y E=-\frac{1}{2} \mathbf{y}^{T} \mathbf{W} \mathbf{y}-\mathbf{b}^{T} \mathbf{y} E=21yTWybTy
    • This must be maximally low for target patterns
    • Must be maximally high for all other patterns
      • So that they are unstable and evolve into one of the target patterns
  • Estimate W W W such that
    • E E E is minimized for y 1 , . . . , y P y_1,...,y_P y1,...,yP
    • E E E is maximized for all other y y y
  • Minimize total energy of target patterns
    • E ( y ) = − 1 2 y T W y W ^ = argmin ⁡ W ∑ y ∈ Y P E ( y ) E(\mathbf{y})=-\frac{1}{2} \mathbf{y}^{T} \mathbf{W y} \quad \widehat{\mathbf{W}}=\underset{\mathbf{W}}{\operatorname{argmin}} \sum_{\mathbf{y} \in \mathbf{Y}_{P}} E(\mathbf{y}) E(y)=21yTWyW =WargminyYPE(y)
    • However, might also pull all the neighborhood states down
  • Maximize the total energy of all non-target patterns
    • E ( y ) = − 1 2 y T W y E(\mathbf{y})=-\frac{1}{2} \mathbf{y}^{T} \mathbf{W} \mathbf{y} E(y)=21yTWy
    • W ^ = argmin ⁡ W ∑ y ∈ Y P E ( y ) − ∑ y ∉ Y P E ( y ) \widehat{\mathbf{W}}=\underset{\mathbf{W}}{\operatorname{argmin}} \sum_{\mathbf{y} \in \mathbf{Y}_{P}} E(\mathbf{y})-\sum_{\mathbf{y} \notin \mathbf{Y}_{P}} E(\mathbf{y}) W =WargminyYPE(y)y/YPE(y)
  • Simple gradient descent
    • W = w + η ( ∑ y ∈ Y P y y T − ∑ y ∉ Y P y y T ) \mathbf{W}=\mathbf{w}+\eta\left(\sum_{\mathbf{y} \in \mathbf{Y}_{P}} \mathbf{y} \mathbf{y}^{T}-\sum_{\mathbf{y} \notin \mathbf{Y}_{P}} \mathbf{y} \mathbf{y}^{T}\right) W=w+η(yYPyyTy/YPyyT)

    • minimize the energy at target patterns

      • 在这里插入图片描述
    • raise all non-target patterns

      • Do we need to raise everything?
Raise negative class
  • Focus on raising the valleys

    • If you raise every valley, eventually they’ll all move up above the target patterns, and many will even vanish
    • 在这里插入图片描述
  • How do you identify the valleys for the current W W W?

    • 在这里插入图片描述

    • Initialize the network randomly and let it evolve

    • It will settle in a valley

在这里插入图片描述

  • Should we randomly sample valleys?

    • Are all valleys equally important?
    • Major requirement: memories must be stable
      • They must be broad valleys
    • 在这里插入图片描述
  • Solution: initialize the network at valid memories and let it evolve

    • It will settle in a valley
    • If this is not the target pattern, raise it
  • What if there’s another target pattern downvalley

    • 在这里插入图片描述

    • no need to raise the entire surface, or even every valley

      • Raise the neighborhood of each target memory
  • 在这里插入图片描述

Storing more than N patterns

在这里插入图片描述

  • Visible neurons
    • The neurons that store the actual patterns of interest
  • Hidden neurons
    • The neurons that only serve to increase the capacity but whose actual values are not important

在这里插入图片描述

  • The maximum number of patterns the net can store is bounded by the width N N N of the patterns…
  • So lets pad the patterns with K K K “don’t care” bits
    • The new width of the patterns is N + K N+K N+K
    • Now we can store N + K N+K N+K patterns!
  • Taking advantage of don’t care bits
    • Simple random setting of don’t care bits, and using the usual training and recall strategies for Hopfield nets should work
    • However, to exploit it properly, it helps to view the Hopfield net differently: as a probabilistic machine

A probabilistic interpretation

  • For binary y the energy of a pattern is the analog of the negative log likelihood of a Boltzmann distribution
    • Minimizing energy maximizes log likelihood
    • E ( y ) = − 1 2 y T W y P ( y ) = Cexp ⁡ ( − E ( y ) ) E(\mathbf{y})=-\frac{1}{2} \mathbf{y}^{T} \mathbf{W y} \quad P(\mathbf{y})=\operatorname{Cexp}(-E(\mathbf{y})) E(y)=21yTWyP(y)=Cexp(E(y))

Boltzmann Distribution

  • E ( y ) = − 1 2 y T W y − b T y E(\mathbf{y})=-\frac{1}{2} \mathbf{y}^{T} \mathbf{W} \mathbf{y}-\mathbf{b}^{T} \mathbf{y} E(y)=21yTWybTy
  • P ( y ) = Cexp ⁡ ( − E ( y ) k T ) P(\mathbf{y})=\operatorname{Cexp}\left(\frac{-E(\mathbf{y})}{k T}\right) P(y)=Cexp(kTE(y))
  • C = 1 ∑ y exp ⁡ ( − E ( y ) k T ) C=\frac{1}{\sum_{\mathrm{y}} \exp \left(\frac{-E(\mathbf{y})}{k T}\right)} C=yexp(kTE(y))1
  • k k k is the Boltzmann constant, T T T is the temperature of the system
  • Optimizing W W W
    • E ( y ) = − 1 2 y T W y W ^ = argmin ⁡ W ∑ y ∈ Y P E ( y ) − ∑ y ∉ Y P E ( y ) E(\mathbf{y})=-\frac{1}{2} \mathbf{y}^{T} \mathbf{W} \mathbf{y} \quad \widehat{\mathbf{W}}=\underset{\mathbf{W}}{\operatorname{argmin}} \sum_{\mathbf{y} \in \mathbf{Y}_{P}} E(\mathbf{y})-\sum_{\mathbf{y} \notin \mathbf{Y}_{P}} E(\mathbf{y}) E(y)=21yTWyW =WargminyYPE(y)y/YPE(y)
    • Simple gradient descent
    • W = W + η ( ∑ y ∈ Y P α y y y T − ∑ y ∉ Y P β ( E ( y ) ) y y T ) \mathbf{W}=\mathbf{W}+\eta\left(\sum_{\mathbf{y} \in \mathbf{Y}_{P}} \alpha_{\mathbf{y}} \mathbf{y} \mathbf{y}^{T}-\sum_{\mathbf{y} \notin \mathbf{Y}_{P}} \beta(E(\mathbf{y})) \mathbf{y} \mathbf{y}^{T}\right) W=W+η(yYPαyyyTy/YPβ(E(y))yyT)
    • α y \alpha_y αy more importance to more frequently presented memories
    • β ( E ( y ) ) \beta (E(y)) β(E(y)) more importance to more attractive spurious memories
    • Looks like an expectation
    • W = W + η ( E y ∼ Y P y y T − E y ∼ Y y y T ) \mathbf{W}=\mathbf{W}+\eta\left(E_{\mathbf{y} \sim \mathbf{Y}_{P}} \mathbf{y} \mathbf{y}^{T}-E_{\mathbf{y} \sim Y} \mathbf{y} \mathbf{y}^{T}\right) W=W+η(EyYPyyTEyYyyT)
  • The behavior of the Hopfield net is analogous to annealed dynamics of a spin glass characterized by a Boltzmann distribution
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值