统计推断(六) Modeling

1. Modeling problem

  • formulation

    • a set of distributions
      P = { p y ( ⋅ ; x ) ∈ P y : x ∈ X } \mathcal{P}=\left\{p_{\mathrm{y}}(\cdot ; x) \in \mathcal{P}^{y}: x \in \mathcal{X}\right\} P={py(;x)Py:xX}

    • approximation
      min ⁡ q ∈ P y max ⁡ x ∈ X D ( p y ( ⋅ ; x ) ∥ q ( ⋅ ) ) \min _{q \in \mathcal{P}^{y}} \max _{x \in \mathcal{X}} D\left(p_{\mathrm{y}}(\cdot ; x) \| q(\cdot)\right) qPyminxXmaxD(py(;x)q())

  • solution

Theorem: 对任意 q ∈ P y q \in \mathcal{P}^{y} qPy 都存在一个混合模型 q w ( ⋅ ) = ∑ x ∈ X w ( x ) p y ( ⋅ ; x ) q_w(\cdot) = \sum_{x \in \mathcal{X}} w(x) p_{y}(\cdot ; x) qw()=xXw(x)py(;x) 满足
D ( p y ( ⋅ ; x ) ∥ q w ( ⋅ ) ) ≤ D ( p y ( ⋅ ; x ) ∥ q ( ⋅ ) )  for all  x ∈ X D\left(p_{y}(\cdot ; x) \| q_{w}(\cdot)\right) \leq D\left(p_{y}(\cdot ; x) \| q(\cdot)\right) \quad \text { for all } x \in \mathcal{X} D(py(;x)qw())D(py(;x)q()) for all xX
Proof: 应用 Pythagoras 定理

然后很容易有
max ⁡ x ∈ X min ⁡ q ∈ P y D ( p y ( ⋅ ; x ) ∥ q ( ⋅ ) ) = max ⁡ x ∈ X 0 = 0 \max _{x \in \mathcal{X}} \min _{q \in \mathcal{P}^{y}} D\left(p_{y}(\cdot ; x) \| q(\cdot)\right)=\max _{x \in \mathcal{X}} 0=0 \\ xXmaxqPyminD(py(;x)q())=xXmax0=0

min ⁡ q ∈ P max ⁡ x ∈ X D ( p y ( ⋅ ; x ) ∥ q ( ⋅ ) ) = min ⁡ q ∈ P max ⁡ w ∈ P X ∑ x w ( x ) D ( p y ( ⋅ ; x ) ∥ q ( ⋅ ) ) \min _{q \in \mathcal{P}} \max _{x \in \mathcal{X}} D\left(p_{\mathrm{y}}(\cdot ; x) \| q(\cdot)\right)=\min _{q \in \mathcal{P}} \max _{w \in \mathcal{P}^{\mathcal{X}}} \sum_{x} w(x) D\left(p_{\mathrm{y}}(\cdot ; x) \| q(\cdot)\right) qPminxXmaxD(py(;x)q())=qPminwPXmaxxw(x)D(py(;x)q())

Theorem (Redundancy-Capacity Theorem): 以下等式成立,且两侧最优的 w , q w,q w,q s是相同的
R + ≜ min ⁡ q ∈ P Y max ⁡ w ∈ P X ∑ x w ( x ) D ( p y ( ⋅ ; x ) ∥ q ( ⋅ ) ) = max ⁡ w ∈ P min ⁡ q ∈ P ∑ x w ( x ) D ( p y ( ⋅ ; x ) ∥ q ( ⋅ ) ) ≜ R − \begin{aligned} R^{+} \triangleq \min _{q \in \mathcal{P}^{\mathcal{Y}}} \max _{w \in \mathcal{P}^{\mathcal{X}}} & \sum_{x} w(x) D\left(p_{\mathrm{y}}(\cdot ; x) \| q(\cdot)\right) \\ &=\max _{w \in \mathcal{P}} \min _{q \in \mathcal{P}} \sum_{x} w(x) D\left(p_{\mathrm{y}}(\cdot ; x) \| q(\cdot)\right) \triangleq R^{-} \end{aligned} R+qPYminwPXmaxxw(x)D(py(;x)q())=wPmaxqPminxw(x)D(py(;x)q())R
Proof:

  1. 利用后面的 Equidistance property 证明 R + ≤ R − R^+ \le R^- R+R
  2. 根据 minimax 和 maxmini 的性质,有 R + ≥ R − R^+ \ge R^- R+R
  3. 一定有 R + ≥ R − R^+ \ge R^- R+R
  4. 证明两个不等式的取等条件是在同样的 w , q w,q w,q 处取到

2. Model capacity

首先计算 R − R^- R 内部的 min
min ⁡ q ∈ P Y ∑ x w ( x ) D ( p y ( ⋅ ; x ) ∥ q ( ⋅ ) ) = min ⁡ q ∈ P Y ∑ x , y w ( x ) p y ( y ; x ) log ⁡ p y ( y ; x ) q ( y ) =  constant  − max ⁡ q ∈ P Y ∑ y q w ( y ) log ⁡ q ( y ) =  constant  − max ⁡ q ∈ P Y E q w [ log ⁡ q ( y ) ] \begin{aligned} & \min _{q \in \mathcal{P}^{\mathcal{Y}}} \sum_{x} w(x) D\left(p_{\mathbf{y}}(\cdot ; x) \| q(\cdot)\right) \\=& \min _{q \in \mathcal{P}^{\mathcal{Y}}} \sum_{x, y} w(x) p_{\mathbf{y}}(y ; x) \log \frac{p_{y}(y ; x)}{q(y)} \\=& \text { constant }-\max _{q \in \mathcal{P}^{\mathcal{Y}}} \sum_{y} q_{w}(y) \log q(y) \\=& \text { constant }-\max _{q \in \mathcal{P}^{\mathcal{Y}}} \mathbb{E}_{q_{w}}[\log q(y)] \end{aligned} ===qPYminxw(x)D(py(;x)q())qPYminx,yw(x)py(y;x)logq(y)py(y;x) constant qPYmaxyqw(y)logq(y) constant qPYmaxEqw[logq(y)]
根据 Gibbs 不等式
q ∗ ( ⋅ ) = q w ( ⋅ ) ≜ ∑ x ∈ X w ( x ) p y ( ⋅ ; x ) q^*(\cdot) = q_{w}(\cdot) \triangleq \sum_{x \in \mathcal{X}} w(x) p_{y}(\cdot ; x) q()=qw()xXw(x)py(;x)
再考虑 R − R^- R 外部的 max,此时可以转化为 Bayesian 角度!
$$
\begin{aligned} R^{-} &=\max {w \in \mathcal{P}^{\mathcal{X}}} \sum{x} w(x) D\left(p_{y}(\cdot ; x) | q_{w}(\cdot)\right) \ &=\max {w \in \mathcal{P}^{\mathcal{X}}} \sum{x, y} w(x) p_{y}(y ; x) \log \frac{p_{y}(y ; x)}{\sum_{x^{\prime}} w\left(x^{\prime}\right) p_{y}\left(y ; x^{\prime}\right)} \

&\overset{\text{Bayesian}}{=}\max {p{\mathbf{x}}} \sum_{x} p_{\mathbf{x}}(x) D\left(p_{y | \mathbf{x}}(\cdot | x) | p_{y}(\cdot)\right) \ &=\max {p{\mathbf{x}}} \sum_{x, y} p_{\mathbf{x}}(x) p_{\mathbf{y} | \mathbf{x}}(y | x) \log \frac{p_{y | x}(y | x)}{p_{\mathbf{y}}(y)} \ &=\max {p{\mathbf{x}}} \sum_{x, y} p_{\mathbf{x}, \mathbf{y}}(x, y) \log \frac{p_{\mathbf{x}, \mathbf{y}}(x, y)}{p_{\mathbf{x}}(x) p_{y}(y)}=\max {p{\mathbf{x}}} I(x ; y)=C
\end{aligned}
KaTeX parse error: Can't use function '$' in math mode at position 24: …ition**: 对一个模型 $̲p_{\mathsf{y|x}…
C \triangleq \max {p{x}} I(x ; y)
$$

  • Model capacity: C
  • least informative prior: p x ∗ p_x^* px

Theorem(Equidistance property): C对应的最优的 p ∗ p^* p w ∗ w^* w 满足
D ( p y ( ⋅ ; x ) ∣ ∣ q ∗ ( ⋅ ) ) ≤ C       ∀ x ∈ X D(p_y(\cdot;x)||q^*(\cdot)) \le C \ \ \ \ \ \forall x\in\mathcal{X} D(py(;x)q())C     xX
其中等号对于满足 w ∗ ( x ) > 0 w^*(x)>0 w(x)>0 的 x 成立

Proof:

  1. I ( x , y ) I(x,y) I(x,y) 关于 p x ( a )    ∀ a p_x(a)\ \ \forall a px(a)  a 是 concave 的
  2. 构造拉格朗日函数 L = I ( x , y ) − λ ( ∑ x p x ( x ) − 1 ) \mathcal{L}=I(x,y) - \lambda(\sum_x p_x(x)-1) L=I(x,y)λ(xpx(x)1),也关于 p x ( a ) p_x(a) px(a) concave
  3. min ⁡ p x I ( x , y ) \min_{p_x}I(x,y) minpxI(x,y) 的极值点应满足 ∂ I ( x ; y ) ∂ p x ( a ) ∣ p x = p x ∗ − λ = 0 ,  for all  a ∈ X  such that  p x ∗ ( a ) > 0 \left.\frac{\partial I(x ; y)}{\partial p_{x}(a)}\right|_{p_{x}=p_{x}^{*}}-\lambda=0, \quad \text { for all } a \in \mathcal{X} \text { such that } p_{x}^{*}(a)>0 px(a)I(x;y)px=pxλ=0, for all aX such that px(a)>0,或者 ∂ I ( x ; y ) ∂ p x ( a ) ∣ p x = p x ∗ − λ ≤ 0 ,  for all  a ∈ X  such that  p x ∗ ( a ) = 0 \left.\frac{\partial I(x ; y)}{\partial p_{x}(a)}\right|_{p_{x}=p_{x}^{*}}-\lambda\le0, \quad \text { for all } a \in \mathcal{X} \text { such that } p_{x}^{*}(a)=0 px(a)I(x;y)px=pxλ0, for all aX such that px(a)=0
  4. ∂ I ( x ; y ) ∂ p x ( a ) = D ( p y ∣ x ( ⋅ ; a ) ∥ p y ) − log ⁡ e \frac{\partial I(x ; y)}{\partial p_{x}(a)} = D\left(p_{y | x}(\cdot ; a) \| p_{y}\right)-\log e px(a)I(x;y)=D(pyx(;a)py)loge 并根据 3 中取等号的特点恰好可以得到定理中的式子

3. Inference with mixture models

  • Formulation: 有观测 y − y_- y,想要预测 y + y_+ y+

  • Solution

    • 根据前面得到的最优先验 w ∗ w^* w 来估计 y = [ y − , y + ] y=[y_-,y_+] y=[y,y+] 的分布
      q y ∗ ( y ) = ∑ x w ∗ ( x ) p y ( y ; x ) q_{\mathbf{y}}^{*}(\mathbf{y})=\sum_{x} w^{*}(x) p_{\mathbf{y}}(\mathbf{y} ; x) qy(y)=xw(x)py(y;x)

    • 然后可以获得后验概率
      q y + ∣ y − ∗ ( ⋅ ∣ y − ) ≜ q y ∗ ( y + , y − ) q y − ∗ ( y − ) = ∑ x w ∗ ( x ) p y ( y + , y − ; x ) ∑ a w ∗ ( a ) p y − ( y − ; a ) = ∑ x w ∗ ( x ∣ y − ) p y + ∣ y − ( y + ∣ y − ; x ) \begin{aligned} q_{\mathrm{y}+| \mathrm{y}_{-}}^{*}\left(\cdot | y_{-}\right) & \triangleq \frac{q_{\mathrm{y}}^{*}\left(y_{+}, y_{-}\right)}{q_{\mathrm{y}-}^{*}\left(y_{-}\right)}=\frac{\sum_{x} w^{*}(x) p_{\mathrm{y}}\left(y_{+}, y_{-} ; x\right)}{\sum_{a} w^{*}(a) p_{\mathrm{y}_{-}}\left(y_{-} ; a\right)} \\ &=\sum_{x} w^{*}\left(x | y_{-}\right) p_{\mathrm{y}_{+} | y_{-}}\left(y_{+} | y_{-} ; x\right) \end{aligned} qy+y(y)qy(y)qy(y+,y)=aw(a)py(y;a)xw(x)py(y+,y;x)=xw(xy)py+y(y+y;x)

    • 相当于是做了 soft decision,因为 ML 估计中只会取 p y + ∣ y − ( ⋅ ∣ y − ; x ^ M L ) p_{\mathrm{y}_{+} | y_{-}}(\cdot|y_-; \hat{x}_{ML}) py+y(y;x^ML)

4. Maximum entropy distribution

  • 最大熵等价于均匀分布向对应的模型集合上的 I-projection
    D ( p ∥ U ) = ∑ y p ( y ) log ⁡ p ( y ) + log ⁡ ∣ Y ∣ = log ⁡ ∣ Y ∣ − H ( p ) p ∗ = arg ⁡ max ⁡ p ∈ L t H ( p ) = arg ⁡ min ⁡ p ∈ L t D ( p ∥ U ) D(p \| U)=\sum_{y} p(y) \log p(y)+\log |\mathcal{Y}|=\log |\mathcal{Y}|-H(p) \\ p^{*}=\underset{p \in \mathcal{L}_{\mathrm{t}}}{\arg \max } H(p)=\underset{p \in \mathcal{L}_{\mathrm{t}}}{\arg \min } D(p \| U) D(pU)=yp(y)logp(y)+logY=logYH(p)p=pLtargmaxH(p)=pLtargminD(pU)

其他内容请看:
统计推断(一) Hypothesis Test
统计推断(二) Estimation Problem
统计推断(三) Exponential Family
统计推断(四) Information Geometry
统计推断(五) EM algorithm
统计推断(六) Modeling
统计推断(七) Typical Sequence
统计推断(八) Model Selection
统计推断(九) Graphical models
统计推断(十) Elimination algorithm
统计推断(十一) Sum-product algorithm

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值