1. Modeling problem
-
formulation
-
a set of distributions
P = { p y ( ⋅ ; x ) ∈ P y : x ∈ X } \mathcal{P}=\left\{p_{\mathrm{y}}(\cdot ; x) \in \mathcal{P}^{y}: x \in \mathcal{X}\right\} P={py(⋅;x)∈Py:x∈X} -
approximation
min q ∈ P y max x ∈ X D ( p y ( ⋅ ; x ) ∥ q ( ⋅ ) ) \min _{q \in \mathcal{P}^{y}} \max _{x \in \mathcal{X}} D\left(p_{\mathrm{y}}(\cdot ; x) \| q(\cdot)\right) q∈Pyminx∈XmaxD(py(⋅;x)∥q(⋅))
-
-
solution
Theorem: 对任意 q ∈ P y q \in \mathcal{P}^{y} q∈Py 都存在一个混合模型 q w ( ⋅ ) = ∑ x ∈ X w ( x ) p y ( ⋅ ; x ) q_w(\cdot) = \sum_{x \in \mathcal{X}} w(x) p_{y}(\cdot ; x) qw(⋅)=∑x∈Xw(x)py(⋅;x) 满足
D ( p y ( ⋅ ; x ) ∥ q w ( ⋅ ) ) ≤ D ( p y ( ⋅ ; x ) ∥ q ( ⋅ ) ) for all x ∈ X D\left(p_{y}(\cdot ; x) \| q_{w}(\cdot)\right) \leq D\left(p_{y}(\cdot ; x) \| q(\cdot)\right) \quad \text { for all } x \in \mathcal{X} D(py(⋅;x)∥qw(⋅))≤D(py(⋅;x)∥q(⋅)) for all x∈X
Proof: 应用 Pythagoras 定理然后很容易有
max x ∈ X min q ∈ P y D ( p y ( ⋅ ; x ) ∥ q ( ⋅ ) ) = max x ∈ X 0 = 0 \max _{x \in \mathcal{X}} \min _{q \in \mathcal{P}^{y}} D\left(p_{y}(\cdot ; x) \| q(\cdot)\right)=\max _{x \in \mathcal{X}} 0=0 \\ x∈Xmaxq∈PyminD(py(⋅;x)∥q(⋅))=x∈Xmax0=0min q ∈ P max x ∈ X D ( p y ( ⋅ ; x ) ∥ q ( ⋅ ) ) = min q ∈ P max w ∈ P X ∑ x w ( x ) D ( p y ( ⋅ ; x ) ∥ q ( ⋅ ) ) \min _{q \in \mathcal{P}} \max _{x \in \mathcal{X}} D\left(p_{\mathrm{y}}(\cdot ; x) \| q(\cdot)\right)=\min _{q \in \mathcal{P}} \max _{w \in \mathcal{P}^{\mathcal{X}}} \sum_{x} w(x) D\left(p_{\mathrm{y}}(\cdot ; x) \| q(\cdot)\right) q∈Pminx∈XmaxD(py(⋅;x)∥q(⋅))=q∈Pminw∈PXmaxx∑w(x)D(py(⋅;x)∥q(⋅))
Theorem (Redundancy-Capacity Theorem): 以下等式成立,且两侧最优的 w , q w,q w,q s是相同的
R + ≜ min q ∈ P Y max w ∈ P X ∑ x w ( x ) D ( p y ( ⋅ ; x ) ∥ q ( ⋅ ) ) = max w ∈ P min q ∈ P ∑ x w ( x ) D ( p y ( ⋅ ; x ) ∥ q ( ⋅ ) ) ≜ R − \begin{aligned} R^{+} \triangleq \min _{q \in \mathcal{P}^{\mathcal{Y}}} \max _{w \in \mathcal{P}^{\mathcal{X}}} & \sum_{x} w(x) D\left(p_{\mathrm{y}}(\cdot ; x) \| q(\cdot)\right) \\ &=\max _{w \in \mathcal{P}} \min _{q \in \mathcal{P}} \sum_{x} w(x) D\left(p_{\mathrm{y}}(\cdot ; x) \| q(\cdot)\right) \triangleq R^{-} \end{aligned} R+≜q∈PYminw∈PXmaxx∑w(x)D(py(⋅;x)∥q(⋅))=w∈Pmaxq∈Pminx∑w(x)D(py(⋅;x)∥q(⋅))≜R−
Proof:
- 利用后面的 Equidistance property 证明 R + ≤ R − R^+ \le R^- R+≤R−
- 根据 minimax 和 maxmini 的性质,有 R + ≥ R − R^+ \ge R^- R+≥R−
- 一定有 R + ≥ R − R^+ \ge R^- R+≥R−
- 证明两个不等式的取等条件是在同样的 w , q w,q w,q 处取到
2. Model capacity
首先计算 R − R^- R− 内部的 min
min q ∈ P Y ∑ x w ( x ) D ( p y ( ⋅ ; x ) ∥ q ( ⋅ ) ) = min q ∈ P Y ∑ x , y w ( x ) p y ( y ; x ) log p y ( y ; x ) q ( y ) = constant − max q ∈ P Y ∑ y q w ( y ) log q ( y ) = constant − max q ∈ P Y E q w [ log q ( y ) ] \begin{aligned} & \min _{q \in \mathcal{P}^{\mathcal{Y}}} \sum_{x} w(x) D\left(p_{\mathbf{y}}(\cdot ; x) \| q(\cdot)\right) \\=& \min _{q \in \mathcal{P}^{\mathcal{Y}}} \sum_{x, y} w(x) p_{\mathbf{y}}(y ; x) \log \frac{p_{y}(y ; x)}{q(y)} \\=& \text { constant }-\max _{q \in \mathcal{P}^{\mathcal{Y}}} \sum_{y} q_{w}(y) \log q(y) \\=& \text { constant }-\max _{q \in \mathcal{P}^{\mathcal{Y}}} \mathbb{E}_{q_{w}}[\log q(y)] \end{aligned} ===q∈PYminx∑w(x)D(py(⋅;x)∥q(⋅))q∈PYminx,y∑w(x)py(y;x)logq(y)py(y;x) constant −q∈PYmaxy∑qw(y)logq(y) constant −q∈PYmaxEqw[logq(y)]
根据 Gibbs 不等式
q ∗ ( ⋅ ) = q w ( ⋅ ) ≜ ∑ x ∈ X w ( x ) p y ( ⋅ ; x ) q^*(\cdot) = q_{w}(\cdot) \triangleq \sum_{x \in \mathcal{X}} w(x) p_{y}(\cdot ; x) q∗(⋅)=qw(⋅)≜x∈X∑w(x)py(⋅;x)
再考虑 R − R^- R− 外部的 max,此时可以转化为 Bayesian 角度!
$$
\begin{aligned} R^{-} &=\max {w \in \mathcal{P}^{\mathcal{X}}} \sum{x} w(x) D\left(p_{y}(\cdot ; x) | q_{w}(\cdot)\right) \ &=\max {w \in \mathcal{P}^{\mathcal{X}}} \sum{x, y} w(x) p_{y}(y ; x) \log \frac{p_{y}(y ; x)}{\sum_{x^{\prime}} w\left(x^{\prime}\right) p_{y}\left(y ; x^{\prime}\right)} \&\overset{\text{Bayesian}}{=}\max {p{\mathbf{x}}} \sum_{x} p_{\mathbf{x}}(x) D\left(p_{y | \mathbf{x}}(\cdot | x) | p_{y}(\cdot)\right) \ &=\max {p{\mathbf{x}}} \sum_{x, y} p_{\mathbf{x}}(x) p_{\mathbf{y} | \mathbf{x}}(y | x) \log \frac{p_{y | x}(y | x)}{p_{\mathbf{y}}(y)} \ &=\max {p{\mathbf{x}}} \sum_{x, y} p_{\mathbf{x}, \mathbf{y}}(x, y) \log \frac{p_{\mathbf{x}, \mathbf{y}}(x, y)}{p_{\mathbf{x}}(x) p_{y}(y)}=\max {p{\mathbf{x}}} I(x ; y)=C
\end{aligned}
KaTeX parse error: Can't use function '$' in math mode at position 24: …ition**: 对一个模型 $̲p_{\mathsf{y|x}…
C \triangleq \max {p{x}} I(x ; y)
$$
- Model capacity: C
- least informative prior: p x ∗ p_x^* px∗
Theorem(Equidistance property): C对应的最优的 p ∗ p^* p∗ 和 w ∗ w^* w∗ 满足
D ( p y ( ⋅ ; x ) ∣ ∣ q ∗ ( ⋅ ) ) ≤ C ∀ x ∈ X D(p_y(\cdot;x)||q^*(\cdot)) \le C \ \ \ \ \ \forall x\in\mathcal{X} D(py(⋅;x)∣∣q∗(⋅))≤C ∀x∈X
其中等号对于满足 w ∗ ( x ) > 0 w^*(x)>0 w∗(x)>0 的 x 成立Proof:
- I ( x , y ) I(x,y) I(x,y) 关于 p x ( a ) ∀ a p_x(a)\ \ \forall a px(a) ∀a 是 concave 的
- 构造拉格朗日函数 L = I ( x , y ) − λ ( ∑ x p x ( x ) − 1 ) \mathcal{L}=I(x,y) - \lambda(\sum_x p_x(x)-1) L=I(x,y)−λ(∑xpx(x)−1),也关于 p x ( a ) p_x(a) px(a) concave
- min p x I ( x , y ) \min_{p_x}I(x,y) minpxI(x,y) 的极值点应满足 ∂ I ( x ; y ) ∂ p x ( a ) ∣ p x = p x ∗ − λ = 0 , for all a ∈ X such that p x ∗ ( a ) > 0 \left.\frac{\partial I(x ; y)}{\partial p_{x}(a)}\right|_{p_{x}=p_{x}^{*}}-\lambda=0, \quad \text { for all } a \in \mathcal{X} \text { such that } p_{x}^{*}(a)>0 ∂px(a)∂I(x;y)∣∣∣px=px∗−λ=0, for all a∈X such that px∗(a)>0,或者 ∂ I ( x ; y ) ∂ p x ( a ) ∣ p x = p x ∗ − λ ≤ 0 , for all a ∈ X such that p x ∗ ( a ) = 0 \left.\frac{\partial I(x ; y)}{\partial p_{x}(a)}\right|_{p_{x}=p_{x}^{*}}-\lambda\le0, \quad \text { for all } a \in \mathcal{X} \text { such that } p_{x}^{*}(a)=0 ∂px(a)∂I(x;y)∣∣∣px=px∗−λ≤0, for all a∈X such that px∗(a)=0
- ∂ I ( x ; y ) ∂ p x ( a ) = D ( p y ∣ x ( ⋅ ; a ) ∥ p y ) − log e \frac{\partial I(x ; y)}{\partial p_{x}(a)} = D\left(p_{y | x}(\cdot ; a) \| p_{y}\right)-\log e ∂px(a)∂I(x;y)=D(py∣x(⋅;a)∥py)−loge 并根据 3 中取等号的特点恰好可以得到定理中的式子
3. Inference with mixture models
-
Formulation: 有观测 y − y_- y−,想要预测 y + y_+ y+
-
Solution
-
根据前面得到的最优先验 w ∗ w^* w∗ 来估计 y = [ y − , y + ] y=[y_-,y_+] y=[y−,y+] 的分布
q y ∗ ( y ) = ∑ x w ∗ ( x ) p y ( y ; x ) q_{\mathbf{y}}^{*}(\mathbf{y})=\sum_{x} w^{*}(x) p_{\mathbf{y}}(\mathbf{y} ; x) qy∗(y)=x∑w∗(x)py(y;x) -
然后可以获得后验概率
q y + ∣ y − ∗ ( ⋅ ∣ y − ) ≜ q y ∗ ( y + , y − ) q y − ∗ ( y − ) = ∑ x w ∗ ( x ) p y ( y + , y − ; x ) ∑ a w ∗ ( a ) p y − ( y − ; a ) = ∑ x w ∗ ( x ∣ y − ) p y + ∣ y − ( y + ∣ y − ; x ) \begin{aligned} q_{\mathrm{y}+| \mathrm{y}_{-}}^{*}\left(\cdot | y_{-}\right) & \triangleq \frac{q_{\mathrm{y}}^{*}\left(y_{+}, y_{-}\right)}{q_{\mathrm{y}-}^{*}\left(y_{-}\right)}=\frac{\sum_{x} w^{*}(x) p_{\mathrm{y}}\left(y_{+}, y_{-} ; x\right)}{\sum_{a} w^{*}(a) p_{\mathrm{y}_{-}}\left(y_{-} ; a\right)} \\ &=\sum_{x} w^{*}\left(x | y_{-}\right) p_{\mathrm{y}_{+} | y_{-}}\left(y_{+} | y_{-} ; x\right) \end{aligned} qy+∣y−∗(⋅∣y−)≜qy−∗(y−)qy∗(y+,y−)=∑aw∗(a)py−(y−;a)∑xw∗(x)py(y+,y−;x)=x∑w∗(x∣y−)py+∣y−(y+∣y−;x) -
相当于是做了 soft decision,因为 ML 估计中只会取 p y + ∣ y − ( ⋅ ∣ y − ; x ^ M L ) p_{\mathrm{y}_{+} | y_{-}}(\cdot|y_-; \hat{x}_{ML}) py+∣y−(⋅∣y−;x^ML)
-
4. Maximum entropy distribution
- 最大熵等价于均匀分布向对应的模型集合上的 I-projection
D ( p ∥ U ) = ∑ y p ( y ) log p ( y ) + log ∣ Y ∣ = log ∣ Y ∣ − H ( p ) p ∗ = arg max p ∈ L t H ( p ) = arg min p ∈ L t D ( p ∥ U ) D(p \| U)=\sum_{y} p(y) \log p(y)+\log |\mathcal{Y}|=\log |\mathcal{Y}|-H(p) \\ p^{*}=\underset{p \in \mathcal{L}_{\mathrm{t}}}{\arg \max } H(p)=\underset{p \in \mathcal{L}_{\mathrm{t}}}{\arg \min } D(p \| U) D(p∥U)=y∑p(y)logp(y)+log∣Y∣=log∣Y∣−H(p)p∗=p∈LtargmaxH(p)=p∈LtargminD(p∥U)
其他内容请看:
统计推断(一) Hypothesis Test
统计推断(二) Estimation Problem
统计推断(三) Exponential Family
统计推断(四) Information Geometry
统计推断(五) EM algorithm
统计推断(六) Modeling
统计推断(七) Typical Sequence
统计推断(八) Model Selection
统计推断(九) Graphical models
统计推断(十) Elimination algorithm
统计推断(十一) Sum-product algorithm