CS229 吴恩达机器学习 习题大作业答案 problem sets 04 PS04(第2,3,4,5题,欢迎指教)非策略评估和因果推断 PCA 主成分分析 ICA 独立成分分析 MDP 马尔可夫决策

2. Off Policy Evaluation And Causal Inference

注意:本节的代码输出结果,特别是那个语音分解,ICA方法,在我账号上传的资源中可以找到

(a). Importance Sampling

这个很容易证明,直接将 π ^ 0 = π 0 \hat{\pi}_{0}=\pi_{0} π^0=π0代入即可:
E s ∼ p ( s ) a ∼ π 0 ( s , a ) π 1 ( s , a ) π ^ 0 ( s , a ) R ( s , a ) = E s ∼ p ( s ) a ∼ π 0 ( s , a ) π 1 ( s , a ) π 0 ( s , a ) R ( s , a ) = ∑ ( s , a ) π 1 ( s , a ) π 0 ( s , a ) R ( s , a ) p ( s ) π 0 ( s , a ) = ∑ ( s , a ) R ( s , a ) p ( s ) π 1 ( s , a ) = E s ∼ p ( s ) a ∼ π 1 ( s , a ) R ( s , a ) \begin{aligned} \mathbb{E} \underset{s \sim p(s) \atop a \sim \pi_{0}(s, a)}{ } \frac{\pi_{1}(s, a)}{\hat{\pi}_{0}(s, a)} R(s, a) &=\mathbb{E}_{s \sim p(s) \atop a \sim \pi_{0}(s, a)} \frac{\pi_{1}(s, a)}{\pi_{0}(s, a)} R(s, a) \\ &=\sum_{(s, a)} \frac{\pi_{1}(s, a)}{\pi_{0}(s, a)} R(s, a) p(s) \pi_{0}(s, a) \\ &=\sum_{(s, a)} R(s, a) p(s) \pi_{1}(s, a) \\ &=\mathbb{E}_{s \sim p(s) \atop a \sim \pi_{1}(s, a)} R(s, a) \end{aligned} Eaπ0(s,a)sp(s)π^0(s,a)π1(s,a)R(s,a)=Eaπ0(s,a)sp(s)π0(s,a)π1(s,a)R(s,a)=(s,a)π0(s,a)π1(s,a)R(s,a)p(s)π0(s,a)=(s,a)R(s,a)p(s)π1(s,a)=Eaπ1(s,a)sp(s)R(s,a)

(b). Weighted Importance Sampling

这个与(a)类似,仍然直接代入 π ^ 0 = π 0 \hat{\pi}_{0}=\pi_{0} π^0=π0证明即可:

对于分母部分:
E s ∼ p ( s ) a ∼ π 0 ( s , a ) π 1 ( s , a ) π ^ 0 ( s , a ) = E s ∼ p ( s ) a ∼ π 0 ( s , a ) π 1 ( s , a ) π 0 ( s , a ) = ∑ ( s , a ) p ( s , a ) π 1 ( s , a ) π 0 ( s , a ) = ∑ ( s , a ) p ( s ) π 0 ( s , a ) π 1 ( s , a ) π 0 ( s , a ) = ∑ ( s , a ) p ( s ) π 1 ( s , a ) = 1 \begin{aligned} \mathbb{E}_{s \sim p(s) \atop a \sim \pi_{0}(s, a)} \frac{\pi_{1}(s, a)}{\hat{\pi}_{0}(s, a)} &=\mathbb{E}_{s \sim p(s) \atop a \sim \pi_{0}(s, a)} \frac{\pi_{1}(s, a)}{\pi_{0}(s, a)} \\ &=\sum_{(s, a)} p(s, a) \frac{\pi_{1}(s, a)}{\pi_{0}(s, a)} \\ &=\sum_{(s, a)} p(s) \pi_{0}(s, a) \frac{\pi_{1}(s, a)}{\pi_{0}(s, a)} \\ &=\sum_{(s, a)} p(s) \pi_{1}(s, a) \\ &=1 \end{aligned} Eaπ0(s,a)sp(s)π^0(s,a)π1(s,a)=Eaπ0(s,a)sp(s)π0(s,a)π1(s,a)=(s,a)p(s,a)π0(s,a)π1(s,a)=(s,a)p(s)π0(s,a)π0(s,a)π1(s,a)=(s,a)p(s)π1(s,a)=1
则:
E s ∼ p ( s ) a ∼ π 0 ( s , a ) π 1 ( s , a ) π ^ 0 ( s , a ) R ( s , a ) E s ∼ p ( s ) a ∼ π 0 ( s , a ) π 1 ( s , a ) π ^ 0 ( s , a ) = E s ∼ p ( s ) a ∼ π 1 ( s , a ) R ( s , a ) \frac{\mathbb{E}_{s \sim p(s) \atop a \sim \pi_{0}(s, a)} \frac{\pi_{1}(s, a)}{\hat{\pi}_{0}(s, a)} R(s, a)}{\mathbb{E}_{s \sim p(s) \atop a \sim \pi_{0}(s, a)} \frac{\pi_{1}(s, a)}{\hat{\pi}_{0}(s, a)}}=\mathbb{E} \underset{s \sim p(s) \atop a \sim \pi_{1}(s, a)}{ } R(s, a) Eaπ0(s,a)sp(s)π^0(s,a)π1(s,a)Eaπ0(s,a)sp(s)π^0(s,a)π1(s,a)R(s,a)=Eaπ1(s,a)sp(s)R(s,a)

©. biased problem

首先由定义出发:
E s ∼ p ( s ) a ∼ π 0 ( s , a ) π 1 ( s , a ) π ^ 0 ( s , a ) R ( s , a ) E s ∼ p ( s ) a ∼ π 0 ( s , a ) π 1 ( s , a ) π ^ 0 ( s , a ) = ∑ ( s , a ) p ( s , a ) π 1 ( s , a ) π ^ 0 ( s , a ) R ( s , a ) ∑ ( s , a ) p ( s , a ) π 1 ( s , a ) π ^ 0 ( s , a ) \frac{\mathbb{E}_{s \sim p(s) \atop a \sim \pi_{0}(s, a)} \frac{\pi_{1}(s, a)}{\hat{\pi}_{0}(s, a)} R(s, a)}{\mathbb{E}_{s \sim p(s) \atop a \sim \pi_{0}(s, a)} \frac{\pi_{1}(s, a)}{\hat{\pi}_{0}(s, a)}}=\frac{\sum_{(s, a)} p(s, a) \frac{\pi_{1}(s, a)}{\hat{\pi}_{0}(s, a)} R(s, a)}{\sum_{(s, a)} p(s, a) \frac{\pi_{1}(s, a)}{\hat{\pi}_{0}(s, a)}} Eaπ0(s,a)sp(s)π^0(s,a)π1(s,a)Eaπ0(s,a)sp(s)π^0(s,a)π1(s,a)R(s,a)=(s,a)p(s,a)π^0(s,a)π1(s,a)(s,a)p(s,a)π^0(s,a)π1(s,a)R(s,a)
考虑只有一个样本的情况:
∑ ( s , a ) p ( s , a ) π 1 ( s , a ) π ^ 0 ( s , a ) R ( s , a ) ∑ ( s , a ) p ( s , a ) π 1 ( s , a ) π ^ 0 ( s , a ) = p ( s , a ) π 1 ( s , a ) π ^ 0 ( s , a ) R ( s , a ) p ( s , a ) π 1 ( s , a ) π ^ 0 ( s , a ) = R ( s , a ) \frac{\sum_{(s, a)} p(s, a) \frac{\pi_{1}(s, a)}{\hat{\pi}_{0}(s, a)} R(s, a)}{\sum_{(s, a)} p(s, a) \frac{\pi_{1}(s, a)}{\hat{\pi}_{0}(s, a)}}=\frac{p(s, a) \frac{\pi_{1}(s, a)}{\hat{\pi}_{0}(s, a)} R(s, a)}{p(s, a) \frac{\pi_{1}(s, a)}{\hat{\pi}_{0}(s, a)}}=R(s, a) (s,a)p(s,a)π^0(s,a)π1(s,a)(s,a)p(s,a)π^0(s,a)π1(s,a)R(s,a)=p(s,a)π^0(s,a)π1(s,a)p(s,a)π^0(s,a)π1(s,a)R(s,a)=R(s,a)
也就是说,由weighted importance sampling得到的关于符合 π 1 \pi_{1} π1的分布的reward function的均值都是一样的(样本给出的),但是如果真正的 π 1 ≠ π 0 \pi_{1} \neq \pi_{0} π1=π0,那么他们对应的reward function肯定是不同的,即:
E R s ∼ p ( s ) a ∼ π 0 ( s , a ) ( s , a ) ≠ E s ∼ p ( s ) a ∼ π 1 ( s , a ) R ( s , a ) \mathbb{E} \underset{s \sim p(s) \atop a \sim \pi_{0}(s, a)}R(s, a) \neq \mathbb{E} \underset{s \sim p(s) \atop a \sim \pi_{1}(s, a)}{ } R(s, a) Eaπ0(s,a)sp(s)R(s,a)=Eaπ1(s,a)sp(s)R(s,a)
也就是混存在偏差。

(d). Doubly Robust

i.

直接代入证明:
首先,这种服从分布相当于求和(积分),内层指标起决定作用,外层时已不含该指标a:
E s ∼ p ( s ) a ∼ π 0 ( s , a ) ( ( E a ∼ π 1 ( s , a ) R ^ ( s , a ) ) ) = E s ∼ p ( s ) a ∼ π 1 ( s , a ) R ^ ( s , a ) . \mathbb{E}_{s \sim p(s) \atop a \sim \pi_{0}(s, a)}\left(\left(\mathbb{E}_{a \sim \pi_{1}(s, a)} \hat{R}(s, a)\right)\right) =\mathbb{E}_{s \sim p(s) \atop a \sim \pi_{1}(s, a)} \hat{R}(s, a). Eaπ0(s,a)sp(s)((Eaπ1(s,a)R^(s,a)))=Eaπ1(s,a)sp(s)R^(s,a).
然后与(b)相同的方法可以得到:
E s ∼ p ( s ) a ∼ π 0 ( s , a ) ( π 1 ( s , a ) π ^ 0 ( s , a ) ( R ( s , a ) − R ^ ( s , a ) ) ) = E s ∼ p ( s ) a ∼ π 1 ( s , a ) ( R ( s , a ) − R ^ ( s , a ) ) \mathbb{E}_{s \sim p(s) \atop a \sim \pi_{0}(s, a)}\left(\frac{\pi_{1}(s, a)}{\hat{\pi}_{0}(s, a)}(R(s, a)-\hat{R}(s, a))\right)=\mathbb{E} \underset{s \sim p(s) \atop a \sim \pi_{1}(s, a)}{ }(R(s, a)-\hat{R}(s, a)) Eaπ0(s,a)sp(s)(π^0(s,a)π1(s,a)(R(s,a)R^(s,a)))=Eaπ1(s,a)sp(s)(R(s,a)R^(s,a))
则以上两部分求和:
E s ∼ p ( s ) a ∼ π 0 ( s , a ) ( ( E a ∼ π 1 ( s , a ) R ^ ( s , a ) ) + π 1 ( s , a ) π ^ 0 ( s , a ) ( R ( s , a ) − R ^ ( s , a ) ) ) = E s ∼ p ( s ) a ∼ π 1 ( s , a ) R ( s , a ) \mathbb{E}_{s \sim p(s) \atop a \sim \pi_{0}(s, a)}\left(\left(\mathbb{E}_{a \sim \pi_{1}(s, a)} \hat{R}(s, a)\right)+\frac{\pi_{1}(s, a)}{\hat{\pi}_{0}(s, a)}(R(s, a)-\hat{R}(s, a))\right)=\mathbb{E}_{s \sim p(s) \atop a \sim \pi_{1}(s, a)} R(s, a) Eaπ0(s,a)sp(s)((Eaπ1(s,a)R^(s,a))+π^0(s,a)π1(s,a)(R(s,a)R^(s,a)))=Eaπ1(s,a)sp(s)R(s,a)

ii.

同样是直接代入证明即可:
E s ∼ p ( s ) a ∼ π 0 ( s , a ) ( ( E a ∼ π 1 ( s , a ) R ^ ( s , a ) ) = E s ∼ p ( s ) a ∼ π 1 ( s , a ) R ( s , a ) \mathbb{E}_{s \sim p(s) \atop a \sim \pi_{0}(s, a)}\left(\left(\mathbb{E}_{a \sim \pi_{1}(s, a)} \hat{R}(s, a)\right)=\mathbb{E}_{s \sim p(s) \atop a \sim \pi_{1}(s, a)} R(s, a)\right. Eaπ0(s,a)sp(s)((Eaπ1(s,a)R^(s,a))=Eaπ1(s,a)sp(s)R(s,a)
E s ∼ p ( s ) a ∼ π 0 ( s , a ) ( π 1 ( s , a ) π ^ 0 ( s , a ) ( R ( s , a ) − R ^ ( s , a ) ) ) = 0 \mathbb{E}_{s \sim p(s) \atop a \sim \pi_{0}(s, a)}\left(\frac{\pi_{1}(s, a)}{\hat{\pi}_{0}(s, a)}(R(s, a)-\hat{R}(s, a))\right)=0 Eaπ0(s,a)sp(s)(π^0(s,a)π1(s,a)(R(s,a)R^(s,a)))=0
E s ∼ p ( s ) a ∼ π 0 ( s , a ) ( ( E a ∼ π 1 ( s , a ) R ^ ( s , a ) ) + π 1 ( s , a ) π ^ 0 ( s , a ) ( R ( s , a ) − R ^ ( s , a ) ) ) = E s ∼ p ( s ) a ∼ π 1 ( s , a ) R ( s , a ) \mathbb{E}_{s \sim p(s) \atop a \sim \pi_{0}(s, a)}\left(\left(\mathbb{E}_{a \sim \pi_{1}(s, a)} \hat{R}(s, a)\right)+\frac{\pi_{1}(s, a)}{\hat{\pi}_{0}(s, a)}(R(s, a)-\hat{R}(s, a))\right)=\mathbb{E}_{s \sim p(s) \atop a \sim \pi_{1}(s, a)} R(s, a) Eaπ0(s,a)sp(s)((Eaπ1(s,a)R^(s,a))+π^0(s,a)π1(s,a)(R(s,a)R^(s,a)))=Eaπ1(s,a)sp(s)R(s,a)

(e).

i.

这时,药物分发是randomly,所以很好评估 π 0 \pi_{0} π0,即 π ^ 0 \hat{\pi}_{0} π^0。所以,用importance sampling estimator合适。

ii.

这时,药物分发比较复杂,因此 π 0 \pi_{0} π0难以评估,因此使用regression estimator合适。

而,上面展示的Doubly Robust方法,相当于两者的中和。

3. PCA

首先由几何意义,可以写出:
f u ( x ) = arg ⁡ min ⁡ v ∈ V ∥ x − v ∥ 2 = u u T x u T u = u u T x f_{u}(x)=\arg \min _{v \in \mathcal{V}}\|x-v\|^{2}=\frac{u u^{T} x}{u^{T} u}=u u^{T} x fu(x)=argvVminxv2=uTuuuTx=uuTx
那么:
arg ⁡ min ⁡ u : u T u = 1 ∑ i = 1 m ∥ x ( i ) − f u ( x ( i ) ) ∥ 2 2 = arg ⁡ min ⁡ u : u T u = 1 ∑ i = 1 m ∥ x ( i ) − u u T x ( i ) ∥ 2 2 = arg ⁡ min ⁡ u : u T u = 1 ∑ i = 1 m ( x ( i ) − u u T x ( i ) ) T ( x ( i ) − u u T x ( i ) ) = arg ⁡ min ⁡ u : u T u = 1 ∑ i = 1 m x ( i ) T x ( i ) − x ( i ) T u u T x ( i ) = arg ⁡ max ⁡ u : u T u = 1 ∑ i = 1 m x ( i ) T u u T x ( i ) = arg ⁡ max ⁡ u : u T u = 1 ∑ i = 1 m u T x ( i ) x ( i ) T u = arg ⁡ max ⁡ u : u T u = 1 u T ( ∑ i = 1 m x ( i ) x ( i ) T ) u \begin{aligned} \arg \min _{u: u^{T} u=1} \sum_{i=1}^{m}\left\|x^{(i)}-f_{u}\left(x^{(i)}\right)\right\|_{2}^{2} &=\arg \min _{u: u^{T} u=1} \sum_{i=1}^{m}\left\|x^{(i)}-u u^{T} x^{(i)}\right\|_{2}^{2} \\ &=\arg \min _{u: u^{T} u=1} \sum_{i=1}^{m}\left(x^{(i)}-u u^{T} x^{(i)}\right)^{T}\left(x^{(i)}-u u^{T} x^{(i)}\right) \\ &=\arg \min _{u: u^{T} u=1} \sum_{i=1}^{m} x^{(i)^{T}} x^{(i)}-x^{(i)^{T}} u u^{T} x^{(i)} \\ &=\arg \max _{u: u^{T} u=1} \sum_{i=1}^{m} x^{(i)^{T}} u u^{T} x^{(i)} \\ &=\arg \max _{u: u^{T} u=1} \sum_{i=1}^{m} u^{T} x^{(i)} x^{(i)^{T}} u \\ &=\arg \max _{u: u^{T} u=1} u^{T}\left(\sum_{i=1}^{m} x^{(i)} x^{(i)^{T}}\right) u \end{aligned} argu:uTu=1mini=1mx(i)fu(x(i))22=argu:uTu=1mini=1mx(i)uuTx(i)22=argu:uTu=1mini=1m(x(i)uuTx(i))T(x(i)uuTx(i))=argu:uTu=1mini=1mx(i)Tx(i)x(i)TuuTx(i)=argu:uTu=1maxi=1mx(i)TuuTx(i)=argu:uTu=1maxi=1muTx(i)x(i)Tu=argu:uTu=1maxuT(i=1mx(i)x(i)T)u

4. Independent Component Analysis

(a). Gaussian source

对最大对数似然函数求导:
∇ W ℓ ( W ) = ∇ W ∑ i = 1 n ( log ⁡ ∣ W ∣ + ∑ j = 1 d log ⁡ 1 2 π exp ⁡ ( − 1 2 ( w j T x ( i ) ) 2 ) ) = n ( W − 1 ) T − ∑ i = 1 n ∇ W ∑ j = 1 d 1 2 ( w j T x ( i ) ) 2 = n ( W − 1 ) T − ∑ i = 1 n W x ( i ) x ( i ) T = n ( W − 1 ) T − W X T X = 0 \begin{aligned} \nabla_{W} \ell(W) &=\nabla_{W} \sum_{i=1}^{n}\left(\log |W|+\sum_{j=1}^{d} \log \frac{1}{\sqrt{2 \pi}} \exp \left(-\frac{1}{2}\left(w_{j}^{T} x^{(i)}\right)^{2}\right)\right) \\ &=n\left(W^{-1}\right)^{T}-\sum_{i=1}^{n} \nabla_{W} \sum_{j=1}^{d} \frac{1}{2}\left(w_{j}^{T} x^{(i)}\right)^{2} \\ &=n\left(W^{-1}\right)^{T}-\sum_{i=1}^{n} W x^{(i)} x^{(i)^{T}} \\ &=n\left(W^{-1}\right)^{T}-W X^{T} X \\ &=0 \end{aligned} W(W)=Wi=1n(logW+j=1dlog2π 1exp(21(wjTx(i))2))=n(W1)Ti=1nWj=1d21(wjTx(i))2=n(W1)Ti=1nWx(i)x(i)T=n(W1)TWXTX=0
从而:
W T W = n ( X T X ) − 1 W^{T} W=n\left(X^{T} X\right)^{-1} WTW=n(XTX)1
再看为何此时的W会有一个旋转的不确定性,假如旋转的矩阵为R(旋转矩阵为行列式为1的正交矩阵),令 W ′ = R W W^{\prime}=R W W=RW
W T W ′ = ( R W ) T R W = W T R T R W = W T W W^{T} W^{\prime}=(R W)^{T} R W=W^{T} R^{T} R W=W^{T} W WTW=(RW)TRW=WTRTRW=WTW
所以任何的 W ′ W^{\prime} W都是和W一样的一个可能的解。

(b). Laplace source

考虑到(c)中要利用随机梯度下降,因此这里求对于一个样本的情况进行求解:
∇ W ℓ ( W ) = ∇ W ( log ⁡ ∣ W ∣ + ∑ j = 1 d log ⁡ 1 2 exp ⁡ ( − ∣ w j T x ( i ) ∣ ) ) = ( W − 1 ) T − ∇ W ∑ j = 1 d ∣ w j T x ( i ) ∣ = ( W T ) − 1 − sign ⁡ ( W x ( i ) ) x ( i ) T \begin{aligned} \nabla_{W} \ell(W) &=\nabla_{W}\left(\log |W|+\sum_{j=1}^{d} \log \frac{1}{2} \exp \left(-\left|w_{j}^{T} x^{(i)}\right|\right)\right) \\ &=\left(W^{-1}\right)^{T}-\nabla_{W} \sum_{j=1}^{d}\left|w_{j}^{T} x^{(i)}\right| \\ &=\left(W^{T}\right)^{-1}-\operatorname{sign}\left(W x^{(i)}\right) x^{(i)^{T}} \end{aligned} W(W)=W(logW+j=1dlog21exp(wjTx(i)))=(W1)TWj=1dwjTx(i)=(WT)1sign(Wx(i))x(i)T
则,更新规则为:
W : = W + α ( ( W T ) − 1 − sign ⁡ ( W x ( i ) ) x ( i ) T ) W:=W+\alpha\left(\left(W^{T}\right)^{-1}-\operatorname{sign}\left(W x^{(i)}\right) x^{(i)^{T}}\right) W:=W+α((WT)1sign(Wx(i))x(i)T)

( c ). Cocktail Party Problem

import numpy as np
import scipy.io.wavfile
import os


def update_W(W, x, learning_rate):
    """
    Perform a gradient ascent update on W using data element x and the provided learning rate.

    This function should return the updated W.

    Use the laplace distribiution in this problem.

    Args:
        W: The W matrix for ICA
        x: A single data element
        learning_rate: The learning rate to use

    Returns:
        The updated W
    """
    
    # *** START CODE HERE ***
    x = x.reshape(-1, 1)
    grad = np.linalg.inv(W.T) - np.sign(W @ x) @ x.T
    updated_W = W + learning_rate * grad
    # *** END CODE HERE ***

    return updated_W


def unmix(X, W):
    """
    Unmix an X matrix according to W using ICA.

    Args:
        X: The data matrix
        W: The W for ICA

    Returns:
        A numpy array S containing the split data
    """

    S = np.zeros(X.shape)


    # *** START CODE HERE ***
    S = X @ W.T
    # *** END CODE HERE ***

    return S


Fs = 11025

def normalize(dat):
    return 0.99 * dat / np.max(np.abs(dat))

def load_data():
    mix = np.loadtxt('./data/mix.dat')
    return mix

def save_W(W):
    np.savetxt('output/W.txt',W)

def save_sound(audio, name):
    scipy.io.wavfile.write('output/{}.wav'.format(name), Fs, audio)

def unmixer(X):
    M, N = X.shape
    W = np.eye(N)

    anneal = [0.1 , 0.1, 0.1, 0.05, 0.05, 0.05, 0.02, 0.02, 0.01, 0.01, 0.005, 0.005, 0.002, 0.002, 0.001, 0.001]
    print('Separating tracks ...')
    for lr in anneal:
        print(lr)
        rand = np.random.permutation(range(M))
        for i in rand:
            x = X[i]
            W = update_W(W, x, lr)

    return W

def main():
    # Seed the randomness of the simulation so this outputs the same thing each time
    np.random.seed(0)
    X = normalize(load_data())

    print('data_set shape:', X.shape)
    # 原始的每个麦克风的声音
    for i in range(X.shape[1]):
        save_sound(X[:, i], 'mixed_{}'.format(i))
    # 进行学习(training)
    W = unmixer(X)
    print('W:\n', W)
    save_W(W)
    # 进行预测(prediction)
    S = normalize(unmix(X, W))   # 在学习的时候可以知道,因为scaling是不能够通过ICA学习得到的,为符合wav格式做正规化
    assert S.shape[1] == 5
    for i in range(S.shape[1]):
        if os.path.exists('split_{}'.format(i)):
            os.unlink('split_{}'.format(i))
        save_sound(S[:, i], 'split_{}'.format(i))
main()
data_set shape: (53442, 5)
Separating tracks ...
0.1
0.1
0.1
0.05
0.05
0.05
0.02
0.02
0.01
0.01
0.005
0.005
0.002
0.002
0.001
0.001
W:
 [[ 52.83512054  16.79594361  19.94115104 -10.1985153  -20.89752402]
 [ -9.92869812  -0.97683438  -4.67732108   8.04342263   1.78792785]
 [  8.31150961  -7.47676167  19.3154481   15.17437116 -14.3261898 ]
 [-14.66745004 -26.64459003   2.44058135  21.38205443  -8.42074783]
 [ -0.26912908  18.37411935   9.31234188   9.1025964   30.59424603]]

上面输出结果,打开听一听,确实能够听到清晰的区分开的声音。

5. Markov decision processes

(a).

证明:
∥ B ( V 1 ) − B ( V 2 ) ∥ ∞ = γ ∥ max ⁡ a ∈ A ∑ s ′ ∈ S P s a ( s ′ ) [ V 1 ( s ′ ) − V 2 ( s ′ ) ] ∥ ∞ = γ max ⁡ s ′ ∈ S ∣ max ⁡ a ∈ A ∑ s ′ ∈ S P s a ( s ′ ) [ V 1 ( s ′ ) − V 2 ( s ′ ) ] ∣ ≤ γ ∥ V 1 − V 2 ∥ ∞ \begin{aligned}\left\|B\left(V_{1}\right)-B\left(V_{2}\right)\right\|_{\infty} &=\gamma\left\|\max _{a \in A} \sum_{s^{\prime} \in S} P_{s a}\left(s^{\prime}\right)\left[V_{1}\left(s^{\prime}\right)-V_{2}\left(s^{\prime}\right)\right]\right\|_{\infty} \\ &=\gamma \max _{s^{\prime} \in S}\left|\max _{a \in A} \sum_{s^{\prime} \in S} P_{s a}\left(s^{\prime}\right)\left[V_{1}\left(s^{\prime}\right)-V_{2}\left(s^{\prime}\right)\right]\right| \\ & \leq \gamma\left\|V_{1}-V_{2}\right\|_{\infty} \end{aligned} B(V1)B(V2)=γaAmaxsSPsa(s)[V1(s)V2(s)]=γsSmaxaAmaxsSPsa(s)[V1(s)V2(s)]γV1V2
最后的不等号成立,是基于,对于 α , x ∈ R n , 若  ∑ i α i = 1  且  α i ≥ 0 ,  则  ∑ i α i x i ≤ max ⁡ x i \alpha, x \in \mathbb{R}^{n} \text {, 若 } \sum_{i} \alpha_{i}=1 \text { 且 } \alpha_{i} \geq 0, \text { 则 } \sum_{i} \alpha_{i} x_{i} \leq \max x_{i} α,xRn iαi=1  αi0,  iαiximaxxi

(b).

利用反证法证明,假设B至少有一个不动点 V 1 V_{1} V1,那么假如有第二个fixed point V 2 V_{2} V2,则:
∥ V 1 − V 2 ∥ ∞ = ∥ B ( V 1 ) − B ( V 2 ) ∥ ∞ ≤ γ ∥ V 1 − V 2 ∥ ∞ \left\|V_{1}-V_{2}\right\|_{\infty}=\left\|B\left(V_{1}\right)-B\left(V_{2}\right)\right\|_{\infty} \leq \gamma\left\|V_{1}-V_{2}\right\|_{\infty} V1V2=B(V1)B(V2)γV1V2
上式中,第一个等号,来自于fixed point的性质,第二个不等号来自(a)中的结论。考虑到 γ < 1 \gamma < 1 γ<1,则上式成立的条件为:
∥ V 1 − V 2 ∥ ∞ = 0 \left\|V_{1}-V_{2}\right\|_{\infty}=0 V1V2=0
从而只有 V 1 = V 2 V_{1}=V_{2} V1=V2,所以假设不成立(没有第二个不同的fixed point),则逆命题,B至多有一个fixed point存在。

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

ML--小小白

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值