《统计学习方法（第2版）》李航第十一章条件随机场 CRF 思维导图笔记及课后习题答案（使用python3编写学习与概率计算算法）

最新推荐文章于 2024-07-25 09:07:39 发布

ML--小小白

最新推荐文章于 2024-07-25 09:07:39 发布

阅读量989

点赞数 3

分类专栏：统计学习方法笔记文章标签：机器学习人工智能深度学习 python 自然语言处理

本文链接：https://blog.csdn.net/qq_26928055/article/details/124357214

版权

统计学习方法笔记专栏收录该内容

26 篇文章

订阅专栏

思维导图

在这里插入图片描述

习题解答

11.1

写出图11.3中无向图描述的概率图模型的因子分解式。
请添加图片描述

$\begin{array}{l}P\left(Y_{1}, Y_{2}, Y_{3}, Y_{4}\right)=\frac{1}{Z} \psi_{c_{1}}\left(Y_{1}, Y_{2}, Y_{3}\right) \psi_{c_{2}}\left(Y_{2}, Y_{3}, Y_{4}\right) \\ Z=\sum_{Y} \psi_{c_{1}}\left(Y_{1}, Y_{2}, Y_{3}\right) \psi_{c_{2}}\left(Y_{2}, Y_{3}, Y_{4}\right)\end{array}$

11.2

$\text { 证明 } Z(x)=\alpha_{n}^{\mathrm{T}}(x) \cdot 1=1^{\mathrm{T}} \cdot \beta_{1}(x) \text {, 其中1是元素均为1的m维列向量。 }$

注意到书中第（11.25）公式，规范化因子为n+1个位置上随机矩阵的相乘，其（start，stop）元素就是规范化因子。因子只要证明等式右边与此相等即可。
$\begin{array}{l}\alpha_{n}^{\mathrm{T}}(x) \cdot 1 \\ =\alpha_{n-1}^{\mathrm{T}}(x) M_{n}(x) \cdot 1 \\ =\alpha_{n-2}^{\mathrm{T}}(x) M_{n-1}(x) M_{n}(x) \cdot 1 \\ =\ldots \\ =\alpha_{0}^{T}(x) M_{1}(x) M_{2}(x) \ldots M_{n}(x) \cdot 1 \\ =\alpha_{0}^{T}(x) M_{1}(x) M_{2}(x) \ldots M_{n}(x) \cdot (M_{n+1})_{\cdot 1} \\ = Z(x)\end{array}$
其中倒数第二个等号是注意到n+1位置上的随机矩阵，只有第一列全为1，其余元素为0.前面的 $\alpha_{0}$ 实际上起到了选择start所在行的作用，最后一个矩阵元只取最后一列，相当于选择stop所在列的过程，因此最终是选择了矩阵中的（start，stop）元。

$\begin{array}{l}1^{\mathrm{T}} \cdot \beta_{1}(x) \\ =1^{T} \cdot M_{2}(x) \beta_{1}(x) \\ =\ldots \\ =1^{T} \cdot M_{2}(x) M_{3}(x) M_{4}(x) \ldots M_{n}(x) \cdot \beta_{n+1}(x) \\ =1^{T} \cdot M_{1}(x) \cdot M_{2}(x) M_{3}(x) M_{4}(x) \ldots M_{n}(x) \cdot \beta_{n+1}(x) \\ =Z(x)\end{array}$
其中倒数第二个等号是因为 $M_{1}$ 矩阵其实是设计start到第一个位置的态的矩阵，因此，其只有第一行全为1，其余元素为0，因此 $1^{T} \cdot M_{1}(x) = 1^{T}$ .前面的 $1^{T}$ 实际上起到了选择start所在行的作用，最后一个$ \beta_{n+1}(x)$，相当于选择stop所在列的过程，因此最终是选择了矩阵中的（start，stop）元。

11.3

写出条件随机场模型学习的梯度下降法.

梯度下降首先要将梯度函数写出来，为此，首先写出对数似然函数：
$L(w)=\log \prod_{x, y} P_{w}(y \mid x)^{N\tilde{P}(x, y)}=\log \prod_{x, y} P_{w}(y \mid x)^{\tilde{P}(x, y)}=\sum_{x, y} \tilde{P}(x, y) \log P_{w}(y \mid x)$
其中，N表示的数据集的样本总数，从而指数部分表示这种(x,y)在数据集中出现的次数，但是N对于参数求极值无关，因此可以直接去掉。

将目标函数取为 $f(w)=-L\left(P_{w}\right)$ ，则通过极小化 $f (w)$ 来更新 $w$ ：
$\begin{aligned} f(w) &=-\sum_{x, y} \tilde{P}(x, y) \log P_{w}(y \mid x) \\ &=\sum_{x, y} \tilde{P}(x, y) \log Z_{w}(x)-\sum_{x, y} \tilde{P}(x, y) \sum_{k=1}^{K} w_{k} f_{k}(x, y) \\ &=\sum_{x} \tilde{P}(x) \log Z_{w}(x)-\sum_{x, y} \tilde{P}(x, y) \sum_{k=1}^{K} w_{k} f_{k}(x, y) \\ &=\sum_{x} \tilde{P}(x) \log \sum_{y} \exp \sum_{k=1}^{K} w_{k} f_{k}(x, y)-\sum_{x, y} \tilde{P}(x, y) \sum_{k=1}^{K} w_{k} f_{k}(x, y) \end{aligned}$
对于 $w$ 求导数可得：
$g(w)=\frac{\partial f(w)}{\partial w}=\sum_{x, y} \tilde{P}(x) P_{w}(y \mid x) f(x, y)-\sum_{x, y} \tilde{P}(x, y) f(x, y)$
梯度下降算法：

（1）取初始的 $w = 0$ 向量，置次数i=0；

（2）计算梯度 $g_{i}=g(w^{(i)})$ ,令 $w^{*}=w^{(i)}$

（3）更新 $w^{(i+1)}=w^{(i)} - \lambda g(w^{(i)})$ ，其中 $\lambda 为学习率$

（4）若 $\Vert w^{(i+1)} - w^{(i)}\Vert < \epsilon (精确度)$ ,则令 $w^{*}=w^{(i+1)}$ ，否则，置 $i = i + 1$ ,返回（2）。

11.4

$\text { 参考图11.6的状态路径图, 假设随机矩阵 } M_{1}(x), M_{2}(x), M_{3}(x), M_{4}(x) \text { 分别是： }$
$\begin{array}{ll}M_{1}(x)=\left[\begin{array}{cc}0 & 0 \\ 0.5 & 0.5\end{array}\right], & M_{2}(x)=\left[\begin{array}{ll}0.3 & 0.7 \\ 0.7 & 0.3\end{array}\right] \\ M_{3}(x)=\left[\begin{array}{ll}0.5 & 0.5 \\ 0.6 & 0.4\end{array}\right], & M_{4}(x)=\left[\begin{array}{ll}0 & 1 \\ 0 & 1\end{array}\right]\end{array}$
求以start=2为起点stop=2为终点的所有路径的状态序列y的概率及概率最大的状态序列.

import numpy as np

random_matrix = np.array([[[0., 0.],
                           [0.5, 0.5]],
                          [[0.3, 0.7],
                           [0.7, 0.3]],
                          [[0.5, 0.5],
                           [0.6, 0.4]],
                          [[0., 1.],
                           [0., 1.]]])

# 计算指定路径的概率
def trajectory_prob(y1, y2, y3, random_matrix, start=2, stop=2):
    # 计算所有随机矩阵的乘积，从而取出规范化因子
    random_matrix_prod = np.eye(random_matrix[0].shape[0])
    for i in range(random_matrix.shape[0]):
        random_matrix_prod = random_matrix_prod @ random_matrix[i]
    norm_factor = random_matrix_prod[start-1, stop-1]
    # 计算指定路径的非规范概率
    no_norm_prob = random_matrix[0][start-1, y1-1] * random_matrix[1][y1-1,y2-1] * random_matrix[2][y2-1, y3-1] * random_matrix[3][y3-1, stop-1]
    return no_norm_prob / norm_factor

# 求出每条路径的概率并打印
for y1 in (1, 2):
    for y2 in (1, 2):
        for y3 in (1, 2):
            prob = trajectory_prob(y1, y2, y3, random_matrix)
            print(f'the probability of trajectory "y1={y1}, y2={y2}, y3={y3}" is: {prob:.3f}')

the probability of trajectory "y1=1, y2=1, y3=1" is: 0.075
the probability of trajectory "y1=1, y2=1, y3=2" is: 0.075
the probability of trajectory "y1=1, y2=2, y3=1" is: 0.210
the probability of trajectory "y1=1, y2=2, y3=2" is: 0.140
the probability of trajectory "y1=2, y2=1, y3=1" is: 0.175
the probability of trajectory "y1=2, y2=1, y3=2" is: 0.175
the probability of trajectory "y1=2, y2=2, y3=1" is: 0.090
the probability of trajectory "y1=2, y2=2, y3=2" is: 0.060

可以发现最大概率路径为"y1=1, y2=2, y3=1"概率为0.210

下面利用维特比（Viterbi）算法来计算一下最优的路径，看是否一致（假装）。注意书中给出的关于维特比的算法，其实计算的是最优路径的非规范化概率。因此结果不会得到0.21，不仅仅差了规范化因子（上面计算过的，打印出规范化因子为1.0），而且还相差了指数运算，理论上用维特比算法求出的最大概率作为e指数指数，然后除以规范化因子应该可以得到相同结果，下面也验证一下。另外，书中11.53～11.57公式中的权值向量与特征函数向量的乘积其实是随机矩阵中，矩阵元的e指数的指数部分，因此可以直接用矩阵元中的数据来计算。

M = 2 # y可取的值的数目
N = 3 # 位置数目（不包含start，stop）
# 定义delta矩阵，行表示位置，列表示取的y的某个值，元素值代表截止此位置此态的最优路径非规范概率最大值
# 定义psi矩阵，行表示位置，列表示某个y的取值，元素值代表截止目前位置，处于此态的最优路径的前一个位置的态
delta = np.zeros((N, M))
psi = np.zeros((N, M))
random_matrix = np.array([[[0., 0.],
                           [0.5, 0.5]],
                          [[0.3, 0.7],
                           [0.7, 0.3]],
                          [[0.5, 0.5],
                           [0.6, 0.4]],
                          [[0., 1.],
                           [0., 1.]]])
random_matrix = np.log(random_matrix+1e-9)

# 最优路径最大概率迭代计算(Viterbi algorithm)
def optimal_loop_prob(i, j, random_matrix, N, M, start=2, stop=2):
    i, j = int(i-1), int(j-1)
    start, stop = start-1, stop-1
    if i==0:
        return random_matrix[i][start, j], start
    elif 0<i<=N-1:
        result = random_matrix[i][:, j]
        for l in range(M):
            result[l] += optimal_loop_prob(i, l, random_matrix, N, M)[0]
        result2 = result.argmax()
        result1 = result[result2]
        return result1, result2
    else:
        print('error in optimal_loop_prob')

# 求出delta，psi矩阵
for i in range(1, N+1):
    for j in range(1, M+1):
        delta[i-1, j-1] = optimal_loop_prob(i, j, random_matrix, N, M)[0]
        psi[i-1, j-1] = optimal_loop_prob(i, j, random_matrix, N, M)[1]

# 最优路径非规范化概率及末尾位置（stop）之前的y值：
optimal_loop_probability = delta[N-1, :].max()
optimal_loop_final_state = delta[N-1, :].argmax()

# 回溯最优路径
optimal_loop_states = [optimal_loop_final_state]
for i in np.arange(N, 1, -1):
    state = psi[i-1, optimal_loop_states[-1]]
    optimal_loop_states.append(int(state))
optimal_loop_states.reverse()
print(f'最优状态序列：{optimal_loop_states}')

最优状态序列：[0, 1, 0]

“正如所料”，最终的序列是y取1，2，1,与之前结果一致（这里输出的是y态的索引，所以相差1）。再来看看是否如前所述的非规范概率相差一个e指数和规范化因子（上面为1.0，但是此处计算过程中没有考虑最后一个矩阵的作用，相当于减少了四种到达stop态的可能，因子这里的规范化因子只有上面的1/4）。

delta

array([[-0.69314718, -0.69314718],
       [-1.04982212, -1.04982212],
       [-2.9469421 , -4.51555801]])

np.exp(delta[-1].max()) / 0.25

0.21000000190999993

可以发现正和想的一样，而后面的残余小数，来自前面防止计算random_matrix时候对0矩阵元取log而加入的小量。

回顾这个验证过程其实并不意外，只不过上面是用了考虑规范化因子的e指数连积的最大化，这里是不考虑规范化因子的e指数指数部分加和的最大化，本质完全是一回事。

本章学习算法的实现

以`例11.1`为例

**思路：**首先按照给出的条件，先计算出各种路径的概率，进而按照这个概率分布构造大量的数据，然后假装不知道权值，利用人造的数据看学习算法能否算回权值

另外注意的一点是，所有的例题包括这个，可能为了简单，都没有考虑观测序列X的作用，都是考虑了下面图这种模型图像：
请添加图片描述

首先求各个路径的概率：

import numpy as np

M = 2 # y可取值数目
N = 3 # 位置的数目
K = 9 # 特征方程数

# 按照已知条件表示出权值向量和所有的特征方程
weights = np.array([1., 0.6, 1., 1., 0.2, 1., 0.5, 0.8, 0.5]).reshape(-1, 1)
feature_func = np.zeros((K, N, M, M))
feature_func[0, 1:3, 0, 1] = 1
feature_func[1, 1, 0, 0] = 1
feature_func[2, 2, 1, 0] = 1
feature_func[3, 1, 1, 0] = 1
feature_func[4, 2, 1, 1] = 1
feature_func[5, 0, :, 0] = 1
feature_func[6, 0:2, :, 1] = 1
feature_func[7, 1:3, :, 0] = 1
feature_func[8, 2, :, 1] = 1

# 计算指定路径的非规范概率
def trajectory_prob_no_norm(y1, y2, y3, weights, feature_func, start=1, stop=1):
    y1, y2, y3 = y1-1, y2-1, y3-1
    y = [y1, y2, y3]
    feature_vec = np.zeros((len(weights), 1))
    for k in range(len(weights)):
        for i in range(len(y)):
            if i == 0:
                feature_vec[k] += feature_func[k, i, start, y[i]]
            else:
                feature_vec[k] += feature_func[k, i, y[i-1], y[i]]
    return (weights.T @ feature_vec).flatten()[0]

# 计算规范化因子
norm_factor = 0.0
for y1 in (1, 2):
    for y2 in (1, 2):
        for y3 in (1, 2):
            norm_factor += trajectory_prob_no_norm(y1, y2, y3, weights, feature_func, start=1, stop=1)
print(f'the norm factor is: {norm_factor:.2f}')

the norm factor is: 26.00

# 求出每条路径的概率并打印
for y1 in (1, 2):
    for y2 in (1, 2):
        for y3 in (1, 2):
            prob = trajectory_prob_no_norm(y1, y2, y3, weights, feature_func, start=1, stop=1) / norm_factor
            print(f'the probability of trajectory "y1={y1}, y2={y2}, y3={y3}" is: {prob:.3f}')

the probability of trajectory "y1=1, y2=1, y3=1" is: 0.123
the probability of trajectory "y1=1, y2=1, y3=2" is: 0.150
the probability of trajectory "y1=1, y2=2, y3=1" is: 0.165
the probability of trajectory "y1=1, y2=2, y3=2" is: 0.123
the probability of trajectory "y1=2, y2=1, y3=1" is: 0.119
the probability of trajectory "y1=2, y2=1, y3=2" is: 0.146
the probability of trajectory "y1=2, y2=2, y3=1" is: 0.108
the probability of trajectory "y1=2, y2=2, y3=2" is: 0.065

接下来根据求出的这个分布，人为构造数据集：

y111 = np.array([1, 1, 1]).reshape(1, -1).repeat(12, axis=0)
y112 = np.array([1, 1, 2]).reshape(1, -1).repeat(15, axis=0)
y121 = np.array([1, 2, 1]).reshape(1, -1).repeat(17, axis=0)
y122 = np.array([1, 2, 2]).reshape(1, -1).repeat(12, axis=0)
y211 = np.array([2, 1, 1]).reshape(1, -1).repeat(12, axis=0)
y212 = np.array([2, 1, 2]).reshape(1, -1).repeat(15, axis=0)
y221 = np.array([2, 2, 1]).reshape(1, -1).repeat(11, axis=0)
y222 = np.array([2, 2, 2]).reshape(1, -1).repeat(7, axis=0)

data = np.vstack((y111, y112, y121, y122, y211, y212, y221, y222))

np.random.shuffle(data)

训练算法：

from collections import Counter

def crf_train(data, feature_func, learning_rate, max_iter=1000):
    m, n = data.shape
    K = feature_func.shape[0] # 特征方程数
    N = n # 位置数目
    M = len(Counter(data.flatten())) # y可取值数
    sequence_dict = {} # 记录各种序列的出现次数
    
    for i in range(m):
        sequence = ()
        for y in data[i]:
            sequence += (y,)
        if sequence in sequence_dict:
            sequence_dict[sequence] += 1
        else:
            sequence_dict[sequence] = 1
            
    # 梯度下降更新
    train_weight = np.ones(K).reshape(-1, 1) * 0.5
    for epoch in range(max_iter):
        # 梯度值
        total_grad = 0.0
        for y in data:
            total_grad += norm_conditional_prob(y, M, train_weight, feature_func) * calculate_feature_vec(y, feature_func) - \
            joint_prob(y, sequence_dict) * calculate_feature_vec(y, feature_func)
        new_weight = train_weight - learning_rate / (epoch * 0.15 + 0.2) * total_grad / m - learning_rate * 0.00001 * train_weight
        new_weight = np.where(new_weight>0, new_weight, 0.1)
        new_weight = np.where(new_weight<1, new_weight, 0.9)
        if np.linalg.norm(new_weight - train_weight) < 1e-5:
            print(f'converge at epoch:{epoch}')
            return new_weight
        else:
            train_weight = new_weight
    return train_weight
            
            
# 计算梯度
def calculate_gradient(data, train_weight, M, feature_func):
    total_grad = 0.0
    for y in data:
        total_grad += norm_conditional_prob(y, M, train_weight, feature_func) * calculate_feature_vec(y, feature_func) - \
        joint_prob(y, sequence_dict)
    
# 计算非规范条件概率
def no_norm_conditional_prob(y, train_weight, feature_func, start=1, stop=1):
    feature_vec = calculate_feature_vec(y, feature_func, start, stop)
    return (train_weight.T @ feature_vec).flatten()[0]

# 计算规范条件概率
def norm_conditional_prob(y, M, train_weight, feature_func, start=1, stop=1):
    total = 0.0
    for y1 in range(M):
        for y2 in range(M):
            for y3 in range(M):
                total += no_norm_conditional_prob(np.array([y1+1, y2+1, y3+1]), train_weight, feature_func, start=1, stop=1)
    return no_norm_conditional_prob(y, train_weight, feature_func, start=1, stop=1) / total   

# 计算某个序列的联合概率分布
def joint_prob(y, sequence_dict):
    y = tuple(y.flatten())
    total_num = 0
    for k, v in sequence_dict.items():
        total_num += v
        if y==k:
            y_num = v
    return y_num / total_num
    
# 计算某个序列的特征函数向量
def calculate_feature_vec(y, feature_func, start=1, stop=1):
    feature_vec = np.zeros((len(feature_func), 1))
    for k in range(len(feature_func)):
        for i in range(len(y)):
            if i == 0:
                feature_vec[k] += feature_func[k, i, start-1, y[i]-1]
            else:
                feature_vec[k] += feature_func[k, i, y[i-1]-1, y[i]-1]
    return feature_vec

训练：

crf_train(data, feature_func, learning_rate=5, max_iter=1500)

array([[0.94056176],
       [0.46584362],
       [0.86350232],
       [0.68291916],
       [0.18839794],
       [0.84303483],
       [0.55404497],
       [0.92399902],
       [0.5699433 ]])

weights # 原来已知的权值

array([[1. ],
       [0.6],
       [1. ],
       [1. ],
       [0.2],
       [1. ],
       [0.5],
       [0.8],
       [0.5]])

可以发现还是比较一致的，之所以不能够很好的收敛，个人觉得是序列相对于特征方程数还有些少。当然，不知各位读者大神们有什么高见？比如更合适的方法，欢迎评论区交流。（b.t.w 原创不易，点点赞😄）