由MLE与MAP推导损失函数和正则化

最新推荐文章于 2023-05-12 00:07:29 发布

qq_38955142

最新推荐文章于 2023-05-12 00:07:29 发布

阅读量869

点赞数

分类专栏：深度学习文章标签：最大似然机器学习正则化

本文链接：https://blog.csdn.net/qq_38955142/article/details/115593735

版权

深度学习专栏收录该内容

2 篇文章 0 订阅

订阅专栏

文章目录

1. Introduction
2. Efficiency and Consistency of MLE
3. Deriving Cost Function
4. Experiment

1. Introduction

在机器学习的监督学习中，我们尝试找到一个将数据映射到目标或标签的函数。我们经常使用神经网络作为近似器，并学习最佳权重以近似函数。因此，我们需要构建适当的成本函数作为优化准则以优化权重。成本函数是衡量模型估计X和y之间关系的能力方面的错误程度的度量。这通常表示为预测值和实际值之间的差异或距离。

估计是一个统计术语，用于使用测量值来找到未知参数的一些估计。点估计是对某些感兴趣的数量提供单一“最佳”预测的尝试，该预测可以是某个参数模型中的单个参数或参数向量，例如线性回归中的权重。

极大似然估计（MLE）可以定义为一种用于从样本数据中估计参数（例如均值或方差）的方法，以使获得观测数据的概率（似然性）最大化。MLE是一致性和效率估算器。当训练示例的数量接近无穷大时，参数的最大似然估计会收敛到参数的真实值。 Cramér-Rao下界表明，没有一个一致的估计量具有比最大似然估计量低的均方误差。由于一致性和效率的原因，MLE通常被认为是用于机器学习的首选估计器。

通过允许先验影响点估计的选择来遵循贝叶斯方法。极大大后验概率（MAP）可用于根据经验数据获得未观测量的点估计。 MAP估计选择最大后验概率点。 MAP的优势在于可以利用先验带来的信息，而这些信息无法在训练数据中找到。此附加信息有助于以增加的偏差为代价减少点估计中的方差。

2. Efficiency and Consistency of MLE

MLE是一种估计统计模型参数的方法。给定具有未知参数 $\theta$ 的统计模型 $f(y;\theta)$ 的分布，MLE用于通过使用观测值 $y$ 最大化概率 $f(y;\theta)$ 来估计参数 $\theta$ 。
$\hat \theta(y)=\mathop{\arg\max}_\theta f(y;\theta)$

2.1 Cramér-Rao Lower Bound

Cramér-Rao下界**[1]**（CRLB）描述了确定性参数 $\theta$ 的估计量方差的下界。
$Var(\hat \theta(Y))\ge\frac{(\frac \partial {\partial \theta}\mathbb E[\hat\theta(Y)])^2}{I(\theta)}$
其中 $I(\theta)$ 是是Fisher信息，用于测量可观测随机变量 $Y$ 携带的有关已知参数 $\theta$ 的信息。对于无偏估计器 $\hat\theta(Y)$ ，有
$Var(\hat \theta(Y))\ge \frac{1}{I(\theta)}$
意味着任何无偏估计器的方差至少是Fisher信息的倒数。

2.2 Efficiency

从2.1节中我们知道，估计器 $\hat\theta(Y)$ 的方差不能小于CRLB。因此，方差等于下界的任何估计量都被认为是有效的估计器**[2]**。

我们假设 $\boldsymbol Y=\{Y_1,\cdots,Y_n\}$ 是一组独立且一致分布的高斯随机变量 $\mathcal N(\theta,\sigma^2)$ 。假设 $\boldsymbol y=\{y_1,\cdots,y_n\}$ 为一组观测值，那么
$\begin{aligned}f(\boldsymbol y;\theta) &= \prod^n_{i=1}f(y_i;\theta) \\ &= \prod^n_{i=1}\frac{1}{\sigma \sqrt{2\pi}}\exp{\left[ {-\frac{(y_i-\theta)^2}{2\sigma^2}} \right]} \\ &= \frac{1}{(2\pi\sigma^2)^{\frac n 2}}\exp{\left[{-\frac{\sum_{i=1}^n(y_i-\theta)^2}{2\sigma^2}}\right]} \end{aligned}$
对方程两边取对数可以得到：
$\log{f(\boldsymbol y;\theta)}=-\frac n 2\log(2\pi\sigma^2)-\frac{\sum_{i=1}^n(y_i-\theta)^2}{2\sigma^2}\\ \frac{\partial \log{f(\boldsymbol y; \theta)}}{\partial \theta}=-2\frac{\sum_{i=1}^n(y_i-\theta)}{2\sigma^2}=0$
因此，MLE为 $\hat \theta_{MLE}=\frac 1 n \sum_{i=1}^n y_i$ 。
$\mathbb E[\hat \theta_{MLE}(y)]=\frac 1 n\sum^n_{i=1}\mathbb E[y_i]=\theta$
为了确定CRLB，需要计算Fisher信息：
$I(\theta)=-\mathbb E[\frac{\partial^2}{\partial \theta^2}\log f(\boldsymbol y;\theta)]=\frac{n}{\sigma^2}$
根据2.1节有
$Var(\hat \theta_{MLE}(Y))\ge\frac{1}{I(\theta)}=\frac{\sigma^2}{n}\\ Var(\hat\theta(Y))=\frac{1}{I(\theta)}$
因此，CRLB满足，MLE是有效的。

2.3 Consistency

假设 $\hat\theta_n$ 为一个观测序列 $\{Y_1,\cdots,Y_n\}$ 的估计， $\hat \theta_n$ 是一致性的如果当 $\hat \theta_n \to \theta$ 时，满足
$\mathbb P(|\hat \theta -\theta|>\epsilon)\to 0,\quad as\ n\to \infty$
上述不等式的一个充分条件是
$\mathbb E[(\hat\theta_n-\theta)^2]\to 0,\quad as\ n\to\infty$
同样假设 $\boldsymbol Y=\{Y_1,\cdots,Y_n\}$ 是一组独立且一致分布的高斯随机变量 $\mathcal N(\theta,\sigma^2)$ 。假设 $\boldsymbol y=\{y_1,\cdots,y_n\}$ 为一组观测值，那么由2.2节可知
$\hat \theta_{MLE}(y)=\frac{1}{n}\sum_{i=1}^n y_i$
因为
$\mathbb E[(\hat\theta-\theta)^2]=Var(\hat\theta_n)=\frac{\sigma^2}{n}$
因此充分条件成立， $\hat \theta_n$ 是一致性的如果当 $\hat \theta_n \to \theta$ 时。

从上面我们知道，MLE是一致且有效的估计器，随着训练数据的数量接近无穷大，参数的估计将收敛到参数的真实值。

3. Deriving Cost Function

MLE核MAP提供了推到损失函数的机制，接下来将从MLE和MAP推导常用的损失函数和正则化。

3.1 Cost Function

在机器学习中，我们通常将预测输出 $\hat y$ 与实际目标 $y$ 之间的误差作为损失函数进行计算，以优化权值系数。均方误差损失和均值绝对误差损失用于回归任务，而交叉熵则用于分类任务。都可以从MLE派生。

3.1.1 Mean Squared Error

均方误差（L2-Loss）被广泛用于线性回归。假设训练集 $\{(x^{(1)},y^{(1)}),\cdots,(x^{(m)},y^{(m)})\}$ 是一组满足独立同分布高斯随机变量的样本 $y\sim\mathcal N(\mu=\hat y,\sigma^2)$ 。
$\hat y = f(x;\theta)\\ p(y|x;\theta)=\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(y-\hat y)^2}{2\sigma^2}}$
似然函数为
$\begin{aligned} p(\boldsymbol y|\boldsymbol x;\theta) &= \prod_{i=1}^m p(y^{(i)}|x^{(i)};\theta) \\ &= \prod_{i=1}^m\frac{1}{\sigma\sqrt{2\pi}}\exp{-\frac{(y^{(i)}-\hat y^{(i)})^2}{2\sigma^2}} \end{aligned}$
两边取对数有，
$\begin{aligned} \log p(\boldsymbol y|\boldsymbol x;\theta) &= \sum_{i=1}^m \log (\frac{1}{\sigma\sqrt{2\pi}}\exp{-\frac{(y^{(i)}-\hat y^{(i)})^2}{2\sigma^2}}) \\ &= -m\log\sigma - \frac m 2\log{2\pi}-\sum_{i=1}^m\frac{(y^{(i)}-\hat y^{(i)})^2}{2\sigma^2} \end{aligned}$

$\begin{aligned} J_{MSE}(\theta) &= \frac{1}{m}\sum_{i=1}^m\frac{(y^{(i)}-\hat y^{(i)})^2}{2\sigma^2} \\ &= \frac{1}{m}\sum_{i=1}^m\|y^{(i)}-\hat y^{(i)}\|^2 \end{aligned}$

其中 $\hat y^{(i)}$ 是数据 $x^{(i)}$ 使用权值 $\theta$ 的预测值。对 $\theta$ 最大化对数似然函数与最小化均方误差相同，因此
$\theta^*=\mathop{\arg\max}_\theta \log p(\boldsymbol y|\boldsymbol x;\theta) = \mathop{\arg\min}_\theta J_{MSE}(\theta)$
上述过程证明的均方误差损失与极大似然估计是一致的。

3.1.2 Mean Absolute Error

与3.1.1小节类似，假设训练集符合Laplace分布 $y\sim Laplace(\mu=\hat y,\lambda)$ ，有
$p(y|x;\theta)=\frac{1}{2\lambda}e^{-\frac{|y-\hat y|}{\lambda}}$
似然函数为
$\begin{aligned} p(\boldsymbol y|\boldsymbol x;\theta) &= \prod_{i=1}^m p(y^{(i)}|x^{(i)};\theta) \\ &= \prod_{i=1}^m\frac{1}{2\lambda}\exp{-\frac{|y^{(i)}-\hat y^{(i)}|}{\lambda}} \end{aligned}$
两边取对数有，
$\log p(\boldsymbol y|\boldsymbol x;\theta) = -m\log{2\lambda} - \sum_{i=1}^m\frac{|y^{(i)}-\hat y^{(i)}|}{\lambda}$

$J_{MSE}(\theta)=\frac{1}{m}\sum_{i=1}^m |y^{(i)}-\hat y^{(i)}|$

对 $\theta$ 最大化对数似然函数与最小化平均绝对误差相同，因此
$\theta^*=\mathop{\arg\max}_\theta \log p(\boldsymbol y|\boldsymbol x;\theta) = \mathop{\arg\min}_\theta J_{MAE}(\theta)$

3.1.3 Cross-Entropy

MSE与MAE用于优化回归任务，对于二分类问题，我们常使用交叉熵作为损失函数，它可以从伯努利分布推导出来，我们假设随机变量 $\boldsymbol y$ 服从伯努利分布， $\hat y$ 是 $y = 1$ 的概率，
$p(y=1|x;\theta) = \hat y\\ p(y=0|;\theta) = 1-\hat y$
概率质量函数为
$P(y|x;\theta)=\hat y^y(1-\hat y)^{1-y}$
对于 $m$ 个样本 $\{(x^{(1)},y^{(1)}),\cdots,(x^{(m)},y^{(m)})\}$ ， $y\in{0, 1}$ ，似然函数为：
$\begin{aligned} L(\theta)&=\prod_{i=1}^m {\hat y^{(i)}}^{y^{(i)}}(1-\hat y^{(i)})^{1-y^{(i)}} \\ &=\prod_{i=1}^m p_\theta(y=1|x^{(i)})^{y^{(i)}}(1-p_\theta(y=1|x^{(i)})^{1-y^{(i)}}\end{aligned}$
对数似然函数为
$\log L(\theta)=\sum_{i=1}^m y^{(i)}\log \hat y^{(i)}+(1-y^{(i)})\log(1-\hat y^{(i)})$
等价于交叉熵。

对于K类随机变量 $\boldsymbol y\in\{1,\cdots, K\},p(y=j)=p_j$ ，根据广义伯努利分布可以得到概率质量函数**[3]**：
$P(\boldsymbol y)=\prod_{j=1}^k p_j^{I(y)}$
其中 $I (y = j)$ 是指示函数，当 $y = j$ 时， $I (y) = 1$ ，否则为0。

对于 $m$ 个样本，似然函数为：
$L(\theta) = \prod_{i=1}^m\prod_{j=1}^K p_j^{I(y^{(i)})}= \prod_{i=1}^m\prod_{j=1}^K p_\theta(y^{(i)}=j|x^{(i)})^{I(y^{(i)})}$
对数似然函数为
$\log L(\theta) = \sum_{i=1}^m\sum_{j=1}^K I(y^{(i)})\log p_\theta(y^{(i)}=j|x^{(i)})$
在实际中，标签通常采用独热编码，因此 $I(y^{(i)})=y^{(i)}$ 。

多分类交叉熵损失为：
$J_{CE}(\theta)=-\frac{1}{m}\sum_{i=1}^m\sum_{j=1}^K y^{(i)}\log{\hat y^{(i)}}$
以上证明MSE，MAE和交叉熵可以有MLE使用不同的分布假设推导得到。

3.2 Regularization

3.1节证明过程的假设之一是样本数近似于无限。实际上，我们可能无法收集到足够的训练集，这会导致模型过度拟合：测试精度远低于训练精度，并且随着训练迭代次数的增长，测试误差将增大。为了避免过拟合，我们通常在成本函数中添加正则化以限制权重。常使用 $L_1$ 正则化和 $L_2$ 正则化，它们都可以从MAP中导出**[5]**。

利用贝叶斯定理可以极大化后验概率，
$p(\theta|y)=\frac{p(y|\theta)p(\theta)}{p(y)}$

$\begin{aligned}\theta^* &= \mathop{\arg\max}_\theta p(\theta|y)\\ &= \mathop{\arg\max}_\theta p(y|\theta)p(\theta)\\ &= \mathop{\arg\max}_\theta [\log p(y|\theta)+\log p(\theta)]\end{aligned}$

其中 $\mathop{\arg\max}_\theta \log p(y|\theta)$ 是MLE，如果假设先验符合高斯分布 $\theta\sim N(0,\frac 1 \lambda)$ ，
$\log p(\theta)=\sum_{i=1}^m\log(\frac{1}{\sqrt{\frac{2\pi}{\lambda}}}exp{-\frac \lambda 2\theta^2}) = -\frac m 2\log \frac {2\pi}{\lambda} - m\frac \lambda 2\theta^2$
将其带入3.1节公式中可以得到：
$\begin{aligned} \theta^* &= \mathop{\arg\max}_\theta (\log p(y|\theta)+\log p(\theta))\\ &= \mathop{\arg\min}_\theta(J(\theta) + \frac \lambda 2\|\theta\|^2_2)\end{aligned}$
公式的第二项为 $L_2$ 正则化。

同理，假设符合Laplace分布 $\theta\sim Laplace(0,\frac 2 \lambda)$ ，
$\log p(\theta) = m\log \frac \lambda 4 - m \frac \lambda 2 |\theta|$

$\begin{aligned} \theta^* &= \mathop{\arg\max}_\theta (\log p(y|\theta)+\log p(\theta))\\ &= \mathop{\arg\min}_\theta(J(\theta) + \frac \lambda 2\|\theta\|_1)\end{aligned}$

公式的第二项为 $L_1$ 正则化。

4. Experiment

在这一部分中，使用波士顿房价数据集进行回归预测。此外，我们将比较 $L_1$ 正则化和 $L_2$ 正则化的影响。
在这里插入图片描述

from sklearn.datasets import load_boston
import tensorflow as tf
import numpy as np
import random

using_reg = True
is_l2 = False
is_MSE = False
weight_decay = 0.5
epoch = 1000
batch_size = 50
lr = 0.001


boston = load_boston()
X = boston.data
Y = boston.target
[m, n] = X.shape
X_max = np.max(X, 0)
X_min = np.min(X, 0)
X_mean = np.mean(X, 0)
X_pre = (X - X_min) / (X_max-X_min)
random.seed(0)
test_ind = random.sample(range(m), 106)
train_ind = list(set(range(m)).difference(test_ind))

X_test = X_pre[test_ind][:]
Y_test = Y[test_ind][:]
X_train = X_pre[train_ind][:]
Y_train = Y[train_ind][:]
X_train = np.reshape(X_train, (400, 13))
Y_train = np.reshape(Y_train, (400, 1))
X_test = np.reshape(X_test, (106, 13))
Y_test = np.reshape(Y_test, (106, 1))

tf.reset_default_graph()
X_input = tf.placeholder(tf.float32, shape=(None, n), name='X')
Y_input = tf.placeholder(tf.float32, shape=(None, 1), name='Y')
W1 = tf.Variable(tf.random_normal([13, 64]), dtype=tf.float32, name='W1')
b1 = tf.Variable(tf.random_normal([64]), dtype=tf.float32, name='b1')
W2 = tf.Variable(tf.random_normal([64, 16]), dtype=tf.float32, name='W2')
b2 = tf.Variable(tf.random_normal([16]), dtype=tf.float32, name='b2')
W3 = tf.Variable(tf.random_normal([16, 1]), dtype=tf.float32, name='W3')
b3 = tf.Variable(tf.random_normal([1]), dtype=tf.float32, name='b3')
l1 = tf.nn.relu(tf.matmul(X_input, W1) + b1)
l2 = tf.nn.relu(tf.matmul(l1, W2) + b2)
l3 = tf.matmul(l2, W3) + b3

if using_reg:
	if is_l2:
		tf.add_to_collection('regular', tf.contrib.layers.l2_regularizer(weight_decay)(W1))
		tf.add_to_collection('regular', tf.contrib.layers.l2_regularizer(weight_decay)(W2))
		tf.add_to_collection('regular', tf.contrib.layers.l2_regularizer(weight_decay)(W3))
	else:
		tf.add_to_collection('regular', tf.contrib.layers.l1_regularizer(weight_decay)(W1))
		tf.add_to_collection('regular', tf.contrib.layers.l1_regularizer(weight_decay)(W2))
		tf.add_to_collection('regular', tf.contrib.layers.l1_regularizer(weight_decay)(W3))

l1 = tf.layers.dense(inputs=X_input, units=64, activation=tf.nn.relu, name='l1')
l2 = tf.layers.dense(inputs=l1, units=16, activation=tf.nn.relu, name='l2')
l3 = tf.layers.dense(inputs=l2, units=1, activation=None, name='l3')

if is_MSE:
	loss = tf.losses.mean_squared_error(predictions=l3, labels=Y_input)
else:
	loss = tf.losses.absolute_difference(predictions=l3, labels=Y_input)
if using_reg:
     loss += tf.add_n(tf.get_collection('regular'))
        
optimizer = tf.train.GradientDescentOptimizer(lr)
train = optimizer.minimize(loss)

sess = tf.Session()
sess.run(tf.initialize_all_variables())
random.seed()
loss_save = []
for i in range(epoch):
    indexs = np.arange(1, 400)
    random.shuffle(indexs)
    for j in range(8):
        index = indexs[j*batch_size:(j+1)*batch_size]
        batch_x = X_train[index][:]
        batch_y = Y_train[index][:]
        sess.run([train], feed_dict={X_input:batch_x, Y_input:batch_y})
    temp = sess.run([loss], feed_dict={X_input:X_test, Y_input:Y_test})
    loss_save.append(temp[0])
    
import matplotlib.pyplot as plt
plt.plot(range(epoch), loss_save)
pred = sess.run([l3], feed_dict={X_input:X_test})
print(np.mean((pred-Y_test)**2))

[1] Cramér H. Mathematical methods of statistics[M]. Princeton university press, 1999.

[2] https://engineering.purdue.edu/ChanGroup/ECE645Notes/StudentLecture08.pdf.

[3] Murphy K P. Machine learning: a probabilistic perspective[M]. MIT press, 2012.

[4] [1] Deep learning - Information theory & Maximum likelihood. https://jhui.github.io/2017/01/05/Deep-learning-Information-theory/.

[5] Goodfellow I, Bengio Y, Courville A. Deep learning[M]. MIT press, 2016.