【科研基础】PRML

db_1024

已于 2024-07-04 19:24:28 修改

阅读量707

点赞数 11

分类专栏：科研基础（野生技术栈、课程补充）文章标签：信息与通信概率论机器学习人工智能

于 2024-06-24 18:47:15 首次发布

本文链接：https://blog.csdn.net/qq_41100635/article/details/139923774

版权

补充

MSE as Maximum Likelihood

MSE as Maximum Likelihood: A Deep Dive into Machine Learning’s Intersection with Statistics
Where does the Mean Squared Error come from?
The beauty of MSE lies in its simplicity and interpretability. By squaring the errors, we grant more weight to larger discrepancies, rendering the model sensitive to more significant errors. Moreover, the squaring process ensures that the error metric is always positive.
In essence, MLE aims to find the model parameters that maximize the likelihood of the observed data.

the connection between MSE and MLE
assume that the model’s errors are normally distributed – a common assumption in many statistical models. When we model errors with a normal distribution, the process of MLE, which maximizes the likelihood of the observed data, turns equivalent to minimizing the MSE.
the log-likelihood for normally distributed errors. It simplifies to a constant subtracted from the MSE.
$p_{\mathbf{\hat s}}(\mathbf{\hat s}|\mathbf{s},f)=\mathcal{N}(\mathbf{s}|f(x),\sigma^2)$

1-Introduction

Supervised / unsupervised learning

Overfitting

p6
p9:
For a given model complexity, the over-fitting problem become less severe as the size of the data set increases.
Choose the complexity of the model according to the complexity of the problem being solved.
Least squares approach to finding the model parameters represents a specific case of maximum likelihood. （更加详细的证明见p141 3.1.1求似然函数，取对数，求期望，求导数）
当误差项服从正态分布时，最小二乘法估计与最大似然估计是一致的，分析如下：

在回归分析中，最小二乘法用于寻找模型参数，使得预测值与实际观测值之间的平方误差和最小化。对于线性回归模型： $y=X\beta+\epsilon$ ，其中 $y$ 是观测值向量， $X$ 是设计矩阵， $\beta$ 是待估参数向量， $\epsilon$ 是误差向量。最小二乘估计通过最小化平方误差和来找到参数 $\beta:$ $\hat{\beta}_{LS}=\arg\min_\beta\|y-X\beta\|^2$
这个优化问题的解是： $\hat{\beta}_{LS}=(X^TX)^{-1}X^Ty$ .
最大似然估计的目标是找到使得观测数据概率最大的参数值。在回归模型中，如果我们假设误差项 $\epsilon$ 服从正态分布 $\epsilon\sim\mathcal{N}(0,\sigma^2I)$ ,那么观测值 $y$ 也服从正态分布： $y\sim\mathcal{N}(X\beta,\sigma^2I)$
观测数据的似然函数为： $L(\beta,\sigma^2)=P(y|\beta,\sigma^2)=\frac{1}{(2\pi\sigma^2)^{n/2}}\exp\left(-\frac{1}{2\sigma^2}(y-X\beta)^T(y-X\beta)\right)$
对数似然函数为： $\log L(\beta,\sigma^2)=-\frac{n}{2}\log(2\pi\sigma^2)-\frac{1}{2\sigma^2}(y-X\beta)^T(y-X\beta)$
为了找到最大似然估计 $\hat{\beta}_{MLE}$ ,我们对 $\log L(\beta,\sigma^2)$ 关于 $\beta$ 取偏导数并令其为零： $\frac{\partial\log L(\beta,\sigma^2)}{\partial\beta}=\frac{1}{\sigma^2}X^T(y-X\beta)=0$
解这个方程得到： $X^{T}y=X^{T}X\beta$ , $\hat{\beta}_{MLE}=(X^TX)^{-1}X^Ty$ 这正是最小二乘估计的解。

The over-fitting problem can be understood as a general property of maximum likelihood.
最大似然估计旨在找到一组参数，使得在给定数据集上的似然函数最大化。然而，如果模型过于复杂（即，参数过多或模型自由度过高），它可能会“记住”训练数据中的噪声和异常值，从而导致过拟合。过拟合的本质在于模型对训练数据拟合得过于紧密，无法很好地泛化到未见过的数据。为什么最大似然估计容易过拟合？1.模型复杂度高，如果模型有太多参数，它可能会精确拟合训练数据，包括数据中的噪声。2.数据不足，当数据量较小时，复杂模型更容易过拟合，因为它可以在有限的数据上找到各种模式，而这些模式在更大的数据集上可能并不成立。3.缺乏正则化，最大似然估计本身不包含对模型复杂度的惩罚。如果不使用任何形式的正则化，模型参数可能会变得很大，以便在训练数据上达到最大似然。总结：在不施加额外约束的情况下，过拟合是最大似然的直接结果。

Adopting a Bayesian approach, the over-fitting problem can be avoided.
1.通过先验分布，引入对参数的约束，避免参数估计过大或过小。先验分布： $p(\mathbf{w})\sim\mathcal{N}(0,\lambda^{-1}I)$ ，似然函数： $p(\mathbf{y}|X,\mathbf{w})$ ，后验分布： $p(\mathbf{w}|\mathbf{y},X)\propto p(\mathbf{y}|X,\mathbf{w})p(\mathbf{w})$ 。通过最大化后验分布，可以得到参数的贝叶斯估计： $\hat{\mathbf{w}}_{Bayes}=\arg\max_\mathbf{w}p(\mathbf{w}|\mathbf{y},X)$ 这相当于在损失函数中加入正则化项： $\mathbf{w}_{Bayes}=\arg\min_{\mathbf{w}}\left(\sum_{i=1}^{N}(y_{i}-f(x_{i};\mathbf{w}))^{2}+\lambda\|\mathbf{w}\|^{2}\right)$
2.模型复杂度的自动调整：当数据量较小时，先验分布对后验分布的影响较大，从而抑制参数的过度拟合。当数据量较大时，数据的贡献更多地影响后验分布，使得参数估计更加准确。

p10
Regularization
$\widetilde{E}(\mathbf{w})=\frac{1}{2}\sum_{n=1}^{N}\left\{y(x_n,\mathbf{w})-t_n\right\}^2+\frac{\lambda}{2}\|\mathbf{w}\|^2$
where $\| \mathbf{w} \| ^2\equiv \mathbf{w} ^{\mathrm{T} }\mathbf{w} = w_0^2+ w_1$

最低0.47元/天解锁文章

db_1024

关注

11
点赞
踩
15

收藏

觉得还不错? 一键收藏
打赏
0
评论
【科研基础】PRML

当数据量较小时，复杂模型更容易过拟合，因为它可以在有限的数据上找到各种模式，而这些模式在更大的数据集上可能并不成立。最大似然估计在有限样本量下，特别是在参数数量远大于数据点数量时，倾向于选择复杂模型，从而导致高方差和过拟合。然而，如果模型过于复杂（即，），它可能会“记住”训练数据中的噪声和异常值，从而导致过拟合。，如果模型有太多参数，它可能会精确拟合训练数据，包括数据中的噪声。1.通过先验分布，引入对参数的约束，避免参数估计过大或过小。总结：在不施加额外约束的情况下，过拟合是最大似然的直接结果。
复制链接

扫一扫