Machine Learning A Probabilistic Perspective第二章学习笔记

最新推荐文章于 2024-07-13 22:36:10 发布

wangchenchen233

最新推荐文章于 2024-07-13 22:36:10 发布

阅读量1.1k

点赞数 1

本文链接：https://blog.csdn.net/A_little_pang/article/details/84101297

版权

Machine Learning A Probabilistic Pe 同时被 2 个专栏收录

1 篇文章 3 订阅

订阅专栏

机器学习

1 篇文章 0 订阅

订阅专栏

Machine Learning A Probabilistic Perspective学习笔记or机器学习学习笔记

- 闲扯
2 Probability

闲扯

1.为什么学这本书？
之前学习了很多机器学习的东西，看了很多书（机器学习，周志华；统计学习方法，李航；Introduction to machine learning，阿培丁）。十月份粗略的看完了第三本书，感觉到机器学习和统计有着莫大的关系，因此觉得学习这本书可以更好地巩固自己的基础。
2.为什么写博客？
博客每日一更可以督促自己学习，不然就在看直播睡觉中虚度大好时光了
3.我的计划
写不熟悉的
写重要的
把几本书搞在一起琢磨

2 Probability

2.2 A brief review of probability theory

2.2.4 Independence and conditional independence

无条件独立或边缘独立(unconditionally independent or marginally independent)
$p (x, y) = p (x) p (y)$ ，用下面这个图理解很棒

在这里插入图片描述
条件独立怎么来的?
“Unfortunately, unconditional independence is rare, because most variables can influence most other variables. However, usually this influence is mediated via other variables rather than being direct.”
给定 $z$ , $x$ 和 $y$ 是条件独立（conditionally independent，CI）的，当且仅当 $p (x, y ∣ z) = p (x ∣ z) p (y ∣ z)$

"Theorem 2.2.1. $X ⊥ Y ∣ Z$ iff there exist function $g$ and $h$ such that $p (x, y ∣ z) = g (x, z) h (y, z)$ , for all $x, y, z$ such that $p (z)$ >0. "
我是这么理解的， $g (x, z) h (y, z) = g (x ∣ z) h (z) h (y ∣ z) g (z)$ ，这样就和 $p (x, y ∣ z) = p (x ∣ z) p (y ∣ z)$ 一样了。

2.3 Some common discrete distributions一些离散分布

常见的有二项分布，伯努利分布，多项分布，multinoulli分布，泊松分布（The Poisson distribution），经验分布（The empirical distribution），这里只说前两种

2.3.1 The binomial and Bernoulli distributions（二项分布和伯努利分布）

假设进行 $n$ 次投硬币试验， $X$ $\in$ { $0,\dots,n$ }是正面的个数，假设正面的概率为 $\theta$ ，那么
$X \sim B i n (n, θ)$ , $X$ 服从二项分布
$θ)=\binom{n}{k}θ^k(1 − θ)^{n−k}$
mean = $θ$ , var = $n θ (1 - θ)$

特殊情况， $n = 1$ 时为伯努利分布，
$Ber(x|θ) = θ^{I(x=1)}(1 − θ)^{I(x=0)}$
$I (x = i)$ 为示性函数，mean = $θ$ , var = $θ (1 - θ)$

2.3.2 The multinomial and multinoulli distributions（多项分布和multinoulli分布）

令 $x=(x_1,\dots,x_K)$ 为随机变量， $K$ 为总的情况个数， $x_i$ 为第 $i$ 种情况出现的个数，那么概率质量函数（probability mass function）为：
$θ)=\binom{n}{x_1 . . . x_K}\prod_{i=1}^K\theta_i^{x_i}$ ， $\theta_i$ 为第 $i$ 种情况出现的概率, $n=\sum_{k=1}^Kx_k$

$\binom{n}{x_1 . . . x_K}=\frac{n!}{x_1!x_2!\cdots x_K!}$

特殊情况， $n = 1$ 时为multinoulli分布
x=[I(x = 1), . . . , I(x = K)]， $θ)=\prod_{i=1}^K\theta_i^{I(x_i=1)}$

总结：伯努利分布可以看看成二项分布和multinoulli分布的特例
在这里插入图片描述
小知识点：
PDF概率密度函数（probability density function）对连续随机变量
PMF概率质量函数（probability mass function）对离散随机变量
CDF累积分布函数 (cumulative distribution function)对前两者的积分或求和

2.4 Some common continuous distributions

常见的有Gaussian (normal) distribution, Degenerate pdf, The Laplace distribution, The gamma distribution, The beta distribution, Pareto distribution.

2.4.1 Gaussian (normal) distribution

在这里插入图片描述
高斯分布的精度常用参数 $\lambda=\frac{1}{\sigma^2}$ 表示, $\lambda$ 越大说明越集中在 $\mu$ 附近
通常用误差函数来计算CDF, $\Phi(x;\mu,\sigma)=\frac{1}{2}[1+$ erf $(\frac{z}{\sqrt2})]$
其中， $z = (x - μ) / σ$ ,

2.4.2 Degenerate pdf

冲激函数：
在这里插入图片描述
我们有

留张图，说明高斯分布对边缘值敏感

2.5 Joint probability distributions

2.5.1 Covariance and correlation

在这里插入图片描述
协方差矩阵

相关矩阵

范围在[-1,1]

相关矩阵对角线全为1
独立意味着不相关，不相关并不意味着独立

2.5.2 The multivariate Gaussian or multivariate normal (MVN)

在这里插入图片描述
其中， $μ=E[x]\in R^D$ 是均值向量, and $Σ = c o v [x]$ 是D × D的协方差矩阵,D维，一共有D(D+1)/2个参数。

这一块《introduce to machine learning》5.4节介绍的较好，可以参考进行学习。
后序还需学习，原理尚未搞懂！！！

2.6 Transformations of random variables

2.6.1 linear transformation

假设 $f$ 是一个线性函数， $y=f(x)=\textbf{A}x+b$
$E[y]=\textbf{A}\mu+b$
$cov[y]=\textbf{A}\Sigma \textbf{A}^T$

2.6.2 general transformation

三个式子看穿一切
在这里插入图片描述

如果是 $R^n\to R^n$ , 可以用jacobian 矩阵

特别的，如果是单个x,y即为：

2.6.3 central limit theorem 中心极限定理

$N$ 个随机变量pdf为 $p(x_i)$ ,均值为 $\mu$ ,方差为 $\sigma^2$ , 假设每个变量之间是独立同分布的（independent and identically distributed，iid）
令 $S_N=\sum_{i=1}^NX_i$ 是所有自由变量的求和，随着 $N$ 的增加， $S_N$ 的分布为
在这里插入图片描述

在这里插入图片描述
收敛到标准正态分布

2.7 Monte Carlo approximation 蒙特卡洛近似

使用变量公式计算PDF是困难的，因此可以采用蒙特卡洛近似，方法如下：
首先产生 $S$ 个样本 $x_1,x_2,\dots,x_S$ (高维分布可以采用Markov chain Monte Carlo,MCMC方法);然后通过经验分布函数{ $f(x_s)$ } $_{s=1}^S$ 来近似 $f (X)$ 。
Monte Carlo integration
在这里插入图片描述
通过改变函数 $f$ , 我们可以近似许多感兴趣的量，例如

2.7.2 Example: estimating π by Monte Carlo integration

在这里插入图片描述
可以看出 $\pi=I/r^2$ , 令 $f(x, y) =I(x^2+y^2≤r^2)$ , 令 $p (x), p (y)$ 为[-1,1]上的自由分布， $p (x) = p (y) = 1 / (2 r)$ ，那么我们有

2.7.3 Accuracy of Monte Carlo approximation

精度随样本的增加增加。记 $\mu=E[X]$ 为精确的均值，MC近似得到的是 $\hat{\mu}$ , 如果样本是独立的，那么
在这里插入图片描述
$\sigma^2$ 可以通过MC估计

那么我们有

其中， $\sqrt{\frac{\hat{\sigma}^2}{S}}$ 为标准误差，是我们估计 $\mu$ 的不确定性。

2.8 Information theory

2.8.1 Entropy

自由变量 $X$ 的分布为 $p$ , 熵记做 $H (p)$ 或者 $H (X)$ , 离散变量熵如下所示，其中 $K$ 为状态数
在这里插入图片描述
若为 $log_2$ 记为bits，若为 $log_e$ 记为nats

2.8.2 KL divergence or relative entropy相对熵

一种判断两种分布相异程度的方法
在这里插入图片描述
其中，求和可以换成对pdf积分，展开之后为

交叉熵

容易看出，
pq的相对熵=pq交叉熵-p的熵，因此相对熵可以理解为通过q分布编码p比p编码自身多出来的，因此相对熵 $\ge0$ .

通过如下jensen不等式可以证明定理2.8.1
在这里插入图片描述
离散分布中随机分布具有最大熵，
令 $u(x)=1/|\mathcal{X}|$ ，我们有

如果我们不知道什么分布更合适的时候就使用均匀分布，这是理由不充分原则（principle of insufficient reason）。

2.8.3 mutual information 互信息

判断 $p (x, y)$ 和 $p (x) p (y)$ 关系的量，如果 $x$ 跟 $y$ 不相关，则 $p (x, y) = p (x) p (y)$ 。二者相关性越大，则p(x, y)就相比于p(x)p(y)越大
在这里插入图片描述

$\mathbb{H}(Y|X)$ 为条件熵,
点互信息(pointwise mutual information)和互信息相似，都是判断 $p (x, y)$ 和 $p (x) p (y)$ 关系的量，可以把互信息理解成点互信息的加权和。

–2018.11.15–

wangchenchen233

关注

1
点赞
踩
11

收藏

觉得还不错? 一键收藏
2
评论
Machine Learning A Probabilistic Perspective第二章学习笔记

Machine Learning A Probabilistic Perspective学习笔记or机器学习学习笔记闲扯2 Probability2.2 A brief review of probability theory2.2.4 Independence and conditional independence2.3 Some common discrete distributions...
复制链接

扫一扫