预测分析笔记chapter 3

最新推荐文章于 2024-02-20 09:30:24 发布

子渔渔

最新推荐文章于 2024-02-20 09:30:24 发布

阅读量141

点赞数

分类专栏：笔记预测分析课程笔记文章标签：机器学习

本文链接：https://blog.csdn.net/lanlingmuzichun/article/details/114600088

版权

笔记同时被 2 个专栏收录

9 篇文章 0 订阅

订阅专栏

预测分析课程笔记

3 篇文章 0 订阅

订阅专栏

课程笔记：预测分析 2021Spring

参考教材：Murphy, K. P. (2021). Probabilistic Machine Learning: An Introduction. MIT press.

In this class,we’ll cover topics in machine learning from a probabilistic view.

We will also introduce some topics in statistical computing,such as EM,MCMC,varaitional inference,some optimization algorithm.

文章目录

Chapter 3 Probabilistic models 一些概率模型介绍

Chapter 3 Probabilistic models 一些概率模型介绍

Previously, we introduce the Bayesion approach to machine Learning.

Basically, there are four steps:

probability model of the form $p(y|x,\theta)=p(y|f(x;\theta))$ 确定模型形式
specify a prior distribution $p(\theta)$ 确定先验分布
compute the posterior distribution over unknown parameters $p(\theta|y)$ 计算后验分布
mode predictions using $p(y_{new}|x,y)$ 利用模型做预测

How to choose a proper model?

depends on our belief about data.根据数据信息选择
we could choose all possible,reasonable models,then pick the"best" one.列举出所有可能、合理的模型，再从中选择最好的一个

Let us review some distributions.

Discrete data: Bernoulli,Binomial,Categorial, multinomial,Poisson,negative Binomial,etc.
Continuous data: Gaussian(univariate,multivariate), student t-dist, Cauchy dist, gamma dist,beta dist,etc.

Discrete

Bernoulli :model binary events 有两面的骰子掷了1次

$\operatorname{Ber}(y \mid \theta) \triangleq \theta^{y}(1-\theta)^{1-y}=\left\{\begin{array}{ll} 1-\theta & \text { if } y=0 \\ \theta & \text { if } y=1 \end{array}\right.$

where $0\le \theta\le1$ is the probability that $y = 1$ .

The Bernoulli distribution is a special case of Binomial distribution.

Binomial 有两面的骰子掷了N次

Suppose we observe a set of N Bernoulli trials, denoted $S=\sum_{n-1}^{N}\mathbb{I}(y_n=1)$

The distribution of S is given by the Binomial distribution, $\operatorname{Bin}(s \mid N, \theta) \triangleq\left(\begin{array}{c}N \\ s\end{array}\right) \theta^{s}(1-\theta)^{N-s}$ ,where $\left(\begin{array}{c}N \\ k\end{array}\right) \triangleq \frac{N !}{(N-k) ! k !}$ .Bernoulli is a special case of Binomial if $N = 1$ .

Sigmoid(logistic) function

When we want to predict a binary variable $y\in \{0,1\}$ given some inputs $\mathbf{x} \in \mathcal{X}$ , we need to use a conditional probability distribution of the form:
$\mid \mathbf{x}, \boldsymbol{\theta})=\operatorname{Ber}(y \mid f(\mathbf{x} ; \boldsymbol{\theta}))$
$f(x;\theta)$ :伯努利分布中的参数，为y=1事件发生的概率，要求在0,1之间。所以我们需要对f作变换满足上述条件。

To avoid the requirement that $\leq f(\mathbf{x} ; \boldsymbol{\theta}) \leq 1,$ we can let $f$ be an unconstrained function, and use the following model:
$\mid \mathbf{x}, \boldsymbol{\theta})=\operatorname{Ber}(y \mid \sigma(f(\mathbf{x} ; \boldsymbol{\theta})))$
Here $\sigma()$ is the sigmoid or logistic function, defined as follows:
$\sigma(a) \triangleq \frac{1}{1+e^{-a}}$

Binary logistic regression

$\mid \mathbf{x} ; \boldsymbol{\theta})=\operatorname{Ber}\left(y \mid \sigma\left(\mathbf{w}^{\top} \mathbf{x}+b\right)\right)$

where $f(\mathbf{x} ; \boldsymbol{\theta})=\mathbf{w}^{\top} \mathbf{x}+b$ （note:为什么原文中没有+b?)

In other words,
$\mid \mathbf{x} ; \boldsymbol{\theta})=\sigma\left(\mathbf{w}^{\top} \mathbf{x}+b\right)=\frac{1}{1+e^{-\left(\mathbf{w}^{\top} \mathbf{x}+b\right)}}$
this is called logistic regression.

logistic回归相当于“伯努利分布”，但是伯努利分布的参数p是由协变量 $X$ 和模型的参数 $\theta$ 组成的，并不为伯努利分布。

Categorical distributions 有C面的骰子掷了1次

Categorial distribution generalizes the Bernoulli to $C > 2$ values. $y\in \{1,2,...,C\}$ .

Categorial 分布是对于伯努利分布中y的二分类的推广，推广为C分类（即结果有C中可能性，而不是2种）

The categorical distribution is a discrete probability distribution with one parameter per class:
$\operatorname{Cat}(y \mid \boldsymbol{\mu}) \triangleq \prod_{c=1}^{C} \theta_{c}^{\mathbb{I}(y=c)}$
In other words, $\mid \boldsymbol{\theta})=\theta_{c} .$

Note that the parameters are constrained so that $\leq \theta_{c} \leq 1$ and $\sum_{c=1}^{C} \theta_{c}=1 ;$ thus there are only $C - 1$ independent parameters.

或者我们可以写成编码形式：当C=3时，我们将三类编码为 $(1, 0, 0), (0, 1, 0), (0, 0, 1)$

分布可以写为：
$\operatorname{Cat}(\mathbf{y} \mid \boldsymbol{\theta}) \triangleq \prod_{c=1}^{C} \theta_{c}^{y_{c}}$

The categorical distribution is a special case of the multinomial distribution.

套娃ing.

multinomial distributions 有C面的骰子掷了N次

Suppose we observe $N$ categorical trials, $y_{n} \sim \operatorname{Cat}(\cdot \mid \boldsymbol{\theta}),$ for $n = 1 : N .$ Concretely, think of rolling a $C$ -sided dice $N$ times.

Let us define $\mathbf{s}$ to be a vector that counts the number of times each face shows up, i.e., $s_{c} \triangleq \sum_{n=1}^{N} \mathbb{I}\left(y_{n}=c\right)$ .

The distribution of $\mathbf{s}$ is given by the multinomial distribution:
$\operatorname{Mu}(\mathbf{s} \mid N, \boldsymbol{\theta}) \triangleq\left(\begin{array}{c} N \\ s_{1} \ldots s_{C} \end{array}\right) \prod_{c=1}^{C} \theta_{c}^{s_{c}}$
where $\theta_{c}$ is the probability that side $c$ shows up, and
$\left(\begin{array}{c} N \\ s_{1} \ldots s_{C} \end{array}\right) \triangleq \frac{N !}{s_{1} ! s_{2} ! \cdots s_{C} !}$
$N=\sum_{c=1}^{C} s_{c}$ .

Softmax function

对sigmoid函数的推广。

Consider $\mid \mathbf{x}, \boldsymbol{\theta})=\operatorname{Cat}(y \mid f(\mathbf{x} ; \boldsymbol{\theta}))$ ,We require that $\leq f_{c}(\mathbf{x} ; \boldsymbol{\theta}) \leq 1$ and $\sum_{c=1}^{C} f_{c}(\mathbf{x} ; \boldsymbol{\theta})=1$ .

To avoid the requirement that $f$ directly predict a probability vector, it is common to pass the output from $f$ into the softmax function , also called the multinomial logit. This is defined as follows:
$\mathcal{S}(\mathbf{a}) \triangleq\left[\frac{e^{a_{1}}}{\sum_{c^{\prime}=1}^{C} e^{a_{c^{\prime}}}}, \cdots, \frac{e^{a_{C}}}{\sum_{c^{\prime}=1}^{C} e^{a_{c^{\prime}}}}\right]$
This maps $\mathbb{R}^{C}$ to $0,1]^{C},$ and satisfies the constraints that $\leq \mathcal{S}(\mathbf{a})_{c} \leq 1$ and $\sum_{c=1}^{C} \mathcal{S}(\mathbf{a})_{c}=1$

Multiclass logistic regression

$f_{c}(\mathbf{x} ; \boldsymbol{\theta})=\mathbf{W} \mathbf{x}+\mathbf{b}$ ,

$\mid \mathbf{x} ; \boldsymbol{\theta})=\operatorname{Cat}(y \mid \mathcal{S}(\mathbf{W} \mathbf{x}+\mathbf{b}))$ , $\mathcal{S}(\mathbf{W} \mathbf{x}+\mathbf{b})$ 是每一类对应的概率P的向量

$\mid \mathbf{x} ; \boldsymbol{\theta})=\frac{e^{a_{c}}}{\sum_{c^{\prime}=1}^{C} e^{a_{c^{\prime}}}}$ ，y=c的概率

Log-sum-exp trick

考虑 $\frac{e^{a_{c}}}{\sum_{c^{\prime}=1}^{C} e^{a_{c^{\prime}}}}$ ，如果直接计算分子和分母，当 $a_c$ 较大或者较小时，计算机在运算时会出现Inf or 0（精度问题），故我们需要将数据“转化”为计算机可计算的数值。

根据恒等式： $\log \sum_{u=1}^{C} \exp \left(a_{c}\right)=m+\log \sum_{u=1}^{C} \exp \left(a_{c}-m\right)$ ，令 $m=max \ a_c$ ,c=1,2,…,C

则有： $p_c=\frac{e^{a_{c}}}{\sum_{c^{\prime}=1}^{C} e^{a_{c^{\prime}}}}=\frac{e^{a_{c}-m}}{\sum_{c^{\prime}=1}^{C} e^{a_{c^{\prime}}-m}}=exp(\log e^{a_{c}-m}-\log \sum_{c^{\prime}=1}^{C} e^{a_{c^{\prime}}-m})$ ，再对exp内的两项分别计算。

$\log p_c=\log e^{a_{c}-m}-\log \sum_{c^{\prime}=1}^{C} e^{a_{c^{\prime}}-m}$ (划重点)

Continuous

Gaussian distribution

The pdf of the Gaussian is given by
$\mathcal{N}\left(y \mid \mu, \sigma^{2}\right) \triangleq \frac{1}{\sqrt{2 \pi \sigma^{2}}} e^{-\frac{1}{2 \sigma^{2}}(y-\mu)^{2}}$

(太熟悉介绍从简)

Why is the Gaussian distribution so widely used?
- it has two parameters which are easy to interpret, and which capture some of the most basic properties of a distribution, namely its mean and variance.
  
  参数易解释
- the central limit theorem tells us that sums of independent random variables have an approximately Gaussian distribution, making it a good choice for modeling residual errors or “noise”.
  
  根据中心极限定理，独立随机变量求和具有渐进高斯分布，拟合误差较好
- the Gaussian distribution makes the least number of assumptions (has maximum entropy), subject to the constraint of having a specified mean and variance; this makes it a good default choice in many cases.
  
  当一阶矩存在，二阶矩有限时，根据最大熵原理求得的分布族为高斯分布族
- it has a simple mathematical form, which results in easy to implement, but often highly effective
  
  易实现

Beta distribution常来模拟概率分布

The beta distribution has support over the interval [0,1] and is defined as follows:
$\operatorname{Beta}(x \mid a, b)=\frac{1}{B(a, b)} x^{a-1}(1-x)^{b-1}$
where $B (a, b)$ is the beta function, defined by
$\triangleq \frac{\Gamma(a) \Gamma(b)}{\Gamma(a+b)}$
where $\Gamma(a)$ is the Gamma function defined by
$\Gamma(a) \triangleq \int_{0}^{\infty} x^{a-1} e^{-x} d x$

Gamma distribution常来模拟非负数据

The gamma distribution is a flexible distribution for positive real valued rv’s, $x > 0 .$ It is defined in terms of two parameters, called the shape $a > 0$ and the rate $b > 0$ :
$\mathrm{Ga}(x \mid \text { shape }=a, \text { rate }=b) \triangleq \frac{b^{a}}{\Gamma(a)} x^{a-1} e^{-x b}$
注：Gamma 分布有许多不同表现形式

Multivariate Gaussian (normal) distribution

Multivariate Gaussian (normal) distribution is defined as:
$\mathcal{N}(\mathbf{y} \mid \boldsymbol{\mu}, \mathbf{\Sigma}) \triangleq \frac{1}{(2 \pi)^{D / 2}|\mathbf{\Sigma}|^{1 / 2}} \exp \left[-\frac{1}{2}(\mathbf{y}-\boldsymbol{\mu})^{\top} \mathbf{\Sigma}^{-1}(\mathbf{y}-\boldsymbol{\mu})\right]$
where $\boldsymbol{\mu}=\mathbb{E}[\mathbf{y}] \in \mathbb{R}^{D}$ is the mean vector, and $\boldsymbol{\Sigma}=\operatorname{Cov}[\mathbf{y}]$ is the $\times D$ covariance matrix,
defined as follows:
$\begin{aligned} \operatorname{Cov}[\mathbf{y}] & \triangleq \mathbb{E}\left[(\mathbf{y}-\mathbb{E}[\mathbf{y}])(\mathbf{y}-\mathbb{E}[\mathbf{y}])^{\top}\right] \\ &=\left(\begin{array}{cccc} \mathbb{V}\left[Y_{1}\right] & \operatorname{Cov}\left[Y_{1}, Y_{2}\right] & \cdots & \operatorname{Cov}\left[Y_{1}, Y_{D}\right] \\ \operatorname{Cov}\left[Y_{2}, Y_{1}\right] & \mathbb{V}\left[Y_{2}\right] & \cdots & \operatorname{Cov}\left[Y_{2}, Y_{D}\right] \\ \vdots & \vdots & \ddots & \vdots \\ \operatorname{Cov}\left[Y_{D}, Y_{1}\right] & \operatorname{Cov}\left[Y_{D}, Y_{2}\right] & \cdots & \mathbb{V}\left[Y_{D}\right] \end{array}\right) \end{aligned}$
where
$\operatorname{Cov}\left[Y_{i}, Y_{j}\right] \triangleq \mathbb{E}\left[\left(Y_{i}-\mathbb{E}\left[Y_{i}\right]\right)\left(Y_{j}-\mathbb{E}\left[Y_{j}\right]\right)\right]=\mathbb{E}\left[Y_{i} Y_{j}\right]-\mathbb{E}\left[Y_{i}\right] \mathbb{E}\left[Y_{j}\right]$
and $\mathbb{V}\left[Y_{i}\right]=\operatorname{Cov}\left[Y_{i}, Y_{i}\right]$ .

important properties:marginal and conditional distribution are still Gaussian distribution.

边际分布和条件分布仍为正态分布

Mixture model

We create a mixture model by taking a convex combination of simple distribution.

This has the form
$p(\mathbf{y} \mid \boldsymbol{\theta})=\sum_{k=1}^{K} \pi_{k} p_{k}(\mathbf{y})$
where $p_{k}$ is the $k$ 'th mixture component, and $\pi_{k}$ are the mixture weights which satisfy $\leq \pi_{k} \leq 1$ $\text { and } \sum_{k=1}^{K} \pi_{k}=1 \text { . }$

We introduce the discrete latent variable $\in\{1, \ldots, K\},$ which specifies which distribution to use for generating the output $\mathbf{y}$ . 引入隐变量z，代表着属于“哪一个”分布。便于模型的解释和推断。

The prior on this latent variable is $p(z=k)=\pi_{k},$ and the conditional is $p(\mathbf{y} \mid z=k)=p_{k}(\mathbf{y})=p\left(\mathbf{y} \mid \boldsymbol{\theta}_{k}\right)$ .

That is, we define the following joint model:
$\begin{aligned} p(z \mid \boldsymbol{\theta}) &=\operatorname{Cat}(z \mid \boldsymbol{\pi}) \\ p(\mathbf{y} \mid z=k, \boldsymbol{\theta}) &=p\left(\mathbf{y} \mid \boldsymbol{\theta}_{k}\right) \end{aligned}$

The “generative story” for the data is that we first generate $z$ (label), and then we generate the observations $\mathbf{y}$ using the parameters chosen according to the value of $z$ .

首先生成z，根据z再去生成y
$p(\mathbf{y} \mid \boldsymbol{\theta})=\sum_{k=1}^{K} p(z=k \mid \boldsymbol{\theta}) p(\mathbf{y} \mid z=k, \boldsymbol{\theta})=\sum_{k=1}^{K} \pi_{k} p\left(\mathbf{y} \mid \boldsymbol{\theta}_{k}\right)$
We can create different kinds of mixture model by varying the base distribution $p_{k},$ .