课程笔记:预测分析 2021Spring
参考教材:Murphy, K. P. (2021). Probabilistic Machine Learning: An Introduction. MIT press.
In this class,we’ll cover topics in machine learning from a probabilistic view.
We will also introduce some topics in statistical computing,such as EM,MCMC,varaitional inference,some optimization algorithm.
文章目录
Chapter 3 Probabilistic models 一些概率模型介绍
Previously, we introduce the Bayesion approach to machine Learning.
Basically, there are four steps:
- probability model of the form p ( y ∣ x , θ ) = p ( y ∣ f ( x ; θ ) ) p(y|x,\theta)=p(y|f(x;\theta)) p(y∣x,θ)=p(y∣f(x;θ)) 确定模型形式
- specify a prior distribution p ( θ ) p(\theta) p(θ) 确定先验分布
- compute the posterior distribution over unknown parameters p ( θ ∣ y ) p(\theta|y) p(θ∣y) 计算后验分布
- mode predictions using p ( y n e w ∣ x , y ) p(y_{new}|x,y) p(ynew∣x,y) 利用模型做预测
How to choose a proper model?
-
depends on our belief about data.根据数据信息选择
-
we could choose all possible,reasonable models,then pick the"best" one.列举出所有可能、合理的模型,再从中选择最好的一个
Let us review some distributions.
- Discrete data: Bernoulli,Binomial,Categorial, multinomial,Poisson,negative Binomial,etc.
- Continuous data: Gaussian(univariate,multivariate), student t-dist, Cauchy dist, gamma dist,beta dist,etc.
Discrete
Bernoulli :model binary events 有两面的骰子掷了1次
Ber ( y ∣ θ ) ≜ θ y ( 1 − θ ) 1 − y = { 1 − θ if y = 0 θ if y = 1 \operatorname{Ber}(y \mid \theta) \triangleq \theta^{y}(1-\theta)^{1-y}=\left\{\begin{array}{ll} 1-\theta & \text { if } y=0 \\ \theta & \text { if } y=1 \end{array}\right. Ber(y∣θ)≜θy(1−θ)1−y={1−θθ if y=0 if y=1
where 0 ≤ θ ≤ 1 0\le \theta\le1 0≤θ≤1 is the probability that y = 1 y=1 y=1.
- The Bernoulli distribution is a special case of Binomial distribution.
Binomial 有两面的骰子掷了N次
Suppose we observe a set of N Bernoulli trials, denoted S = ∑ n − 1 N I ( y n = 1 ) S=\sum_{n-1}^{N}\mathbb{I}(y_n=1) S=∑n−1NI(yn=1)
The distribution of S is given by the Binomial distribution, Bin ( s ∣ N , θ ) ≜ ( N s ) θ s ( 1 − θ ) N − s \operatorname{Bin}(s \mid N, \theta) \triangleq\left(\begin{array}{c}N \\ s\end{array}\right) \theta^{s}(1-\theta)^{N-s} Bin(s∣N,θ)≜(Ns)θs(1−θ)N−s,where ( N k ) ≜ N ! ( N − k ) ! k ! \left(\begin{array}{c}N \\ k\end{array}\right) \triangleq \frac{N !}{(N-k) ! k !} (Nk)≜(N−k)!k!N!.Bernoulli is a special case of Binomial if N = 1 N=1 N=1.
Sigmoid(logistic) function
When we want to predict a binary variable
y
∈
{
0
,
1
}
y\in \{0,1\}
y∈{0,1} given some inputs
x
∈
X
\mathbf{x} \in \mathcal{X}
x∈X, we need to use a conditional probability distribution of the form:
p
(
y
∣
x
,
θ
)
=
Ber
(
y
∣
f
(
x
;
θ
)
)
p(y \mid \mathbf{x}, \boldsymbol{\theta})=\operatorname{Ber}(y \mid f(\mathbf{x} ; \boldsymbol{\theta}))
p(y∣x,θ)=Ber(y∣f(x;θ))
f
(
x
;
θ
)
f(x;\theta)
f(x;θ):伯努利分布中的参数,为y=1事件发生的概率,要求在0,1之间。所以我们需要对f作变换满足上述条件。
To avoid the requirement that
0
≤
f
(
x
;
θ
)
≤
1
,
0 \leq f(\mathbf{x} ; \boldsymbol{\theta}) \leq 1,
0≤f(x;θ)≤1, we can let
f
f
f be an unconstrained function, and use the following model:
p
(
y
∣
x
,
θ
)
=
Ber
(
y
∣
σ
(
f
(
x
;
θ
)
)
)
p(y \mid \mathbf{x}, \boldsymbol{\theta})=\operatorname{Ber}(y \mid \sigma(f(\mathbf{x} ; \boldsymbol{\theta})))
p(y∣x,θ)=Ber(y∣σ(f(x;θ)))
Here
σ
(
)
\sigma()
σ() is the sigmoid or logistic function, defined as follows:
σ
(
a
)
≜
1
1
+
e
−
a
\sigma(a) \triangleq \frac{1}{1+e^{-a}}
σ(a)≜1+e−a1
Binary logistic regression
p ( y ∣ x ; θ ) = Ber ( y ∣ σ ( w ⊤ x + b ) ) p(y \mid \mathbf{x} ; \boldsymbol{\theta})=\operatorname{Ber}\left(y \mid \sigma\left(\mathbf{w}^{\top} \mathbf{x}+b\right)\right) p(y∣x;θ)=Ber(y∣σ(w⊤x+b))
where f ( x ; θ ) = w ⊤ x + b f(\mathbf{x} ; \boldsymbol{\theta})=\mathbf{w}^{\top} \mathbf{x}+b f(x;θ)=w⊤x+b (note:为什么原文中没有+b?)
In other words,
p
(
y
=
1
∣
x
;
θ
)
=
σ
(
w
⊤
x
+
b
)
=
1
1
+
e
−
(
w
⊤
x
+
b
)
p(y=1 \mid \mathbf{x} ; \boldsymbol{\theta})=\sigma\left(\mathbf{w}^{\top} \mathbf{x}+b\right)=\frac{1}{1+e^{-\left(\mathbf{w}^{\top} \mathbf{x}+b\right)}}
p(y=1∣x;θ)=σ(w⊤x+b)=1+e−(w⊤x+b)1
this is called logistic regression.
logistic回归相当于“伯努利分布”,但是伯努利分布的参数p是由协变量 X X X和模型的参数 θ \theta θ 组成的,并不为伯努利分布。
Categorical distributions 有C面的骰子掷了1次
Categorial distribution generalizes the Bernoulli to C > 2 C>2 C>2 values. y ∈ { 1 , 2 , . . . , C } y\in \{1,2,...,C\} y∈{1,2,...,C}.
Categorial 分布是对于伯努利分布中y的二分类的推广,推广为C分类(即结果有C中可能性,而不是2种)
The categorical distribution is a discrete probability distribution with one parameter per class:
Cat
(
y
∣
μ
)
≜
∏
c
=
1
C
θ
c
I
(
y
=
c
)
\operatorname{Cat}(y \mid \boldsymbol{\mu}) \triangleq \prod_{c=1}^{C} \theta_{c}^{\mathbb{I}(y=c)}
Cat(y∣μ)≜c=1∏CθcI(y=c)
In other words,
p
(
y
=
c
∣
θ
)
=
θ
c
.
p(y=c \mid \boldsymbol{\theta})=\theta_{c} .
p(y=c∣θ)=θc.
Note that the parameters are constrained so that 0 ≤ θ c ≤ 1 0 \leq \theta_{c} \leq 1 0≤θc≤1 and ∑ c = 1 C θ c = 1 ; \sum_{c=1}^{C} \theta_{c}=1 ; ∑c=1Cθc=1; thus there are only C − 1 C-1 C−1 independent parameters.
或者我们可以写成编码形式:当C=3时,我们将三类编码为 ( 1 , 0 , 0 ) , ( 0 , 1 , 0 ) , ( 0 , 0 , 1 ) (1,0,0),(0,1,0),(0,0,1) (1,0,0),(0,1,0),(0,0,1)
分布可以写为:
Cat
(
y
∣
θ
)
≜
∏
c
=
1
C
θ
c
y
c
\operatorname{Cat}(\mathbf{y} \mid \boldsymbol{\theta}) \triangleq \prod_{c=1}^{C} \theta_{c}^{y_{c}}
Cat(y∣θ)≜c=1∏Cθcyc
The categorical distribution is a special case of the multinomial distribution.
套娃ing.
multinomial distributions 有C面的骰子掷了N次
Suppose we observe N N N categorical trials, y n ∼ Cat ( ⋅ ∣ θ ) , y_{n} \sim \operatorname{Cat}(\cdot \mid \boldsymbol{\theta}), yn∼Cat(⋅∣θ), for n = 1 : N . n=1: N . n=1:N. Concretely, think of rolling a C C C -sided dice N N N times.
Let us define s \mathbf{s} s to be a vector that counts the number of times each face shows up, i.e., s c ≜ ∑ n = 1 N I ( y n = c ) s_{c} \triangleq \sum_{n=1}^{N} \mathbb{I}\left(y_{n}=c\right) sc≜∑n=1NI(yn=c).
The distribution of
s
\mathbf{s}
s is given by the multinomial distribution:
Mu
(
s
∣
N
,
θ
)
≜
(
N
s
1
…
s
C
)
∏
c
=
1
C
θ
c
s
c
\operatorname{Mu}(\mathbf{s} \mid N, \boldsymbol{\theta}) \triangleq\left(\begin{array}{c} N \\ s_{1} \ldots s_{C} \end{array}\right) \prod_{c=1}^{C} \theta_{c}^{s_{c}}
Mu(s∣N,θ)≜(Ns1…sC)c=1∏Cθcsc
where
θ
c
\theta_{c}
θc is the probability that side
c
c
c shows up, and
(
N
s
1
…
s
C
)
≜
N
!
s
1
!
s
2
!
⋯
s
C
!
\left(\begin{array}{c} N \\ s_{1} \ldots s_{C} \end{array}\right) \triangleq \frac{N !}{s_{1} ! s_{2} ! \cdots s_{C} !}
(Ns1…sC)≜s1!s2!⋯sC!N!
N
=
∑
c
=
1
C
s
c
N=\sum_{c=1}^{C} s_{c}
N=∑c=1Csc.
Softmax function
对sigmoid函数的推广。
Consider p ( y ∣ x , θ ) = Cat ( y ∣ f ( x ; θ ) ) p(y \mid \mathbf{x}, \boldsymbol{\theta})=\operatorname{Cat}(y \mid f(\mathbf{x} ; \boldsymbol{\theta})) p(y∣x,θ)=Cat(y∣f(x;θ)),We require that 0 ≤ f c ( x ; θ ) ≤ 1 0 \leq f_{c}(\mathbf{x} ; \boldsymbol{\theta}) \leq 1 0≤fc(x;θ)≤1 and ∑ c = 1 C f c ( x ; θ ) = 1 \sum_{c=1}^{C} f_{c}(\mathbf{x} ; \boldsymbol{\theta})=1 ∑c=1Cfc(x;θ)=1.
To avoid the requirement that
f
f
f directly predict a probability vector, it is common to pass the output from
f
f
f into the softmax function , also called the multinomial logit. This is defined as follows:
S
(
a
)
≜
[
e
a
1
∑
c
′
=
1
C
e
a
c
′
,
⋯
,
e
a
C
∑
c
′
=
1
C
e
a
c
′
]
\mathcal{S}(\mathbf{a}) \triangleq\left[\frac{e^{a_{1}}}{\sum_{c^{\prime}=1}^{C} e^{a_{c^{\prime}}}}, \cdots, \frac{e^{a_{C}}}{\sum_{c^{\prime}=1}^{C} e^{a_{c^{\prime}}}}\right]
S(a)≜[∑c′=1Ceac′ea1,⋯,∑c′=1Ceac′eaC]
This maps
R
C
\mathbb{R}^{C}
RC to
[
0
,
1
]
C
,
[0,1]^{C},
[0,1]C, and satisfies the constraints that
0
≤
S
(
a
)
c
≤
1
0 \leq \mathcal{S}(\mathbf{a})_{c} \leq 1
0≤S(a)c≤1 and
∑
c
=
1
C
S
(
a
)
c
=
1
\sum_{c=1}^{C} \mathcal{S}(\mathbf{a})_{c}=1
∑c=1CS(a)c=1
Multiclass logistic regression
f c ( x ; θ ) = W x + b f_{c}(\mathbf{x} ; \boldsymbol{\theta})=\mathbf{W} \mathbf{x}+\mathbf{b} fc(x;θ)=Wx+b,
p ( y ∣ x ; θ ) = Cat ( y ∣ S ( W x + b ) ) p(y \mid \mathbf{x} ; \boldsymbol{\theta})=\operatorname{Cat}(y \mid \mathcal{S}(\mathbf{W} \mathbf{x}+\mathbf{b})) p(y∣x;θ)=Cat(y∣S(Wx+b)), S ( W x + b ) \mathcal{S}(\mathbf{W} \mathbf{x}+\mathbf{b}) S(Wx+b)是每一类对应的概率P的向量
p ( y = c ∣ x ; θ ) = e a c ∑ c ′ = 1 C e a c ′ p(y=c \mid \mathbf{x} ; \boldsymbol{\theta})=\frac{e^{a_{c}}}{\sum_{c^{\prime}=1}^{C} e^{a_{c^{\prime}}}} p(y=c∣x;θ)=∑c′=1Ceac′eac,y=c的概率
Log-sum-exp trick
考虑 e a c ∑ c ′ = 1 C e a c ′ \frac{e^{a_{c}}}{\sum_{c^{\prime}=1}^{C} e^{a_{c^{\prime}}}} ∑c′=1Ceac′eac,如果直接计算分子和分母,当 a c a_c ac较大或者较小时,计算机在运算时会出现Inf or 0(精度问题),故我们需要将数据“转化”为计算机可计算的数值。
根据恒等式: log ∑ u = 1 C exp ( a c ) = m + log ∑ u = 1 C exp ( a c − m ) \log \sum_{u=1}^{C} \exp \left(a_{c}\right)=m+\log \sum_{u=1}^{C} \exp \left(a_{c}-m\right) log∑u=1Cexp(ac)=m+log∑u=1Cexp(ac−m),令 m = m a x a c m=max \ a_c m=max ac,c=1,2,…,C
则有: p c = e a c ∑ c ′ = 1 C e a c ′ = e a c − m ∑ c ′ = 1 C e a c ′ − m = e x p ( log e a c − m − log ∑ c ′ = 1 C e a c ′ − m ) p_c=\frac{e^{a_{c}}}{\sum_{c^{\prime}=1}^{C} e^{a_{c^{\prime}}}}=\frac{e^{a_{c}-m}}{\sum_{c^{\prime}=1}^{C} e^{a_{c^{\prime}}-m}}=exp(\log e^{a_{c}-m}-\log \sum_{c^{\prime}=1}^{C} e^{a_{c^{\prime}}-m}) pc=∑c′=1Ceac′eac=∑c′=1Ceac′−meac−m=exp(logeac−m−log∑c′=1Ceac′−m),再对exp内的两项分别计算。
log p c = log e a c − m − log ∑ c ′ = 1 C e a c ′ − m \log p_c=\log e^{a_{c}-m}-\log \sum_{c^{\prime}=1}^{C} e^{a_{c^{\prime}}-m} logpc=logeac−m−log∑c′=1Ceac′−m(划重点)
Continuous
Gaussian distribution
The pdf of the Gaussian is given by
N
(
y
∣
μ
,
σ
2
)
≜
1
2
π
σ
2
e
−
1
2
σ
2
(
y
−
μ
)
2
\mathcal{N}\left(y \mid \mu, \sigma^{2}\right) \triangleq \frac{1}{\sqrt{2 \pi \sigma^{2}}} e^{-\frac{1}{2 \sigma^{2}}(y-\mu)^{2}}
N(y∣μ,σ2)≜2πσ21e−2σ21(y−μ)2
(太熟悉介绍从简)
-
Why is the Gaussian distribution so widely used?
-
it has two parameters which are easy to interpret, and which capture some of the most basic properties of a distribution, namely its mean and variance.
参数易解释
-
the central limit theorem tells us that sums of independent random variables have an approximately Gaussian distribution, making it a good choice for modeling residual errors or “noise”.
根据中心极限定理,独立随机变量求和具有渐进高斯分布,拟合误差较好
-
the Gaussian distribution makes the least number of assumptions (has maximum entropy), subject to the constraint of having a specified mean and variance; this makes it a good default choice in many cases.
当一阶矩存在,二阶矩有限时,根据最大熵原理求得的分布族为高斯分布族
-
it has a simple mathematical form, which results in easy to implement, but often highly effective
易实现
-
Beta distribution常来模拟概率分布
The beta distribution has support over the interval [0,1] and is defined as follows:
Beta
(
x
∣
a
,
b
)
=
1
B
(
a
,
b
)
x
a
−
1
(
1
−
x
)
b
−
1
\operatorname{Beta}(x \mid a, b)=\frac{1}{B(a, b)} x^{a-1}(1-x)^{b-1}
Beta(x∣a,b)=B(a,b)1xa−1(1−x)b−1
where
B
(
a
,
b
)
B(a, b)
B(a,b) is the beta function, defined by
B
(
a
,
b
)
≜
Γ
(
a
)
Γ
(
b
)
Γ
(
a
+
b
)
B(a, b) \triangleq \frac{\Gamma(a) \Gamma(b)}{\Gamma(a+b)}
B(a,b)≜Γ(a+b)Γ(a)Γ(b)
where
Γ
(
a
)
\Gamma(a)
Γ(a) is the Gamma function defined by
Γ
(
a
)
≜
∫
0
∞
x
a
−
1
e
−
x
d
x
\Gamma(a) \triangleq \int_{0}^{\infty} x^{a-1} e^{-x} d x
Γ(a)≜∫0∞xa−1e−xdx
Gamma distribution常来模拟非负数据
The gamma distribution is a flexible distribution for positive real valued rv’s,
x
>
0.
x>0 .
x>0. It is defined in terms of two parameters, called the shape
a
>
0
a>0
a>0 and the rate
b
>
0
b>0
b>0 :
G
a
(
x
∣
shape
=
a
,
rate
=
b
)
≜
b
a
Γ
(
a
)
x
a
−
1
e
−
x
b
\mathrm{Ga}(x \mid \text { shape }=a, \text { rate }=b) \triangleq \frac{b^{a}}{\Gamma(a)} x^{a-1} e^{-x b}
Ga(x∣ shape =a, rate =b)≜Γ(a)baxa−1e−xb
注:Gamma 分布有许多不同表现形式
Multivariate Gaussian (normal) distribution
Multivariate Gaussian (normal) distribution is defined as:
N
(
y
∣
μ
,
Σ
)
≜
1
(
2
π
)
D
/
2
∣
Σ
∣
1
/
2
exp
[
−
1
2
(
y
−
μ
)
⊤
Σ
−
1
(
y
−
μ
)
]
\mathcal{N}(\mathbf{y} \mid \boldsymbol{\mu}, \mathbf{\Sigma}) \triangleq \frac{1}{(2 \pi)^{D / 2}|\mathbf{\Sigma}|^{1 / 2}} \exp \left[-\frac{1}{2}(\mathbf{y}-\boldsymbol{\mu})^{\top} \mathbf{\Sigma}^{-1}(\mathbf{y}-\boldsymbol{\mu})\right]
N(y∣μ,Σ)≜(2π)D/2∣Σ∣1/21exp[−21(y−μ)⊤Σ−1(y−μ)]
where
μ
=
E
[
y
]
∈
R
D
\boldsymbol{\mu}=\mathbb{E}[\mathbf{y}] \in \mathbb{R}^{D}
μ=E[y]∈RD is the mean vector, and
Σ
=
Cov
[
y
]
\boldsymbol{\Sigma}=\operatorname{Cov}[\mathbf{y}]
Σ=Cov[y] is the
D
×
D
D \times D
D×D covariance matrix,
defined as follows:
Cov
[
y
]
≜
E
[
(
y
−
E
[
y
]
)
(
y
−
E
[
y
]
)
⊤
]
=
(
V
[
Y
1
]
Cov
[
Y
1
,
Y
2
]
⋯
Cov
[
Y
1
,
Y
D
]
Cov
[
Y
2
,
Y
1
]
V
[
Y
2
]
⋯
Cov
[
Y
2
,
Y
D
]
⋮
⋮
⋱
⋮
Cov
[
Y
D
,
Y
1
]
Cov
[
Y
D
,
Y
2
]
⋯
V
[
Y
D
]
)
\begin{aligned} \operatorname{Cov}[\mathbf{y}] & \triangleq \mathbb{E}\left[(\mathbf{y}-\mathbb{E}[\mathbf{y}])(\mathbf{y}-\mathbb{E}[\mathbf{y}])^{\top}\right] \\ &=\left(\begin{array}{cccc} \mathbb{V}\left[Y_{1}\right] & \operatorname{Cov}\left[Y_{1}, Y_{2}\right] & \cdots & \operatorname{Cov}\left[Y_{1}, Y_{D}\right] \\ \operatorname{Cov}\left[Y_{2}, Y_{1}\right] & \mathbb{V}\left[Y_{2}\right] & \cdots & \operatorname{Cov}\left[Y_{2}, Y_{D}\right] \\ \vdots & \vdots & \ddots & \vdots \\ \operatorname{Cov}\left[Y_{D}, Y_{1}\right] & \operatorname{Cov}\left[Y_{D}, Y_{2}\right] & \cdots & \mathbb{V}\left[Y_{D}\right] \end{array}\right) \end{aligned}
Cov[y]≜E[(y−E[y])(y−E[y])⊤]=⎝⎜⎜⎜⎛V[Y1]Cov[Y2,Y1]⋮Cov[YD,Y1]Cov[Y1,Y2]V[Y2]⋮Cov[YD,Y2]⋯⋯⋱⋯Cov[Y1,YD]Cov[Y2,YD]⋮V[YD]⎠⎟⎟⎟⎞
where
Cov
[
Y
i
,
Y
j
]
≜
E
[
(
Y
i
−
E
[
Y
i
]
)
(
Y
j
−
E
[
Y
j
]
)
]
=
E
[
Y
i
Y
j
]
−
E
[
Y
i
]
E
[
Y
j
]
\operatorname{Cov}\left[Y_{i}, Y_{j}\right] \triangleq \mathbb{E}\left[\left(Y_{i}-\mathbb{E}\left[Y_{i}\right]\right)\left(Y_{j}-\mathbb{E}\left[Y_{j}\right]\right)\right]=\mathbb{E}\left[Y_{i} Y_{j}\right]-\mathbb{E}\left[Y_{i}\right] \mathbb{E}\left[Y_{j}\right]
Cov[Yi,Yj]≜E[(Yi−E[Yi])(Yj−E[Yj])]=E[YiYj]−E[Yi]E[Yj]
and
V
[
Y
i
]
=
Cov
[
Y
i
,
Y
i
]
\mathbb{V}\left[Y_{i}\right]=\operatorname{Cov}\left[Y_{i}, Y_{i}\right]
V[Yi]=Cov[Yi,Yi].
-
important properties:marginal and conditional distribution are still Gaussian distribution.
边际分布和条件分布仍为正态分布
Mixture model
We create a mixture model by taking a convex combination of simple distribution.
This has the form
p
(
y
∣
θ
)
=
∑
k
=
1
K
π
k
p
k
(
y
)
p(\mathbf{y} \mid \boldsymbol{\theta})=\sum_{k=1}^{K} \pi_{k} p_{k}(\mathbf{y})
p(y∣θ)=k=1∑Kπkpk(y)
where
p
k
p_{k}
pk is the
k
k
k 'th mixture component, and
π
k
\pi_{k}
πk are the mixture weights which satisfy
0
≤
π
k
≤
1
0 \leq \pi_{k} \leq 1
0≤πk≤1
and
∑
k
=
1
K
π
k
=
1
.
\text { and } \sum_{k=1}^{K} \pi_{k}=1 \text { . }
and ∑k=1Kπk=1 .
We introduce the discrete latent variable z ∈ { 1 , … , K } , z \in\{1, \ldots, K\}, z∈{1,…,K}, which specifies which distribution to use for generating the output y \mathbf{y} y. 引入隐变量z,代表着属于“哪一个”分布。便于模型的解释和推断。
The prior on this latent variable is p ( z = k ) = π k , p(z=k)=\pi_{k}, p(z=k)=πk, and the conditional is p ( y ∣ z = k ) = p k ( y ) = p ( y ∣ θ k ) p(\mathbf{y} \mid z=k)=p_{k}(\mathbf{y})=p\left(\mathbf{y} \mid \boldsymbol{\theta}_{k}\right) p(y∣z=k)=pk(y)=p(y∣θk).
That is, we define the following joint model:
p
(
z
∣
θ
)
=
Cat
(
z
∣
π
)
p
(
y
∣
z
=
k
,
θ
)
=
p
(
y
∣
θ
k
)
\begin{aligned} p(z \mid \boldsymbol{\theta}) &=\operatorname{Cat}(z \mid \boldsymbol{\pi}) \\ p(\mathbf{y} \mid z=k, \boldsymbol{\theta}) &=p\left(\mathbf{y} \mid \boldsymbol{\theta}_{k}\right) \end{aligned}
p(z∣θ)p(y∣z=k,θ)=Cat(z∣π)=p(y∣θk)
The “generative story” for the data is that we first generate z z z (label), and then we generate the observations y \mathbf{y} y using the parameters chosen according to the value of z z z.
首先生成z,根据z再去生成y
p
(
y
∣
θ
)
=
∑
k
=
1
K
p
(
z
=
k
∣
θ
)
p
(
y
∣
z
=
k
,
θ
)
=
∑
k
=
1
K
π
k
p
(
y
∣
θ
k
)
p(\mathbf{y} \mid \boldsymbol{\theta})=\sum_{k=1}^{K} p(z=k \mid \boldsymbol{\theta}) p(\mathbf{y} \mid z=k, \boldsymbol{\theta})=\sum_{k=1}^{K} \pi_{k} p\left(\mathbf{y} \mid \boldsymbol{\theta}_{k}\right)
p(y∣θ)=k=1∑Kp(z=k∣θ)p(y∣z=k,θ)=k=1∑Kπkp(y∣θk)
We can create different kinds of mixture model by varying the base distribution
p
k
,
p_{k},
pk,.
Gaussian Mixture model(GMM)
p ( y ∣ θ ) = ∑ k = 1 K π k N ( y ∣ μ k , Σ k ) p(\mathbf{y} \mid \boldsymbol{\theta})=\sum_{k=1}^{K} \pi_{k} N(\mathbf{y}|\mu_k,\Sigma_k) p(y∣θ)=k=1∑KπkN(y∣μk,Σk)
often used for clustering.
Note: y here is not label,but features.(equivalent to covariates in regression models)
在这里说到的y是“特征”,不是“标签/响应变量/label”
Data: y(features)
objective: infer parameters ( π k , μ k , Σ k ) , k = 1 , 2 , . . . , K (\pi_k,\mu_k,\Sigma_k),k=1,2,...,K (πk,μk,Σk),k=1,2,...,K, 3*K parameters.估计参数,再对新数据进行推断。