预测分析笔记chapter 3

课程笔记:预测分析 2021Spring

参考教材:Murphy, K. P. (2021). Probabilistic Machine Learning: An Introduction. MIT press.

In this class,we’ll cover topics in machine learning from a probabilistic view.

We will also introduce some topics in statistical computing,such as EM,MCMC,varaitional inference,some optimization algorithm.

Chapter 3 Probabilistic models 一些概率模型介绍

Previously, we introduce the Bayesion approach to machine Learning.

Basically, there are four steps:

  • probability model of the form p ( y ∣ x , θ ) = p ( y ∣ f ( x ; θ ) ) p(y|x,\theta)=p(y|f(x;\theta)) p(yx,θ)=p(yf(x;θ)) 确定模型形式
  • specify a prior distribution p ( θ ) p(\theta) p(θ) 确定先验分布
  • compute the posterior distribution over unknown parameters p ( θ ∣ y ) p(\theta|y) p(θy) 计算后验分布
  • mode predictions using p ( y n e w ∣ x , y ) p(y_{new}|x,y) p(ynewx,y) 利用模型做预测

How to choose a proper model?

  • depends on our belief about data.根据数据信息选择

  • we could choose all possible,reasonable models,then pick the"best" one.列举出所有可能、合理的模型,再从中选择最好的一个

Let us review some distributions.

  • Discrete data: Bernoulli,Binomial,Categorial, multinomial,Poisson,negative Binomial,etc.
  • Continuous data: Gaussian(univariate,multivariate), student t-dist, Cauchy dist, gamma dist,beta dist,etc.

Discrete

Bernoulli :model binary events 有两面的骰子掷了1次

Ber ⁡ ( y ∣ θ ) ≜ θ y ( 1 − θ ) 1 − y = { 1 − θ  if  y = 0 θ  if  y = 1 \operatorname{Ber}(y \mid \theta) \triangleq \theta^{y}(1-\theta)^{1-y}=\left\{\begin{array}{ll} 1-\theta & \text { if } y=0 \\ \theta & \text { if } y=1 \end{array}\right. Ber(yθ)θy(1θ)1y={1θθ if y=0 if y=1

where 0 ≤ θ ≤ 1 0\le \theta\le1 0θ1 is the probability that y = 1 y=1 y=1.

  • The Bernoulli distribution is a special case of Binomial distribution.

Binomial 有两面的骰子掷了N次

Suppose we observe a set of N Bernoulli trials, denoted S = ∑ n − 1 N I ( y n = 1 ) S=\sum_{n-1}^{N}\mathbb{I}(y_n=1) S=n1NI(yn=1)

The distribution of S is given by the Binomial distribution, Bin ⁡ ( s ∣ N , θ ) ≜ ( N s ) θ s ( 1 − θ ) N − s \operatorname{Bin}(s \mid N, \theta) \triangleq\left(\begin{array}{c}N \\ s\end{array}\right) \theta^{s}(1-\theta)^{N-s} Bin(sN,θ)(Ns)θs(1θ)Ns,where ( N k ) ≜ N ! ( N − k ) ! k ! \left(\begin{array}{c}N \\ k\end{array}\right) \triangleq \frac{N !}{(N-k) ! k !} (Nk)(Nk)!k!N!.Bernoulli is a special case of Binomial if N = 1 N=1 N=1.

Sigmoid(logistic) function

When we want to predict a binary variable y ∈ { 0 , 1 } y\in \{0,1\} y{0,1} given some inputs x ∈ X \mathbf{x} \in \mathcal{X} xX, we need to use a conditional probability distribution of the form:
p ( y ∣ x , θ ) = Ber ⁡ ( y ∣ f ( x ; θ ) ) p(y \mid \mathbf{x}, \boldsymbol{\theta})=\operatorname{Ber}(y \mid f(\mathbf{x} ; \boldsymbol{\theta})) p(yx,θ)=Ber(yf(x;θ))
f ( x ; θ ) f(x;\theta) f(x;θ):伯努利分布中的参数,为y=1事件发生的概率,要求在0,1之间。所以我们需要对f作变换满足上述条件。

To avoid the requirement that 0 ≤ f ( x ; θ ) ≤ 1 , 0 \leq f(\mathbf{x} ; \boldsymbol{\theta}) \leq 1, 0f(x;θ)1, we can let f f f be an unconstrained function, and use the following model:
p ( y ∣ x , θ ) = Ber ⁡ ( y ∣ σ ( f ( x ; θ ) ) ) p(y \mid \mathbf{x}, \boldsymbol{\theta})=\operatorname{Ber}(y \mid \sigma(f(\mathbf{x} ; \boldsymbol{\theta}))) p(yx,θ)=Ber(yσ(f(x;θ)))
Here σ ( ) \sigma() σ() is the sigmoid or logistic function, defined as follows:
σ ( a ) ≜ 1 1 + e − a \sigma(a) \triangleq \frac{1}{1+e^{-a}} σ(a)1+ea1

Binary logistic regression

p ( y ∣ x ; θ ) = Ber ⁡ ( y ∣ σ ( w ⊤ x + b ) ) p(y \mid \mathbf{x} ; \boldsymbol{\theta})=\operatorname{Ber}\left(y \mid \sigma\left(\mathbf{w}^{\top} \mathbf{x}+b\right)\right) p(yx;θ)=Ber(yσ(wx+b))

where f ( x ; θ ) = w ⊤ x + b f(\mathbf{x} ; \boldsymbol{\theta})=\mathbf{w}^{\top} \mathbf{x}+b f(x;θ)=wx+b (note:为什么原文中没有+b?)

In other words,
p ( y = 1 ∣ x ; θ ) = σ ( w ⊤ x + b ) = 1 1 + e − ( w ⊤ x + b ) p(y=1 \mid \mathbf{x} ; \boldsymbol{\theta})=\sigma\left(\mathbf{w}^{\top} \mathbf{x}+b\right)=\frac{1}{1+e^{-\left(\mathbf{w}^{\top} \mathbf{x}+b\right)}} p(y=1x;θ)=σ(wx+b)=1+e(wx+b)1
this is called logistic regression.

logistic回归相当于“伯努利分布”,但是伯努利分布的参数p是由协变量 X X X和模型的参数 θ \theta θ 组成的,并不为伯努利分布。

Categorical distributions 有C面的骰子掷了1次

Categorial distribution generalizes the Bernoulli to C > 2 C>2 C>2 values. y ∈ { 1 , 2 , . . . , C } y\in \{1,2,...,C\} y{1,2,...,C}.

Categorial 分布是对于伯努利分布中y的二分类的推广,推广为C分类(即结果有C中可能性,而不是2种)

The categorical distribution is a discrete probability distribution with one parameter per class:
Cat ⁡ ( y ∣ μ ) ≜ ∏ c = 1 C θ c I ( y = c ) \operatorname{Cat}(y \mid \boldsymbol{\mu}) \triangleq \prod_{c=1}^{C} \theta_{c}^{\mathbb{I}(y=c)} Cat(yμ)c=1CθcI(y=c)
In other words, p ( y = c ∣ θ ) = θ c . p(y=c \mid \boldsymbol{\theta})=\theta_{c} . p(y=cθ)=θc.

Note that the parameters are constrained so that 0 ≤ θ c ≤ 1 0 \leq \theta_{c} \leq 1 0θc1 and ∑ c = 1 C θ c = 1 ; \sum_{c=1}^{C} \theta_{c}=1 ; c=1Cθc=1; thus there are only C − 1 C-1 C1 independent parameters.

或者我们可以写成编码形式:当C=3时,我们将三类编码为 ( 1 , 0 , 0 ) , ( 0 , 1 , 0 ) , ( 0 , 0 , 1 ) (1,0,0),(0,1,0),(0,0,1) (1,0,0),(0,1,0),(0,0,1)

分布可以写为:
Cat ⁡ ( y ∣ θ ) ≜ ∏ c = 1 C θ c y c \operatorname{Cat}(\mathbf{y} \mid \boldsymbol{\theta}) \triangleq \prod_{c=1}^{C} \theta_{c}^{y_{c}} Cat(yθ)c=1Cθcyc

The categorical distribution is a special case of the multinomial distribution.

套娃ing.

multinomial distributions 有C面的骰子掷了N次

Suppose we observe N N N categorical trials, y n ∼ Cat ⁡ ( ⋅ ∣ θ ) , y_{n} \sim \operatorname{Cat}(\cdot \mid \boldsymbol{\theta}), ynCat(θ), for n = 1 : N . n=1: N . n=1:N. Concretely, think of rolling a C C C -sided dice N N N times.

Let us define s \mathbf{s} s to be a vector that counts the number of times each face shows up, i.e., s c ≜ ∑ n = 1 N I ( y n = c ) s_{c} \triangleq \sum_{n=1}^{N} \mathbb{I}\left(y_{n}=c\right) scn=1NI(yn=c).

The distribution of s \mathbf{s} s is given by the multinomial distribution:
Mu ⁡ ( s ∣ N , θ ) ≜ ( N s 1 … s C ) ∏ c = 1 C θ c s c \operatorname{Mu}(\mathbf{s} \mid N, \boldsymbol{\theta}) \triangleq\left(\begin{array}{c} N \\ s_{1} \ldots s_{C} \end{array}\right) \prod_{c=1}^{C} \theta_{c}^{s_{c}} Mu(sN,θ)(Ns1sC)c=1Cθcsc
where θ c \theta_{c} θc is the probability that side c c c shows up, and
( N s 1 … s C ) ≜ N ! s 1 ! s 2 ! ⋯ s C ! \left(\begin{array}{c} N \\ s_{1} \ldots s_{C} \end{array}\right) \triangleq \frac{N !}{s_{1} ! s_{2} ! \cdots s_{C} !} (Ns1sC)s1!s2!sC!N!
N = ∑ c = 1 C s c N=\sum_{c=1}^{C} s_{c} N=c=1Csc.

Softmax function

对sigmoid函数的推广。

Consider p ( y ∣ x , θ ) = Cat ⁡ ( y ∣ f ( x ; θ ) ) p(y \mid \mathbf{x}, \boldsymbol{\theta})=\operatorname{Cat}(y \mid f(\mathbf{x} ; \boldsymbol{\theta})) p(yx,θ)=Cat(yf(x;θ)),We require that 0 ≤ f c ( x ; θ ) ≤ 1 0 \leq f_{c}(\mathbf{x} ; \boldsymbol{\theta}) \leq 1 0fc(x;θ)1 and ∑ c = 1 C f c ( x ; θ ) = 1 \sum_{c=1}^{C} f_{c}(\mathbf{x} ; \boldsymbol{\theta})=1 c=1Cfc(x;θ)=1.

To avoid the requirement that f f f directly predict a probability vector, it is common to pass the output from f f f into the softmax function , also called the multinomial logit. This is defined as follows:
S ( a ) ≜ [ e a 1 ∑ c ′ = 1 C e a c ′ , ⋯   , e a C ∑ c ′ = 1 C e a c ′ ] \mathcal{S}(\mathbf{a}) \triangleq\left[\frac{e^{a_{1}}}{\sum_{c^{\prime}=1}^{C} e^{a_{c^{\prime}}}}, \cdots, \frac{e^{a_{C}}}{\sum_{c^{\prime}=1}^{C} e^{a_{c^{\prime}}}}\right] S(a)[c=1Ceacea1,,c=1CeaceaC]
This maps R C \mathbb{R}^{C} RC to [ 0 , 1 ] C , [0,1]^{C}, [0,1]C, and satisfies the constraints that 0 ≤ S ( a ) c ≤ 1 0 \leq \mathcal{S}(\mathbf{a})_{c} \leq 1 0S(a)c1 and ∑ c = 1 C S ( a ) c = 1 \sum_{c=1}^{C} \mathcal{S}(\mathbf{a})_{c}=1 c=1CS(a)c=1

Multiclass logistic regression

f c ( x ; θ ) = W x + b f_{c}(\mathbf{x} ; \boldsymbol{\theta})=\mathbf{W} \mathbf{x}+\mathbf{b} fc(x;θ)=Wx+b,

p ( y ∣ x ; θ ) = Cat ⁡ ( y ∣ S ( W x + b ) ) p(y \mid \mathbf{x} ; \boldsymbol{\theta})=\operatorname{Cat}(y \mid \mathcal{S}(\mathbf{W} \mathbf{x}+\mathbf{b})) p(yx;θ)=Cat(yS(Wx+b)), S ( W x + b ) \mathcal{S}(\mathbf{W} \mathbf{x}+\mathbf{b}) S(Wx+b)是每一类对应的概率P的向量

p ( y = c ∣ x ; θ ) = e a c ∑ c ′ = 1 C e a c ′ p(y=c \mid \mathbf{x} ; \boldsymbol{\theta})=\frac{e^{a_{c}}}{\sum_{c^{\prime}=1}^{C} e^{a_{c^{\prime}}}} p(y=cx;θ)=c=1Ceaceac,y=c的概率

Log-sum-exp trick

考虑 e a c ∑ c ′ = 1 C e a c ′ \frac{e^{a_{c}}}{\sum_{c^{\prime}=1}^{C} e^{a_{c^{\prime}}}} c=1Ceaceac,如果直接计算分子和分母,当 a c a_c ac较大或者较小时,计算机在运算时会出现Inf or 0(精度问题),故我们需要将数据“转化”为计算机可计算的数值。

根据恒等式: log ⁡ ∑ u = 1 C exp ⁡ ( a c ) = m + log ⁡ ∑ u = 1 C exp ⁡ ( a c − m ) \log \sum_{u=1}^{C} \exp \left(a_{c}\right)=m+\log \sum_{u=1}^{C} \exp \left(a_{c}-m\right) logu=1Cexp(ac)=m+logu=1Cexp(acm),令 m = m a x   a c m=max \ a_c m=max ac,c=1,2,…,C

则有: p c = e a c ∑ c ′ = 1 C e a c ′ = e a c − m ∑ c ′ = 1 C e a c ′ − m = e x p ( log ⁡ e a c − m − log ⁡ ∑ c ′ = 1 C e a c ′ − m ) p_c=\frac{e^{a_{c}}}{\sum_{c^{\prime}=1}^{C} e^{a_{c^{\prime}}}}=\frac{e^{a_{c}-m}}{\sum_{c^{\prime}=1}^{C} e^{a_{c^{\prime}}-m}}=exp(\log e^{a_{c}-m}-\log \sum_{c^{\prime}=1}^{C} e^{a_{c^{\prime}}-m}) pc=c=1Ceaceac=c=1Ceacmeacm=exp(logeacmlogc=1Ceacm),再对exp内的两项分别计算。

log ⁡ p c = log ⁡ e a c − m − log ⁡ ∑ c ′ = 1 C e a c ′ − m \log p_c=\log e^{a_{c}-m}-\log \sum_{c^{\prime}=1}^{C} e^{a_{c^{\prime}}-m} logpc=logeacmlogc=1Ceacm(划重点)

Continuous

Gaussian distribution

The pdf of the Gaussian is given by
N ( y ∣ μ , σ 2 ) ≜ 1 2 π σ 2 e − 1 2 σ 2 ( y − μ ) 2 \mathcal{N}\left(y \mid \mu, \sigma^{2}\right) \triangleq \frac{1}{\sqrt{2 \pi \sigma^{2}}} e^{-\frac{1}{2 \sigma^{2}}(y-\mu)^{2}} N(yμ,σ2)2πσ2 1e2σ21(yμ)2

(太熟悉介绍从简)

  • Why is the Gaussian distribution so widely used?

    • it has two parameters which are easy to interpret, and which capture some of the most basic properties of a distribution, namely its mean and variance.

      参数易解释

    • the central limit theorem tells us that sums of independent random variables have an approximately Gaussian distribution, making it a good choice for modeling residual errors or “noise”.

      根据中心极限定理,独立随机变量求和具有渐进高斯分布,拟合误差较好

    • the Gaussian distribution makes the least number of assumptions (has maximum entropy), subject to the constraint of having a specified mean and variance; this makes it a good default choice in many cases.

      当一阶矩存在,二阶矩有限时,根据最大熵原理求得的分布族为高斯分布族

    • it has a simple mathematical form, which results in easy to implement, but often highly effective

      易实现

Beta distribution常来模拟概率分布

The beta distribution has support over the interval [0,1] and is defined as follows:
Beta ⁡ ( x ∣ a , b ) = 1 B ( a , b ) x a − 1 ( 1 − x ) b − 1 \operatorname{Beta}(x \mid a, b)=\frac{1}{B(a, b)} x^{a-1}(1-x)^{b-1} Beta(xa,b)=B(a,b)1xa1(1x)b1
where B ( a , b ) B(a, b) B(a,b) is the beta function, defined by
B ( a , b ) ≜ Γ ( a ) Γ ( b ) Γ ( a + b ) B(a, b) \triangleq \frac{\Gamma(a) \Gamma(b)}{\Gamma(a+b)} B(a,b)Γ(a+b)Γ(a)Γ(b)
where Γ ( a ) \Gamma(a) Γ(a) is the Gamma function defined by
Γ ( a ) ≜ ∫ 0 ∞ x a − 1 e − x d x \Gamma(a) \triangleq \int_{0}^{\infty} x^{a-1} e^{-x} d x Γ(a)0xa1exdx

Gamma distribution常来模拟非负数据

The gamma distribution is a flexible distribution for positive real valued rv’s, x > 0. x>0 . x>0. It is defined in terms of two parameters, called the shape a > 0 a>0 a>0 and the rate b > 0 b>0 b>0 :
G a ( x ∣  shape  = a ,  rate  = b ) ≜ b a Γ ( a ) x a − 1 e − x b \mathrm{Ga}(x \mid \text { shape }=a, \text { rate }=b) \triangleq \frac{b^{a}}{\Gamma(a)} x^{a-1} e^{-x b} Ga(x shape =a, rate =b)Γ(a)baxa1exb
注:Gamma 分布有许多不同表现形式

Multivariate Gaussian (normal) distribution

Multivariate Gaussian (normal) distribution is defined as:
N ( y ∣ μ , Σ ) ≜ 1 ( 2 π ) D / 2 ∣ Σ ∣ 1 / 2 exp ⁡ [ − 1 2 ( y − μ ) ⊤ Σ − 1 ( y − μ ) ] \mathcal{N}(\mathbf{y} \mid \boldsymbol{\mu}, \mathbf{\Sigma}) \triangleq \frac{1}{(2 \pi)^{D / 2}|\mathbf{\Sigma}|^{1 / 2}} \exp \left[-\frac{1}{2}(\mathbf{y}-\boldsymbol{\mu})^{\top} \mathbf{\Sigma}^{-1}(\mathbf{y}-\boldsymbol{\mu})\right] N(yμ,Σ)(2π)D/2Σ1/21exp[21(yμ)Σ1(yμ)]
where μ = E [ y ] ∈ R D \boldsymbol{\mu}=\mathbb{E}[\mathbf{y}] \in \mathbb{R}^{D} μ=E[y]RD is the mean vector, and Σ = Cov ⁡ [ y ] \boldsymbol{\Sigma}=\operatorname{Cov}[\mathbf{y}] Σ=Cov[y] is the D × D D \times D D×D covariance matrix,
defined as follows:
Cov ⁡ [ y ] ≜ E [ ( y − E [ y ] ) ( y − E [ y ] ) ⊤ ] = ( V [ Y 1 ] Cov ⁡ [ Y 1 , Y 2 ] ⋯ Cov ⁡ [ Y 1 , Y D ] Cov ⁡ [ Y 2 , Y 1 ] V [ Y 2 ] ⋯ Cov ⁡ [ Y 2 , Y D ] ⋮ ⋮ ⋱ ⋮ Cov ⁡ [ Y D , Y 1 ] Cov ⁡ [ Y D , Y 2 ] ⋯ V [ Y D ] ) \begin{aligned} \operatorname{Cov}[\mathbf{y}] & \triangleq \mathbb{E}\left[(\mathbf{y}-\mathbb{E}[\mathbf{y}])(\mathbf{y}-\mathbb{E}[\mathbf{y}])^{\top}\right] \\ &=\left(\begin{array}{cccc} \mathbb{V}\left[Y_{1}\right] & \operatorname{Cov}\left[Y_{1}, Y_{2}\right] & \cdots & \operatorname{Cov}\left[Y_{1}, Y_{D}\right] \\ \operatorname{Cov}\left[Y_{2}, Y_{1}\right] & \mathbb{V}\left[Y_{2}\right] & \cdots & \operatorname{Cov}\left[Y_{2}, Y_{D}\right] \\ \vdots & \vdots & \ddots & \vdots \\ \operatorname{Cov}\left[Y_{D}, Y_{1}\right] & \operatorname{Cov}\left[Y_{D}, Y_{2}\right] & \cdots & \mathbb{V}\left[Y_{D}\right] \end{array}\right) \end{aligned} Cov[y]E[(yE[y])(yE[y])]=V[Y1]Cov[Y2,Y1]Cov[YD,Y1]Cov[Y1,Y2]V[Y2]Cov[YD,Y2]Cov[Y1,YD]Cov[Y2,YD]V[YD]
where
Cov ⁡ [ Y i , Y j ] ≜ E [ ( Y i − E [ Y i ] ) ( Y j − E [ Y j ] ) ] = E [ Y i Y j ] − E [ Y i ] E [ Y j ] \operatorname{Cov}\left[Y_{i}, Y_{j}\right] \triangleq \mathbb{E}\left[\left(Y_{i}-\mathbb{E}\left[Y_{i}\right]\right)\left(Y_{j}-\mathbb{E}\left[Y_{j}\right]\right)\right]=\mathbb{E}\left[Y_{i} Y_{j}\right]-\mathbb{E}\left[Y_{i}\right] \mathbb{E}\left[Y_{j}\right] Cov[Yi,Yj]E[(YiE[Yi])(YjE[Yj])]=E[YiYj]E[Yi]E[Yj]
and V [ Y i ] = Cov ⁡ [ Y i , Y i ] \mathbb{V}\left[Y_{i}\right]=\operatorname{Cov}\left[Y_{i}, Y_{i}\right] V[Yi]=Cov[Yi,Yi].

  • important properties:marginal and conditional distribution are still Gaussian distribution.

    边际分布和条件分布仍为正态分布

Mixture model

We create a mixture model by taking a convex combination of simple distribution.

This has the form
p ( y ∣ θ ) = ∑ k = 1 K π k p k ( y ) p(\mathbf{y} \mid \boldsymbol{\theta})=\sum_{k=1}^{K} \pi_{k} p_{k}(\mathbf{y}) p(yθ)=k=1Kπkpk(y)
where p k p_{k} pk is the k k k 'th mixture component, and π k \pi_{k} πk are the mixture weights which satisfy 0 ≤ π k ≤ 1 0 \leq \pi_{k} \leq 1 0πk1  and  ∑ k = 1 K π k = 1  .  \text { and } \sum_{k=1}^{K} \pi_{k}=1 \text { . }  and k=1Kπk=1 . 

We introduce the discrete latent variable z ∈ { 1 , … , K } , z \in\{1, \ldots, K\}, z{1,,K}, which specifies which distribution to use for generating the output y \mathbf{y} y. 引入隐变量z,代表着属于“哪一个”分布。便于模型的解释和推断。

The prior on this latent variable is p ( z = k ) = π k , p(z=k)=\pi_{k}, p(z=k)=πk, and the conditional is p ( y ∣ z = k ) = p k ( y ) = p ( y ∣ θ k ) p(\mathbf{y} \mid z=k)=p_{k}(\mathbf{y})=p\left(\mathbf{y} \mid \boldsymbol{\theta}_{k}\right) p(yz=k)=pk(y)=p(yθk).

That is, we define the following joint model:
p ( z ∣ θ ) = Cat ⁡ ( z ∣ π ) p ( y ∣ z = k , θ ) = p ( y ∣ θ k ) \begin{aligned} p(z \mid \boldsymbol{\theta}) &=\operatorname{Cat}(z \mid \boldsymbol{\pi}) \\ p(\mathbf{y} \mid z=k, \boldsymbol{\theta}) &=p\left(\mathbf{y} \mid \boldsymbol{\theta}_{k}\right) \end{aligned} p(zθ)p(yz=k,θ)=Cat(zπ)=p(yθk)

The “generative story” for the data is that we first generate z z z (label), and then we generate the observations y \mathbf{y} y using the parameters chosen according to the value of z z z.

首先生成z,根据z再去生成y
p ( y ∣ θ ) = ∑ k = 1 K p ( z = k ∣ θ ) p ( y ∣ z = k , θ ) = ∑ k = 1 K π k p ( y ∣ θ k ) p(\mathbf{y} \mid \boldsymbol{\theta})=\sum_{k=1}^{K} p(z=k \mid \boldsymbol{\theta}) p(\mathbf{y} \mid z=k, \boldsymbol{\theta})=\sum_{k=1}^{K} \pi_{k} p\left(\mathbf{y} \mid \boldsymbol{\theta}_{k}\right) p(yθ)=k=1Kp(z=kθ)p(yz=k,θ)=k=1Kπkp(yθk)
We can create different kinds of mixture model by varying the base distribution p k , p_{k}, pk,.

Gaussian Mixture model(GMM)

p ( y ∣ θ ) = ∑ k = 1 K π k N ( y ∣ μ k , Σ k ) p(\mathbf{y} \mid \boldsymbol{\theta})=\sum_{k=1}^{K} \pi_{k} N(\mathbf{y}|\mu_k,\Sigma_k) p(yθ)=k=1KπkN(yμk,Σk)

often used for clustering.

Note: y here is not label,but features.(equivalent to covariates in regression models)

在这里说到的y是“特征”,不是“标签/响应变量/label”

Data: y(features)

objective: infer parameters ( π k , μ k , Σ k ) , k = 1 , 2 , . . . , K (\pi_k,\mu_k,\Sigma_k),k=1,2,...,K (πk,μk,Σk),k=1,2,...,K, 3*K parameters.估计参数,再对新数据进行推断。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值