Open Notes: Machine Learning 机器学基础笔记(7:DM vs. GM, EM, HMM)

by Max Z. C. Li (843995168@qq.com)

based on lecture notes of Prof. Kai-Wei Chang, UCLA 2018 winter CM 146 Intro. to M.L., with marks and comments (//,==>, words, etc.)
all graphs/pictures are from the lecture notes; I disavow the background ownership watermarks auto-added by csdn.

original acknowledgment: "The instructor gratefully acknowledges Eric Eaton (UPenn), who assembled the original slides, Jessica Wu (Harvey Mudd), David Kauchak (Pomona), Dan Roth (Upenn), Sriram Sankararaman (UCLA), whose slides are also heavily used, and the many others who made their course materials freely available online."

 

Discriminative Model and Generative Model

Discriminative Model

 

Generative Model

===>1. by MAP Estimation, we are actually learning the correct likelihood and prior

===>2. we are maximizing the a posterior by argmax_label at prediction time ==> MAP prediction ==> we used both Bayesian rule and the NB assumptions (conditional independence for the features)

for NB, we first learn P(y) and P(X|Y)

then to predict Y, we use MAP prediction ===> maximize P(Y|X) by:

 

DM vs. GM

notice that only DM has decision boundaries

 

MLE for UL: Expectation-Maximization (EM)(17)

DM or GM

In unsupervised learning, we only observed input distribution 𝑃(𝑋)

GM is obviously more suitable for UL:

 

EM Algorithm

E.g. Coin Toss

==> linearly inseparable ==> try logstic regression, KSVM, NB

 

Basic Idea

step 1 initialization

===> guess the (label) value of the toss of coin 0 rather than coin 1 

===> the guess as shown can be very crude

 

step 2 Maximum Conditional Likelihood

==> MCL = P(h | D, guess) = P(y | x, theta) ==> likely hood of the distribution given the data and the guesses

 

step 3 Likelihood Estimation

or

===> again obviously we meant to guess coin 0 not coin 1

===> now we have a better guess of coin 0 toss ===> proceed to step 2 again:

==> repeat calculation of the parameters by Maximum Conditional Likelihood.

 

Compare to GMM

recall GMM marginal distribution of x:

the algorithm:

====> GMM is learning prob, and prob. learning ==> we start by guessing the parameters and compute the label ===> then we use the label to MLE the distribution.

====> the updating/convergin mechanism is similar to EM ===> though for EM we start by guessing the labels and learn by MLE to update the parameters (hypothesis)

====> for SL, GMM make predictions by comparing the values of marginal prob of x in each group and choose the most likely one. ==> for UL, each component is a cluster

====> for SL we can directly estimate the parameters of GMM by MLE (prob learning) 

where:

and 

===> use Bayesian theorem and MLE or MAPE (pick the most likely one as the prediction) ===> we use MLE for updating the parameters (distribution model)

 

Hidden Markov Model (17,18)

Sequence Model

Sequence

we can model it by:

we can simple lose the history tail by using the:

 

Discrete Markov Model

e.g.

===> the parameters correspond to state transition prob.

===> we need O(K**2) in total but we only use O(K) parameters at a time.

e.g.

the Discrete (1st order) Markov Model can generalize to:

Mth Order Markov Model

 

DMM vs. HMM

HMM e.g.

 

Joint Model over States and Observations

===> emission prob: P(emission | state)

e.g.

You are a climatologist in the year 2799, Studying global warming. You can’t find any records of the weather in Baltimore, MA for summer of 2007 But you find a diary Which lists how many ice-creams J ate every date that summer
Our job: figure out how hot it was

==> obviously, H/C are states, and #cones are emissions

 

Inference for HMM

 

Most Likely State Sequence

a possible variant:

we can try to get the result with:

Now if we introduce the 1st-order DMM assumption about history:

==> recursive pattern, not recursion implementation ==> perhaps best delt with by dynamic programming.

 

Viterbi Algorithm

===> it's O(nK*K)  runtime and O(nK) space, since we remembered best score each round, we can guarantee the best result

===> K*K because we look all pairs of state to find the max each time

===> we need full access to all scores of all states in the previous round, hence O(nK) storage

e,g,

==> so the most likely tag sequence is CCC

 

SL by HMM

Learning the HMM Parameters

e.g.

===> resembles GMM SL 

===> the current distribution represented by the data is the Most Likely one to yield the data, and is most likely to be the true distribution

MLE counting e.g.

 

Prior and Smoothing

e.g. 

recall the spam example 

P(w) = 1/|Vocabulary| ===> prior used for smoothing

 

UL for HMM 

===> use EM; start by guessing the tags ==> update the parameters ==> update the tags

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值