Open Notes: Machine Learning 机器学基础笔记（7：DM vs. GM, EM, HMM）

最新推荐文章于 2022-03-25 10:55:49 发布

EverNoob

最新推荐文章于 2022-03-25 10:55:49 发布

阅读量313

点赞数

分类专栏： Machine_Learning Notes 文章标签：机器学习

本文链接：https://blog.csdn.net/maxzcl/article/details/115014908

版权

Notes 同时被 2 个专栏收录

136 篇文章 0 订阅

订阅专栏

Machine_Learning

53 篇文章 1 订阅

订阅专栏

by Max Z. C. Li (843995168@qq.com)

based on lecture notes of Prof. Kai-Wei Chang, UCLA 2018 winter CM 146 Intro. to M.L., with marks and comments (//,==>, words, etc.)
all graphs/pictures are from the lecture notes; I disavow the background ownership watermarks auto-added by csdn.

original acknowledgment: "The instructor gratefully acknowledges Eric Eaton (UPenn), who assembled the original slides, Jessica Wu (Harvey Mudd), David Kauchak (Pomona), Dan Roth (Upenn), Sriram Sankararaman (UCLA), whose slides are also heavily used, and the many others who made their course materials freely available online."

Discriminative Model and Generative Model

Discriminative Model

Generative Model

===>1. by MAP Estimation, we are actually learning the correct likelihood and prior

===>2. we are maximizing the a posterior by argmax_label at prediction time ==> MAP prediction ==> we used both Bayesian rule and the NB assumptions (conditional independence for the features)

for NB, we first learn P(y) and P(X|Y)

then to predict Y, we use MAP prediction ===> maximize P(Y|X) by:

DM vs. GM

notice that only DM has decision boundaries

MLE for UL: Expectation-Maximization (EM)(17)

DM or GM

In unsupervised learning, we only observed input distribution 𝑃(𝑋)

GM is obviously more suitable for UL:

EM Algorithm

E.g. Coin Toss

==> linearly inseparable ==> try logstic regression, KSVM, NB

Basic Idea

step 1 initialization

===> guess the (label) value of the toss of coin 0 rather than coin 1

===> the guess as shown can be very crude

step 2 Maximum Conditional Likelihood

==> MCL = P(h | D, guess) = P(y | x, theta) ==> likely hood of the distribution given the data and the guesses

step 3 Likelihood Estimation

===> again obviously we meant to guess coin 0 not coin 1

===> now we have a better guess of coin 0 toss ===> proceed to step 2 again:

==> repeat calculation of the parameters by Maximum Conditional Likelihood.

Compare to GMM

recall GMM marginal distribution of x:

the algorithm:

====> GMM is learning prob, and prob. learning ==> we start by guessing the parameters and compute the label ===> then we use the label to MLE the distribution.

====> the updating/convergin mechanism is similar to EM ===> though for EM we start by guessing the labels and learn by MLE to update the parameters (hypothesis)

====> for SL, GMM make predictions by comparing the values of marginal prob of x in each group and choose the most likely one. ==> for UL, each component is a cluster

====> for SL we can directly estimate the parameters of GMM by MLE (prob learning)

where:

and

===> use Bayesian theorem and MLE or MAPE (pick the most likely one as the prediction) ===> we use MLE for updating the parameters (distribution model)

Hidden Markov Model (17,18)

Sequence Model

Sequence

we can model it by:

we can simple lose the history tail by using the:

Discrete Markov Model

e.g.

===> the parameters correspond to state transition prob.

===> we need O(K**2) in total but we only use O(K) parameters at a time.

e.g.

the Discrete (1st order) Markov Model can generalize to:

Mth Order Markov Model

DMM vs. HMM

HMM e.g.

Joint Model over States and Observations

===> emission prob: P(emission | state)

e.g.

You are a climatologist in the year 2799, Studying global warming. You can’t find any records of the weather in Baltimore, MA for summer of 2007 But you find a diary Which lists how many ice-creams J ate every date that summer
Our job: figure out how hot it was

==> obviously, H/C are states, and #cones are emissions

Inference for HMM

Most Likely State Sequence

a possible variant:

we can try to get the result with:

Now if we introduce the 1st-order DMM assumption about history:

==> recursive pattern, not recursion implementation ==> perhaps best delt with by dynamic programming.

Viterbi Algorithm

===> it's O(nK*K) runtime and O(nK) space, since we remembered best score each round, we can guarantee the best result

===> K*K because we look all pairs of state to find the max each time

===> we need full access to all scores of all states in the previous round, hence O(nK) storage

e,g,

==> so the most likely tag sequence is CCC

SL by HMM

Learning the HMM Parameters

e.g.

===> resembles GMM SL

===> the current distribution represented by the data is the Most Likely one to yield the data, and is most likely to be the true distribution

MLE counting e.g.

Prior and Smoothing

e.g.

recall the spam example

P(w) = 1/|Vocabulary| ===> prior used for smoothing

UL for HMM

===> use EM; start by guessing the tags ==> update the parameters ==> update the tags

EverNoob

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Open Notes: Machine Learning 机器学基础笔记（7：DM vs. GM, EM, HMM）

by Max Z. C. Li (843995168@qq.com)based on lecture notes of Prof. Kai-Wei Chang, UCLA 2018 winter CM 146 Intro. to M.L., with marks and comments (//,==>, words, etc.)all graphs/pictures are from the lecture notes; I disavow the background ownership w
复制链接

扫一扫