by Max Z. C. Li (843995168@qq.com)
based on lecture notes of Prof. Kai-Wei Chang, UCLA 2018 winter CM 146 Intro. to M.L., with marks and comments (//,==>, words, etc.)
all graphs/pictures are from the lecture notes; I disavow the background ownership watermarks auto-added by csdn.
original acknowledgment: "The instructor gratefully acknowledges Eric Eaton (UPenn), who assembled the original slides, Jessica Wu (Harvey Mudd), David Kauchak (Pomona), Dan Roth (Upenn), Sriram Sankararaman (UCLA), whose slides are also heavily used, and the many others who made their course materials freely available online."
Discriminative Model and Generative Model
Discriminative Model
Generative Model
===>1. by MAP Estimation, we are actually learning the correct likelihood and prior
===>2. we are maximizing the a posterior by argmax_label at prediction time ==> MAP prediction ==> we used both Bayesian rule and the NB assumptions (conditional independence for the features)
for NB, we first learn P(y) and P(X|Y)
then to predict Y, we use MAP prediction ===> maximize P(Y|X) by:
DM vs. GM
notice that only DM has decision boundaries
MLE for UL: Expectation-Maximization (EM)(17)
DM or GM
In unsupervised learning, we only observed input distribution 𝑃(𝑋)
GM is obviously more suitable for UL:
EM Algorithm
E.g. Coin Toss
==> linearly inseparable ==> try logstic regression, KSVM, NB
Basic Idea
step 1 initialization
===> guess the (label) value of the toss of coin 0 rather than coin 1
===> the guess as shown can be very crude
step 2 Maximum Conditional Likelihood
==> MCL = P(h | D, guess) = P(y | x, theta) ==> likely hood of the distribution given the data and the guesses
step 3 Likelihood Estimation
or
===> again obviously we meant to guess coin 0 not coin 1
===> now we have a better guess of coin 0 toss ===> proceed to step 2 again:
==> repeat calculation of the parameters by Maximum Conditional Likelihood.
Compare to GMM
recall GMM marginal distribution of x:
the algorithm:
====> GMM is learning prob, and prob. learning ==> we start by guessing the parameters and compute the label ===> then we use the label to MLE the distribution.
====> the updating/convergin mechanism is similar to EM ===> though for EM we start by guessing the labels and learn by MLE to update the parameters (hypothesis)
====> for SL, GMM make predictions by comparing the values of marginal prob of x in each group and choose the most likely one. ==> for UL, each component is a cluster
====> for SL we can directly estimate the parameters of GMM by MLE (prob learning)
where:
and
===> use Bayesian theorem and MLE or MAPE (pick the most likely one as the prediction) ===> we use MLE for updating the parameters (distribution model)
Hidden Markov Model (17,18)
Sequence Model
Sequence
we can model it by:
we can simple lose the history tail by using the:
Discrete Markov Model
e.g.
===> the parameters correspond to state transition prob.
===> we need O(K**2) in total but we only use O(K) parameters at a time.
e.g.
the Discrete (1st order) Markov Model can generalize to:
Mth Order Markov Model
DMM vs. HMM
HMM e.g.
Joint Model over States and Observations
===> emission prob: P(emission | state)
e.g.
You are a climatologist in the year 2799, Studying global warming. You can’t find any records of the weather in Baltimore, MA for summer of 2007 But you find a diary Which lists how many ice-creams J ate every date that summer
Our job: figure out how hot it was
==> obviously, H/C are states, and #cones are emissions
Inference for HMM
Most Likely State Sequence
a possible variant:
we can try to get the result with:
Now if we introduce the 1st-order DMM assumption about history:
==> recursive pattern, not recursion implementation ==> perhaps best delt with by dynamic programming.
Viterbi Algorithm
===> it's O(nK*K) runtime and O(nK) space, since we remembered best score each round, we can guarantee the best result
===> K*K because we look all pairs of state to find the max each time
===> we need full access to all scores of all states in the previous round, hence O(nK) storage
e,g,
==> so the most likely tag sequence is CCC
SL by HMM
Learning the HMM Parameters
e.g.
===> resembles GMM SL
===> the current distribution represented by the data is the Most Likely one to yield the data, and is most likely to be the true distribution
MLE counting e.g.
Prior and Smoothing
e.g.
recall the spam example
P(w) = 1/|Vocabulary| ===> prior used for smoothing
UL for HMM
===> use EM; start by guessing the tags ==> update the parameters ==> update the tags