预测分析笔记1&2

课程笔记:预测分析 2021Spring

参考教材:Murphy, K. P. (2021). Probabilistic Machine Learning: An Introduction. MIT press.

In this class,we’ll cover topics in machine learning from a probabilistic view.

We will also introduce some topics in statistical computing,such as EM,MCMC,varaitional inference,some optimization algorithm.

Chapter 1 Introduction介绍

  • Machine Learning:
    • supervised learning(监督学习)
    • unsupervised learning(无监督学习)
    • reinforcement learning(强化学习)

Supervised Learning监督学习

  • The most common form

  • Data: ( X , y ) (X,y) (X,y)

    • X X X: feature(机器学习中),covariate,predictor(统计中)

    • y y y: response,label

    • Task : learn a mapping f f f from input x ∈ X x\in X xX to out put y ∈ Y y\in Y yY, i.e. y = f ( X ) y=f(X) y=f(X).

      the mapping f f f is unknown,we want to learn from data.

  • Supervised Learning

    (depends on type of response y y y.)

    • classification: discrete,unordered and mutually exclusive labels.
    • regression: real valued quantity y ∈ R y\in R yR instead of a class label.

    Those idea is similar ,except loss function.

In classification分类问题

The goal is to automatically come up with classification model f ^ \hat{f} f^:decision tree,neural nets,logistic regressions to reliably predict the labels y = f ^ ( X ) y=\hat{f}(X) y=f^(X).

  • How to measure the performance of f ^ \hat{f} f^ on this data?

    we look at the misclassification rate:

    L ( θ ) = 1 N ∑ n = 1 N I { y n ≠ f ( x n , θ ) } L(\theta)=\frac{1}{N} \sum_{n=1}^N \mathbb{I}_{\{y_n\neq f(x_n,\theta)\}} L(θ)=N1n=1NI{yn=f(xn,θ)},where I ( e ) \mathbb{I}(e) I(e) is the binary indicator function.

    • we might also assume errors are unequal. So we give a more generous definition.

    • Define empirical risk to be the average loss of predictor:

      L ( θ ) = 1 N ∑ n = 1 N l { y n ≠ f ( x n , θ ) } L(\theta)=\frac{1}{N} \sum_{n=1}^N l_{\{y_n\neq f(x_n,\theta)\}} L(θ)=N1n=1Nl{yn=f(xn,θ)}

    • The zero-one loss is a special case. i.e. l ( y , y ^ ) = I ( y , y ^ ) l(y,\hat{y})=\mathbb{I}(y,\hat{y}) l(y,y^)=I(y,y^)

    Our goal is to minimize the expected loss on future data that have not yet seen.

    With a suitably flexible model,we can derive the training loss to zero by simply memorizing the correct out put for each input. But it does not do well in test dataset.

  • How can we pick a model of the right complexity?

    If we use training set to evaluate different models,we’ll always pick the most complex model , since we will have the most degree of freedom,and hence will have minimum loss,so insted we should pick the model with minimum test error.

In regression回归问题

we want to predict a real-valued quantity y ∈ R y\in R yR instead of a class label y ∈ { 1 , 2 , . . . , c } y\in\{1,2,...,c\} y{1,2,...,c}.We have different loss function: l ( y , y ^ ) = ( y − y ^ ) 2 l(y,\hat{y})=(y-\hat{y})^2 l(y,y^)=(yy^)2

  • M S E ( θ ) = 1 N ∑ n = 1 N ( y n − f ( x n , θ ) ) 2 MSE(\theta)=\frac{1}{N} \sum_{n=1}^N(y_n- f(x_n,\theta))^2 MSE(θ)=N1n=1N(ynf(xn,θ))2

Unsupervised Learning无监督学习

  • In supervised learning,we assume that each input example X X X in the training set has an associated set of output targets y y y , and our goal is to learn the input-output mapping f f f.

  • In unsupervised learning,we just get observed “inputs” D = { x n ∣ n = 1 : n } D=\{x_n| n=1:n\} D={xnn=1:n} without any corresponding “outputs”

  • From a probabilistic perspective , we can view the task of unsupervised learning as fitting an unconditional model of the form p ( X ) p(X) p(X),which can generate new data X X X, whereas supervised learning fitting p ( y ∣ X ) p(y|X) p(yX) , which specifies outputs given inputs.

  • One simple example of unsupervised learning is clustering. The goal is to partition the input into regions that contain “similar” points.

  • How to evaluate unsupervised learning?

    • It is hard to evaluate the quality of the output of an unsupervised learning method , because there is no ground truth to compute to.

    • A common method for evaluating unsupervised models is to measure the probability assigned by the model to unseen test examples.

      L ( θ ; D ) = − 1 ∣ D ∣ ∑ x ∈ D l o g   p ( X ∣ θ ) L(\theta;D)=-\frac{1}{|D|} \sum_{x \in D} log \ p(X|\theta) L(θ;D)=D1xDlog p(Xθ)

      e.g. mixture models.

Chapter 2 Probabilistic inference 概率推断

Bayesian VS Frequentist贝叶斯学派vs频率学派

introduction and compare介绍与对比

  • Frequentist : θ \theta θ is deterministic.(为一定值)

    eg. MLE ,MOM

  • Bayesian: θ o r H \theta or H θorH is unknown quantity of interest,is a random variable.(为一随机变量)

  • Data: y y y

    Given model , p ( y ∣ θ ) p(y|\theta) p(yθ)

Frequentist频率学派
  • Frequentist:

    θ = a r g m a x θ P ( y ∣ θ ) \theta=\mathop{argmax}\limits_{\theta}P(y|\theta) θ=θargmaxP(yθ) (MLE解)

Bayes贝叶斯学派
  • Bayesian: p ( θ ∣ y ) = p ( y ∣ θ ) p ( θ ) p ( y ) p(\theta|y)=\frac{p(y|\theta) p(\theta)}{p(y)} p(θy)=p(y)p(yθ)p(θ) (product rule of probability)

    • Basis of Bayesian inference

      • Bayes rule: computing the probability distribution over possible values of an unknown quantity H H H given some observed data Y = y Y=y Y=y.

        P ( H = h ∣ Y = y ) = P ( H = h ) P ( Y = y ∣ H = h ) P ( Y = y ) P(H=h|Y=y)=\frac{P(H=h)P(Y=y|H=h)}{P(Y=y)} P(H=hY=y)=P(Y=y)P(H=h)P(Y=yH=h), we can find that $posterior\propto prior * likelihood $

        P ( h ∣ y ) P ( y ) = P ( h ) P ( y ∣ h ) = P ( h , y ) P(h|y)P(y)=P(h)P(y|h)=P(h,y) P(hy)P(y)=P(h)P(yh)=P(h,y)

        • P ( H ) P(H) P(H) : prior distribution,what we know about H H H before we see any data.
        • P ( Y ∣ H = h ) P(Y|H=h) P(YH=h) : the distribution over possible outcomes Y Y Y when we expect to see if H = h H=h H=h.
        • P ( H , y ) P(H,y) P(H,y) : unnormalized posterior distribution.
        • P ( Y = y ) P(Y=y) P(Y=y) : marginal likelihood function.
          week 1
      • Posterior predictive:
        P ( θ ∣ y ) ∝ P ( y ∣ θ ) ∗ P ( θ ) P(\theta |y)\propto P(y|\theta)*P(\theta) P(θy)P(yθ)P(θ),i.e. we ignore normalizing constant p ( y ) p(y) p(y)
        P ( y n e w ∣ y ) = ∫ p ( y n e w ∣ θ ) ∗ p ( θ ∣ y )   d θ P(y_{new}|y)=\int p(y_{new}|\theta)*p(\theta|y)\ d\theta P(ynewy)=p(ynewθ)p(θy) dθ.Bayes model average,we will introduce a bit more later.(在Chapter 4 中讲解)

compare对比(MAP与MLE)
  • MAP、MLE and plugin approximation

    As the amount of data increases,the posterior will become concentrated around a single point,namely the posterior mode(后验众数).

    (In MLE,we provide a point estimate. But in many cases,we want to quantify the uncertainty,so we derive the standard deviation, and confidence interval.)

    The posterior mode is defined as the hypothesis with maximum posterior probability:

    h m a p = a r g m a x h P ( h ∣ D ) = a r g m a x h [ log ⁡ P ( D ∣ h ) + log ⁡ P ( h ) ] h_{map}=\mathop{argmax}\limits_{h} P(h|D)=\mathop{argmax}\limits_{h} [\log P(D|h)+\log P(h)] hmap=hargmaxP(hD)=hargmax[logP(Dh)+logP(h)]

    h:参数,D:数据集

    This is called the maximum a posterior or MAP estimate.

    • As the data set increases in size, the log likelihood grows in magnitude, but the prior term remains constant. We thus say that the likelihood overwhelms the prior.

      当数据集中数据量在增大时,log似然的绝对值在增大,但是先验是不变的(与数据量无关)。

    • A reasonable approximation to the MAP estimate is to ignore the prior term.(MLE is a reasonable approximate of MAP if the size of data is large,plus the prior distribution is weak.)

      所以在数据量很大时,log似然会起到主导的作用,先验的作用会很小。此时MAP和MLE近似是相等的。

    • Note that MAP is a way to represent posterior distribution,as a point estimate.

      MAP是一个点估计,我们也可以根据方差和置信度给出置信域。

Is Bayes relevant in the “big data era”?贝叶斯学派方法是否能在大数据时代很好的应用?

  • As the amount of data increases, the posterior p ( θ ∣ D ) p(θ|D) p(θD) often shrinks to a point, since the log likelihood term, log ⁡   p ( D ∣ θ ) \log\ p(D|θ) log p(Dθ), grows with N, whereas the log prior,   l o g   p ( θ ) \ log\ p(θ)  log p(θ), is independent of N.

  • Consequently, the posterior approaches a delta function at the MLE, p ( θ ∣ D ) → δ ( θ − θ ^ ) p(\theta|D)\rightarrow \delta(\theta-\hat{\theta}) p(θD)δ(θθ^),where δ ( x ) \delta (x) δ(x) is delta function.

    在大数据下,Bayes与频率学派得出的结果区别不是很大(MAP与MLE),但Bayes方法较复杂、计算成本较高(更新后验分布/似然函数不易计算)。故在大数据下,“不需”使用Bayes方法。

  • Thus one way think that Bayes is irrelevant in the era of “big data”.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值