斯坦福机器学习笔记（1）

最新推荐文章于 2018-12-27 21:23:44 发布

zhanghong98

最新推荐文章于 2018-12-27 21:23:44 发布

阅读量1k

点赞数 1

分类专栏：溯游从之文章标签：机器学习

本文链接：https://blog.csdn.net/zhanghong98/article/details/58602613

版权

溯游从之专栏收录该内容

1 篇文章 0 订阅

订阅专栏

课已经听了一半，现在从头整理一遍思路。不然内容太多，最后就全忘了。对内容的掌握还没有达到能生成的地步，也没有实战经验，初稿只是把要点罗列一下，将来不断加入感悟吧。

1.Introduction

机器学习是一个交叉学科，涉及神经科学，计算生物学，机器人学，自然语言处理等等。它是计算机的一种新能力，即在未显式编程下使机器完成一些任务，例如预测数据走势，模式识别，自动驾驶等等。本课程涉及到这样几个Topic：

监督学习算法 Supervised Learning Algorithm
即训练样本包含正确结果，即 $S=\{(x^{(i)},y^{(i)})\}$ ，根据 $y^{(i)}$ 的取值分为：
- continuous case，称为regression，例如我们熟知的linear regression；
- discrete case，称为classification，若有k种可能的类别，则 $y^{(i)}\in\{0,1,...,k\}$
  
  其实回归和分类完全可以互相转化，连续的 y 值可以通过划分区间来discretize，分类当k足够大时也可以近似回归的效果
Learning theory
极度重要，讲sample complexity，generalized error的关系，模型/算法选择等等
无监督学习算法 Unsupervised LA
即不给出正确结果，试图让机器自行探索出数据的潜在规律。如clustering
增强学习

Supervised Learning

Our goal is, given a training set, to learn a function h mapping from $\cal X$ to $\cal Y$ so that h(x) is a “good” predictor for the corresponding value of y, the process is:

显然，这个LA怎么得出h，是关键。目前见到的几种主要算法，如Linear Regression，SVM，是设定一个衡量误差的cost function，然后求出最小化它的参数；而Logistic Regression，Naive Bayes，Gaussion Discriminative Analysis，Bayesian Logistic Regression，则是采取类似于极大似然估计的方法。其实本质上是相通的，现在还不能点透

Conventional Notations:
- m = #training examples
- n = # features
- x = features
- y = target
- $\theta$ = parameters
- h = hypothesis 即 $h_\theta(x)$ ，which is x parametrized by $\theta$
- J( $\theta$ ) = cost funtion ,用来衡量误差，需要人工设定

2.线性回归算法（Linear Regression）

线性回归算法是最简单的监督学习算法，它属于回归算法。思路是显式地取假设集为 $\cal H=\{h_\theta (x)=\sum^n_{i=0} \theta_ix_i\}(x_0:=0)$ ,再选取cost function为 $J(\theta)=\frac12 (h_\theta (x)-y)^2$ ,再利用批梯度下降/随机梯度下降/公式/牛顿法…求出参数 $\theta$ 。直观上就是用“直线”拟合n+1维空间中的m个点。

Batch Gradient Descent

想法：从山(曲面 $J=J(\vec\theta)$ ）的某个地方出发，要到某个谷底（极小值点），怎么走呢？环顾四周，看那个方向下降的最快，迈一小步。不断重复这个过程。注意这是求极值/优化的算法，并不是LA

Goal
$m i n θ ⃗ J (θ ⃗)$ $min_{\vec\theta}J(\vec\theta)$
Algo
Repeat till convergence:
$θ ⃗ : = θ ⃗ - α \nabla J (θ ⃗)$ $\vec\theta:=\vec\theta-\alpha\nabla J(\vec\theta)$
$\alpha$ 可以控制步伐大小（Learning Rate），但当接近极值点时， $\nabla J(\theta)$ 会变小，故步伐往往是趋于0的。步伐太大会迈过极值点，步伐太小则学习较慢

Choice on Cost Function

在二维的情况下，就是画一条直线穿过m个点，尽可能拟合地比较好，自然可以选择样本点到预测点的距离来衡量误差，故可以尝试：( $\frac12$ 只是让结果seem nice）

J (θ) = 1 2 (h θ (x) - y) 2

$J(\theta)=\frac12 (h_\theta (x)-y)^2$

Probabilistic Interpretation

还有一种对选择 $J(\theta)=\frac12 (h_\theta (x)-y)^2$ 的理解，就是假设误差服从高斯分布，然后用极大似然估计得出 $\theta$

Let $y^{(i)}=\theta^{T}x^{(i)}+\varepsilon^{(i)}$ , where $\varepsilon^{(i)}$ is the error.

Assume $\varepsilon^{(i)}\sim N(0,\sigma^{2})$ , which to some extent makes sense,since according to the Central Limit Theorem, the sum of many independent r.v.s(which in this case are the unmodeled effects) tends to conform to Gaussian Dist.

p (ε (i)) = 1 2 π - - \sqrt σ e x p (- ( ε ( i ) ) 2 2 σ 2)

$p(\varepsilon^{(i)})=\frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{(\varepsilon^{(i)})^{2}}{2\sigma^{2}})$
This implies that

p (y i | x i; θ) = 1 2 π - - \sqrt σ e x p (- ( y ( i ) - θ T x ( i ) ) 2 2 σ 2)

$p(y^{i}|x^{i};\theta)=\frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{(y^{(i)}-\theta^{T}x^{(i)})^{2}}{2\sigma^{2}})$
where “

;θ $;\theta$ ” simply suggests “parametrized by

θ $\theta$ “,since now we apply the Frequentist School of Probability, who views

θ $\theta$ as merely some fixed value somehow unknown to us momentarily.
Now we define the likelihood function:

L (θ) = p (y ⃗ | X; θ)

$L(\theta)=p(\vec y|X;\theta)$
where X is the design matrix containing all the training examples.

Our goal of getting the best $h$ using the training set could then be interpreted as maximizing $L(\theta)$ with respect to $\theta$ . This isn’t the only way to interpret ,not as well the best one, however it’s pretty much convincing.

Now we also assume that $\varepsilon^{(i)}$ ’s are $i.i.d$ , which is again pretty convincing unless you get a bad training set. Then,

L (θ) = p (y ⃗ | X; θ) = \prod i = 1 m p (y (i) | x (i); θ) = \prod i = 1 m 1 2 π - - \sqrt σ e x p (- ( y ( i ) - θ T x ( i ) ) 2 2 σ 2)

$\begin{align*} L(\theta)&=p(\vec y|X;\theta)\\ &=\prod^{m}_{i=1}p(y^{(i)}|x^{(i)};\theta)\\ &=\prod^{m}_{i=1}\frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{(y^{(i)}-\theta^{T}x^{(i)})^{2}}{2\sigma^{2}}) \end{align*}$
For convenience,define the log-likelihood function

l(θ)=lnL(θ) $l(\theta)=lnL(\theta)$

l (θ) = l n \prod i = 1 m 1 2 π - - \sqrt σ e x p (- ( y ( i ) - θ T x ( i ) ) 2 2 σ 2) = \sum i = 1 m l n 1 2 π - - \sqrt σ e x p (- ( y ( i ) - θ T x ( i ) ) 2 2 σ 2) = m l n 1 2 π - - \sqrt σ - 1 σ 2 1 2 \sum i = 1 m [h θ (x (i)) - y (i)] 2 = m l n 1 2 π - - \sqrt σ - 1 σ 2 J (θ)

$l(\theta)=ln\prod^{m}_{i=1}\frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{(y^{(i)}-\theta^{T}x^{(i)})^{2}}{2\sigma^{2}}) \\=\sum^{m}_{i=1}ln\frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{(y^{(i)}-\theta^{T}x^{(i)})^{2}}{2\sigma^{2}}) \\=mln\frac{1}{\sqrt{2\pi}\sigma}-\frac{1}{\sigma^{2}}\frac12 \sum_{i=1}^m[h_\theta(x^{(i)})-y^{(i)}]^2 \\=mln\frac{1}{\sqrt{2\pi}\sigma}-\frac{1}{\sigma^{2}}J(\theta)$
therefore maximizing the log-likelihood function is exactly the same thing as minimizing the chosen cost function

12∑mi=1[hθ(x(i))−y(i)]2 $\frac12 \sum_{i=1}^m[h_\theta(x^{(i)})-y^{(i)}]^2$ .

确定了 $J(\theta)$ 之后，如果用Batch Gradient Descent，可以得到

θ i : = θ i - α \sum j = 1 m [h θ (x (j)) - y (j)] x (j) i

$\theta_i:=\theta_i-\alpha\sum_{j=1}^m [h_\theta(x^{(j)})-y^{(j)}]x_i^{(j)}$
求和是对所有样本点，也就是每一次下降都试图减小 总距离，当training set较大时，每次下降都需要求和，时间复杂度太大，故引入Stochastic Gradient Descent

Stochastic Gradient Descent

Instead of using all training examples in every iteration,we use only one each time.

Repeat for j = 1 to m:

θ i : = θ i - α [h θ (x (j)) - y (j)] x (j) i

$\theta_i:=\theta_i-\alpha\ [h_\theta(x^{(j)})-y^{(j)}]x_i^{(j)}$ (simultaneously for all i)

关于两种GD的比较，待实战经验丰富之后再补充。

Formula Method

Linear Regression比较简单， $\theta$ 甚至可以用公式表示

引入矩阵导数： $f:R^{m+n}\mapsto R,(\nabla_Af(A))_{ij} := \frac{\partial f}{\partial A_{ij}}$
Properties：

\nabla A t r A B = B T \nabla A t r (A B A T C) = C A B + C T A B T

$\nabla_AtrAB=B^T \\\nabla_Atr(ABA^TC)=CAB+C^TAB^T$

Then

J (θ) = 1 2 \sum i = 1 m [h θ (x (i)) - y (i)] 2 = 1 2 (X θ - y) T (X θ - y)

$J(\theta) = \frac12\sum_{i=1}^m[h_\theta(x^{(i)})-y^{(i)}]^2 \\=\frac12(X\theta-y)^T(X\theta-y)$

setting $\nabla_\theta J(\theta) = 0$ shall give us $\theta$

\nabla θ 1 2 (X θ - y) T (X θ - y) = 1 2 \nabla θ t r (θ T X T X θ - θ T X T y - y T X θ + y T y) = 1 2 (\nabla θ t r (θ I θ T X T X) - \nabla θ t r (θ T X T y) - \nabla θ y T X θ)

$\nabla_\theta\frac12(X\theta-y)^T(X\theta-y) \\=\frac12\nabla_\theta tr(\theta^TX^TX\theta-\theta^TX^Ty-y^TX\theta+y^Ty) \\=\frac12(\nabla_\theta tr(\theta I\theta^TX^TX)-\nabla_\theta tr(\theta^TX^Ty)-\nabla_\theta y^TX\theta)$
using the properties to eliminate

θT $\theta^{T}$