吴恩达机器学习视频学习笔记1

learning theory

Machine Learning definition

Arthur Samuel(1959). machine learning: field of study that gives computers the ability to learn without being explictly programmed.
Tom Mitchell(1998) Well-posed Learning Problem: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measure by P, improves with experience E.

supervised learning

监督问题的算法,即我们给算法提供了一组“标准答案”,让算法去学习标准输入和标准答案之间的关系,以尝试对于我们的其他输入给我们提供更为标准的答案。

regression

The output y is a continuous variable
Example:
define the cost function:
J ( Θ ) = 1 2 ∑ i = 1 n ( h Θ ( x ( i ) ) − y ( i ) ) 2 J(\Theta)=\frac{1}{2} \sum_{i=1}^n (h_{\Theta} (x^ {(i)})-y^{(i)})^2 J(Θ)=21i=1n(hΘ(x(i))y(i))2

  1. LMS(least mean squares) algorithm:

The gradient descent algorithm:
Θ j = d e f Θ j − α ∂ ∂ Θ j J ( Θ ) \Theta_j \overset{\underset{\mathrm{def}}{}}{=} \Theta_j - \alpha \frac{\partial}{\partial \Theta_j} J(\Theta) Θj=defΘjαΘjJ(Θ)
This is a very natural algorithm that repeatedly takes a step in the direction of steepest decrease of J J J. α \alpha α is called the learning rate.
For a single training example, LMS update rule(also the Widrow-Hoff learning rule)
Θ j = d e f Θ j + α ( y ( i ) − h Θ ( x ( i ) ) ) x j ( i ) \Theta_j \overset{\underset{\mathrm{def}}{}}{=} \Theta_j + \alpha (y^{(i)}-h_{\Theta} (x^ {(i)})) x_j ^{(i)} Θj=defΘj+α(y(i)hΘ(x(i)))xj(i)
where the magnitude of the update is proportional to the error term.

There are two ways to modify this method for a training set of more than one example.
The first is replace it with the following algorithm:
Repeat until convergence{
Θ j = d e f Θ j + α ∑ i = 1 n ( y ( i ) − h Θ ( x ( i ) ) ) x j ( i ) \Theta_j \overset{\underset{\mathrm{def}}{}}{=} \Theta_j + \alpha \sum_{i=1}^n (y^{(i)}-h_{\Theta} (x^ {(i)})) x_j ^{(i)} Θj=defΘj+αi=1n(y(i)hΘ(x(i)))xj(i) (for every j j j)
}
This method looks at every example in the entire training set on every step, and is called batch gradient descent. While gradient descent can be susceptible to local minima in general, the optimization problem we have posed here for linear regression has only one global, and no other local, optima.
There is an alternative to batch gradient descent that also works very well. Consider the second algorithm:
Loop{
for i = 1 i=1 i=1 to n n n,{
Θ j = d e f Θ j + α ( y ( i ) − h Θ ( x ( i ) ) ) x j ( i ) \Theta_j \overset{\underset{\mathrm{def}}{}}{=} \Theta_j + \alpha (y^{(i)}-h_{\Theta} (x^ {(i)})) x_j ^{(i)} Θj=defΘj+α(y(i)hΘ(x(i)))xj(i) (for every j j j)
}
We repeatedly run through the training set, and each time we encounter a training example, we update the parameters according to the gradient of the error with respect to that single training example only. This algorithm is called stochastic gradient descent(also incremental gradient descent).
It may never converge to the minimum and , but the parameters Θ \Theta Θ will keep oscillating around the minimum of J ( Θ ) J(\Theta) J(Θ), but in practice most of the values near the minimum will be reasonably good approximations to the true minimum.
Conclusion:when the training set is large, stochastic gradient descent is often preferred over batch gradient descent.

  1. The normal equations
  2. Probabilistic interpretation

assume that
y ( i ) = θ T x ( i ) + ϵ ( i ) y^{(i)}=\theta^T x^{(i)}+\epsilon^{(i)} y(i)=θTx(i)+ϵ(i)
in which, ϵ ( i ) \epsilon^{(i)} ϵ(i) is an error term that captures either unmodeled effects, and is assumed to be distributed IID(independently and identically distributed,独立同分布), namely, ϵ ( i ) ∼ ( 0 , σ 2 ) \epsilon^{(i)} \sim \mathcal(0,\sigma^2) ϵ(i)(0,σ2).
the density of ϵ ( i ) \epsilon^{(i)} ϵ(i) is given by
p ( ϵ ( i ) ) = 1 2 π σ e x p ( − ( ϵ ( i ) ) 2 2 σ 2 ) p(\epsilon^{(i)})=\frac{1}{\sqrt{2\pi} \sigma}exp(-\frac{(\epsilon^{(i)})^2}{2 \sigma^2}) p(ϵ(i))=2π σ1exp(2σ2(ϵ(i))2)
This implies that
p ( y ( i ) ∣ x ( i ) ; θ ) = 1 2 π σ e x p ( − ( y ( i ) − θ T x ( i ) ) 2 2 σ 2 ) p(y^{(i)} \mid x^{(i)};\theta)=\frac{1}{\sqrt{2\pi} \sigma}exp(-\frac{(y^{(i)}- \theta^T x^{(i)})^2}{2 \sigma^2}) p(y(i)x(i);θ)=2π σ1exp(2σ2(y(i)θTx(i))2)
The notation p ( y ( i ) ∣ x ( i ) ; θ ) p(y^{(i)} \mid x^{(i)};\theta) p(y(i)x(i);θ) indicates that this is the distribution of y ( i ) y^{(i)} y(i)
given x ( i ) x^{(i)} x(i) and parameterized by θ \theta θ. Note that we should not condition on θ \theta θ
p ( y ( i ) ∣ x ( i ) ; θ ) p(y^{(i)} \mid x^{(i)};\theta) p(y(i)x(i);θ), since θ \theta θ is not a random variable. We can also write the
distribution of y ( i ) y^{(i)} y(i) as y ( i ) ∣ x ( i ) ; θ ∼ N ( θ T x ( i ) , σ 2 ) y^{(i)} \mid x^{(i)};\theta \sim \mathcal{N}(\theta^T x^{(i)}, \sigma^2) y(i)x(i);θN(θTx(i),σ2).
the likelihood function:
L ( θ ) = L ( θ ; X , y ⃗ ) = p ( y ⃗ ∣ X ; θ ) = ∏ i = 1 N p ( y ( i ) ∣ x ( i ) ; θ ) = ∏ i = 1 N 1 2 π σ e x p ( − ( y ( i ) − θ T x ( i ) ) 2 2 σ 2 ) L(\theta)=L(\theta;X,\vec{y})=p(\vec{y} \mid X;\theta)=\prod_{i=1}^N p(y^{(i)} \mid x^{(i)};\theta)=\prod_{i=1}^N \frac{1}{\sqrt{2\pi} \sigma}exp(-\frac{(y^{(i)}- \theta^T x^{(i)})^2}{2 \sigma^2}) L(θ)=L(θ;X,y )=p(y X;θ)=i=1Np(y(i)x(i);θ)=i=1N2π σ1exp(2σ2(y(i)θTx(i))2)
according to the principal of maximum likelihood:
ℓ ( θ ) = l o g L ( θ ) = n l o g 1 2 π σ − 1 σ 2 ⋅ 1 2 ∑ i = 1 n ( y ( i ) − θ T x ( i ) ) 2 \ell(\theta)=logL(\theta)=nlog\frac{1}{\sqrt{2\pi} \sigma}-\frac{1}{\sigma^2} \cdot \frac{1}{2} \sum_{i=1}^{n} (y^{(i)}- \theta^T x^{(i)})^2 (θ)=logL(θ)=nlog2π σ1σ2121i=1n(y(i)θTx(i))2
Hence, maximizing ℓ ( θ ) \ell(\theta) (θ) gives the same answer as minimizing J ( θ ) = 1 2 ∑ i = 1 n ( y ( i ) − θ T x ( i ) ) 2 J(\theta)=\frac{1}{2} \sum_{i=1}^{n} (y^{(i)}- \theta^T x^{(i)})^2 J(θ)=21i=1n(y(i)θTx(i))2,our original least-squares cost function.

To summarize: Under the previous probabilistic assumptions on the data, least-squares regression corresponds to finding the maximum likelihood estimate of \theta. This is thus one set of assumptions under which least-squares regression can be justified as a very natural method that’s just doing maximum likelihood estimation. (Note however that the probabilistic assumptions are by no means necessary for least-squares to be a perfectly good and rational procedure, and there may—and indeed there are—other natural assumptions that can also be used to justify it.)
Note also that, in our previous discussion, our final choice of θ did not depend on what was σ 2 \sigma^2 σ2, and indeed we’d have arrived at the same result even if σ 2 \sigma^2 σ2 were unknown.

  1. Locally weighted linear regression

underfitting—in which the data clearly shows structure not captured by the model
overfitting—in which the data clearly shows structure captured by the model perfectly

steps:
a)fit θ \theta θ to minimize ∑ i w ( i ) ( y ( i ) − θ T x ( i ) ) 2 \sum_i w^{(i)} (y^{(i)}-\theta^T x^{(i)})^2 iw(i)(y(i)θTx(i))2
b)output θ T x \theta^T x θTx

the weight w ( i ) = e x p ( − ( x ( i ) − x ) 2 2 τ 2 ) w^{(i)}=exp(-\frac{( x^{(i)}-x)^2}{2 \tau^2}) w(i)=exp(2τ2(x(i)x)2), in which, the parameter τ \tau τ is called the bandwidth and controls how quickly the weight of a training example falls off with distance of its x ( i ) x^{(i)} x(i) from the query point x x x. Note that while the formula for the weights takes a form that is cosmetically similar to the density of a Gaussian distribution. Locally weighted linear regression is the first example we’re seeing of a non-parametric algorithm. The (unweighted) linear regression algorithm that we saw earlier is known as a parametric learning algorithm, because it has a fixed, finite number of parameters (the θ i \theta_i θi’s), which are fit to the data. Once we’ve fit the θi’s and stored them away, we no longer need to keep the training data around to make future predictions. In contrast, to make predictions using locally weighted linear regression, we need to keep the entire training set around. The term “non-parametric” (roughly) refers to the fact that the amount of stuff we need to keep in order to represent the
hypothesis h h h grows linearly with the size of the training set.

classification

the values y we want to predict take on only a small number of discrete values。

Binary classification

y can take on only two values, 0 and 1.

  1. logistic regression

It also doesn’t make sense for h θ ( x ) h_\theta(x) hθ(x) to take values larger than 1 or smaller than 0 when we know that y ∈ { 0 , 1 } y \in \{0, 1\} y{0,1}. Choose
h θ ( x ) = g ( θ T x ) = 1 1 + e − θ T x h_\theta(x)=g(θ^T x)=\frac{1}{1 + e^{−\theta^T x}} hθ(x)=g(θTx)=1+eθTx1
in which g ( z ) = 1 1 + e − z g(z)=\frac{1}{1 + e^{−z}} g(z)=1+ez1 is called the logistic function or the sigmoid function and g ′ ( x ) = g ( z ) ( 1 − g ( z ) ) g^\prime(x)=g(z)(1-g(z)) g(x)=g(z)(1g(z)).
Let us assume that
P ( y = 1 ∣ x ; θ ) = h θ ( x ) P ( y = 0 ∣ x ; θ ) = 1 − h θ ( x ) P(y=1|x;\theta)=h_\theta(x) P(y=0|x;\theta)=1-h_\theta(x) P(y=1x;θ)=hθ(x)P(y=0x;θ)=1hθ(x)
Note that this can be written more compactly as
P ( y ∣ x ; θ ) = ( h θ ( x ) ) y ( 1 − h θ ( x ) ) 1 − y P(y|x;\theta)=(h_\theta(x))^y(1-h_\theta(x))^{1-y} P(yx;θ)=(hθ(x))y(1hθ(x))1y
Assuming that the n training examples were generated independently, we
can then write down the likelihood of the parameters as
L ( θ ) = P ( ( ⃗ y ) ∣ X ; θ ) = ∏ i = 1 N P ( y ( i ) ∣ x ( i ) ; θ ) = ∏ i = 1 N ( h θ ( x ) ) y ( i ) ( 1 − h θ ( x ) ) 1 − y ( i ) L(\theta)=P(\vec(y)|X;\theta)=\prod_{i=1}^N P(y^{(i)}|x^{(i)};\theta)=\prod_{i=1}^N (h_\theta(x))^{y^{(i)}}(1-h_\theta(x))^{1-y^{(i)}} L(θ)=P(( y)X;θ)=i=1NP(y(i)x(i);θ)=i=1N(hθ(x))y(i)(1hθ(x))1y(i)
As before, it will be easier to maximize the log likelihood:
ℓ ( θ ) = l o g L ( θ ) = ∑ i = 1 n y ( i ) l o g h ( x ( i ) ) + ( 1 − y ( i ) ) l o g ( 1 − h ( x ( i ) ) ) \ell(\theta)=logL(\theta)=\sum_{i=1}^n y^{(i)}logh(x^{(i)})+(1-y^{(i)})log(1-h(x^{(i)})) (θ)=logL(θ)=i=1ny(i)logh(x(i))+(1y(i))log(1h(x(i)))
use gradient ascent θ = d e f θ + α ∇ θ ℓ ( θ ) \theta\overset{def}{=}\theta+\alpha\nabla_\theta\ell(\theta) θ=defθ+αθ(θ)
take derivatives to derive the stochastic
gradient ascent rule:
∂ ∂ θ j ℓ ( θ ) = ( y − h θ ( x ) ) x j \frac{\partial}{\partial \theta_j} \ell(\theta)=(y-h_\theta(x))x_j θj(θ)=(yhθ(x))xj
Therefore,
θ = d e f θ j + α ( y ( i ) − h θ ( x ( i ) ) ) x j ( i ) \theta\overset{def}{=}\theta_j+\alpha(y^{(i)}-h_\theta(x^{(i)}))x_j^{(i)} θ=defθj+α(y(i)hθ(x(i)))xj(i)
If we compare this to the LMS update rule, we see that it looks identical; but this is not the same algorithm, because h θ ( x ( i ) ) h_\theta(x^{(i)}) hθ(x(i)) is now defined as a non-linear function of θ T x ( i ) θ^T x^{(i)} θTx(i). Nonetheless, it’s a little surprising that we end up with the same update rule for a rather different algorithm and learning problem.

multiple-class classification

unsupervised learning

Dataset contains no labels. the goal is to find interesting structures in the data.

reinforcement learning

The algorithm can collect data interactively. Try the strategy and collect feedbacks and improve the strategy based on the feedbacks.
关键在于如何定义“奖励”与“惩罚”。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值