吴恩达机器学习视频学习笔记1

最新推荐文章于 2024-07-18 16:33:22 发布

sinat_26668989

最新推荐文章于 2024-07-18 16:33:22 发布

阅读量218

点赞数

文章标签： machine learning 吴恩达学习笔记

本文链接：https://blog.csdn.net/sinat_26668989/article/details/89423002

版权

learning theory

Machine Learning definition

Arthur Samuel(1959). machine learning: field of study that gives computers the ability to learn without being explictly programmed.
Tom Mitchell(1998) Well-posed Learning Problem: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measure by P, improves with experience E.

supervised learning

监督问题的算法，即我们给算法提供了一组“标准答案”，让算法去学习标准输入和标准答案之间的关系，以尝试对于我们的其他输入给我们提供更为标准的答案。

regression

The output y is a continuous variable
Example:
define the cost function:
$J(\Theta)=\frac{1}{2} \sum_{i=1}^n (h_{\Theta} (x^ {(i)})-y^{(i)})^2$

LMS(least mean squares) algorithm:

The gradient descent algorithm:
$\Theta_j \overset{\underset{\mathrm{def}}{}}{=} \Theta_j - \alpha \frac{\partial}{\partial \Theta_j} J(\Theta)$
This is a very natural algorithm that repeatedly takes a step in the direction of steepest decrease of $J$ . $\alpha$ is called the learning rate.
For a single training example, LMS update rule(also the Widrow-Hoff learning rule)
$\Theta_j \overset{\underset{\mathrm{def}}{}}{=} \Theta_j + \alpha (y^{(i)}-h_{\Theta} (x^ {(i)})) x_j ^{(i)}$
where the magnitude of the update is proportional to the error term.

There are two ways to modify this method for a training set of more than one example.
The first is replace it with the following algorithm:
Repeat until convergence{
$\Theta_j \overset{\underset{\mathrm{def}}{}}{=} \Theta_j + \alpha \sum_{i=1}^n (y^{(i)}-h_{\Theta} (x^ {(i)})) x_j ^{(i)}$ (for every $j$ )
}
This method looks at every example in the entire training set on every step, and is called batch gradient descent. While gradient descent can be susceptible to local minima in general, the optimization problem we have posed here for linear regression has only one global, and no other local, optima.
There is an alternative to batch gradient descent that also works very well. Consider the second algorithm:
Loop{
for $i = 1$ to $n$ ,{
$\Theta_j \overset{\underset{\mathrm{def}}{}}{=} \Theta_j + \alpha (y^{(i)}-h_{\Theta} (x^ {(i)})) x_j ^{(i)}$ (for every $j$ )
}
We repeatedly run through the training set, and each time we encounter a training example, we update the parameters according to the gradient of the error with respect to that single training example only. This algorithm is called stochastic gradient descent(also incremental gradient descent).
It may never converge to the minimum and , but the parameters $\Theta$ will keep oscillating around the minimum of $J(\Theta)$ , but in practice most of the values near the minimum will be reasonably good approximations to the true minimum.
Conclusion:when the training set is large, stochastic gradient descent is often preferred over batch gradient descent.

The normal equations
Probabilistic interpretation

assume that
$y^{(i)}=\theta^T x^{(i)}+\epsilon^{(i)}$
in which, $\epsilon^{(i)}$ is an error term that captures either unmodeled effects, and is assumed to be distributed IID(independently and identically distributed,独立同分布), namely, $\epsilon^{(i)} \sim \mathcal(0,\sigma^2)$ .
the density of $\epsilon^{(i)}$ is given by
$p(\epsilon^{(i)})=\frac{1}{\sqrt{2\pi} \sigma}exp(-\frac{(\epsilon^{(i)})^2}{2 \sigma^2})$
This implies that
$p(y^{(i)} \mid x^{(i)};\theta)=\frac{1}{\sqrt{2\pi} \sigma}exp(-\frac{(y^{(i)}- \theta^T x^{(i)})^2}{2 \sigma^2})$
The notation $p(y^{(i)} \mid x^{(i)};\theta)$ indicates that this is the distribution of $y^{(i)}$
given $x^{(i)}$ and parameterized by $\theta$ . Note that we should not condition on $\theta$
$p(y^{(i)} \mid x^{(i)};\theta)$ , since $\theta$ is not a random variable. We can also write the
distribution of $y^{(i)}$ as $y^{(i)} \mid x^{(i)};\theta \sim \mathcal{N}(\theta^T x^{(i)}, \sigma^2)$ .
the likelihood function:
$L(\theta)=L(\theta;X,\vec{y})=p(\vec{y} \mid X;\theta)=\prod_{i=1}^N p(y^{(i)} \mid x^{(i)};\theta)=\prod_{i=1}^N \frac{1}{\sqrt{2\pi} \sigma}exp(-\frac{(y^{(i)}- \theta^T x^{(i)})^2}{2 \sigma^2})$
according to the principal of maximum likelihood:
$\ell(\theta)=logL(\theta)=nlog\frac{1}{\sqrt{2\pi} \sigma}-\frac{1}{\sigma^2} \cdot \frac{1}{2} \sum_{i=1}^{n} (y^{(i)}- \theta^T x^{(i)})^2$
Hence, maximizing $\ell(\theta)$ gives the same answer as minimizing $J(\theta)=\frac{1}{2} \sum_{i=1}^{n} (y^{(i)}- \theta^T x^{(i)})^2$ ,our original least-squares cost function.

To summarize: Under the previous probabilistic assumptions on the data, least-squares regression corresponds to finding the maximum likelihood estimate of \theta. This is thus one set of assumptions under which least-squares regression can be justified as a very natural method that’s just doing maximum likelihood estimation. (Note however that the probabilistic assumptions are by no means necessary for least-squares to be a perfectly good and rational procedure, and there may—and indeed there are—other natural assumptions that can also be used to justify it.)
Note also that, in our previous discussion, our final choice of θ did not depend on what was $\sigma^2$ , and indeed we’d have arrived at the same result even if $\sigma^2$ were unknown.

Locally weighted linear regression

underfitting—in which the data clearly shows structure not captured by the model
overfitting—in which the data clearly shows structure captured by the model perfectly

steps:
a)fit $\theta$ to minimize $\sum_i w^{(i)} (y^{(i)}-\theta^T x^{(i)})^2$
b)output $\theta^T x$

the weight $w^{(i)}=exp(-\frac{( x^{(i)}-x)^2}{2 \tau^2})$ , in which, the parameter $\tau$ is called the bandwidth and controls how quickly the weight of a training example falls off with distance of its $x^{(i)}$ from the query point $x$ . Note that while the formula for the weights takes a form that is cosmetically similar to the density of a Gaussian distribution. Locally weighted linear regression is the first example we’re seeing of a non-parametric algorithm. The (unweighted) linear regression algorithm that we saw earlier is known as a parametric learning algorithm, because it has a fixed, finite number of parameters (the $\theta_i$ ’s), which are fit to the data. Once we’ve fit the θi’s and stored them away, we no longer need to keep the training data around to make future predictions. In contrast, to make predictions using locally weighted linear regression, we need to keep the entire training set around. The term “non-parametric” (roughly) refers to the fact that the amount of stuff we need to keep in order to represent the
hypothesis $h$ grows linearly with the size of the training set.

classification

the values y we want to predict take on only a small number of discrete values。

Binary classification

y can take on only two values, 0 and 1.

logistic regression

It also doesn’t make sense for $h_\theta(x)$ to take values larger than 1 or smaller than 0 when we know that $\in \{0, 1\}$ . Choose
$h_\theta(x)=g(θ^T x)=\frac{1}{1 + e^{−\theta^T x}}$
in which $g(z)=\frac{1}{1 + e^{−z}}$ is called the logistic function or the sigmoid function and $g^\prime(x)=g(z)(1-g(z))$ .
Let us assume that
$P(y=1|x;\theta)=h_\theta(x) P(y=0|x;\theta)=1-h_\theta(x)$
Note that this can be written more compactly as
$P(y|x;\theta)=(h_\theta(x))^y(1-h_\theta(x))^{1-y}$
Assuming that the n training examples were generated independently, we
can then write down the likelihood of the parameters as
$L(\theta)=P(\vec(y)|X;\theta)=\prod_{i=1}^N P(y^{(i)}|x^{(i)};\theta)=\prod_{i=1}^N (h_\theta(x))^{y^{(i)}}(1-h_\theta(x))^{1-y^{(i)}}$
As before, it will be easier to maximize the log likelihood:
$\ell(\theta)=logL(\theta)=\sum_{i=1}^n y^{(i)}logh(x^{(i)})+(1-y^{(i)})log(1-h(x^{(i)}))$
use gradient ascent $\theta\overset{def}{=}\theta+\alpha\nabla_\theta\ell(\theta)$
take derivatives to derive the stochastic
gradient ascent rule:
$\frac{\partial}{\partial \theta_j} \ell(\theta)=(y-h_\theta(x))x_j$
Therefore,
$\theta\overset{def}{=}\theta_j+\alpha(y^{(i)}-h_\theta(x^{(i)}))x_j^{(i)}$
If we compare this to the LMS update rule, we see that it looks identical; but this is not the same algorithm, because $h_\theta(x^{(i)})$ is now defined as a non-linear function of $θ^T x^{(i)}$ . Nonetheless, it’s a little surprising that we end up with the same update rule for a rather different algorithm and learning problem.

multiple-class classification

unsupervised learning

Dataset contains no labels. the goal is to find interesting structures in the data.

reinforcement learning

The algorithm can collect data interactively. Try the strategy and collect feedbacks and improve the strategy based on the feedbacks.
关键在于如何定义“奖励”与“惩罚”。