Andrew Ng Machine Learning notes - Course1 Week1

qq_42949310

已于 2023-01-13 05:35:51 修改

阅读量142

点赞数

文章标签：人工智能深度学习

于 2023-01-13 04:02:13 首次发布

本文链接：https://blog.csdn.net/qq_42949310/article/details/128623894

版权

Course 1: Supervised Machine Learning: Regression and Classification

Week 1: Introduction to Machine Learning

supervised learning v.s. unsupervised learning

supervised learning:

algorithms that learn x to y. give your learning algorithm examples to learn from, given “right answers” (output label).
e.g.

input(X)	output(Y)	application
email	spam? (0/1)	spam filtering
audio	text transcript	speech recognition
English	Spanish	machine translation
ad, user info	click? (0/1)	online advertising
image, radar info	position of other cars	self-driving car
image of phone	defect? (0/1)	visual inspection

Regression: predict a number from infinitely many possible outputs
Classification: predict categories from a small number of possible outputs

unsupervised learning:

given data that isn’t associated with any output label y, find some structure/pattern / something interesting in unlabeled data

Clustering: group similar data points together. e.g. Google news, DNA microarray, grouping customers
Anomaly Detection: find unusual data points. e.g. fraud detection
Dimensionality Reduction: compress data using fewer numbers

Regression model

Linear Regression with one variable

Notation:
$x$ = “input” variable, feature
$y$ = “output” variable, “target” variable
$m$ = number of training examples
$(x, y)$ = single training example
$x^{(i)}, y^{(i)})$ = i-th training example

Univariate linear regression: linear regression with one variable $f_{w,b}(x) = wx+b$

Cost Function:
squared-error cost function
$\frac{1}{2m} \sum_{i=1}^{m} (\hat{y}^{(i)}-y^{(i)})^2$
where $\hat{y}^{(i)} =f_{w,b}(x^{(i)})$

bowl-shaped for squared-error cost function

Train the model with gradient descent

Gradient Descent:
repeat until convergence:
$\alpha \frac{\partial}{\partial w} J(w,b)$ $\alpha \frac{\partial}{\partial b} J(w,b)$
where $\alpha$ is the learning rate
Note: simultaneously update $w$ and $b$ . simultaneously means that you calculate the partial derivatives for all the parameters before updating any of the parameters.

在这里插入图片描述
Choosing a different starting point (even just a few steps away from the original starting point), may leading to the reached local minimum different.

在这里插入图片描述

Learning Rate:
if $\alpha$ is too small, gradient descent will work but may be slow.
if $\alpha$ is too large, gradient descent may overshoot and never reach minimum. May fail to converge, and even diverge

If already at a local minimum, gradient descent leaves $w$ unchanged (since slope=0).

Gradient descent can reach local minimum with fixed learning rate. Because: as we get nearer a local minimum, gradient descent will automatically take smaller steps, since derivative automatically gets smaller.

Gradient Descent for Linear Regression:
$\alpha \frac{\partial}{\partial w} J(w,b)$ $\alpha \frac{\partial}{\partial b} J(w,b)$
where
$\frac{\partial}{\partial w} J(w,b) = \frac{1}{m} \sum_{i=1}^{m} (f_{w,b}(x^{(i)}) - y^{(i)})x^{(i)}$ $\frac{\partial}{\partial b} J(w,b) = \frac{1}{m} \sum_{i=1}^{m} (f_{w,b}(x^{(i)}) - y^{(i)})$
在这里插入图片描述
Squared-error cost function is a convex function, which has a single global minimum, because of the bowl shape. So as long as your learning rate is chosen appropriately, it will always converge to the global minimum.