Course 1: Supervised Machine Learning: Regression and Classification
Week 1: Introduction to Machine Learning
supervised learning v.s. unsupervised learning
supervised learning:
algorithms that learn x to y. give your learning algorithm examples to learn from, given “right answers” (output label).
e.g.
input(X) | output(Y) | application |
---|---|---|
spam? (0/1) | spam filtering | |
audio | text transcript | speech recognition |
English | Spanish | machine translation |
ad, user info | click? (0/1) | online advertising |
image, radar info | position of other cars | self-driving car |
image of phone | defect? (0/1) | visual inspection |
Regression: predict a number from infinitely many possible outputs
Classification: predict categories from a small number of possible outputs
unsupervised learning:
given data that isn’t associated with any output label y, find some structure/pattern / something interesting in unlabeled data
Clustering: group similar data points together. e.g. Google news, DNA microarray, grouping customers
Anomaly Detection: find unusual data points. e.g. fraud detection
Dimensionality Reduction: compress data using fewer numbers
Regression model
Linear Regression with one variable
Notation:
x
x
x = “input” variable, feature
y
y
y = “output” variable, “target” variable
m
m
m = number of training examples
(
x
,
y
)
(x, y)
(x,y) = single training example
(
x
(
i
)
,
y
(
i
)
)
(x^{(i)}, y^{(i)})
(x(i),y(i)) = i-th training example
Univariate linear regression: linear regression with one variable f w , b ( x ) = w x + b f_{w,b}(x) = wx+b fw,b(x)=wx+b
Cost Function:
squared-error cost function
J
(
w
,
b
)
=
1
2
m
∑
i
=
1
m
(
y
^
(
i
)
−
y
(
i
)
)
2
J(w,b) = \frac{1}{2m} \sum_{i=1}^{m} (\hat{y}^{(i)}-y^{(i)})^2
J(w,b)=2m1i=1∑m(y^(i)−y(i))2
where
y
^
(
i
)
=
f
w
,
b
(
x
(
i
)
)
\hat{y}^{(i)} =f_{w,b}(x^{(i)})
y^(i)=fw,b(x(i))
bowl-shaped for squared-error cost function
Train the model with gradient descent
Gradient Descent:
repeat until convergence:
w
=
w
−
α
∂
∂
w
J
(
w
,
b
)
w = w - \alpha \frac{\partial}{\partial w} J(w,b)
w=w−α∂w∂J(w,b)
b
=
b
−
α
∂
∂
b
J
(
w
,
b
)
b = b - \alpha \frac{\partial}{\partial b} J(w,b)
b=b−α∂b∂J(w,b)
where
α
\alpha
α is the learning rate
Note: simultaneously update
w
w
w and
b
b
b. simultaneously means that you calculate the partial derivatives for all the parameters before updating any of the parameters.
Choosing a different starting point (even just a few steps away from the original starting point), may leading to the reached local minimum different.
Learning Rate:
if
α
\alpha
α is too small, gradient descent will work but may be slow.
if
α
\alpha
α is too large, gradient descent may overshoot and never reach minimum. May fail to converge, and even diverge
If already at a local minimum, gradient descent leaves w w w unchanged (since slope=0).
Gradient descent can reach local minimum with fixed learning rate. Because: as we get nearer a local minimum, gradient descent will automatically take smaller steps, since derivative automatically gets smaller.
Gradient Descent for Linear Regression:
w
=
w
−
α
∂
∂
w
J
(
w
,
b
)
w = w - \alpha \frac{\partial}{\partial w} J(w,b)
w=w−α∂w∂J(w,b)
b
=
b
−
α
∂
∂
b
J
(
w
,
b
)
b = b - \alpha \frac{\partial}{\partial b} J(w,b)
b=b−α∂b∂J(w,b)
where
∂
∂
w
J
(
w
,
b
)
=
1
m
∑
i
=
1
m
(
f
w
,
b
(
x
(
i
)
)
−
y
(
i
)
)
x
(
i
)
\frac{\partial}{\partial w} J(w,b) = \frac{1}{m} \sum_{i=1}^{m} (f_{w,b}(x^{(i)}) - y^{(i)})x^{(i)}
∂w∂J(w,b)=m1i=1∑m(fw,b(x(i))−y(i))x(i)
∂
∂
b
J
(
w
,
b
)
=
1
m
∑
i
=
1
m
(
f
w
,
b
(
x
(
i
)
)
−
y
(
i
)
)
\frac{\partial}{\partial b} J(w,b) = \frac{1}{m} \sum_{i=1}^{m} (f_{w,b}(x^{(i)}) - y^{(i)})
∂b∂J(w,b)=m1i=1∑m(fw,b(x(i))−y(i))
Squared-error cost function is a convex function, which has a single global minimum, because of the bowl shape. So as long as your learning rate is chosen appropriately, it will always converge to the global minimum.
“Batch” gradient descent: each step of gradient descent uses all the training examples.