HEU机器学习基础0.02
模型
文章目录
前言
Preview
based on Andrew Ng
术语Terminology
- Training set数据集:Data used to train the model
- x = input variable(feature)
- y = output varibles(target variable)
- m = number of training example
- (x, y) = single training example
- (x (i) , y(i)) = ithtraining example(不是指数表达式)
线性回归模型Linear Regression
一条直线拟合实验数据
广泛应用于监督学习:同时给出input和对应的output,即x,y轴,形成函数关系。
回归模型Regression model
结果为预测数据predicts numbers
即output为数字 的线性回归模型称为回归模型Regression model
infinitely many possible outputs
对于f 回归函数 的表达:fw,b(x) = wx+b =
y
^
\hat{y}
y^
Linear regression with one variable (只有x)
分类模型Classification model
结果为分类Predicts categories
small number of possible outputs
代价函数Cost function
代价函数 用来评价模型效果好坏
对于:
f
w
,
b
(
x
(
i
)
)
=
w
x
(
i
)
+
b
(Linear regression function)
f_{w,b}(x^{(i)}) = wx^{(i)} + b\tag{Linear regression function}
fw,b(x(i))=wx(i)+b(Linear regression function)
需要找到对应的w,b使得各(x(i), y(i)) 处的
y
^
\hat{y}
y^(i) 接近 y(i)
于是建立代价函数Cost function:
J
(
w
,
b
)
=
1
2
m
∑
i
=
1
m
(
y
^
(
i
)
−
y
(
i
)
)
2
J(w,b) = \frac{1}{2m}\sum_{i=1}^{m}(\hat{y}^{(i)} - y^{(i)})^2
J(w,b)=2m1i=1∑m(y^(i)−y(i))2
=
J
(
w
,
b
)
=
1
2
m
∑
i
=
1
m
(
f
(
i
)
−
y
(
i
)
)
2
=
1
2
m
∑
i
=
0
m
−
1
(
f
w
,
b
(
x
(
i
)
)
−
y
(
i
)
)
2
=J(w,b) = \frac{1}{2m} \sum_{i=1}^{m}(f^{(i)} - y^{(i)})^2= \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2
=J(w,b)=2m1i=1∑m(f(i)−y(i))2=2m1i=0∑m−1(fw,b(x(i))−y(i))2
y
^
(
i
)
−
y
(
i
)
=
e
r
r
o
r
\hat{y}^{(i)} - y^{(i)} = error
y^(i)−y(i)=error
m = number of train examples$$
J(w,b) 越小(代价越小),回归拟合效果越好
梯度下降 Gradient descent
可用来最小化函数
本处用于最小化代价函数
对于线性回归函数,代价函数图像将呈碗状 bowl shape(只有一个local minimum,即global minmum) 或 吊床状 hammock shape (可以有多个local minimum)
对于非线性回归函数,图像将更加复杂
通过多次梯度下降,找到代价函数局部最小值 local minimum 所在位置
w
=
w
−
α
∂
J
(
w
,
b
)
∂
w
(Gradient descent for w)
w = w - \alpha \frac{\partial J(w,b)}{\partial w}\tag{Gradient descent for w}
w=w−α∂w∂J(w,b)(Gradient descent for w)
b
=
b
−
α
∂
J
(
w
,
b
)
∂
b
(Gradient descent for b)
b = b - \alpha \frac{\partial J(w,b)}{\partial b}\tag{Gradient descent for b}
b=b−α∂b∂J(w,b)(Gradient descent for b)
Steps for gradient descent calculation:
t
e
m
p
w
=
w
−
α
∂
J
(
w
,
b
)
∂
w
(1)
temp_w = w - \alpha \frac{\partial J(w,b)}{\partial w}\tag{1}
tempw=w−α∂w∂J(w,b)(1)
t
e
m
p
b
=
b
−
α
∂
J
(
t
e
m
p
w
,
b
)
∂
b
(2)
temp_b = b - \alpha \frac{\partial J(temp_w,b)}{\partial b}\tag{2}
tempb=b−α∂b∂J(tempw,b)(2)
w
=
t
e
m
p
w
(3)
w = temp_w\tag{3}
w=tempw(3)
b
=
t
e
m
p
b
(4)
b = temp_b\tag{4}
b=tempb(4)
偏导项结果
∂
J
(
w
,
b
)
∂
w
=
1
m
∑
i
=
0
m
−
1
(
f
w
,
b
(
x
(
i
)
)
−
y
(
i
)
)
x
(
i
)
\frac{\partial J(w,b)}{\partial w} = \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})x^{(i)}
∂w∂J(w,b)=m1i=0∑m−1(fw,b(x(i))−y(i))x(i)
∂
J
(
w
,
b
)
∂
b
=
1
m
∑
i
=
0
m
−
1
(
f
w
,
b
(
x
(
i
)
)
−
y
(
i
)
)
\frac{\partial J(w,b)}{\partial b} = \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})
∂b∂J(w,b)=m1i=0∑m−1(fw,b(x(i))−y(i))
α
\alpha
α 表示 学习率 Learning rate
结合梯度下降公式 学习率即代价函数结果(回归函数拟合效果好坏)对回归函数的修改权重
- 学习率过小:梯度下降慢,寻找合适回归函数的效率低
- 学习率过大:梯度下降过快,导致超调(overshoot),很难找到local
minimum;或梯度下降无法收敛fail to convergent,甚至发散divergent
为提高效率,采用可变的学习率,即
α
\alpha
α在梯度下降过程中也是个变量
在local minimum 附近时:
∂
J
(
w
,
b
)
∂
w
\frac{\partial J(w,b)}{\partial w}
∂w∂J(w,b)会变得较小,且变化率也减小