回归
-
A set of function: y = b + w ⋅ x c p y=b+w\cdot x_{cp} y=b+w⋅xcp
-
Goodness of function: L ( f ) = ∑ ( y ^ − f ( x c p n ) ) 2 L(f)=\sum(\hat{y}-f(x_{cp}^n))^2 L(f)=∑(y^−f(xcpn))2,
Pick the “best” function f ∗ = a r g min f L ( f ) f^*=arg \min_{f} L(f) f∗=argfminL(f)Gradient Descent: w 1 ← w 0 − η ∂ L ∂ w ∣ w = w 0 , b = b 0 , b 1 ← b 0 − η ∂ L ∂ b ∣ w = w 0 , b = b 0 w^1 \leftarrow w^0- \eta\frac{\partial L}{\partial w}|_{w=w^0,b=b^0}, b^1 \leftarrow b^0- \eta\frac{\partial L}{\partial b}|_{w=w^0,b=b^0} w1←w0−η∂w∂L∣w=w0,b=b0,b1←b0−η∂b∂L∣w=w0,b=b0
把所有偏微分写成向量,就是gradient -
Training data: ( x 1 , y ^ 1 ) , . . . , ( x n , y ^ n ) (x^1,\hat{y}^1),...,(x^n,\hat{y}^n) (x1,y^1),...,(xn,y^n)
选择更高次的线性模型(低次的是高次的子集合),训练数据的average error减小,但是注意过拟合
Regularization:
λ
∑
(
w
i
)
2
\lambda \sum(w_i)^2
λ∑(wi)2
Smoother function is more likely to be correct
bias and variance
简单的模型受到数据的影响较小,
复杂模型variance更高,bias更小
bias(欠拟合):增加features;more complex model
variance(过拟合):增加data;regularization
Cross Validation
梯度下降
- 调Learning Rate,可以visualize No.参数update 和 loss
- 自适应(Adagrad):
分子说梯度越大update越大,分母说梯度越大update越小。反差
The best step is 一次微分除以二次微分 - 随机梯度
看一个example就update一次参数 - Feature Scaling
x i r ← x i r − m i σ i x^r_i \leftarrow \frac{x^r_i-m_i}{\sigma_i} xir←σixir−mi第 r r r个example的第 i i i个feature
- 自适应(Adagrad):
理论基础
泰勒展开
h
(
x
)
=
h
(
x
0
)
+
h
′
(
x
0
)
(
x
−
x
0
)
+
h
′
′
(
x
0
)
2
!
(
x
−
x
0
)
2
+
.
.
.
h(x)=h(x_0)+h'(x_0)(x-x_0)+\frac{h''(x_0)}{2!}(x-x_0)^2+...
h(x)=h(x0)+h′(x0)(x−x0)+2!h′′(x0)(x−x0)2+...
当 x x x 很接近 x 0 x_0 x0 时, h ( x ) ≈ h ( x 0 ) + h ′ ( x 0 ) ( x − x 0 ) h(x)\approx h(x_0)+h'(x_0)(x-x_0) h(x)≈h(x0)+h′(x0)(x−x0)
多元泰勒展开:
h
(
x
,
y
)
≈
h
(
x
0
,
y
0
)
+
∂
h
(
x
0
,
y
0
)
∂
x
(
x
−
x
0
)
+
∂
h
(
x
0
,
y
0
)
∂
y
(
y
−
y
0
)
h(x, y) \approx h\left(x_{0}, y_{0}\right)+\frac{\partial h\left(x_{0}, y_{0}\right)}{\partial x}\left(x-x_{0}\right)+\frac{\partial h\left(x_{0}, y_{0}\right)}{\partial y}\left(y-y_{0}\right)
h(x,y)≈h(x0,y0)+∂x∂h(x0,y0)(x−x0)+∂y∂h(x0,y0)(y−y0)
所以可以对损失函数泰勒展开(两个参数)
圆的半径足够小才能满足泰勒近似,圆的半径和学习速率成正比