机器学习基本概念
2 选择模型
评判好模型标准: 略
尽量减少误差,误差
2.1 Bias & Variance
- bias:根据样本拟合出的模型的输出结果的期望与样本真实值的差距
- variance: 描述的是样本上训练出来的模型在测试集的表现
- 方差的定义:
var [ X ] = E [ ( X − μ ) 2 ] = E [ X 2 − 2 X μ + μ 2 ] = E ( X 2 ) − 2 μ 2 + μ 2 = E ( X 2 ) − μ 2 E [ X 2 ] = Var [ X ] + ( E [ X ] ) 2 \operatorname{var}[X]=E\left[(X-\mu)^{2}\right]=E\left[X^{2}-2 X \mu+\mu^{2}\right]=E\left(X^{2}\right)-2 \mu^{2}+\mu^{2}=E\left(X^{2}\right)-\mu^{2}\\ E\left[X^{2}\right]=\operatorname{Var}[X]+(E[X])^{2} var[X]=E[(X−μ)2]=E[X2−2Xμ+μ2]=E(X2)−2μ2+μ2=E(X2)−μ2E[X2]=Var[X]+(E[X])2 - 测试样本y的期望:
E [ f ] = f y = f + ε E [ ε ] = 0 var [ ε ] = σ 2 E [ y ] = E [ f + ε ] = f \begin{array}{c}{E[f]=f} \\ {y=f+\varepsilon} \\ {E[\varepsilon]=0} \\ {\operatorname{var}[\varepsilon]=\sigma^{2}} \\ {E[y]=E[f+\varepsilon]=f}\end{array} E[f]=fy=f+εE[ε]=0var[ε]=σ2E[y]=E[f+ε]=f
将系列02中的误差拆分为bias何variance。简单model(左边)是bias比较大造成的error,这种情况叫做 Underfitting(欠拟合),而复杂model(右边)是variance过大造成的error,这种情况叫做Overfitting(过拟合)。
2.2 Model Selection
- Should NOT do: 直接在Testing Set验证Model效果。因为Testing Set有自己的bias,会导致效果变差。
Holdout Method
- 是指将数据集 D 划分成两份互斥的数据集,一份作为训练集 S,一份作为测试集 T,在 S 上训练模型,在 T 上评估模型效果;
- 尽量保证训练集 S 和测试集 T 的数据分布一致,避免由于数据划分引入额外的偏差而对最终结果产生影响.
N-fold Cross Validation
为了解决Validation Set的bias问题
3 优化方法
-
Vanilla Gradient descent
w t + 1 ← w t − η t g t η t = η t + 1 g t = ∂ C ( θ t ) ∂ w w^{t+1} \leftarrow w^{t}-\eta^{t} g^{t}\\ \eta^{t}=\frac{\eta}{\sqrt{t+1}} \quad g^{t}=\frac{\partial C\left(\theta^{t}\right)}{\partial w} wt+1←wt−ηtgtηt=t+1ηgt=∂w∂C(θt) -
Adagrad
w t + 1 ← w t − η t σ t g t σ t = 1 t + 1 ∑ i = 0 t ( g i ) 2 } } w t + 1 ← w t − η ∑ i = 0 t ( g i ) 2 g t \left.\begin{array}{l}{w^{t+1} \leftarrow w^{t}-\frac{\eta^{t}}{\sigma^{t}} g^{t}} \\ {\sigma^{t}=\sqrt{\frac{1}{t+1} \sum_{i=0}^{t}\left(g^{i}\right)^{2}} \}}\end{array}\right\} w^{t+1} \leftarrow w^{t}-\frac{\eta}{\sqrt{\sum_{i=0}^{t}\left(g^{i}\right)^{2}}} g^{t} wt+1←wt−σtηtgtσt=t+11∑i=0t(gi)2}}wt+1←wt−∑i=0t(gi)2ηgt
σ t \sigma^{t} σt: root mean square of the previous derivatibes of parameter w -
Gradient Descent
θ i = θ i − 1 − η ∇ L ( θ i − 1 ) \theta^{i}=\theta^{i-1}-\eta \nabla L\left(\theta^{i-1}\right) θi=θi−1−η∇L(θi−1) -
Stochastic Gradient Descent
L n = ( y ^ n − ( b + ∑ w i x i n ) ) 2 θ i = θ i − 1 − η ∇ L ( θ i − 1 ) L^{n}=\left(\hat{y}^{n}-\left(b+\sum w_{i} x_{i}^{n}\right)\right)^{2}\\ \theta^{i}=\theta^{i-1}-\eta \nabla L\left(\theta^{i-1}\right) Ln=(y^n−(b+∑wixin))2θi=θi−1−η∇L(θi−1) -
Gradient descent: Feature Scaling
将特征缩放到差不多大 -
Steepest Gradient descent
g ( t ) : = f ( x ( k ) + t d ( k ) ) over t ≥ 0 Set x ( k + 1 ) = x ( k ) + t k d ( k ) \begin{array}{c}{g(t) :=f\left(\mathbf{x}^{(k)}+t \mathbf{d}^{(k)}\right) \quad \text { over } \quad t \geq 0} \\ {\text { Set } \mathbf{x}^{(k+1)}=\mathbf{x}^{(k)}+t_{k} \mathbf{d}^{(k)}}\end{array} g(t):=f(x(k)+td(k)) over t≥0 Set x(k+1)=x(k)+tkd(k)
SGD MGD代码
# Stochastic Gradient Descent
n_epochs = 50
t0, t1 = 5, 50 # 学习超参数
def learning_schedule(t):
return t0 / (t + t1)
theta = np.random.randn(2,1) # 随机初始化
for epoch in range(n_epochs):
for i in range(m):
random_index = np.random.randint(m)
xi = X_b[random_index:random_index+1]
yi = y[random_index:random_index+1]
gradients = 2 * xi.T.dot(xi.dot(theta) - yi)
eta = learning_schedule(epoch * m + i)
theta = theta - eta * gradients
# Stochastic Gradient Descent with Scikit-Learn
from sklearn.linear_model import SGDRegressor
sgd_reg = SGDRegressor(max_iter=1000, tol=1e-3, penalty=None, eta0=0.1)
sgd_reg.fit(X, y.ravel())
# MBGD 小批量梯度下降法
import numpy as np
import random
def gen_line_data(sample_num = 100):
"""
y = 3*x1 + 4*x2
"""
x1 = np.linspace(0,9,sample_num)
x2 = np.linspace(4,13,sample_num)
x = np.concatenate(([x1],[x2]),axis = 0).T
y = np.dot(x,np.array([3,4]).T)
return x,y
def mbgd(samples, y,step_size = 0.01,max_iter_count=10000, batch_size=0.2):
sample_num,dim = samples.shape
y = y.flatten()
w = np.ones((dim,),dtype=np.float32)
loss = 10
iter_count=0
while loss > 0.001 and iter_count < max_iter_count:
loss = 0
error = np.zeros((dim,), dtype=np.float32)
index = random.sample(range(sample_num), int(np.ceil(sample_num * batch_size)))
batch_samples = samples[index]
batch_y = y[index]
for i in range(len(batch_samples)):
predict_y = np.dot(w.T, batch_samples[i])
for j in range(dim):
error[j] += (batch_y[i] - predict_y)*batch_samples[i][j]
for j in range(dim):
w[j] += step_size * error[j]/sample_num
for i in range(sample_num):
predict_y = np.dot(w.T, samples[i])
error = (1/(sample_num * dim))*np.power((predict_y - y[i]), 2)
loss += error
iter_count += 1
return w
if __name__ == '__main__':
samples, y = gen_line_data()
w = mbgd(samples, y)
print(w)
学习回归模型评价指标
略