3.过拟合欠拟合、梯度消失梯度爆炸与循环神经网络进阶

过拟合、欠拟合及其解决方案

名词定义

1.训练误差(training error)、泛化误差(generalization error)
2.过拟合(overfitting)、欠拟合(underfitting)
3.验证集(validation set)
4.K折交叉验证(K-fold cross-validation)

多项式函数拟合实验

1.构造输入数据 (x1,x2,x3)

features = torch.randn((n_train + n_test, 1))
poly_features = torch.cat((features, torch.pow(features, 2), torch.pow(features, 3)), 1)
labels = (true_w[0] * poly_features[:, 0] + true_w[1] * poly_features[:, 1]
          + true_w[2] * poly_features[:, 2] + true_b)
labels += torch.tensor(np.random.normal(0, 0.01, size=labels.size()), dtype=torch.float)

2.读取数据集

    dataset = torch.utils.data.TensorDataset(train_features, train_labels)      # 设置数据集
    train_iter = torch.utils.data.DataLoader(dataset, batch_size, shuffle=True) # 设置获取数据方式

权重衰减

1.L2 范数正则化(regularization)

ℓ ( w 1 , w 2 , b ) = 1 n ∑ i = 1 n 1 2 ( x 1 ( i ) w 1 + x 2 ( i ) w 2 + b − y ( i ) ) 2 \ell(w_1, w_2, b) = \frac{1}{n} \sum_{i=1}^n \frac{1}{2}\left(x_1^{(i)} w_1 + x_2^{(i)} w_2 + b - y^{(i)}\right)^2 (w1,w2,b)=n1i=1n21(x1(i)w1+x2(i)w2+by(i))2
ℓ ( w 1 , w 2 , b ) + λ 2 n ∣ w ∣ 2 , \ell(w_1, w_2, b) + \frac{\lambda}{2n} |\boldsymbol{w}|^2, (w1,w2,b)+2nλw2,
w 1 ← ( 1 − η λ ∣ B ∣ ) w 1 − η ∣ B ∣ ∑ i ∈ B x 1 ( i ) ( x 1 ( i ) w 1 + x 2 ( i ) w 2 + b − y ( i ) ) , w 2 ← ( 1 − η λ ∣ B ∣ ) w 2 − η ∣ B ∣ ∑ i ∈ B x 2 ( i ) ( x 1 ( i ) w 1 + x 2 ( i ) w 2 + b − y ( i ) ) . \begin{aligned} w_1 &\leftarrow \left(1- \frac{\eta\lambda}{|\mathcal{B}|} \right)w_1 - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}}x_1^{(i)} \left(x_1^{(i)} w_1 + x_2^{(i)} w_2 + b - y^{(i)}\right),\\ w_2 &\leftarrow \left(1- \frac{\eta\lambda}{|\mathcal{B}|} \right)w_2 - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}}x_2^{(i)} \left(x_1^{(i)} w_1 + x_2^{(i)} w_2 + b - y^{(i)}\right). \end{aligned} w1w2(1Bηλ)w1BηiBx1(i)(x1(i)w1+x2(i)w2+by(i)),(1Bηλ)w2BηiBx2(i)(x1(i)w1+x2(i)w2+by(i)).

2.L2范数惩罚函数从零实现

def l2_penalty(w):
    return (w**2).sum() / 2

3.简洁实现

    net = nn.Linear(num_inputs, 1)
    nn.init.normal_(net.weight, mean=0, std=1)
    nn.init.normal_(net.bias, mean=0, std=1)
    optimizer_w = torch.optim.SGD(params=[net.weight], lr=lr, weight_decay=wd) # 对权重参数衰减
    optimizer_b = torch.optim.SGD(params=[net.bias], lr=lr)  # 不对偏差参数衰减

丢弃法

1.概念

丢弃法中, h i h_i hi p p p的概率会被清零,有 1 − p 1-p 1p的概率会除以 1 − p 1-p 1p做拉伸。其中丢弃概率是丢弃法的超参数。
h i = ϕ ( x 1 w 1 i + x 2 w 2 i + x 3 w 3 i + x 4 w 4 i + b i ) h_i = \phi\left(x_1 w_{1i} + x_2 w_{2i} + x_3 w_{3i} + x_4 w_{4i} + b_i\right) hi=ϕ(x1w1i+x2w2i+x3w3i+x4w4i+bi)
h i ′ = ξ i 1 − p h i h_i' = \frac{\xi_i}{1-p} h_i hi=1pξihi
E ( ξ i ) = 1 − p E(\xi_i) = 1-p E(ξi)=1p
E ( h i ′ ) = E ( ξ i ) 1 − p h i = h i E(h_i') = \frac{E(\xi_i)}{1-p}h_i = h_i E(hi)=1pE(ξi)hi=hi

2.从零实现

丢弃法函数

def dropout(X, drop_prob):
    X = X.float()
    assert 0 <= drop_prob <= 1
    keep_prob = 1 - drop_prob
    # 这种情况下把全部元素都丢弃
    if keep_prob == 0:
        return torch.zeros_like(X)
    mask = (torch.rand(X.shape) < keep_prob).float()
    
    return mask * X / keep_prob
X = torch.arange(16).view(2, 8)
dropout(X, 0)

训练网络

def net(X, is_training=True):
    X = X.view(-1, num_inputs)
    H1 = (torch.matmul(X, W1) + b1).relu()
    if is_training:  # 只在训练模型时使用丢弃法
        H1 = dropout(H1, drop_prob1)  # 在第一层全连接后添加丢弃层
    H2 = (torch.matmul(H1, W2) + b2).relu()
    if is_training:
        H2 = dropout(H2, drop_prob2)  # 在第二层全连接后添加丢弃层
    return torch.matmul(H2, W3) + b3

验证网络

def evaluate_accuracy(data_iter, net):
    acc_sum, n = 0.0, 0
    for X, y in data_iter:
        if isinstance(net, torch.nn.Module):  #判断是class网络 还是自定义网络
            net.eval() # 开启评估模式来关闭dropout
            acc_sum += (net(X).argmax(dim=1) == y).float().sum().item()
            net.train() # 改回训练模式
        else: # 自定义的模型
            if('is_training' in net.__code__.co_varnames): # 如果有is_training这个参数
                # 将is_training设置成False
                acc_sum += (net(X, is_training=False).argmax(dim=1) == y).float().sum().item() 
            else:
                acc_sum += (net(X).argmax(dim=1) == y).float().sum().item() 
        n += y.shape[0]
    return acc_sum / n

3.简洁实现

net = nn.Sequential(  # 利用了nn.sequential,这不是自定义模型了
        d2l.FlattenLayer(),
        nn.Linear(num_inputs, num_hiddens1),
        nn.ReLU(),
        nn.Dropout(drop_prob1),   # 直接调用nn.Dropout函数
        nn.Linear(num_hiddens1, num_hiddens2), 
        nn.ReLU(),
        nn.Dropout(drop_prob2),
        nn.Linear(num_hiddens2, 10)
        )
for param in net.parameters():
    nn.init.normal_(param, mean=0, std=0.01)
optimizer = torch.optim.SGD(net.parameters(), lr=0.5)
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size, None, None, optimizer)

梯度消失、梯度爆炸以及Kaggle房价预测

梯度消失(vanishing)和梯度爆炸(explosion)

H ( l ) = X W ( 1 ) W ( 2 ) … W ( l ) \boldsymbol{H}^{(l)} = \boldsymbol{X} \boldsymbol{W}^{(1)} \boldsymbol{W}^{(2)} \ldots \boldsymbol{W}^{(l)} H(l)=XW(1)W(2)W(l)

随机初始化模型参数

1.PyTorch的默认随机初始化

在线性回归的简洁实现中,我们使用torch.nn.init.normal_()使模型net的权重参数采用正态分布的随机初始化方式。不过,PyTorch中nn.Module的模块参数都采取了较为合理的初始化策略(不同类型的layer具体采样的哪一种初始化方法的可参考源代码),因此一般不用我们考虑。

2.Xavier随机初始化

还有一种比较常用的随机初始化方法叫作Xavier随机初始化。 假设某全连接层的输入个数为 a a a,输出个数为 b b b,Xavier随机初始化将使该层中权重参数的每个元素都随机采样于均匀分布
U ( − 6 a + b , 6 a + b ) . U\left(-\sqrt{\frac{6}{a+b}}, \sqrt{\frac{6}{a+b}}\right). U(a+b6 ,a+b6 ).
它的设计主要考虑到,模型参数初始化后,每层输出的方差不该受该层输入个数影响,且每层梯度的方差也不该受该层输出个数影响。

考虑环境因素

1.协变量偏移

P(x)改变了,但P(y∣x)保持不变。我们认为x导致y。

2.标签偏移

P(y)改变了,但P(x∣y)保持不变。我们认为y导致x。

3.概念偏移

P(y∣x)改变了,但通常只是逐渐变化。

Kaggle 房价预测实战

1.pandas处理

train_data = pd.read_csv("/home/kesci/input/houseprices2807/house-prices-advanced-regression-techniques/train.csv") # 读取
train_data.iloc[0:4, [0, 1, 2, 3, -3, -2, -1]]  # 查看
all_features = pd.concat((train_data.iloc[:, 1:-1], test_data.iloc[:, 1:]))
numeric_features = all_features.dtypes[all_features.dtypes != 'object'].index
all_features[numeric_features] = all_features[numeric_features].apply(
    lambda x: (x - x.mean()) / (x.std()))
# 标准化后,每个数值特征的均值变为0,所以可以直接用0来替换缺失值
all_features[numeric_features] = all_features[numeric_features].fillna(0)
all_features = pd.get_dummies(all_features, dummy_na=True)  # 将离散数值转成指示特征
train_features = torch.tensor(all_features[:n_train].values, dtype=torch.float)
train_labels = torch.tensor(train_data.SalePrice.values, dtype=torch.float).view(-1, 1)

2.K折交叉验证

def get_k_fold_data(k, i, X, y):
    # 返回第i折交叉验证时所需要的训练和验证数据
    assert k > 1
    fold_size = X.shape[0] // k
    X_train, y_train = None, None
    for j in range(k):
        idx = slice(j * fold_size, (j + 1) * fold_size)
        X_part, y_part = X[idx, :], y[idx]
        if j == i:
            X_valid, y_valid = X_part, y_part
        elif X_train is None:
            X_train, y_train = X_part, y_part
        else:
            X_train = torch.cat((X_train, X_part), dim=0)
            y_train = torch.cat((y_train, y_part), dim=0)
    return X_train, y_train, X_valid, y_valid

3.预测

    test_data['SalePrice'] = pd.Series(preds.reshape(1, -1)[0])
    submission = pd.concat([test_data['Id'], test_data['SalePrice']], axis=1)
    submission.to_csv('./submission.csv', index=False)

循环神经网络进阶

RNN

在这里插入图片描述
H t = ϕ ( X t W x h + H t − 1 W h h + b h ) H_{t} = ϕ(X_{t}W_{xh} + H_{t-1}W_{hh} + b_{h}) Ht=ϕ(XtWxh+Ht1Whh+bh)

GRU

在这里插入图片描述
R t = σ ( X t W x r + H t − 1 W h r + b r ) Z t = σ ( X t W x z + H t − 1 W h z + b z ) H ~ t = t a n h ( X t W x h + ( R t ⊙ H t − 1 ) W h h + b h ) H t = Z t ⊙ H t − 1 + ( 1 − Z t ) ⊙ H ~ t R_{t} = σ(X_tW_{xr} + H_{t−1}W_{hr} + b_r)\\ Z_{t} = σ(X_tW_{xz} + H_{t−1}W_{hz} + b_z)\\ \widetilde{H}_t = tanh(X_tW_{xh} + (R_t ⊙H_{t−1})W_{hh} + b_h)\\ H_t = Z_t⊙H_{t−1} + (1−Z_t)⊙\widetilde{H}_t Rt=σ(XtWxr+Ht1Whr+br)Zt=σ(XtWxz+Ht1Whz+bz)H t=tanh(XtWxh+(RtHt1)Whh+bh)Ht=ZtHt1+(1Zt)H t
• 重置⻔有助于捕捉时间序列⾥短期的依赖关系;
• 更新⻔有助于捕捉时间序列⾥⻓期的依赖关系。

1.参数初始化

def get_params():  
    def _one(shape):
        ts = torch.tensor(np.random.normal(0, 0.01, size=shape), device=device, dtype=torch.float32) #正态分布
        return torch.nn.Parameter(ts, requires_grad=True)
    def _three():
        return (_one((num_inputs, num_hiddens)),
                _one((num_hiddens, num_hiddens)),
                torch.nn.Parameter(torch.zeros(num_hiddens, device=device, dtype=torch.float32), requires_grad=True))
     
    W_xz, W_hz, b_z = _three()  # 更新门参数
    W_xr, W_hr, b_r = _three()  # 重置门参数
    W_xh, W_hh, b_h = _three()  # 候选隐藏状态参数
    
    # 输出层参数
    W_hq = _one((num_hiddens, num_outputs))
    b_q = torch.nn.Parameter(torch.zeros(num_outputs, device=device, dtype=torch.float32), requires_grad=True)
    return nn.ParameterList([W_xz, W_hz, b_z, W_xr, W_hr, b_r, W_xh, W_hh, b_h, W_hq, b_q])

def init_gru_state(batch_size, num_hiddens, device):   #隐藏状态初始化
    return (torch.zeros((batch_size, num_hiddens), device=device), )

2.GRU模型

1.从零实现
def gru(inputs, state, params):
    W_xz, W_hz, b_z, W_xr, W_hr, b_r, W_xh, W_hh, b_h, W_hq, b_q = params
    H, = state
    outputs = []
    for X in inputs:
        Z = torch.sigmoid(torch.matmul(X, W_xz) + torch.matmul(H, W_hz) + b_z)
        R = torch.sigmoid(torch.matmul(X, W_xr) + torch.matmul(H, W_hr) + b_r)
        H_tilda = torch.tanh(torch.matmul(X, W_xh) + R * torch.matmul(H, W_hh) + b_h)
        H = Z * H + (1 - Z) * H_tilda
        Y = torch.matmul(H, W_hq) + b_q
        outputs.append(Y)
    return outputs, (H,)
2.简洁实现
gru_layer = nn.GRU(input_size=vocab_size, hidden_size=num_hiddens)
model = d2l.RNNModel(gru_layer, vocab_size).to(device)

LSTM

长短期记忆(long short-term memory):
遗忘门:控制上一时间步的记忆细胞 输入门:控制当前时间步的输入
输出门:控制从记忆细胞到隐藏状态
记忆细胞:⼀种特殊的隐藏状态的信息的流动
在这里插入图片描述
I t = σ ( X t W x i + H t − 1 W h i + b i ) F t = σ ( X t W x f + H t − 1 W h f + b f ) O t = σ ( X t W x o + H t − 1 W h o + b o ) C ~ t = t a n h ( X t W x c + H t − 1 W h c + b c ) C t = F t ⊙ C t − 1 + I t ⊙ C ~ t H t = O t ⊙ t a n h ( C t ) I_t = σ(X_tW_{xi} + H_{t−1}W_{hi} + b_i) \\ F_t = σ(X_tW_{xf} + H_{t−1}W_{hf} + b_f)\\ O_t = σ(X_tW_{xo} + H_{t−1}W_{ho} + b_o)\\ \widetilde{C}_t = tanh(X_tW_{xc} + H_{t−1}W_{hc} + b_c)\\ C_t = F_t ⊙C_{t−1} + I_t ⊙\widetilde{C}_t\\ H_t = O_t⊙tanh(C_t) It=σ(XtWxi+Ht1Whi+bi)Ft=σ(XtWxf+Ht1Whf+bf)Ot=σ(XtWxo+Ht1Who+bo)C t=tanh(XtWxc+Ht1Whc+bc)Ct=FtCt1+ItC tHt=Ottanh(Ct)

1.从零实现
def lstm(inputs, state, params):
    [W_xi, W_hi, b_i, W_xf, W_hf, b_f, W_xo, W_ho, b_o, W_xc, W_hc, b_c, W_hq, b_q] = params
    (H, C) = state
    outputs = []
    for X in inputs:
        I = torch.sigmoid(torch.matmul(X, W_xi) + torch.matmul(H, W_hi) + b_i)
        F = torch.sigmoid(torch.matmul(X, W_xf) + torch.matmul(H, W_hf) + b_f)
        O = torch.sigmoid(torch.matmul(X, W_xo) + torch.matmul(H, W_ho) + b_o)
        C_tilda = torch.tanh(torch.matmul(X, W_xc) + torch.matmul(H, W_hc) + b_c)
        C = F * C + I * C_tilda
        H = O * C.tanh()
        Y = torch.matmul(H, W_hq) + b_q
        outputs.append(Y)
    return outputs, (H, C)
2.简洁实现
lstm_layer = nn.LSTM(input_size=vocab_size, hidden_size=num_hiddens)
model = d2l.RNNModel(lstm_layer, vocab_size)

深度循环神经网络

在这里插入图片描述
H t ( 1 ) = ϕ ( X t W x h ( 1 ) + H t − 1 ( 1 ) W h h ( 1 ) + b h ( 1 ) ) H t ( ℓ ) = ϕ ( H t ( ℓ − 1 ) W x h ( ℓ ) + H t − 1 ( ℓ ) W h h ( ℓ ) + b h ( ℓ ) ) O t = H t ( L ) W h q + b q \boldsymbol{H}_t^{(1)} = \phi(\boldsymbol{X}_t \boldsymbol{W}_{xh}^{(1)} + \boldsymbol{H}_{t-1}^{(1)} \boldsymbol{W}_{hh}^{(1)} + \boldsymbol{b}_h^{(1)})\\ \boldsymbol{H}_t^{(\ell)} = \phi(\boldsymbol{H}_t^{(\ell-1)} \boldsymbol{W}_{xh}^{(\ell)} + \boldsymbol{H}_{t-1}^{(\ell)} \boldsymbol{W}_{hh}^{(\ell)} + \boldsymbol{b}_h^{(\ell)})\\ \boldsymbol{O}_t = \boldsymbol{H}_t^{(L)} \boldsymbol{W}_{hq} + \boldsymbol{b}_q Ht(1)=ϕ(XtWxh(1)+Ht1(1)Whh(1)+bh(1))Ht()=ϕ(Ht(1)Wxh()+Ht1()Whh()+bh())Ot=Ht(L)Whq+bq

1.简洁实现
gru_layer = nn.LSTM(input_size=vocab_size, hidden_size=num_hiddens,num_layers=2)
model = d2l.RNNModel(gru_layer, vocab_size).to(device)

双向循环神经网络

1.简洁实现

在这里插入图片描述

gru_layer = nn.GRU(input_size=vocab_size, hidden_size=num_hiddens,bidirectional=True)
model = d2l.RNNModel(gru_layer, vocab_size).to(device)
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值