Minibatch Stochastic Gradient Descent

Reference:
https://d2l.ai/chapter_linear-networks/linear-regression.html
https://d2l.ai/chapter_linear-networks/linear-regression-scratch.html

Gradient descent

x ← x − η ∂ x L ( x ) \mathbf x\leftarrow \mathbf x-\eta \partial_{\mathbf x}\mathcal L(\mathbf x) xxηxL(x)

where the loss function L ( x ) \mathcal L(\mathbf x) L(x)​ is usually defined as an average of the losses computed on every single example in the dataset. For example, consider a linear regression problem:
L ( w , b ) = 1 N ∑ i = 1 N l ( i ) ( w , b ) = 1 N ∑ i = 1 N 1 2 ( w T x ( i ) + b − y ( i ) ) 2 \mathcal L(\mathbf w,b)=\frac{1}{N}\sum_{i=1}^Nl^{(i)}(\mathbf w,b)=\frac{1}{N}\sum_{i=1}^N\frac{1}{2}\left(\mathbf w^T \mathbf x^{(i)}+b-y^{(i)} \right)^2 L(w,b)=N1i=1Nl(i)(w,b)=N1i=1N21(wTx(i)+by(i))2

( w , b ) ← ( w , b ) − η ∂ ( w , b ) L ( w , b ) (GD) (\mathbf{w},b) \leftarrow (\mathbf{w},b) - \eta \partial_{(\mathbf{w},b)} \mathcal L(\mathbf{w},b) \tag{GD} (w,b)(w,b)η(w,b)L(w,b)(GD)

In practice, this can be extremely slow: we must pass over the entire dataset before making a single update. Thus, we will often settle for sampling a random minibatch of examples every time we need to compute the update, a variant called minibatch stochastic gradient descent.

Minibatch Stochastic Gradient Descent

In each iteration, we first randomly sample a minibatch B \mathcal{B} B​ consisting of a fixed number of training examples. We then compute the derivative (gradient) of the average loss on the minibatch with regard to the model parameters. Finally, we multiply the gradient by a predetermined positive value η \eta η and subtract the resulting term from the current parameter values:
( w , b ) ← ( w , b ) − η ∣ B ∣ ∑ i ∈ B ∂ ( w , b ) l ( i ) ( w , b ) (MSGD) (\mathbf{w},b) \leftarrow (\mathbf{w},b) - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_{(\mathbf{w},b)} l^{(i)}(\mathbf{w},b) \tag{MSGD} (w,b)(w,b)BηiB(w,b)l(i)(w,b)(MSGD)
To summarize, steps of the algorithm are the following:

  • initialize the values of the model parameters, typically at random
  • iteratively sample random minibatches from the data
  • update the parameters in the direction of the negative gradient

For the linear regression problem, we can write this out explicitly as follows:
w ← w − η ∣ B ∣ ∑ i ∈ B ∂ w l ( i ) ( w , b ) = w − η ∣ B ∣ ∑ i ∈ B x ( i ) ( w T x ( i ) + b − y ( i ) ) , b ← b − η ∣ B ∣ ∑ i ∈ B ∂ b l ( i ) ( w , b ) = b − η ∣ B ∣ ∑ i ∈ B ( w T x ( i ) + b − y ( i ) ) . \begin{aligned} \mathbf{w} &\leftarrow \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_{\mathbf{w}} l^{(i)}(\mathbf{w}, b) = \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \mathbf{x}^{(i)} \left(\mathbf{w}^T \mathbf{x}^{(i)} + b - y^{(i)}\right),\\ b &\leftarrow b - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_b l^{(i)}(\mathbf{w}, b) = b - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \left(\mathbf{w}^T \mathbf{x}^{(i)} + b - y^{(i)}\right). \end{aligned} wbwBηiBwl(i)(w,b)=wBηiBx(i)(wTx(i)+by(i)),bBηiBbl(i)(w,b)=bBηiB(wTx(i)+by(i)).
If ∣ B ∣ = 1 |\mathcal B|=1 B=1, the algorithm is named Stochastic Gradient Descent.

The minibatch stochastic gradient descent optimizer for linear regression from scratch in Pytorch:

import random
import torch


def synthetic_data(w, b, num_examples): 
    """Generate y = Xw + b + noise."""
    X = torch.normal(0, 1, (num_examples, len(w)))
    y = torch.matmul(X, w) + b
    y += torch.normal(0, 0.01, y.shape)
    return X, y.reshape((-1, 1))


def data_iter(batch_size, features, labels):
    """Generate Minibatches"""
    num_examples = len(features)
    indices = list(range(num_examples))
    # The examples are read at random, in no particular order
    random.shuffle(indices)
    for i in range(0, num_examples, batch_size):
        batch_indices = torch.tensor(indices[i:min(i +
                                                   batch_size, num_examples)])
        # About yield: https://www.runoob.com/w3cnote/python-yield-used-analysis.html
        yield features[batch_indices], labels[batch_indices]


def linreg(X, w, b):
    """The linear regression model."""
    return torch.matmul(X, w) + b


def squared_loss(y_hat, y):
    """Squared loss."""
    return (y_hat - y.reshape(y_hat.shape))**2 / 2


def sgd(params, lr, batch_size):
    """Minibatch stochastic gradient descent."""
    with torch.no_grad():
        for param in params:
            param -= lr * param.grad / batch_size
            param.grad.zero_()


# Generating data
true_w = torch.tensor([2, -3.4])
true_b = 4.2
features, labels = synthetic_data(true_w, true_b, 1000)

# Parameter assignment
batch_size = 10
lr = 0.03
num_epochs = 3
net = linreg
loss = squared_loss

# Initialization
w = torch.normal(0, 0.01, size=(2, 1), requires_grad=True)
b = torch.zeros(1, requires_grad=True)

# Training
for epoch in range(num_epochs):
    for X, y in data_iter(batch_size, features, labels):
        l = loss(net(X, w, b), y)  # Minibatch loss in `X` and `y`
        # Compute gradient on `l` with respect to [`w`, `b`]
        l.sum().backward()
        sgd([w, b], lr, batch_size)  # Update parameters using their gradient
    with torch.no_grad():
        train_l = loss(net(features, w, b), labels)
        print(f'epoch {epoch + 1}, loss {float(train_l.mean()):f}')

# Error
print(f'error in estimating w: {true_w - w.reshape(true_w.shape)}')
print(f'error in estimating b: {true_b - b}')
  • 2
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值