Minibatch Stochastic Gradient Descent

最新推荐文章于 2024-10-22 23:23:54 发布

拉普拉斯的汪

最新推荐文章于 2024-10-22 23:23:54 发布

阅读量290

点赞数 2

分类专栏： Deep Learning 文章标签： pytorch 深度学习机器学习

本文链接：https://blog.csdn.net/qq_39599295/article/details/119988921

版权

梯度下降线性回归批量梯度下降小批量梯度下降 PyTorch

关键词由CSDN通过智能技术生成

Deep Learning 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

Reference:
https://d2l.ai/chapter_linear-networks/linear-regression.html
https://d2l.ai/chapter_linear-networks/linear-regression-scratch.html

Gradient descent

$\mathbf x\leftarrow \mathbf x-\eta \partial_{\mathbf x}\mathcal L(\mathbf x)$

where the loss function $\mathcal L(\mathbf x)$ is usually defined as an average of the losses computed on every single example in the dataset. For example, consider a linear regression problem:
$\mathcal L(\mathbf w,b)=\frac{1}{N}\sum_{i=1}^Nl^{(i)}(\mathbf w,b)=\frac{1}{N}\sum_{i=1}^N\frac{1}{2}\left(\mathbf w^T \mathbf x^{(i)}+b-y^{(i)} \right)^2$

$(\mathbf{w},b) \leftarrow (\mathbf{w},b) - \eta \partial_{(\mathbf{w},b)} \mathcal L(\mathbf{w},b) \tag{GD}$

In practice, this can be extremely slow: we must pass over the entire dataset before making a single update. Thus, we will often settle for sampling a random minibatch of examples every time we need to compute the update, a variant called minibatch stochastic gradient descent.

Minibatch Stochastic Gradient Descent

In each iteration, we first randomly sample a minibatch $\mathcal{B}$ consisting of a fixed number of training examples. We then compute the derivative (gradient) of the average loss on the minibatch with regard to the model parameters. Finally, we multiply the gradient by a predetermined positive value $\eta$ and subtract the resulting term from the current parameter values:
$(\mathbf{w},b) \leftarrow (\mathbf{w},b) - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_{(\mathbf{w},b)} l^{(i)}(\mathbf{w},b) \tag{MSGD}$
To summarize, steps of the algorithm are the following:

initialize the values of the model parameters, typically at random
iteratively sample random minibatches from the data
update the parameters in the direction of the negative gradient

For the linear regression problem, we can write this out explicitly as follows:
$\begin{aligned} \mathbf{w} &\leftarrow \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_{\mathbf{w}} l^{(i)}(\mathbf{w}, b) = \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \mathbf{x}^{(i)} \left(\mathbf{w}^T \mathbf{x}^{(i)} + b - y^{(i)}\right),\\ b &\leftarrow b - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_b l^{(i)}(\mathbf{w}, b) = b - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \left(\mathbf{w}^T \mathbf{x}^{(i)} + b - y^{(i)}\right). \end{aligned}$
If $|\mathcal B|=1$ , the algorithm is named Stochastic Gradient Descent.

The minibatch stochastic gradient descent optimizer for linear regression from scratch in Pytorch:

import random
import torch


def synthetic_data(w, b, num_examples): 
    """Generate y = Xw + b + noise."""
    X = torch.normal(0, 1, (num_examples, len(w)))
    y = torch.matmul(X, w) + b
    y += torch.normal(0, 0.01, y.shape)
    return X, y.reshape((-1, 1))


def data_iter(batch_size, features, labels):
    """Generate Minibatches"""
    num_examples = len(features)
    indices = list(range(num_examples))
    # The examples are read at random, in no particular order
    random.shuffle(indices)
    for i in range(0, num_examples, batch_size):
        batch_indices = torch.tensor(indices[i:min(i +
                                                   batch_size, num_examples)])
        # About yield: https://www.runoob.com/w3cnote/python-yield-used-analysis.html
        yield features[batch_indices], labels[batch_indices]


def linreg(X, w, b):
    """The linear regression model."""
    return torch.matmul(X, w) + b


def squared_loss(y_hat, y):
    """Squared loss."""
    return (y_hat - y.reshape(y_hat.shape))**2 / 2


def sgd(params, lr, batch_size):
    """Minibatch stochastic gradient descent."""
    with torch.no_grad():
        for param in params:
            param -= lr * param.grad / batch_size
            param.grad.zero_()


# Generating data
true_w = torch.tensor([2, -3.4])
true_b = 4.2
features, labels = synthetic_data(true_w, true_b, 1000)

# Parameter assignment
batch_size = 10
lr = 0.03
num_epochs = 3
net = linreg
loss = squared_loss

# Initialization
w = torch.normal(0, 0.01, size=(2, 1), requires_grad=True)
b = torch.zeros(1, requires_grad=True)

# Training
for epoch in range(num_epochs):
    for X, y in data_iter(batch_size, features, labels):
        l = loss(net(X, w, b), y)  # Minibatch loss in `X` and `y`
        # Compute gradient on `l` with respect to [`w`, `b`]
        l.sum().backward()
        sgd([w, b], lr, batch_size)  # Update parameters using their gradient
    with torch.no_grad():
        train_l = loss(net(features, w, b), labels)
        print(f'epoch {epoch + 1}, loss {float(train_l.mean()):f}')

# Error
print(f'error in estimating w: {true_w - w.reshape(true_w.shape)}')
print(f'error in estimating b: {true_b - b}')