Reference:
https://d2l.ai/chapter_linear-networks/linear-regression.html
https://d2l.ai/chapter_linear-networks/linear-regression-scratch.html
Gradient descent
x ← x − η ∂ x L ( x ) \mathbf x\leftarrow \mathbf x-\eta \partial_{\mathbf x}\mathcal L(\mathbf x) x←x−η∂xL(x)
where the loss function
L
(
x
)
\mathcal L(\mathbf x)
L(x) is usually defined as an average of the losses computed on every single example in the dataset. For example, consider a linear regression problem:
L
(
w
,
b
)
=
1
N
∑
i
=
1
N
l
(
i
)
(
w
,
b
)
=
1
N
∑
i
=
1
N
1
2
(
w
T
x
(
i
)
+
b
−
y
(
i
)
)
2
\mathcal L(\mathbf w,b)=\frac{1}{N}\sum_{i=1}^Nl^{(i)}(\mathbf w,b)=\frac{1}{N}\sum_{i=1}^N\frac{1}{2}\left(\mathbf w^T \mathbf x^{(i)}+b-y^{(i)} \right)^2
L(w,b)=N1i=1∑Nl(i)(w,b)=N1i=1∑N21(wTx(i)+b−y(i))2
( w , b ) ← ( w , b ) − η ∂ ( w , b ) L ( w , b ) (GD) (\mathbf{w},b) \leftarrow (\mathbf{w},b) - \eta \partial_{(\mathbf{w},b)} \mathcal L(\mathbf{w},b) \tag{GD} (w,b)←(w,b)−η∂(w,b)L(w,b)(GD)
In practice, this can be extremely slow: we must pass over the entire dataset before making a single update. Thus, we will often settle for sampling a random minibatch of examples every time we need to compute the update, a variant called minibatch stochastic gradient descent.
Minibatch Stochastic Gradient Descent
In each iteration, we first randomly sample a minibatch
B
\mathcal{B}
B consisting of a fixed number of training examples. We then compute the derivative (gradient) of the average loss on the minibatch with regard to the model parameters. Finally, we multiply the gradient by a predetermined positive value
η
\eta
η and subtract the resulting term from the current parameter values:
(
w
,
b
)
←
(
w
,
b
)
−
η
∣
B
∣
∑
i
∈
B
∂
(
w
,
b
)
l
(
i
)
(
w
,
b
)
(MSGD)
(\mathbf{w},b) \leftarrow (\mathbf{w},b) - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_{(\mathbf{w},b)} l^{(i)}(\mathbf{w},b) \tag{MSGD}
(w,b)←(w,b)−∣B∣ηi∈B∑∂(w,b)l(i)(w,b)(MSGD)
To summarize, steps of the algorithm are the following:
- initialize the values of the model parameters, typically at random
- iteratively sample random minibatches from the data
- update the parameters in the direction of the negative gradient
For the linear regression problem, we can write this out explicitly as follows:
w
←
w
−
η
∣
B
∣
∑
i
∈
B
∂
w
l
(
i
)
(
w
,
b
)
=
w
−
η
∣
B
∣
∑
i
∈
B
x
(
i
)
(
w
T
x
(
i
)
+
b
−
y
(
i
)
)
,
b
←
b
−
η
∣
B
∣
∑
i
∈
B
∂
b
l
(
i
)
(
w
,
b
)
=
b
−
η
∣
B
∣
∑
i
∈
B
(
w
T
x
(
i
)
+
b
−
y
(
i
)
)
.
\begin{aligned} \mathbf{w} &\leftarrow \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_{\mathbf{w}} l^{(i)}(\mathbf{w}, b) = \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \mathbf{x}^{(i)} \left(\mathbf{w}^T \mathbf{x}^{(i)} + b - y^{(i)}\right),\\ b &\leftarrow b - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_b l^{(i)}(\mathbf{w}, b) = b - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \left(\mathbf{w}^T \mathbf{x}^{(i)} + b - y^{(i)}\right). \end{aligned}
wb←w−∣B∣ηi∈B∑∂wl(i)(w,b)=w−∣B∣ηi∈B∑x(i)(wTx(i)+b−y(i)),←b−∣B∣ηi∈B∑∂bl(i)(w,b)=b−∣B∣ηi∈B∑(wTx(i)+b−y(i)).
If
∣
B
∣
=
1
|\mathcal B|=1
∣B∣=1, the algorithm is named Stochastic Gradient Descent.
The minibatch stochastic gradient descent optimizer for linear regression from scratch in Pytorch:
import random
import torch
def synthetic_data(w, b, num_examples):
"""Generate y = Xw + b + noise."""
X = torch.normal(0, 1, (num_examples, len(w)))
y = torch.matmul(X, w) + b
y += torch.normal(0, 0.01, y.shape)
return X, y.reshape((-1, 1))
def data_iter(batch_size, features, labels):
"""Generate Minibatches"""
num_examples = len(features)
indices = list(range(num_examples))
# The examples are read at random, in no particular order
random.shuffle(indices)
for i in range(0, num_examples, batch_size):
batch_indices = torch.tensor(indices[i:min(i +
batch_size, num_examples)])
# About yield: https://www.runoob.com/w3cnote/python-yield-used-analysis.html
yield features[batch_indices], labels[batch_indices]
def linreg(X, w, b):
"""The linear regression model."""
return torch.matmul(X, w) + b
def squared_loss(y_hat, y):
"""Squared loss."""
return (y_hat - y.reshape(y_hat.shape))**2 / 2
def sgd(params, lr, batch_size):
"""Minibatch stochastic gradient descent."""
with torch.no_grad():
for param in params:
param -= lr * param.grad / batch_size
param.grad.zero_()
# Generating data
true_w = torch.tensor([2, -3.4])
true_b = 4.2
features, labels = synthetic_data(true_w, true_b, 1000)
# Parameter assignment
batch_size = 10
lr = 0.03
num_epochs = 3
net = linreg
loss = squared_loss
# Initialization
w = torch.normal(0, 0.01, size=(2, 1), requires_grad=True)
b = torch.zeros(1, requires_grad=True)
# Training
for epoch in range(num_epochs):
for X, y in data_iter(batch_size, features, labels):
l = loss(net(X, w, b), y) # Minibatch loss in `X` and `y`
# Compute gradient on `l` with respect to [`w`, `b`]
l.sum().backward()
sgd([w, b], lr, batch_size) # Update parameters using their gradient
with torch.no_grad():
train_l = loss(net(features, w, b), labels)
print(f'epoch {epoch + 1}, loss {float(train_l.mean()):f}')
# Error
print(f'error in estimating w: {true_w - w.reshape(true_w.shape)}')
print(f'error in estimating b: {true_b - b}')