一:参考资料
李宏毅 Regression
Dive-into-DL-Pytorch
二: 线性回归数学表达式
回归和分类问题的区别:回归的输出是一个连续值,分类问题的输出是离散值
y
^
=
X
∗
w
+
b
\hat{y}=X*w+b
y^=X∗w+b
y
^
=
(
y
^
(
1
)
y
^
(
2
)
⋮
y
^
(
n
)
)
\hat{y} =\begin{pmatrix} \hat{y}^{(1)}\\ \hat{y}^{(2)} \\ \vdots \\ \hat{y}^{(n)} \end{pmatrix}
y^=⎝⎜⎜⎜⎛y^(1)y^(2)⋮y^(n)⎠⎟⎟⎟⎞ ,
y
=
(
y
(
1
)
y
(
2
)
⋮
y
(
n
)
)
y =\begin{pmatrix} y^{(1)}\\ y^{(2)} \\ \vdots \\ y^{(n)} \end{pmatrix}
y=⎝⎜⎜⎜⎛y(1)y(2)⋮y(n)⎠⎟⎟⎟⎞
x
=
(
x
1
(
1
)
x
2
(
1
)
…
x
m
(
1
)
x
1
(
2
)
x
2
(
2
)
…
x
m
(
1
)
⋮
⋮
⋮
⋮
x
1
(
n
)
x
2
(
n
)
…
x
m
(
n
)
)
x =\begin{pmatrix} x^{(1)}_{1} &x^{(1)}_{2} &\dots& x^{(1)}_{m} \\ x^{(2)}_{1} &x^{(2)}_{2} &\dots& x^{(1)}_{m} \\ \vdots &\vdots&\vdots&\vdots\\ x^{(n)}_{1}&x^{(n)}_{2}&\dots&x^{(n)}_{m} \end{pmatrix}
x=⎝⎜⎜⎜⎜⎛x1(1)x1(2)⋮x1(n)x2(1)x2(2)⋮x2(n)……⋮…xm(1)xm(1)⋮xm(n)⎠⎟⎟⎟⎟⎞ ,
w = ( w 1 w 2 ⋮ w m ) w =\begin{pmatrix} w_1 \\ w_2\\ \vdots \\ w_{m} \end{pmatrix} w=⎝⎜⎜⎜⎛w1w2⋮wm⎠⎟⎟⎟⎞
说明
\textcolor{blue}{\fbox{说明}}
说明 我们有数据
(
x
1
(
i
)
,
x
2
(
i
)
…
x
m
(
i
)
,
y
(
i
)
)
(x^{(i)}_{1},x^{(i)}_{2}\dots x^{(i)}_{m},{y}^{(i)})
(x1(i),x2(i)…xm(i),y(i))表示第
i
i
i个样本,一共 n 个样本 ,m 为 feanture 的个数,w 称 weight,b 为 bias(一般只要使用标量即可,然后使用broadcast来与前面相加)
y
^
\hat{y}
y^ 为我们计算的结果
y
^
(
i
)
=
∑
j
=
1
m
x
j
(
i
)
w
j
+
b
\hat{y}^{(i)}=\sum^{m}_{j=1}x^{(i)}_{j}w_{j}+b
y^(i)=j=1∑mxj(i)wj+b
损失函数 \textcolor{blue}{\fbox{损失函数}} 损失函数 假设我们采用平方损失函数,我们有
l ( i ) ( y ^ ( i ) , y ( i ) ) = 1 2 ( y ^ ( i ) − y ( i ) ) 2 = 1 2 ( ∑ j = 1 m x j ( i ) w j + b − y ( i ) ) 2 l^{(i)}(\hat{y}^{(i)},{y}^{(i)})=\dfrac{1}{2}(\hat{y}^{(i)}-y^{(i)})^2=\dfrac{1}{2}(\sum^{m}_{j=1}x^{(i)}_{j}w_{j}+b-y^{(i)})^2 l(i)(y^(i),y(i))=21(y^(i)−y(i))2=21(∑j=1mxj(i)wj+b−y(i))2
∂
l
(
i
)
(
y
^
(
i
)
,
y
(
i
)
)
∂
w
j
=
(
∑
j
=
1
m
x
j
(
i
)
w
j
+
b
−
y
(
i
)
)
∗
x
j
\dfrac{\partial{l^{(i)}(\hat{y}^{(i)},{y}^{(i)})}}{\partial w_j}=(\sum^{m}_{j=1}x^{(i)}_{j}w_{j}+b-y^{(i)})*x_j
∂wj∂l(i)(y^(i),y(i))=(∑j=1mxj(i)wj+b−y(i))∗xj,
∂
l
(
i
)
(
y
^
(
i
)
,
y
(
i
)
)
∂
b
=
(
∑
j
=
1
m
x
j
(
i
)
w
j
+
b
−
y
(
i
)
)
\dfrac{\partial{l^{(i)}(\hat{y}^{(i)},{y}^{(i)})}}{\partial b}=(\sum^{m}_{j=1}x^{(i)}_{j}w_{j}+b-y^{(i)})
∂b∂l(i)(y^(i),y(i))=(∑j=1mxj(i)wj+b−y(i)),
l ( y ^ , y ; θ ) = 1 2 ∑ i = 1 n ( y ^ ( i ) − y ( i ) ) 2 = 1 2 ( y ^ − y ) T ( y ^ − y ) l(\hat{y},y;\theta)= \dfrac{1}{2}\sum^{n}_{i=1} (\hat{y}^{(i)}-y^{(i)} )^2=\dfrac{1}{2}(\hat{y}-y)^T(\hat{y}-y) l(y^,y;θ)=21∑i=1n(y^(i)−y(i))2=21(y^−y)T(y^−y)
小批量随机梯度下降 \textcolor{blue}{\fbox{小批量随机梯度下降}} 小批量随机梯度下降 我们采用 小批量随机梯度下降(mini-batch stochastic gradient descent)
w j = w j − η ∣ B ∣ ∑ i ∈ B l ( i ) ( y ^ ( i ) , y ( i ) ) ∂ w j = w j − η ∣ B ∣ ∑ i ∈ B ( ∑ j = 1 m x j ( i ) w j + b − y ( i ) ) ∗ x j w_j= w_j-\dfrac{\eta}{\vert \Beta\vert}\sum_{i\in \Beta}\dfrac{l^{(i)}(\hat{y}^{(i)},{y}^{(i)})}{\partial w_j}=w_j-\dfrac{\eta}{\vert \Beta\vert}\sum_{i\in \Beta}(\sum^{m}_{j=1}x^{(i)}_{j}w_{j}+b-y^{(i)})*x_j wj=wj−∣B∣η∑i∈B∂wjl(i)(y^(i),y(i))=wj−∣B∣η∑i∈B(∑j=1mxj(i)wj+b−y(i))∗xj
b = b − η ∣ B ∣ ∑ i ∈ B l ( i ) ( y ^ ( i ) , y ( i ) ) ∂ b = b − η ∣ B ∣ ∑ i ∈ B ( ∑ j = 1 m x j ( i ) w j + b − y ( i ) ) b= b-\dfrac{\eta}{\vert \Beta\vert}\sum_{i\in \Beta}\dfrac{l^{(i)}(\hat{y}^{(i)},{y}^{(i)})}{\partial b}=b-\dfrac{\eta}{\vert \Beta\vert}\sum_{i\in \Beta}(\sum^{m}_{j=1}x^{(i)}_{j}w_{j}+b-y^{(i)}) b=b−∣B∣η∑i∈B∂bl(i)(y^(i),y(i))=b−∣B∣η∑i∈B(∑j=1mxj(i)wj+b−y(i))
θ = θ − η ∣ B ∣ ∑ i ∈ B ∇ ∂ l ( i ) ( θ ) ∂ θ \theta=\theta-\dfrac{\eta}{\vert \Beta\vert}\sum_{i \in \Beta}\nabla\dfrac{\partial l^{(i)}(\theta)}{\partial\theta} θ=θ−∣B∣η∑i∈B∇∂θ∂l(i)(θ)
∇ ∂ l ( i ) ( θ ) ∂ θ = ( ( ∑ j = 1 m x j ( i ) w j + b − y ( i ) ) ∗ x 1 ( ∑ j = 1 m x j ( i ) w j + b − y ( i ) ) ∗ x 2 ⋮ ( ∑ j = 1 m x j ( i ) w j + b − y ( i ) ) ∗ x m ∑ j = 1 m x j ( i ) w j + b − y ( i ) ) = ( x 1 x 2 ⋮ x m 1 ) ( ∑ j = 1 m x j ( i ) w j + b − y ( i ) ) = ( x 1 x 2 ⋮ x m 1 ) ( y ^ ( i ) − y ( i ) − ) \begin{aligned} \nabla\dfrac{\partial l^{(i)}(\theta)}{\partial\theta}&=\begin{pmatrix} (\sum^{m}_{j=1}x^{(i)}_{j}w_{j}+b-y^{(i)})*x_1 \\ (\sum^{m}_{j=1}x^{(i)}_{j}w_{j}+b-y^{(i)})*x_2 \\ \vdots \\ (\sum^{m}_{j=1}x^{(i)}_{j}w_{j}+b-y^{(i)})*x_{m} \\ \sum^{m}_{j=1}x^{(i)}_{j}w_{j}+b-y^{(i)} \end{pmatrix} =\begin{pmatrix} x_1\\ x_2 \\ \vdots \\ x_{m} \\ 1 \end{pmatrix}(\sum^{m}_{j=1}x^{(i)}_{j}w_{j}+b-y^{(i)}) \\ &=\begin{pmatrix} x_1 \\ x_2 \\ \vdots \\ x_{m} \\ 1 \end{pmatrix}(\hat{y}^{(i)}-y^{(i)}-) \end{aligned} ∇∂θ∂l(i)(θ)=⎝⎜⎜⎜⎜⎜⎜⎜⎛(∑j=1mxj(i)wj+b−y(i))∗x1(∑j=1mxj(i)wj+b−y(i))∗x2⋮(∑j=1mxj(i)wj+b−y(i))∗xm∑j=1mxj(i)wj+b−y(i)⎠⎟⎟⎟⎟⎟⎟⎟⎞=⎝⎜⎜⎜⎜⎜⎛x1x2⋮xm1⎠⎟⎟⎟⎟⎟⎞(j=1∑mxj(i)wj+b−y(i))=⎝⎜⎜⎜⎜⎜⎛x1x2⋮xm1⎠⎟⎟⎟⎟⎟⎞(y^(i)−y(i)−)
三 线性回归 Pytorch 实现
%matplotlib inline
import torch
from torch import nn,autograd
from matplotlib import pyplot as plt
import numpy as np
import random
from IPython import display
生成数据集
nnum_inputs = 2
num_examples = 1000
true_w = torch.tensor([[2],[-3.4]]) #tensor默认类型是torch.float32
true_b = torch.tensor([4.2])
features = torch.randn(num_examples,num_inputs) #torch.Size([1000, 2])
labels = torch.mm(features,true_w)+true_b
labels+=torch.normal(0,0.01,size=labels.size()) #torch.Size([1000, 1])
观察数据集
print(features[0], labels[0])
# tensor([ 0.6882, -1.0295]) tensor([9.0837])
def use_svg_display():
# 用矢量图显示
display.set_matplotlib_formats('svg')
def set_figure(figsize=(3.5,2.5)):
use_svg_display()
# 设置图的尺寸
plt.rcParams['figure.figsize']=figsize
set_figure()
plt.scatter(features[:,1].numpy(),labels.numpy(),1)
读取数据
def data_iter(batch_size,features,labels):
#每次返回一个大小为batch_size 的 (features,labels)
num_examples = len(labels)
indices = list(range(num_examples))
random.shuffle(indices)
for i in range(0,num_examples,batch_size):
j = torch.LongTensor(indices[i:min(i+batch_size,num_examples)])
yield features.index_select(0,j),labels.index_select(0,j) #index_select 的 index 需要 LongTensor torch.int64
初始化模型参数
w = torch.normal(0,0.01,size=(num_inputs,1)) #torch.float32
b = torch.zeros(1)
w.requires_grad = True
b.requires_grad = True
定义模型
def linreq(X,w,b):
return torch.mm(X,w)+b
定义损失函数
def squared_loss(y,y_hat):
return (y-y_hat.view(y.size()))**2/2
定义优化算法
def SGD(params,lr,batch_size):
for param in params:
param.data -= lr*param.grad/batch_size
训练模型
# 超参数
lr = 0.03
num_epochs = 10
net=linreq
loss = squared_loss
#开始训练
for epoch in range(num_epochs):
for X,y in data_iter(batch_size,features,labels):
y_hat = net(X,w,b)
l=loss(y,y_hat).sum()
l.backward()
SGD([w,b],lr,batch_size)
# 不要忘了梯度清零
w.grad.data.zero_()
b.grad.data.zero_()
train_l = loss(labels,net(features,w,b)).sum().item()
print("epochs:{} loss:{}".format(epoch+1,train_l))
print("w:{} b:{}".format(w,b))
Out
epochs:1 loss:33.648372650146484
epochs:2 loss:0.12297464162111282
epochs:3 loss:0.051339346915483475
epochs:4 loss:0.05117468535900116
epochs:5 loss:0.051132265478372574
epochs:6 loss:0.05133380368351936
epochs:7 loss:0.0512143149971962
epochs:8 loss:0.05118824541568756
epochs:9 loss:0.05112023279070854
epochs:10 loss:0.05111289396882057
w:tensor([[ 2.0003],
[-3.3996]], requires_grad=True) b:tensor([4.2001], requires_grad=True)