深度学习
Task01.线性回归
1.1.线性回归的基本要素
1.1.1.模型
线性回归假设输出与各个输入之间是线性关系: y = W.T X + b
1.1.2.数据集
在机器学习术语里,该数据集被称为训练数据集(training data set)或训练集(training set),一栋房屋被称为一个样本(sample),其真实售出价格叫作标签(label),用来预测标签的两个因素叫作特征(feature)。特征用来表征样本的特点。
1.1.3.损失函数
在模型训练中,衡量价格预测值与真实值之间的误差。通常选取一个非负数作为误差,且数值越小表示误差越小。一个常用的选择是平方函数。
公式
1.1.4.优化函数 - 随机梯度下降
优化函数的有以下两个步骤:
• (i)初始化模型参数,一般来说使用随机初始化;
• (ii)我们在数据上迭代多次,通过在负梯度方向移动参数来更新每个参数。
公式
1.2.线性回归模型从零开始的实现
1.2.1.导入各种包
# import packages and modules
%matplotlib inline
import torch
from IPython import display
from matplotlib import pyplot as plt
import numpy as np
import random
print(torch.__version__)
1.2.2.生成数据集及使用图像来展示生成的数据
# set input feature number
num_inputs = 2
# set example number
num_examples = 1000
# set true weight and bias in order to generate corresponded label
true_w = [2, -3.4]
true_b = 4.2
features = torch.randn(num_examples, num_inputs,
dtype=torch.float32)
labels = true_w[0] * features[:, 0] + true_w[1] * features[:, 1] + true_b
labels += torch.tensor(np.random.normal(0, 0.01, size=labels.size()),
dtype=torch.float32)
plt.scatter(features[:, 1].numpy(), labels.numpy(), 1);
1.2.3.读取数据集
def data_iter(batch_size, features, labels):
num_examples = len(features)
indices = list(range(num_examples))
random.shuffle(indices) # random read 10 samples
for i in range(0, num_examples, batch_size):
j = torch.LongTensor(indices[i: min(i + batch_size, num_examples)]) # the last time may be not enough for a whole batch
yield features.index_select(0, j), labels.index_select(0, j)
batch_size = 10
for X, y in data_iter(batch_size, features, labels):
print(X, '\n', y)
break
1.2.4.初始化模型参数
w = torch.tensor(np.random.normal(0, 0.01, (num_inputs, 1)), dtype=torch.float32)
b = torch.zeros(1, dtype=torch.float32)
w.requires_grad_(requires_grad=True)
b.requires_grad_(requires_grad=True)
1.2.5.定义模型和各种函数
定义模型
def linreg(X, w, b):
return torch.mm(X, w) + b
定义损失函数
def squared_loss(y_hat, y):
return (y_hat - y.view(y_hat.size())) ** 2 / 2
定义优化函数
def sgd(params, lr, batch_size):
for param in params:
param.data -= lr * param.grad / batch_size # ues .data to operate param without gradient track
1.2.6.训练模型
# super parameters init
lr = 0.03
num_epochs = 5
net = linreg
loss = squared_loss
# training
for epoch in range(num_epochs): # training repeats num_epochs times
# in each epoch, all the samples in dataset will be used once
# X is the feature and y is the label of a batch sample
for X, y in data_iter(batch_size, features, labels):
l = loss(net(X, w, b), y).sum() # 损失求总和
# calculate the gradient of batch sample loss
l.backward()
# using small batch random gradient descent to iter model parameters
sgd([w, b], lr, batch_size)
# reset parameter gradient
w.grad.data.zero_()
b.grad.data.zero_()
train_l = loss(net(features, w, b), labels)
print('epoch %d, loss %f' % (epoch + 1, train_l.mean().item()))
1.3.线性回归模型使用pytorch的简洁实现
1.3.1.导入各种包
import torch
from torch import nn
import numpy as np
torch.manual_seed(1)
print(torch.__version__)
torch.set_default_tensor_type('torch.FloatTensor')
1.3.2.生成数据集及使用图像来展示生成的数据
num_inputs = 2
num_examples = 1000
true_w = [2, -3.4]
true_b = 4.2
features = torch.tensor(np.random.normal(0, 1, (num_examples, num_inputs)), dtype=torch.float)
labels = true_w[0] * features[:, 0] + true_w[1] * features[:, 1] + true_b
labels += torch.tensor(np.random.normal(0, 0.01, size=labels.size()), dtype=torch.float)
1.3.3.读取数据集
import torch.utils.data as Data
batch_size = 10
# combine featues and labels of dataset
dataset = Data.TensorDataset(features, labels)
# put dataset into DataLoader
data_iter = Data.DataLoader(
dataset=dataset, # torch TensorDataset format
batch_size=batch_size, # mini batch size
shuffle=True, # whether shuffle the data or not
num_workers=2, # read data in multithreading
)
for X, y in data_iter:
print(X, '\n', y)
break
1.3.4.初始化模型参数
from torch.nn import init
init.normal_(net[0].weight, mean=0.0, std=0.01)
init.constant_(net[0].bias, val=0.0) # or you can use `net[0].bias.data.fill_(0)` to modify it directly
1.3.5.定义模型和各种函数
定义模型
class LinearNet(nn.Module):
def __init__(self, n_feature):
super(LinearNet, self).__init__() # call father function to init
self.linear = nn.Linear(n_feature, 1) # function prototype: `torch.nn.Linear(in_features, out_features, bias=True)`
def forward(self, x):
y = self.linear(x)
return y
net = LinearNet(num_inputs)
print(net)
定义损失函数
loss = nn.MSELoss() # nn built-in squared loss function
# function prototype: `torch.nn.MSELoss(size_average=None, reduce=None, reduction='mean')`
定义优化函数
import torch.optim as optim
optimizer = optim.SGD(net.parameters(), lr=0.03) # built-in random gradient descent function
print(optimizer) # function prototype: `torch.optim.SGD(params, lr=, momentum=0, dampening=0, weight_decay=0, nesterov=False)`
1.3.6.训练模型
num_epochs = 3
for epoch in range(1, num_epochs + 1):
for X, y in data_iter:
output = net(X)
l = loss(output, y.view(-1, 1))
optimizer.zero_grad() # reset gradient, equal to net.zero_grad()
l.backward()
optimizer.step()
print('epoch %d, loss: %f' % (epoch, l.item()))
Task02-softmax与分类模型
2.1.softmax回归的基本概念
2.1.1.解决多分类问题
通常使用离散的数值来表示类别,例如 y1=1,y2=2,y3=3
2.1.2.神经网络图
softmax运算符(softmax operator)解决了:
输出层的输出值的范围不确定和离散值与不确定范围的输出值之间的误差难以衡量的问题
•
2.1.3.交叉熵损失函数
2.1.4.模型训练和预测
在训练好softmax回归模型后,给定任一样本特征,就可以预测每个输出类别的概率。通常,我们把预测概率最大的类别作为输出类别。如果它与真实类别(标签)一致,说明这次预测是正确的
2.2.如何获取Fashion-MNIST数据集和读取数据
2.3.softmax回归模型的从零开始实现(对Fashion-MNIST的图像数据进行分类)
import torch
import torchvision
import numpy as np
import sys
sys.path.append("/home/kesci/input")
import d2lzh1981 as d2l
print(torch.__version__)
print(torchvision.__version__)
2.3.1.获取训练集数据和测试集数据
batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, root='/home/kesci/input/FashionMNIST2065')
2.3.2.模型参数初始化
num_inputs = 784
print(28*28)
num_outputs = 10
W = torch.tensor(np.random.normal(0, 0.01, (num_inputs, num_outputs)), dtype=torch.float)
b = torch.zeros(num_outputs, dtype=torch.float)
W.requires_grad_(requires_grad=True)
b.requires_grad_(requires_grad=True)
2.3.3.定义函数和模型
定义softmax函数
def softmax(X):
X_exp = X.exp()
partition = X_exp.sum(dim=1, keepdim=True)
# print("X size is ", X_exp.size())
# print("partition size is ", partition, partition.size())
return X_exp / partition # 这里应用了广播机制
定义softmax回归模型
def net(X):
return softmax(torch.mm(X.view((-1, num_inputs)), W) + b)
定义损失函数
def cross_entropy(y_hat, y):
return - torch.log(y_hat.gather(1, y.view(-1, 1)))
定义准确率
• 模型训练完了进行模型预测的时候,会用到我们这里定义的准确率
def accuracy(y_hat, y):
return (y_hat.argmax(dim=1) == y).float().mean().item()
print(accuracy(y_hat, y))
# 本函数已保存在d2lzh_pytorch包中方便以后使用。该函数将被逐步改进:它的完整实现将在“图像增广”一节中描述
def evaluate_accuracy(data_iter, net):
acc_sum, n = 0.0, 0
for X, y in data_iter:
acc_sum += (net(X).argmax(dim=1) == y).float().sum().item()
n += y.shape[0]
return acc_sum / n
print(evaluate_accuracy(test_iter, net))
2.3.4.训练模型
num_epochs, lr = 5, 0.1
# 本函数已保存在d2lzh_pytorch包中方便以后使用
def train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size,
params=None, lr=None, optimizer=None):
for epoch in range(num_epochs):
train_l_sum, train_acc_sum, n = 0.0, 0.0, 0
for X, y in train_iter:
y_hat = net(X)
l = loss(y_hat, y).sum()
# 梯度清零
if optimizer is not None:
optimizer.zero_grad()
elif params is not None and params[0].grad is not None:
for param in params:
param.grad.data.zero_()
l.backward()
if optimizer is None:
d2l.sgd(params, lr, batch_size)
else:
optimizer.step()
train_l_sum += l.item()
train_acc_sum += (y_hat.argmax(dim=1) == y).sum().item()
n += y.shape[0]
test_acc = evaluate_accuracy(test_iter, net)
print('epoch %d, loss %.4f, train acc %.3f, test acc %.3f'
% (epoch + 1, train_l_sum / n, train_acc_sum / n, test_acc))
train_ch3(net, train_iter, test_iter, cross_entropy, num_epochs, batch_size, [W, b], lr)
2.3.5.模型预测
X, y = iter(test_iter).next()
true_labels = d2l.get_fashion_mnist_labels(y.numpy())
pred_labels = d2l.get_fashion_mnist_labels(net(X).argmax(dim=1).numpy())
titles = [true + '\n' + pred for true, pred in zip(true_labels, pred_labels)]
d2l.show_fashion_mnist(X[0:9], titles[0:9])
2.4.使用pytorch重新实现softmax回归模型
# 加载各种包或者模块
import torch
from torch import nn
from torch.nn import init
import numpy as np
import sys
sys.path.append("/home/kesci/input")
import d2lzh1981 as d2l
print(torch.__version__)
2.4.1.获取数据
batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, root='/home/kesci/input/FashionMNIST2065')
2.4.2.初始化模型参数
init.normal_(net.linear.weight, mean=0, std=0.01)
init.constant_(net.linear.bias, val=0)
2.4.3.定义函数和模型
定义网络模型
num_inputs = 784
num_outputs = 10
class LinearNet(nn.Module):
def __init__(self, num_inputs, num_outputs):
super(LinearNet, self).__init__()
self.linear = nn.Linear(num_inputs, num_outputs)
def forward(self, x): # x 的形状: (batch, 1, 28, 28)
y = self.linear(x.view(x.shape[0], -1))
return y
# net = LinearNet(num_inputs, num_outputs)
class FlattenLayer(nn.Module):
def __init__(self):
super(FlattenLayer, self).__init__()
def forward(self, x): # x 的形状: (batch, *, *, ...)
return x.view(x.shape[0], -1)
from collections import OrderedDict
net = nn.Sequential(
# FlattenLayer(),
# LinearNet(num_inputs, num_outputs)
OrderedDict([
('flatten', FlattenLayer()),
('linear', nn.Linear(num_inputs, num_outputs))]) # 或者写成我们自己定义的 LinearNet(num_inputs, num_outputs) 也可以
)
定义损失函数
loss = nn.CrossEntropyLoss() # 下面是他的函数原型
# class torch.nn.CrossEntropyLoss(weight=None, size_average=None, ignore_index=-100, reduce=None, reduction='mean')
定义优化函数
optimizer = torch.optim.SGD(net.parameters(), lr=0.1) # 下面是函数原型
# class torch.optim.SGD(params, lr=, momentum=0, dampening=0, weight_decay=0, nesterov=False)
2.4.4.训练模型
num_epochs = 5
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size, None, None, optimizer)
Task03-多层感知机
3.1.多层感知机的基本知识
3.1.1.隐藏层
3.1.2.表达公式
3.1.3.激活函数(非线性变换)
解决全连接层只是对数据做仿射变换(affine transformation),而多个仿射变换的叠加仍然是一个仿射变换的问题(多个隐藏层=单个隐藏层)
常用函数(3个)
• ReLU函数
• ReLU(x)=max(0,x)
• ReLU函数只保留正数元素,并将负数元素清零
• Sigmoid函数
• 将元素的值变换到0和1之间
• tanh函数
• tanh(双曲正切)函数可以将元素的值变换到-1和1之间
激活函数的选择
• ReLu函数是一个通用的激活函数,目前在大多数情况下使用。但是,ReLU函数只能在隐藏层中使用。
• 用于分类器时,sigmoid函数及其组合通常效果更好。由于梯度消失问题,有时要避免使用sigmoid和tanh函数。
• 在神经网络层数较多的时候,最好使用ReLu函数,ReLu函数比较简单计算量少,而sigmoid和tanh函数计算量大很多。
• 在选择激活函数的时候可以先选用ReLu函数如果效果不理想可以尝试其他激活函数。
3.1.4.多层感知机
3.2.使用多层感知机图像分类的从零开始的实现
import torch
import numpy as np
import sys
sys.path.append("/home/kesci/input")
import d2lzh1981 as d2l
print(torch.__version__)
# x = torch.randn(4, 4)
# y = x.view(16) # 将x resize成指定形状的张量
# print(x.size(),y.size())
# y = x.view(2, 8)
# print(x.size(),y.size())
3.2.1.获取训练集
batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size,root='/home/kesci/input/FashionMNIST2065')
print(train_iter)
3.2.2.定义模型参数
num_inputs, num_outputs, num_hiddens = 784, 10, 256
W1 = torch.tensor(np.random.normal(0, 0.01, (num_inputs, num_hiddens)), dtype=torch.float)
b1 = torch.zeros(num_hiddens, dtype=torch.float)
W2 = torch.tensor(np.random.normal(0, 0.01, (num_hiddens, num_outputs)), dtype=torch.float)
b2 = torch.zeros(num_outputs, dtype=torch.float)
params = [W1, b1, W2, b2]`def net(X):
X = X.view((-1, num_inputs))
H = relu(torch.matmul(X, W1) + b1) # matmul(),高维矩阵相乘
return torch.matmul(H, W2) + b2`
for param in params:
param.requires_grad_(requires_grad=True)
3.2.3.定义激活函数
def relu(X):
return torch.max(input=X, other=torch.tensor(0.0))
3.2.4.定义网络
def net(X):
X = X.view((-1, num_inputs))
H = relu(torch.matmul(X, W1) + b1) # matmul(),高维矩阵相乘
return torch.matmul(H, W2) + b2
3.2.5.定义损失函数
loss = torch.nn.CrossEntropyLoss()
3.2.6.训练
num_epochs, lr = 5, 100.0
# def train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size,
# params=None, lr=None, optimizer=None):
# for epoch in range(num_epochs):
# train_l_sum, train_acc_sum, n = 0.0, 0.0, 0
# for X, y in train_iter:
# y_hat = net(X)
# l = loss(y_hat, y).sum()
#
# # 梯度清零
# if optimizer is not None:
# optimizer.zero_grad()
# elif params is not None and params[0].grad is not None:
# for param in params:
# param.grad.data.zero_()
#
# l.backward()
# if optimizer is None:
# d2l.sgd(params, lr, batch_size)
# else:
# optimizer.step() # “softmax回归的简洁实现”一节将用到
#
#
# train_l_sum += l.item()
# train_acc_sum += (y_hat.argmax(dim=1) == y).sum().item()
# n += y.shape[0]
# test_acc = evaluate_accuracy(test_iter, net)
# print('epoch %d, loss %.4f, train acc %.3f, test acc %.3f'
# % (epoch + 1, train_l_sum / n, train_acc_sum / n, test_acc))
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size, params, lr)
3.3.使用pytorch的简洁实现
import torch
from torch import nn
from torch.nn import init
import numpy as np
import sys
sys.path.append("/home/kesci/input")
import d2lzh1981 as d2l
print(torch.__version__)
3.3.1.初始化模型和各个参数
num_inputs, num_outputs, num_hiddens = 784, 10, 256
net = nn.Sequential(
d2l.FlattenLayer(),
nn.Linear(num_inputs, num_hiddens),
nn.ReLU(),
nn.Linear(num_hiddens, num_outputs),
)
for params in net.parameters():
init.normal_(params, mean=0, std=0.01)
3.3.2.训练
batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size,root='/home/kesci/input/FashionMNIST2065')
loss = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(net.parameters(), lr=0.5)
num_epochs = 5
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size, None, None, optimizer)
Task04-文本预处理
4.1.文本是一类序列数据,一篇文章可以看作是字符或单词的序列
4.2.预处理通常包括四个步骤:
4.2.1.读入文本
import collections
import re # 正则表达式
def read_time_machine():
with open('/home/kesci/input/timemachine7163/timemachine.txt', 'r') as f: # 打开文本文件
lines = [re.sub('[^a-z]+', ' ', line.strip().lower()) for line in f]
# line.strip()去掉前缀、后缀的空白符;lower()转化为小写字母;re.sub() 正则表达式替换
return lines
lines = read_time_machine()
print('# sentences %d' % len(lines))
4.2.2.分词
对每个句子进行分词,也就是将一个句子划分成若干个词(token),转换为一个词的序列,def tokenize(sentences, token=‘word’):
def tokenize(sentences, token='word'):
"""(分词)Split sentences into word or char tokens"""
if token == 'word':
return [sentence.split(' ') for sentence in sentences] # 返回的是一个二维列表:句子和单词序列
elif token == 'char':
return [list(sentence) for sentence in sentences]
else:
print('ERROR: unkown token type '+token)
tokens = tokenize(lines)
tokens[0:2]
• 存在问题:分词方式非常简单,它至少有以下几个缺点:
• 标点符号通常可以提供语义信息,但是我们的方法直接将其丢弃了
• 类似“shouldn’t", “doesn’t"这样的词会被错误地处理
• 类似"Mr.”, "Dr."这样的词会被错误地处理
补充:用现有工具进行分词(解决上述问题)
• spaCy:
import spacy
text = "Mr. Chen doesn't agree with my suggestion."
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
print([token.text for token in doc])
• NLTK:
from nltk.tokenize import word_tokenize
print(word_tokenize(text))
4.2.3.建立字典,将每个词映射到一个唯一的索引(index)
将字符串转换为数字(索引)。因此我们需要先构建一个字典(vocabulary),将每个词映射到一个唯一的索引编号。
class Vocab(object):
"""
每个词映射到一个唯一的索引编号
"""
def __init__(self, tokens, min_freq=0, use_special_tokens=False):
# tokens参数:所有的单词;min_freq参数:出现频率阈值;use_special_tokens参数:特殊token标志位
counter = count_corpus(tokens) #获取词频字典,<key:value>:<词:词频>
self.token_freqs = list(counter.items())
self.idx_to_token = []
if use_special_tokens:
# padding, begin of sentence, end of sentence, unknown
self.pad, self.bos, self.eos, self.unk = (0, 1, 2, 3)
# <pad>:将句子补充至最长的长度;<bos>:标识句子开始;<eos>:标识句子结束;
# <unk>:语料库中未出现的词,称做“未登录词”
self.idx_to_token += ['<pad>', '<bos>', '<eos>', '<unk>']
else:
self.unk = 0
self.idx_to_token += ['<unk>']
self.idx_to_token += [token for token, freq in self.token_freqs
if freq >= min_freq and token not in self.idx_to_token]
self.token_to_idx = dict()
# enumerate() 函数用于将一个可遍历的数据对象(如列表、元组或字符串)组合为一个索引序列,同时列出数据和数据下标,一般用在 for 循环当中
for idx, token in enumerate(self.idx_to_token):
self.token_to_idx[token] = idx
def __len__(self):
return len(self.idx_to_token)
def __getitem__(self, tokens):
"""
从词到索引的映射
"""
if not isinstance(tokens, (list, tuple)):
return self.token_to_idx.get(tokens, self.unk)
return [self.__getitem__(token) for token in tokens]
def to_tokens(self, indices):
"""
从索引到词的映射
"""
if not isinstance(indices, (list, tuple)):
return self.idx_to_token[indices]
return [self.idx_to_token[index] for index in indices]
def count_corpus(sentences):
tokens = [tk for st in sentences for tk in st] # 将二维列表sentences变为一维列表
return collections.Counter(tokens) # 返回一个字典,记录每个词的出现次数
我们看一个例子,这里我们尝试用Time Machine作为语料构建字典
vocab = Vocab(tokens)
print(list(vocab.token_to_idx.items())[0:10]) # 输出列表前10个
4.2.4.将文本从 词的序列 转换为 索引的序列,方便输入模型
使用字典,我们可以将原文本中的句子从单词序列转换为索引序列
for i in range(8, 10):
print('words:', tokens[i]) # 输出的是第i行单词序列的列表
print('indices:', vocab[tokens[i]]) # 输出的是第i行单词序列对应的索引的列表
Task05-语言模型与数据集采样
5.1.语言模型
5.1.1.概念
定义
• 一段自然语言文本可以看作是一个离散时间序列,给定一个长度为T的词的序列w1,w2,…,wT,语言模型的目标就是评估该序列是否合理,即计算该序列的概率: P(w1,w2,…,wT)
参数
• 词的概率以及给定前几个词情况下的条件概率
5.1.2.类型
基于统计的语言模型,主要是n元语法(n-gram)
• n元语法的缺陷:
• 参数空间过大
• 数据稀疏
基于神经网络的语言模型(后续介绍):RNN
5.2.语言模型数据集
5.2.1.读取数据集
with open('/home/kesci/input/jaychou_lyrics4703/jaychou_lyrics.txt') as f:
corpus_chars = f.read()
print(len(corpus_chars))
print(corpus_chars[: 40])
corpus_chars = corpus_chars.replace('\n', ' ').replace('\r', ' ') # 将换行、回车换成空格字符
corpus_chars = corpus_chars[: 10000] # 只保留前10000个字符
print(corpus_chars[: 50])
5.2.2.建立字符索引
# set() 函数创建一个无序不重复元素集,可进行关系测试,删除重复数据,还可以计算交集、差集、并集等
idx_to_char = list(set(corpus_chars)) # set()去重,得到索引到字符的映射(list本身的特点)
char_to_idx = {char: i for i, char in enumerate(idx_to_char)} # 利用字典,得到字符到索引的映射
vocab_size = len(char_to_idx) # 字典大小
print(vocab_size)
corpus_indices = [char_to_idx[char] for char in corpus_chars] # 将每个字符转化为索引,得到一个索引的序列
sample = corpus_indices[: 20]
print('chars:', ''.join([idx_to_char[idx] for idx in sample]))
print('indices:', sample)
5.2.3.定义函数load_data_jay_lyrics
def load_data_jay_lyrics():
with open('/home/kesci/input/jaychou_lyrics4703/jaychou_lyrics.txt') as f:
corpus_chars = f.read()
corpus_chars = corpus_chars.replace('\n', ' ').replace('\r', ' ')
corpus_chars = corpus_chars[0:10000]
idx_to_char = list(set(corpus_chars))
char_to_idx = dict([(char, i) for i, char in enumerate(idx_to_char)])
vocab_size = len(char_to_idx)
corpus_indices = [char_to_idx[char] for char in corpus_chars]
return corpus_indices, char_to_idx, idx_to_char, vocab_size
# 返回:整个语料的索引序列;字符到索引的映射;索引到字符的映射;字典大小
5.3.时序数据的采样(2种)
5.3.1.解决问题:如果序列的长度为T,时间步数为n,那么一共有T-n个合法的样本,但是这些样本有大量的重合,我们通常采用更加高效的采样方式,分别是随机采样和相邻采样。
每次从数据里随机采样一个小批量。其中批量大小batch_size是每个小批量的样本数,num_steps是每个样本所包含的时间步数
5.3.2.随机采样
每个样本是原始序列上任意截取的一段序列,相邻的两个随机小批量在原始序列上的位置不一定相毗邻
import torch
import random
def data_iter_random(corpus_indices, batch_size, num_steps, device=None):
# 减1是因为对于长度为n的序列,X最多只有包含其中的前n - 1个字符
num_examples = (len(corpus_indices) - 1) // num_steps # 下取整,得到不重叠情况下的样本个数
example_indices = [i * num_steps for i in range(num_examples)] # 每个样本的第一个字符在corpus_indices中的下标
random.shuffle(example_indices)
def _data(i):
# 返回从i开始的长为num_steps的序列
return corpus_indices[i: i + num_steps]
if device is None:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
for i in range(0, num_examples, batch_size):
# 每次选出batch_size个随机样本
batch_indices = example_indices[i: i + batch_size] # 当前batch的各个样本的首字符的下标
X = [_data(j) for j in batch_indices]
Y = [_data(j + 1) for j in batch_indices]
yield torch.tensor(X, device=device), torch.tensor(Y, device=device)
测试一下这个函数,我们输入从0到29的连续整数作为一个人工序列,设批量大小和时间步数分别为2和6,打印随机采样每次读取的小批量样本的输入X和标签Y。
my_seq = list(range(30))
for X, Y in data_iter_random(my_seq, batch_size=2, num_steps=6):
print('X: ', X, '\nY:', Y, '\n')
5.3.3.相邻采样
相邻的两个随机小批量在原始序列上的位置相毗邻
def data_iter_consecutive(corpus_indices, batch_size, num_steps, device=None):
if device is None:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
corpus_len = len(corpus_indices) // batch_size * batch_size # 保留下来的序列的长度
corpus_indices = corpus_indices[: corpus_len] # 仅保留前corpus_len个字符
indices = torch.tensor(corpus_indices, device=device)
indices = indices.view(batch_size, -1) # resize成二维的tensor(batch_size, )
batch_num = (indices.shape[1] - 1) // num_steps # 第二个维度的大小-1//步长
for i in range(batch_num):
i = i * num_steps
X = indices[:, i: i + num_steps]
Y = indices[:, i + 1: i + num_steps + 1]
yield X, Y
同样的设置下,打印相邻采样每次读取的小批量样本的输入X和标签Y。相邻的两个随机小批量在原始序列上的位置相毗邻。
for X, Y in data_iter_consecutive(my_seq, batch_size=2, num_steps=6):
print('X: ', X, '\nY:', Y, '\n')
Task06-循环神经网络
6.1.基于循环神经网络的语言模型
6.1.1.概念
6.1.2.构造
6.2.从零开始的实现
import torch
import torch.nn as nn
import time
import math
import sys
sys.path.append("/home/kesci/input")
import d2l_jay9460 as d2l
(corpus_indices, char_to_idx, idx_to_char, vocab_size) = d2l.load_data_jay_lyrics()
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
6.2.1.one-hot向量
将字符表示成向量,这里采用one-hot向量
假设词典大小是N,每次字符对应一个从0到N-1的唯一的索引,则该字符的向量是一个长度为N的向量,若字符的索引是i,则该向量的第i个位置为1,其他位置为0。
def one_hot(x, n_class, dtype=torch.float32):
# x:一维向量,每个元素都是字符的索引;n_class:字典的大小
result = torch.zeros(x.shape[0], n_class, dtype=dtype, device=x.device) # result矩阵shape: (n, n_class)
result.scatter_(1, x.long().view(-1, 1), 1) # result[i, x[i, 0]] = 1
return result # 返回的是一个nxn_class的矩阵
x = torch.tensor([0, 2])
x_one_hot = one_hot(x, vocab_size)
print(x_one_hot)
print(x_one_hot.shape)
print(x_one_hot.sum(axis=1))
• 每次采样的小批量的形状是(批量大小, 时间步数),此函数将这样的小批量变换成数个形状为(批量大小, 词典大小)的矩阵,矩阵个数等于时间步数。
def to_onehot(X, n_class):
return [one_hot(X[:, i], n_class) for i in range(X.shape[1])] # 每次取X的一列使用one_hot()
X = torch.arange(10).view(2, 5)
inputs = to_onehot(X, vocab_size)
print(len(inputs), inputs[0].shape)
6.2.2.初始化模型参数
num_inputs, num_hiddens, num_outputs = vocab_size, 256, vocab_size
# num_inputs: d
# num_hiddens: h, 隐藏单元的个数是超参数
# num_outputs: q
def get_params():
def _one(shape):
# 构造参数矩阵
param = torch.zeros(shape, device=device, dtype=torch.float32)
nn.init.normal_(param, 0, 0.01)
return torch.nn.Parameter(param)
# 初始化
# 隐藏层参数
W_xh = _one((num_inputs, num_hiddens))
W_hh = _one((num_hiddens, num_hiddens))
b_h = torch.nn.Parameter(torch.zeros(num_hiddens, device=device))
# 输出层参数
W_hq = _one((num_hiddens, num_outputs))
b_q = torch.nn.Parameter(torch.zeros(num_outputs, device=device))
return (W_xh, W_hh, b_h, W_hq, b_q) #返回元组
6.2.3.定义模型
def rnn(inputs, state, params):
# inputs和outputs皆为num_steps个形状为(batch_size, vocab_size)的矩阵
W_xh, W_hh, b_h, W_hq, b_q = params
H, = state # 定义隐藏状态
outputs = []
for X in inputs:
H = torch.tanh(torch.matmul(X, W_xh) + torch.matmul(H, W_hh) + b_h)
Y = torch.matmul(H, W_hq) + b_q
outputs.append(Y)
return outputs, (H,)
6.2.4.定义函数init_rnn_state
初始化隐藏变量,这里的返回值是一个元组
def init_rnn_state(batch_size, num_hiddens, device): # 初始化隐藏变量,赋给state
return (torch.zeros((batch_size, num_hiddens), device=device), )
做个简单的测试来观察输出结果的个数(时间步数),以及第一个时间步的输出层输出的形状和隐藏状态的形状。
print(X.shape)
print(num_hiddens)
print(vocab_size) #字典大小
state = init_rnn_state(X.shape[0], num_hiddens, device)
inputs = to_onehot(X.to(device), vocab_size)
params = get_params()
outputs, state_new = rnn(inputs, state, params)
print(len(inputs), inputs[0].shape)
print(len(outputs), outputs[0].shape)
print(len(state), state[0].shape)
print(len(state_new), state_new[0].shape)
6.2.5.裁剪梯度
解决问题:循环神经网络中较容易出现梯度衰减或梯度爆炸,这会导致网络几乎无法训练
裁剪梯度(clip gradient)是一种应对梯度爆炸的方法。
def grad_clipping(params, theta, device):
norm = torch.tensor([0.0], device=device)
# 得到g的L2范数
for param in params:
norm += (param.grad.data ** 2).sum()
norm = norm.sqrt().item() # ||g||
# θ/||g||比较
if norm > theta:
for param in params:
param.grad.data *= (theta / norm)
6.2.6.定义预测函数
基于前缀prefix(含有数个字符的字符串)来预测接下来的num_chars个字符
def predict_rnn(prefix, num_chars, rnn, params, init_rnn_state,
num_hiddens, vocab_size, device, idx_to_char, char_to_idx):
state = init_rnn_state(1, num_hiddens, device)
output = [char_to_idx[prefix[0]]] # output记录prefix加上预测的num_chars个字符
for t in range(num_chars + len(prefix) - 1):
# 将上一时间步的输出作为当前时间步的输入
X = to_onehot(torch.tensor([[output[-1]]], device=device), vocab_size)
# 计算输出和更新隐藏状态
(Y, state) = rnn(X, state, params)
# 下一个时间步的输入是prefix里的字符或者当前的最佳预测字符
if t < len(prefix) - 1:
output.append(char_to_idx[prefix[t + 1]])
else:
output.append(Y[0].argmax(dim=1).item())
return ''.join([idx_to_char[i] for i in output])
我们先测试一下predict_rnn函数。我们将根据前缀“分开”创作长度为10个字符(不考虑前缀长度)的一段歌词。因为模型参数为随机值,所以预测结果也是随机的。
predict_rnn('分开', 10, rnn, params, init_rnn_state, num_hiddens, vocab_size,
device, idx_to_char, char_to_idx)
6.2.7.困惑度
通常使用困惑度(perplexity)来评价语言模型的好坏。困惑度是对交叉熵损失函数做指数运算后得到的值,==概率的倒数
• 任何一个有效模型的困惑度必须小于类别个数。在本例中,困惑度必须小于词典大小vocab_size
6.2.8.定义模型训练函数
与之前模型的不同点
• 使用困惑度评价模型。
• 在迭代模型参数前裁剪梯度。
• 对时序数据采用不同采样方法将导致隐藏状态初始化的不同。
def train_and_predict_rnn(rnn, get_params, init_rnn_state, num_hiddens,
vocab_size, device, corpus_indices, idx_to_char,
char_to_idx, is_random_iter, num_epochs, num_steps,
lr, clipping_theta, batch_size, pred_period,
pred_len, prefixes):
if is_random_iter:
data_iter_fn = d2l.data_iter_random
else:
data_iter_fn = d2l.data_iter_consecutive
params = get_params()
loss = nn.CrossEntropyLoss()
for epoch in range(num_epochs):
if not is_random_iter: # 如使用相邻采样,在epoch开始时初始化隐藏状态
state = init_rnn_state(batch_size, num_hiddens, device)
l_sum, n, start = 0.0, 0, time.time()
data_iter = data_iter_fn(corpus_indices, batch_size, num_steps, device)
for X, Y in data_iter:
if is_random_iter: # 如使用随机采样,在每个小批量更新前初始化隐藏状态
state = init_rnn_state(batch_size, num_hiddens, device)
else: # 否则需要使用detach函数从计算图分离隐藏状态
for s in state:
s.detach_()
# inputs是num_steps个形状为(batch_size, vocab_size)的矩阵
inputs = to_onehot(X, vocab_size)
# outputs有num_steps个形状为(batch_size, vocab_size)的矩阵
(outputs, state) = rnn(inputs, state, params)
# 拼接之后形状为(num_steps * batch_size, vocab_size)
outputs = torch.cat(outputs, dim=0)
# Y的形状是(batch_size, num_steps),转置后再变成形状为
# (num_steps * batch_size,)的向量,这样跟输出的行一一对应
y = torch.flatten(Y.T)
# 使用交叉熵损失计算平均分类误差
l = loss(outputs, y.long())
# 梯度清0
if params[0].grad is not None:
for param in params:
param.grad.data.zero_()
l.backward()
grad_clipping(params, clipping_theta, device) # 裁剪梯度
d2l.sgd(params, lr, 1) # 因为误差已经取过均值,梯度不用再做平均
l_sum += l.item() * y.shape[0]
n += y.shape[0]
# 每隔一定时间输出信息
if (epoch + 1) % pred_period == 0:
print('epoch %d, perplexity %f, time %.2f sec' % (
epoch + 1, math.exp(l_sum / n), time.time() - start))
for prefix in prefixes:
print(' -', predict_rnn(prefix, pred_len, rnn, params, init_rnn_state,
num_hiddens, vocab_size, device, idx_to_char, char_to_idx))
6.2.9.训练模型并创作歌词
首先,设置模型超参数。我们将根据前缀“分开”和“不分开”分别创作长度为50个字符(不考虑前缀长度)的一段歌词。我们每过50个迭代周期便根据当前训练的模型创作一段歌词。
num_epochs, num_steps, batch_size, lr, clipping_theta = 250, 35, 32, 1e2, 1e-2
pred_period, pred_len, prefixes = 50, 50, ['分开', '不分开']
下面采用随机采样训练模型并创作歌词。
train_and_predict_rnn(rnn, get_params, init_rnn_state, num_hiddens,
vocab_size, device, corpus_indices, idx_to_char,
char_to_idx, True, num_epochs, num_steps, lr,
clipping_theta, batch_size, pred_period, pred_len,
prefixes)
6.3.循环神经网络的简洁实现
6.3.1.定义模型
使用Pytorch中的nn.RNN来构造循环神经网络
class RNNModel(nn.Module):
def __init__(self, rnn_layer, vocab_size):
super(RNNModel, self).__init__()
self.rnn = rnn_layer
self.hidden_size = rnn_layer.hidden_size * (2 if rnn_layer.bidirectional else 1) #双向循环网络为2
self.vocab_size = vocab_size
self.dense = nn.Linear(self.hidden_size, vocab_size) # 定义一个线性层作为输出层
def forward(self, inputs, state):
# inputs.shape: (batch_size, num_steps)
X = to_onehot(inputs, vocab_size)
X = torch.stack(X) # 将X各个元素堆叠起来->X.shape: (num_steps, batch_size, vocab_size)
hiddens, state = self.rnn(X, state)
hiddens = hiddens.view(-1, hiddens.shape[-1]) # hiddens.shape: (num_steps * batch_size, hidden_size)
output = self.dense(hiddens)
return output, state
使用权重为随机值的模型来预测一次。
model = RNNModel(rnn_layer, vocab_size).to(device)
predict_rnn_pytorch('分开', 10, model, vocab_size, device, idx_to_char, char_to_idx)
6.3.2.定义预测函数
def train_and_predict_rnn_pytorch(model, num_hiddens, vocab_size, device,
corpus_indices, idx_to_char, char_to_idx,
num_epochs, num_steps, lr, clipping_theta,
batch_size, pred_period, pred_len, prefixes):
loss = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
model.to(device)
for epoch in range(num_epochs):
l_sum, n, start = 0.0, 0, time.time()
data_iter = d2l.data_iter_consecutive(corpus_indices, batch_size, num_steps, device) # 相邻采样
state = None # pytorch里会默认state为0,故使state=None
for X, Y in data_iter:
if state is not None:
# 使用detach函数从计算图分离隐藏状态
if isinstance (state, tuple): # LSTM, state:(h, c)
state[0].detach_()
state[1].detach_()
else:
state.detach_()
(output, state) = model(X, state) # output.shape: (num_steps * batch_size, vocab_size)
y = torch.flatten(Y.T)
l = loss(output, y.long())
optimizer.zero_grad()
l.backward()
grad_clipping(model.parameters(), clipping_theta, device)
optimizer.step()
l_sum += l.item() * y.shape[0]
n += y.shape[0]
if (epoch + 1) % pred_period == 0:
print('epoch %d, perplexity %f, time %.2f sec' % (
epoch + 1, math.exp(l_sum / n), time.time() - start))
for prefix in prefixes:
print(' -', predict_rnn_pytorch(
prefix, pred_len, model, vocab_size, device, idx_to_char,
char_to_idx))
6.3.3.训练模型
num_epochs, batch_size, lr, clipping_theta = 250, 32, 1e-3, 1e-2
pred_period, pred_len, prefixes = 50, 50, ['分开', '不分开']
train_and_predict_rnn_pytorch(model, num_hiddens, vocab_size, device,
corpus_indices, idx_to_char, char_to_idx,
num_epochs, num_steps, lr, clipping_theta,
batch_size, pred_period, pred_len, prefixes)