第十一周：机器学习-CSDN博客

前两周主要学习了RNN的流程图、分类以及其训练过程。接着对RNN的理论讨论，这周以两个RNN的基本案例（来自pytorch官方文档）来实现其代码实践。任务一和任务二是一个逆向的过程，任务一进行主要的代码分析，任务二侧重进行总结。最后，还从training和testing两方面入手学习了“批归一化”，并且根据现有实验结果讨论了BN和ICS的关系。

Abstract

The first two weeks were spent learning about RNN flowcharts, classification, and its training process. This week follows the theoretical discussion of RNNs, and this week two basic cases of RNNs (from pytorch official documents) are used to realize their code practice. Task 1 and Task 2 are a reverse process, with Task 1 performing the main code analysis and Task 2 focusing on summarization. Finally, we also learn “batch normalization” from both training and testing, and discuss the relationship between BN and ICS based on the available experimental results.

一、字符级的RNN进行名字分类

目标：将单词作为一系列字符进行输入，把每一步的输入经过一系列计算得到”输出预测“和”隐藏状态“，其中上一步的”隐藏状态“会称为下一步输入的一部分。最终我们将最后一步的输出视为预测结果，即该单词属于哪个类别。

具体任务：就是说，我们把18种语言构成的上千个名字的数据集作为训练模型，训练结束后，我们根据一个名字的拼写来预测它是哪种语言。

1、准备数据

from __future__ import unicode_literals,print_function,division
from io import open #用于系统文件的打开
import glob  #找到指定目录下指定格式的文件
import os

def findFiles(path):       #设定函数，找到指定文件
    return glob.glob(path)
print(findFiles('E:/pytorch学习/task_RNN/data/names/*.txt')) #输出的是符合格式的文件集合

函数findFiles() 能够找到指定目录下的指定文件，示例输出如下：

统一文字——把18种语言统一用英文字母表示

import unicodedata
import string

all_letters = string.ascii_letters + " .,;'"
#其中的string.ascii_letters代表的是所有的大小写字母，all_letters则另外包含了一般标点符号
print(len(all_letters))
n_letters = len(all_letters)

def unicodeToAscii(s):    #将unicode字符串转化为ASCII码
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
        and c in all_letters
    )
#当转化为ASCII码之后，采用”one-hot“的形式描述各个字符
#也就是说可以把各个国家“不规范”的语言转化为标准的英文
print(unicodeToAscii('Ślusàrski'))

其中的string.ascii_letters代表的是所有的大小写字母，all_letters则另外包含了一般标点符号，输出所有字符的数量；

函数unicodeToAscii（）的作用是将unicode字符串转化为ASCII码。如下所示：

构造一个字典：语言对应名字的列表{language:[name1,name2...]}

category_lines = {} #构建字典，里面是每种语言的名字列表all_categories 最终形式就是{[]、[]、[]...}
all_categories = []


def readLines(filename):  #读取文件并且分成几行
    lines = open(filename, encoding='utf-8').read().strip().split('\n')
    return [unicodeToAscii(line) for line in lines]

for filename in findFiles('E:/pytorch学习/task_RNN/data/names/*.txt'):
    category = os.path.splitext(os.path.basename(filename))[0] #basename也就是“.txt”前面的国家名称
#若是不加[0]的话，那么输出即为整个文件名字的分割“’xxx‘，’txt‘”;若是加上[0]的话，那么输出即为国家名称“xxx”
    all_categories.append(category) #依次加入到列表当中
    lines = readLines(filename)     #读出指定文件中的每一行内容
    category_lines[category] = lines#将每一行内容添加到all_category的列表中去，再将每个列表添加到category_lines的字典中去
# print(readLines('./data/names/Chinese.txt'))
n_categories = len(all_categories)

print(all_categories)
print(category_lines['Chinese'])#实际上就是“print(readLines('data/names/Chinese.txt'))”的输出，只不过就是统一了命名过程

函数readLines（）作用是读取文件并且将一行行的名字返回到函数unicodeToAscii（）中去；

category是不同种的语言，all_categories是存放所有语言的列表，lines是每种语言的名字，category_lines是指字典中每种语言存放名字的列表。示例输出可以直接由语言得到名字：

该步骤可以加载出所有文件中的名字。

单词转化为张量

import torch
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

def letterToIndex(letter):   # 返回字符all_letter 的索引 index，比如'a'-0
    return all_letters.find(letter)

def letterToTensor(letter):   # 把一个字母编码成tensor
    tensor = torch.zeros(1, n_letters)  #size=1代表了指定张量的“形状”，可以理解为batch；n_letters是指all_letters的长度57，也就是张量的”维度“
    tensor[0][letterToIndex(letter)] = 1    # 把字母 letter 的索引设定为1，其它都是0
    return tensor.to(device)


def lineToTensor(line):  # 把一个单词编码成tensor
    tensor = torch.zeros(len(line), 1, n_letters) #也就是说，line代表输入单词的字符个数，len(line)代表长度，n_letter是张量的维度
    for li, letter in enumerate(line):   # 遍历单词中的所有字母，对每个字母 letter 它的索引设定为1，其它都是0
        tensor[li][0][letterToIndex(letter)] = 1
    return tensor.to(device)

print(letterToIndex('J'))
print(letterToTensor('J'))
print(lineToTensor('Jones').size())

函数letterToTensor()是将字符转化为张量（“大小写字母+特殊字符”总共57个），共57维向量，采用了one-hot编码技术；函数letterToIndex()是将字符转化为索引，也就是letterToTensor函数中独热码为1的索引；函数lineToTensor()是将单词转化为张量也就是letterToTensor函数的进阶版。

2、构造神经网络

import torch.nn as nn
#任务：已知name——>预测language  学习”input layer的name中的语义特征“+”隐藏的语义信息hidden layer（可能会判断副词、形容词等词性）“
class RNN(nn.Module):
    # 初始化定义每一层的输入大小，输出大小
    def __init__(self, input_size, hidden_size, output_size):  #input_size可以理解为n_letters=57
        super(RNN, self).__init__()

        self.hidden_size = hidden_size  #hidden大小的初始化是随机的,类RNN需要传入的参数是input_size和hidden_size,传出的参数是output_size
        self.i2h = nn.Linear(input_size + hidden_size, hidden_size)   #nn.Linear(input,output)
        self.i2o = nn.Linear(input_size + hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)  #softmax激活函数

    # 前向传播过程
    def forward(self, input, hidden):
        #第一个hidden是随机初始化的h0
        combined = torch.cat((input, hidden), 1)
        hidden = self.i2h(combined)
        output = self.i2o(combined)
        output = self.softmax(output)
        return output, hidden   #output用来计算loss的，hidden是用来记录state传给下一个状态

    # 初始化隐藏层状态 h0
    def initHidden(self):
        return torch.zeros(1, self.hidden_size).to(device)


n_hidden = 128
rnn = RNN(n_letters, n_hidden, n_categories)#其中n_categories=18代表语言的个数  RNN(input,hidden,output)
rnn = rnn.to(device)

#测试一下，尝试”input一个字母“和”hidden一个随机初始化的tensor“
input=letterToTensor('A')
hidden=torch.zeros(1,n_hidden)
output,next_hidden=rnn(input,hidden)
print(output)  #上一层的输出，下一层的输入,output是18维
print(next_hidden) #上一层保存的hidden的state，hidden初始化是128维

RNN的网络结构需要基本的3层：2个Linear线性层+1个LogSoftmax层

在这里，我们的输入包含“input当前字母tensor+hidden隐藏状态（初始为零）”，输出则是“output每种语言的概率+hidden隐藏状态（传递到下一层）”

根据输出结果可以看出， output是18维的，因为分类的语言共有18种；hidden则是128维的（初始值设定）

3、训练

训练前的准备

该步骤是在训练结束之后，为了对比分类后各个类别的概率，找到最符合的类别

def categoryFromOutput(output):  #此函数作用：在output下如何得到最好的预测
    top_n, top_i = output.topk(1)  #topk()函数输出两个值，一个是value，一个是indices(最大值在结果中的位置索引)
    category_i = top_i[0].item()   #item()方法是为了提取出topk()函数的indices
    print('top_n=%s，top_i=%d'%(top_n,top_i))
    return all_categories[category_i], category_i
print(output)
print(output.topk(1))
categoryFromOutput(output)

函数tensor.topk()旨在找出最大值在结果中的位置索引；函数categoryFromOutput()就是在所有的output中找到概率最大的预测，返回的是预测的语言的索引（0-17）

输出一共18维，对应18种语言，找到概率值最大的索引，输出如下：

快速获取示例

该步骤是在训练之前得到随机的名字（随机语言中的随机名字）

import random
def randomChoice(l):    #从指定区间随机选取数字
    return l[random.randint(0, len(l) - 1)]


def randomTrainingExample():
    category = randomChoice(all_categories)  #对语言类别进行随机抽取
    line = randomChoice(category_lines[category]) #对该语言的某一单词进行随机抽取（每一行有一个单词）
    category_tensor = torch.tensor([all_categories.index(category)], dtype=torch.long).to(device)#把categories转化为tensor
    line_tensor = lineToTensor(line) #把单词转化为需要的维度
    return category, line, category_tensor, line_tensor


for i in range(10):   #随机抽取10种语言（可能相同）的单词
    category, line, category_tensor, line_tensor = randomTrainingExample()
    print('category = ', category, '/ line = ', line)

函数randomTrainingExample()就是随机一个语言的名字（随机两次），并且将其格式转换为tensor

训练函数的定义

criterion = nn.NLLLoss()  #定义损失函数
learning_rate = 0.005

def train(category_tensor, line_tensor):
    hidden = rnn.initHidden()   #需要初始化一个hidden
    rnn.zero_grad()   #梯度清零，防止本次和上次的梯度叠加

    # RNN的循环
    for i in range(line_tensor.size()[0]):  #因为inputtensor指定为1维，所以此处遍历的是单词的每个字母
        output, hidden = rnn(line_tensor[i], hidden)  #第一个字母的hidden是随机初始化的，后面的字母用前一个hidden保存下的

    loss = criterion(output, category_tensor)  #将输出和分类进行损失计算
    loss.backward()
    # 更新参数
    for p in rnn.parameters():   #将参数的梯度加入到参数值中去，并乘以学习率
        p.data.add_(p.grad.data, alpha=-learning_rate)

    return output, loss.item()

step1 定义损失函数和学习率

step2 初始化隐藏层和清零梯度

step3 遍历单词的每个字母做RNN

step4 反向传播并更新损失参数

step5 得到该层的output和hidden

开始训练

在训练数据集的过程中，我们可以通过可视化图表来观察损失随着轮次的变化

import time
import math

n_iters = 100000
print_every = 5000
plot_every = 1000

#绘制损失loss
current_loss = 0
all_losses = []

def timeSince(since): #时间轴计算分秒
    now = time.time()
    s = now-since
    return '%dm %ds'%(s//60,s%60)  #返回分秒字符串

start = time.time()

for iter in range(1, n_iters + 1): #统计迭代的轮数，训练次数是n_iters
    category, line, category_tensor, line_tensor = randomTrainingExample()
    output, loss = train(category_tensor, line_tensor)
    current_loss += loss

    if iter % print_every == 0:
        guess, guess_i = categoryFromOutput(output)
        correct = '√' if guess==category else '×(%s)'%category
        print('%d %d%% (%s) %.4f %s / %s %s' %
          (iter, iter/n_iters*100,timeSince(start),loss,line,guess,correct))
        #打印迭代次数、完成进度、完成时间（迭代的编号）、损失、单词、猜测、是否正确

    if iter % plot_every == 0:   #把每一次的损失加到损失列表中去
        all_losses.append(current_loss/plot_every)
        current_loss = 0

#绘制出可视化结果
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

plt.figure()
plt.plot(all_losses)

函数time.time()是用于获取当前时间戳，print_every代表每过print_every次就打印样例并计算平均损失，every_plot代表每过every_plot次就在loss绘图中描述一个点。

训练过程如下：

可视化loss图表如下：

能够看出loss随时间的基本走向及其稳定性，以此来推断训练效果的好坏。

4、评价结果

设置一个混淆矩阵confusion，它能够描述18种语言的真实值（行）和预测值（列）之间的差距。

confusion = torch.zeros(n_categories, n_categories)#混淆矩阵来推断正确的猜测
n_confusion = 10000 #样本数量

def evaluate(line_tensor):  #实际上就是一个训练过程，但是不需要进行梯度反向传播
    hidden = rnn.initHidden()

    for i in range(line_tensor.size()[0]):
        output, hidden = rnn(line_tensor[i],hidden)

    return output

for i in range(n_confusion):
    category, line, category_tensor,line_tensor = randomTrainingExample()
    output = evaluate(line_tensor)
    guess, guess_i = categoryFromOutput(output)
    category_i = all_categories.index(category)
    confusion[category_i][guess_i] += 1  #行category_i是实际的，列guess_i是预测的

for i in range(n_categories):   #归一化：每一个➗一整行的总和
    confusion[i] = confusion[i] / confusion[i].sum()

#设置绘图
fig = plt.figure()
ax = fig.add_subplot(111) #分配子图，位置在111
cax = ax.matshow(confusion.numpy())
fig.colorbar(cax)  #热力图函数

#设置横纵坐标轴
ax.set_xticklabels(['']+all_categories,rotation=90)
ax.set_yticklabels(['']+all_categories)

ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
ax.yaxis.set_major_locator(ticker.MultipleLocator(1))

plt.show()

函数evaluate()相当于除去反向传播的train()函数：初始化隐藏层——>遍历单词的每个字母做RNN——>得到该层的output和hidden

颜色越浅，代表真实值与预测值更加接近。一般来说，最理想的情况就是对角线上的颜色最浅，其余均为深色。所以我们可以根据除了对角线外的格子颜色判断，若有浅色，那么将可能会是模型预测错误的语言（二者比较相近）。

5、预测

这里是处理用户输入，预测给定名字的三种最大可能性的语言分类

def predict(input_line, n_predictions=3):  #给定一个单词，预测前三个✔
    print('\n> %s'%input_line)
    with torch.no_grad():   #类似训练但不进行梯度传播
        output = evaluate(lineToTensor(input_line))

        topv, topi = output.topk(n_predictions,1,True)
        predictions = []

        for i in range(n_predictions):   #获得前n个类别
            value = topv[0][i].item()
            category_index = topi[0][i].item()
            print('(%.2f) %s' % (value, all_categories[category_index]))
            predictions.append([value, all_categories[category_index]])

predict('Dovesky')
predict('Jackson')
predict('Satoshi')

函数predict() 主要包含2个部分：一个是类似训练的过程（不含梯度下降），另一个是tensor.topk()函数的找出最大值索引的过程。结果如下：

参考文章：使用字符级RNN进行名字分类 - PyTorch官方教程中文版

二、字符级的RNN生成名字

目标：将语言名称作为一个类别进行输入，最终的得到的输出是一个名字，即为该类别预测的名字。

具体任务：输入一个国家的语言名，最后输出是以前三个字母为首字母的名字。

1、准备数据

与任务一相似

step1 定义findFiles()函数：找到指定目录下的指定文件

step2 定义unicodeToAscii()函数：把Unicode字符转化为Ascii编码形式

step3 定义readLines()函数：读入所选择的文件中的每行名字

step4 构造字典：把所有语言的所有文字汇总在一起

step5 字符转索引：把字符或者单词转化为tensor

def targetTensor(line): #预测下一个字母的索引值
    letter_indexes = [all_letters.find(line[li])for li in range(1,len(line))]
    letter_indexes.append(n_letters - 1) # EOS的索引
    return torch.LongTensor(letter_indexes).to(device)

任务二中相较于任务一中，多了一个函数targetTensor()是为了预测下个字母的索引