从零开始NLP：使用字符级 RNN 对名称进行分类

白酒永远的神

已于 2024-06-01 22:11:05 修改

阅读量101

点赞数

文章标签： pytorch 自然语言处理

于 2024-04-07 08:52:48 首次发布

NLP From Scratch: Classifying Names with a Character-Level RNN

Author: Sean Robertson

We will be building and training a basic character-level Recurrent Neural Network (RNN) to classify words. This tutorial, along with two other Natural Language Processing (NLP) “from scratch” tutorials NLP From Scratch: Generating Names with a Character-Level RNN and NLP From Scratch: Translation with a Sequence to Sequence Network and Attention, show how to preprocess data to model NLP. In particular these tutorials do not use many of the convenience functions of torchtext, so you can see how preprocessing to model NLP works at a low level.

A character-level RNN reads words as a series of characters - outputting a prediction and “hidden state” at each step, feeding its previous hidden state into each next step. We take the final prediction to be the output, i.e. which class the word belongs to.

Specifically, we’ll train on a few thousand surnames from 18 languages of origin, and predict which language a name is from based on the spelling:

我们将构建并训练一个基本的字符级循环神经网络（RNN）来对单词进行分类。本教程与其他两个 "从零开始 "的自然语言处理（NLP）教程NLP From Scratch: Generating Names with a Character-Level RNN和NLP From Scratch: Translation with a Sequence to Sequence Network and Attention 展示了如何预处理数据以建立 NLP 模型。特别是，这些教程没有使用 torchtext 的许多便利功能，因此您可以看到如何在较低水平上进行预处理以建立 NLP 模型。

字符级 RNN 将单词读作一系列字符，每一步都会输出预测结果和 “隐藏状态”，并将上一步的隐藏状态输入下一步。我们将最终预测作为输出，即单词属于哪一类。

具体来说，我们将对来自 18 种语言的几千个姓氏进行训练，并根据拼写预测一个名字来自哪种语言：

$ python predict.py Hinton
(-0.47) Scottish
(-1.52) English
(-3.57) Irish

$ python predict.py Schmidhuber
(-0.19) German
(-2.48) Czech
(-2.68) Dutch

Recommended Preparation

Before starting this tutorial it is recommended that you have installed PyTorch, and have a basic understanding of Python programming language and Tensors:

https://pytorch.org/ For installation instructions
Deep Learning with PyTorch: A 60 Minute Blitz to get started with PyTorch in general and learn the basics of Tensors
Learning PyTorch with Examples for a wide and deep overview
PyTorch for Former Torch Users if you are former Lua Torch user

It would also be useful to know about RNNs and how they work:

The Unreasonable Effectiveness of Recurrent Neural Networks shows a bunch of real life examples
Understanding LSTM Networks is about LSTMs specifically but also informative about RNNs in general

在开始本教程之前，建议您已安装 PyTorch，并对 Python 编程语言和 Tensors 有基本了解：

https://pytorch.org/ 有关安装说明
Deep Learning with PyTorch: A 60 Minute Blitz 了解 PyTorch 的一般入门知识和 Tensors 的基础知识
Learning PyTorch with Examples，了解广泛而深入的概述
PyTorch for Former Torch Users，如果您以前是 Lua Torch 用户。

了解 RNN 及其工作原理也很有用：

The Unreasonable Effectiveness of Recurrent Neural Networks 展示了大量现实生活中的实例
Understanding LSTM Networks专门介绍 LSTM，但也提供了有关 RNN 的一般信息。

Preparing the Data

NOTE

Download the data from here and extract it to the current directory.

从此处下载数据并解压缩到当前目录。

Included in the data/names directory are 18 text files named as [Language].txt. Each file contains a bunch of names, one name per line, mostly romanized (but we still need to convert from Unicode to ASCII).

We’ll end up with a dictionary of lists of names per language, {language: [names ...]}. The generic variables “category” and “line” (for language and name in our case) are used for later extensibility.

在 data/names 目录中包含 18 个名为 [Language].txt的文本文件。每个文件包含大量名称，每行一个名称，大部分是罗马化的（但我们仍需将 Unicode 转换为 ASCII）。

最后，我们将得到一个由每种语言的名称列表组成的字典，即{language： [names ......]}。通用变量 "类别 "和 “行”（在我们的例子中表示语言和名称）用于以后的扩展。

from io import open
import glob
import os

def findFiles(path): return glob.glob(path)

print(findFiles('data/names/*.txt'))

import unicodedata
import string

# "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ .,;'"
all_letters = string.ascii_letters + " .,;'"
n_letters = len(all_letters)

# 将 Unicode 字符串转换为纯 ASCII 字符串，感谢 https://stackoverflow.com/a/518232/2809427
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
        and c in all_letters
    )

print(unicodeToAscii('Ślusàrski'))

# 建立类别_行字典，即每种语言的名称列表
category_lines = {}
all_categories = []

# 读取文件并分割成行
def readLines(filename):
    lines = open(filename, encoding='utf-8').read().strip().split('\n')
    return [unicodeToAscii(line) for line in lines]

for filename in findFiles('data/names/*.txt'):
    category = os.path.splitext(os.path.basename(filename))[0]
    all_categories.append(category)
    lines = readLines(filename)
    category_lines[category] = lines

n_categories = len(all_categories)

['data/names/Arabic.txt', 'data/names/Chinese.txt', 'data/names/Czech.txt', 'data/names/Dutch.txt', 'data/names/English.txt', 'data/names/French.txt', 'data/names/German.txt', 'data/names/Greek.txt', 'data/names/Irish.txt', 'data/names/Italian.txt', 'data/names/Japanese.txt', 'data/names/Korean.txt', 'data/names/Polish.txt', 'data/names/Portuguese.txt', 'data/names/Russian.txt', 'data/names/Scottish.txt', 'data/names/Spanish.txt', 'data/names/Vietnamese.txt']
Slusarski

Now we have category_lines, a dictionary mapping each category (language) to a list of lines (names). We also kept track of all_categories (just a list of languages) and n_categories for later reference.

现在我们有了 category_lines，这是一个将每个类别（语言）映射到行列（名称）列表的字典。我们还记录了 all_categories（只是语言列表）和 n_categories 以供日后参考。

print(category_lines['Italian'][:5])

['Abandonato', 'Abatangelo', 'Abatantuono', 'Abate', 'Abategiovanni']

Turning Names into Tensors

Now that we have all the names organized, we need to turn them into Tensors to make any use of them.

To represent a single letter, we use a “one-hot vector” of size <1 x n_letters>. A one-hot vector is filled with 0s except for a 1 at index of the current letter, e.g. "b" = <0 1 0 0 0 ...>.

To make a word we join a bunch of those into a 2D matrix <line_length x 1 x n_letters>.

现在，我们已经整理好了所有的名称，需要将它们转换成张量，以便使用。

为了表示单个字母，我们使用大小为<1 x n_letters>的 “one-hot 向量”。除了当前字母索引处的 1 之外，一个one-hot向量中都是 0，例如，"b" = <0 1 0 0 0 ...>。

要组成一个单词，我们要将这些单词连接成一个二维矩阵<line_length x 1 x n_n_letters>。

import torch

# 从 all_letters 中查找字母索引，例如 "a" = 0
def letterToIndex(letter):
    return all_letters.find(letter)

# 只是为了演示，把一个字母变成 <1 x n_letters> 张量
def letterToTensor(letter):
    tensor = torch.zeros(1, n_letters)
    tensor[0][letterToIndex(letter)] = 1
    return tensor

# 将一行变为 <line_length x 1 x n_letters>、
# 或单个字母向量数组
def lineToTensor(line):
    tensor = torch.zeros(len(line), 1, n_letters)
    for li, letter in enumerate(line):
        tensor[li][0][letterToIndex(letter)] = 1
    return tensor

print(letterToTensor('J')) # 1 x 57

print(lineToTensor('Jones').size()) # 5 x 1 x 57

tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0.]])
torch.Size([5, 1, 57])

Creating the Network

Before autograd, creating a recurrent neural network in Torch involved cloning the parameters of a layer over several timesteps. The layers held hidden state and gradients which are now entirely handled by the graph itself. This means you can implement a RNN in a very “pure” way, as regular feed-forward layers.

This RNN module (mostly copied from the PyTorch for Torch users tutorial) is just 2 linear layers which operate on an input and hidden state, with a LogSoftmax layer after the output.

在 autograd 之前，在 Torch 中创建递归神经网络需要在多个时间步中克隆层的参数。这些层包含隐藏状态和梯度，而现在完全由图形本身处理。这意味着你可以用一种非常 "纯粹 "的方式实现 RNN，就像普通的前馈层一样。

这个 RNN 模块（大部分复制自 PyTorch for Torch users tutorial ）只有两个线性层，分别对输入和隐藏状态进行操作，在输出后有一个 LogSoftmax 层。

import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__()

        self.hidden_size = hidden_size

        self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
        self.h2o = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input, hidden):
        combined = torch.cat((input, hidden), 1)
        hidden = self.i2h(combined)
        output = self.h2o(hidden)
        output = self.softmax(output)
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, self.hidden_size)

n_hidden = 128
rnn = RNN(n_letters, n_hidden, n_categories)

To run a step of this network we need to pass an input (in our case, the Tensor for the current letter) and a previous hidden state (which we initialize as zeros at first). We’ll get back the output (probability of each language) and a next hidden state (which we keep for the next step).

要运行这个网络的一个步骤，我们需要传递一个输入（在我们的例子中是当前字母的张量）和上一个隐藏状态（我们首先将其初始化为零）。我们将得到输出（每种语言的概率）和下一个隐藏状态（我们将其保留到下一步）。

input = letterToTensor('A')
hidden = torch.zeros(1, n_hidden)

output, next_hidden = rnn(input, hidden)

For the sake of efficiency we don’t want to be creating a new Tensor for every step, so we will use lineToTensor instead of letterToTensor and use slices. This could be further optimized by precomputing batches of Tensors.

为了提高效率，我们不希望每一步都创建一个新的张量，因此我们将使用 "lineToTensor "而不是 “letterToTensor”，并使用切片。这可以通过预先计算成批的张量来进一步优化。

input = lineToTensor('Albert')
hidden = torch.zeros(1, n_hidden)

output, next_hidden = rnn(input[0], hidden)
print(output)

tensor([[-2.9083, -2.9270, -2.9167, -2.9590, -2.9108, -2.8332, -2.8906, -2.8325,
         -2.8521, -2.9279, -2.8452, -2.8754, -2.8565, -2.9733, -2.9201, -2.8233,
         -2.9298, -2.8624]], grad_fn=<LogSoftmaxBackward0>)

As you can see the output is a <1 x n_categories> Tensor, where every item is the likelihood of that category (higher is more likely).

正如您所看到的，输出结果是一个<1 x n_categories> 张量"，其中每个项目都是该类别的可能性（越高可能性越大）。

Training

Preparing for Training

Before going into training we should make a few helper functions. The first is to interpret the output of the network, which we know to be a likelihood of each category. We can use Tensor.topk to get the index of the greatest value:

在开始训练之前，我们应该制作一些辅助函数。首先是解释网络的输出，我们知道这是每个类别的可能性。我们可以使用 Tensor.topk 来获取最大值的索引：

def categoryFromOutput(output):
    top_n, top_i = output.topk(1)
    category_i = top_i[0].item()
    return all_categories[category_i], category_i

print(categoryFromOutput(output))

('Scottish', 15)

We will also want a quick way to get a training example (a name and its language):

我们还需要一种快速获取训练示例（名称及其语言）的方法：

import random

def randomChoice(l):
    return l[random.randint(0, len(l) - 1)]

def randomTrainingExample():
    category = randomChoice(all_categories)
    line = randomChoice(category_lines[category])
    category_tensor = torch.tensor([all_categories.index(category)], dtype=torch.long)
    line_tensor = lineToTensor(line)
    return category, line, category_tensor, line_tensor

for i in range(10):
    category, line, category_tensor, line_tensor = randomTrainingExample()
    print('category =', category, '/ line =', line)

category = Chinese / line = Hou
category = Scottish / line = Mckay
category = Arabic / line = Cham
category = Russian / line = V'Yurkov
category = Irish / line = O'Keeffe
category = French / line = Belrose
category = Spanish / line = Silva
category = Japanese / line = Fuchida
category = Greek / line = Tsahalis
category = Korean / line = Chang

Training the Network

Now all it takes to train this network is show it a bunch of examples, have it make guesses, and tell it if it’s wrong.

For the loss function nn.NLLLoss is appropriate, since the last layer of the RNN is nn.LogSoftmax.

现在，训练这个网络所需要做的就是向它展示大量示例，让它进行猜测，然后告诉它是否猜错了。

损失函数 nn.NLLLoss 是合适的，因为 RNN 的最后一层是 nn.LogSoftmax。

criterion = nn.NLLLoss()

Each loop of training will:

Create input and target tensors
Create a zeroed initial hidden state
Read each letter in and
- Keep hidden state for next letter
Compare final output to target
Back-propagate
Return the output and loss

每个训练循环将

创建输入和目标张量
创建归零的初始隐藏状态
读取输入的每个字母，并
- 为下一个字母保持隐藏状态
将最终输出与目标进行比较
反向传播
返回输出和损失

learning_rate = 0.005 # 如果设置得太高，它可能会爆炸。如果太低，它可能无法学习

def train(category_tensor, line_tensor):
    hidden = rnn.initHidden()

    rnn.zero_grad()

    for i in range(line_tensor.size()[0]):
        output, hidden = rnn(line_tensor[i], hidden)

    loss = criterion(output, category_tensor)
    loss.backward()

    # 将参数的梯度值乘以学习率
    for p in rnn.parameters():
        p.data.add_(p.grad.data, alpha=-learning_rate)

    return output, loss.item()

Now we just have to run that with a bunch of examples. Since the train function returns both the output and loss we can print its guesses and also keep track of loss for plotting. Since there are 1000s of examples we print only every print_every examples, and take an average of the loss.

现在，我们只需用大量示例来运行它。由于 train 函数会返回输出和损失，因此我们可以打印它的猜测，同时也可以跟踪损失，以便绘图。由于有 1000 个例子，我们只打印每一个 print_every 例子，并取损失的平均值。

import time
import math

n_iters = 100000
print_every = 5000
plot_every = 1000



# 跟踪损失，以便绘制
current_loss = 0
all_losses = []

def timeSince(since):
    now = time.time()
    s = now - since
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)

start = time.time()

for iter in range(1, n_iters + 1):
    category, line, category_tensor, line_tensor = randomTrainingExample()
    output, loss = train(category_tensor, line_tensor)
    current_loss += loss

    # 打印 ``iter`` 编号、损失、名称和猜测值
    if iter % print_every == 0:
        guess, guess_i = categoryFromOutput(output)
        correct = '✓' if guess == category else '✗ (%s)' % category
        print('%d %d%% (%s) %.4f %s / %s %s' % (iter, iter / n_iters * 100, timeSince(start), loss, line, guess, correct))

    # 将当前损失平均值添加到损失列表中
    if iter % plot_every == 0:
        all_losses.append(current_loss / plot_every)
        current_loss = 0

5000 5% (0m 4s) 2.6379 Horigome / Japanese ✓
10000 10% (0m 8s) 2.0172 Miazga / Japanese ✗ (Polish)
15000 15% (0m 13s) 0.2680 Yukhvidov / Russian ✓
20000 20% (0m 17s) 1.8239 Mclaughlin / Irish ✗ (Scottish)
25000 25% (0m 22s) 0.6978 Banh / Vietnamese ✓
30000 30% (0m 26s) 1.7433 Machado / Japanese ✗ (Portuguese)
35000 35% (0m 31s) 0.0340 Fotopoulos / Greek ✓
40000 40% (0m 35s) 1.4637 Quirke / Irish ✓
45000 45% (0m 40s) 1.9018 Reier / French ✗ (German)
50000 50% (0m 44s) 0.9174 Hou / Chinese ✓
55000 55% (0m 48s) 1.0506 Duan / Vietnamese ✗ (Chinese)
60000 60% (0m 53s) 0.9617 Giang / Vietnamese ✓
65000 65% (0m 57s) 2.4557 Cober / German ✗ (Czech)
70000 70% (1m 2s) 0.8502 Mateus / Portuguese ✓
75000 75% (1m 6s) 0.2750 Hamilton / Scottish ✓
80000 80% (1m 11s) 0.7515 Maessen / Dutch ✓
85000 85% (1m 15s) 0.0912 Gan / Chinese ✓
90000 90% (1m 20s) 0.1190 Bellomi / Italian ✓
95000 95% (1m 24s) 0.0137 Vozgov / Russian ✓
100000 100% (1m 28s) 0.7810 Tong / Vietnamese ✓

Plotting the Results

Plotting the historical loss from all_losses shows the network learning:

从 all_losses 中绘制的历史损失图显示了网络的学习情况：

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

plt.figure()
plt.plot(all_losses)

char rnn classification tutorial

[<matplotlib.lines.Line2D object at 0x7fb380e5b5b0>]

Evaluating the Results

To see how well the network performs on different categories, we will create a confusion matrix, indicating for every actual language (rows) which language the network guesses (columns). To calculate the confusion matrix a bunch of samples are run through the network with evaluate(), which is the same as train() minus the backprop.

为了了解网络在不同类别上的表现，我们将创建一个混淆矩阵，显示网络对每种实际语言（行）的猜测（列）。为了计算混淆矩阵，我们将使用 evaluate() 通过网络运行大量样本，这与去掉反向推理的 train() 相同。

# 在混淆矩阵中跟踪正确的猜测
confusion = torch.zeros(n_categories, n_categories)
n_confusion = 10000

# 只要返回一行的输出结果
def evaluate(line_tensor):
    hidden = rnn.initHidden()

    for i in range(line_tensor.size()[0]):
        output, hidden = rnn(line_tensor[i], hidden)

    return output

# 查看大量示例，记录哪些猜对了
for i in range(n_confusion):
    category, line, category_tensor, line_tensor = randomTrainingExample()
    output = evaluate(line_tensor)
    guess, guess_i = categoryFromOutput(output)
    category_i = all_categories.index(category)
    confusion[category_i][guess_i] += 1

# 用每一行的总和除以每一行，进行归一化处理
for i in range(n_categories):
    confusion[i] = confusion[i] / confusion[i].sum()

# 设置绘图
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(confusion.numpy())
fig.colorbar(cax)

# 设置轴
ax.set_xticklabels([''] + all_categories, rotation=90)
ax.set_yticklabels([''] + all_categories)

# 在每个刻度处强制标注
ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
ax.yaxis.set_major_locator(ticker.MultipleLocator(1))

# sphinx_gallery_thumbnail_number = 2
plt.show()

char rnn classification tutorial

/var/lib/workspace/intermediate_source/char_rnn_classification_tutorial.py:445: UserWarning:

set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.

/var/lib/workspace/intermediate_source/char_rnn_classification_tutorial.py:446: UserWarning:

set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.

You can pick out bright spots off the main axis that show which languages it guesses incorrectly, e.g. Chinese for Korean, and Spanish for Italian. It seems to do very well with Greek, and very poorly with English (perhaps because of overlap with other languages).

您可以从主轴上挑出一些亮点，显示它猜错了哪些语言，例如中文猜成了韩语，西班牙语猜成了意大利语。它似乎在希腊语方面做得很好，而在英语方面做得很差（也许是因为与其他语言重叠的缘故）。

Running on User Input

def predict(input_line, n_predictions=3):
    print('\n> %s' % input_line)
    with torch.no_grad():
        output = evaluate(lineToTensor(input_line))

        # 获取前 N 个类别
        topv, topi = output.topk(n_predictions, 1, True)
        predictions = []

        for i in range(n_predictions):
            value = topv[0][i].item()
            category_index = topi[0][i].item()
            print('(%.2f) %s' % (value, all_categories[category_index]))
            predictions.append([value, all_categories[category_index]])

predict('Dovesky')
predict('Jackson')
predict('Satoshi')

> Dovesky
(-0.57) Czech
(-0.97) Russian
(-3.43) English

> Jackson
(-1.02) Scottish
(-1.49) Russian
(-1.96) English

> Satoshi
(-0.42) Japanese
(-1.70) Polish
(-2.74) Italian

The final versions of the scripts in the Practical PyTorch repo split the above code into a few files:

data.py (loads files)
model.py (defines the RNN)
train.py (runs training)
predict.py (runs predict() with command line arguments)
server.py (serve prediction as a JSON API with bottle.py)

Run train.py to train and save the network.

Run predict.py with a name to view predictions:

the Practical PyTorch repo 中的脚本最终版本将上述代码分成了几个文件：

data.py（加载文件）
model.py（定义 RNN）
train.py（运行训练）
predict.py（使用命令行参数运行 predict()）。
server.py（使用 bottle.py 将预测结果作为 JSON API 服务）

运行 train.py 训练并保存网络。

使用名称运行 predict.py 查看预测结果：

$ python predict.py Hazaki
(-0.42) Japanese
(-1.39) Polish
(-3.51) Czech

Run server.py and visit http://localhost:5533/Yourname to get JSON output of predictions.

Exercises

Try with a different dataset of line -> category, for example:
- Any word -> language
- First name -> gender
- Character name -> writer
- Page title -> blog or subreddit
Get better results with a bigger and/or better shaped network
- Add more linear layers
- Try the nn.LSTM and nn.GRU layers
- Combine multiple of these RNNs as a higher level network
例如，用不同的行 -> 类别数据集进行尝试：
- 任何单词 -> 语言
- 名字 -> 性别
- 角色名 -> 作者
- 页面标题 -> 博客或子版块
使用更大和/或形状更好的网络获得更好的结果
- 添加更多线性层
- 尝试 nn.LSTM和 nn.GRU层
- 将多个 RNN 组合成更高级别的网络