《Deep Learning With Python second edition》英文版读书笔记：第十一章DL for text: NLP、Transformer、Seq2Seq

本文链接：https://blog.csdn.net/shizheng_Li/article/details/123583077

文章目录

写在前面，本文是阅读python深度学习第二版的读书笔记，仅用于个人学习使用。另外，截至2022年3月18日，距离该版的英文版发布已经将近5个月，国内还未翻译过来。这里的启发是什么呢？学好英文，接触一手的资料！
在这里插入图片描述

第十一章：Deep learning for text

本章的主要内容

This chapter covers

 Preprocessing text data for machine learning
applications

 Bag-of-words approaches and sequence-modeling
approaches for text processing

 The Transformer architecture

 Sequence-to-sequence learning

11.1 Natural language processing: The bird’s eye view

编程语言和自然语言不同

In computer science, we refer to human languages, like English or Mandarin, as
“natural” languages, to distinguish them from languages that were designed for
machines, like Assembly, LISP, or XML. Every machine language was designed: its
starting point was a human engineer writing down a set of formal rules to describe
what statements you could make in that language and what they meant. Rules came
first, and people only started using the language once the rule set was complete.
With human language, it’s the reverse: usage comes first, rules arise later. Natural
language was shaped by an evolution process, much like biological organisms—
that’s what makes it “natural.” Its “rules,” like the grammar of English, were formalized
after the fact and are often ignored or broken by its users. As a result, while machine-readable language is highly structured and rigorous, using precise syntactic
rules to weave together exactly defined concepts from a fixed vocabulary, natural language
is messy—ambiguous, chaotic, sprawling, and constantly in flux.

handcraft complex sets of rules to perform machine translation 行不通

Creating algorithms that can make sense of natural language is a big deal: language,
and in particular text, underpins most of our communications and our cultural
production. The internet is mostly text. Language is how we store almost all of
our knowledge. Our very thoughts are largely built upon language. However, the ability
to understand natural language has long eluded machines. Some people once
naively thought that you could simply write down the “rule set of English,” much like
one can write down the rule set of LISP. Early attempts to build natural language processing
(NLP) systems were thus made through the lens of “applied linguistics.” Engineers
and linguists would handcraft complex sets of rules to perform basic machine
translation or create simple chatbots—like the famous ELIZA program from the
1960s, which used pattern matching to sustain very basic conversation. But language is
a rebellious thing: it’s not easily pliable to formalization. After several decades of
effort, the capabilities of these systems remained disappointing.

统计NLP方法

Handcrafted rules held out as the dominant approach well into the 1990s. But
starting in the late 1980s, faster computers and greater data availability started making
a better alternative viable. When you find yourself building systems that are big piles
of ad hoc rules, as a clever engineer, you’re likely to start asking: “Could I use a corpus
of data to automate the process of finding these rules? Could I search for the rules
within some kind of rule space, instead of having to come up with them myself?” And
just like that, you’ve graduated to doing machine learning. And so, in the late 1980s,
we started seeing machine learning approaches to natural language processing. The
earliest ones were based on decision trees—the intent was literally to automate the
development of the kind of if/then/else rules of previous systems. Then statistical
approaches started gaining speed, starting with logistic regression. Over time, learned
parametric models fully took over, and linguistics came to be seen as more of a hindrance
than a useful tool. Frederick Jelinek, an early speech recognition researcher,
joked in the 1990s: “Every time I fire a linguist, the performance of the speech recognizer
goes up.”

下面是modern NLP的一些任务：

That’s what modern NLP is about: using machine learning and large datasets to
give computers the ability not to understand language, which is a more lofty goal, but
to ingest a piece of language as input and return something useful, like predicting the
following:

 “What’s the topic of this text?” (text classification)

 “Does this text contain abuse?” (content filtering)

 “Does this text sound positive or negative?” (sentiment analysis)

 “What should be the next word in this incomplete sentence?” (language modeling)

 “How would you say this in German?” (translation)

 “How would you summarize this article in one paragraph?” (summarization)

 etc.

对NLP的简单介绍: 就是模式识别在texts中的应用

Of course, keep in mind throughout this chapter that the text-processing models you
will train won’t possess a human-like understanding of language; rather, they simply
look for statistical regularities in their input data, which turns out to be sufficient to
perform well on many simple tasks. In much the same way that computer vision is pattern
recognition applied to pixels, NLP is pattern recognition applied to words, sentences,
and paragraphs.

recurrent neural networks：递归神经网路

介绍一下近些年NLP模型的发展,现在用的多的是transformer

The toolset of NLP—decision trees, logistic regression—only saw slow evolution
from the 1990s to the early 2010s. Most of the research focus was on feature engineering.
When I won my first NLP competition on Kaggle in 2013, my model was, you
guessed it, based on decision trees and logistic regression. However, around 2014–
2015, things started changing at last. Multiple researchers began to investigate the
language-understanding capabilities of recurrent neural networks, in particular LSTM—
a sequence-processing algorithm from the late 1990s that had stayed under the radar
until then.

In early 2015, Keras made available the first open source, easy-to-use implementation
of LSTM, just at the start of a massive wave of renewed interest in recurrent neural
networks—until then, there had only been “research code” that couldn’t be readily
reused. Then from 2015 to 2017, recurrent neural networks dominated the booming
NLP scene. Bidirectional LSTM models, in particular, set the state of the art on many important tasks, from summarization to question-answering to machine translation.

Finally, around 2017–2018, a new architecture rose to replace RNNs: the Transformer,
which you will learn about in the second half of this chapter. Transformers
unlocked considerable progress across the field in a short period of time, and today
most NLP systems are based on them.

Let’s dive into the details. This chapter will take you from the very basics to doing
machine translation with a Transformer.

11.2 Preparing text data

下面是一些重要的概念：tokens，indexing all tokens

Deep learning models, being differentiable functions, can only process numeric tensors:
they can’t take raw text as input. Vectorizing text is the process of transforming
text into numeric tensors. Text vectorization processes come in many shapes and
forms, but they all follow the same template (see figure 11.1):

 First, you standardize the text to make it easier to process, such as by converting
it to lowercase or removing punctuation.

 You split the text into units (called tokens), such as characters, words, or groups
of words. This is called tokenization.

 You convert each such token into a numerical vector. This will usually involve
first indexing all tokens present in the data.

Let’s review each of these steps.

下图就是整个初始化text的过程
在这里插入图片描述

11.2.1 Text standardization

Consider these two sentences:

 “sunset came. i was staring at the Mexico sky. Isnt nature splendid??”

 “Sunset came; I stared at the México sky. Isn’t nature splendid?”

They’re very similar—in fact, they’re almost identical. Yet, if you were to convert them
to byte strings, they would end up with very different representations, because “i” and
“I” are two different characters, “Mexico” and “México” are two different words, “isnt”
isn’t “isn’t,” and so on. A machine learning model doesn’t know a priori that “i” and
“I” are the same letter, that “é” is an “e” with an accent, or that “staring” and “stared”
are two forms of the same verb.

Text standardization is a basic form of feature engineering that aims to erase
encoding differences that you don’t want your model to have to deal with. It’s not
exclusive to machine learning, either—you’d have to do the same thing if you were
building a search engine.

简单处理

One of the simplest and most widespread standardization schemes is “convert to
lowercase and remove punctuation characters.” Our two sentences would become

 “sunset came i was staring at the mexico sky isnt nature splendid”

 “sunset came i stared at the méxico sky isnt nature splendid”

还有：stemming这个技巧，把一个单词的不同形式编码为a single shared representation

Much closer already. Another common transformation is to convert special characters
to a standard form, such as replacing “é” with “e,” “æ” with “ae,” and so on. Our token
“méxico” would then become “mexico”.

Lastly, a much more advanced standardization pattern that is more rarely used in a
machine learning context is stemming: converting variations of a term (such as different
conjugated forms of a verb) into a single shared representation, like turning
“caught” and “been catching” into “[catch]” or “cats” into “[cat]”. With stemming,
“was staring” and “stared” would become something like “[stare]”, and our two similar
sentences would finally end up with an identical encoding:
 “sunset came i [stare] at the mexico sky isnt nature splendid”

标准化的好处

With these standardization techniques, your model will require less training data and
will generalize better—it won’t need abundant examples of both “Sunset” and “sunset”
to learn that they mean the same thing, and it will be able to make sense of “México”
even if it has only seen “mexico” in its training set. Of course, standardization may
also erase some amount of information, so always keep the context in mind: for
instance, if you’re writing a model that extracts questions from interview articles, it
should definitely treat “?” as a separate token instead of dropping it, because it’s a useful
signal for this specific task.

11.2.2 Text splitting (tokenization)

Once your text is standardized, you need to break it up into units to be vectorized
(tokens), a step called tokenization. You could do this in three different ways:

 Word-level tokenization—Where tokens are space-separated (or punctuationseparated)
substrings. A variant of this is to further split words into subwords
when applicable—for instance, treating “staring” as “star+ing” or “called” as
“call+ed.”

 N-gram tokenization—Where tokens are groups of N consecutive words. For
instance, “the cat” or “he was” would be 2-gram tokens (also called bigrams).

 Character-level tokenization—Where each character is its own token. In practice,
this scheme is rarely used, and you only really see it in specialized contexts, like
text generation or speech recognition.

两类模型：

sequence models: care about words order, using word-level tokenization
bag-of-words models: discard order of words, such as N-grams models

In general, you’ll always use either word-level or N-gram tokenization. There are two
kinds of text-processing models: those that care about word order, called sequence models,
and those that treat input words as a set, discarding their original order, called
bag-of-words models. If you’re building a sequence model, you’ll use word-level tokenization,
and if you’re building a bag-of-words model, you’ll use N-gram tokenization.
N-grams are a way to artificially inject a small amount of local word order information
into the model. Throughout this chapter, you’ll learn more about each type of model
and when to use them.

下面是对N-grams和bag-of-words的解释

Understanding N-grams and bag-of-words

Word N-grams are groups of N (or fewer) consecutive words that you can extract from
a sentence. The same concept may also be applied to characters instead of words.
Here’s a simple example. Consider the sentence “the cat sat on the mat.” It may be
decomposed into the following set of 2-grams:

{“the”, “the cat”, “cat”, “cat sat”, “sat”,
“sat on”, “on”, “on the”, “the mat”, “mat”}

It may also be decomposed into the following set of 3-grams:

{“the”, “the cat”, “cat”, “cat sat”, “the cat sat”,
“sat”, “sat on”, “on”, “cat sat on”, “on the”,
“sat on the”, “the mat”, “mat”, “on the mat”}

Such a set is called a bag-of-2-grams or bag-of-3-grams, respectively. The term “bag”
here refers to the fact that you’re dealing with a set of tokens rather than a list or
sequence: the tokens have no specific order. This family of tokenization methods is
called bag-of-words (or bag-of-N-grams).

Because bag-of-words isn’t an order-preserving tokenization method (the tokens generated
are understood as a set, not a sequence, and the general structure of the sentences
is lost), it tends to be used in shallow language-processing models rather than
in deep learning models. Extracting N-grams is a form of feature engineering, and
deep learning sequence models do away with this manual approach, replacing it with
hierarchical feature learning. One-dimensional convnets, recurrent neural networks,
and Transformers are capable of learning representations for groups of words and
characters without being explicitly told about the existence of such groups, by looking
at continuous word or character sequences.

11.2.3 Vocabulary indexing

Once your text is split into tokens, you need to encode each token into a numerical
representation. You could potentially do this in a stateless way, such as by hashing each
token into a fixed binary vector, but in practice, the way you’d go about it is to build
an index of all terms found in the training data (the “vocabulary”), and assign a
unique integer to each entry in the vocabulary.

不用管用的少的words

Note that at this step it’s common to restrict the vocabulary to only the top 20,000 or
30,000 most common words found in the training data. Any text dataset tends to feature
an extremely large number of unique terms, most of which only show up once or
twice—indexing those rare terms would result in an excessively large feature space,
where most features would have almost no information content.

记得使用OOV index来记录上述的越界单词（很少用的单词）

Remember when you were training your first deep learning models on the IMDB
dataset in chapters 4 and 5? The data you were using from keras.datasets.imdb was
already preprocessed into sequences of integers, where each integer stood for a given
word. Back then, we used the setting num_words=10000, in order to restrict our vocabulary
to the top 10,000 most common words found in the training data.

Now, there’s an important detail here that we shouldn’t overlook: when we look
up a new token in our vocabulary index, it may not necessarily exist. Your training
data may not have contained any instance of the word “cherimoya” (or maybe you
excluded it from your index because it was too rare), so doing token_index =
vocabulary[“cherimoya”] may result in a KeyError. To handle this, you should use
an “out of vocabulary” index (abbreviated as OOV index)—a catch-all for any token
that wasn’t in the index. It’s usually index 1: you’re actually doing token_index =
vocabulary.get(token, 1). When decoding a sequence of integers back into words,
you’ll replace 1 with something like “[UNK]” (which you’d call an “OOV token”).

索引1表示OOV

索引0表示not a word

短的序列需要padded to 最长序列的长度

“Why use 1 and not 0?” you may ask. That’s because 0 is already taken. There are
two special tokens that you will commonly use: the OOV token (index 1), and the
mask token (index 0). While the OOV token means “here was a word we did not recognize,”
the mask token tells us “ignore me, I’m not a word.” You’d use it in particular to
pad sequence data: because data batches need to be contiguous, all sequences in a
batch of sequence data must have the same length, so shorter sequences should be
padded to the length of the longest sequence. If you want to make a batch of data with
the sequences [5, 7, 124, 4, 89] and [8, 34, 21], it would have to look like this:
[[5, 7, 124, 4, 89]
[8, 34, 21, 0, 0]]
The batches of integer sequences for the IMDB dataset that you worked with in chapters
4 and 5 were padded with zeros in this way.

11.2.4 Using the TextVectorization layer

Every step I’ve introduced so far would be very easy to implement in pure Python.
Maybe you could write something like this:

python中self相当于java中的this，表示当前类的对象，可以调用当前类中的方法和属性。

import string

class Vectorizer:
    # 标准化：全部变成小写，去掉标点符号
    def standardize(self, text):
        text = text.lower()  # 变成小写
        return "".join(char for char in text if char not in string.punctuation) # 去掉标点符号

    # tokenize：拆分单词
    def tokenize(self, text):
        text = self.standardize(text)
        return text.split() # 返回list，通过空格分隔出单词

    # 做单词和索引的哈希表
    def make_vocabulary(self, dataset):
        self.vocabulary = {"": 0, "[UNK]": 1}
        for text in dataset:
            text = self.standardize(text)
            tokens = self.tokenize(text)
            for token in tokens:
                if token not in self.vocabulary:
                    self.vocabulary[token] = len(self.vocabulary)
        # self.vocabulary的内容如下：hash表
        # {'': 0, '[UNK]': 1, 'i': 2, 'write': 3, 'erase': 4, 'rewrite': 5, 'again': 6, 'and': 7, 'then': 8, 'a': 9, 'poppy': 10, 'blooms': 11}

        self.inverse_vocabulary = dict(
            (v, k) for k, v in self.vocabulary.items())
        # self.inverse_vocalulary的内容
        # {0: '', 1: '[UNK]', 2: 'i', 3: 'write', 4: 'erase', 5: 'rewrite', 6: 'again', 7: 'and', 8: 'then', 9: 'a', 10: 'poppy', 11: 'blooms'}

    def encode(self, text):
        text = self.standardize(text)
        tokens = self.tokenize(text) # ['i', 'write', 'rewrite', 'and', 'still', 'rewrite', 'again']
        return [self.vocabulary.get(token, 1) for token in tokens]  # 如果token没有在vocabulary中出现的话，返回1

    def decode(self, int_sequence):
        return " ".join(
            self.inverse_vocabulary.get(i, "[UNK]") for i in int_sequence)

vectorizer = Vectorizer()
dataset = [
    "I write, erase, rewrite",
    "Erase again, and then",
    "A poppy blooms.",
]
vectorizer.make_vocabulary(dataset)

# 进行编码：由token输出index
test_sentence = "I write, rewrite, and still rewrite again"
encoded_sentence = vectorizer.encode(test_sentence)
print(encoded_sentence) # [2, 3, 5, 7, 1, 5, 6]

#进行解码：由index输出token
decoded_sentence = vectorizer.decode(encoded_sentence)
print(decoded_sentence) # i write rewrite and [UNK] rewrite again

https://www.runoob.com/python/att-dictionary-get.html ： python dict.get() 方法的解释

get(token, 1)的意思是如果没有查找到，返回设置的1；如果是get(token)，在哈希表中如果没有查找到的话，返回默认的None。

以上是自己处理的过程，下面使用Keras TextVectorization

However, using something like this wouldn’t be very performant. In practice, you’ll
work with the Keras TextVectorization layer, which is fast and efficient and can be
dropped directly into a tf.data pipeline or a Keras model.

This is what the TextVectorization layer looks like:

from tensorflow.keras.layers import TextVectorization
text_vectorization = TextVectorization(
    output_mode="int", # 映射为整数
)

在这里插入图片描述

默认的，TextVectorization layer自动standardize and tokenize

By default, the TextVectorization layer will use the setting “convert to lowercase and
remove punctuation” for text standardization, and “split on whitespace” for tokenization.
But importantly, you can provide custom functions for standardization and tokenization,
which means the layer is flexible enough to handle any use case.

Note that such custom functions should operate on tf.string tensors, not regular Python
strings! For instance, the default layer behavior is equivalent to the following:

在这里插入图片描述

标准化

import re
import string
import tensorflow as tf

def custom_standardization_fn(string_tensor):
    lowercase_string = tf.strings.lower(string_tensor)
    return tf.strings.regex_replace(
        lowercase_string, f"[{re.escape(string.punctuation)}]", "") # 用空字符串代替标点符号

def custom_split_fn(string_tensor):
    return tf.strings.split(string_tensor)

text_vectorization = TextVectorization(
    output_mode="int",
    standardize=custom_standardization_fn,
    split=custom_split_fn,
)

添加index使用adapt方法

To index the vocabulary of a text corpus, just call the adapt() method of the layer
with a Dataset object that yields strings, or just with a list of Python strings:

dataset = [
    "I write, erase, rewrite",
    "Erase again, and then",
    "A poppy blooms.",
]
text_vectorization.adapt(dataset)

对adapt的解释

During adapt(), the layer will build a vocabulary of all string tokens seen in the dataset, sorted by occurance count, with ties broken by sort order of the tokens (high to low).

可以使用get_vocabulary()方法检索词汇

Note that you can retrieve the computed vocabulary via get_vocabulary()—this can
be useful if you need to convert text encoded as integer sequences back into words.
The first two entries in the vocabulary are the mask token (index 0) and the OOV
token (index 1). Entries in the vocabulary list are sorted by frequency, so with a realworld
dataset, very common words like “the” or “a” would come first.

展示词汇

text_vectorization.get_vocabulary()

output

['',
 '[UNK]',
 'erase',
 'write',
 'then',
 'rewrite',
 'poppy',
 'i',
 'blooms',
 'and',
 'again',
 'a']

测试一下

vocabulary = text_vectorization.get_vocabulary()
test_sentence = "I write, rewrite, and still rewrite again hello a an great"
encoded_sentence = text_vectorization(test_sentence)
print(encoded_sentence)

output

tf.Tensor([ 7  3  5  9  1  5 10  1 11  1  1], shape=(11,), dtype=int64)

inverse_vocab = dict(enumerate(vocabulary))
decoded_sentence = " ".join(inverse_vocab[int(i)] for i in encoded_sentence)
print(decoded_sentence)

output

i write rewrite and [UNK] rewrite again [UNK] a [UNK] [UNK]

TextVectorization是在CPU上跑的

Using the TextVectorization layer in a tf.data pipeline or as part of a model
Importantly, because TextVectorization is mostly a dictionary lookup operation, it
can’t be executed on a GPU (or TPU)—only on a CPU. So if you’re training your model
on a GPU, your TextVectorization layer will run on the CPU before sending its output
to the GPU. This has important performance implications.

有两种方式使用TextVectorization layer

在这里插入图片描述

两种方式有很大的区别

There’s an important difference between the two: if the vectorization step is part of
the model, it will happen synchronously with the rest of the model. This means that
at each training step, the rest of the model (placed on the GPU) will have to wait for
the output of the TextVectorization layer (placed on the CPU) to be ready in order
to get to work. Meanwhile, putting the layer in the tf.data pipeline enables you to do asynchronous preprocessing of your data on CPU: while the GPU runs the model
on one batch of vectorized data, the CPU stays busy by vectorizing the next batch of
raw strings.

如果使用GPU训练的话，倾向于使用第一种：tf.data pipeline

So if you’re training the model on GPU or TPU, you’ll probably want to go with the first
option to get the best performance. This is what we will do in all practical examples
throughout this chapter. When training on a CPU, though, synchronous processing is
fine: you will get 100% utilization of your cores regardless of which option you go with.

export to a production environment

Now, if you were to export our model to a production environment, you would want to
ship a model that accepts raw strings as input, like in the code snippet for the second
option above—otherwise you would have to reimplement text standardization and
tokenization in your production environment (maybe in JavaScript?), and you would
face the risk of introducing small preprocessing discrepancies that would hurt the
model’s accuracy. Thankfully, the TextVectorization layer enables you to include
text preprocessing right into your model, making it easier to deploy—even if you were
originally using the layer as part of a tf.data pipeline. In the sidebar “Exporting a
model that processes raw strings,” you’ll learn how to export an inference-only
trained model that incorporates text preprocessing.

11.3 Two approaches for representing groups of words:Sets and sequences

word order

How a machine learning model should represent individual words is a relatively uncontroversial
question: they’re categorical features (values from a predefined set), and we
know how to handle those. They should be encoded as dimensions in a feature space,
or as category vectors (word vectors in this case). A much more problematic question,
however, is how to encode the way words are woven into sentences: word order.

词序很重要，但是它和整个句子的含义的关系并没有那么直接。

The problem of order in natural language is an interesting one: unlike the steps of
a timeseries, words in a sentence don’t have a natural, canonical order. Different languages
order similar words in very different ways. For instance, the sentence structure
of English is quite different from that of Japanese. Even within a given language, you
can typically say the same thing in different ways by reshuffling the words a bit. Even
further, if you fully randomize the words in a short sentence, you can still largely figure
out what it was saying—though in many cases significant ambiguity seems to arise.
Order is clearly important, but its relationship to meaning isn’t straightforward.

下面是几个模型

How to represent word order is the pivotal question from which different kinds of
NLP architectures spring. The simplest thing you could do is just discard order and
treat text as an unordered set of words—this gives you bag-of-words models. You could
also decide that words should be processed strictly in the order in which they appear,
one at a time, like steps in a timeseries—you could then leverage the recurrent models
from the last chapter. Finally, a hybrid approach is also possible: the Transformer architecture is technically order-agnostic, yet it injects word-position information into
the representations it processes, which enables it to simultaneously look at different
parts of a sentence (unlike RNNs) while still being order-aware. Because they take into
account word order, both RNNs and Transformers are called sequence models.

之前NLP主要处理的是词袋模型，直到2015年才逐渐关注顺序模型。

Historically, most early applications of machine learning to NLP just involved
bag-of-words models. Interest in sequence models only started rising in 2015, with the
rebirth of recurrent neural networks. Today, both approaches remain relevant. Let’s
see how they work, and when to leverage which.

We’ll demonstrate each approach on a well-known text classification benchmark:
the IMDB movie review sentiment-classification dataset. In chapters 4 and 5, you
worked with a prevectorized version of the IMDB dataset; now, let’s process the raw
IMDB text data, just like you would do when approaching a new text-classification
problem in the real world.

11.3.1 Preparing the IMDB movie reviews data

Let’s start by downloading the dataset from the Stanford page of Andrew Maas and
uncompressing it:

!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

output

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 80.2M  100 80.2M    0     0  48.6M      0  0:00:01  0:00:01 --:--:-- 48.6M

There’s also a train/unsup subdirectory in there, which we don’t need. Let’s
delete it:

!rm -r aclImdb/train/unsup

这是文件结构

aclImdb/
...train/
......pos/
......neg/
...test/
......pos/
......neg/

For instance, the train/pos/ directory contains a set of 12,500 text files, each of which
contains the text body of a positive-sentiment movie review to be used as training data.
The negative-sentiment reviews live in the “neg” directories. In total, there are 25,000
text files for training and another 25,000 for testing.

Take a look at the content of a few of these text files. Whether you’re working with
text data or image data, remember to always inspect what your data looks like before
you dive into modeling it. It will ground your intuition about what your model is actually
doing:

!cat aclImdb/train/pos/4077_10.txt

output

I first saw this back in the early 90s on UK TV, i did like it then but i missed the chance to tape it, many years passed but the film always stuck with me and i lost hope of seeing it TV again, the main thing that stuck with me was the end, the hole castle part really touched me, its easy to watch, has a great story, great music, the list goes on and on, its OK me saying how good it is but everyone will take there own best bits away with them once they have seen it, yes the animation is top notch and beautiful to watch, it does show its age in a very few parts but that has now become part of it beauty, i am so glad it has came out on DVD as it is one of my top 10 films of all time. Buy it or rent it just see it, best viewing is at night alone with drink and food in reach so you don't have to stop the film.<br /><br />Enjoy

prepare a validation set

Next, let’s prepare a validation set by setting apart 20% of the training text files in a
new directory, aclImdb/val:

下面这段代码可以学习一下：如何从数据集中拿出20%作为验证集

import os, pathlib, shutil, random

base_dir = pathlib.Path("aclImdb")
val_dir = base_dir / "val"
train_dir = base_dir / "train"
for category in ("neg", "pos"):
    os.makedirs(val_dir / category)
    files = os.listdir(train_dir / category)
    random.Random(1337).shuffle(files)
    num_val_samples = int(0.2 * len(files))
    val_files = files[-num_val_samples:]
    for fname in val_files:
        shutil.move(train_dir / category / fname,
                    val_dir / category / fname)

如果是运行原书自带的代码，只能运行一遍，运行第二遍的时候会报错

---------------------------------------------------------------------------
FileExistsError                           Traceback (most recent call last)
<ipython-input-32-c888da3ad7e7> in <module>()
      5 train_dir = base_dir / "train"
      6 for category in ("neg", "pos"):
----> 7     os.makedirs(val_dir / category)
      8     files = os.listdir(train_dir / category)
      9     random.Random(1337).shuffle(files)

/usr/lib/python3.7/os.py in makedirs(name, mode, exist_ok)
    221             return
    222     try:
--> 223         mkdir(name, mode)
    224     except OSError:
    225         # Cannot rely on checking for EEXIST, since the operating system

FileExistsError: [Errno 17] File exists: 'aclImdb/val/neg'

下面是对代码的解释

在这里插入图片描述

Remember how, in chapter 8, we used the image_dataset_from_directory utility to
create a batched Dataset of images and their labels for a directory structure? You can
do the exact same thing for text files using the text_dataset_from_directory utility.
Let’s create three Dataset objects for training, validation, and testing:

from tensorflow import keras
batch_size = 32

train_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/train", batch_size=batch_size
)
val_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/val", batch_size=batch_size
)
test_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/test", batch_size=batch_size
)

output

Found 20000 files belonging to 2 classes.
Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.

These datasets yield inputs that are TensorFlow tf.string tensors and targets that are
int32 tensors encoding the value “0” or “1.”

Listing 11.2 Displaying the shapes and dtypes of the first batch

for inputs, targets in train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

输出结果

inputs.shape: (32,)
inputs.dtype: <dtype: 'string'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor(b'Despite being quite far removed from my expectations, I was thoroughly impressed by Dog Bite Dog. I rented it not knowing much about it, but I essentially expected it to be a martial arts/action film in the standard Hong Kong action tradition, of which I am a devoted fan. I ended up getting something entirely different, which is not at all a bad thing. While the film could be classified as such, and there is definitely some good action and hand to hand combat scenes in the film, it is definitely not the primary focus. Its characters are infinitely more important to the film than its fights, a rather uncommon thing in many Hong Kong action movies.<br /><br />I was really quite surprised by the intricacy of the characters and character relationships in the film. The lead character, played by Edison Chen (who is really very good), becomes infinitely more complex by the end of the film than I ever thought he would be after watching the first thirty minutes. The police characters also defied my expectations thoroughly. In fact, the stark and honest portrayal of the seldom seen dark side of the police force was quite possible my favorite aspect of the film. I don\'t know that I would say Dog Bite Dog entirely subverts typical notions of bad criminal, good cop, but it certainly distorts them in ways not often seen in film (unfortunately). So many films, especially Hong Kong action films I find, portray police in what is frankly a VERY ignorantly idealized light. This is one of my least favorite things about the genre. I was pleasantly surprised to see that Dog Bite Dog actually had some very unique, and really quite courageous, ideas to present about the police force. There are negotiation scenes in this film that I have never seen the likes of before, and doubt I will ever see again, and am sure I will remember for quite a while. Also, the criminal characters are shown from an interesting perspective as well, there is some documentary footage in the film of Cambodian boys no older than ten being made to fight each other to the death with their bare hands, which I thought was one of the film\'s most powerful and moving moments. It says a lot about the reason these guys are the way they are, rather than simply condemning them. Also, the relationship between Chen\'s character and the girl he meets in the junk yard reveals a lot about his character. It wasn\'t until this element entered the film that I really started to see the film as an emotional experience rather than only a visceral one. There is something about most on screen relationships that doesn\'t quite get through to me, but for some reason this one really did. The actress does an incredible job with this role which I imagine was not easy to play.<br /><br />Dog Bite Dog also features some really breathtaking cinematography, all though it is unfortunately rather uneven. There were some moments that I found really striking, particularly in the last segment of the film, but there was also a good deal of camera work that was just OK. Another slight problem I had was with the pacing, which I also felt was uneven. I found a lot of the "looking for a boat" scenes to be a little alienating, all though it quickly picks up after that. The action scenes are short and not too plentiful, but are truly powerful and effecting, particularly towards the end. The fight choreography is honestly not all that impressive for the most part, all though to its credit it is solid and fairly realistic, but the true strength is the emotional content behind the fights. The final scene, while not a marvel of martial artistry or fight choreography, is one of the most powerful final fights I have ever seen, and I\'ve seen quite a few martial arts films.<br /><br />I suppose the biggest determining factor of whether or not one will get much out of Dog Bite Dog is whether or not you can connect with the characters. All of them are certainly some of the more flawed characters one is likely to see in a film of any kind, but there was something very human about all of them that I couldn\'t help but be drawn to and really feel for them, particularly Chen\'s girlfriend. I should say that I doubt most people will like the film as much as I did simply because I imagine that most people will not like or care about the characters in the same way, but I still recommend it highly all the same. It is truly a deeply moving and effecting film if you give it a chance.', shape=(), dtype=string)
targets[0]: tf.Tensor(1, shape=(), dtype=int32)

下面是输出第32个batch

for inputs, targets in train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)
    print("inputs[31]:", inputs[31])
    print("targets[31]:", targets[31])
    break

output

inputs.shape: (32,)
inputs.dtype: <dtype: 'string'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[31]: tf.Tensor(b'"The Brain Machine" will at least put your own brain into overdrive trying to figure out what it\'s all about. Four subjects of varying backgrounds and intelligence level have been selected for an experiment described by one of the researchers as a scientific study of man and environment. Since the only common denominator among them is the fact that they each have no known family should have been a tip off - none of them will be missed.<br /><br />The whole affair is supervised by a mysterious creep known only as The General, but it seems he\'s taking his direction from a Senator who wishes to remain anonymous. Good call there on the Senator\'s part. There\'s also a shadowy guard that the camera constantly zooms in on, who later claims he doesn\'t take his direction from the General or \'The Project\'. Too bad he wasn\'t more effective, he was overpowered rather easily before the whole thing went kablooey.<br /><br />If nothing else, the film is a veritable treasure trove of 1970\'s technology featuring repeated shots of dial phones, room size computers and a teletype machine that won\'t quit. Perhaps that was the basis of the film\'s alternate title - "Time Warp"; nothing else would make any sense. As for myself, I\'d like to consider a title suggested by the murdered Dr. Krisner\'s experiment titled \'Group Stress Project\'. It applies to the film\'s actors and viewers alike.<br /><br />Keep an eye out just above The General\'s head at poolside when he asks an agent for his weapon, a boom mic is visible above his head for a number of seconds.<br /><br />You may want to catch this flick if you\'re a die hard Gerald McRaney fan, could he have ever been that young? James Best also appears in a somewhat uncharacteristic role as a cryptic reverend, but don\'t call him Father. For something a little more up his alley, try to get your hands on 1959\'s "The Killer Shrews". That one at least doesn\'t pretend to take itself so seriously.', shape=(), dtype=string)
targets[31]: tf.Tensor(0, shape=(), dtype=int32)

All set. Now let’s try learning something from this data.

11.3.2 Processing words as a set: The bag-of-words approach

不考虑词序

The simplest way to encode a piece of text for processing by a machine learning
model is to discard order and treat it as a set (a “bag”) of tokens. You could either look
at individual words (unigrams), or try to recover some local order information by
looking at groups of consecutive token (N-grams).

SINGLE WORDS (UNIGRAMS) WITH BINARY ENCODING

If you use a bag of single words, the sentence “the cat sat on the mat” becomes

{“cat”, “mat”, “on”, “sat”, “the”}

The main advantage of this encoding is that you can represent an entire text as a single
vector, where each entry is a presence indicator for a given word. For instance,
using binary encoding (multi-hot), you’d encode a text as a vector with as many
dimensions as there are words in your vocabulary—with 0s almost everywhere and
some 1s for dimensions that encode words present in the text. This is what we did
when we worked with text data in chapters 4 and 5. Let’s try this on our task.

具体过程：

First, let’s process our raw text datasets with a TextVectorization layer so that
they yield multi-hot encoded binary word vectors. Our layer will only look at single
words (that is to say, unigrams).

补充multi-hot的知识

Multi-hot encode your lists to turn them into vectors of 0s and 1s. This would
mean, for instance, turning the sequence [8, 5] into a 10,000-dimensional vector
that would be all 0s except for indices 8 and 5, which would be 1s. Then you
could use a Dense layer, capable of handling floating-point vector data, as the
first layer in your model.

补充lambda表达式

lambda [arg1 [, agr2,.....argn]] : expression

可以有很多参数，最后返回的是expression的值

Listing 11.3 Preprocessing our datasets with a TextVectorization layer

text_vectorization = TextVectorization(
    max_tokens=20000,
    output_mode="multi_hot",
)
text_only_train_ds = train_ds.map(lambda x, y: x)
text_vectorization.adapt(text_only_train_ds)

binary_1gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4) # lambda表达式，返回：后面的结果
binary_1gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_1gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

在这里插入图片描述

You can try to inspect the output of one of these datasets.

Listing 11.4 Inspecting the output of our binary unigram dataset

for inputs, targets in binary_1gram_train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

output

inputs.shape: (32, 20000)
inputs.dtype: <dtype: 'float32'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor([1. 1. 1. ... 0. 0. 0.], shape=(20000,), dtype=float32)
targets[0]: tf.Tensor(1, shape=(), dtype=int32)

在这里插入图片描述

写可重用的代码

Next, let’s write a reusable model-building function that we’ll use in all of our experiments
in this section.

Listing 11.5 Our model-building utility

from tensorflow import keras
from tensorflow.keras import layers

def get_model(max_tokens=20000, hidden_dim=16):
    inputs = keras.Input(shape=(max_tokens,))
    x = layers.Dense(hidden_dim, activation="relu")(inputs)
    x = layers.Dropout(0.5)(x)
    outputs = layers.Dense(1, activation="sigmoid")(x)
    model = keras.Model(inputs, outputs)
    model.compile(optimizer="rmsprop",
                  loss="binary_crossentropy",
                  metrics=["accuracy"])
    return model

关于keras.Input的理解

tf.keras.Input(
    shape=None, batch_size=None, name=None, dtype=None, sparse=False, tensor=None,
    ragged=False, **kwargs
)

https://devdocs.io/tensorflow~2.4/keras/input

Arguments
`shape`	A shape tuple (integers), not including the batch size. For instance, `shape=(32,)` indicates that the expected input will be batches of 32-dimensional vectors. Elements of this tuple can be None; ‘None’ elements represent dimensions where the shape is not known.

为什么有的Keras函数后两个()，能不能从函数的角度解释下? https://blog.csdn.net/ding_programmer/article/details/100178016

Python 函数两括号()() ()(X)的语法含义： https://blog.csdn.net/u013166171/article/details/81292132

关于keras.dropout的理解

https://devdocs.io/tensorflow~2.4/keras/layers/dropout

tf.keras.layers.Dropout(
    rate, noise_shape=None, seed=None, **kwargs
)

解释：为了防止过拟合，dropout根据rate随机置零，其他未置零的数据变成原来的1/(1-rate)倍。

The Dropout layer randomly sets input units to 0 with a frequency of rate at each step during training time, which helps prevent overfitting. Inputs not set to 0 are scaled up by 1/(1 - rate) such that the sum over all inputs is unchanged.

对于keras.layers.dense的理解

https://devdocs.io/tensorflow~2.4/keras/layers/dense

tf.keras.layers.Dense(
    units, activation=None, use_bias=True,
    kernel_initializer='glorot_uniform',
    bias_initializer='zeros', kernel_regularizer=None,
    bias_regularizer=None, activity_regularizer=None, kernel_constraint=None,
    bias_constraint=None, **kwargs
)

含义：Dense implements the operation: output = activation(dot(input, kernel) + bias) where activation is the element-wise activation function passed as the activation argument, kernel is a weights matrix created by the layer, and bias is a bias vector created by the layer (only applicable if use_bias is True).

举例

# Create a `Sequential` model and add a Dense layer as the first layer.
model = tf.keras.models.Sequential()
model.add(tf.keras.Input(shape=(16,)))
model.add(tf.keras.layers.Dense(32, activation='relu'))
# Now the model will take as input arrays of shape (None, 16)
# and output arrays of shape (None, 32).
# Note that after the first layer, you don't need to specify
# the size of the input anymore:
model.add(tf.keras.layers.Dense(32))
model.output_shape
(None, 32)

参数

Arguments
`units`	Positive integer, dimensionality of the output space.
`activation`	Activation function to use. If you don’t specify anything, no activation is applied (ie. “linear” activation: `a(x) = x`).
`use_bias`	Boolean, whether the layer uses a bias vector.
`kernel_initializer`	Initializer for the `kernel` weights matrix.
`bias_initializer`	Initializer for the bias vector.
`kernel_regularizer`	Regularizer function applied to the `kernel` weights matrix.
`bias_regularizer`	Regularizer function applied to the bias vector.
`activity_regularizer`	Regularizer function applied to the output of the layer (its “activation”).
`kernel_constraint`	Constraint function applied to the `kernel` weights matrix.
`bias_constraint`	Constraint function applied to the bias vector.

参数和返回值

Input shape:

N-D tensor with shape: (batch_size, ..., input_dim). The most common situation would be a 2D input with shape (batch_size, input_dim).

Output shape:

N-D tensor with shape: (batch_size, ..., units). For instance, for a 2D input with shape (batch_size, input_dim), the output would have shape (batch_size, units).

关于Keras中的map函数

Map function in Keras

Listing 11.6 Training and testing the binary unigram model

model = get_model()
model.summary()
callbacks = [
    keras.callbacks.ModelCheckpoint("binary_1gram.keras",
                                    save_best_only=True)
]
model.fit(binary_1gram_train_ds.cache(),
          validation_data=binary_1gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)
model = keras.models.load_model("binary_1gram.keras")
print(f"Test acc: {model.evaluate(binary_1gram_test_ds)[1]:.3f}")

在这里插入图片描述

output

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_2 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense_2 (Dense)             (None, 16)                320016    
                                                                 
 dropout_1 (Dropout)         (None, 16)                0         
                                                                 
 dense_3 (Dense)             (None, 1)                 17        
                                                                 
=================================================================
Total params: 320,033
Trainable params: 320,033
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
625/625 [==============================] - 14s 17ms/step - loss: 0.4085 - accuracy: 0.8281 - val_loss: 0.2989 - val_accuracy: 0.8834
Epoch 2/10
625/625 [==============================] - 3s 5ms/step - loss: 0.2791 - accuracy: 0.8990 - val_loss: 0.2873 - val_accuracy: 0.8860
Epoch 3/10
625/625 [==============================] - 3s 5ms/step - loss: 0.2442 - accuracy: 0.9126 - val_loss: 0.2984 - val_accuracy: 0.8836
Epoch 4/10
625/625 [==============================] - 3s 5ms/step - loss: 0.2307 - accuracy: 0.9214 - val_loss: 0.3136 - val_accuracy: 0.8802
Epoch 5/10
625/625 [==============================] - 3s 5ms/step - loss: 0.2351 - accuracy: 0.9251 - val_loss: 0.3264 - val_accuracy: 0.8794
Epoch 6/10
625/625 [==============================] - 3s 5ms/step - loss: 0.2187 - accuracy: 0.9293 - val_loss: 0.3394 - val_accuracy: 0.8774
Epoch 7/10
625/625 [==============================] - 3s 5ms/step - loss: 0.2137 - accuracy: 0.9329 - val_loss: 0.3454 - val_accuracy: 0.8796
Epoch 8/10
625/625 [==============================] - 3s 5ms/step - loss: 0.2109 - accuracy: 0.9341 - val_loss: 0.3513 - val_accuracy: 0.8750
Epoch 9/10
625/625 [==============================] - 3s 5ms/step - loss: 0.2147 - accuracy: 0.9356 - val_loss: 0.3586 - val_accuracy: 0.8780
Epoch 10/10
625/625 [==============================] - 3s 5ms/step - loss: 0.2004 - accuracy: 0.9395 - val_loss: 0.3637 - val_accuracy: 0.8784
782/782 [==============================] - 10s 12ms/step - loss: 0.2913 - accuracy: 0.8860
Test acc: 0.886

书上写着

This gets us to a test accuracy of 89.2%: not bad! Note that in this case, since the dataset
is a balanced two-class classification dataset (there are as many positive samples as
negative samples), the “naive baseline” we could reach without training an actual model
would only be 50%. Meanwhile, the best score that can be achieved on this dataset
without leveraging external data is around 95% test accuracy.

实际上我们测试的准确率是88.6%

BIGRAMS WITH BINARY ENCODING

有时候我们需要把词组放在一起，这里是二字词组，比如 united stated是一个整体，而不是分开为两个单词。

Of course, discarding word order is very reductive, because even atomic concepts can
be expressed via multiple words: the term “United States” conveys a concept that is
quite distinct from the meaning of the words “states” and “united” taken separately.
For this reason, you will usually end up re-injecting local order information into your
bag-of-words representation by looking at N-grams rather than single words (most
commonly, bigrams).

With bigrams, our sentence becomes

{“the”, “the cat”, “cat”, “cat sat”, “sat”,

“sat on”, “on”, “on the”, “the mat”, “mat”}

具体代码可以通过传参来实现

The TextVectorization layer can be configured to return arbitrary N-grams: bigrams,
trigrams, etc. Just pass an ngrams=N argument as in the following listing.

Listing 11.7 Configuring the TextVectorization layer to return bigrams

text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode="multi_hot",
)

Let’s test how our model performs when trained on such binary-encoded bags of
bigrams.

下面是训练

Listing 11.8 Training and testing the binary bigram model

text_vectorization.adapt(text_only_train_ds)
binary_2gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_2gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_2gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

model = get_model()
model.summary()
callbacks = [
    keras.callbacks.ModelCheckpoint("binary_2gram.keras",
                                    save_best_only=True)
]
model.fit(binary_2gram_train_ds.cache(),
          validation_data=binary_2gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)
model = keras.models.load_model("binary_2gram.keras")
print(f"Test acc: {model.evaluate(binary_2gram_test_ds)[1]:.3f}")

output: 自己训练的第一次准确率是89.7%, 第二次训练达到90.1%的准确率。

首先展示summary()，这个模型共用了几个中间层。

Model: "model_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_3 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense_4 (Dense)             (None, 16)                320016    
                                                                 
 dropout_2 (Dropout)         (None, 16)                0         
                                                                 
 dense_5 (Dense)             (None, 1)                 17        
                                                                 
=================================================================
Total params: 320,033
Trainable params: 320,033
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
625/625 [==============================] - 11s 16ms/step - loss: 0.3807 - accuracy: 0.8443 - val_loss: 0.2662 - val_accuracy: 0.8982
Epoch 2/10
625/625 [==============================] - 3s 5ms/step - loss: 0.2374 - accuracy: 0.9154 - val_loss: 0.2651 - val_accuracy: 0.9004
Epoch 3/10
625/625 [==============================] - 3s 5ms/step - loss: 0.2108 - accuracy: 0.9334 - val_loss: 0.2789 - val_accuracy: 0.9004
Epoch 4/10
625/625 [==============================] - 3s 5ms/step - loss: 0.1854 - accuracy: 0.9423 - val_loss: 0.2988 - val_accuracy: 0.8990
Epoch 5/10
625/625 [==============================] - 3s 5ms/step - loss: 0.1869 - accuracy: 0.9450 - val_loss: 0.3096 - val_accuracy: 0.8994
Epoch 6/10
625/625 [==============================] - 3s 5ms/step - loss: 0.1847 - accuracy: 0.9486 - val_loss: 0.3221 - val_accuracy: 0.8988
Epoch 7/10
625/625 [==============================] - 3s 5ms/step - loss: 0.1762 - accuracy: 0.9520 - val_loss: 0.3286 - val_accuracy: 0.8996
Epoch 8/10
625/625 [==============================] - 3s 5ms/step - loss: 0.1809 - accuracy: 0.9513 - val_loss: 0.3343 - val_accuracy: 0.8976
Epoch 9/10
625/625 [==============================] - 3s 5ms/step - loss: 0.1817 - accuracy: 0.9541 - val_loss: 0.3415 - val_accuracy: 0.8970
Epoch 10/10
625/625 [==============================] - 3s 5ms/step - loss: 0.1754 - accuracy: 0.9542 - val_loss: 0.3498 - val_accuracy: 0.8968
782/782 [==============================] - 11s 14ms/step - loss: 0.2710 - accuracy: 0.9011
Test acc: 0.901

BIGRAMS WITH TF-IDF ENCODING

可以统计每个gram出现多少次

You can also add a bit more information to this representation by counting how many
times each word or N-gram occurs, that is to say, by taking the histogram of the words
over the text:
{"the": 2, "the cat": 1, "cat": 1, "cat sat": 1, "sat": 1, 

"sat on": 1, "on": 1, "on the": 1, "the mat: 1", "mat": 1}

在sentiment clasification 中，如果happy出现一次，说明不了这个电影评论是积极的还是消极的，可是happy出现了很多次，有理由相信这条电影评论是积极的。

If you’re doing text classification, knowing how many times a word occurs in a sample
is critical: any sufficiently long movie review may contain the word “terrible” regardless
of sentiment, but a review that contains many instances of the word “terrible” is
likely a negative one.

Here’s how you’d count bigram occurrences with the TextVectorization layer.

Listing 11.9 Configuring the TextVectorization layer to return token counts

text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode="count"
)

遇到问题：有些冠词出现的频率很高，但是对我们的问题没有丝毫帮助，如何处理呢？

Now, of course, some words are bound to occur more often than others no matter
what the text is about. The words “the,” “a,” “is,” and “are” will always dominate your
word count histograms, drowning out other words—despite being pretty much useless
features in a classification context. How could we address this?

使用正则化：减去均值，除以方差。但是这里的vector很稀疏，不需要减均值，只需要除法即可。

分母使用TF-IDF，意思是词频，逆文档频率

TF-IDF normalization—TF-IDF stands for “term frequency,inverse document frequency.”

You already guessed it: via normalization. We could just normalize word counts by
subtracting the mean and dividing by the variance (computed across the entire training
dataset). That would make sense. Except most vectorized sentences consist almost
entirely of zeros (our previous example features 12 non-zero entries and 19,988 zero
entries), a property called “sparsity.” That’s a great property to have, as it dramatically
reduces compute load and reduces the risk of overfitting. If we subtracted the mean
from each feature, we’d wreck sparsity. Thus, whatever normalization scheme we use
should be divide-only. What, then, should we use as the denominator? The best practice
is to go with something called TF-IDF normalization—TF-IDF stands for “term frequency,
inverse document frequency.”

TF-IDF is so common that it’s built into the TextVectorization layer. All you need
to do to start using it is to switch the output_mode argument to “tf_idf”.

Listing 11.10 Configuring TextVectorization to return TF-IDF-weighted outputs

text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode="tf_idf",
)

补充相关知识

Understanding TF-IDF normalization

The more a given term appears in a document, the more important that term is for
understanding what the document is about. At the same time, the frequency at which
the term appears across all documents in your dataset matters too: terms that
appear in almost every document (like “the” or “a”) aren’t particularly informative,
while terms that appear only in a small subset of all texts (like “Herzog”) are very distinctive,
and thus important. TF-IDF is a metric that fuses these two ideas. It weights
a given term by taking “term frequency,” how many times the term appears in the
current document, and dividing it by a measure of “document frequency,” which estimates
how often the term comes up across the dataset. You’d compute it as follows:
def tfidf(term, document, dataset):
   term_freq = document.count(term)
   doc_freq = math.log(sum(doc.count(term) for doc in dataset) + 1)
   return term_freq / doc_freq

Let’s train a new model with this scheme.

Listing 11.11 Training and testing the TF-IDF bigram model

The adapt() call will learn the TF-IDF weights in addition to the vocabulary adapt() 调用将学习词汇表和TF-IDF 权重 in additon to = as well as = besides

text_vectorization.adapt(text_only_train_ds)

tfidf_2gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
tfidf_2gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
tfidf_2gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

model = get_model()
model.summary()
callbacks = [
    keras.callbacks.ModelCheckpoint("tfidf_2gram.keras",
                                    save_best_only=True)
]
model.fit(tfidf_2gram_train_ds.cache(),
          validation_data=tfidf_2gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)
model = keras.models.load_model("tfidf_2gram.keras")
print(f"Test acc: {model.evaluate(tfidf_2gram_test_ds)[1]:.3f}")

遇到报错

---------------------------------------------------------------------------
InvalidArgumentError                      Traceback (most recent call last)
<ipython-input-56-09c5afe5a363> in <module>()
----> 1 text_vectorization.adapt(text_only_train_ds)
      2 
      3 tfidf_2gram_train_ds = train_ds.map(
      4     lambda x, y: (text_vectorization(x), y),
      5     num_parallel_calls=4)

3 frames
/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
     53     ctx.ensure_initialized()
     54     tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
---> 55                                         inputs, attrs, num_outputs)
     56   except core._NotOkStatusException as e:
     57     if name is not None:

InvalidArgumentError: Graph execution error:

2 root error(s) found.
  (0) INVALID_ARGUMENT:  During Variant Host->Device Copy: non-DMA-copy attempted of tensor type: string
	 [[{{node map/TensorArrayUnstack/TensorListFromTensor/_42}}]]
	 [[Func/map/while/body/_1/input/_48/_72]]
  (1) INVALID_ARGUMENT:  During Variant Host->Device Copy: non-DMA-copy attempted of tensor type: string
	 [[{{node map/TensorArrayUnstack/TensorListFromTensor/_42}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference_adapt_step_194806]

如何将解决的？将colab的运行时环境改为CPU，而不是GPU即可。

output: 训练效果87.5%

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_2 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense_2 (Dense)             (None, 16)                320016    
                                                                 
 dropout_1 (Dropout)         (None, 16)                0         
                                                                 
 dense_3 (Dense)             (None, 1)                 17        
                                                                 
=================================================================
Total params: 320,033
Trainable params: 320,033
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
625/625 [==============================] - 17s 25ms/step - loss: 0.5273 - accuracy: 0.7180 - val_loss: 0.3055 - val_accuracy: 0.8740
Epoch 2/10
625/625 [==============================] - 6s 10ms/step - loss: 0.3804 - accuracy: 0.8231 - val_loss: 0.3086 - val_accuracy: 0.8932
Epoch 3/10
625/625 [==============================] - 6s 10ms/step - loss: 0.3474 - accuracy: 0.8382 - val_loss: 0.3347 - val_accuracy: 0.8782
Epoch 4/10
625/625 [==============================] - 7s 11ms/step - loss: 0.3230 - accuracy: 0.8551 - val_loss: 0.3330 - val_accuracy: 0.8800
Epoch 5/10
625/625 [==============================] - 7s 11ms/step - loss: 0.3059 - accuracy: 0.8622 - val_loss: 0.3723 - val_accuracy: 0.8734
Epoch 6/10
625/625 [==============================] - 7s 12ms/step - loss: 0.2969 - accuracy: 0.8651 - val_loss: 0.3604 - val_accuracy: 0.8716
Epoch 7/10
625/625 [==============================] - 7s 11ms/step - loss: 0.2834 - accuracy: 0.8734 - val_loss: 0.3652 - val_accuracy: 0.8722
Epoch 8/10
625/625 [==============================] - 7s 11ms/step - loss: 0.2767 - accuracy: 0.8781 - val_loss: 0.3433 - val_accuracy: 0.8754
Epoch 9/10
625/625 [==============================] - 8s 13ms/step - loss: 0.2583 - accuracy: 0.8876 - val_loss: 0.3501 - val_accuracy: 0.8698
Epoch 10/10
625/625 [==============================] - 8s 12ms/step - loss: 0.2494 - accuracy: 0.8970 - val_loss: 0.3469 - val_accuracy: 0.8866
782/782 [==============================] - 16s 20ms/step - loss: 0.3106 - accuracy: 0.8751
Test acc: 0.875

书上的结果: 也不高，但是一般来说，对于大多数的文本分类数据集，使用TF-IDF可以带来1%的提升。

This gets us an 89.8% test accuracy on the IMDB classification task: it doesn’t seem to
be particularly helpful in this case. However, for many text-classification datasets, it
would be typical to see a one-percentage-point increase when using TF-IDF compared
to plain binary encoding.

Exporting a model that processes raw strings 导出处理原始字符串的模型

In the preceding examples, we did our text standardization, splitting, and indexing as
part of the tf.data pipeline. But if we want to export a standalone model independent
of this pipeline, we should make sure that it incorporates its own text preprocessing
(otherwise, you’d have to reimplement in the production environment, which
can be challenging or can lead to subtle discrepancies between the training data and
the production data). Thankfully, this is easy.

Just create a new model that reuses your TextVectorization layer and adds to it
the model you just trained:

创建一个新的model，然后使用我们自己的TextVectorization layer，加到目的model中即可。

inputs = keras.Input(shape=(1,), dtype="string")
processed_inputs = text_vectorization(inputs)
outputs = model(processed_inputs)
inference_model = keras.Model(inputs, outputs)

在这里插入图片描述

之后这个模型就可以处理原始字符串了

The resulting model can process batches of raw strings:

import tensorflow as tf
raw_text_data = tf.convert_to_tensor([
    ["That was an excellent movie, I loved it."],
])
predictions = inference_model(raw_text_data)
print(f"{float(predictions[0] * 100):.2f} percent positive")

训练结果

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-vsFGe3rD-1647604218558)(https://raw.githubusercontent.com/shizhengLi/image_bed_01/main/img/image-20220314154320715.png)]

然后自己又添加了一条raw text，进行训练

import tensorflow as tf
raw_text_data = tf.convert_to_tensor([
    ["That was an excellent movie, I loved it."],
    ["terrible movie, did not like it."],
])
predictions = inference_model(raw_text_data)
print(predictions)
print(f"{float(predictions[0] * 100):.2f} percent positive")
print(f"{float(predictions[1] * 100):.2f} percent positive")

输出结果

tf.Tensor(
[[0.8222418 ]
 [0.38764578]], shape=(2, 1), dtype=float32)
82.22 percent positive
38.76 percent positive

11.3.3 Processing words as a sequence: The sequence model approach

sequence models：深度学习模型自己学习order-based features

These past few examples clearly show that word order matters: manual engineering of
order-based features, such as bigrams, yields a nice accuracy boost. Now remember: the
history of deep learning is that of a move away from manual feature engineering, toward
letting models learn their own features from exposure to data alone. What if, instead of
manually crafting order-based features, we exposed the model to raw word sequences
and let it figure out such features on its own? This is what sequence models are about.

实现的具体步骤

To implement a sequence model, you’d start by representing your input samples as
sequences of integer indices (one integer standing for one word). Then, you’d map
each integer to a vector to obtain vector sequences. Finally, you’d feed these
sequences of vectors into a stack of layers that could cross-correlate features from adjacent
vectors, such as a 1D convnet, a RNN, or a Transformer.

之前序列模型用的多的是双向RNNs，现在用的多的是Transformers

但是作者发现： a residual stack of depthwise-separable 1D convolutions的性能可以达到双向LSTM，而且前者可以大大降低计算成本。但是前者在NLP中没有火过。

For some time around 2016–2017, bidirectional RNNs (in particular, bidirectional
LSTMs) were considered to be the state of the art for sequence modeling. Since you’re already familiar with this architecture, this is what we’ll use in our first sequence model
examples. However, nowadays sequence modeling is almost universally done with Transformers,
which we will cover shortly. Oddly, one-dimensional convnets were never
very popular in NLP, even though, in my own experience, a residual stack of depthwise-
separable 1D convolutions can often achieve comparable performance to a bidirectional
LSTM, at a greatly reduced computational cost.

先下载数据集

!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz
!rm -r aclImdb/train/unsup

数据预处理

import os, pathlib, shutil, random
from tensorflow import keras
batch_size = 32
base_dir = pathlib.Path("aclImdb")
val_dir = base_dir / "val"
train_dir = base_dir / "train"
for category in ("neg", "pos"):
    os.makedirs(val_dir / category)
    files = os.listdir(train_dir / category)
    random.Random(1337).shuffle(files)
    num_val_samples = int(0.2 * len(files))
    val_files = files[-num_val_samples:]
    for fname in val_files:
        shutil.move(train_dir / category / fname,
                    val_dir / category / fname)

train_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/train", batch_size=batch_size
)
val_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/val", batch_size=batch_size
)
test_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/test", batch_size=batch_size
)
text_only_train_ds = train_ds.map(lambda x, y: x)

如果在colab中出现makedirs错误：文件夹已经存在，那么解决方法可以是注释掉整个for循环。

A FIRST PRACTICAL EXAMPLE

Let’s try out a first sequence model in practice. First, let’s prepare datasets that return
integer sequences.

Listing 11.12 Preparing integer sequence datasets

这里我们截取评论的最大长度为600 words，如果超过600，就截取到600。这样做的原因是长于该数字的评论只占5%。

from tensorflow.keras import layers

max_length = 600
max_tokens = 20000
text_vectorization = layers.TextVectorization(
    max_tokens=max_tokens,
    output_mode="int",
    output_sequence_length=max_length,
)
text_vectorization.adapt(text_only_train_ds)

int_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
int_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
int_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

下面是做一个模型: 把整数序列映射为one-hot向量，然后添加一个简单的bidirectional LSTM

Next, let’s make a model. The simplest way to convert our integer sequences to vector
sequences is to one-hot encode the integers (each dimension would represent one
possible term in the vocabulary). On top of these one-hot vectors, we’ll add a simple
bidirectional LSTM.

Listing 11.13 A sequence model built on one-hot encoded vector sequences

import tensorflow as tf
inputs = keras.Input(shape=(None,), dtype="int64")
embedded = tf.one_hot(inputs, depth=max_tokens)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

代码解释

在这里插入图片描述

output

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_1 (InputLayer)        [(None, None)]            0         
                                                                 
 tf.one_hot (TFOpLambda)     (None, None, 20000)       0         
                                                                 
 bidirectional (Bidirectiona  (None, 64)               5128448   
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 64)                0         
                                                                 
 dense (Dense)               (None, 1)                 65        
                                                                 
=================================================================
Total params: 5,128,513
Trainable params: 5,128,513
Non-trainable params: 0
_________________________________________________________________

Now, let’s train our model.

Listing 11.14 Training a first basic sequence model

callbacks = [
    keras.callbacks.ModelCheckpoint("one_hot_bidir_lstm.keras",
                                    save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=10, callbacks=callbacks)
model = keras.models.load_model("one_hot_bidir_lstm.keras")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

这个模型one-hot encoding不行：每条电影评论被编码为(600, 2000)的矩阵，数据集中有很多这样的评论，计算量太大，所以时间很长。

A first observation: this model trains very slowly, especially compared to the lightweight
model of the previous section. This is because our inputs are quite large: each
input sample is encoded as a matrix of size (600, 20000) (600 words per sample,
20,000 possible words). That’s 12,000,000 floats for a single movie review. Our bidirectional
LSTM has a lot of work to do. Second, the model only gets to 87% test accuracy—
it doesn’t perform nearly as well as our (very fast) binary unigram model.

需要更好的方式：word embedding

Clearly, using one-hot encoding to turn words into vectors, which was the simplest
thing we could do, wasn’t a great idea. There’s a better way: word embeddings.

UNDERSTANDING WORD EMBEDDINGS

one-hot vectors：里面的每个向量都是正交的(orthogonal)，这和我们的需求是不一样的，该项目中很多词都是相近的

Crucially, when you encode something via one-hot encoding, you’re making a featureengineering
decision. You’re injecting into your model a fundamental assumption
about the structure of your feature space. That assumption is that the different tokens
you’re encoding are all independent from each other: indeed, one-hot vectors are all orthogonal
to one another. And in the case of words, that assumption is clearly wrong. Words
form a structured space: they share information with each other. The words “movie”
and “film” are interchangeable in most sentences, so the vector that represents
“movie” should not be orthogonal to the vector that represents “film”—they should be
the same vector, or close enough.

向量之间的几何关系需要反映词义的关系: 含义相近的词汇，几何距离要小；含义不同的词汇，几何距离要大。

To get a bit more abstract, the geometric relationship between two word vectors
should reflect the semantic relationship between these words. For instance, in a reasonable
word vector space, you would expect synonyms to be embedded into similar word
vectors, and in general, you would expect the geometric distance (such as the cosine
distance or L2 distance) between any two word vectors to relate to the “semantic distance”
between the associated words. Words that mean different things should lie far
away from each other, whereas related words should be closer.

词嵌入的功能

Word embeddings are vector representations of words that achieve exactly this: they
map human language into a structured geometric space

词嵌入需要更少的维数来encode words

Whereas the vectors obtained through one-hot encoding are binary, sparse (mostly
made of zeros), and very high-dimensional (the same dimensionality as the number of
words in the vocabulary), word embeddings are low-dimensional floating-point vectors
(that is, dense vectors, as opposed to sparse vectors); see figure 11.2. It’s common to see
word embeddings that are 256-dimensional, 512-dimensional, or 1,024-dimensional when dealing with very large vocabularies. On the other hand, one-hot encoding words
generally leads to vectors that are 20,000-dimensional or greater (capturing a vocabulary
of 20,000 tokens, in this case). So, word embeddings pack more information into
far fewer dimensions.

在这里插入图片描述

Besides being dense representations, word embeddings are also structured representations,
and their structure is learned from data. Similar words get embedded in close
locations, and further, specific directions in the embedding space are meaningful. To
make this clearer, let’s look at a concrete example.

下面是一个例子：词嵌入空间

在这里插入图片描述

In figure 11.3, four words are embedded on a 2D plane: cat, dog, wolf, and tiger.
With the vector representations we chose here, some semantic relationships between
these words can be encoded as geometric transformations. For instance, the same
vector allows us to go from cat to tiger and from dog to wolf: this vector could be interpreted
as the “from pet to wild animal” vector. Similarly, another vector lets us go
from dog to cat and from wolf to tiger, which could be interpreted as a “from canine
to feline” vector.

在现实世界的词嵌入空间中，有意义的几何变换的常见示例是“性别”向量和“复数”向量。词嵌入空间通常具有数千个这样的可解释和潜在有用的向量。

In real-world word-embedding spaces, common examples of meaningful geometric
transformations are “gender” vectors and “plural” vectors. For instance, by adding a
“female” vector to the vector “king,” we obtain the vector “queen.” By adding a “plural”
vector, we obtain “kings.” Word-embedding spaces typically feature thousands of
such interpretable and potentially useful vectors.

实际中使用词嵌入的两种方法

Let’s look at how to use such an embedding space in practice. There are two ways
to obtain word embeddings:

 Learn word embeddings jointly with the main task you care about (such as document
classification or sentiment prediction). In this setup, you start with random
word vectors and then learn word vectors in the same way you learn the
weights of a neural network.

 Load into your model word embeddings that were precomputed using a different
machine learning task than the one you’re trying to solve. These are called
pretrained word embeddings.

Let’s review each of these approaches.

下面是两种词嵌入方法

LEARNING WORD EMBEDDINGS WITH THE EMBEDDING LAYER

每个词嵌入空间随着具体问题不同而不同

Is there some ideal word-embedding space that would perfectly map human language
and could be used for any natural language processing task? Possibly, but we have yet
to compute anything of the sort. Also, there is no such a thing as human language—
there are many different languages, and they aren’t isomorphic to one another,
because a language is the reflection of a specific culture and a specific context. But
more pragmatically, what makes a good word-embedding space depends heavily on
your task: the perfect word-embedding space for an English-language movie-review
sentiment-analysis model may look different from the perfect embedding space for an
English-language legal-document classification model, because the importance of certain
semantic relationships varies from task to task.

keras让这件事变得简单

It’s thus reasonable to learn a new embedding space with every new task. Fortunately,
backpropagation makes this easy, and Keras makes it even easier. It’s about
learning the weights of a layer: the Embedding layer.

Listing 11.15 Instantiating an Embedding layer

Embedding需要至少两个参数：The Embedding layer takes at least two arguments: the number of possible tokens and the dimensionality of the embeddings (here, 256).

embedding_layer = layers.Embedding(input_dim=max_tokens, output_dim=256)

具体工作过程

The Embedding layer is best understood as a dictionary that maps integer indices
(which stand for specific words) to dense vectors. It takes integers as input, looks up
these integers in an internal dictionary, and returns the associated vectors. It’s effectively
a dictionary lookup (see figure 11.4).

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-QL5ptn94-1647604218559)(https://raw.githubusercontent.com/shizhengLi/image_bed_01/main/img/image-20220314184655186.png)]

输入和输出

The Embedding layer takes as input a rank-2 tensor of integers, of shape (batch_size,
sequence_length), where each entry is a sequence of integers. The layer then returns
a 3D floating-point tensor of shape (batch_size, sequence_length, embedding_
dimensionality).

训练模型

Let’s build a model that includes an Embedding layer and benchmark it on our task.

Listing 11.16 Model that uses an Embedding layer trained from scratch

inputs = keras.Input(shape=(None,), dtype="int64")
embedded = layers.Embedding(input_dim=max_tokens, output_dim=256)(inputs)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

callbacks = [
    keras.callbacks.ModelCheckpoint("embeddings_bidir_gru.keras",
                                    save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=10, callbacks=callbacks)
model = keras.models.load_model("embeddings_bidir_gru.keras")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

output：训练了35分钟，准确率为86.3%

Model: "model_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_5 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding_3 (Embedding)     (None, None, 256)         5120000   
                                                                 
 bidirectional_3 (Bidirectio  (None, 64)               73984     
 nal)                                                            
                                                                 
 dropout_2 (Dropout)         (None, 64)                0         
                                                                 
 dense_2 (Dense)             (None, 1)                 65        
                                                                 
=================================================================
Total params: 5,194,049
Trainable params: 5,194,049
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
782/782 [==============================] - 168s 210ms/step - loss: 0.4519 - accuracy: 0.8035 - val_loss: 0.2636 - val_accuracy: 0.8980
Epoch 2/10
782/782 [==============================] - 162s 207ms/step - loss: 0.2903 - accuracy: 0.8942 - val_loss: 0.1787 - val_accuracy: 0.9380
Epoch 3/10
782/782 [==============================] - 163s 208ms/step - loss: 0.2328 - accuracy: 0.9180 - val_loss: 0.1440 - val_accuracy: 0.9500
Epoch 4/10
782/782 [==============================] - 163s 209ms/step - loss: 0.1977 - accuracy: 0.9334 - val_loss: 0.1271 - val_accuracy: 0.9588
Epoch 5/10
782/782 [==============================] - 161s 206ms/step - loss: 0.1692 - accuracy: 0.9438 - val_loss: 0.1210 - val_accuracy: 0.9568
Epoch 6/10
782/782 [==============================] - 158s 202ms/step - loss: 0.1445 - accuracy: 0.9527 - val_loss: 0.2025 - val_accuracy: 0.9388
Epoch 7/10
782/782 [==============================] - 158s 201ms/step - loss: 0.1186 - accuracy: 0.9607 - val_loss: 0.0647 - val_accuracy: 0.9812
Epoch 8/10
782/782 [==============================] - 158s 202ms/step - loss: 0.0976 - accuracy: 0.9688 - val_loss: 0.0533 - val_accuracy: 0.9838
Epoch 9/10
782/782 [==============================] - 158s 202ms/step - loss: 0.0846 - accuracy: 0.9726 - val_loss: 0.0426 - val_accuracy: 0.9866
Epoch 10/10
782/782 [==============================] - 159s 204ms/step - loss: 0.0710 - accuracy: 0.9772 - val_loss: 0.0415 - val_accuracy: 0.9856
782/782 [==============================] - 57s 72ms/step - loss: 0.5607 - accuracy: 0.8630
Test acc: 0.863

书中的准确率

It trains much faster than the one-hot model (since the LSTM only has to process
256-dimensional vectors instead of 20,000-dimensional), and its test accuracy is comparable
(87%). However, we’re still some way off from the results of our basic bigram
model. Part of the reason why is simply that the model is looking at slightly less data:
the bigram model processed full reviews, while our sequence model truncates sequences
after 600 words.

UNDERSTANDING PADDING AND MASKING

我们上面是把多于600words的review截断到600 words，然后把低于600 words的review填充到600 words

One thing that’s slightly hurting model performance here is that our input sequences
are full of zeros. This comes from our use of the output_sequence_length=max_
length option in TextVectorization (with max_length equal to 600): sentences longer
than 600 tokens are truncated to a length of 600 tokens, and sentences shorter
than 600 tokens are padded with zeros at the end so that they can be concatenated
together with other sequences to form contiguous batches.

使用双向RNNs

We’re using a bidirectional RNN: two RNN layers running in parallel, with one
processing the tokens in their natural order, and the other processing the same
tokens in reverse. The RNN that looks at the tokens in their natural order will spend
its last iterations seeing only vectors that encode padding—possibly for several hundreds
of iterations if the original sentence was short. The information stored in the
internal state of the RNN will gradually fade out as it gets exposed to these meaningless
inputs.

We need some way to tell the RNN that it should skip these iterations. There’s an
API for that: masking.

需要padding的向量，可能需要很多次迭代才处理完，这里使用API masking来跳过这些iterations

The Embedding layer is capable of generating a “mask” that corresponds to its
input data. This mask is a tensor of ones and zeros (or True/False booleans), of shape
(batch_size, sequence_length), where the entry mask[i, t] indicates where timestep
t of sample i should be skipped or not (the timestep will be skipped if mask[i, t]
is 0 or False, and processed otherwise).

使用时需要传入一个开启mask的参数

By default, this option isn’t active—you can turn it on by passing mask_zero=True
to your Embedding layer. You can retrieve the mask with the compute_mask() method:

>>> embedding_layer = Embedding(input_dim=10, output_dim=256, mask_zero=True)
>>> some_input = [
... [4, 3, 2, 1, 0, 0, 0],
... [5, 4, 3, 2, 1, 0, 0],
... [2, 1, 0, 0, 0, 0, 0]]
>>> mask = embedding_layer.compute_mask(some_input)
<tf.Tensor: shape=(3, 7), dtype=bool, numpy=
array([[ True, True, True, True, False, False, False],
[ True, True, True, True, True, False, False],
[ True, True, False, False, False, False, False]])>

实际应用时，Keras可以自动pass on the mask

In practice, you will almost never have to manage masks by hand. Instead, Keras will
automatically pass on the mask to every layer that is able to process it (as a piece of
metadata attached to the sequence it represents). This mask will be used by RNN layers
to skip masked steps. If your model returns an entire sequence, the mask will also
be used by the loss function to skip masked steps in the output sequence.
Let’s try retraining our model with masking enabled.

Listing 11.17 Using an Embedding layer with masking enabled

inputs = keras.Input(shape=(None,), dtype="int64")
embedded = layers.Embedding(
    input_dim=max_tokens, output_dim=256, mask_zero=True)(inputs)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

callbacks = [
    keras.callbacks.ModelCheckpoint("embeddings_bidir_gru_with_masking.keras",
                                    save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=10, callbacks=callbacks)
model = keras.models.load_model("embeddings_bidir_gru_with_masking.keras")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

output: 准确率为85.7%，训练时间为33min

Model: "model_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_6 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding_4 (Embedding)     (None, None, 256)         5120000   
                                                                 
 bidirectional_4 (Bidirectio  (None, 64)               73984     
 nal)                                                            
                                                                 
 dropout_3 (Dropout)         (None, 64)                0         
                                                                 
 dense_3 (Dense)             (None, 1)                 65        
                                                                 
=================================================================
Total params: 5,194,049
Trainable params: 5,194,049
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
782/782 [==============================] - 178s 216ms/step - loss: 0.3784 - accuracy: 0.8350 - val_loss: 0.2136 - val_accuracy: 0.9150
Epoch 2/10
782/782 [==============================] - 168s 215ms/step - loss: 0.2305 - accuracy: 0.9109 - val_loss: 0.1679 - val_accuracy: 0.9350
Epoch 3/10
782/782 [==============================] - 168s 214ms/step - loss: 0.1709 - accuracy: 0.9359 - val_loss: 0.1160 - val_accuracy: 0.9592
Epoch 4/10
782/782 [==============================] - 167s 213ms/step - loss: 0.1329 - accuracy: 0.9522 - val_loss: 0.0751 - val_accuracy: 0.9724
Epoch 5/10
782/782 [==============================] - 167s 214ms/step - loss: 0.1028 - accuracy: 0.9638 - val_loss: 0.0823 - val_accuracy: 0.9704
Epoch 6/10
782/782 [==============================] - 167s 213ms/step - loss: 0.0790 - accuracy: 0.9734 - val_loss: 0.0385 - val_accuracy: 0.9874
Epoch 7/10
782/782 [==============================] - 167s 213ms/step - loss: 0.0601 - accuracy: 0.9798 - val_loss: 0.0302 - val_accuracy: 0.9914
Epoch 8/10
782/782 [==============================] - 167s 213ms/step - loss: 0.0444 - accuracy: 0.9856 - val_loss: 0.0226 - val_accuracy: 0.9930
Epoch 9/10
782/782 [==============================] - 167s 213ms/step - loss: 0.0330 - accuracy: 0.9893 - val_loss: 0.0180 - val_accuracy: 0.9948
Epoch 10/10
782/782 [==============================] - 167s 214ms/step - loss: 0.0253 - accuracy: 0.9926 - val_loss: 0.0133 - val_accuracy: 0.9962
782/782 [==============================] - 65s 79ms/step - loss: 0.6487 - accuracy: 0.8666
Test acc: 0.867

书上的准确率

This time we get to 88% test accuracy—a small but noticeable improvement

USING PRETRAINED WORD EMBEDDINGS

可以使用另一个task提前训练好的词嵌入，原理是NLP处理的文本有很多通用的语义特征

Sometimes you have so little training data available that you can’t use your data alone
to learn an appropriate task-specific embedding of your vocabulary. In such cases,
instead of learning word embeddings jointly with the problem you want to solve, you
can load embedding vectors from a precomputed embedding space that you know is
highly structured and exhibits useful properties—one that captures generic aspects of
language structure. The rationale behind using pretrained word embeddings in natural
language processing is much the same as for using pretrained convnets in image
classification: you don’t have enough data available to learn truly powerful features on
your own, but you expect that the features you need are fairly generic—that is, common
visual features or semantic features. In this case, it makes sense to reuse features
learned on a different problem.

这种提前使用的词嵌入，主要是依据词汇出现的统计特征，哪些词出现在一起的概率大。

在学术界和工业界火起来是在2013年，Word2Vec算法的发布之后。该算法可以获取特殊的语义属性，比如性别。

Such word embeddings are generally computed using word-occurrence statistics
(observations about what words co-occur in sentences or documents), using a variety
of techniques, some involving neural networks, others not. The idea of a dense, lowdimensional
embedding space for words, computed in an unsupervised way, was initially
explored by Bengio et al. in the early 2000s,1 but it only started to take off in
research and industry applications after the release of one of the most famous and
successful word-embedding schemes: the Word2Vec algorithm (https://code.google
.com/archive/p/word2vec), developed by Tomas Mikolov at Google in 2013. Word2Vec
dimensions capture specific semantic properties, such as gender.

课外补充：

Word2Vec是如何工作的？

把一个文本语料库作为输入，学习词汇的向量表示，输出之。学习到的词向量可以被很多NLP或者AI项目使用。比如发现距离最近的词。

来源：https://code.google.com/archive/p/word2vec/

How does it work

The word2vec tool takes a text corpus as input and produces the word vectors as output. It first constructs a vocabulary from the training text data and then learns vector representation of words. The resulting word vector file can be used as features in many natural language processing and machine learning applications.

A simple way to investigate the learned representations is to find the closest words for a user-specified word. The distance tool serves that purpose. For example, if you enter ‘france’, distance will display the most similar words and their distances to ‘france’, which should look like:

```

Word Cosine distance
           spain              0.678515
         belgium              0.665923
     netherlands              0.652428
           italy              0.633130
     switzerland              0.622323
      luxembourg              0.610033
        portugal              0.577154
          russia              0.571507
         germany              0.563291
       catalonia              0.534176
```

课外补充结束

还有其他的提前训练好的词嵌入数据库，比如GloVe： GloVe: Global Vectors for Word Representation

There are various precomputed databases of word embeddings that you can download
and use in a Keras Embedding layer. Word2vec is one of them. Another popular
one is called Global Vectors for Word Representation (GloVe, https://nlp.stanford.edu/projects/glove), which was developed by Stanford researchers in 2014. This
embedding technique is based on factorizing a matrix of word co-occurrence statistics.
Its developers have made available precomputed embeddings for millions of
English tokens, obtained from Wikipedia data and Common Crawl data.

课外补充：

下面是原网站的文档一部分:https://nlp.stanford.edu/projects/glove/

Introduction
GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word cooccurrence
statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

另外在github上的地址：https://github.com/stanfordnlp/GloVe

课外补充结束

下面继续回到这本书，用GloVe来做词嵌入

Let’s look at how you can get started using GloVe embeddings in a Keras model.
The same method is valid for Word2Vec embeddings or any other word-embedding
database. We’ll start by downloading the GloVe files and parse them. We’ll then load
the word vectors into a Keras Embedding layer, which we’ll use to build a new model.

先下载数据

First, let’s download the GloVe word embeddings precomputed on the 2014
English Wikipedia dataset. It’s an 822 MB zip file containing 100-dimensional embedding
vectors for 400,000 words (or non-word tokens).

!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip

数据下载完成

--2022-03-14 12:09:11--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2022-03-14 12:09:11--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2022-03-14 12:09:11--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’

glove.6B.zip        100%[===================>] 822.24M  5.03MB/s    in 2m 41s  

2022-03-14 12:11:53 (5.09 MB/s) - ‘glove.6B.zip’ saved [862182613/862182613]

Let’s parse the unzipped file (a .txt file) to build an index that maps words (as strings)
to their vector representation.

Listing 11.18 Parsing the GloVe word-embeddings file

import numpy as np
path_to_glove_file = "glove.6B.100d.txt"

embeddings_index = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print(f"Found {len(embeddings_index)} word vectors.")

output

Found 400000 word vectors.

构建嵌入矩阵

Next, let’s build an embedding matrix that you can load into an Embedding layer. It
must be a matrix of shape (max_words, embedding_dim), where each entry i contains
the embedding_dim-dimensional vector for the word of index i in the reference word
index (built during tokenization).

Listing 11.19 Preparing the GloVe word-embeddings matrix

embedding_dim = 100

vocabulary = text_vectorization.get_vocabulary()
word_index = dict(zip(vocabulary, range(len(vocabulary))))

embedding_matrix = np.zeros((max_tokens, embedding_dim))
for word, i in word_index.items():
    if i < max_tokens:
        embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

代码解释

在这里插入图片描述

使用Constant initializer 来读取提前训练好的词嵌入

Finally, we use a Constant initializer to load the pretrained embeddings in an Embedding
layer. So as not to disrupt the pretrained representations during training, we freeze
the layer via trainable=False:

embedding_layer = layers.Embedding(
    max_tokens,
    embedding_dim,
    embeddings_initializer=keras.initializers.Constant(embedding_matrix),
    trainable=False,
    mask_zero=True,
)

现在使用提前训练好的词嵌入向量，维度为100维做同一个问题

We’re now ready to train a new model—identical to our previous model, but leveraging
the 100-dimensional pretrained GloVe embeddings instead of 128-dimensional
learned embeddings.

Listing 11.20 Model that uses a pretrained Embedding layer

inputs = keras.Input(shape=(None,), dtype="int64")
embedded = embedding_layer(inputs)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

callbacks = [
    keras.callbacks.ModelCheckpoint("glove_embeddings_sequence_model.keras",
                                    save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=10, callbacks=callbacks)
model = keras.models.load_model("glove_embeddings_sequence_model.keras")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

output：准确率为87.9%, 训练时间为34min

Model: "model_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_7 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding_5 (Embedding)     (None, None, 100)         2000000   
                                                                 
 bidirectional_5 (Bidirectio  (None, 64)               34048     
 nal)                                                            
                                                                 
 dropout_4 (Dropout)         (None, 64)                0         
                                                                 
 dense_4 (Dense)             (None, 1)                 65        
                                                                 
=================================================================
Total params: 2,034,113
Trainable params: 34,113
Non-trainable params: 2,000,000
_________________________________________________________________
Epoch 1/10
782/782 [==============================] - 176s 214ms/step - loss: 0.5485 - accuracy: 0.7160 - val_loss: 0.5630 - val_accuracy: 0.7102
Epoch 2/10
782/782 [==============================] - 164s 210ms/step - loss: 0.4238 - accuracy: 0.8095 - val_loss: 0.3699 - val_accuracy: 0.8454
Epoch 3/10
782/782 [==============================] - 163s 208ms/step - loss: 0.3787 - accuracy: 0.8374 - val_loss: 0.3765 - val_accuracy: 0.8268
Epoch 4/10
782/782 [==============================] - 165s 211ms/step - loss: 0.3468 - accuracy: 0.8522 - val_loss: 0.3120 - val_accuracy: 0.8658
Epoch 5/10
782/782 [==============================] - 163s 209ms/step - loss: 0.3244 - accuracy: 0.8646 - val_loss: 0.3470 - val_accuracy: 0.8404
Epoch 6/10
782/782 [==============================] - 163s 208ms/step - loss: 0.3044 - accuracy: 0.8744 - val_loss: 0.3494 - val_accuracy: 0.8540
Epoch 7/10
782/782 [==============================] - 164s 209ms/step - loss: 0.2877 - accuracy: 0.8827 - val_loss: 0.2501 - val_accuracy: 0.8996
Epoch 8/10
782/782 [==============================] - 163s 209ms/step - loss: 0.2731 - accuracy: 0.8895 - val_loss: 0.2345 - val_accuracy: 0.9058
Epoch 9/10
782/782 [==============================] - 161s 206ms/step - loss: 0.2599 - accuracy: 0.8961 - val_loss: 0.3041 - val_accuracy: 0.8726
Epoch 10/10
782/782 [==============================] - 166s 212ms/step - loss: 0.2470 - accuracy: 0.9022 - val_loss: 0.2107 - val_accuracy: 0.9130
782/782 [==============================] - 68s 83ms/step - loss: 0.3137 - accuracy: 0.8785
Test acc: 0.879

虽然这里使用的pretrained embeddings 没有提高性能，但是这种方法在小数据集上效果很好。

You’ll find that on this particular task, pretrained embeddings aren’t very helpful,
because the dataset contains enough samples that it is possible to learn a specialized
enough embedding space from scratch. However, leveraging pretrained embeddings
can be very helpful when you’re working with a smaller dataset.

11.4 The Transformer architecture

注意力机制

Starting in 2017, a new model architecture started overtaking recurrent neural networks
across most natural language processing tasks: the Transformer.
Transformers were introduced in the seminal paper “Attention is all you need” by
Vaswani et al.2 The gist of the paper is right there in the title: as it turned out, a simple
mechanism called “neural attention” could be used to build powerful sequence models
that didn’t feature any recurrent layers or convolution layers.

This finding unleashed nothing short of a revolution in natural language processing—
and beyond. Neural attention has fast become one of the most influential ideas
in deep learning. In this section, you’ll get an in-depth explanation of how it works
and why it has proven so effective for sequence data. We’ll then leverage self-attention to create a Transformer encoder, one of the basic components of the Transformer
architecture, and we’ll apply it to the IMDB movie review classification task.

11.4.1 Understanding self-attention

对重要的特征pay more attention, 对其他特征 pay less attention

As you’re going through this book, you may be skimming some parts and attentively
reading others, depending on what your goals or interests are. What if your models
did the same? It’s a simple yet powerful idea: not all input information seen by a
model is equally important to the task at hand, so models should “pay more attention”
to some features and “pay less attention” to other features

两种我们已经比较熟悉的注意力的例子

Does that sound familiar? You’ve already encountered a similar concept twice in
this book:

 Max pooling in convnets looks at a pool of features in a spatial region and
selects just one feature to keep. That’s an “all or nothing” form of attention:
keep the most important feature and discard the rest.

 TF-IDF normalization assigns importance scores to tokens based on how much
information different tokens are likely to carry. Important tokens get boosted
while irrelevant tokens get faded out. That’s a continuous form of attention.

这些方法都是都按照特征的重要程度计分（attention scores），随着方法选取的不同，相应的计算得分也不同

There are many different forms of attention you could imagine, but they all start by computing
importance scores for a set of features, with higher scores for more relevant features
and lower scores for less relevant ones (see figure 11.5). How these scores should be
computed, and what you should do with them, will vary from approach to approach.

在这里插入图片描述

但是，有些特征是基于上下文的（context-aware），不同语境下，相同单词会有不同的意思。

Crucially, this kind of attention mechanism can be used for more than just highlighting
or erasing certain features. It can be used to make features context-aware. You’ve just
learned about word embeddings—vector spaces that capture the “shape” of the semantic
relationships between different words. In an embedding space, a single word has a fixed
position—a fixed set of relationships with every other word in the space. But that’s not
quite how language works: the meaning of a word is usually context-specific. When you
mark the date, you’re not talking about the same “date” as when you go on a date, nor is
it the kind of date you’d buy at the market. When you say, “I’ll see you soon,” the meaning
of the word “see” is subtly different from the “see” in “I’ll see this project to its end” or
“I see what you mean.” And, of course, the meaning of pronouns like “he,” “it,” “in,” etc.,
is entirely sentence-specific and can even change multiple times within a single sentence.

self-attention机制: 可以产生基于上下文的表示

Clearly, a smart embedding space would provide a different vector representation
for a word depending on the other words surrounding it. That’s where self-attention
comes in. The purpose of self-attention is to modulate the representation of a token
by using the representations of related tokens in the sequence. This produces contextaware
token representations. Consider an example sentence: “The train left the station
on time.” Now, consider one word in the sentence: station. What kind of station
are we talking about? Could it be a radio station? Maybe the International Space Station?
Let’s figure it out algorithmically via self-attention (see figure 11.6).

在这里插入图片描述

第一步：计算相关得分

Step 1 is to compute relevancy scores between the vector for “station” and every other
word in the sentence. These are our “attention scores.” We’re simply going to use the
dot product between two word vectors as a measure of the strength of their relationship.
It’s a very computationally efficient distance function, and it was already the standard
way to relate two word embeddings to each other long before Transformers. In
practice, these scores will also go through a scaling function and a softmax, but for
now, that’s just an implementation detail.

第二步:上下文中对句子中的词向量赋予不同的权重，然后求和，作为目标词语的新的representation

Step 2 is to compute the sum of all word vectors in the sentence, weighted by our
relevancy scores. Words closely related to “station” will contribute more to the sum
(including the word “station” itself), while irrelevant words will contribute almost
nothing. The resulting vector is our new representation for “station”: a representation
that incorporates the surrounding context. In particular, it includes part of the “train”
vector, clarifying that it is, in fact, a “train station.”

上述过程需要对每个单词使用

You’d repeat this process for every word in the sentence, producing a new
sequence of vectors encoding the sentence. Let’s see it in NumPy-like pseudocode:

手撕这个过程的伪代码如下：

在这里插入图片描述

当然可以调用keras的内置layer:MultiHeadAttention layer

Of course, in practice you’d use a vectorized implementation. Keras has a built-in
layer to handle it: the MultiHeadAttention layer. Here’s how you would use it:

num_heads = 4
embed_dim = 256
mha_layer = MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
outputs = mha_layer(inputs, inputs, inputs)

阅读上述4行代码，可能有所疑问

Reading this, you’re probably wondering

 Why are we passing the inputs to the layer three times? That seems redundant.

 What are these “multiple heads” we’re referring to? That sounds intimidating—
do they also grow back if you cut them?
Both of these questions have simple answers. Let’s take a look.

GENERALIZED SELF-ATTENTION: THE QUERY-KEY-VALUE MODEL

transformer架构之前开发出来是用于机器翻译的。

So far, we have only considered one input sequence. However, the Transformer architecture
was originally developed for machine translation, where you have to deal with
two input sequences: the source sequence you’re currently translating (such as “How’s
the weather today?”), and the target sequence you’re converting it to (such as “¿Qué
tiempo hace hoy?”). A Transformer is a sequence-to-sequence model: it was designed to
convert one sequence into another. You’ll learn about sequence-to-sequence models
in depth later in this chapter.

Now let’s take a step back. The self-attention mechanism as we’ve introduced it
performs the following, schematically:

query-key-value 模型

This means “for each token in inputs (A), compute how much the token is related to
every token in inputs (B), and use these scores to weight a sum of tokens from
inputs ©.” Crucially, there’s nothing that requires A, B, and C to refer to the same
input sequence. In the general case, you could be doing this with three different
sequences. We’ll call them “query,” “keys,” and “values.” The operation becomes “for
each element in the query, compute how much the element is related to every key,
and use these scores to weight a sum of values”:

在这里插入图片描述

主要用于搜索引擎和推荐系统：去前N个相关的

This terminology comes from search engines and recommender systems (see figure
11.7). Imagine that you’re typing up a query to retrieve a photo from your collection—
“dogs on the beach.” Internally, each of your pictures in the database is described by a
set of keywords—“cat,” “dog,” “party,” etc. We’ll call those “keys.” The search engine will
start by comparing your query to the keys in the database. “Dog” yields a match of 1, and
“cat” yields a match of 0. It will then rank those keys by strength of match—relevance—
and it will return the pictures associated with the top N matches, in order of relevance.

在这里插入图片描述

Transformer-style attention的工作原理：用query查找，匹配key，然后返回a weighte sum of values

Conceptually, this is what Transformer-style attention is doing. You’ve got a reference
sequence that describes something you’re looking for: the query. You’ve got a
body of knowledge that you’re trying to extract information from: the values. Each
value is assigned a key that describes the value in a format that can be readily compared
to a query. You simply match the query to the keys. Then you return a weighted
sum of values.

实战的时候，keys和values一般是相等的，而且在序列分类中，query和keys以及values是相等的。

In practice, the keys and the values are often the same sequence. In machine translation,
for instance, the query would be the target sequence, and the source sequence
would play the roles of both keys and values: for each element of the target (like “tiempo”), you want to go back to the source (“How’s the weather today?”) and identify
the different bits that are related to it (“tiempo” and “weather” should have a
strong match). And naturally, if you’re just doing sequence classification, then query,
keys, and values are all the same: you’re comparing a sequence to itself, to enrich each
token with context from the whole sequence.
That explains why we needed to pass inputs three times to our MultiHeadAttention
layer. But why “multi-head” attention?

11.4.2 Multi-head attention

每个head的来历:“多头”绰号是指自注意力层的输出空间被分解为一组独立的子空间，分别学习：初始查询、键和值会传给三个独立的密集投影集，这会产生三个独立的向量。每个向量都通过神经注意力进行处理，三个子空间输出连接在一起形成一个输出序列。

“Multi-head attention” is an extra tweak to the self-attention mechanism, introduced
in “Attention is all you need.” The “multi-head” moniker refers to the fact that the
output space of the self-attention layer gets factored into a set of independent subspaces,
learned separately: the initial query, key, and value are sent through three
independent sets of dense projections, resulting in three separate vectors. Each vector
is processed via neural attention, and the three outputs are concatenated back
together into a single output sequence. Each such subspace is called a “head.” The full
picture is shown in figure 11.8.

在这里插入图片描述

learnable dense projections的作用

拥有独立的head可以让一层学习很多不同的特征，不同组的特征是相互独立的。

The presence of the learnable dense projections enables the layer to actually learn
something, as opposed to being a purely stateless transformation that would require
additional layers before or after it to be useful. In addition, having independent heads
helps the layer learn different groups of features for each token, where features within
one group are correlated with each other but are mostly independent from features
in a different group.

和depthwise separable convolutions相似的思想：

This is similar in principle to what makes depthwise separable convolutions work: in a
depthwise separable convolution, the output space of the convolution is factored into
many subspaces (one per input channel) that get learned independently. The “Attention
is all you need” paper was written at a time when the idea of factoring feature
spaces into independent subspaces had been shown to provide great benefits for computer
vision models—both in the case of depthwise separable convolutions, and in the
case of a closely related approach, grouped convolutions. Multi-head attention is simply
the application of the same idea to self-attention.

11.4.3 The Transformer encoder

添加dense projections 到output of attention mechanism

If adding extra dense projections is so useful, why don’t we also apply one or two to
the output of the attention mechanism? Actually, that’s a great idea—let’s do that.
And our model is starting to do a lot, so we might want to add residual connections to
make sure we don’t destroy any valuable information along the way—you learned in
chapter 9 that they’re a must for any sufficiently deep architecture. And there’s
another thing you learned in chapter 9: normalization layers are supposed to help
gradients flow better during backpropagation. Let’s add those too.

Transformer encoder：把输出重构成多个独立的子空间，添加residual connection 和 normalization layers，就构成了transformer encoder。

That’s roughly the thought process that I imagine unfolded in the minds of the
inventors of the Transformer architecture at the time. Factoring outputs into multiple
independent spaces, adding residual connections, adding normalization layers—all of
these are standard architecture patterns that one would be wise to leverage in any
complex model. Together, these bells and whistles form the Transformer encoder—one
of two critical parts that make up the Transformer architecture (see figure 11.9).

在这里插入图片描述

最初的transformer版本就是由一个encoder和一个decoder组成：进行语言翻译

The original Transformer architecture consists of two parts: a Transformer encoder that
processes the source sequence, and a Transformer decoder that uses the source sequence
to generate a translated version. You’ll learn about about the decoder part in a minute.

Crucially, the encoder part can be used for text classification—it’s a very generic
module that ingests a sequence and learns to turn it into a more useful representation.
Let’s implement a Transformer encoder and try it on the movie review sentiment
classification task.

Listing 11.21 Transformer encoder implemented as a subclassed Layer

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

class TransformerEncoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim # size of the input token vectors
        self.dense_dim = dense_dim # size of the inner dense layer
        self.num_heads = num_heads # number of attention heads
        self.attention = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim)
        self.dense_proj = keras.Sequential(
            [layers.Dense(dense_dim, activation="relu"),
             layers.Dense(embed_dim),]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()

    def call(self, inputs, mask=None):
        if mask is not None:
            mask = mask[:, tf.newaxis, :] # attention layer需要的mask至少是3D，而在embedding layer产生的mask是2D
        attention_output = self.attention(
            inputs, inputs, attention_mask=mask)
        proj_input = self.layernorm_1(inputs + attention_output)
        proj_output = self.dense_proj(proj_input)
        return self.layernorm_2(proj_input + proj_output)

    def get_config(self):
        config = super().get_config()
        config.update({
            "embed_dim": self.embed_dim,
            "num_heads": self.num_heads,
            "dense_dim": self.dense_dim,
        })
        return config

保留自定义的层

Saving custom layers

When you write custom layers, make sure to implement the get_config method: this
enables the layer to be reinstantiated from its config dict, which is useful during
model saving and loading. The method should return a Python dict that contains the
values of the constructor arguments used to create the layer.
All Keras layers can be serialized and deserialized as follows:

在这里插入图片描述

When saving a model that contains custom layers, the savefile will contain these config
dicts. When loading the model from the file, you should provide the custom layer
classes to the loading process, so that it can make sense of the config objects:

model = keras.models.load_model(
	filename, custom_objects={"PositionalEmbedding": PositionalEmbedding})

使用LayerNormalization layer做正则化

You’ll note that the normalization layers we’re using here aren’t BatchNormalization
layers like those we’ve used before in image models. That’s because BatchNormalization
doesn’t work well for sequence data. Instead, we’re using the LayerNormalization layer,
which normalizes each sequence independently from other sequences in the batch.
Like this, in NumPy-like pseudocode:

在这里插入图片描述

BatchNormalization 和 LayerNormalization 的区别

While BatchNormalization collects information from many samples to obtain accurate
statistics for the feature means and variances, LayerNormalization pools data
within each sequence separately, which is more appropriate for sequence data.

使用Transformer encoder 进行文本分类

Now that we’ve implemented our TransformerEncoder, we can use it to assemble a
text-classification model similar to the GRU-based one you’ve seen previously.

Listing 11.22 Using the Transformer encoder for text classification

vocab_size = 20000
embed_dim = 256
num_heads = 2
dense_dim = 32

inputs = keras.Input(shape=(None,), dtype="int64")
x = layers.Embedding(vocab_size, embed_dim)(inputs)
x = TransformerEncoder(embed_dim, dense_dim, num_heads)(x)
x = layers.GlobalMaxPooling1D()(x)  # Since TransformerEncoder returns full sequences, we need to reduce each sequence to a single vector for classification via a global pooling layer.

x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

output: 这个模型的结构

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_1 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding (Embedding)       (None, None, 256)         5120000   
                                                                 
 transformer_encoder (Transf  (None, None, 256)        543776    
 ormerEncoder)                                                   
                                                                 
 global_max_pooling1d (Globa  (None, 256)              0         
 lMaxPooling1D)                                                  
                                                                 
 dropout (Dropout)           (None, 256)               0         
                                                                 
 dense_2 (Dense)             (None, 1)                 257       
                                                                 
=================================================================
Total params: 5,664,033
Trainable params: 5,664,033
Non-trainable params: 0
_________________________________________________________________

下面是用TransformerEncoder进行训练

Listing 11.23 Training and evaluating the Transformer encoder based model

callbacks = [
    keras.callbacks.ModelCheckpoint("transformer_encoder.keras",
                                    save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=20, callbacks=callbacks)
model = keras.models.load_model(
    "transformer_encoder.keras",
    custom_objects={"TransformerEncoder": TransformerEncoder}) 
	# Provide the custom TransformerEncoder class to the model-loading process.
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

训练结果：准确率88.2%

Epoch 1/20
625/625 [==============================] - 36s 48ms/step - loss: 0.4777 - accuracy: 0.7785 - val_loss: 0.3435 - val_accuracy: 0.8486
Epoch 2/20
625/625 [==============================] - 29s 47ms/step - loss: 0.3007 - accuracy: 0.8747 - val_loss: 0.2886 - val_accuracy: 0.8742
Epoch 3/20
625/625 [==============================] - 29s 47ms/step - loss: 0.2392 - accuracy: 0.9050 - val_loss: 0.2964 - val_accuracy: 0.8736
Epoch 4/20
625/625 [==============================] - 29s 47ms/step - loss: 0.1981 - accuracy: 0.9212 - val_loss: 0.2898 - val_accuracy: 0.8856
Epoch 5/20
625/625 [==============================] - 32s 51ms/step - loss: 0.1576 - accuracy: 0.9412 - val_loss: 0.3181 - val_accuracy: 0.8864
Epoch 6/20
625/625 [==============================] - 29s 47ms/step - loss: 0.1284 - accuracy: 0.9522 - val_loss: 0.3281 - val_accuracy: 0.8892
Epoch 7/20
625/625 [==============================] - 29s 47ms/step - loss: 0.1099 - accuracy: 0.9600 - val_loss: 0.4795 - val_accuracy: 0.8672
Epoch 8/20
625/625 [==============================] - 29s 47ms/step - loss: 0.0931 - accuracy: 0.9651 - val_loss: 0.4229 - val_accuracy: 0.8722
Epoch 9/20
625/625 [==============================] - 29s 47ms/step - loss: 0.0754 - accuracy: 0.9733 - val_loss: 0.5293 - val_accuracy: 0.8702
Epoch 10/20
625/625 [==============================] - 30s 47ms/step - loss: 0.0648 - accuracy: 0.9761 - val_loss: 0.6447 - val_accuracy: 0.8582
Epoch 11/20
625/625 [==============================] - 29s 47ms/step - loss: 0.0547 - accuracy: 0.9798 - val_loss: 0.5479 - val_accuracy: 0.8734
Epoch 12/20
625/625 [==============================] - 29s 47ms/step - loss: 0.0419 - accuracy: 0.9860 - val_loss: 0.6914 - val_accuracy: 0.8692
Epoch 13/20
625/625 [==============================] - 29s 47ms/step - loss: 0.0356 - accuracy: 0.9878 - val_loss: 0.7549 - val_accuracy: 0.8584
Epoch 14/20
625/625 [==============================] - 29s 47ms/step - loss: 0.0266 - accuracy: 0.9912 - val_loss: 0.7526 - val_accuracy: 0.8630
Epoch 15/20
625/625 [==============================] - 29s 47ms/step - loss: 0.0203 - accuracy: 0.9934 - val_loss: 1.2271 - val_accuracy: 0.8468
Epoch 16/20
625/625 [==============================] - 29s 47ms/step - loss: 0.0202 - accuracy: 0.9933 - val_loss: 0.8784 - val_accuracy: 0.8594
Epoch 17/20
625/625 [==============================] - 29s 47ms/step - loss: 0.0151 - accuracy: 0.9949 - val_loss: 1.0260 - val_accuracy: 0.8642
Epoch 18/20
625/625 [==============================] - 29s 47ms/step - loss: 0.0160 - accuracy: 0.9955 - val_loss: 1.1992 - val_accuracy: 0.8578
Epoch 19/20
625/625 [==============================] - 29s 47ms/step - loss: 0.0113 - accuracy: 0.9966 - val_loss: 1.0857 - val_accuracy: 0.8674
Epoch 20/20
625/625 [==============================] - 29s 47ms/step - loss: 0.0126 - accuracy: 0.9969 - val_loss: 1.1278 - val_accuracy: 0.8652
782/782 [==============================] - 14s 17ms/step - loss: 0.2830 - accuracy: 0.8817
Test acc: 0.882

可是，上面的模型并没有考虑token的顺序,并不是一个sequence model

At this point, you should start to feel a bit uneasy. Something’s off here. Can you tell
what it is?

This section is ostensibly about “sequence models.” I started off by highlighting
the importance of word order. I said that Transformer was a sequence-processing
architecture, originally developed for machine translation. And yet . . . the Transformer
encoder you just saw in action wasn’t a sequence model at all. Did you
notice? It’s composed of dense layers that process sequence tokens independently
from each other, and an attention layer that looks at the tokens as a set. You could
change the order of the tokens in a sequence, and you’d get the exact same pairwise
attention scores and the exact same context-aware representations. If you were to
completely scramble the words in every movie review, the model wouldn’t notice,
and you’d still get the exact same accuracy. Self-attention is a set-processing mechanism,
focused on the relationships between pairs of sequence elements (see figure
11.10)—it’s blind to whether these elements occur at the beginning, at the end, or
in the middle of a sequence. So why do we say that Transformer is a sequence
model? And how could it possibly be good for machine translation if it doesn’t look
at word order?

下图是不同类型的NLP models的特点

在这里插入图片描述

transformer是一种混合的方法：技术上是不考虑词序的，但是需要手动为表示注入顺序信息。

I hinted at the solution earlier in the chapter: I mentioned in passing that Transformer
was a hybrid approach that is technically order-agnostic, but that manually
injects order information in the representations it processes. This is the missing ingredient!
It’s called positional encoding. Let’s take a look.

USING POSITIONAL ENCODING TO RE-INJECT ORDER INFORMATION

input word embeddings will have two components： word vector and position vector

The idea behind positional encoding is very simple: to give the model access to wordorder
information, we’re going to add the word’s position in the sentence to each word
embedding.

Our input word embeddings will have two components: the usual word vector, which represents the word independently of any specific context, and a position vector, which represents the position of the word in the current sentence. Hopefully,
the model will then figure out how to best leverage this additional information.

一种不理想的解决方法：concatenate the word’s position to its embedding vector

原因：positions可能是大整数，NN不喜欢大的输入数据，或者是离散分布的数据。

The simplest scheme you could come up with would be to concatenate the word’s
position to its embedding vector. You’d add a “position” axis to the vector and fill it
with 0 for the first word in the sequence, 1 for the second, and so on.
That may not be ideal, however, because the positions can potentially be very large
integers, which will disrupt the range of values in the embedding vector. As you know,
neural networks don’t like very large input values, or discrete input distributions.

在transformer的经典论文中，使用的trick是把一个取值在[-1, 1]的向量添加到word embeddings

但是，我们采用的是更有效的方法：positional embedding

The original “Attention is all you need” paper used an interesting trick to encode
word positions: it added to the word embeddings a vector containing values in the
range [-1, 1] that varied cyclically depending on the position (it used cosine functions
to achieve this). This trick offers a way to uniquely characterize any integer in a
large range via a vector of small values. It’s clever, but it’s not what we’re going to use
in our case. We’ll do something simpler and more effective: we’ll learn positionembedding
vectors the same way we learn to embed word indices. We’ll then proceed
to add our position embeddings to the corresponding word embeddings, to obtain a
position-aware word embedding. This technique is called “positional embedding.”
Let’s implement it.

Listing 11.24 Implementing positional embedding as a subclassed layer

我们需要提前知道序列长度

class PositionalEmbedding(layers.Layer):
    def __init__(self, sequence_length, input_dim, output_dim, **kwargs):
        super().__init__(**kwargs)
        self.token_embeddings = layers.Embedding(
            input_dim=input_dim, output_dim=output_dim)
        self.position_embeddings = layers.Embedding(
            input_dim=sequence_length, output_dim=output_dim)
        self.sequence_length = sequence_length
        self.input_dim = input_dim
        self.output_dim = output_dim

    def call(self, inputs):
        length = tf.shape(inputs)[-1]
        positions = tf.range(start=0, limit=length, delta=1)
        embedded_tokens = self.token_embeddings(inputs)
        embedded_positions = self.position_embeddings(positions)
        return embedded_tokens + embedded_positions

    def compute_mask(self, inputs, mask=None):
        return tf.math.not_equal(inputs, 0)

    def get_config(self):
        config = super().get_config()
        config.update({
            "output_dim": self.output_dim,
            "sequence_length": self.sequence_length,
            "input_dim": self.input_dim,
        })
        return config

代码解释

在这里插入图片描述

You would use this PositionEmbedding layer just like a regular Embedding layer. Let’s
see it in action!

下面是使用PositionEmbedding layer

PUTTING IT ALL TOGETHER: A TEXT-CLASSIFICATION TRANSFORMER

对tf.keras.layers.GlobalMaxPool1D方法的解释

https://devdocs.io/tensorflow~2.4/keras/layers/globalmaxpool1d

使用方法：数据格式，另外就是参数

tf.keras.layers.GlobalMaxPool1D(
    data_format='channels_last', **kwargs
)

功能：Downsamples the input representation by taking the maximum value over the time dimension.通过去时间维上的最大值，来下采样。

参数

Arguments
`data_format`	A string, one of `channels_last` (default) or `channels_first`. The ordering of the dimensions in the inputs. `channels_last` corresponds to inputs with shape `(batch, steps, features)` while `channels_first` corresponds to inputs with shape `(batch, features, steps)`.

举例

x = tf.constant([[1., 2., 3.], [4., 5., 6.], [7., 8., 9.]])
x = tf.reshape(x, [3, 3, 1])
x
<tf.Tensor: shape=(3, 3, 1), dtype=float32, numpy=
array([[[1.], [2.], [3.]],
       [[4.], [5.], [6.]],
       [[7.], [8.], [9.]]], dtype=float32)>
max_pool_1d = tf.keras.layers.GlobalMaxPooling1D()
max_pool_1d(x)
<tf.Tensor: shape=(3, 1), dtype=float32, numpy=
array([[3.],
       [6.],
       [9.], dtype=float32)>

Listing 11.25 Combining the Transformer encoder with positional embedding

vocab_size = 20000
sequence_length = 600
embed_dim = 256
num_heads = 2
dense_dim = 32

inputs = keras.Input(shape=(None,), dtype="int64")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(inputs)
x = TransformerEncoder(embed_dim, dense_dim, num_heads)(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

callbacks = [
    keras.callbacks.ModelCheckpoint("full_transformer_encoder.keras",
                                    save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=20, callbacks=callbacks)
model = keras.models.load_model(
    "full_transformer_encoder.keras",
    custom_objects={"TransformerEncoder": TransformerEncoder,
                    "PositionalEmbedding": PositionalEmbedding})
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

训练结果：86.8%

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_3 (InputLayer)        [(None, None)]            0         
                                                                 
 positional_embedding (Posit  (None, None, 256)        5273600   
 ionalEmbedding)                                                 
                                                                 
 transformer_encoder_1 (Tran  (None, None, 256)        543776    
 sformerEncoder)                                                 
                                                                 
 global_max_pooling1d_1 (Glo  (None, 256)              0         
 balMaxPooling1D)                                                
                                                                 
 dropout_1 (Dropout)         (None, 256)               0         
                                                                 
 dense_7 (Dense)             (None, 1)                 257       
                                                                 
=================================================================
Total params: 5,817,633
Trainable params: 5,817,633
Non-trainable params: 0
_________________________________________________________________
Epoch 15/20
625/625 [==============================] - 32s 51ms/step - loss: 0.0254 - accuracy: 0.9916 - val_loss: 0.7190 - val_accuracy: 0.8654
Epoch 16/20
625/625 [==============================] - 32s 50ms/step - loss: 0.0195 - accuracy: 0.9934 - val_loss: 0.8543 - val_accuracy: 0.8660
Epoch 17/20
625/625 [==============================] - 32s 50ms/step - loss: 0.0178 - accuracy: 0.9941 - val_loss: 1.7047 - val_accuracy: 0.7986
Epoch 18/20
625/625 [==============================] - 32s 51ms/step - loss: 0.0179 - accuracy: 0.9941 - val_loss: 0.9727 - val_accuracy: 0.8646
Epoch 19/20
625/625 [==============================] - 32s 51ms/step - loss: 0.0139 - accuracy: 0.9954 - val_loss: 1.1850 - val_accuracy: 0.8590
Epoch 20/20
625/625 [==============================] - 33s 52ms/step - loss: 0.0141 - accuracy: 0.9956 - val_loss: 1.0120 - val_accuracy: 0.8658
782/782 [==============================] - 27s 34ms/step - loss: 0.3119 - accuracy: 0.8681
Test acc: 0.868

书上的准确率是88.3%，和我训练的结果有出入，不过可能是因为自己训练的次数太少了。

We get to 88.3% test accuracy, a solid improvement that clearly demonstrates the
value of word order information for text classification. This is our best sequence
model so far! However, it’s still one notch below the bag-of-words approach.

11.4.4 When to use sequence models over bag-of-words models

词袋模型并没有过时，它在很多场合还是有很好的性能

You may sometimes hear that bag-of-words methods are outdated, and that Transformerbased
sequence models are the way to go, no matter what task or dataset you’re looking
at. This is definitely not the case: a small stack of Dense layers on top of a bag-ofbigrams
remains a perfectly valid and relevant approach in many cases. In fact, among
the various techniques that we’ve tried on the IMDB dataset throughout this chapter,
the best performing so far was the bag-of-bigrams!

何时使用词袋模型，何时使用序列模型的准则

So, when should you prefer one approach over the other?
In 2017, my team and I ran a systematic analysis of the performance of various textclassification
techniques across many different types of text datasets, and we discovered
a remarkable and surprising rule of thumb for deciding whether to go with a
bag-of-words model or a sequence model (http://mng.bz/AOzK)—a golden constant
of sorts.

这个准则就是：样本数量与平均样本长度之比，如果小于1500则用词袋模型，否则则使用序列模型。

It turns out that when approaching a new text-classification task, you should pay
close attention to the ratio between the number of samples in your training data and
the mean number of words per sample (see figure 11.11). If that ratio is small—less
than 1,500—then the bag-of-bigrams model will perform better (and as a bonus, it will
be much faster to train and to iterate on too). If that ratio is higher than 1,500, then
you should go with a sequence model. In other words, sequence models work best
when lots of training data is available and when each sample is relatively short.

在这里插入图片描述

具体举例: IMDB电影评论分类任务，具有两万个训练样本，每个样本平均长度是233个词，两者的比例小于1500，所以应该选用词袋模型。

So if you’re classifying 1,000-word long documents, and you have 100,000 of them (a
ratio of 100), you should go with a bigram model. If you’re classifying tweets that are
40 words long on average, and you have 50,000 of them (a ratio of 1,250), you should
also go with a bigram model. But if you increase your dataset size to 500,000 tweets (a
ratio of 12,500), go with a Transformer encoder. What about the IMDB movie review
classification task? We had 20,000 training samples and an average word count of 233,
so our rule of thumb points toward a bigram model, which confirms what we found
in practice.

下面是直观的解释

This intuitively makes sense: the input of a sequence model represents a richer
and more complex space, and thus it takes more data to map out that space; meanwhile,
a plain set of terms is a space so simple that you can train a logistic regression
on top using just a few hundreds or thousands of samples. In addition, the shorter a
sample is, the less the model can afford to discard any of the information it contains—
in particular, word order becomes more important, and discarding it can create ambiguity.
The sentences “this movie is the bomb” and “this movie was a bomb” have very close unigram representations, which could confuse a bag-of-words model, but a
sequence model could tell which one is negative and which one is positive. With a longer
sample, word statistics would become more reliable and the topic or sentiment
would be more apparent from the word histogram alone.

我们上述的比例大于1500的规律基本上只适合于文本分类，而不适合于其他NLP任务

Now, keep in mind that this heuristic rule was developed specifically for text classification.
It may not necessarily hold for other NLP tasks—when it comes to machine
translation, for instance, Transformer shines especially for very long sequences, compared
to RNNs. Our heuristic is also just a rule of thumb, rather than a scientific law,
so expect it to work most of the time, but not necessarily every time.

11.5 Beyond text classification: Sequence-to-sequence learning

You now possess all of the tools you will need to tackle most natural language processing
tasks. However, you’ve only seen these tools in action on a single problem: text
classification. This is an extremely popular use case, but there’s a lot more to NLP
than classification. In this section, you’ll deepen your expertise by learning about
sequence-to-sequence models.

A sequence-to-sequence model takes a sequence as input (often a sentence or
paragraph) and translates it into a different sequence. This is the task at the heart of
many of the most successful applications of NLP:

 Machine translation—Convert a paragraph in a source language to its equivalent
in a target language.

 Text summarization—Convert a long document to a shorter version that retains
the most important information.

 Question answering—Convert an input question into its answer.

 Chatbots—Convert a dialogue prompt into a reply to this prompt, or convert the
history of a conversation into the next reply in the conversation.

 Text generation—Convert a text prompt into a paragraph that completes the prompt.

 Etc.

encoder 和 decoder干啥的？

The general template behind sequence-to-sequence models is described in figure 11.12.
During training,

 An encoder model turns the source sequence into an intermediate representation.

 A decoder is trained to predict the next token i in the target sequence by looking
at both previous tokens (0 to i - 1) and the encoded source sequence.

推理过程

During inference, we don’t have access to the target sequence—we’re trying to predict
it from scratch. We’ll have to generate it one token at a time:

1 We obtain the encoded source sequence from the encoder.

2 The decoder starts by looking at the encoded source sequence as well as an initial
“seed” token (such as the string “[start]”), and uses them to predict the
first real token in the sequence.

3 The predicted sequence so far is fed back into the decoder, which generates the
next token, and so on, until it generates a stop token (such as the string
“[end]”).

在这里插入图片描述

Everything you’ve learned so far can be repurposed to build this new kind of model.
Let’s dive in.

11.5.1 A machine translation example

使用s2s模型用于机器翻译任务。

We’ll demonstrate sequence-to-sequence modeling on a machine translation task.
Machine translation is precisely what Transformer was developed for! We’ll start with a
recurrent sequence model, and we’ll follow up with the full Transformer architecture.

We’ll be working with an English-to-Spanish translation dataset available at
www.manythings.org/anki/. Let’s download it:
!wget http:/ /storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip
!unzip -q spa-eng.zip

数据集的内容

The text file contains one example per line: an English sentence, followed by a tab
character, followed by the corresponding Spanish sentence. Let’s parse this file.

读取数据集：每一行都是两句话：英文和西班牙文

text_file = "spa-eng/spa.txt"
with open(text_file) as f:
    lines = f.read().split("\n")[:-1]
text_pairs = []
for line in lines:
    english, spanish = line.split("\t")
    spanish = "[start] " + spanish + " [end]" # We prepend "[start]" and append "[end]" to the Spanish sentence, to match the template from figure 11.12.
    text_pairs.append((english, spanish))

经过处理的数据集展示：英文，后面跟着对应的西班牙文（以start开始，以end结束）

import random
print(random.choice(text_pairs))

output

("Tom said he couldn't clean the pool tomorrow afternoon.", '[start] Tom dijo que no podría limpiar la piscina mañana por la tarde. [end]')

('They know you.', '[start] Os conocen. [end]')

('Do you have brothers and sisters?', '[start] ¿Tenéis hermanos o hermanas? [end]')

我们要把数据集打乱，然后分配给测试集、验证集和测试集

import random
random.shuffle(text_pairs)
num_val_samples = int(0.15 * len(text_pairs))
num_train_samples = len(text_pairs) - 2 * num_val_samples
train_pairs = text_pairs[:num_train_samples]
val_pairs = text_pairs[num_train_samples:num_train_samples + num_val_samples]
test_pairs = text_pairs[num_train_samples + num_val_samples:]

对两种语言分别使用TextVectorization layer：这里把标点符号给省略了

Next, let’s prepare two separate TextVectorization layers: one for English and one
for Spanish. We’re going to need to customize the way strings are preprocessed:

 We need to preserve the “[start]” and “[end]” tokens that we’ve inserted. By
default, the characters [ and ] would be stripped, but we want to keep them
around so we can tell apart the word “start” and the start token “[start]”.

 Punctuation is different from language to language! In the Spanish Text-
Vectorization layer, if we’re going to strip punctuation characters, we need to
also strip the character ¿.

但是对于真正实际的翻译，标点符号是需要保留的

Note that for a non-toy translation model, we would treat punctuation characters as separate
tokens rather than stripping them, since we would want to be able to generate correctly
punctuated sentences. In our case, for simplicity, we’ll get rid of all punctuation.

Listing 11.26 Vectorizing the English and Spanish text pairs

import tensorflow as tf
import string
import re
from tensorflow.keras import layers

# 需要去掉或者替换的符号
strip_chars = string.punctuation + "¿"
strip_chars = strip_chars.replace("[", "")
strip_chars = strip_chars.replace("]", "")

def custom_standardization(input_string):
    lowercase = tf.strings.lower(input_string)
    return tf.strings.regex_replace(
        lowercase, f"[{re.escape(strip_chars)}]", "")

#To keep things simple, we’ll only look at the top 15,000 words in each language, and we’ll restrict sentences to 20 words.
vocab_size = 15000
sequence_length = 20

# English layer
source_vectorization = layers.TextVectorization(
    max_tokens=vocab_size,
    output_mode="int",
    output_sequence_length=sequence_length,
)
# Spanish layer
target_vectorization = layers.TextVectorization(
    max_tokens=vocab_size,
    output_mode="int",
    output_sequence_length=sequence_length + 1, # Generate Spanish sentences that have one extra token, since we’ll need to offset the sentence by one step during training.
    standardize=custom_standardization,
)
train_english_texts = [pair[0] for pair in train_pairs]
train_spanish_texts = [pair[1] for pair in train_pairs]
# Learn the vocabulary of each language.
source_vectorization.adapt(train_english_texts)
target_vectorization.adapt(train_spanish_texts)

这样就把数据转化为tf.data pipeline了。

Finally, we can turn our data into a tf.data pipeline. We want it to return a tuple
(inputs, target) where inputs is a dict with two keys, “encoder_inputs” (the English
sentence) and “decoder_inputs” (the Spanish sentence), and target is the Spanish
sentence offset by one step ahead.

Listing 11.27 Preparing datasets for the translation task

batch_size = 64

def format_dataset(eng, spa):
    eng = source_vectorization(eng)
    spa = target_vectorization(spa)
    return ({
        "english": eng,
        "spanish": spa[:, :-1],
    }, spa[:, 1:])

def make_dataset(pairs):
    eng_texts, spa_texts = zip(*pairs)
    eng_texts = list(eng_texts)
    spa_texts = list(spa_texts)
    dataset = tf.data.Dataset.from_tensor_slices((eng_texts, spa_texts))
    dataset = dataset.batch(batch_size)
    dataset = dataset.map(format_dataset, num_parallel_calls=4)
    return dataset.shuffle(2048).prefetch(16).cache()

train_ds = make_dataset(train_pairs)
val_ds = make_dataset(val_pairs)

代码解释

在这里插入图片描述

看一下上面处理的数据集：

for inputs, targets in train_ds.take(1):
    print(f"inputs['english'].shape: {inputs['english'].shape}")
    print(f"inputs['spanish'].shape: {inputs['spanish'].shape}")
    print(f"targets.shape: {targets.shape}")

输出

inputs['english'].shape: (64, 20)
inputs['spanish'].shape: (64, 20)
targets.shape: (64, 20)

The data is now ready—time to build some models. We’ll start with a recurrent
sequence-to-sequence model before moving on to a Transformer

11.5.2 Sequence-to-sequence learning with RNNs

RNN在机器翻译上还是挺火的

Recurrent neural networks dominated sequence-to-sequence learning from 2015–
2017 before being overtaken by Transformer. They were the basis for many realworld
machine-translation systems—as mentioned in chapter 10, Google Translate
circa 2017 was powered by a stack of seven large LSTM layers. It’s still worth learning
about this approach today, as it provides an easy entry point to understanding
sequence-to-sequence models.

The simplest, naive way to use RNNs to turn a sequence into another sequence is
to keep the output of the RNN at each time step. In Keras, it would look like this:
inputs = keras.Input(shape=(sequence_length,), dtype="int64")
x = layers.Embedding(input_dim=vocab_size, output_dim=128)(inputs)
x = layers.LSTM(32, return_sequences=True)(x)
outputs = layers.Dense(vocab_size, activation="softmax")(x)
model = keras.Model(inputs, outputs)

这种方法有两个问题：

第一：目标序列和原序列长度必须一样

第二：RNN是一步一步的，用原序列中的前N个去预测目标序列中的第N个，在翻译中不好用。

However, there are two major issues with this approach:

 The target sequence must always be the same length as the source sequence.
In practice, this is rarely the case. Technically, this isn’t critical, as you could
always pad either the source sequence or the target sequence to make their
lengths match.

 Due to the step-by-step nature of RNNs, the model will only be looking at
tokens 0…N in the source sequence in order to predict token N in the target
sequence. This constraint makes this setup unsuitable for most tasks, and
particularly translation. Consider translating “The weather is nice today” to
French—that would be “Il fait beau aujourd’hui.” You’d need to be able
to predict “Il” from just “The,” “Il fait” from just “The weather,” etc., which is
simply impossible.

人类翻译员需要通读整个句子再做翻译，而不是听一个单词一翻译

If you’re a human translator, you’d start by reading the entire source sentence before
starting to translate it. This is especially important if you’re dealing with languages
that have wildly different word ordering, like English and Japanese. And that’s exactly
what standard sequence-to-sequence models do.

先用RNN将整个原序列变成一个向量，然后再用一个RNN来遍历元素

In a proper sequence-to-sequence setup (see figure 11.13), you would first use an
RNN (the encoder) to turn the entire source sequence into a single vector (or set of
vectors). This could be the last output of the RNN, or alternatively, its final internal
state vectors. Then you would use this vector (or vectors) as the initial state of another RNN (the decoder), which would look at elements 0…N in the target sequence, and
try to predict step N+1 in the target sequence.

在这里插入图片描述

现在实现基于GRU的encoder

Let’s implement this in Keras with GRU-based encoders and decoders. The choice
of GRU rather than LSTM makes things a bit simpler, since GRU only has a single
state vector, whereas LSTM has multiple. Let’s start with the encoder.

Listing 11.28 GRU-based encoder

from tensorflow import keras
from tensorflow.keras import layers

embed_dim = 256
latent_dim = 1024

source = keras.Input(shape=(None,), dtype="int64", name="english")
x = layers.Embedding(vocab_size, embed_dim, mask_zero=True)(source)
encoded_source = layers.Bidirectional(
    layers.GRU(latent_dim), merge_mode="sum")(x)

代码解释

在这里插入图片描述

看decoder

Next, let’s add the decoder—a simple GRU layer that takes as its initial state the
encoded source sentence. On top of it, we add a Dense layer that produces for each
output step a probability distribution over the Spanish vocabulary.

Listing 11.29 GRU-based decoder and the end-to-end model

past_target = keras.Input(shape=(None,), dtype="int64", name="spanish")
x = layers.Embedding(vocab_size, embed_dim, mask_zero=True)(past_target)
decoder_gru = layers.GRU(latent_dim, return_sequences=True)
x = decoder_gru(x, initial_state=encoded_source)
x = layers.Dropout(0.5)(x)
target_next_step = layers.Dense(vocab_size, activation="softmax")(x)
seq2seq_rnn = keras.Model([source, past_target], target_next_step)

代码解释

在这里插入图片描述

During training, the decoder takes as input the entire target sequence, but thanks to
the step-by-step nature of RNNs, it only looks at tokens 0…N in the input to predict token N in the output (which corresponds to the next token in the sequence, since
the output is intended to be offset by one step). This means we only use information
from the past to predict the future, as we should; otherwise we’d be cheating, and our
model would not work at inference time.

Listing 11.30 Training our recurrent sequence-to-sequence model

seq2seq_rnn.compile(
    optimizer="rmsprop",
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"])
seq2seq_rnn.fit(train_ds, epochs=15, validation_data=val_ds)

next-token 准确率是64%，但这并不是实际中采用的衡量标准，在现实翻译应用中，使用BLEU scores这个衡量标准：查看整个输出句子和人类认知的翻译质量作比较。

We picked accuracy as a crude way to monitor validation-set performance during
training. We get to 64% accuracy: on average, the model predicts the next word in the
Spanish sentence correctly 64% of the time. However, in practice, next-token accuracy
isn’t a great metric for machine translation models, in particular because it makes the
assumption that the correct target tokens from 0 to N are already known when predicting
token N+1. In reality, during inference, you’re generating the target sentence
from scratch, and you can’t rely on previously generated tokens being 100% correct.
If you work on a real-world machine translation system, you will likely use “BLEU
scores” to evaluate your models—a metric that looks at entire generated sequences
and that seems to correlate well with human perception of translation quality.

看看测试集效果怎么样

At last, let’s use our model for inference. We’ll pick a few sentences in the test set
and check how our model translates them. We’ll start from the seed token, “[start]”,
and feed it into the decoder model, together with the encoded English source sentence.
We’ll retrieve a next-token prediction, and we’ll re-inject it into the decoder
repeatedly, sampling one new target token at each iteration, until we get to “[end]”
or reach the maximum sentence length.

Listing 11.31 Translating new sentences with our RNN encoder and decoder

import numpy as np
spa_vocab = target_vectorization.get_vocabulary()
spa_index_lookup = dict(zip(range(len(spa_vocab)), spa_vocab))
max_decoded_sentence_length = 20

def decode_sequence(input_sentence):
    tokenized_input_sentence = source_vectorization([input_sentence])
    decoded_sentence = "[start]"
    for i in range(max_decoded_sentence_length):
        tokenized_target_sentence = target_vectorization([decoded_sentence])
        next_token_predictions = seq2seq_rnn.predict(
            [tokenized_input_sentence, tokenized_target_sentence])
        sampled_token_index = np.argmax(next_token_predictions[0, i, :])
        sampled_token = spa_index_lookup[sampled_token_index]
        decoded_sentence += " " + sampled_token
        if sampled_token == "[end]":
            break
    return decoded_sentence

test_eng_texts = [pair[0] for pair in test_pairs]
for _ in range(20):
    input_sentence = random.choice(test_eng_texts)
    print("-")
    print(input_sentence)
    print(decode_sequence(input_sentence))

我们的模型重新处理的整个输入句子和生成的目标句子

Note that this inference setup, while very simple, is rather inefficient, since we reprocess
the entire source sentence and the entire generated target sentence every time
we sample a new word. In a practical application, you’d factor the encoder and the
decoder as two separate models, and your decoder would only run a single step at
each token-sampling iteration, reusing its previous internal state.

Here are our translation results. Our model works decently well for a toy model,
though it still makes many basic mistakes.

Listing 11.32 Some sample results from the recurrent translation model

-
Get out of my bed.
[start] ya me de la cama [end]
-
You don't fool me, you know.
[start] no me [UNK] no lo [UNK] [end]
-
Everyone likes me.
[start] todos me gusta [end]
-
There are no oranges on the table.
[start] no hay las flores en la mesa [end]
-
And now who's going to clean everything?
[start] y ahora se puede todo a esto [end]
-
Do you like Boston?
[start] te gusta boston [end]
-
I'm studying art history.
[start] estoy estudiar la historia [end]
-
The room didn't look very good after Tom painted it, because he didn't even bother to wash the walls or ceiling first.
[start] la habitación no se ha sido muy bien cuando mary los [UNK] de las no [end]
-
Are you sure it'll work?
[start] estás seguro de que trabajo [end]
-
My father is as busy as ever.
[start] mi padre está tan ocupado como siempre [end]
-
How long are you staying in Japan?
[start] cuánto tiempo vas a japón en japón [end]
-
Choose any one from among these.
[start] [UNK] uno de uno de estos dos [end]
-
Tom suggested to Mary that she offer John a cup of coffee.
[start] tom le pidió que mary a una fiesta es un café de navidad [end]
-
Check around.
[start] [UNK] a tom [end]
-
What's your favorite kind of music?
[start] cuál es tu historia de música [end]
-
Tom is a very kind and generous man.
[start] tom es un hombre y muy bien [end]
-
She wrapped the present in paper.
[start] ella tiene el no en el que hay un no [end]
-
Tom is alone in his office.
[start] tom está solo en su oficina [end]
-
He did not buy it after all.
[start] Él no lo compró al fin y al [end]
-
She went to the hairdresser's.
[start] ella fue a los estados unidos [end]

其他可以改进的地方,比如使用更多的recurrent layers，或者使用LSTM而不是GRU。

当然，RNN模型用来处理seq2seq有一些基本的局限

There are many ways this toy model could be improved: We could use a deep stack of
recurrent layers for both the encoder and the decoder (note that for the decoder, this
makes state management a bit more involved). We could use an LSTM instead of a GRU.
And so on. Beyond such tweaks, however, the RNN approach to sequence-to-sequence
learning has a few fundamental limitations:

第一就是原来的句子必须存在encoder state vectors中，这限制了能翻译的句子的长度和复杂度

 The source sequence representation has to be held entirely in the encoder state
vector(s), which puts significant limitations on the size and complexity of the
sentences you can translate. It’s a bit as if a human were translating a sentence
entirely from memory, without looking twice at the source sentence while producing
the translation.

第二，RNN会忘记前面的信息: 翻译长句子和长文本不适用

 RNNs have trouble dealing with very long sequences, since they tend to progressively
forget about the past—by the time you’ve reached the 100th token in
either sequence, little information remains about the start of the sequence.That means RNN-based models can’t hold onto long-term context, which can
be essential for translating long documents.

上面的限制就是推动机器学习社区在 sequence-to-sequence problems问题上转向transformer架构。

These limitations are what has led the machine learning community to embrace the
Transformer architecture for sequence-to-sequence problems. Let’s take a look.

11.5.3 Sequence-to-sequence learning with Transformer

Sequence-to-sequence learning is the task where Transformer really shines. Neural
attention enables Transformer models to successfully process sequences that are considerably
longer and more complex than those RNNs can handle.

我们在翻译时，是来回查看上下文，确保翻译是对的

As a human translating English to Spanish, you’re not going to read the English
sentence one word at a time, keep its meaning in memory, and then generate the
Spanish sentence one word at a time. That may work for a five-word sentence, but
it’s unlikely to work for an entire paragraph. Instead, you’ll probably want to go
back and forth between the source sentence and your translation in progress, and
pay attention to different words in the source as you’re writing down different parts
of your translation.

transformer就是这么干的: 每个token产生基于上下文的表示，而且encoded representation的形式为：a sequence of
context-aware embedding vectors.

That’s exactly what you can achieve with neural attention and Transformers.
You’re already familiar with the Transformer encoder, which uses self-attention to produce
context-aware representations of each token in an input sequence. In a
sequence-to-sequence Transformer, the Transformer encoder would naturally play the
role of the encoder, which reads the source sequence and produces an encoded representation
of it. Unlike our previous RNN encoder, though, the Transformer
encoder keeps the encoded representation in a sequence format: it’s a sequence of
context-aware embedding vectors.

transformer decoder的工作: 通过注意力机制，在输入句子中找到接下来要预测的token 的最接近的意思

The second half of the model is the Transformer decoder. Just like the RNN decoder,
it reads tokens 0…N in the target sequence and tries to predict token N+1. Crucially,
while doing this, it uses neural attention to identify which tokens in the encoded source
sentence are most closely related to the target token it’s currently trying to predict—
perhaps not unlike what a human translator would do. Recall the query-key-value
model: in a Transformer decoder, the target sequence serves as an attention “query”
that is used to to pay closer attention to different parts of the source sequence (the
source sequence plays the roles of both keys and values).

THE TRANSFORMER DECODER

decoder的内部构造和encoder有点像，只不过增多了一个额外的attention block

Figure 11.14 shows the full sequence-to-sequence Transformer. Look at the decoder
internals: you’ll recognize that it looks very similar to the Transformer encoder, except
that an extra attention block is inserted between the self-attention block applied to
the target sequence and the dense layers of the exit block.

Let’s implement it. Like for the TransformerEncoder, we’ll use a Layer subclass.
Before we focus on the call(), method, where the action happens, let’s start by defining
the class constructor, containing the layers we’re going to need.

在这里插入图片描述

Listing 11.33 The TransformerDecoder

class TransformerDecoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads
        self.attention_1 = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim)
        self.attention_2 = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim)
        self.dense_proj = keras.Sequential(
            [layers.Dense(dense_dim, activation="relu"),
             layers.Dense(embed_dim),]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        self.layernorm_3 = layers.LayerNormalization()
        self.supports_masking = True

    def get_config(self):
        config = super().get_config()
        config.update({
            "embed_dim": self.embed_dim,
            "num_heads": self.num_heads,
            "dense_dim": self.dense_dim,
        })
        return config

The call() method is almost a straightforward rendering of the connectivity diagram
from figure 11.14. But there’s an additional detail we need to take into
account: causal padding. Causal padding is absolutely critical to successfully training
a sequence-to-sequence Transformer. Unlike an RNN, which looks at its input one
step at a time, and thus will only have access to steps 0…N to generate output step N
(which is token N+1 in the target sequence), the TransformerDecoder is order-agnostic:
it looks at the entire target sequence at once. If it were allowed to use its entire
input, it would simply learn to copy input step N+1 to location N in the output. The
model would thus achieve perfect training accuracy, but of course, when running
inference, it would be completely useless, since input steps beyond N aren’t available.

解决方法很简单：我们将屏蔽成对注意力矩阵的上半部分，以防止模型关注来自未来的信息——在生成目标时，只应使用目标序列中来自标记 0…N 的信息用于预测第N+1个。

The fix is simple: we’ll mask the upper half of the pairwise attention matrix to prevent
the model from paying any attention to information from the future—only information
from tokens 0…N in the target sequence should be used when generating
target token N+1. To do this, we’ll add a get_causal_attention_mask(self, inputs)
method to our TransformerDecoder to retrieve an attention mask that we can pass to
our MultiHeadAttention layers.

Listing 11.34 TransformerDecoder method that generates a causal mask

def get_causal_attention_mask(self, inputs):
        input_shape = tf.shape(inputs)
        batch_size, sequence_length = input_shape[0], input_shape[1]
        i = tf.range(sequence_length)[:, tf.newaxis]
        j = tf.range(sequence_length)
        mask = tf.cast(i >= j, dtype="int32")
        mask = tf.reshape(mask, (1, input_shape[1], input_shape[1]))
        mult = tf.concat(
            [tf.expand_dims(batch_size, -1),
             tf.constant([1, 1], dtype=tf.int32)], axis=0)
        return tf.tile(mask, mult)

Now we can write down the full call() method implementing the forward pass of the
decoder.

Listing 11.35 The forward pass of the TransformerDecoder

掩码：描述了在目标序列中的padding position

def call(self, inputs, encoder_outputs, mask=None):
        causal_mask = self.get_causal_attention_mask(inputs)
        if mask is not None:
            padding_mask = tf.cast(
                mask[:, tf.newaxis, :], dtype="int32")
            padding_mask = tf.minimum(padding_mask, causal_mask)
        attention_output_1 = self.attention_1(
            query=inputs,
            value=inputs,
            key=inputs,
            attention_mask=causal_mask)
        attention_output_1 = self.layernorm_1(inputs + attention_output_1)
        attention_output_2 = self.attention_2(
            query=attention_output_1,
            value=encoder_outputs,
            key=encoder_outputs,
            attention_mask=padding_mask,
        )
        attention_output_2 = self.layernorm_2(
            attention_output_1 + attention_output_2)
        proj_output = self.dense_proj(attention_output_2)
        return self.layernorm_3(attention_output_2 + proj_output)

代码解释

在这里插入图片描述

PUTTING IT ALL TOGETHER: A TRANSFORMER FOR MACHINE TRANSLATION

由于encoder和decoder的形状是不变的，所以可以堆叠起来使用，这里我们各只使用一个。

The end-to-end Transformer is the model we’ll be training. It maps the source
sequence and the target sequence to the target sequence one step in the future. It
straightforwardly combines the pieces we’ve built so far: PositionalEmbedding layers,
the TransformerEncoder, and the TransformerDecoder. Note that both the TransformerEncoder
and the TransformerDecoder are shape-invariant, so you could be
stacking many of them to create a more powerful encoder or decoder. In our example,
we’ll stick to a single instance of each.

Listing 11.36 End-to-end Transformer

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

class TransformerEncoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads
        self.attention = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim)
        self.dense_proj = keras.Sequential(
            [layers.Dense(dense_dim, activation="relu"),
             layers.Dense(embed_dim),]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()

    def call(self, inputs, mask=None):
        if mask is not None:
            mask = mask[:, tf.newaxis, :]
        attention_output = self.attention(
            inputs, inputs, attention_mask=mask)
        proj_input = self.layernorm_1(inputs + attention_output)
        proj_output = self.dense_proj(proj_input)
        return self.layernorm_2(proj_input + proj_output)

    def get_config(self):
        config = super().get_config()
        config.update({
            "embed_dim": self.embed_dim,
            "num_heads": self.num_heads,
            "dense_dim": self.dense_dim,
        })
        return config

We’re now ready to train our model—we get to 67% accuracy, a good deal above the
GRU-based model.

Listing 11.37 Training the sequence-to-sequence Transformer

transformer.compile(
    optimizer="rmsprop",
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"])
transformer.fit(train_ds, epochs=30, validation_data=val_ds)

Finally, let’s try using our model to translate never-seen-before English sentences from
the test set. The setup is identical to what we used for the sequence-to-sequence RNN
model.

Listing 11.38 Translating new sentences with our Transformer model

import numpy as np
spa_vocab = target_vectorization.get_vocabulary()
spa_index_lookup = dict(zip(range(len(spa_vocab)), spa_vocab))
max_decoded_sentence_length = 20

def decode_sequence(input_sentence):
    tokenized_input_sentence = source_vectorization([input_sentence])
    decoded_sentence = "[start]"
    for i in range(max_decoded_sentence_length):
        tokenized_target_sentence = target_vectorization(
            [decoded_sentence])[:, :-1]
        predictions = transformer(
            [tokenized_input_sentence, tokenized_target_sentence])
        sampled_token_index = np.argmax(predictions[0, i, :])
        sampled_token = spa_index_lookup[sampled_token_index]
        decoded_sentence += " " + sampled_token
        if sampled_token == "[end]":
            break
    return decoded_sentence

test_eng_texts = [pair[0] for pair in test_pairs]
for _ in range(20):
    input_sentence = random.choice(test_eng_texts)
    print("-")
    print(input_sentence)
    print(decode_sequence(input_sentence))

Subjectively, the Transformer seems to perform significantly better than the GRUbased
translation model. It’s still a toy model, but it’s a better toy model.

Listing 11.39 Some sample results from the Transformer translation model

在这里插入图片描述

That concludes this chapter on natural language processing—you just went from the
very basics to a fully fledged Transformer that can translate from English to Spanish.
Teaching machines to make sense of language is the latest superpower you can add to
your collection.

Summary

下面是这章的总结

 There are two kinds of NLP models: bag-of-words models that process sets of words
or N-grams without taking into account their order, and sequence models that process
word order. A bag-of-words model is made of Dense layers, while a sequence
model could be an RNN, a 1D convnet, or a Transformer.

 When it comes to text classification, the ratio between the number of samples
in your training data and the mean number of words per sample can help you
determine whether you should use a bag-of-words model or a sequence model.

 Word embeddings are vector spaces where semantic relationships between words are
modeled as distance relationships between vectors that represent those words.

 Sequence-to-sequence learning is a generic, powerful learning framework that can be
applied to solve many NLP problems, including machine translation. A sequenceto-
sequence model is made of an encoder, which processes a source sequence,
and a decoder, which tries to predict future tokens in target sequence by looking
at past tokens, with the help of the encoder-processed source sequence.

 Neural attention is a way to create context-aware word representations. It’s the
basis for the Transformer architecture.

 The Transformer architecture, which consists of a TransformerEncoder and a
TransformerDecoder, yields excellent results on sequence-to-sequence tasks.
The first half, the TransformerEncoder, can also be used for text classification
or any sort of single-input NLP task.