从经典学习 NLP：小白到大白：1. Word Tokenization-CSDN博客

本文链接：https://blog.csdn.net/Tongcheng_98/article/details/136407002

本文详细介绍了词法分析的两种主要方法：规则驱动的顶级分词和自底向上的Byte-PairEncoding。规则驱动分词处理cliticcontraction，而Byte-PairEncoding通过合并频繁字符对生成词汇，适用于处理未知词汇。文章还讨论了不同语言的分词挑战和Byte-PairEncoding的优点。

摘要由CSDN通过智能技术生成

文章目录

- - 1 Word Tokenization
  - - 1.1 Top-down/rule-based tokenization
    - 1.2 Byte-pair Encoding: A Bottom-up tokenization algorithm

1 Word Tokenization

来源：JM3 Chapter 2.5 p19-23

tokenization 就是把 running text 分割成为 words；

常有两种方法：

top-down/rule-based tokenization
根据预先定义的标准与规则实现分词；
bottom-up tokenization
使用字母序列的统计量实现分词；

1.1 Top-down/rule-based tokenization

举个例子，

首先要注意 clitic contraction ，也就是附着缩略词的处理，clitic 是：“ A clitic is a part of a word that can’t stand on its own, and can only occur when it is attached to another word. ”

按照 Penn Treebank tokenization 标准：
doesn’t 被展开为 does + n’t
其中的 n’t 就是 clitic contraction 嘛
tokenization 会保持 hyphenated words，即连字符连接的词在一起，并把所有 punctuation，即标点符号进行分割！具体例子可以见：
![[Pasted image 20240229223529.png]]

tokenization 可以展开这种 clitic contraction，也可以看到，tokenization 识别出了 does 这个词，所以 tokenization 也和 NLP 的一个方向：named entity recognition 紧密相关！

由于 tokenization 在其他的 text precessing 前，所以需要比较迅速，常是基于 regular expression 正则表达式，正则表达式通常都编译入非常高效的有限状态自动机。

设计优秀的 top-down tokenization 所遵循的 deterministic algorithm 可以处理各种 ambiguity，比如不同的撇好apostrophe:

genitive marker 属格
the book’s cover
quotative 引用
‘The other class’, she said
clitics 附着词
they’re, doesn’t

不同的语言在进行tokenization时可能存有不同困难，english 天然的以 words 分，但 chinese 不使用 space 进行 words 的分割，而是以 character，即汉字，作为分割得到的 token。由于chinese本身的character，也就是汉字，具有丰富的意义，研究表明，chinese NLP 中，以 character 作为 input 会比 words 更好。

但像 japanese，thai 等语言，他的 character 本身作为一个 unit 太小，不足以表达含义，所以也需要 word segmentation 算法。

1.2 Byte-pair Encoding: A Bottom-up tokenization algorithm

tokenization 很重要的一点是要有能力处理 unknown words，我们希望也能够处理 corpus，即语料库，之外的的 unknown words。

首先要引入一个 subwords 子字的概念，这是一种 sets of tokens that include tokens smaller than words. 即一种比 words 更小的 token.

子词并行：在自然语言处理中，一种处理词汇的方法，将词汇分解为更小的单元（子词），以便更好地处理稀有词汇和词汇变化。

subwords 可以是 arbitrary substring，也可以是有一定意义的 morphemes，即语素，比如：-est, -er.

A morpheme is the smallest meaning-bearing unit of a language; for example the word unlikeliest has the morphemes un-, likely, and -est.

现代 tokenization 方法中的 token 常为 words，但也可以为某些 frequently morpheme 高频语素，或者是一些其他的字词，比如：-er.

基于 subword，任何 unknow words 都可以由某些 subwords units 序列构成，比如 lower，可以由 low 和 -er 这两个 subwords 组成，或者，如果有必要，可以视为由 -l, -o,-w, -e, -r 等一系列 letter 构成。

tokenization schema 一般分为 token learner 和 token segmenter。前者从 corpus语料中学习，并产生vocabulary，即 set of tokens。后者作用于原始文本，对文本按照 vocabulary 进行分割，实现分词得到一系列 tokens。

常有三种方法：

byte-pair encoding, BPE
unigram language modeling
SentencePiece library 含有上述两种，但常指代后一种

BPE token learner 由所有部分均为 individual character 的 vocabulary 开始，根据 training corpus 中的 words，去寻找具有最高出现频次的 adjacent symbols （symbol可以是多个character构成）。注意，最开始的 vocabulary 就是所有 character 构成的。

将最高频次的 symbols 不断merge，并加入到 vocabulary中，以一种greedy的思想去不断 merge highest frequent adjacent symbols into vocabulary，直到添加完毕 $k$ 个 symbols 进入到 vocabulary中。注意，这里的 $k$ 是 BPE algorithm 的参数。通过BPE，最终得到的 vocabulary 就是由原来的 individual characters 加上 $k$ 个 merged symbols.

过程中，同样频率的 pairs of characters，哪个先 merge 没有特定要求，是 arbitrary 的！

BPE的核心思想：
iteratively merge freaquent pairs of characters.

一些优点：

data-informed tokenization
language independent, can derive the vocabulary for the language with only corpus, this BPE can figure the corpus itself;
works for different languages
no need to design rules for different language
deal better with unknown words
worst case: unknown words 分成 individual characters

最终的 vocabulary 里，大部分都是 full words，少部分是 subwords。
最差情况下，unknown word 也是被分为多个 individual characters。

具体例子参考 jm3，p21. 简单展示：

![[Pasted image 20240301174028.png]]
注意，有一个 end-of-word symbol _；
从过程来看，一般都是从 end-of-word 处开始 merge。

基于最终得到的 vocabulary：
对于 n e w e r _，其会被分为一整个 token：newer_；
对于 unknown word l o w e r _, 会被分为两个 token: low 和 er_.