BERT分词，wordpiece，BPE，jieba，pkuseg

最新推荐文章于 2024-08-12 14:08:38 发布

DecafTea

最新推荐文章于 2024-08-12 14:08:38 发布

阅读量4.8k

点赞数

分类专栏： NLP # 分词

本文链接：https://blog.csdn.net/DecafTea/article/details/114526213

版权

Note：BERT中文分词是字粒度，jieba等其他分词工具是词粒度
主要参考文章：https://medium.com/@makcedward/how-subword-helps-on-your-nlp-model-83dd1b836f46

BERT分词

摘自：
https://blog.csdn.net/u010099080/article/details/102587954

BERT 源码中 tokenization.py 就是预处理进行分词的程序，主要有两个分词器：BasicTokenizer 和 WordpieceTokenizer，另外一个 FullTokenizer 是这两个的结合：先进行 BasicTokenizer 得到一个分得比较粗的 token 列表，然后再对每个 token 进行一次 WordpieceTokenizer，得到最终的分词结果。

对于中文来说，一句话概括：BERT 采取的是「分字」，即每一个汉字都切开。

BasicTokenizer
BasicTokenizer（以下简称 BT）是一个初步的分词器。对于一个待分词字符串，流程大致就是转成 unicode -> 去除各种奇怪字符 -> 处理中文 -> 空格分词 -> 去除多余字符和标点分词 -> 再次空格分词，结束。

WordpieceTokenizer
按照从左到右的顺序，将一个词拆分成多个子词，每个子词尽可能长。 greedy longest-match-first algorithm，贪婪最长优先匹配算法。
在这里插入图片描述
对BasicTokenizer分出的每个token再进行WordPieceTokenizer处理，得到一些词典中有的词片段，非词首的词片要变成”##词片“形式，如##able。上图中，output_tokens = [“un”, “##aff”, “##able”]

wordpiece是BPE（byte pair encoding）的变种，不同点在于，WordPiece在training 阶段基于概率生成新的subword而不是下一最高频字节对。直到subword vocabulary size到指定size或 the likelihood increase falls below a certain threshold。

WordPiece
WordPiece is another word segmentation algorithm and it is similar with BPE. Schuster and Nakajima introduced WordPiece by solving Japanese and Korea voice problem in 2012. Basically, WordPiece is similar with BPE and the difference part is forming a new subword by likelihood but not the next highest frequency pair.

Algorithm
Prepare a large enough training data (i.e. corpus)
Define a desired subword vocabulary size
Split word to sequence of characters
Build a languages model based on step 3 data
Choose the new word unit out of all the possible ones that increases the likelihood on the training data the most when added to the model.
Repeating step 5until reaching subword vocabulary size which is defined in step 2 or the likelihood increase falls below a certa