阅读源码-理解pytorch_pretrained_bert中BertTokenizer工作方式

最新推荐文章于 2024-08-16 08:00:33 发布

枪枪枪

最新推荐文章于 2024-08-16 08:00:33 发布

阅读量6.6k

点赞数 3

分类专栏： Machine Learning

本文链接：https://blog.csdn.net/az9996/article/details/109219652

版权

Machine Learning 专栏收录该内容

52 篇文章 9 订阅

订阅专栏

本文详细介绍了BERT分词器的工作原理，包括`load_vocab`函数用于加载词汇表，`whitespace_tokenize`进行基本的空格分词，以及`BertTokenizer`和`WordpieceTokenizer`类的实现。`BertTokenizer`结合了基本分词和词块分词，利用最大长度优先的匹配算法将文本转换为词块 token。通过对文本的预处理和分词，为模型输入做好准备。

摘要由CSDN通过智能技术生成

文章目录

tokenization.py

在使用训练好的模型预测文本时，需要对文本进行tokenize，然后在将tokenize转为index序列，最后传入模型。

tokenization.py

load_vocab(vocab_file)

def load_vocab(vocab_file):
    """Loads a vocabulary file into a dictionary."""
    vocab = collections.OrderedDict()
    index = 0
    with open(vocab_file, "r", encoding="utf-8") as reader:
        while True:
            token = reader.readline()
            if not token:
                break
            token = token.strip()
            vocab[token] = index
            index += 1
    return vocab

传统的dict()是按照hash存储的，只保存了键值对信息，故每次输出键值对的顺序都不同。
OrderedDict()保存了键值对插入时的顺序信息，根据插入顺序对字典进行排序，可以使字典有序。

理解：collections.OrderedDict()
OrderedDict是记住键首次插入顺序的字典。如果新条目覆盖现有条目，则原始插入位置保持不变。
参考资料：https://zhuanlan.zhihu.com/p/110407087

whitespace_tokenize(text)

在一段文本中运行基本的空字符清洗和拆分

def whitespace_tokenize(text):
    """Runs basic whitespace cleaning and splitting on a piece of text."""
    # 去除开头、结尾的空字符
    text = text.strip()
    if not text:
        return []
    # 默认按空字符进行拆分
    tokens = text.split()
    return tokens

class BertTokenizer(object)

class BertTokenizer(object):
    """Runs end-to-end tokenization: punctuation splitting + wordpiece"""

    def __init__(self, vocab_file, do_lower_case=True, max_len=None, do_basic_tokenize=True,
                 never_split=("[UNK]", "[SEP]", "[PAD]", "[CLS]", "[MASK]")):
        if not os.path.isfile(vocab_file):
            raise ValueError(
                "Can't find a vocabulary file at path '{}'. To load the vocabulary from a Google pretrained "
                "model use `tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`".format(vocab_file))
        # 加载词汇文件vocab.txt，返回一个有序字典
        self.vocab = load_vocab(vocab_file)	
        # 根据索引得到对应的词汇
        self.ids_to_tokens = collections.OrderedDict(
            [(ids, tok) for tok, ids in self.vocab.items()])
        self.do_basic_tokenize = do_basic_tokenize
        if do_basic_tokenize:
          self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case,
                                                never_split=never_split)
        # 用于调用运行分词函数的类
        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
        self.max_len = max_len if max_len is not None else int(1e12)

WordpieceTokenizer(object)

    def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=100):
        self.vocab = vocab
        self.unk_token = unk_token
        self.max_input_chars_per_word = max_input_chars_per_word

tokenize(self, text)

将一段文本标记成它的单词段。

使用贪婪的最大长度优先匹配算法

假设vocab.txt中的最长词条所含字符个数为n，则取被处理文本序列中的一段字符作为匹配字段，在vocab.txt中查找。若vocab.txt中存在这样一个字符个数为n的词，则匹配成功，匹配字段作为一个完整的词被切分出来；如果vocab.txt中找不到这样的一个字符个数为n的词，则匹配失败。匹配字段去掉最后一个字符，剩下的字符作为新的匹配字段，回到上述步骤，重新匹配，一直循环，直到切分成功为止。完成一轮匹配，并切分出一个词，之后再按上述步骤进行下去，直到切分出所有词为止。

使用给定的词汇执行标记化

def tokenize(self, text):
	"""Tokenizes a piece of text into its word pieces.
	
	This uses a greedy longest-match-first algorithm to perform tokenization
	using the given vocabulary.
	
	For example:
	  input = "unaffable"
	  output = ["un", "##aff", "##able"]
	
	Args:
	  text: A single token or whitespace separated tokens. This should have
	    already been passed through `BasicTokenizer`.
	
	Returns:
	  A list of wordpiece tokens.
	"""
	
	output_tokens = []
	# 见目录：whitespace_tokenize(text)
	for token in whitespace_tokenize(text):
	    chars = list(token)
	    if len(chars) > self.max_input_chars_per_word:
	        output_tokens.append(self.unk_token)
	        continue
	
	    is_bad = False
	    start = 0
	    sub_tokens = []
	    while start < len(chars):
	        end = len(chars)
	        cur_substr = None
	        while start < end:
	            substr = "".join(chars[start:end])
	            if start > 0:
	                substr = "##" + substr
	            if substr in self.vocab:
	                cur_substr = substr
	                break
	            end -= 1
	        if cur_substr is None:
	            is_bad = True
	            break
	        sub_tokens.append(cur_substr)
	        start = end
	
	    if is_bad:
	        output_tokens.append(self.unk_token)
	    else:
	        output_tokens.extend(sub_tokens)
	return output_tokens