hugging-face Transformer tokenization_bert.py

最新推荐文章于 2024-06-23 11:55:35 发布

桃汽宝

最新推荐文章于 2024-06-23 11:55:35 发布

阅读量3k

点赞数 2

分类专栏： MRC

本文链接：https://blog.csdn.net/weixin_44317740/article/details/109362437

版权

MRC 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

函数

load_vocab

把词汇表加载为一个有序字典

def load_vocab(vocab_file):
    """Loads a vocabulary file into a dictionary."""
    """把词汇表加载为一个有序字典"""
    vocab = collections.OrderedDict()   # 有序字典
    with open(vocab_file, "r", encoding="utf-8") as reader:	# vocab_file是一个txt文件一个单词一行
        tokens = reader.readlines()     # 依次读取每行
    for index, token in enumerate(tokens):      # index：行号，从0开始
        token = token.rstrip("\n")      # rstrip：删除字符串末尾的指定字符
        vocab[token] = index        # token：键；index：值。 
       						 # eg. OrderedDict([('d1={}', 0), ("d1['a']='A'", 1), ("d1['c']='C'", 2)])
    return vocab		# vocab：有序字典

whitespace_tokenize

在一段文本上，删除空格并拆分

def whitespace_tokenize(text):
    """Runs basic whitespace cleaning and splitting on a piece of text."""
    """在一段文本上，删除空格并拆分"""
    text = text.strip()     # strip：删除字符串首尾的指定字符
    if not text:		# text没有内容就返回空列表
        return []
    tokens = text.split()       # split()函数按照空格进行分割，并返回分割后的字符串列表。
    return tokens       # tokens：列表

类

BasicTokenizer类(继承自Object)

基本的分词（标点符号拆分、小写）
参数：
do_lower_case :bool, optional, 默认为True 当分词时是否对输入小写。
never_split ：Iterable, 字符串的列表, optional. 分词期间不被切分的tokens的集合，只有当do_basic_tokenize=True的时候才有效。
tokenize_chinese_chars ：bool, optional, 默认为True. 是否对中文字符进行分词。
strip_accents：bool, optional. 用于去除变音符号。

附加符号或称变音符号（accents），是指添加在字母上面的符号，以更改字母的发音或者以区分拼写相似词语。例如汉语拼音字母“ü”上面的两个小点，或“á”、“à”字母上面的标调符。

_tokenize_chinese_chars函数

在任何中日韩字符前后添加空格

    def _tokenize_chinese_chars(self, text):
        """Adds whitespace around any CJK character."""
        # 在任何中日韩字符前后添加空格
        output = []
        for char in text:
            cp = ord(char)      # ord()以一个字符（长度为1的字符串）作为参数，返回对应的 ASCII 数值，或者 Unicode 数值
            if self._is_chinese_char(cp):
                output.append(" ")
                output.append(char)
                output.append(" ")
            else:
                output.append(char)
        return "".join(output)      # join() 方法用于将序列中的元素以指定的字符连接生成一个新的字符串。

ord()以一个字符（长度为1的字符串）作为参数，返回对应的 ASCII 数值，或者 Unicode 数值

join() 方法用于将序列中的元素以指定的字符连接生成一个新的字符串。

_is_chinese_char函数

用于判断是否是中日韩字符

Unicode 是一个符号集，它只规定了符号的二进制代码，却没有规定这个二进制代码应该如何存储。
https://home.unicode.org
中日韩汉字编码表：http://www.chi2ko.com/tool/CJK.html

UTF-8 就是在互联网上使用最广的一种 Unicode 的实现方式。
UTF-8 的编码规则很简单，只有二条：
1）对于单字节的符号，字节的第一位设为0，后面7位为这个符号的 Unicode 码。因此对于英语字母，UTF-8 编码和 ASCII码是相同的。
2）对于n字节的符号（n > 1），第一个字节的前n位都设为1，第n +1位设为0，后面字节的前两位一律设为10。剩下的没有提及的二进制位，全部为这个符号的 Unicode 码。

转载自：http://www.ruanyifeng.com/blog/2007/10/ascii_unicode_and_utf-8.html

_run_strip_accents函数

在一段文本中去除变音符号

       def _run_strip_accents(self, text):
        """Strips accents from a piece of text."""
        # 在一段文本中去除变音符号
        # 先将文本标准化；进入循环，判断文本中字符的类型（如果是变音符号则跳过当前循环，否则将字符加入到空列表中）；返回拼接后的字符串
        text = unicodedata.normalize("NFD", text)       # unicodedata.normalize：对Unicode文本进行标准化。
        output = []
        for char in text:
            cat = unicodedata.category(char)        # unicodedata.category：返回一个字符在UNICODE里分类的类型
                                # Category一共分为 Letter, Mark, Number, Punctuation, Symbol, Seperator, Other 七大类
            if cat == "Mn":     # Mark Nonspacing 应该是变音符号
                continue
            output.append(char)
        return "".join(output)

unicodedata.normalize(form, unistr)
把一串UNICODE字符串转换为普通格式的字符串，具体格式支持NFC、NFKC、NFD和NFKD格式。
Normalization Form D (NFD)，Normalization Form KD (NFKD)，Normalization Form C (NFC)，和Normalization Form KC (NFKC)。
大约来说，NFD和NFKD将可能的字符进行分解，而NFC和NFKC将可能的字符进行组合。

unicodedata.category(chr)
返回一个字符在UNICODE里分类的类型。
查看unicode的相关属性 https://www.compart.com/en/unicode/category

在这里插入图片描述

_is_punctuation函数

def _is_punctuation(char):
    """Checks whether `char` is a punctuation character."""
    # 检查字符'char'是不是一个标点符号
    cp = ord(char)
    # We treat all non-letter/number ASCII as punctuation.
    # 将所有非字母/数字的ASCII视为标点
    # Characters such as "^", "$", and "`" are not in the Unicode
    # Punctuation class but we treat them as punctuation anyways, for
    # consistency.
    # "^", "$", 和 "`"不在标点的类别中，但是为了保持一致性将它们看作为标点
    # 判断ASCII
    if (cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126):
        return True
    # 判断unicode类别
    cat = unicodedata.category(char)
    if cat.startswith("P"):     # 如果以P开头
        return True
    return False

_run_split_on_punc函数

以标点符号来切分句子，返回列表。

    def _run_split_on_punc(self, text, never_split=None):
        """Splits punctuation on a piece of text."""
        # 以标点符号来切分句子，返回列表
        # 先判断此段文本能否分割；如果可以，则：（1）
        if never_split is not None and text in never_split:
            return [text]                   # text = "我是一个粉刷匠，粉刷本领强"
        chars = list(text)             # ['我', '是', '一', '个', '粉', '刷', '匠', '，', '粉', '刷', '本', '领', '强']
        i = 0
        start_new_word = True
        output = []
        while i < len(chars):
            char = chars[i]
            if _is_punctuation(char):
                output.append([char])       # 列表嵌套列表
                                            # [['我'], ['是'], ['一'], ['个'], ['粉'], ['刷'], ['匠']]
                                            # "".join的时候[子列表，子列表，……]
                start_new_word = True
            else:
                if start_new_word:
                    output.append([])       # 传入空列表
                start_new_word = False
                output[-1].append(char)     # 令output = [['我']]
                                  # [['我', '我', '是', '一', '个', '粉', '刷', '匠'],['粉', '刷', '本', '领', '强']]
                                            #  "".join的时候[子列表，子列表，……]，子列表中的字符直接拼接
            i += 1

        return ["".join(x) for x in output]     # if为True的输出：['我', '是', '一', '个', '粉', '刷', '匠']
                                                # False的输出：['我我是一个粉刷匠', '粉刷本领强']

jupyter尝试：

在这里插入图片描述

_is_control函数

判断’char’是否是一个控制字符

def _is_control(char):
    """Checks whether `char` is a control character."""
    # 判断'char'是否是一个控制字符
    # These are technically control characters but we count them as whitespace
    # characters.从技术上讲，它们是控制字符，但我们将其视为空格字符。
    # \t水平制表(HT) （跳到下一个TAB位置）
    # \r回车(CR) ，将当前位置移到本行开头
    # \n换行(LF) ，将当前位置移到下一行开头        https://baike.baidu.com/item/转义字符#2
    if char == "\t" or char == "\n" or char == "\r":
        return False
    cat = unicodedata.category(char)
    if cat.startswith("C"):
        return True
    return False

_is_whitespace函数

判断char是否是一个空格

def _is_whitespace(char):
    """Checks whether `char` is a whitespace character."""
    # 判断`char`是否是一个空格
    # \t, \n, and \r are technically contorl characters but we treat them
    # as whitespace since they are generally considered as such.
    # 从技术上讲，\t, \n,  \r是控制字符，但我们将其视为空格字符。
    if char == " " or char == "\t" or char == "\n" or char == "\r":
        return True
    cat = unicodedata.category(char)
    if cat == "Zs": # Zs separator,space
        return True
    return False

_clean_text函数

删除无效字符，清除空格

    def _clean_text(self, text):
        """Performs invalid character removal and whitespace cleanup on text."""
        # 删除无效字符，清除空格
        output = []
        for char in text:
            cp = ord(char)
            if cp == 0 or cp == 0xFFFD or _is_control(char):
                continue
            if _is_whitespace(char):
                output.append(" ")
            else:
                output.append(char)
        return "".join(output)

tokenize函数

私有变量：小写和一个前导下划线

_private_value

python中不存在私有变量一说，若是遇到需要保护的变量，使用小写和一个前导下划线。
但这只是程序员之间的一个约定，用于警告说明这是一个私有变量，外部类不要去访问它。
但实际上，外部类还是可以访问到这个变量。

    def tokenize(self, text, never_split=None):
        """Basic Tokenization of a piece of text.
            Split on "white spaces" only, for sub-word tokenization, see WordPieceTokenizer.
        对一段文本的基本分词。仅在空格的地方进行分词，对于子词的话参阅WordPieceTokenizer

        Args:
            **never_split**: (`optional`) list of str
                Kept for backward compatibility purposes.保持向后兼容目的
                Now implemented directly at the base class level (see :func:`PreTrainedTokenizer.tokenize`)
                List of token not to split.     不进行split的token列表
        """
        # union() returns a new set by concatenating the two sets.
        # union() 方法返回两个集合的并集，即包含了所有集合的元素，重复的元素只会出现一次。
        never_split = self.never_split.union(set(never_split)) if never_split else self.never_split

        # This was added on November 1st, 2018 for the multilingual and Chinese
        # models. This is also applied to the English models now, but it doesn't
        # matter since the English models were not trained on any Chinese data
        # and generally don't have any Chinese data in them (there are Chinese
        # characters in the vocabulary because Wikipedia does have some Chinese
        # words in the English Wikipedia.).
        if self.tokenize_chinese_chars:
            text = self._tokenize_chinese_chars(text)
        orig_tokens = whitespace_tokenize(text)
        split_tokens = []
        for token in orig_tokens:
            if token not in never_split:
                if self.do_lower_case:
                    token = token.lower()
                    if self.strip_accents is not False:
                        token = self._run_strip_accents(token)
                elif self.strip_accents:
                    token = self._run_strip_accents(token)
            split_tokens.extend(self._run_split_on_punc(token, never_split))
            # extend() 函数用于在列表末尾一次性追加另一个序列中的多个值（用新列表扩展原来的列表）。

        output_tokens = whitespace_tokenize(" ".join(split_tokens))
        return output_tokens

extend() 函数用于在列表末尾一次性追加另一个序列中的多个值（用新列表扩展原来的列表）。

union() 方法返回两个集合的并集，即包含了所有集合的元素，重复的元素只会出现一次。

控制字符是出现于特定的信息文本中，表示某一控制功能的字符。

在这里插入图片描述

WordpieceTokenizer类(继承自object)

WordpieceTokenizer是将BasicTokenizer的结果进一步做更细粒度的切分。做这一步的目的主要是为了去除未登录词对模型效果的影响。这一过程对中文没有影响，因为在前面BasicTokenizer里面已经切分成以字为单位的了。

class WordpieceTokenizer(object):
    """Runs WordPiece tokenization."""
    # 把word拆成piece一片一片。
    # WordPiece的一种主要的实现方式叫做BPE（Byte-Pair Encoding）双字节编码
    # WordpieceTokenizer是将BasicTokenizer的结果进一步做更细粒度的切分。
    # 做这一步的目的主要是为了去除未登录词对模型效果的影响。
    # 这一过程对中文没有影响，因为在前面BasicTokenizer里面已经切分成以字为单位的了。

    def __init__(self, vocab, unk_token, max_input_chars_per_word=100):
        self.vocab = vocab
        self.unk_token = unk_token
        self.max_input_chars_per_word = max_input_chars_per_word    # 每个单词最多的输入字符

tokenize函数

    def tokenize(self, text):
        """Tokenizes a piece of text into its word pieces.
        # 将一段文本标记为词片段形式

        This uses a greedy longest-match-first algorithm to perform tokenization
        using the given vocabulary. 贪心的最大正向匹配算法

        For example:
          input = "unaffable"
          output = ["un", "##aff", "##able"]

        Args:
          text: A single token or whitespace separated tokens. This should have
            already been passed through `BasicTokenizer`.
          text：BasicTokenizer的输出，即单个的token或空格分隔的标记

        Returns:
          A list of wordpiece tokens.
          返回词片段tokens
        """

        output_tokens = []
        for token in whitespace_tokenize(text):
            chars = list(token)     # token = 'xdcfvgbhjkn';
                                    # chars = ['x', 'd', 'c', 'f', 'v', 'g', 'b', 'h', 'j', 'k', 'n']
            if len(chars) > self.max_input_chars_per_word:  # 如果单词的字符长度大于阈值，这个单词本身就不要了，替换为unk_token
                output_tokens.append(self.unk_token)        # output_tokens列表中加入unk_token
                continue

            is_bad = False      # 是否是坏词
            start = 0
            sub_tokens = []     # 切分后的词片段
            while start < len(chars):     # 不满足循环的条件就跳到下一段代码
                end = len(chars)
                cur_substr = None
                while start < end:
                    substr = "".join(chars[start:end])
                    if start > 0:
                        substr = "##" + substr      # 表示这个词是接着前面的，这样使得WordPiece切分是可逆的（可以恢复出“真正”的词）
                    if substr in self.vocab:        
                        cur_substr = substr
                        break
                    end -= 1
                if cur_substr is None:
                    # 上面循环直到结束都没有找到在给定词表中的词片段，则认为是坏词
                    is_bad = True
                    break
                sub_tokens.append(cur_substr)
                start = end

            if is_bad:
                output_tokens.append(self.unk_token)
            else:
                output_tokens.extend(sub_tokens)
        return output_tokens

BertTokenizer类(继承自PreTrainedTokenizer)

基于WordPiece

参数：
vocab_files: str. 包括词汇表的文件。
do_lower_case : bool,optional：默认为True. 在分词的时候输入是否小写。
do_basic_tokenize ：bool, optional，默认为True. 是否在wordpiece之前做基本的分词。
never_split ：Iterable, 字符串的列表, optional. 分词期间不被切分的tokens的集合，只有当do_basic_tokenize=True的时候才有效。
unk_token ：str, optional, 默认为"[UNK]". 词汇表当中没有的未知的token，不能被转化为一个ID，会被设置为unk_token的ID。
sep_token：str, optional, 默认为"[SEP]". 分隔器token，在由多个序列去构建一个序列的时候会使用它；也会被用作一个特殊标记构建的序列的最后一个token。
pad_token ：str, optional, 默认为"[PAD]". 用于padding，当进行序列长度不一致的批处理的时候会用到。
cls_token ：str, optional, 默认为"[CLS]". 序列分类的时候被使用作为分类器token。当序列由特殊tokens构建而成的时候，会作为第一个token。
mask_token ：str, optional, 默认为"[MASK]". 用于对值进行掩盖，在训练MLM的时候会被使用。模型会尝试去预测这个token。
tokenize_chinese_chars ：bool, optional, 默认为True. 是否对中文字符进行分词。
strip_accents：bool, optional. 用于去除变音符号。

初始化

self.ids_to_tokens = collections.OrderedDict([(ids, tok) for tok, ids in self.vocab.items()])
                                                # 由词汇表构造一个有序字典，id-tokens

vocab_size函数

    def vocab_size(self):       # 返回词汇表的大小
        return len(self.vocab)

get_vocab函数

    def get_vocab(self):
        # dict() 函数用于创建一个字典
        return dict(self.vocab, **self.added_tokens_encoder)        
        # self.added_tokens_encoder是一个字典 Dict[str, int]

_tokenize函数

    def _tokenize(self, text):
        split_tokens = []
        # 判断是否做基础分词。
        # T：基础分词+wordpiece；F：wordpiece
        if self.do_basic_tokenize:
            for token in self.basic_tokenizer.tokenize(text, never_split=self.all_special_tokens):  # 所有特殊的token不被分割

                # If the token is part of the never_split set
                # 判断是否在never_split列表中。
                if token in self.basic_tokenizer.never_split:
                    split_tokens.append(token)
                else:
                    split_tokens += self.wordpiece_tokenizer.tokenize(token)
        else:
            split_tokens = self.wordpiece_tokenizer.tokenize(text)
        return split_tokens     # 返回（基础分词）+ wordpiece分词后的列表

_convert_token_to_id函数

    def _convert_token_to_id(self, token):
        """ Converts a token (str) in an id using the vocab. """
        # 通过词汇表将一个token（字符串）转换为一个ID
        return self.vocab.get(token, self.vocab.get(self.unk_token))

python dict.get(key, default=None)
key – 字典中要查找的键。
default – 如果指定的键不存在时，返回该默认值。
返回指定键的值，如果键不在字典中返回默认值 None 或者指定的默认值。

_convert_id_to_token函数

    def _convert_id_to_token(self, index):
        """Converts an index (integer) in a token (str) using the vocab."""
        # 通过词汇表将一个整数的索引值转化为一个token（字符串）
        return self.ids_to_tokens.get(index, self.unk_token)

convert_tokens_to_string函数

    def convert_tokens_to_string(self, tokens):
        """ Converts a sequence of tokens (string) in a single string. """
        # 将多个字符串合成一个单一的字符串
        # wordpiece后的结果：un,##aff,##able   ->  un ##aff ##able    ->  unaffable
        out_string = " ".join(tokens).replace(" ##", "").strip()        # strip()去除头尾的空格
        return out_string

build_inputs_with_special_tokens函数

    def build_inputs_with_special_tokens(
        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
    ) -> List[int]:
        """
        为模型构建输入，从一个序列或一对序列，对于序列分类任务，通过concat和add特殊字符
        Build model inputs from a sequence or a pair of sequence for sequence classification tasks
        by concatenating and adding special tokens.
        A BERT sequence has the following format:

        - single sequence: ``[CLS] X [SEP]``
        - pair of sequences: ``[CLS] A [SEP] B [SEP]``

        Args:
            token_ids_0 (:obj:`List[int]`):
                List of IDs to which the special tokens will be added.
                特殊token将被加入到的ID列表
            token_ids_1 (:obj:`List[int]`, `optional`):
                Optional second list of IDs for sequence pairs.

        Returns:
            :obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
        """
        if token_ids_1 is None:
            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]      # [CLS] X [SEP]
        cls = [self.cls_token_id]
        sep = [self.sep_token_id]
        return cls + token_ids_0 + sep + token_ids_1 + sep      # [CLS] A [SEP] B [SEP]

get_special_tokens_mask函数

    def get_special_tokens_mask(
        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
    ) -> List[int]:
        """
        Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
        special tokens using the tokenizer ``prepare_for_model`` method.
        从一个没有特殊token添加的列表 恢复出来 序列的ID

        Args:
            token_ids_0 (:obj:`List[int]`):
                List of IDs.IDs的列表
            token_ids_1 (:obj:`List[int]`, `optional`):
                Optional second list of IDs for sequence pairs.
                序列对IDs的列表
            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
                Whether or not the token list is already formatted with special tokens for the model.
                token列表是否已经用特殊token格式化

        Returns:
            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
            返回一个整数构成的列表，1代表特殊token，0代表序列token
        """

        if already_has_special_tokens:  # 有特殊的token
            if token_ids_1 is not None: # 有第二个列表
                raise ValueError(
                    "You should not supply a second sequence if the provided sequence of "
                    "ids is already formated with special tokens for the model."
                )
            # 当是sep和cls的时候为1，其他为0
            # lamda冒号左边的x是token_ids_0中的每个元素，冒号右边是当前x的返回值
            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))

        if token_ids_1 is not None: # 没有特殊的token，有第二个列表
            # cls + 第一个列表的长度 + sep + 第二个列表的长度 + sep
            return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
        # cls + 第一个列表的长度 + sep
        return [1] + ([0] * len(token_ids_0)) + [1]         # 没有特殊的token，没有第二个列表

create_token_type_ids_from_sequences函数

    def create_token_type_ids_from_sequences(
        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
    ) -> List[int]:
        """
        Create a mask from the two sequences passed to be used in a sequence-pair classification task.
        # 从传递给序列对分类任务的两个序列中创建一个mask。
        A BERT sequence pair mask has the following format:

        ::

            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
            | first sequence    | second sequence |

        If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s).
        如果只有一个序列，则只返回0的那部分。

        Args:
            token_ids_0 (:obj:`List[int]`):
                List of IDs.
            token_ids_1 (:obj:`List[int]`, `optional`):
                Optional second list of IDs for sequence pairs.

        Returns:
            :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given
            sequence(s).
        """
        sep = [self.sep_token_id]
        cls = [self.cls_token_id]
        if token_ids_1 is None:
            return len(cls + token_ids_0 + sep) * [0]
        # 返回只有0和1的列表。0表示第一个序列；1表示第二个序列。
        return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]

save_vocabulary函数

    def save_vocabulary(self, vocab_path):
        """
        Save the vocabulary (copy original file) and special tokens file to a directory目录.
        保存词汇表和特殊token文件到一个目录。

        Args:
            vocab_path (:obj:`str`):
                The directory in which to save the vocabulary.
                保存词汇表的目录

        Returns:
            :obj:`Tuple(str)`: Paths to the files saved.文件被存储的路径
        """
        index = 0
        if os.path.isdir(vocab_path):       # os.path.isdir：判断是不是路径
            vocab_file = os.path.join(vocab_path, VOCAB_FILES_NAMES["vocab_file"])
        else:
            vocab_file = vocab_path
        with open(vocab_file, "w", encoding="utf-8") as writer:
            # self.vocab：OrderedDict([('d1={}', 0), ("d1['a']='A'", 1)]）
            # sorted函数中，key主要是用来进行比较的元素
            # 当列表较为复杂时，以列表中的 每个元素的 第二个数据进行排序，即以token_index来排序。
            for token, token_index in sorted(self.vocab.items(), key=lambda kv: kv[1]):
                if index != token_index:
                    # logger.warning：某些没有预料到的事件的提示，或者在将来可能会出现的问题提示。例如：磁盘空间不足。但是软件还是会照常运行。
                    logger.warning(
                        "Saving vocabulary to {}: vocabulary indices are not consecutive."
                        " Please check that the vocabulary is not corrupted破坏!".format(vocab_file)
                    )
                    index = token_index
                writer.write(token + "\n")
                index += 1
        return (vocab_file,)