transformers中的data_collator

前言

使用huggingface的Dataset加载数据集,然后使用过tokenizer对文本数据进行编码,但是此时的特征数据还不是tensor,需要转换为深度学习框架所需的tensor类型。data_collator的作用就是将features特征数据转换为tensor类型的dataset。

本文记录huggingface transformers中两种比较常用的data_collator,一种是default_data_collator,另一种是DataCollatorWithPadding。本文使用BertTokenizer作为基础tokenizer,如下所示:

from transformers import BertTokenizer
from transformers import default_data_collator, DataCollatorWithPadding
from datasets import Dataset

tokenizer = BertTokenizer.from_pretrained("hfl/chinese-bert-wwm-ext")

def func(exam):
    return tokenizer(exam["text"])

default_data_collator

如果使用pytorch框架,default_data_collator本质是执行torch_default_data_collator。注意输入参数要求是List[Any]格式,输出需满足Dict[str, Any]格式。

def default_data_collator(features: List[InputDataClass], return_tensors="pt") -> Dict[str, Any]:
    """
    Very simple data collator that simply collates batches of dict-like objects and performs special handling for
    potential keys named:

        - `label`: handles a single value (int or float) per object
        - `label_ids`: handles a list of values per object

    Does not do any additional preprocessing: property names of the input object will be used as corresponding inputs
    to the model. See glue and ner for example of how it's useful.
    """

    # In this function we'll make the assumption that all `features` in the batch
    # have the same attributes.
    # So we will look at the first element as a proxy for what attributes exist
    # on the whole batch.

    if return_tensors == "pt":
        return torch_default_data_collator(features)
    elif return_tensors == "tf":
        return tf_default_data_collator(features)
    elif return_tensors == "np":
        return numpy_default_data_collator(features)

torch_default_data_collator 源码如下,源码中假设所有features特征数据拥有相同的属性信息,因此源码选择使用第一个样例数据进行逻辑判断。另外源码对特征数据中的label或者label_ids属性进行特殊处理, 分别对应单标签分类多标签分类。并且将特征属性更名为“labels”——大多数预训练模型的forward方法中定义的关键词参数名为labels

def torch_default_data_collator(features: List[InputDataClass]) -> Dict[str, Any]:
    import torch

    if not isinstance(features[0], Mapping):
        features = [vars(f) for f in features]
    first = features[0]
    batch = {}

    # Special handling for labels.
    # Ensure that tensor is created with the correct type
    # (it should be automatically the case, but let's make sure of it.)
    if "label" in first and first["label"] is not None:
        label = first["label"].item() if isinstance(first["label"], torch.Tensor) else first["label"]
        dtype = torch.long if isinstance(label, int) else torch.float
        batch["labels"] = torch.tensor([f["label"] for f in features], dtype=dtype)
    elif "label_ids" in first and first["label_ids"] is not None:
        if isinstance(first["label_ids"], torch.Tensor):
            batch["labels"] = torch.stack([f["label_ids"] for f in features])
        else:
            dtype = torch.long if type(first["label_ids"][0]) is int else torch.float
            batch["labels"] = torch.tensor([f["label_ids"] for f in features], dtype=dtype)

    # Handling of all other possible keys.
    # Again, we will use the first element to figure out which key/values are not None for this model.
    for k, v in first.items():
        if k not in ("label", "label_ids") and v is not None and not isinstance(v, str):
            if isinstance(v, torch.Tensor):
                batch[k] = torch.stack([f[k] for f in features])
            elif isinstance(v, np.ndarray):
                batch[k] = torch.tensor(np.stack([f[k] for f in features]))
            else:
                batch[k] = torch.tensor([f[k] for f in features])

    return batch

示例:

x = [{"text": "我爱中国。", "label": 1}, {"text": "我爱中国。", "label": 1}]
ds = Dataset.from_list(x)
features = ds.map(func, batched=False, remove_columns=["text"])
dataset = default_data_collator(features)

DataCollatorWithPadding

注意DataCollatorWithPadding是一个类,首先需要实例化,然后再将features转为dataset。与default_data_collator相比,DataCollatorWithPadding会为接受到的特征数据进行padding操作——各个维度的size补全到相同值。其源码如下:

@dataclass
class DataCollatorWithPadding:
    """
    Data collator that will dynamically pad the inputs received.

    Args:
        tokenizer ([`PreTrainedTokenizer`] or [`PreTrainedTokenizerFast`]):
            The tokenizer used for encoding the data.
        padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `True`):
            Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
            among:

            - `True` or `'longest'` (default): Pad to the longest sequence in the batch (or no padding if only a single
              sequence is provided).
            - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
              acceptable input length for the model if that argument is not provided.
            - `False` or `'do_not_pad'`: No padding (i.e., can output a batch with sequences of different lengths).
        max_length (`int`, *optional*):
            Maximum length of the returned list and optionally padding length (see above).
        pad_to_multiple_of (`int`, *optional*):
            If set will pad the sequence to a multiple of the provided value.

            This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >=
            7.5 (Volta).
        return_tensors (`str`):
            The type of Tensor to return. Allowable values are "np", "pt" and "tf".
    """

    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    return_tensors: str = "pt"

    def __call__(self, features: List[Dict[str, Any]]) -> Dict[str, Any]:
        batch = self.tokenizer.pad(
            features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors=self.return_tensors,
        )
        if "label" in batch:
            batch["labels"] = batch["label"]
            del batch["label"]
        if "label_ids" in batch:
            batch["labels"] = batch["label_ids"]
            del batch["label_ids"]
        return batch

在实例化过程中,注意pad_to_multiple_of其含义是指将max_length扩充为指定值的整数倍。举例而言,如果max_length=510pad_to_multiple_of=8,则会将max_length设置为512。参考transformers.tokenization_utils_base.PreTrainedTokenizerBase._pad源码:

    def _pad(
        self,
        encoded_inputs: Union[Dict[str, EncodedInput], BatchEncoding],
        max_length: Optional[int] = None,
        padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
        pad_to_multiple_of: Optional[int] = None,
        return_attention_mask: Optional[bool] = None,
    ) -> dict:
        """
        Pad encoded inputs (on left/right and up to predefined length or max length in the batch)

        Args:
            encoded_inputs:
                Dictionary of tokenized inputs (`List[int]`) or batch of tokenized inputs (`List[List[int]]`).
            max_length: maximum length of the returned list and optionally padding length (see below).
                Will truncate by taking into account the special tokens.
            padding_strategy: PaddingStrategy to use for padding.

                - PaddingStrategy.LONGEST Pad to the longest sequence in the batch
                - PaddingStrategy.MAX_LENGTH: Pad to the max length (default)
                - PaddingStrategy.DO_NOT_PAD: Do not pad
                The tokenizer padding sides are defined in self.padding_side:

                    - 'left': pads on the left of the sequences
                    - 'right': pads on the right of the sequences
            pad_to_multiple_of: (optional) Integer if set will pad the sequence to a multiple of the provided value.
                This is especially useful to enable the use of Tensor Core on NVIDIA hardware with compute capability
                `>= 7.5` (Volta).
            return_attention_mask:
                (optional) Set to False to avoid returning attention mask (default: set to model specifics)
        """
...
...
        if max_length is not None and pad_to_multiple_of is not None and (max_length % pad_to_multiple_of != 0):
            max_length = ((max_length // pad_to_multiple_of) + 1) * pad_to_multiple_of
...
...

DataCollatorWithPadding的__call__方法中,同样将label或者label_ids重命名为labels。并且其实质是通过transformers.tokenization_utils_base.PreTrainedTokenizerBase.pad实现的。

    def pad(
        self,
        encoded_inputs: Union[
            BatchEncoding,
            List[BatchEncoding],
            Dict[str, EncodedInput],
            Dict[str, List[EncodedInput]],
            List[Dict[str, EncodedInput]],
        ],
        padding: Union[bool, str, PaddingStrategy] = True,
        max_length: Optional[int] = None,
        pad_to_multiple_of: Optional[int] = None,
        return_attention_mask: Optional[bool] = None,
        return_tensors: Optional[Union[str, TensorType]] = None,
        verbose: bool = True,
    ) -> BatchEncoding:
        """
        Pad a single encoded input or a batch of encoded inputs up to predefined length or to the max sequence length
        in the batch.

        Padding side (left/right) padding token ids are defined at the tokenizer level (with `self.padding_side`,
        `self.pad_token_id` and `self.pad_token_type_id`).

        Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the
        text followed by a call to the `pad` method to get a padded encoding.

        <Tip>

        If the `encoded_inputs` passed are dictionary of numpy arrays, PyTorch tensors or TensorFlow tensors, the
        result will use the same type unless you provide a different tensor type with `return_tensors`. In the case of
        PyTorch tensors, you will lose the specific device of your tensors however.

        </Tip>

        Args:
            encoded_inputs ([`BatchEncoding`], list of [`BatchEncoding`], `Dict[str, List[int]]`, `Dict[str, List[List[int]]` or `List[Dict[str, List[int]]]`):
                Tokenized inputs. Can represent one input ([`BatchEncoding`] or `Dict[str, List[int]]`) or a batch of
                tokenized inputs (list of [`BatchEncoding`], *Dict[str, List[List[int]]]* or *List[Dict[str,
                List[int]]]*) so you can use this method during preprocessing as well as in a PyTorch Dataloader
                collate function.

                Instead of `List[int]` you can have tensors (numpy arrays, PyTorch tensors or TensorFlow tensors), see
                the note above for the return type.
            padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `True`):
                 Select a strategy to pad the returned sequences (according to the model's padding side and padding
                 index) among:

                - `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
                  sequence if provided).
                - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
                  acceptable input length for the model if that argument is not provided.
                - `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of different
                  lengths).
            max_length (`int`, *optional*):
                Maximum length of the returned list and optionally padding length (see above).
            pad_to_multiple_of (`int`, *optional*):
                If set will pad the sequence to a multiple of the provided value.

                This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
                `>= 7.5` (Volta).
            return_attention_mask (`bool`, *optional*):
                Whether to return the attention mask. If left to the default, will return the attention mask according
                to the specific tokenizer's default, defined by the `return_outputs` attribute.

                [What are attention masks?](../glossary#attention-mask)
            return_tensors (`str` or [`~utils.TensorType`], *optional*):
                If set, will return tensors instead of list of python integers. Acceptable values are:

                - `'tf'`: Return TensorFlow `tf.constant` objects.
                - `'pt'`: Return PyTorch `torch.Tensor` objects.
                - `'np'`: Return Numpy `np.ndarray` objects.
            verbose (`bool`, *optional*, defaults to `True`):
                Whether or not to print more information and warnings.
        """

......
        # If we have a list of dicts, let's convert it in a dict of lists
        # We do this to allow using this method as a collate_fn function in PyTorch Dataloader
        if isinstance(encoded_inputs, (list, tuple)) and isinstance(encoded_inputs[0], Mapping):
            encoded_inputs = {key: [example[key] for example in encoded_inputs] for key in encoded_inputs[0].keys()}
......
  • 首先注意pad方法对输入参数的要求,其中EncodedInput是List[int]的别名。BatchEncoding可以看做是一个字典对象,其格式满足Dict[str, Any],其数据存储在data属性中。并且BatchEncoding实例化过程中,会调用convert_to_tensors方法,该方法会将data属性中的数据转换成tensor类型。
  • 如果输入的特征数据是List[Dict[str, Any]]格式,会将其转换为Dict[str, List],以满足pytorch Dataloader的要求。并且如果直接使用datasets.Dataset示例对象作为pad方法的输入,会报错——datasets.Dataset示例没有keys属性。

示例:

x += [{"text": "中国是一个伟大国家。", "label": 1}]
ds = Dataset.from_list(x)
features = ds.map(func, batched=False, remove_columns=["text"])
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding=True)
dataset = data_collator(features=features.to_list())  # convert Dataset into List
  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
`transformers` 是一个基于 PyTorch 和 TensorFlow 的自然语言处理模型库,它提供了丰富的函数和类,可以用于构建、训练和使用各种预训练的语言模型。下面列举一些常用的函数: - `AutoTokenizer.from_pretrained(model_name_or_path, *args, **kwargs)`: 根据模型名称或路径创建一个 tokenizer 对象,用于将文本转换为模型可以处理的输入格式。 - `AutoModel.from_pretrained(model_name_or_path, *args, **kwargs)`: 根据模型名称或路径创建一个模型对象,用于进行文本的编码、解码和生成等操作。 - `AutoConfig.from_pretrained(model_name_or_path, *args, **kwargs)`: 根据模型名称或路径创建一个配置对象,用于配置模型的参数和超参数。 - `Trainer(model, args, train_dataset, eval_dataset=None, data_collator=None, tokenizer=None, compute_metrics=None, callbacks=None, optimizers=None, lr_scheduler=None, model_init=None, **kwargs)`: 创建一个训练器对象,用于对模型进行训练、评估和预测等操作。 - `pipeline(task, model=None, tokenizer=None, framework='pt', **kwargs)`: 创建一个管道对象,用于对输入文本进行特定任务的处理,例如文本分类、实体识别、问答等。 - `set_seed(seed)`: 设置随机数种子,用于确保实验的可重复性。 - `get_linear_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps, last_epoch=-1)`: 创建一个学习率调度器对象,用于在训练过程动态调整学习率。 - `AdamW(params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8, weight_decay=0.0, correct_bias=True)`: 创建一个 AdamW 优化器对象,用于优化模型的参数。 - `get_cosine_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps, num_cycles=0.5, last_epoch=-1)`: 创建一个余弦退火学习率调度器对象,用于在训练过程动态调整学习率。 - `get_polynomial_decay_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps, lr_end=0.0, power=1.0, last_epoch=-1)`: 创建一个多项式衰减学习率调度器对象,用于在训练过程动态调整学习率。 这些函数只是 `transformers` 提供的众多函数的几个常用函数,具体使用方式和参数可以参考 `transformers` 的文档。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值