【Hugging Face】tokenizer.batch_encode_plus() 方法：批量编码文本（返回 input ids、attention mask、token type IDs）

彬彬侠

于 2025-03-13 11:56:45 发布

阅读量679

点赞数 9

分类专栏： Hugging Face 文章标签： batch_encode_pl encode_plus 批量编码文本 transformers Hugging Face 自然语言处理 NLP

本文链接：https://blog.csdn.net/u013172930/article/details/146227130

版权

Hugging Face 专栏收录该内容

66 篇文章

订阅专栏

Hugging Face `tokenizer.batch_encode_plus` 方法

batch_encode_plus 是 批量处理多个文本 的方法，它比 encode 和 encode_plus 更适合大规模数据处理，适用于 文本分类、翻译、问答等 NLP 任务。

1. `batch_encode_plus` 方法的基本用法

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

texts = ["Hugging Face is great!", "Transformers are amazing!"]
tokens = tokenizer.batch_encode_plus(texts)

print(tokens)

示例输出

{
  'input_ids': [
      [101, 17662, 18781, 2003, 2307, 999, 102],
      [101, 10938, 2024, 6429, 999, 102]
  ],
  'token_type_ids': [
      [0, 0, 0, 0, 0, 0, 0],
      [0, 0, 0, 0, 0, 0]
  ],
  'attention_mask': [
      [1, 1, 1, 1, 1, 1, 1],
      [1, 1, 1, 1, 1, 1]
  ]
}

解释

input_ids：文本对应的 token ID 序列
token_type_ids：区分句子（0 代表第一个句子，1 代表第二个句子）
attention_mask：标记有效 token（1）和填充 token（0）

2. `batch_encode_plus` 的常见参数

参数	作用	默认值
`text`	输入文本列表	必需
`add_special_tokens`	是否添加 `[CLS]` 和 `[SEP]`	`True`
`max_length`	最大序列长度	`None`
`padding`	是否填充	`False`
`truncation`	是否截断	`False`
`return_tensors`	返回 `torch`、`tf` 或 `np`	`None`
`return_attention_mask`	是否返回 `attention_mask`	`True`
`return_token_type_ids`	是否返回 `token_type_ids`	`True`

3. `batch_encode_plus` 详细用法

3.1. 处理超长文本（截断）

如果文本 超过 BERT 最大长度（512），可以手动截断：

tokens = tokenizer.batch_encode_plus(texts, max_length=5, truncation=True)
print(tokens["input_ids"])

输出

[
  [101, 17662, 18781, 2003, 102],
  [101, 10938, 2024, 6429, 102]
]

文本被截断，只保留 max_length=5 的 token。

3.2. 填充（`padding`）

如果不同文本长度不一致，可以 自动填充 让它们对齐：

tokens = tokenizer.batch_encode_plus(texts, max_length=10, padding="max_length", truncation=True)
print(tokens["input_ids"])

输出

[
  [101, 17662, 18781, 2003, 2307, 999, 102, 0, 0, 0],
  [101, 10938, 2024, 6429, 999, 102, 0, 0, 0, 0]
]

0 代表 [PAD]
attention_mask 也会更新：

[
  [1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
  [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
]

1 代表 有效 token，0 代表 填充 token。

3.3. 处理句子对（文本匹配、问答）

如果输入 两个句子，batch_encode_plus 会自动添加 [SEP] 并生成 token_type_ids：

pairs = [("Hugging Face is great!", "It provides NLP tools."),
         ("Transformers are powerful.", "They are used in AI.")]
tokens = tokenizer.batch_encode_plus(pairs)

print(tokens["input_ids"])

输出

[
  [101, 17662, 18781, 2003, 2307, 999, 102, 2009, 3641, 10336, 26642, 2476, 1012, 102],
  [101, 10938, 2024, 6429, 1012, 102, 2027, 2024, 2109, 1999, 9932, 1012, 102]
]

[SEP] 分隔句子
token_type_ids：

[
  [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
  [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
]

0 代表 第一个句子
1 代表 第二个句子

3.4. 返回张量（PyTorch/TensorFlow）

如果要在 PyTorch 或 TensorFlow 中使用：

tokens = tokenizer.batch_encode_plus(texts, return_tensors="pt")  # PyTorch
print(tokens["input_ids"].shape)  # torch.Size([2, 7])

tokens = tokenizer.batch_encode_plus(texts, return_tensors="tf")  # TensorFlow
print(tokens["input_ids"].shape)  # (2, 7)

4. `encode_plus` vs `batch_encode_plus`

方法	作用
`encode_plus(text)`	处理单个文本
`batch_encode_plus([text1, text2])`	批量处理多个文本

如果你的任务 包含多个文本，推荐使用 batch_encode_plus，它可以 自动填充、截断、返回 PyTorch/TensorFlow 张量。

5. `batch_encode_plus` 在 `Trainer` 训练时使用

在 微调 BERT 时，batch_encode_plus 可用于数据预处理：

from datasets import load_dataset

dataset = load_dataset("imdb")

def preprocess_function(examples):
    return tokenizer.batch_encode_plus(examples["text"], truncation=True, padding="max_length")

encoded_dataset = dataset.map(preprocess_function, batched=True)

示例输出

{
  "input_ids": [[101, 17662, 18781, 2003, 2307, 999, 102, 0, 0, 0], ...],
  "attention_mask": [[1, 1, 1, 1, 1, 1, 1, 0, 0, 0], ...]
}

6. 总结

batch_encode_plus 适用于批量文本处理，相比 encode_plus 支持多个文本同时编码，适用于 文本分类、机器翻译、问答任务。

常见用法

tokenizer.batch_encode_plus(texts) → 批量编码
tokenizer.batch_encode_plus(texts, max_length=10, truncation=True) → 处理超长文本
tokenizer.batch_encode_plus(texts, padding="max_length") → 填充到固定长度
tokenizer.batch_encode_plus(texts, return_tensors="pt") → 返回 PyTorch 张量
tokenizer.batch_encode_plus([(text1, text2)]) → 处理句子对任务

如果你的任务 需要处理多个文本，并希望自动填充、截断、返回 attention_mask，建议 使用 batch_encode_plus。