【Hugging Face】tokenizer.encode_plus() 方法：文本编码（返回 input ids、attention mask、token type IDs）

最新推荐文章于 2025-03-13 11:56:45 发布

彬彬侠

最新推荐文章于 2025-03-13 11:56:45 发布

阅读量501

点赞数 4

分类专栏： Hugging Face 文章标签： encode_plus tokenizer transformers Hugging Face 文本编码自然语言处理 NLP

本文链接：https://blog.csdn.net/u013172930/article/details/146226921

版权

Hugging Face 专栏收录该内容

66 篇文章

订阅专栏

Hugging Face `tokenizer.encode_plus` 方法

tokenizer.encode_plus 是 Hugging Face 提供的 文本编码方法，相比 tokenizer.encode，它 返回更丰富的编码信息，如 attention mask、token type IDs，适用于 文本分类、问答、NER 等 NLP 任务。

1. `encode_plus` 方法的基本用法

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

text = "Hugging Face is great!"
tokens = tokenizer.encode_plus(text)

print(tokens)

示例输出

{
  'input_ids': [101, 17662, 18781, 2003, 2307, 999, 102],
  'token_type_ids': [0, 0, 0, 0, 0, 0, 0],
  'attention_mask': [1, 1, 1, 1, 1, 1, 1]
}

解释

input_ids → token ID 序列（包括 [CLS] 和 [SEP]）
token_type_ids → 句子区分标记（用于句子对任务）
attention_mask → 指示 有效 token（1） 和 填充 token（0）

2. `encode_plus` 的常见参数

参数	作用	默认值
`text`	输入文本	必需
`add_special_tokens`	是否添加 `[CLS]` 和 `[SEP]`	`True`
`max_length`	最大序列长度	`None`
`padding`	是否填充	`False`
`truncation`	是否截断	`False`
`return_tensors`	返回 `torch`、`tf` 或 `np`	`None`
`return_attention_mask`	是否返回 `attention_mask`	`True`
`return_token_type_ids`	是否返回 `token_type_ids`	`True`

3. `encode_plus` 详细用法

3.1. 禁用特殊标记

默认情况下，encode_plus 会添加 [CLS] 和 [SEP]，如果 不需要这些标记：

tokens = tokenizer.encode_plus(text, add_special_tokens=False)
print(tokens)

输出

{
  'input_ids': [17662, 18781, 2003, 2307, 999],
  'token_type_ids': [0, 0, 0, 0, 0],
  'attention_mask': [1, 1, 1, 1, 1]
}

3.2. 处理超长文本（截断）

如果文本 超过 BERT 允许的最大长度（512），可以手动截断：

tokens = tokenizer.encode_plus(text, max_length=5, truncation=True)
print(tokens)

输出

{
  'input_ids': [101, 17662, 18781, 2003, 102],  # 截断
  'token_type_ids': [0, 0, 0, 0, 0],
  'attention_mask': [1, 1, 1, 1, 1]
}

3.3. 处理句子对任务

如果输入 两个句子（如文本匹配、问答任务），encode_plus 会自动添加 [SEP] 并生成 token_type_ids：

sentence1 = "Hugging Face is amazing."
sentence2 = "It provides NLP tools."

tokens = tokenizer.encode_plus(sentence1, sentence2)
print(tokens)

输出

{
  'input_ids': [101, 17662, 18781, 2003, 6429, 1012, 102, 2009, 3641, 10336, 26642, 2476, 1012, 102],
  'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
  'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
}

token_type_ids：
- 0 代表 第一个句子
- 1 代表 第二个句子

3.4. 填充（`padding`）

如果想让所有 input_ids 变为相同长度：

tokens = tokenizer.encode_plus(text, max_length=10, padding="max_length", truncation=True)
print(tokens)

输出

{
  'input_ids': [101, 17662, 18781, 2003, 2307, 999, 102, 0, 0, 0],  # 用 [PAD] 填充
  'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
  'attention_mask': [1, 1, 1, 1, 1, 1, 1, 0, 0, 0]  # 填充部分 attention_mask=0
}

3.5. 返回张量（PyTorch/TensorFlow）

如果要在 PyTorch 或 TensorFlow 中使用，需要返回 tensor：

tokens = tokenizer.encode_plus(text, return_tensors="pt")  # PyTorch
print(tokens["input_ids"].shape)  # torch.Size([1, 7])

tokens = tokenizer.encode_plus(text, return_tensors="tf")  # TensorFlow
print(tokens["input_ids"].shape)  # (1, 7)

4. `encode` vs `encode_plus` vs `batch_encode_plus`

方法	作用
`encode(text)`	编码单个文本，返回 token ID
`encode_plus(text)`	返回 `input_ids`、`attention_mask`、`token_type_ids`
`batch_encode_plus([text1, text2])`	批量编码多个文本

4.1. `batch_encode_plus`（批量处理）

如果 要处理多个句子，使用 batch_encode_plus：

texts = ["Hugging Face is great!", "Transformers are amazing!"]
tokens = tokenizer.batch_encode_plus(texts, padding=True, truncation=True)
print(tokens)

输出

{
  'input_ids': [[101, 17662, 18781, 2003, 2307, 999, 102],
                [101, 10938, 2024, 6429, 999, 102, 0]],
  'attention_mask': [[1, 1, 1, 1, 1, 1, 1],
                     [1, 1, 1, 1, 1, 1, 0]]
}

5. 总结

encode_plus 方法 比 encode 更强大，适用于 文本分类、问答等任务。

常见用法

tokenizer.encode_plus(text) → 编码文本
tokenizer.encode_plus(text, add_special_tokens=False) → 不加 [CLS] [SEP]
tokenizer.encode_plus(text, max_length=10, truncation=True) → 处理超长文本
tokenizer.encode_plus(sentence1, sentence2) → 处理句子对
tokenizer.encode_plus(text, return_tensors="pt") → 返回 PyTorch 张量
tokenizer.batch_encode_plus([text1, text2]) → 批量处理文本