BPE(Byte-Pair Encoding )代码实现

BPE 是使用最广泛的sub-word tokenization算法之一。尽管贪婪,但它具有良好的性能,并被作为机器翻译等主流NLP任务的首选tokenize方法之一。

BPE算法原理传送门

1. Byte-Pair Encoding Tokenizer Training


import pandas as pd

# Import gc, a library for controlling the garbage collector
import gc

# Import various classes and functions from the tokenizers library, which is used for creating and using custom tokenizers 
from tokenizers import (
    decoders,
    models,
    normalizers,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
)

# Import PreTrainedTokenizerFast, a class for using fast tokenizers from the transformers library
from transformers import PreTrainedTokenizerFast

# Import TfidfVectorizer, a class for transforming text into TF-IDF features
from sklearn.feature_extraction.text import TfidfVectorizer

# Import tqdm, a library for displaying progress bars 
from tqdm.auto import tqdm

# Import Dataset, a class for working with datasets in a standardized way 
from datasets import Dataset

# Set the LOWERCASE flag to False
LOWERCASE = False 

# Set the VOCAB_SIZE to 10000000.
# This means that the maximum number of words in the vocabulary will be 10 million.
VOCAB_SIZE = 10000000
test = pd.read_csv('data/test_text.csv').iloc[:6666]

# Create a tokenizer object using the Byte Pair Encoding (BPE) algorithm
# Define an unknown token as "[UNK]"
raw_tokenizer = Tokenizer(models.BPE(unk_token="[UNK]"))

# Normalize the text by applying Unicode Normalization Form C (NFC) and optionally lowercasing it
raw_tokenizer.normalizer = normalizers.Sequence([normalizers.NFC()] + [normalizers.Lowercase()] if LOWERCASE else [])

# Pre-tokenize the text by splitting it into bytes
raw_tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel()

# Define the special tokens that will be used for the downstream task
special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]

# Create a trainer object that will train the tokenizer on the given vocabulary size and special tokens
trainer = trainers.BpeTrainer(vocab_size=VOCAB_SIZE, special_tokens=special_tokens)

# Load the test dataset from a pandas dataframe and select only the text column
dataset = Dataset.from_pandas(test[['text']])

# Define a generator function that will yield batches of text from the dataset
def train_corp_iter(): 
    for i in range(0, len(dataset), 1000):
        yield dataset[i : i + 1000]["text"]

# Train the tokenizer on the batches of text using the trainer object
raw_tokenizer.train_from_iterator(train_corp_iter(), trainer=trainer)

# Wrap the raw tokenizer object into a PreTrainedTokenizerFast object that is compatible with the HuggingFace library
tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=raw_tokenizer,
    unk_token="[UNK]",
    pad_token="[PAD]",
    cls_token="[CLS]",
    sep_token="[SEP]",
    mask_token="[MASK]",
)


# Initialize an empty list to store the tokenized texts for the test set
tokenized_texts_test = []

# Loop over the texts in the test set and tokenize them using the tokenizer object
for text in tqdm(test['text'].tolist()):
    tokenized_texts_test.append(tokenizer.tokenize(text))
  0%|          | 0/6666 [00:00<?, ?it/s]

2. TF-IDF Vectorization

一般情况下,在用BPE算法对数据进行压缩后,需要进行TF-IDF向量化,来将数据转换成方便建模的形式。
TF-IDF算法原理

# Define a dummy function that returns the input text as it is
def dummy(text):
    return text

# Create another TfidfVectorizer object with the same parameters, but using the vocabulary obtained from the previous vectorizer
vectorizer = TfidfVectorizer(ngram_range=(3, 5), lowercase=False, sublinear_tf=True, 
                            analyzer = 'word',
                            tokenizer = dummy,
                            preprocessor = dummy,
                            token_pattern = None, strip_accents='unicode'
                            )

# Fit and transform the vectorizer on the tokenized texts of the train set, and get the sparse matrix of tf-idf values
tf_test = vectorizer.fit_transform(tokenized_texts_test)

# Get the vocabulary of the vectorizer, which is a dictionary of n-grams and their indices
vocab = vectorizer.vocabulary_

# Print the vocabulary
print(list(vocab.items())[:10])

# Delete the vectorizer object to free up memory
del vectorizer

# Invoke the garbage collector to reclaim unused memory
gc.collect()

[('ĠPhones Ċ Ċ', 1023935), ('Ċ Ċ Modern', 716662), ('Ċ Modern Ġhumans', 679728), ('Modern Ġhumans Ġtoday', 534040), ('Ġhumans Ġtoday Ġare', 3237252), ('Ġtoday Ġare Ġalways', 5977665), ('Ġare Ġalways Ġon', 1675978), ('Ġalways Ġon Ġtheir', 1455005), ('Ġon Ġtheir Ġphone', 4210093), ('Ġtheir Ġphone .', 5562309)]





540
  • 11
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值