BPE(Byte-Pair Encoding )代码实现

BPE 是使用最广泛的sub-word tokenization算法之一。尽管贪婪,但它具有良好的性能,并被作为机器翻译等主流NLP任务的首选tokenize方法之一。

BPE算法原理传送门

1. Byte-Pair Encoding Tokenizer Training


import pandas as pd

# Import gc, a library for controlling the garbage collector
import gc

# Import various classes and functions from the tokenizers library, which is used for creating and using custom tokenizers 
from tokenizers import (
    decoders,
    models,
    normalizers,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
)

# Import PreTrainedTokenizerFast, a class for using fast tokenizers from the transformers library
from transformers import PreTrainedTokenizerFast

# Import TfidfVectorizer, a class for transforming text into TF-IDF features
from sklearn.feature_extraction.text import TfidfVectorizer

# Import tqdm, a library for displaying progress bars 
from tqdm.auto import tqdm

# Import Dataset, a class for working with datasets in a standardized way 
from datasets import Dataset

# Set the LOWERCASE flag to False
LOWERCASE = False 

# Set the VOCAB_SIZE to 10000000.
# This means that the maximum number of words in the vocabulary will be 10 million.
VOCAB_SIZE = 10000000
test = pd.read_csv('data/test_text.csv'
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值