纯文本
整个文本只有一行,无换行,字之间空格隔开
方法一:torchtext
对于纯文本数据,通常我们会使用LanguageModelingDataset建立数据集,然后使用BPTTIterator创建迭代器。
注意:如果文本数过小,且BPTTIterator中设置的batch_size * bptt_len大于文本总长度,则生成的batch的seq_len达不到bptt_len。
如果处理中文,tokenize函数可以使用jieba进行分词:
tokenize = lambdax: jieba.lcut(x)
import torchtext
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
BATCH_SIZE = 32
MAX_VOCAB_SIZE = 50000
tokenize = lambda x: x.split()
"""
定义TEXT field用于处理文本的方法
sequential: Whether the datatype represents sequential data. If False, no tokenization is applied. Default: True.
use_vocab: Whether to use a Vocab object. If False, the data in this field should already be numerical. Default: True.
tokenize: The function used to tokenize strings using this field into sequential examples. Default: string.split.
"""
TEXT = torchtext.data.Field(sequential=True, use_vocab=True, tokenize=tokenize, lower=True,
batch_first=True, init_token=None, eos_token=None)
"""
LanguageModelingDataset.split() 处理纯文本数据,分词方法直接使用str.split()
"""
train, val, test = torchtext.datasets.LanguageModelingDataset.splits(path="data",
train="text8.train.txt",
validation="text8.dev.txt",
test="text8.test.txt",
text_field=TEXT)
# 只有一条数据 result=1 每条数据用一个字典表示
print('total example row = ', len(train))
# 打印第一条数据的keys值 result='result'
print(train[0].__dict__.keys())
# 打印第一条数据的values值 result='result'
# print(train[0].__dict__.values())
# create vocabulary
TEXT.build_vocab(train, max_size=MAX_VOCAB_SIZE)
VOCAB_SIZE = len(TEXT.vocab)
print("vocabulary size: ", VOCAB_SIZE)
print(TEXT.vocab.itos[:10])
print(TEXT.vocab.stoi['apple'])
print(' indx is ', TEXT.vocab.stoi[''])
print(' indx is ', TEXT.vocab.stoi[''])
UNK_STR = TEXT.unk_token
PAD_STR = TEXT.pad_token
UNK_IDX = TEXT.vocab.stoi[UNK_STR]
PAD_IDX = TEXT.vocab.stoi[PAD_STR]
print(f'{UNK_STR} index is {UNK_IDX}')
print(f'{PAD_STR} index is {PAD_IDX}')
"""
Defines an iterator for language modeling tasks that use BPTT.
bptt: Length of sequences for backpropagation through time
repeat: Whether to repeat the iterator for multiple epochs. Default: False.
"""
train_iter, val_iter, test_iter = torchtext.data.BPTTIterator.splits((train, val, test), batch_size=BATCH_SIZE,
device=device, bptt_len=50,
repeat=False, shuffle=True)
for batch in train_iter:
print(batch.text.shape) # (batch=32, seqlen=50)
print(batch.target.shape) # (batch=32, seqlen=50)
print(" ".join(TEXT.vocab.itos[i] for i in batch.text[-1, :].data.cpu()))
print(" ".join(TEXT.vocab.itos[i] for i in batch.target[-1, :].data.cpu()))
break
方法二:torch.utils.data
可以使用torch的Dataset和DataLoader进行数据的处理。