随手笔记前言
TorchText 0.14.0 版本的新建词表和数据加载
一、写该随手目的?
Torchtext从0.12.0版本升级以后,很多方法都完全变了。本笔记说明0.14.0版本的新建词表和数据加载,适用于0.12.0以后的所有版本
二、随手笔记内容
1.Demo数据
首先是Demo数据。
text,label
films adapted from comic books have had plenty of success whether they about superheroes batman superman spawn or geared toward kids casper or the arthouse crowd ghost world but there never really been a comic book like from hell before,1
for starters it was created by alan moore and eddie campbell who brought the medium to a whole new level in the mid with a part series called the watchmen,1
to say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd,1
the book or graphic novel if you will is over pages long and includes nearly more that consist of nothing but footnotes,1
in other words do dismiss this film because of its source,1
if you can get past the whole comic book thing you might find another stumbling block in from hell directors albert and allen hughes,1
getting the hughes brothers to direct this seems almost as ludicrous as casting carrot top in well anything but riddle me this who better to direct a film that set in the ghetto and features really violent street crime than the mad geniuses behind menace ii society,1
the ghetto in question is of course whitechapel in london east end,1
it a filthy sooty place where the whores called unfortunates are starting to get a little nervous about this mysterious psychopath who has been carving through their profession with surgical precision,1
when the first stiff turns up copper peter godley robbie coltrane the world is not enough calls in inspector frederick abberline johnny depp blow to crack the case,1
abberline a widower has prophetic dreams he unsuccessfully tries to quell with copious amounts of absinthe and opium,1
upon arriving in whitechapel he befriends an unfortunate named mary kelly heather graham say it is so and proceeds to investigate the horribly gruesome crimes that even the police surgeon ca stomach,1
i do think anyone needs to be briefed on jack the ripper so i wo go into the particulars here other than to say moore and campbell have a unique and interesting theory about both the identity of the killer and the reasons he chooses to slay,1
in the comic they do bother cloaking the identity of the ripper but screenwriters terry hayes vertical limit and rafael yglesias les mis rables do a good job of keeping him hidden from viewers until the very end,1
it funny to watch the locals blindly point the finger of blame at jews and indians because after all an englishman could never be capable of committing such ghastly acts,1
and from hell ending had me whistling the stonecutters song from the simpsons for days who holds back the electric carwho made steve guttenberg a star,1
do worry it all make sense when you see it,1
now onto from hell appearance it certainly dark and bleak enough and it surprising to see how much more it looks like a tim burton film than planet of the apes did at times it seems like sleepy hollow,1
2.代码
代码如下(示例):
# 加载数据 'text', 'label'
import torch
import pandas as pd
import pkuseg
from torchtext.vocab import build_vocab_from_iterator
from torch.utils.data import Dataset, DataLoader
from torchtext.transforms import Truncate, PadTransform
seg = pkuseg.pkuseg()
def tokenizer(text):
return seg.cut(text)
def yield_tokens(data_iter):
for _, text in data_iter.iterrows():
txt = text['text']
if type(txt) is float:
continue
yield tokenizer(txt)
train_iter = pd.read_csv('D:/Pycharmprojects/PytorchStudy/data/Movie-reviews/demo_train.csv')
# 1.词表建立
vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=['<unk>', '<pad>'])
vocab.set_default_index(vocab['<unk>'])
text_pipeline = lambda x: vocab(tokenizer(x))
# 2.数据加载
class TextCNNDataSet(Dataset):
def __init__(self, data, data_targets):
self.content = data
self.pos = data_targets
def __getitem__(self, index):
return self.content[index], self.pos[index]
def __len__(self):
return len(self.pos)
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
def collate_batch(batch):
label_list, text_list = [], []
truncate = Truncate(max_seq_len=20) # 截断
pad = PadTransform(max_length=20, pad_value=vocab['<pad>'])
for (_text, _label) in batch:
label_list.append(_label)
text = text_pipeline(_text)
text = truncate(text)
text = torch.tensor(text, dtype=torch.int64)
text = pad(text)
text_list.append(text)
label_list = torch.tensor(label_list, dtype=torch.int64)
text_list = torch.vstack(text_list)
return label_list.to(device), text_list.to(device)
train_iter = TextCNNDataSet(list(train_iter['text']), list(train_iter['label']))
train_loader = DataLoader(train_iter, batch_size=8, shuffle=True, collate_fn=collate_batch)
for i, batch in enumerate(train_loader):
pos, content = batch[0], batch[1]
print(pos)
print(content)
3.展示结果
总结
以上就是TorchText 0.14.0 版本的新建词表和数据加载过程