TorchText 0.14.0 版本的新建词表和数据加载

随手笔记前言

TorchText 0.14.0 版本的新建词表和数据加载


一、写该随手目的?

Torchtext从0.12.0版本升级以后,很多方法都完全变了。本笔记说明0.14.0版本的新建词表和数据加载,适用于0.12.0以后的所有版本

二、随手笔记内容

1.Demo数据

首先是Demo数据。
在这里插入图片描述

text,label
films adapted from comic books have had plenty of success whether they about superheroes batman superman spawn or geared toward kids casper or the arthouse crowd ghost world but there never really been a comic book like from hell before,1
for starters it was created by alan moore and eddie campbell who brought the medium to a whole new level in the mid with a part series called the watchmen,1
to say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd,1
the book or graphic novel if you will is over pages long and includes nearly more that consist of nothing but footnotes,1
in other words do dismiss this film because of its source,1
if you can get past the whole comic book thing you might find another stumbling block in from hell directors albert and allen hughes,1
getting the hughes brothers to direct this seems almost as ludicrous as casting carrot top in well anything but riddle me this who better to direct a film that set in the ghetto and features really violent street crime than the mad geniuses behind menace ii society,1
the ghetto in question is of course whitechapel in london east end,1
it a filthy sooty place where the whores called unfortunates are starting to get a little nervous about this mysterious psychopath who has been carving through their profession with surgical precision,1
when the first stiff turns up copper peter godley robbie coltrane the world is not enough calls in inspector frederick abberline johnny depp blow to crack the case,1
abberline a widower has prophetic dreams he unsuccessfully tries to quell with copious amounts of absinthe and opium,1
upon arriving in whitechapel he befriends an unfortunate named mary kelly heather graham say it is so and proceeds to investigate the horribly gruesome crimes that even the police surgeon ca stomach,1
i do think anyone needs to be briefed on jack the ripper so i wo go into the particulars here other than to say moore and campbell have a unique and interesting theory about both the identity of the killer and the reasons he chooses to slay,1
in the comic they do bother cloaking the identity of the ripper but screenwriters terry hayes vertical limit and rafael yglesias les mis rables do a good job of keeping him hidden from viewers until the very end,1
it funny to watch the locals blindly point the finger of blame at jews and indians because after all an englishman could never be capable of committing such ghastly acts,1
and from hell ending had me whistling the stonecutters song from the simpsons for days who holds back the electric carwho made steve guttenberg a star,1
do worry it all make sense when you see it,1
now onto from hell appearance it certainly dark and bleak enough and it surprising to see how much more it looks like a tim burton film than planet of the apes did at times it seems like sleepy hollow,1

2.代码

代码如下(示例):

# 加载数据 'text', 'label'
import torch
import pandas as pd
import pkuseg
from torchtext.vocab import build_vocab_from_iterator
from torch.utils.data import Dataset, DataLoader
from torchtext.transforms import Truncate, PadTransform


seg = pkuseg.pkuseg()


def tokenizer(text):
    return seg.cut(text)


def yield_tokens(data_iter):
    for _, text in data_iter.iterrows():
        txt = text['text']
        if type(txt) is float:
                continue
        yield tokenizer(txt)


train_iter = pd.read_csv('D:/Pycharmprojects/PytorchStudy/data/Movie-reviews/demo_train.csv')

# 1.词表建立
vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=['<unk>', '<pad>'])
vocab.set_default_index(vocab['<unk>'])

text_pipeline = lambda x: vocab(tokenizer(x))


# 2.数据加载
class TextCNNDataSet(Dataset):
    def __init__(self, data, data_targets):
        self.content = data
        self.pos = data_targets

    def __getitem__(self, index):
        return self.content[index], self.pos[index]

    def __len__(self):
        return len(self.pos)


device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')


def collate_batch(batch):
    label_list, text_list = [], []
    truncate = Truncate(max_seq_len=20)  # 截断
    pad = PadTransform(max_length=20, pad_value=vocab['<pad>'])
    for (_text, _label) in batch:
        label_list.append(_label)
        text = text_pipeline(_text)
        text = truncate(text)
        text = torch.tensor(text, dtype=torch.int64)
        text = pad(text)
        text_list.append(text)

    label_list = torch.tensor(label_list, dtype=torch.int64)

    text_list = torch.vstack(text_list)
    return label_list.to(device), text_list.to(device)


train_iter = TextCNNDataSet(list(train_iter['text']), list(train_iter['label']))
train_loader = DataLoader(train_iter, batch_size=8, shuffle=True, collate_fn=collate_batch)


for i, batch in enumerate(train_loader):
    pos, content = batch[0], batch[1]
    print(pos)
    print(content)

3.展示结果

在这里插入图片描述


总结

以上就是TorchText 0.14.0 版本的新建词表和数据加载过程

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值