Pytorch笔记-6

Text Classification with the torchtext library

本教程,我们将展示如何使用torchtext库,构建文本分类分析用的数据集。用户将能灵活做以下几项:

  • Access to raw data as iterator
  • Build data processing pipeline to convert the raw text strings into torch.Tensor that can be used to train the model
  • Shuffle and iterate the data with torch.utils.data.DataLoader
1. Access to the raw dataset iterators

torchtext库提供了一些raw dataset iterators,能够yield the raw strings.例如,AG_NEWS dataset iterators yield the raw data as a tuple of label and text.

# 运行本课程中介绍的,会出现connectError错误
import torch
from torchtext.datasets import AG_NEWS
train_iter = AG_NEWS(split='train')

以上代码返回错误 ConnectionError: ('Connection aborted.', OSError(22, 'Invalid argument'))

因此,我在网上下载了AG_NEWS数据集,下载地址:

https://download.csdn.net/download/hanfeixue2001/16261579?spm=1001.2014.3001.5501

from torchtext.utils import unicode_csv_reader
import io
def read_iter(path):
    with io.open(path, encoding='utf-8') as f:
        reader = unicode_csv_reader(f)
        for row in reader:
            yield int(row[0]), ' '.join(row[1:])

            
train_path = './AG_NEWS/train.csv'
test_path = './AG_NEWS/test.csv'
train_iter = read_iter(train_path)
next(train_iter)
(3,
 "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.")
next(train_iter)
(3,
 'Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\\which has a reputation for making well-timed and occasionally\\controversial plays in the defense industry, has quietly placed\\its bets on another part of the market.')
2. Prepare data processing pipelines

我们将会使用torchtext库最基本的组件,包括vocab、word vectors、tokenizer。对 raw text strings 做基本的数据预处理操作。

这里给出用tokenizer 和 vocabulary做NLP数据预处理的例子。第一步,用raw training dataset构建vocabulary。用户可以通过在Vocab类的构造函数中设置参数,来自定义vocab。例如:参数min_freq,表示包含tokens的最小频数要求。

from torchtext.data.utils import get_tokenizer
from collections import Counter
from torchtext.vocab import Vocab
# get_tokenizer函数的作用是创建一个分词器,根据不同分词函数的规则完成分词
# 分词器支持’basic_english’,‘spacy’,‘moses’,‘toktok’,‘revtok’,'subword’等规则
tokenizer = get_tokenizer('basic_english')
train_iter = read_iter(train_path)
counter = Counter()
for label, line in train_iter:
    # 将语料喂给相应的分词器
    counter.update(tokenizer(line))
# Create a Vocab object from a collections.Counter
vocab = Vocab(counter, min_freq=1)
# vocab有三个属性,分别是freqs、stoi、itos,下面展示一个属性
vocab.itos[:5]
['<unk>', '<pad>', '.', 'the', ',']
# convert token into integer(token进行数值化处理,即每个token都有唯一索引去替代)
[vocab[token] for token in ['here', 'is', 'an', 'example']]

[476, 22, 31, 5298]

准备带有tokenizer和vocabulary的the text processing pipeline。

The text 和 label pipelines 将被用于处理raw data strings

text_pipeline = lambda x: [vocab[token] for token in tokenizer(x)]
label_pipeline = lambda x: int(x) - 1
text_pipeline('here is an example')
[476, 22, 31, 5298]
label_pipeline('3')
2
  • 1
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值