Pytorch笔记-6

最新推荐文章于 2023-01-09 10:06:41 发布

hanfeixue2001

最新推荐文章于 2023-01-09 10:06:41 发布

阅读量584

点赞数 1

分类专栏： Pytorch 文章标签： python 深度学习机器学习 pytorch

本文链接：https://blog.csdn.net/hanfeixue2001/article/details/115350067

版权

Text Classification with the torchtext library

本教程，我们将展示如何使用torchtext库，构建文本分类分析用的数据集。用户将能灵活做以下几项：

Access to raw data as iterator
Build data processing pipeline to convert the raw text strings into torch.Tensor that can be used to train the model
Shuffle and iterate the data with torch.utils.data.DataLoader

1. Access to the raw dataset iterators

torchtext库提供了一些raw dataset iterators,能够yield the raw strings.例如，AG_NEWS dataset iterators yield the raw data as a tuple of label and text.

# 运行本课程中介绍的，会出现connectError错误
import torch
from torchtext.datasets import AG_NEWS
train_iter = AG_NEWS(split='train')

以上代码返回错误 ConnectionError: ('Connection aborted.', OSError(22, 'Invalid argument'))

因此，我在网上下载了AG_NEWS数据集，下载地址：

https://download.csdn.net/download/hanfeixue2001/16261579?spm=1001.2014.3001.5501

from torchtext.utils import unicode_csv_reader
import io
def read_iter(path):
    with io.open(path, encoding='utf-8') as f:
        reader = unicode_csv_reader(f)
        for row in reader:
            yield int(row[0]), ' '.join(row[1:])

            
train_path = './AG_NEWS/train.csv'
test_path = './AG_NEWS/test.csv'
train_iter = read_iter(train_path)

next(train_iter)

(3,
 "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.")

next(train_iter)

(3,
 'Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\\which has a reputation for making well-timed and occasionally\\controversial plays in the defense industry, has quietly placed\\its bets on another part of the market.')

2. Prepare data processing pipelines

我们将会使用torchtext库最基本的组件，包括vocab、word vectors、tokenizer。对 raw text strings 做基本的数据预处理操作。

这里给出用tokenizer 和 vocabulary做NLP数据预处理的例子。第一步，用raw training dataset构建vocabulary。用户可以通过在Vocab类的构造函数中设置参数，来自定义vocab。例如：参数min_freq，表示包含tokens的最小频数要求。

from torchtext.data.utils import get_tokenizer
from collections import Counter
from torchtext.vocab import Vocab
# get_tokenizer函数的作用是创建一个分词器，根据不同分词函数的规则完成分词
# 分词器支持’basic_english’，‘spacy’，‘moses’，‘toktok’，‘revtok’，'subword’等规则
tokenizer = get_tokenizer('basic_english')
train_iter = read_iter(train_path)
counter = Counter()
for label, line in train_iter:
    # 将语料喂给相应的分词器
    counter.update(tokenizer(line))
# Create a Vocab object from a collections.Counter
vocab = Vocab(counter, min_freq=1)
# vocab有三个属性，分别是freqs、stoi、itos，下面展示一个属性
vocab.itos[:5]

['<unk>', '<pad>', '.', 'the', ',']

# convert token into integer（token进行数值化处理，即每个token都有唯一索引去替代）
[vocab[token] for token in ['here', 'is', 'an', 'example']]

[476, 22, 31, 5298]

准备带有tokenizer和vocabulary的the text processing pipeline。

The text 和 label pipelines 将被用于处理raw data strings

text_pipeline = lambda x: [vocab[token] for token in tokenizer(x)]
label_pipeline = lambda x: int(x) - 1

text_pipeline('here is an example')

[476, 22, 31, 5298]

label_pipeline('3')

最低0.47元/天解锁文章

hanfeixue2001

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
Pytorch笔记-6

Text Classification with the torchtext library本教程，我们将展示如何使用torchtext库，构建文本分类分析用的数据集。用户将能灵活做以下几项：Access to raw data as iteratorBuild data processing pipeline to convert the raw text strings into torch.Tensor that can be used to train the modelShuffle a
复制链接

扫一扫

专栏目录