Torchtext下的AG_NEWS数据集进行分类(官方文档代码)

原链接:Text classification with the torchtext library — PyTorch Tutorials 1.11.0+cu102 documentation

 (1)导入数据集(经常会出现数据集下载失败的情况),有大佬的网盘:https://pan.baidu.com/s/1Rz_XoaTZWSRiHGOwkACosQ,提取码:j0no 

下载完直接放到当前打开jupyter notebook的目录下,地址就到AG_NEWS.data文件夹即可

(现在的版本好像要加上root=‘地址’,不然会报错)

import torch
from torchtext.datasets import AG_NEWS
path = r'E:\Notebook\自然语言处\Text_classification_with_the_torchtext_library\AG_NEWS.data'
train_iter = iter(AG_NEWS(root=path, split='train'))

 (2)构建词汇表

from torchtext.data.utils import get_tokenizer #导入分词工具
from torchtext.vocab import build_vocab_from_iterator #使用迭代器构建词表

tokenizer = get_tokenizer('basic_english') #创建分词器对象,采用英文分词
train_iter = AG_NEWS(root=path, split='train')  #获取数据集,并生成迭代器

def yield_tokens(data_iter):
    for _, text in data_iter: #获取每一条的标签label和内容text
        yield tokenizer(text) #对获取内容分词,并返回。yield返回一个迭代器对象

#将未能识别的单词设置为<unk>
vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"]) 

#设置<unk>的索引为默认索引,一旦遇到不能识别单词,转为<unk>的索引值
vocab.set_default_index(vocab['<unk>'])

 (3)获取每条数据的label和text

text_pipeline = lambda x: vocab(tokenizer(x)) #获取每一条的text的索引表示
label_pipeline = lambda x: int(x) - 1 #获取对应的label

#演示
text_pipeline('here is the an example')
>>> [475, 21, 2, 30, 5297]
label_pipeline('10')
>>> 9<
  • 4
    点赞
  • 30
    收藏
    觉得还不错? 一键收藏
  • 7
    评论
AG's News Topic Classification Dataset Version 3, Updated 09/09/2015 ORIGIN AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html . The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the dataset above. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015). DESCRIPTION The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600. The file classes.txt contains a list of classes corresponding to each label. The files train.csv and test.csv contain all the training samples as comma-sparated values. There are 3 columns in them, corresponding to class index (1 to 4), title and description. The title and description are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is "\n".
496,835 条来自 AG 新闻语料库 4 大类别超过 2000 个新闻源的新闻文章,数据集仅仅援用了标题和描述字段。每个类别分别拥有 30,000 个训练样本及 1900 个测试样本。 README: AG's News Topic Classification Dataset Version 3, Updated 09/09/2015 ORIGIN AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html . The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the dataset above. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015). DESCRIPTION The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600. The file classes.txt contains a list of classes corresponding to each label. The files train.csv and test.csv contain all the training samples as comma-sparated values. There are 3 columns in them, corresponding to class index (1 to 4), title and description. The title and description are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is "\n".

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 7
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值