【Pytorch】【torchtext(一)】概述与基本操作

一、概述

1. torchtext中的主要组件

torchtext主要包含的组件有:Field、Dataset和Iterator。

1.1 Field

Field是用于处理数据的对象,处理的过程通过参数指定,且通过Filed能够参数Example对象。下面是定义Field对象的例子,

TEXT = data.Field(sequential=True, tokenize=tokenize, lower=True, fix_length=200)

1.2 Dataset

继承自pytorch的Dataset,表示数据集。Dataset可以看做是Example的实例集合;

1.3 Iterator

Iterator是torchtext到模型的输出,它提供了对数据的一般处理方式,比如打乱,排序,等等,可以动态修改batch大小
在这里插入图片描述

二、Quick Start

import pandas as pd
import torch
from torchtext import data
from torchtext.vocab import Vectors
from torchtext.data import TabularDataset,Dataset,BucketIterator,Iterator
from torch.nn import init
from tqdm import tqdm

2.1 展示数据格式

df1 = pd.read_csv('./data/train_one_label.csv').head()
df2 = pd.read_csv('./data/test.csv').head()
display(df1)
display(df2)
idcomment_texttoxic
00000997932d777bfExplanation\nWhy the edits made under my usern...0
1000103f0d9cfb60fD'aww! He matches this background colour I'm s...0
2000113f07ec002fdHey man, I'm really not trying to edit war. It...0
30001b41b1c6bb37e"\nMore\nI can't make any real suggestions on ...0
40001d958c54c6e35You, sir, are my hero. Any chance you remember...0
idcomment_text
000001cee341fdb12Yo bitch Ja Rule is more succesful then you'll...
10000247867823ef7== From RfC == \n\n The title is fine as it is...
200013b17ad220c46" \n\n == Sources == \n\n * Zawe Ashton on Lap...
300017563c3f7919a:If you have a look back at the source, the in...
400017695ad8997ebI don't anonymously edit articles at all.

2.2 定义Filed

tokenize = lambda x: x.split() # tokenize指定如何划分句子
# 定义了两种Filed,分别用于处理文本和标签
TEXT = data.Field(sequential=True, tokenize=tokenize, lower=True, fix_length=200)
LABEL = data.Field(sequential=False, use_vocab=False)

2.3 构建Dataset

fields = [("id", None),("comment_text",TEXT),("toxic",LABEL)] # 列名与对应的Field对象
# TabularDataset:从csv、tsv、json的文件中读取数据并生成dataset
train, valid = TabularDataset.splits(
    path='data',
    train='train_one_label.csv',
    validation='valid_one_label.csv',
    format='csv',
    skip_header=True,
    fields=fields)

test_datafields = [('id', None),('comment_text', TEXT)]

test = TabularDataset(
    path=r'data\test.csv',
    format='csv',
    skip_header=True,
    fields=test_datafields
)
print(type(train))
<class 'torchtext.data.dataset.TabularDataset'>

构建词表

TEXT.build_vocab(train,valid,test,vectors='glove.6B.100d')
print(TEXT.vocab.stoi['<pad>'])
print(TEXT.vocab.stoi['<unk>'])
print(TEXT.vocab.itos[0])
print(TEXT.vocab.freqs.most_common(5))
print(vars(train.examples[0]))
1
0
<unk>
[('the', 226), ('to', 137), ('a', 90), ('is', 84), ('you', 82)]
{'comment_text': ['explanation', 'why', 'the', 'edits', 'made', 'under', 'my', 'username', 'hardcore', 'metallica', 'fan', 'were', 'reverted?', 'they', "weren't", 'vandalisms,', 'just', 'closure', 'on', 'some', 'gas', 'after', 'i', 'voted', 'at', 'new', 'york', 'dolls', 'fac.', 'and', 'please', "don't", 'remove', 'the', 'template', 'from', 'the', 'talk', 'page', 'since', "i'm", 'retired', 'now.89.205.38.27'], 'toxic': '0'}

2.4 生成迭代器

train_iter, valid_iter = BucketIterator.splits(
    (train, valid),
    batch_sizes=(8, 8),
    device="cpu",
    sort_key=lambda x: len(x.comment_text),
    sort_within_batch=False,
    repeat=False
)
test_iter = Iterator(test, batch_size=8, device="cpu", sort=False, sort_within_batch=False, repeat=False)

调用迭代器

for idx, batch in enumerate(train_iter):
    print(batch)
    print(batch.__dict__.keys())
    text, label = batch.comment_text, batch.toxic
    print(text.shape, label.shape)
[torchtext.data.batch.Batch of size 1]
	[.comment_text]:[torch.LongTensor of size 200x1]
	[.toxic]:[torch.LongTensor of size 1]
dict_keys(['batch_size', 'dataset', 'fields', 'input_fields', 'target_fields', 'comment_text', 'toxic'])
torch.Size([200, 1]) torch.Size([1])

[torchtext.data.batch.Batch of size 8]
	[.comment_text]:[torch.LongTensor of size 200x8]
	[.toxic]:[torch.LongTensor of size 8]
dict_keys(['batch_size', 'dataset', 'fields', 'input_fields', 'target_fields', 'comment_text', 'toxic'])
torch.Size([200, 8]) torch.Size([8])

[torchtext.data.batch.Batch of size 8]
	[.comment_text]:[torch.LongTensor of size 200x8]
	[.toxic]:[torch.LongTensor of size 8]
dict_keys(['batch_size', 'dataset', 'fields', 'input_fields', 'target_fields', 'comment_text', 'toxic'])
torch.Size([200, 8]) torch.Size([8])

[torchtext.data.batch.Batch of size 8]
	[.comment_text]:[torch.LongTensor of size 200x8]
	[.toxic]:[torch.LongTensor of size 8]
dict_keys(['batch_size', 'dataset', 'fields', 'input_fields', 'target_fields', 'comment_text', 'toxic'])
torch.Size([200, 8]) torch.Size([8])
  • 1
    点赞
  • 24
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 4
    评论
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

BQW_

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值