torchtext.data.utils

get_tokenizer(tokenizer, language=‘en’)

功能:使用分词器对句子进行分词。

参数:

  • tokenizer:分词器名称。

    如果传入None,功能相当于simple_space_split()不会分割标点符号

    如果传入"basic_english",则会将字母转为小写并按空格分割,会分割标点符号

    如果传入可调用函数,则会调用并返回相应结果。

    如果传入分词库名称,则会返回相应的结果,分词库有spacy、moses、toktok、revotk,subword等。

  • language:语言名称,默认为en

示例:

import torchtext
from torchtext.data.utils import get_tokenizer

# 传入None
tokenizer = get_tokenizer(None)
tokens = tokenizer("You can,\t\tnow\n\ninstall TorchText using pip!!!")
print(tokens)

# 传入"basic_english"
tokenizer = get_tokenizer("basic_english")
tokens = tokenizer("You can,\t\tnow\n\ninstall TorchText using pip!!!")
print(tokens)

# 传入可调用函数
def mySplit(text:str):
    return text.split(" ")

tokenizer = get_tokenizer(mySplit)
tokens = tokenizer("You can,\t\tnow\n\ninstall TorchText using pip!!!")
print(tokens)

# 传入分词库名称
tokenizer = get_tokenizer("moses")
tokens = tokenizer("You can,\t\tnow\n\ninstall TorchText using pip!!!")
print(tokens)

示例结果:

['You', 'can,', 'now', 'install', 'TorchText', 'using', 'pip!!!']
['you', 'can', ',', 'now', 'install', 'torchtext', 'using', 'pip', '!', '!', '!']
['You', 'can,\t\tnow\n\ninstall', 'TorchText', 'using', 'pip!!!']
['You', 'can', ',', 'now', 'install', 'TorchText', 'using', 'pip', '!', '!', '!']

ngarms_iterator(token_list, ngrams)

功能:生成ngram词袋分词。

参数:

  • token_list:分词结果列表。
  • ngrams:词袋大小

示例:

import torchtext
from torchtext.data.utils import get_tokenizer, ngrams_iterator

tokenizer = get_tokenizer("basic_english")
tokens = tokenizer("You can\t\tnow")

list(ngrams_iterator(tokens, 3))

示例结果:

['you', 'can', 'now', 'you can', 'can now', 'you can now']
import torch import torch.nn as nn from torchtext.datasets import AG_NEWS from torchtext.data.utils import get_tokenizer from torchtext.vocab import build_vocab_from_iterator # 数据预处理 tokenizer = get_tokenizer('basic_english') train_iter = AG_NEWS(split='train') counter = Counter() for (label, line) in train_iter: counter.update(tokenizer(line)) vocab = build_vocab_from_iterator([counter], specials=["<unk>"]) word2idx = dict(vocab.stoi) # 设定超参数 embedding_dim = 64 hidden_dim = 128 num_epochs = 10 batch_size = 64 # 定义模型 class RNN(nn.Module): def __init__(self, vocab_size, embedding_dim, hidden_dim): super(RNN, self).__init__() self.embedding = nn.Embedding(vocab_size, embedding_dim) self.rnn = nn.RNN(embedding_dim, hidden_dim, batch_first=True) self.fc = nn.Linear(hidden_dim, 4) def forward(self, x): x = self.embedding(x) out, _ = self.rnn(x) out = self.fc(out[:, -1, :]) return out # 初始化模型、优化器和损失函数 model = RNN(len(vocab), embedding_dim, hidden_dim) optimizer = torch.optim.Adam(model.parameters()) criterion = nn.CrossEntropyLoss() # 定义数据加载器 train_iter = AG_NEWS(split='train') train_data = [] for (label, line) in train_iter: label = torch.tensor([int(label)-1]) line = torch.tensor([word2idx[word] for word in tokenizer(line)]) train_data.append((line, label)) train_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_size, shuffle=True) # 开始训练 for epoch in range(num_epochs): total_loss = 0.0 for input, target in train_loader: model.zero_grad() output = model(input) loss = criterion(output, target.squeeze()) loss.backward() optimizer.step() total_loss += loss.item() * input.size(0) print("Epoch: {}, Loss: {:.4f}".format(epoch+1, total_loss/len(train_data)))改错
05-25
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值