Pytorch对NLP的数据进行处理,使用Dataset和Dataloader方法

此次数据使用的是LCQMC,它是做短文本匹配的一个数据,长这样

判断两个文本是否相似,如果相似标签为1,不相似为0

对于数据的处理在这里就不做研究了,无非就是分词,构建词表之类的,这里只说Dataset和DataLoader的用法。

我们首先构建一个类,并继承Dataset类

class DatasetIterater(Dataset):
    def __init__(self,texta,textb,label):
        self.texta = texta
        self.textb = textb
        self.label = label

    def __getitem__(self, item):
        return self.texta[item],self.textb[item],self.label[item]

    def __len__(self):
        return len(self.texta)

既然继承Dataset类,就要实现Dataset类的方法。

第一个方法就不用说了,初始化方法。

第二个是迭代方法,每次得到是一个数据,不是一个batch,我最初以为是一个batch的数据。

第三个就是返回数据的个数

但是,处理NLP的数据,通常情况下是需要对数据进行补齐,也就是在不够长度的数据后补0,所以需要自己实现一个collate_fn函数来进行对文本的补齐操作。

def collate_fn(batch_data,pad=0):
    texta,textb,label = list(zip(*batch_data))#batch_data的结构是[([texta_1],[textb_1],[label_1]),([texta_2],[textb_2],[label_2]),...],所以需要使用zip函数对它解压
    max_len_a = max([len(seq_a) for seq_a in texta])
    max_len_b = max([len(seq_b) for seq_b in textb])
    max_len = max(max_len_a,max_len_b) #这里我使用的是一个batch中text_a或者是text_b的最大长度作为max_len,也可以自定义长度
    texta = [seq+[pad]*(max_len-len(seq)) for seq in texta]
    textb = [seq+[pad]*(max_len-len(seq)) for seq in textb]
    texta = torch.LongTensor(texta)
    textb = torch.LongTensor(textb)
    label = torch.FloatTensor(label)
    return (texta,textb,label)

接下来就可以向自己实现的 DatasetIterater 类里传值了

train_data = DatasetIterater(train_texta,train_textb,train_label)

然后使用DataLoader进行处理,DataLoader每次返回的是一个batch的数据,这里的shuffle代表每次是否要打乱数据,一般对于训练数据都是要打乱的,验证集可打乱也可不打乱,测试集是千万不能打乱的!这里的collate_fn使用的是我们自己定义的函数,

DataLoader里还有一个参数是num_workers,这个是使用几个线程来处理。

train_loader = DataLoader(dataset=train_data,batch_size=args.batch_size,shuffle=True,collate_fn=collate_fn)

最后我们就可以迭代数据向模型里传入了,

        for batch_data in tqdm(train_loader):
            texta, textb, tag = map(lambda x: x.to(device), batch_data)
            output = model(texta, textb)

 

  • 7
    点赞
  • 22
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
THUCNews是一个中文文本分类数据集,可以使用基于神经网络的方法进行文本分类。以下是一个使用Python和PyTorch框架实现THUCNews文本分类的代码示例: ```python import torch import torch.nn as nn import torch.optim as optim import torch.utils.data as data from torch.autograd import Variable import numpy as np import pandas as pd import jieba import re # 读取数据集并进行处理 def read_data(file): data = pd.read_csv(file, header=None, sep='\t') data.columns = ['label', 'content'] data = data.dropna() data = data.reset_index(drop=True) data['content'] = data['content'].apply(lambda x: re.sub('\s+', ' ', x)) data['content'] = data['content'].apply(lambda x: ' '.join(jieba.cut(x))) return data # 构建词典 def build_vocab(data): word_dict = {} for content in data['content']: for word in content.split(' '): if word not in word_dict: word_dict[word] = len(word_dict) + 1 return word_dict # 将文本转化为词向量 def text2vec(data, word_dict): content_vec = [] for content in data['content']: content_vec.append([word_dict[word] for word in content.split(' ')]) return content_vec # 定义模型 class TextCNN(nn.Module): def __init__(self, vocab_size, embedding_dim, class_num, kernel_num, kernel_sizes, dropout): super(TextCNN, self).__init__() self.embedding = nn.Embedding(vocab_size, embedding_dim) self.convs = nn.ModuleList([nn.Conv2d(1, kernel_num, (K, embedding_dim)) for K in kernel_sizes]) self.dropout = nn.Dropout(dropout) self.fc = nn.Linear(len(kernel_sizes) * kernel_num, class_num) def forward(self, x): x = self.embedding(x) x = x.unsqueeze(1) x = [nn.functional.relu(conv(x)).squeeze(3) for conv in self.convs] x = [nn.functional.max_pool1d(i, i.size(2)).squeeze(2) for i in x] x = torch.cat(x, 1) x = self.dropout(x) logit = self.fc(x) return logit # 定义训练函数 def train(model, train_iter, val_iter, optimizer, criterion, num_epochs): for epoch in range(num_epochs): running_loss = 0.0 running_acc = 0.0 model.train() for batch in train_iter: x, y = batch.content, batch.label x = Variable(x) y = Variable(y) optimizer.zero_grad() output = model(x) loss = criterion(output, y) running_loss += loss.data[0] * y.size(0) _, pred = torch.max(output, 1) num_correct = (pred == y).sum() running_acc += num_correct.data[0] loss.backward() optimizer.step() epoch_loss = running_loss / len(train_data) epoch_acc = running_acc / len(train_data) val_loss, val_acc = evaluate(model, val_iter, criterion) print('Epoch: {}, Training Loss: {:.4f}, Training Acc: {:.4f}, Validation Loss: {:.4f}, Validation Acc: {:.4f}'.format(epoch+1, epoch_loss, epoch_acc, val_loss, val_acc)) # 定义评估函数 def evaluate(model, val_iter, criterion): running_loss = 0.0 running_acc = 0.0 model.eval() for batch in val_iter: x, y = batch.content, batch.label x = Variable(x) y = Variable(y) output = model(x) loss = criterion(output, y) running_loss += loss.data[0] * y.size(0) _, pred = torch.max(output, 1) num_correct = (pred == y).sum() running_acc += num_correct.data[0] epoch_loss = running_loss / len(val_data) epoch_acc = running_acc / len(val_data) return epoch_loss, epoch_acc # 设置超参数 vocab_size = 50000 embedding_dim = 100 class_num = 14 kernel_num = 100 kernel_sizes = [3, 4, 5] dropout = 0.5 lr = 1e-3 batch_size = 128 num_epochs = 10 # 读取数据集 train_data = read_data('train.txt') val_data = read_data('val.txt') test_data = read_data('test.txt') # 构建词典 word_dict = build_vocab(train_data) # 将文本转化为词向量 train_content_vec = text2vec(train_data, word_dict) val_content_vec = text2vec(val_data, word_dict) test_content_vec = text2vec(test_data, word_dict) # 将词向量转化为Tensor train_content_tensor = torch.LongTensor(train_content_vec) train_label_tensor = torch.LongTensor(np.array(train_data['label'])) val_content_tensor = torch.LongTensor(val_content_vec) val_label_tensor = torch.LongTensor(np.array(val_data['label'])) test_content_tensor = torch.LongTensor(test_content_vec) test_label_tensor = torch.LongTensor(np.array(test_data['label'])) # 构建数据集和数据迭代器 train_dataset = data.TensorDataset(train_content_tensor, train_label_tensor) val_dataset = data.TensorDataset(val_content_tensor, val_label_tensor) test_dataset = data.TensorDataset(test_content_tensor, test_label_tensor) train_iter = data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True) val_iter = data.DataLoader(val_dataset, batch_size=batch_size, shuffle=True) test_iter = data.DataLoader(test_dataset, batch_size=batch_size, shuffle=True) # 初始化模型、优化器和损失函数 model = TextCNN(vocab_size, embedding_dim, class_num, kernel_num, kernel_sizes, dropout) optimizer = optim.Adam(model.parameters(), lr=lr) criterion = nn.CrossEntropyLoss() # 训练模型并评估 train(model, train_iter, val_iter, optimizer, criterion, num_epochs) test_loss, test_acc = evaluate(model, test_iter, criterion) print('Test Loss: {:.4f}, Test Acc: {:.4f}'.format(test_loss, test_acc)) ``` 以上代码中使用了TextCNN模型,它是一种基于卷积神经网络的文本分类模型。训练过程中使用了Adam优化器和交叉熵损失函数。在训练完模型后,我们使用测试集对模型进行了评估,并输出了测试集上的损失和准确率。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值