中文新闻情感分类 Bert-Pytorch-transformers

中文新闻情感分类 Bert-Pytorch-transformers

使用pytorch框架以及transformers包,以及Bert的中文预训练模型

文本分类,模型通过提取序列语义,找到不同类别文本之间的区别,是
自然语言处理中比较容易入门的的任务。

1.数据预处理

进行机器学习往往都要先进行数据的预处理,比如中文分词、
停用词处理、人工去噪。
以本文所用的数据为例,我们可以观察到,这个数据集并不完美,
句子中有许多与句子本身没有关系的无意义词语,还有很多空的数据,
这些都是我们应该处理的。
另外,中文自然语言处理往往需要分词与建立词典,本文中的模型采用
了Bert自带的分词与词典,因此省略了这一步骤。

2.词语表示

在自然语言处理中,大多数情况下以词作为最小单位,模型的输入就是
词语的向量表示,一般可以使用Embedding方法获取词向量,本文依然
使用了Bert预训练后的词表。
因为文本分类是对整个句子的语义进行学习,因此可以使用降维、特征
分析等机器学习中常用的方法。

3.构建Data

pytorch中我们可以使用一种方法来获取数据,方便后面直接将其输入到
模型中,在图像处理中transformer方法也可以直接加入到这种方法。
我也直接将数据预处理的过程和NewsData写在一起。

4.模型

Bert模型在这里就不赘述了,我们这里只需要知道Bert预训练模型会输出
句子中每个词的表示向量,另外使用模型时需要对序列进行填充。

#model.py
#原模型,供参考
#https://huggingface.co/transformers/_modules/transformers/modeling_bert.html#BertForSequenceClassification


class BertForSequenceClassification(BertPreTrainedModel):
    r"""
        **labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
            Labels for computing the sequence classification/regression loss.
            Indices should be in ``[0, ..., config.num_labels - 1]``.
            If ``config.num_labels == 1`` a regression loss is computed (Mean-Square loss),
            If ``config.num_labels > 1`` a classification loss is computed (Cross-Entropy).

    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
        **loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
            Classification (or regression if config.num_labels==1) loss.
        **logits**: ``torch.FloatTensor`` of shape ``(batch_size, config.num_labels)``
            Classification (or regression if config.num_labels==1) scores (before SoftMax).
        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
            of shape ``(batch_size, sequence_length, hidden_size)``:
            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

    Examples::

        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
        model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0)  # Batch size 1
        labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
        outputs = model(input_ids, labels=labels)
        loss, logits = outputs[:2]

    """
    def __init__(self, config):
        super(BertForSequenceClassification, self).__init__(config)
        self.num_labels = config.num_labels

        self.bert = BertModel(config)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, self.config.num_labels)

        self.init_weights()

    def forward(self, input_ids=None, attention_mask=None, token_type_ids=None,
                position_ids=None, head_mask=None, inputs_embeds=None, labels=None):

        outputs = self.bert(input_ids,
                            attention_mask=attention_mask,
                            token_type_ids=token_type_ids,
                            position_ids=position_ids,
                            head_mask=head_mask,
                            inputs_embeds=inputs_embeds)

        pooled_output = outputs[1]

        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)

        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here

        if labels is not None:
            if self.num_labels == 1:
                #  We are doing regression
                loss_fct = MSELoss()
                loss = loss_fct(logits.view(-1), labels.view(-1))
            else:
                loss_fct = CrossEntropyLoss()
                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
            outputs = (loss,) + outputs

        return outputs  # (loss), logits, (hidden_states), (attentions)

Bert输出每个词的向量表示后又经过一个Dropout和Linear线性层,这个模型顺便还
计算了Loss函数,我们也可以直接使用

5.结果

因为是须训练模型,甚至不用训练就可以得到不错的结果,
同时也因为这个原因,训练不得当结果会直接变差。

github:

https://github.com/Toyhom/Hei_Dong/tree/master/Project/中文文本分类

源代码


文件目录

data

Train_DataSet.csv
Train_DataSet_Label.csv

main.py
NewsData.py

#main.py
from transformers import BertTokenizer
from transformers import BertForSequenceClassification
from transformers import BertConfig
from transformers import BertPreTrainedModel
import torch
import torch.nn as nn
from transformers import BertModel
import time
import argparse
from NewsData import NewsData
import os

def get_train_args():
    parser=argparse.ArgumentParser()
    parser.add_argument('--batch_size',type=int,default=10,help = '每批数据的数量')
    parser.add_argument('--nepoch',type=int,default=3,help = '训练的轮次')
    parser.add_argument('--lr',type=float,default=0.001,help = '学习率')
    parser.add_argument('--gpu',type=bool,default=True,help = '是否使用gpu')
    parser.add_argument('--num_workers',type=int,default=2,help='dataloader使用的线程数量')
    parser.add_argument('--num_labels',type=int,default=3,help='分类类数')
    parser.add_argument('--data_path',type=str,default='./data',help='数据路径')
    opt=parser.parse_args()
    print(opt)
    return opt

def get_model(opt):
    #类方法.from_pretrained()获取预训练模型,num_labels是分类的类数
    model = BertForSequenceClassification.from_pretrained('bert-base-chinese',num_labels=opt.num_labels)
    return model

def get_data(opt):
    #NewsData继承于pytorch的Dataset类
    trainset = NewsData(opt.data_path,is_train = 1)
    trainloader=torch.utils.data.DataLoader(trainset,batch_size=opt.batch_size,shuffle=True,num_workers=opt.num_workers)
    testset = NewsData(opt.data_path,is_train = 0)
    testloader=torch.utils.data.DataLoader(testset,batch_size=opt.batch_size,shuffle=False,num_workers=opt.num_workers)
    return trainloader,testloader



    
def train(epoch,model,trainloader,testloader,optimizer,opt):
    print('\ntrain-Epoch: %d' % (epoch+1))
    model.train()
    start_time = time.time()
    print_step = int(len(trainloader)/10)
    for batch_idx,(sue,label,posi) in enumerate(trainloader):
        if opt.gpu:
            sue = sue.cuda()
            posi = posi.cuda()
            label = label.unsqueeze(1).cuda()
        
        
        optimizer.zero_grad()
        #输入参数为词列表、位置列表、标签
        outputs = model(sue, position_ids=posi,labels = label) 

        loss, logits = outputs[0],outputs[1]
        loss.backward()
        optimizer.step()
        
        if batch_idx % print_step == 0:
            print("Epoch:%d [%d|%d] loss:%f" %(epoch+1,batch_idx,len(trainloader),loss.mean()))
    print("time:%.3f" % (time.time() - start_time))


def test(epoch,model,trainloader,testloader,opt):
    print('\ntest-Epoch: %d' % (epoch+1))
    model.eval()
    total=0
    correct=0
    with torch.no_grad():
        for batch_idx,(sue,label,posi) in enumerate(testloader):
            if opt.gpu:
                sue = sue.cuda()
                posi = posi.cuda()
                labels = label.unsqueeze(1).cuda()
                label = label.cuda()
            else:
                labels = label.unsqueeze(1)
            
            outputs = model(sue, labels=labels)
            loss, logits = outputs[:2]
            _,predicted=torch.max(logits.data,1)


            total+=sue.size(0)
            correct+=predicted.data.eq(label.data).cpu().sum()
    
    s = ("Acc:%.3f" %((1.0*correct.numpy())/total))
    print(s)


if __name__=='__main__':
        opt = get_train_args()
        model = get_model(opt)
        trainloader,testloader = get_data(opt)
        
        if opt.gpu:
            model.cuda()
        
        optimizer=torch.optim.SGD(model.parameters(),lr=opt.lr,momentum=0.9)
        
        if not os.path.exists('./model.pth'):
            for epoch in range(opt.nepoch):
                train(epoch,model,trainloader,testloader,optimizer,opt)
                test(epoch,model,trainloader,testloader,opt)
            torch.save(model.state_dict(),'./model.pth')
        else:
            model.load_state_dict(torch.load('model.pth'))
            print('模型存在,直接test')
            test(0,model,trainloader,testloader,opt)


    
#NewsData.py

from transformers import BertTokenizer
from transformers import BertForSequenceClassification
from transformers import BertConfig
from transformers import BertPreTrainedModel
import torch
import torch.nn as nn
from transformers import BertModel
import time
import argparse

class NewsData(torch.utils.data.Dataset):
    def __init__(self,root,is_train = 1):
        self.tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
        self.data_num = 7346
        self.x_list = []
        self.y_list = []
        self.posi = []
        
        with open(root + '/Train_DataSet.csv',encoding='UTF-8') as f:
            for i in range(self.data_num+1):
                line = f.readline()[:-1] + '这是一个中性的数据'

                
                data_one_str = line.split(',')[len(line.split(','))-2]
                data_two_str = line.split(',')[len(line.split(','))-1]
                

               
                if len(data_one_str) < 6:
                    z = len(data_one_str)
                    data_one_str = data_one_str + ',' + data_two_str[0:min(200,len(data_two_str))]
                else:
                    data_one_str = data_one_str
                if i==0:
                    continue
                

                word_l = self.tokenizer.encode(data_one_str, add_special_tokens=False)
                if len(word_l)<100:
                    while(len(word_l)!=100):
                        word_l.append(0)
                else:
                    word_l = word_l[0:100]
                
                word_l.append(102)
                l = word_l
                word_l = [101]
                word_l.extend(l)
                
                
                self.x_list.append(torch.tensor(word_l))

                self.posi.append(torch.tensor([i for i in range(102)]))
                
        with open(root + '/Train_DataSet_Label.csv',encoding='UTF-8') as f:
            for i in range(self.data_num+1):
                #print(i)
                label_one = f.readline()[-2]
                if i==0:
                    continue
                label_one = int(label_one)
                self.y_list.append(torch.tensor(label_one))
                
        
        #训练集或者是测试集
        if is_train == 1:  
            self.x_list = self.x_list[0:6000]
            self.y_list = self.y_list[0:6000]
            self.posi = self.posi[0:6000]
        else:
            self.x_list = self.x_list[6000:]
            self.y_list = self.y_list[6000:]
            self.posi = self.posi[6000:]
        
        self.len = len(self.x_list)
        
        
        
        
    def __getitem__(self, index):
        return self.x_list[index], self.y_list[index],self.posi[index]
             
    def __len__(self):
        return self.len


  • 5
    点赞
  • 41
    收藏
    觉得还不错? 一键收藏
  • 5
    评论
BERT是一种预训练的语言模型,它在自然语言处理领域中表现出色。在这里,我将简要介绍如何使用PyTorch实现BERT模型。 首先,我们需要导入必要的库: ```python import torch import torch.nn as nn from transformers import BertModel ``` 然后,我们定义BERT模型的类: ```python class BERT(nn.Module): def __init__(self, bert_path): super(BERT, self).__init__() self.bert = BertModel.from_pretrained(bert_path) self.dropout = nn.Dropout(0.1) self.fc = nn.Linear(768, 1) def forward(self, input_ids, attention_mask): output = self.bert(input_ids=input_ids, attention_mask=attention_mask) output = output.last_hidden_state output = self.dropout(output) output = self.fc(output) output = torch.sigmoid(output) return output ``` 在这个类中,我们首先使用`BertModel.from_pretrained()`方法加载预训练的BERT模型。然后,我们添加了一个dropout层和一个全连接层。最后,我们使用sigmoid函数将输出值转换为0到1之间的概率。 接下来,我们定义训练和测试函数: ```python def train(model, train_dataloader, optimizer, criterion, device): model.train() running_loss = 0.0 for inputs, labels in train_dataloader: inputs = inputs.to(device) labels = labels.to(device) optimizer.zero_grad() outputs = model(inputs['input_ids'], inputs['attention_mask']) loss = criterion(outputs.squeeze(-1), labels.float()) loss.backward() optimizer.step() running_loss += loss.item() * inputs.size(0) epoch_loss = running_loss / len(train_dataloader.dataset) return epoch_loss def test(model, test_dataloader, criterion, device): model.eval() running_loss = 0.0 with torch.no_grad(): for inputs, labels in test_dataloader: inputs = inputs.to(device) labels = labels.to(device) outputs = model(inputs['input_ids'], inputs['attention_mask']) loss = criterion(outputs.squeeze(-1), labels.float()) running_loss += loss.item() * inputs.size(0) epoch_loss = running_loss / len(test_dataloader.dataset) return epoch_loss ``` 在训练函数中,我们首先将模型设置为训练模式,并迭代数据集中的每个批次,将输入和标签移动到GPU上,然后执行前向传播、计算损失、反向传播和优化器步骤。在测试函数中,我们将模型设置为评估模式,并在数据集上进行迭代,计算测试损失。 最后,我们可以实例化模型并开始训练: ```python if __name__ == '__main__': bert_path = 'bert-base-uncased' train_dataset = ... test_dataset = ... train_dataloader = ... test_dataloader = ... device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') model = BERT(bert_path).to(device) optimizer = torch.optim.Adam(model.parameters(), lr=1e-5) criterion = nn.BCELoss() for epoch in range(num_epochs): train_loss = train(model, train_dataloader, optimizer, criterion, device) test_loss = test(model, test_dataloader, criterion, device) print(f'Epoch {epoch+1}/{num_epochs}, Train loss: {train_loss:.4f}, Test loss: {test_loss:.4f}') ``` 在这里,我们首先定义数据集和数据加载器,然后实例化模型并将其移动到GPU上(如果可用)。然后,我们定义优化器和损失函数,并开始训练模型。
评论 5
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值