使用bert进行中文文本分类

15 篇文章 1 订阅
8 篇文章 0 订阅

一、Bert预训练模型的使用

安装transformers

pip install transformers

BertTokenizer

Bert的分词器,附带Bert的字典,因为词向量化需要找到字典中对应的序号,才能找到它对应的词向量

from transformers import BertTokenizer
import torch

tokenizers = BertTokenizer.from_pretrained('bert-base-uncased')  # 加载base模型的对应的切词器
print(tokenizers)

token = tokenizers.tokenize("I love music")
print(token)

indexes = tokenizers.convert_tokens_to_ids(token)
print(indexes)

id2token = tokenizers.convert_ids_to_tokens(indexes)
print(id2token)

encoder = tokenizers.encode('I love music')
print(encoder)

encoder_tensor = torch.tensor(encoder)
print(f"encoder_tensor: {encoder_tensor}, the size of encoder_tensor is: {encoder_tensor.size()}")

cls = tokenizers._convert_token_to_id('[CLS]')
sep = tokenizers._convert_token_to_id("[SEP]")

print(cls, sep)

'''
PreTrainedTokenizer(name_or_path='bert-base-uncased', vocab_size=30522, model_max_len=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})
['i', 'love', 'music']
[1045, 2293, 2189]
['i', 'love', 'music']
[101, 1045, 2293, 2189, 102]
encoder_tensor: tensor([ 101, 1045, 2293, 2189,  102]), the size of encoder_tensor is: torch.Size([5])
101 102
'''

BertModel

输入BertModel中的句子的开头和结尾需要自己添加上[CLS]和[SEP],才能够正确使用Bert,或者直接调用BertTokenizer.encode则会自动添加

转换为id即可输入Bert获得对应的句子编码和词向量

import torch
from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained('bert-base-uncased')

music = torch.tensor(tokenizer.encode('I love music')).unsqueeze(0)  # batch_size=1
bert = torch.tensor(tokenizer.encode("I am using the bert model")).unsqueeze(0)  # batch_size=1

output_music = model(music)

music_word_embedding = output_music[0]  # 词的向量表表示
music_sentence_embedding = output_music[1]  # 句子的向量表示

print(f"word embedding: {music_word_embedding.shape}")
print(f"sentence embedding: {music_sentence_embedding.shape}")

'''
word embedding: torch.Size([1, 5, 768])
sentence embedding: torch.Size([1, 768])
'''

[CLS]和[SEP]也会被转换为向量输出

import torch
from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

morning = torch.tensor(tokenizer.encode('good morning')).unsqueeze(0)
night = torch.tensor(tokenizer.encode('good night')).unsqueeze(0)

morning_embedding = model(morning)
night_embedding = model(night)

print(f"morning good: {morning_embedding[0][0][1]}")
print(f"night good: {night_embedding[0][0][1]}")

'''
morning good: tensor([-5.8899e-02,  4.4347e-02,  8.4845e-01, -1.1691e+00, -4.5506e-01,
         6.7695e-02, -6.9360e-02,  1.6030e+00, -1.2251e-01, -1.6857e+00,
         1.2733e-01, -4.0276e-01,  1.2708e-01,  1.2586e-01, -1.8597e-01,

night good: tensor([ 1.0142e+00,  1.2357e-01,  1.2800e+00, -1.2907e+00, -4.6667e-01,
        -1.9426e-01,  5.0342e-01,  1.7515e+00,  6.8127e-02, -1.3032e+00,
        -9.8331e-02,  2.1038e-01,  6.1536e-02,  1.5393e-02, -3.9809e-01,
'''

看的出来,同一个词,在不同的句子中词向量是不同的,说明Bert能够很好解决一词多义的问题

二、使用Bert进行中文文本分类

BertForSequenceClassification 是在 BertModel 的基础上,添加了一个线性层 + 激活函数,用于分类。我们会使用model = BertForSequenceClassification.from_pretrained("bert-base-uncased", config=config)来加载模型,那么线性层 + 激活函数的权重就会随机初始化。我们的目的,就是通过微调,学习到线性层 + 激活函数的权重

# -*- coding: utf-8 -*-
# @Time    : 2022/12/4 10:53
# @Author  : 楚楚
# @File    : 04文本分类.py
# @Software: PyCharm

import pandas as pd
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import torch
import pandas as od
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from transformers import BertTokenizer, BertForSequenceClassification, AdamW, BertConfig
from datetime import datetime

# hyper parameters
HIDDEN_DROPOUT_PROB = 0.3
NUM_LABELS = 6
LR = 1e-5
WEIGHT_DECAY = 1e-2
EPOCH = 100
BATCH_SIZE = 16
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# file path
data_path = "./dataset/"
vocab_file = data_path + "vocab.txt"
train = data_path + "train.xlsx"
test = data_path + "test.xlsx"

print("loading BERT tokenizer...")
tokenizer = BertTokenizer(vocab_file=vocab_file)

'''
BertTokenizer(vocab_file=vocab_file) 等价于 BertTokenizer.from_pretrained("hfl/chinese-bert-wwm"  )
'''

config = BertConfig.from_pretrained("hfl/chinese-bert-wwm", num_labels=NUM_LABELS,
                                    hidden_dropout_prob=HIDDEN_DROPOUT_PROB)
model = BertForSequenceClassification.from_pretrained("hfl/chinese-bert-wwm", config=config)
model.to(device)

'''
颜色对应的序号
    0:蓝帽、1:白帽、2:红帽、3:黄帽、4:黑帽、5:绿帽
'''


class SixHatDataset(Dataset):
    def __init__(self, path_to_file):
        super(SixHatDataset, self).__init__()

        self.label = ['蓝帽', '白帽', '红帽', '黄帽', '黑帽', '绿帽']
        self.label2id = {'蓝帽': 0, '白帽': 1, '红帽': 2, '黄帽': 3, '黑帽': 4, '绿帽': 5}

        self.dataset = pd.read_excel(path_to_file, keep_default_na=False)  # 读取train.xlsx/test.xlsx中的数据

        self.label_text = []  # 存放标签以及标签对应的文本(列表中存放的数据类型为字典类型 label: text)

        for label in self.label:
            self.hat_dataset = self.dataset.loc[:, label]
            self.read_each_hat()

    # 读取excel表中每一个帽子对应的文本数据
    def read_each_hat(self):
        label = self.hat_dataset.name  # 获取对应的标签
        label = self.label2id.get(label)

        for text in self.hat_dataset:
            if text != '':
                self.label_text.append({label: text})

    def __len__(self):
        return len(self.label_text)

    def __getitem__(self, idx):
        label_text = self.label_text[idx]
        label = list(label_text.keys())[0]
        text = label_text.get(label)

        label = torch.tensor(label, dtype=torch.long)

        return text, label


train_dataset = SixHatDataset(train)
train_dataloader = DataLoader(dataset=train_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=0)

test_dataset = SixHatDataset(test)
test_dataloader = DataLoader(dataset=test_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=0)

# 定义优化器和损失函数
optimizer = AdamW(model.parameters(), lr=LR)
criterion = nn.CrossEntropyLoss().to(device)


# 定义训练的函数
def train(epoch, dataloader):
    model.train()
    epoch_acc = 0

    for idx, data in enumerate(dataloader):
        text, label = data
        label = label.to(device)

        tokenize_text = tokenizer(text, max_length=100, add_special_tokens=True, truncation=True, padding=True,
                                  return_tensors='pt')
        tokenize_text = tokenize_text.to(device)

        optimizer.zero_grad()

        '''
        SequenceClassifierOutput(loss=tensor(2.0327, grad_fn=<NllLossBackward0>), logits=tensor([[-0.1659,  0.7432,  0.9424, -0.3815,  0.1794,  0.1559],
        [-1.1916,  0.2135,  1.0156, -0.5150,  0.7795, -0.0261],
        [-0.3528,  0.1796,  1.1230, -0.8721,  0.4448,  0.8882],
        [-0.9661,  0.0696,  1.0002, -0.3308,  0.8832,  0.1922],
        [-0.4554,  0.4486,  0.9846, -0.3371,  0.8539, -0.3214],
        [-1.3695, -0.2882,  0.5169,  0.4508,  1.1330,  0.2997],
        [-0.5405,  0.0763,  1.2337, -0.2260,  0.6922, -0.2044],
        [-0.9427, -0.0595,  1.7682, -1.0026,  0.4901, -0.1369],
        [-1.1734,  0.1412,  2.0086, -0.5898,  0.8525,  0.0528],
        [-0.7478, -0.3635,  1.2168, -0.5125,  1.2169,  0.3979],
        [-1.2102,  0.2823,  0.9883, -0.5061,  0.5131, -0.0209],
        [-1.0257, -0.0059,  1.0093, -0.8454,  1.1518,  0.1737],
        [-0.6118,  0.2500,  1.3389, -0.7910,  0.0835,  0.4923],
        [-0.5885, -0.0195,  1.0697,  0.0891,  0.9630, -0.0917],
        [-0.7473,  0.1327,  1.0242, -0.4896,  0.2457,  0.2772],
        [-0.6821, -0.1901,  0.6271, -0.2386,  0.1395, -0.0949]],
        grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)
        '''
        output = model(**tokenize_text, labels=label)

        # y_pred_prob = logits : [batch_size, num_labels]
        y_pred_prob = output[1]
        y_pred_label = y_pred_prob.argmax(dim=1)

        # loss = loss
        loss = output[0]

        acc = ((y_pred_label == label.view(-1)).sum()).item()

        loss.backward()
        optimizer.step()

        if idx % 10 == 0:
            print(f'train epoch:{epoch}, loss: {loss}')

            now = datetime.now()
            now = now.strftime("%Y-%m-%d %H:%M:%S")

            content = f"{now}\tloss: {loss}\n"

            with open('information.txt', 'a+', encoding='utf-8') as file:
                file.write(content)

        epoch_acc += acc

    accuracy = epoch_acc / len(train_dataset)

    print(f"训练集上的准确率:{accuracy:.4f}%")

    now = datetime.now()
    now = now.strftime("%Y-%m-%d %H:%M:%S")

    content = f"{now}\t训练集上的准确率:{accuracy:.4f}%\n"

    with open('information.txt', 'a+', encoding='utf-8') as file:
        file.write(content)


def validate(epoch, dataloader):
    model.eval()

    total_loss = 0
    epoch_acc = 0

    with torch.no_grad():
        for _, data in enumerate(dataloader):
            text, label = data

            '''
            result:
                {'input_ids': tensor([[ 101, 1266,  776, 2356, 3308, 7345, 1277, 1266, 1724, 4384,  704, 6662,
                          102,    0,    0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]])}

                token_type_ids: segment encoding
                attention_mask: 避免填充的数据参与Self-Attention的计算

            tokenizer():
                return_tensors='pt': 表示返回的是pytorch的tensor
            '''
            tokenize_text = tokenizer(text, max_length=100, padding=True, truncation=True, add_special_tokens=True,
                                      return_tensors='pt')

            tokenize_text = tokenize_text.to(device)
            label = label.to(device)

            output = model(**tokenize_text, labels=label)

            y_pred_label = output[1].argmax(dim=1)
            loss = output[0]

            total_loss += loss

            acc = (y_pred_label == label.view(-1)).sum().item()
            epoch_acc += acc

    print(f"测试集上的loss:{total_loss}")

    accuracy = epoch_acc / len(test_dataset)
    print(f"测试集准确率:{accuracy}%")

    print("模型保存成功")
    torch.save(model.state_dict(), f'./model/classification_{epoch}.pth')

    now = datetime.now()
    now = now.strftime("%Y-%m-%d %H:%M:%S")

    content = f"{now}\t测试集上的loss:{total_loss},测试集上的准确率:{accuracy}\n"
    with open('information.txt', 'a+', encoding='utf-8') as file:
        file.write(content)


for i in range(EPOCH):
    content = f"{'-' * 20}epoch{i + 1}{'-' * 20}\n"

    content = content + f"{'*' * 10}训练开始{'*' * 10}\n"

    with open('information.txt', 'a+', encoding='utf-8') as file:
        file.write(content)

    train(i, train_dataloader)

    content = f"{'*' * 10}测试开始{'*' * 10}\n"

    with open('information.txt', 'a+', encoding='utf-8') as file:
        file.write(content)

    validate(i, test_dataloader)

词汇表 vocab.txt 来自于哈工大的中文预训练语言模型 BERT-wwm, Chinese,地址: 中文BERT-wwm

参考

1、transformer包中的bert预训练模型的调用详解

2、How does max_length, padding and truncation arguments work in HuggingFace’ BertTokenizerFast.from_pretrained(‘bert-base-uncased’)?

3、明明pip install transformers了,但调用模型的时候还会报错

4、中文BERT-wwm

5、NLP(二十八):BertForSequenceClassification进行文本分类,基于transformers

BERT是一个基于Transformer的预训练模型,可以用于文本分类任务。下面是BERT实现文本分类的步骤: 1. 数据预处理:将文本数据转化为模型可接受的格式。首先,将每个文本拆分成单词或子词(subwords)。然后,将每个单词或子词映射成其在词表(vocabulary)中的索引,得到输入序列。此外,还需要为输入序列添加特殊的标记,例如[CLS]和[SEP],分别标记句子的开头和结尾。 2. 模型搭建:使用预训练的BERT模型作为基础,通过Fine-tuning调整模型参数以适应文本分类任务。通常,在模型搭建时,会在输入序列的开头添加一个分类器(classifier),用于预测文本的类别。通过Fine-tuning可以提高模型在具体分类任务上的性能。 3. 模型训练:使用带有标注的训练数据对BERT模型进行训练。训练过程中,模型会通过反向传播算法不断更新参数,使得模型在分类任务上的损失函数逐渐减小。通常,可以使用一些优化算法(如Adam)来控制训练过程。 4. 模型评估:在模型训练完毕后,使用一部分标注好的测试数据来评估模型的性能。常用的评估指标包括准确率(accuracy)、精确率(precision)、召回率(recall)和F1值等。 5. 模型应用:经过训练的BERT模型可以用于后续的文本分类任务。对于新的、未见过的文本数据,可以将其输入到模型中,通过模型输出的预测结果来确定文本的类别。 总之,BERT的实现主要包括数据预处理、模型搭建、模型训练、模型评估和模型应用等步骤。通过Fine-tuning,BERT能够在各种文本分类任务中达到较好的性能。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值
>