【NLP实战-BERT文本分类】一文搞懂基于BERT进行文本分类并定制化评估指标

本文链接：https://blog.csdn.net/lov1993/article/details/145498581

【NLP实战-BERT文本分类】一文搞懂基于BERT进行文本分类并定制化评估指标

本次修炼方法请往下查看
在这里插入图片描述

🌈 欢迎莅临我的个人主页 👈这里是我工作、学习、实践 IT领域、真诚分享 踩坑集合，智慧小天地！
🎇 免费获取相关内容文档关注：微信公众号，发送 pandas 即可获取
🎇 相关内容视频讲解 B站

🎓 博主简介：AI算法驯化师，混迹多个大厂搜索、推荐、广告、数据分析、数据挖掘岗位 个人申请专利40+，熟练掌握机器、深度学习等各类应用算法原理和项目实战经验。

🔧 技术专长：在机器学习、搜索、广告、推荐、CV、NLP、多模态、数据分析等算法相关领域有丰富的项目实战经验。已累计为求职、科研、学习等需求提供近千次有偿|无偿定制化服务，助力多位小伙伴在学习、求职、工作上少走弯路、提高效率，近一年好评率100% 。

📝 博客风采：积极分享关于机器学习、深度学习、数据分析、NLP、PyTorch、Python、Linux、工作、项目总结相关的实用内容。

🌵文章目录🌵

BERT-文本分类

下滑查看解决方法

BERT-文本分类

🎯 1. 基本介绍

文本分类任务的目标是将文本数据分配到预定义的类别中。在本次项目中，我们使用了一个包含用户评论和评价维度的数据集。数据集中的每条评论都对应一个评价维度（如“服务态度”、“产品质量”等），我们的目标是训练一个模型，能够自动根据评论内容预测其对应的评价维度。

🎯 二、数据处理

数据处理是机器学习项目中的重要环节，良好的数据预处理可以显著提升模型的性能。以下是我们的数据处理流程：
加载数据，我们使用pandas库加载数据集，并对数据进行初步查看。数据集存储在data.xlsx文件中，我们只取前10000条数据进行实验。

df_train = pd.read_excel('data.xlsx').head(10000)
print(df_train.head())

标签编码, 由于模型无法直接处理文本标签，我们需要将文本标签转换为数值标签。这里我们使用LabelEncoder对“评价维度”列进行编码。

le = LabelEncoder()
df_train['label'] = le.fit_transform(df_train['评价维度'].tolist())

划分训练集和测试集, 使用train_test_split方法将数据划分为训练集和测试集，测试集占比为20%。

X_train, X_test, y_train, y_test = train_test_split(df_train['评论内容'].tolist(), df_train['label'].tolist(), 
                                                    test_size=0.2, random_state=42)

保存标签字典为,了在预测时将数值标签转换回原始文本标签，我们保存了标签字典。


label_dict = dict(zip(le.classes_, le.transform(le.classes_)))

🎯 三、模型准备

加载BERT Tokenizer我,们使用transformers库加载预训练的BERT Tokenizer，并对文本进行编码。

tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')

文本编码, 定义一个函数encode_texts，将文本数据转换为BERT模型所需的格式。

def encode_texts(tokenizer, texts, max_length=128):
    encodings = tokenizer(texts, truncation=True, padding=True, max_length=max_length, return_tensors="pt")
    return encodings

创建数据集类, 定义一个TextDataset类，继承自torch.utils.data.Dataset，用于将编码后的文本和标签封装成PyTorch数据集。

class TextDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx], dtype=torch.long)
        return item
    
    def __len__(self):
        return len(self.labels)

加载预训练BERT模型, 加载预训练的BERT模型，并根据我们的分类任务调整模型的输出层。

model = BertForSequenceClassification.from_pretrained('/bert-base-chinese', 
                                                       num_labels=num_labels, problem_type="single_label_classification")

🎯 四、模型训练与评估

定义评估指标定,义一个compute_metrics函数，用于计算模型的评估指标，包括AUC、精确率、召回率和F1分数。

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    
    # 获取原始预测概率
    probs = torch.nn.functional.softmax(torch.tensor(pred.predictions), dim=1).numpy()
    
    try:
        # 检查是否为二分类问题
        if probs.shape[1] == 2:
            # 二分类问题只需要使用正类的概率
            macro_auc = roc_auc_score(
                y_true=labels,
                y_score=probs[:, 1],  # 使用正类的概率
                average='macro'
            )
        else:
            # 多分类问题使用OvR策略
            macro_auc = roc_auc_score(
                y_true=labels,
                y_score=probs,
                multi_class='ovr',  # 使用one-vs-rest策略
                average='macro'     # 使用宏平均
            )
    except ValueError as e:
        print(f"Warning: AUC calculation failed - {str(e)}")
        try:
            # 尝试使用micro平均作为备选
            if probs.shape[1] == 2:
                macro_auc = roc_auc_score(
                    y_true=labels,
                    y_score=probs[:, 1],
                    average='micro'
                )
            else:
                macro_auc = roc_auc_score(
                    y_true=labels,
                    y_score=probs,
                    multi_class='ovr',
                    average='micro'
                )
        except ValueError as e:
            print(f"Warning: Micro-average AUC calculation also failed - {str(e)}")
            macro_auc = 0.0
    
    # 计算其他指标
    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, 
        preds, 
        average="weighted",
        zero_division=0
    )
    
    metrics = {
        "auc": macro_auc,
        "precision": precision,
        "recall": recall,
        "f1": f1
    }
    
    # 添加详细的分类信息到日志
    print(f"\nNumber of classes: {probs.shape[1]}")
    print(f"AUC Score: {macro_auc:.4f}")
    
    return metrics
    ````

&emsp;&emsp;自定义Trainer类
为了更好地记录训练过程中的指标，我们定义了一个CustomTrainer类，继承自transformers.Trainer。
Python
复制
class CustomTrainer(Trainer):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.train_metrics = []
        self.current_train_loss = 0.0
        self.steps_in_epoch = 0
    
    def training_step(self, model, inputs):
        """记录每个step的loss"""
        loss = super().training_step(model, inputs)
        self.current_train_loss += loss.item()
        self.steps_in_epoch += 1
        return loss
    
    def evaluate(self, eval_dataset=None, ignore_keys=None, metric_key_prefix="eval"):
        """在每个epoch结束时评估并记录指标"""
        metrics = super().evaluate(eval_dataset, ignore_keys, metric_key_prefix)
        
        # 计算平均训练loss
        avg_train_loss = self.current_train_loss / self.steps_in_epoch if self.steps_in_epoch > 0 else 0
        
        # 记录当前epoch的指标
        current_epoch = self.state.epoch
        
        # 添加到训练历史
        training_history.append({
            'epoch': current_epoch,
            'loss': avg_train_loss,
            'auc': metrics.get('eval_auc', 0),
            'precision': metrics.get('eval_precision', 0),
            'recall': metrics.get('eval_recall', 0),
            'f1': metrics.get('eval_f1', 0)
        })
        
        # 重置loss累积
        self.current_train_loss = 0.0
        self.steps_in_epoch = 0
        
        return metrics
    
    def _save(self, output_dir: str, state_dict=None):
        """重写保存方法，确保张量是连续的"""
        if state_dict is None:
            state_dict = self.model.state_dict()
        
        # 确保所有张量都是连续的
        state_dict = {k: v.contiguous() if torch.is_tensor(v) else v 
                      for k, v in state_dict.items()}
        
        # 保存模型
        self.model.save_pretrained(
            output_dir,
            state_dict=state_dict,
            save_function=torch.save

🎯 四、模型训练

具体的训练过程如下所示：


# 设置训练参数
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=6,
    per_device_train_batch_size=128,
    per_device_eval_batch_size=32,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    evaluation_strategy="epoch",
    save_strategy="epoch",     
    save_total_limit=2,        
    metric_for_best_model="f1",
    save_safetensors=False,
    overwrite_output_dir=True  # 如果输出目录已存在，覆盖它
)

# 创建自定义Trainer
trainer = CustomTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics
)

# 训练模型
trainer.train()

# 评估模型
results = trainer.evaluate()
print(f"Test AUC: {results['eval_auc']:.4f}")
print(f"Test Precision: {results['eval_precision']:.4f}")
print(f"Test Recall: {results['eval_recall']:.4f}")
print(f"Test F1: {results['eval_f1']:.4f}")

# 将训练历史转换为DataFrame并保存
import pandas as pd
history_df = pd.DataFrame(training_history)
# 添加模型类型列
history_df['model_type'] = 'bert_aspect'  # 标识这是BERT模型的结果
# 检查文件是否存在并保存结果
filename = 'training_history.csv'
if os.path.exists(filename):
    # 如果文件存在，追加新结果
    history_df.to_csv(filename, mode='a', header=False, index=False)
    print(f"\nAppended new training history to {filename}")
else:
    # 如果文件不存在，创建新文件
    history_df.to_csv(filename, index=False)
    print(f"\nSaved new training history to {filename}")

# 打印训练历史
print("\nTraining History:")
print(history_df)

# 修改可视化代码，将accuracy替换为auc
def plot_training_history(history_df, model_type='BERT'):
    # 创建一个3行1列的图表
    plt.figure(figsize=(12, 12))
    
    # 绘制loss曲线
    plt.subplot(3, 1, 1)
    plt.plot(history_df['epoch'], history_df['loss'], label=f'{model_type} Training Loss')
    plt.title(f'{model_type} bert - Training Loss over Epochs')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.legend()
    
    # 单独绘制AUC曲线
    plt.subplot(3, 1, 2)
    plt.plot(history_df['epoch'], history_df['auc'], 
            label=f'{model_type} AUC', color='red')
    plt.title(f'{model_type} bert Model - AUC over Epochs')
    plt.xlabel('Epoch')
    plt.ylabel('AUC Score')
    plt.legend()
    
    # 绘制其他评估指标曲线
    plt.subplot(3, 1, 3)
    colors = ['blue', 'green', 'orange']
    for metric, color in zip(['precision', 'recall', 'f1'], colors):
        plt.plot(history_df['epoch'], history_df[metric], 
                label=f'{model_type} {metric.capitalize()}',
                color=color)
    plt.title(f'{model_type} bert Model - Other Metrics over Epochs')
    plt.xlabel('Epoch')
    plt.ylabel('Score')
    plt.legend()
    
    plt.tight_layout()
    plt.savefig(f'training_history_{model_type.lower()}.png')
    plt.close()

# 在保存CSV后添加
plot_training_history(history_df, 'aspect')
print("Training plots saved to training_history_bert.png")