自然语言处理(NLP)-BERT实战-简单情感分析

AI大模型_学习君

于 2024-09-13 14:14:34 发布

阅读量450

点赞数 14

文章标签：自然语言处理 bert 人工智能 LLM 大语言模型 ai大模型大模型实战

本文链接：https://blog.csdn.net/python12345678_/article/details/142209707

版权

本篇文章我们将基于BERT对文本数据进行情感分析，使用深度学习框架为PyTorch。

本篇文章我们使用预训练模型BERT-base-chinese，对文本进行情感分析。

什么是BERT-base-chinese？

BERT-base-chinese 是 Google 于 2019 年发布的中文预训练语言模型，基于 Transformer 架构，使用了大规模中文语料库进行训练。它具有以下特点：

大规模语料库: BERT-base-chinese 使用了约 500GB 的中文语料库进行训练，包括新闻、小说、百科全书等各种类型的文本。

多层结构: BERT-base-chinese 采用多层双向 Transformer 结构，能够捕捉到文本中的长期依赖关系。

掩码语言模型: BERT-base-chinese 使用掩码语言模型进行训练，能够预测被遮蔽的单词，从而学习到单词之间的语义关系。

下一句预测: BERT-base-chinese 还使用了下一句预测任务进行训练，能够预测两个句子之间的逻辑关系。

BERT-base-chinese 在中文自然语言处理任务上取得了 state-of-the-art 的效果，被广泛用于文本分类、情感分析、问答系统等任务。

BERT-base-chinese 应用举例：

文本分类: BERT-base-chinese 可以用于对文本进行分类，例如判断一篇新闻是正面还是负面，或者判断一条微博是积极还是消极。

情感分析: BERT-base-chinese 可以用于对文本的情感进行分析，例如判断一条评论是正面还是负面，或者判断一个人的情绪是高兴还是悲伤。

问答系统: BERT-base-chinese 可以用于构建问答系统，回答用户提出的问题。

机器翻译: BERT-base-chinese 可以用于构建机器翻译系统，将一种语言翻译成另一种语言。

通过使用bert-base-chinese模型，我们可以进行各种中文自然语言处理任务，如文本分类、情感分析、命名实体识别等。在进行这些任务时，我们可以直接使用预训练的模型进行特征提取，或者在我们的任务上微调模型以获得更好的性能。

下面我们使用最简单的代码对预训练模型BERT-base-chinese进行演示操作：

我们使用预定的LABEL_0表示负面信息，LABEL_1表示正面信息。

1. 加载需要的库

import torch
from transformers import BertTokenizer, BertForSequenceClassification

2. 下载及加载模型

model = BertForSequenceClassification.from_pretrained('bert-base-chinese')
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')

# 输出
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:88: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.  
  warnings.warn(
config.json: 100%
   624/624 [00:00<00:00, 16.4kB/s]
model.safetensors: 100%
 412M/412M [00:04<00:00, 85.1MB/s]
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-chinese and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
tokenizer_config.json: 100%
 49.0/49.0 [00:00<00:00, 843B/s]
vocab.txt: 100%
 110k/110k [00:00<00:00, 1.55MB/s]
tokenizer.json: 100%
 269k/269k [00:00<00:00, 4.28MB/s]

3. 定义待推理数据
在这里插入图片描述