SST-2（Stanford Sentiment Treebank 2）数据集：情感分析二分类数据集

最新推荐文章于 2025-04-14 11:14:42 发布

彬彬侠

最新推荐文章于 2025-04-14 11:14:42 发布

阅读量1k

点赞数 5

分类专栏：大模型文章标签： SST-2 情感分析二分类 GLUE Benchmark Hugging Face python

本文链接：https://blog.csdn.net/u013172930/article/details/146130503

版权

大模型专栏收录该内容

98 篇文章

订阅专栏

SST-2 数据集简介

SST-2（Stanford Sentiment Treebank 2）是 情感分析（Sentiment Analysis） 任务中常用的 二分类数据集，来源于斯坦福大学的 Stanford Sentiment Treebank，用于 自然语言处理（NLP）模型的训练和评估。

1. SST-2 的特点

任务类型：二分类（positive / negative）
数据来源：IMDB 电影评论
标签类别：
- 1（正向情感 Positive）
- 0（负向情感 Negative）
总样本数：约 67,000 条
数据格式：
- 训练集（Train）：约 67,000 条
- 开发集（Dev）：约 872 条
- 测试集（Test）：1,821 条（无标签）
评估指标：准确率（Accuracy）

SST-2 主要用于 GLUE Benchmark 任务之一，是评测 NLP 预训练模型（如 BERT、RoBERTa、GPT） 重要的数据集。

2. SST-2 数据格式

示例数据

sentence	label
This film is amazing and full of life.	1
I didn't like the movie at all.	0
One of the worst films I have ever watched.	0
A beautiful and touching story.	1

数据文件

train.tsv（训练集）
dev.tsv（验证集）
test.tsv（测试集，无标签）

每个文件包含两列：

sentence：电影评论文本
label：情感标签（1=正向，0=负向）

3. 加载 SST-2 数据集

3.1 使用 Hugging Face `datasets`

from datasets import load_dataset

# 加载 SST-2 数据集
dataset = load_dataset("glue", "sst2")

# 查看数据
print(dataset["train"][0])

输出

{'sentence': 'hide new secretions from the parental units ', 'label': 0}

3.2 使用 `pandas` 读取

import pandas as pd

# 读取训练数据
train_data = pd.read_csv("sst2/train.tsv", delimiter="\t")

# 显示前 5 条数据
print(train_data.head())

4. 训练情感分析模型（BERT）

4.1 使用 Hugging Face `transformers` 进行微调

from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments

# 加载预训练的 BERT tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# 加载模型
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

# 对文本进行分词
def preprocess_function(examples):
    return tokenizer(examples["sentence"], padding="max_length", truncation=True)

# 处理数据
encoded_dataset = dataset.map(preprocess_function, batched=True)

# 训练配置
training_args = TrainingArguments(output_dir="./results", evaluation_strategy="epoch")
trainer = Trainer(model=model, args=training_args, train_dataset=encoded_dataset["train"], eval_dataset=encoded_dataset["validation"])

# 开始训练
trainer.train()