★★★ 本文源自AlStudio社区精品项目,【点击此处】查看更多精品内容 >>>
2022人民网算法赛:微博话题识别
比赛地址:http://data.sklccc.com/2022
赛题介绍
新浪微博作为新型社交媒体积累了各领域的海量数据,从中挖掘出潜在的特征并及时识别出话题,能够带来可观的社会价值。本次比赛提供微博识别数据集,每条数据包括微博文本数据及对应的话题标签,每个数据样本可能包含一个或多个话题标签。参赛选手需要通过训练集数据建立预测模型,对测试集数据的话题标签作出识别。
数据说明
- 训练集包含一批文本信息样本及其标签,文件名为train.csv,各字段以tab分隔,格式如下:
- Text,微博文本内容
- Label,话题标签
Text | Label |
---|---|
原来是这样啊,又是一个新的道理,简直让我长知识了,七星连珠是很难得的一种现象,这个解析也太到位了 | label_878402 |
老师好厉害,咱就是说,老师,能不能,就是,研究一下能不能穿越🙏 代入感很强,我已经想穿越去… | label_878402 |
这个七星连珠的现象居然真的存在,我只在小说上面看过,太神奇了吧,至于其他的我也不太懂,科学现… | label_878402 |
- 测试集包含一批不含标签的样本,文件名为test.csv,格式如下:
- ID,样本ID
- Text,微博文本内容
评价指标
在此任务中,为了更好的反应模型的能力,我们对评估样本采用部分正确的评估方法。
from sklearn import metrics
import numpy as np
y_true = np.array([0,1,0,1], [1,0,1,0])
y_pred = np.array([0,1,1,0], [1,0,1,0])
F1_score = metrics.f1_score(y_true, y_pred, average="samples")
数据读取
!unzip data/data193239/weibo_topic_recognition_01.zip
!pip install wordcloud
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting wordcloud
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/05/1e/ff1052b62f233243f3d088d0815bf5ca0ed31f1aa64ae060dd78f3e1d636/wordcloud-1.8.2.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (435 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m435.2/435.2 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hRequirement already satisfied: pillow in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from wordcloud) (8.2.0)
Requirement already satisfied: numpy>=1.6.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from wordcloud) (1.19.5)
Requirement already satisfied: matplotlib in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from wordcloud) (2.2.3)
Requirement already satisfied: cycler>=0.10 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from matplotlib->wordcloud) (0.10.0)
Requirement already satisfied: pytz in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from matplotlib->wordcloud) (2019.3)
Requirement already satisfied: six>=1.10 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from matplotlib->wordcloud) (1.16.0)
Requirement already satisfied: python-dateutil>=2.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from matplotlib->wordcloud) (2.8.2)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from matplotlib->wordcloud) (3.0.9)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from matplotlib->wordcloud) (1.1.0)
Requirement already satisfied: setuptools in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from kiwisolver>=1.0.1->matplotlib->wordcloud) (56.2.0)
Installing collected packages: wordcloud
Successfully installed wordcloud-1.8.2.2
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.1.2[0m[39;49m -> [0m[32;49m23.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import jieba
from wordcloud import WordCloud, ImageColorGenerator
from collections import Counter
from tqdm import tqdm
import paddle
train_data = pd.read_csv('train.csv', sep='\t')
test_data = pd.read_csv('test.csv', sep='\t')
train_data.head(3)
ID | Text | Label | |
---|---|---|---|
0 | 0 | 原来是这样啊,又是一个新的道理,简直让我长知识了,七星连珠是很难得的一种现象,这个解析也太到位了 | label_878402 |
1 | 1 | 老师好厉害,咱就是说,老师,能不能,就是,研究一下能不能穿越🙏 代入感很强,我已经想穿越去... | label_878402 |
2 | 2 | 这个七星连珠的现象居然真的存在,我只在小说上面看过,太神奇了吧,至于其他的我也不太懂,科学现... | label_878402 |
test_data.head(3)
ID | Text | |
---|---|---|
0 | 5077 | @河南城建学院2022年第二学士学位招生简章已发布,招生对象为当年普通高校本科毕业并获得学士... |
1 | 5435 | 左航ZH💙 💙 一个人的能力和才华可以成为吸引人的资本,但这世上最能打动人的,永远是内心那些... |
2 | 7668 | 韩国人连炸鸡都吃不起了?韩国龙头企业上调炸鸡价格!网友:失去炸鸡自由了、 搞机数码哥的微博视... |
数据分析
句子长度分析
plt.figure(figsize=(6, 3))
plt.subplot(121)
train_data['Text'].apply(len).plot(kind='box')
plt.title('Train')
plt.subplot(122)
test_data['Text'].apply(len).plot(kind='box')
plt.title('Test')
Text(0.5,1,'Test')
话题分布分析
plt.figure(figsize=(6, 3))
# 统计一个微博包含多个话题的个数,大部分的微博只包含1个话题,最多有7个话题
train_data['Label'].apply(lambda x: len(x.split(','))).value_counts().plot(kind='barh')
<matplotlib.axes._subplots.AxesSubplot at 0x7ff1003f9c10>
# 话题label_1191241样本最多,且类别样本严重不均衡
pd.Series(','.join(train_data['Label']).split(',')).value_counts().head(20)
label_1191241 10235
label_1008181 10023
label_1281707 4680
label_472394 4671
label_1515062 4671
label_742793 2783
label_1411524 1906
label_1265038 1881
label_896157 1730
label_1064693 1719
label_467023 1641
label_1474127 1421
label_1227838 1066
label_1166118 1066
label_19479 1060
label_287908 828
label_753343 719
label_1056127 716
label_529001 711
label_512340 704
dtype: int64
# label_1191241 话题词云
content = ''.join(train_data[train_data['Label'].str.contains('label_1191241')]['Text'])
wordcloud = WordCloud(background_color = 'white', max_words = 1000, font_path = 'STHeiti-Light.ttc')
wordcloud.generate(''.join(jieba.lcut(content)))
plt.imshow(wordcloud)
plt.xticks([]); plt.yticks([])
Building prefix dict from the default dictionary ...
Dumping model to file cache /tmp/jieba.cache
Loading model cost 0.805 seconds.
Prefix dict has been built successfully.
([], <a list of 0 Text yticklabel objects>)
# label_1515062 话题词云
content = ''.join(train_data[train_data['Label'].str.contains('label_1515062')]['Text'])
wordcloud = WordCloud(background_color = 'white', max_words = 1000, font_path = 'STHeiti-Light.ttc')
wordcloud.generate(''.join(jieba.lcut(content)))
plt.imshow(wordcloud)
plt.xticks([]); plt.yticks([])
([], <a list of 0 Text yticklabel objects>)
# label_878402 话题词云
content = ''.join(train_data[train_data['Label'].str.contains('label_878402')]['Text'])
wordcloud = WordCloud(background_color = 'white', max_words = 1000, font_path = 'STHeiti-Light.ttc')
wordcloud.generate(''.join(jieba.lcut(content)))
plt.imshow(wordcloud)
plt.xticks([]); plt.yticks([])
([], <a list of 0 Text yticklabel objects>)
赛题思路
赛题是一个典型的文本类型的比赛,但与普通的文本分类任务又存在区别。比如微博文本中其实关键词会更加明显,且微博的话题可能是多个。
接下来我们将逐步尝试多个思路(单分类思路):
- 思路1:直接统计话题微博中的关键词,并使用关键词进行分类
- 思路2:使用TFIDF或BOW进行提取特征,使用机器学习模型进行分类
- 思路3:使用PaddleNLP中ERNIE3.0模型训练进行分类
划分验证集
train_data = train_data.sample(frac=1.0)
X, X_valid = train_data.iloc[:-15000], train_data.iloc[-15000:]
X['word'] = X['Text'].apply(jieba.lcut)
X_valid['word'] = X_valid['Text'].apply(jieba.lcut)
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.684 seconds.
Prefix dict has been built successfully.
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/ipykernel_launcher.py:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
after removing the cwd from sys.path.
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/ipykernel_launcher.py:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
"""
思路1:话题关键词
train_label = pd.Series(','.join(X['Label']).split(',')).value_counts().index
# 对top10话题
for label in train_label[:10]:
# 统计话题微博中次数出现最多的5个
words = sum(X[X['Label'].str.contains(label)]['word'].iloc[:100], [])
words = [x for x in words if len(x.strip()) >= 2]
print(label, Counter(words).most_common(5))
label_1191241 [('左航', 208), ('ZH', 96), ('TF', 96), ('家族', 96), ('超话', 79)]
label_1008181 [('左航', 208), ('ZH', 96), ('TF', 96), ('家族', 96), ('超话', 79)]
label_1281707 [('张极', 333), ('TF', 93), ('家族', 93), ('舞台', 68), ('微博', 59)]
label_1515062 [('张极', 333), ('TF', 93), ('家族', 93), ('舞台', 68), ('微博', 59)]
label_472394 [('张极', 333), ('TF', 93), ('家族', 93), ('舞台', 68), ('微博', 59)]
label_742793 [('浙江', 54), ('高考', 17), ('考生', 16), ('真的', 14), ('分数线', 14)]
label_1265038 [('怎么办', 53), ('我们', 47), ('施暴', 31), ('为什么', 25), ('自己', 21)]
label_1411524 [('会员', 131), ('成长', 130), ('PK', 130), ('厉害', 100), ('击败', 100)]
label_896157 [('浙江', 69), ('最后', 28), ('高考', 20), ('一下', 18), ('一年', 16)]
label_1064693 [('炸鸡', 37), ('韩国', 36), ('我们', 20), ('什么', 16), ('自己', 15)]
train_label = pd.Series(','.join(X['Label']).split(',')).value_counts().index
top_word_rule = {}
for label in train_label[:]:
# 统计话题单词出现次数最多的5个
words = sum(X[X['Label'].str.contains(label)]['word'].iloc[:100], [])
words = [x for x in words if len(x.strip()) >= 2]
# 如果这个单词之前不属于其他话题,则保存
top_words = [x[0] for x in Counter(words).most_common(5)]
for word in top_words:
if word not in top_word_rule:
top_word_rule[word] = label
break
X_valid_pred = []
# 贪心匹配每个话题是否包含单词
for text in X_valid['Text']:
label = -1
for word in top_word_rule.keys():
if word in text:
label = top_word_rule[word]
break
X_valid_pred.append(label)
from sklearn.metrics import accuracy_score
# 关键词准确率
accuracy_score(X_valid['Label'], X_valid_pred)
0.14406666666666668
思路2:TFIDF + 机器学习
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
- BOW特征
vector = CountVectorizer(max_features=2000).fit([' '.join(x) for x in X['word']])
X_data = vector.transform([' '.join(x) for x in X['word']])
X_valid_data = vector.transform([' '.join(x) for x in X_valid['word']])
model = MultinomialNB().fit(X_data, X['Label'])
# BOW + NB准确率
model.score(X_valid_data, X_valid['Label'])
0.679
- TFIDF特征
vector = TfidfVectorizer(max_features=2000).fit([' '.join(x) for x in X['word']])
X_data = vector.transform([' '.join(x) for x in X['word']])
X_valid_data = vector.transform([' '.join(x) for x in X_valid['word']])
model = MultinomialNB().fit(X_data, X['Label'])
# TFIDF + NB准确率
model.score(X_valid_data, X_valid['Label'])
0.589
思路3:ERNIE分类
import paddle
from paddlenlp.datasets import load_dataset
from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer
# 加载模型
model = AutoModelForSequenceClassification.from_pretrained("ernie-3.0-mini-zh", num_classes=1399)
# 加载字符编码器
tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-mini-zh")
tokenizer.encode('你好,我是阿水。')
[2023-02-24 21:10:49,988] [ INFO] - We are using <class 'paddlenlp.transformers.ernie.modeling.ErnieForSequenceClassification'> to load 'ernie-3.0-mini-zh'.
[2023-02-24 21:10:49,991] [ INFO] - Already cached /home/aistudio/.paddlenlp/models/ernie-3.0-mini-zh/ernie_3.0_mini_zh.pdparams
W0224 21:10:49.996181 2852 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0224 21:10:50.000505 2852 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
[2023-02-24 21:10:52,243] [ INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load 'ernie-3.0-mini-zh'.
[2023-02-24 21:10:52,247] [ INFO] - Already cached /home/aistudio/.paddlenlp/models/ernie-3.0-mini-zh/ernie_3.0_mini_zh_vocab.txt
[2023-02-24 21:10:52,274] [ INFO] - tokenizer config file saved in /home/aistudio/.paddlenlp/models/ernie-3.0-mini-zh/tokenizer_config.json
[2023-02-24 21:10:52,277] [ INFO] - Special tokens file saved in /home/aistudio/.paddlenlp/models/ernie-3.0-mini-zh/special_tokens_map.json
{'input_ids': [1, 226, 170, 4, 75, 10, 816, 101, 12043, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
from sklearn.preprocessing import MultiLabelBinarizer, LabelEncoder
# 单分类标签编码
label_encoder = LabelEncoder()
label_encoder.fit(train_data['Label'])
X_muti_label = label_encoder.transform(X['Label'])
X_valid_muti_label = label_encoder.transform(X_valid['Label'])
# 多标签分类标签编码
# label_encoder = MultiLabelBinarizer()
# label_encoder.fit(train_data['Label'].str.split(','))
# X_muti_label = label_encoder.transform(X['Label'].str.split(','))
# X_valid_muti_label = label_encoder.transform(X_valid['Label'].str.split(','))
from paddle.io import Dataset, DataLoader
# 自定义数据集
class MyDataset(Dataset):
def __init__(self, data, label):
self.data = data
self.label = label
def __getitem__(self, idx):
return self.data[idx], self.label[idx]
def __len__(self):
return len(self.data)
train_loader = DataLoader(MyDataset(X['Text'].values[:], X_muti_label[:]), batch_size=50, shuffle=True)
valid_loader = DataLoader(MyDataset(X_valid['Text'].values[:], X_valid_muti_label[:]), batch_size=50)
optimizer = paddle.optimizer.AdamW(0.00005, parameters=model.parameters())
loss_fn = paddle.nn.loss.CrossEntropyLoss(reduction='mean')
for epoch in range(10):
# 训练过程
model.train()
for batch_x, batch_y in tqdm(train_loader):
batch_x = tokenizer(batch_x, max_length=70, padding=True)
batch_x = {key: paddle.to_tensor(value) for key, value in batch_x.items()}
pred = model(batch_x['input_ids'], batch_x['token_type_ids'])
loss = loss_fn(pred, paddle.to_tensor(batch_y, dtype="int32"))
loss.backward()
optimizer.step()
optimizer.clear_gradients()
# 验证过程
model.eval()
val_loss = []
with paddle.no_grad():
for batch_x, batch_y in tqdm(valid_loader):
batch_x = tokenizer(batch_x, max_length=70, padding=True)
batch_x = {key: paddle.to_tensor(value) for key, value in batch_x.items()}
batch_y = paddle.to_tensor(batch_y, dtype="int32")
pred = model(batch_x['input_ids'], batch_x['token_type_ids'])
loss = loss_fn(pred, batch_y)
val_loss.append(loss.item())
print('Epoch {0}, Val loss {1:3f}, Val Accuracy {2:3f}'.format(
epoch,
np.mean(val_loss),
(pred.argmax(1) == batch_y).astype('float').mean().item()
))
100%|██████████| 1247/1247 [01:11<00:00, 17.36it/s]
100%|██████████| 300/300 [00:13<00:00, 21.88it/s]
Epoch 0, Val loss 2.689926, Val Accuracy 0.500000
0%| | 0/1247 [00:00<?, ?it/s]100%|██████████| 1247/1247 [01:10<00:00, 17.59it/s]
100%|██████████| 300/300 [00:13<00:00, 21.62it/s]
Epoch 1, Val loss 1.945902, Val Accuracy 0.660000
0%| | 0/1247 [00:00<?, ?it/s]100%|██████████| 1247/1247 [01:10<00:00, 17.63it/s]
100%|██████████| 300/300 [00:13<00:00, 21.48it/s]
Epoch 2, Val loss 1.601855, Val Accuracy 0.660000
0%| | 0/1247 [00:00<?, ?it/s]100%|██████████| 1247/1247 [01:10<00:00, 17.69it/s]
100%|██████████| 300/300 [00:13<00:00, 21.69it/s]
Epoch 3, Val loss 1.377380, Val Accuracy 0.660000
0%| | 0/1247 [00:00<?, ?it/s]100%|██████████| 1247/1247 [01:11<00:00, 17.36it/s]
100%|██████████| 300/300 [00:13<00:00, 21.75it/s]
Epoch 4, Val loss 1.250767, Val Accuracy 0.800000
0%| | 0/1247 [00:00<?, ?it/s]100%|██████████| 1247/1247 [01:10<00:00, 17.72it/s]
100%|██████████| 300/300 [00:13<00:00, 21.88it/s]
Epoch 5, Val loss 1.161806, Val Accuracy 0.720000
0%| | 0/1247 [00:00<?, ?it/s]100%|██████████| 300/300 [00:14<00:00, 21.35it/s]s]
Epoch 6, Val loss 1.091673, Val Accuracy 0.800000
0%| | 0/1247 [00:00<?, ?it/s]100%|██████████| 1247/1247 [01:10<00:00, 17.60it/s]
100%|██████████| 300/300 [00:13<00:00, 21.82it/s]
Epoch 7, Val loss 1.039342, Val Accuracy 0.840000
0%| | 0/1247 [00:00<?, ?it/s]100%|██████████| 1247/1247 [01:10<00:00, 17.71it/s]
100%|██████████| 300/300 [00:13<00:00, 21.88it/s]
Epoch 8, Val loss 1.012694, Val Accuracy 0.860000
0%| | 0/1247 [00:00<?, ?it/s] 36%|███▌ | 447/1247 [00:26<00:48, 16.46it/s]
总结与展望
在本项目中我们对微博话题进行了识别,我们对文本内容长度和话题进行了分析,发现了一些微博数据本身存在的规律。接下来我们尝试了多种思路,包括机器学习和深度学习。在所有模型中ERNIE模型的效果最好,也是非常值得学习的。
在使用ERNIE模型的过程中,我们使用搭建了数据集、模型、训练过程,这也是文本分类任务的常规过程。后续改进包括:
- 使用模型更加的ERNIE版本,项目中选择的是small版
- 考虑进行多标签分类,而不是单标签分类
- 考虑交叉验证、模型调参和对抗训练