2022人民网算法赛:微博话题识别任务(ERNIE文本分类)

★★★ 本文源自AlStudio社区精品项目,【点击此处】查看更多精品内容 >>>

2022人民网算法赛:微博话题识别

比赛地址:http://data.sklccc.com/2022

赛题介绍

新浪微博作为新型社交媒体积累了各领域的海量数据,从中挖掘出潜在的特征并及时识别出话题,能够带来可观的社会价值。本次比赛提供微博识别数据集,每条数据包括微博文本数据及对应的话题标签,每个数据样本可能包含一个或多个话题标签。参赛选手需要通过训练集数据建立预测模型,对测试集数据的话题标签作出识别。

数据说明

  • 训练集包含一批文本信息样本及其标签,文件名为train.csv,各字段以tab分隔,格式如下:
    • Text,微博文本内容
    • Label,话题标签
TextLabel
原来是这样啊,又是一个新的道理,简直让我长知识了,七星连珠是很难得的一种现象,这个解析也太到位了label_878402
老师好厉害,咱就是说,老师,能不能,就是,研究一下能不能穿越🙏 代入感很强,我已经想穿越去…label_878402
这个七星连珠的现象居然真的存在,我只在小说上面看过,太神奇了吧,至于其他的我也不太懂,科学现…label_878402
  • 测试集包含一批不含标签的样本,文件名为test.csv,格式如下:
    • ID,样本ID
    • Text,微博文本内容

评价指标

在此任务中,为了更好的反应模型的能力,我们对评估样本采用部分正确的评估方法。

from sklearn import metrics
import numpy as np
y_true = np.array([0,1,0,1], [1,0,1,0])
y_pred = np.array([0,1,1,0], [1,0,1,0])

F1_score = metrics.f1_score(y_true, y_pred, average="samples")

数据读取

!unzip data/data193239/weibo_topic_recognition_01.zip
!pip install wordcloud
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting wordcloud
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/05/1e/ff1052b62f233243f3d088d0815bf5ca0ed31f1aa64ae060dd78f3e1d636/wordcloud-1.8.2.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (435 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m435.2/435.2 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hRequirement already satisfied: pillow in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from wordcloud) (8.2.0)
Requirement already satisfied: numpy>=1.6.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from wordcloud) (1.19.5)
Requirement already satisfied: matplotlib in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from wordcloud) (2.2.3)
Requirement already satisfied: cycler>=0.10 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from matplotlib->wordcloud) (0.10.0)
Requirement already satisfied: pytz in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from matplotlib->wordcloud) (2019.3)
Requirement already satisfied: six>=1.10 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from matplotlib->wordcloud) (1.16.0)
Requirement already satisfied: python-dateutil>=2.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from matplotlib->wordcloud) (2.8.2)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from matplotlib->wordcloud) (3.0.9)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from matplotlib->wordcloud) (1.1.0)
Requirement already satisfied: setuptools in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from kiwisolver>=1.0.1->matplotlib->wordcloud) (56.2.0)
Installing collected packages: wordcloud
Successfully installed wordcloud-1.8.2.2

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.1.2[0m[39;49m -> [0m[32;49m23.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import jieba  
from wordcloud import WordCloud, ImageColorGenerator  
from collections import Counter
from tqdm import tqdm
import paddle
train_data = pd.read_csv('train.csv', sep='\t')
test_data = pd.read_csv('test.csv', sep='\t')
train_data.head(3)
IDTextLabel
00原来是这样啊,又是一个新的道理,简直让我长知识了,七星连珠是很难得的一种现象,这个解析也太到位了label_878402
11老师好厉害,咱就是说,老师,能不能,就是,研究一下能不能穿越🙏 代入感很强,我已经想穿越去...label_878402
22这个七星连珠的现象居然真的存在,我只在小说上面看过,太神奇了吧,至于其他的我也不太懂,科学现...label_878402
test_data.head(3)
IDText
05077@河南城建学院2022年第二学士学位招生简章已发布,招生对象为当年普通高校本科毕业并获得学士...
15435左航ZH💙 💙 一个人的能力和才华可以成为吸引人的资本,但这世上最能打动人的,永远是内心那些...
27668韩国人连炸鸡都吃不起了?韩国龙头企业上调炸鸡价格!网友:失去炸鸡自由了、 搞机数码哥的微博视...

数据分析

句子长度分析

plt.figure(figsize=(6, 3))

plt.subplot(121)
train_data['Text'].apply(len).plot(kind='box')
plt.title('Train')

plt.subplot(122)
test_data['Text'].apply(len).plot(kind='box')
plt.title('Test')
Text(0.5,1,'Test')

请添加图片描述

话题分布分析

plt.figure(figsize=(6, 3))
# 统计一个微博包含多个话题的个数,大部分的微博只包含1个话题,最多有7个话题
train_data['Label'].apply(lambda x: len(x.split(','))).value_counts().plot(kind='barh')
<matplotlib.axes._subplots.AxesSubplot at 0x7ff1003f9c10>

请添加图片描述

# 话题label_1191241样本最多,且类别样本严重不均衡
pd.Series(','.join(train_data['Label']).split(',')).value_counts().head(20)
label_1191241    10235
label_1008181    10023
label_1281707     4680
label_472394      4671
label_1515062     4671
label_742793      2783
label_1411524     1906
label_1265038     1881
label_896157      1730
label_1064693     1719
label_467023      1641
label_1474127     1421
label_1227838     1066
label_1166118     1066
label_19479       1060
label_287908       828
label_753343       719
label_1056127      716
label_529001       711
label_512340       704
dtype: int64
# label_1191241 话题词云
content = ''.join(train_data[train_data['Label'].str.contains('label_1191241')]['Text'])
wordcloud = WordCloud(background_color = 'white', max_words = 1000, font_path = 'STHeiti-Light.ttc')
wordcloud.generate(''.join(jieba.lcut(content)))
plt.imshow(wordcloud)
plt.xticks([]); plt.yticks([])
Building prefix dict from the default dictionary ...
Dumping model to file cache /tmp/jieba.cache
Loading model cost 0.805 seconds.
Prefix dict has been built successfully.





([], <a list of 0 Text yticklabel objects>)

请添加图片描述

# label_1515062 话题词云
content = ''.join(train_data[train_data['Label'].str.contains('label_1515062')]['Text'])
wordcloud = WordCloud(background_color = 'white', max_words = 1000, font_path = 'STHeiti-Light.ttc')
wordcloud.generate(''.join(jieba.lcut(content)))
plt.imshow(wordcloud)
plt.xticks([]); plt.yticks([])
([], <a list of 0 Text yticklabel objects>)

请添加图片描述

# label_878402 话题词云
content = ''.join(train_data[train_data['Label'].str.contains('label_878402')]['Text'])
wordcloud = WordCloud(background_color = 'white', max_words = 1000, font_path = 'STHeiti-Light.ttc')
wordcloud.generate(''.join(jieba.lcut(content)))
plt.imshow(wordcloud)
plt.xticks([]); plt.yticks([])
([], <a list of 0 Text yticklabel objects>)

请添加图片描述

赛题思路

赛题是一个典型的文本类型的比赛,但与普通的文本分类任务又存在区别。比如微博文本中其实关键词会更加明显,且微博的话题可能是多个。

接下来我们将逐步尝试多个思路(单分类思路):

  • 思路1:直接统计话题微博中的关键词,并使用关键词进行分类
  • 思路2:使用TFIDF或BOW进行提取特征,使用机器学习模型进行分类
  • 思路3:使用PaddleNLP中ERNIE3.0模型训练进行分类

划分验证集

train_data = train_data.sample(frac=1.0)
X, X_valid = train_data.iloc[:-15000], train_data.iloc[-15000:]

X['word'] = X['Text'].apply(jieba.lcut)
X_valid['word'] = X_valid['Text'].apply(jieba.lcut)
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.684 seconds.
Prefix dict has been built successfully.
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/ipykernel_launcher.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/ipykernel_launcher.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """

思路1:话题关键词

train_label = pd.Series(','.join(X['Label']).split(',')).value_counts().index

# 对top10话题
for label in train_label[:10]:
    # 统计话题微博中次数出现最多的5个
    words = sum(X[X['Label'].str.contains(label)]['word'].iloc[:100], [])
    words = [x for x in words if len(x.strip()) >= 2]
    print(label, Counter(words).most_common(5))
label_1191241 [('左航', 208), ('ZH', 96), ('TF', 96), ('家族', 96), ('超话', 79)]
label_1008181 [('左航', 208), ('ZH', 96), ('TF', 96), ('家族', 96), ('超话', 79)]
label_1281707 [('张极', 333), ('TF', 93), ('家族', 93), ('舞台', 68), ('微博', 59)]
label_1515062 [('张极', 333), ('TF', 93), ('家族', 93), ('舞台', 68), ('微博', 59)]
label_472394 [('张极', 333), ('TF', 93), ('家族', 93), ('舞台', 68), ('微博', 59)]
label_742793 [('浙江', 54), ('高考', 17), ('考生', 16), ('真的', 14), ('分数线', 14)]
label_1265038 [('怎么办', 53), ('我们', 47), ('施暴', 31), ('为什么', 25), ('自己', 21)]
label_1411524 [('会员', 131), ('成长', 130), ('PK', 130), ('厉害', 100), ('击败', 100)]
label_896157 [('浙江', 69), ('最后', 28), ('高考', 20), ('一下', 18), ('一年', 16)]
label_1064693 [('炸鸡', 37), ('韩国', 36), ('我们', 20), ('什么', 16), ('自己', 15)]
train_label = pd.Series(','.join(X['Label']).split(',')).value_counts().index

top_word_rule = {}
for label in train_label[:]:
    # 统计话题单词出现次数最多的5个
    words = sum(X[X['Label'].str.contains(label)]['word'].iloc[:100], [])
    words = [x for x in words if len(x.strip()) >= 2]
    
    # 如果这个单词之前不属于其他话题,则保存
    top_words = [x[0] for x in Counter(words).most_common(5)]
    for word in top_words:
        if word not in top_word_rule:
            top_word_rule[word] = label
            break
X_valid_pred = []

# 贪心匹配每个话题是否包含单词
for text in X_valid['Text']:
    label = -1
    for word in top_word_rule.keys():
        if word in text:
            label = top_word_rule[word]
            break

    X_valid_pred.append(label)
from sklearn.metrics import accuracy_score

# 关键词准确率
accuracy_score(X_valid['Label'], X_valid_pred)
0.14406666666666668

思路2:TFIDF + 机器学习

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
  • BOW特征
vector = CountVectorizer(max_features=2000).fit([' '.join(x) for x in X['word']])
X_data = vector.transform([' '.join(x) for x in X['word']])
X_valid_data = vector.transform([' '.join(x) for x in X_valid['word']])

model = MultinomialNB().fit(X_data, X['Label'])

# BOW + NB准确率
model.score(X_valid_data, X_valid['Label'])
0.679
  • TFIDF特征
vector = TfidfVectorizer(max_features=2000).fit([' '.join(x) for x in X['word']])
X_data = vector.transform([' '.join(x) for x in X['word']])
X_valid_data = vector.transform([' '.join(x) for x in X_valid['word']])

model = MultinomialNB().fit(X_data, X['Label'])

# TFIDF + NB准确率
model.score(X_valid_data, X_valid['Label'])
0.589

思路3:ERNIE分类

import paddle
from paddlenlp.datasets import load_dataset
from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer

# 加载模型
model = AutoModelForSequenceClassification.from_pretrained("ernie-3.0-mini-zh", num_classes=1399)

# 加载字符编码器
tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-mini-zh")

tokenizer.encode('你好,我是阿水。')   
[2023-02-24 21:10:49,988] [    INFO] - We are using <class 'paddlenlp.transformers.ernie.modeling.ErnieForSequenceClassification'> to load 'ernie-3.0-mini-zh'.
[2023-02-24 21:10:49,991] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/ernie-3.0-mini-zh/ernie_3.0_mini_zh.pdparams
W0224 21:10:49.996181  2852 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0224 21:10:50.000505  2852 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
[2023-02-24 21:10:52,243] [    INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load 'ernie-3.0-mini-zh'.
[2023-02-24 21:10:52,247] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/ernie-3.0-mini-zh/ernie_3.0_mini_zh_vocab.txt
[2023-02-24 21:10:52,274] [    INFO] - tokenizer config file saved in /home/aistudio/.paddlenlp/models/ernie-3.0-mini-zh/tokenizer_config.json
[2023-02-24 21:10:52,277] [    INFO] - Special tokens file saved in /home/aistudio/.paddlenlp/models/ernie-3.0-mini-zh/special_tokens_map.json





{'input_ids': [1, 226, 170, 4, 75, 10, 816, 101, 12043, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
from sklearn.preprocessing import MultiLabelBinarizer, LabelEncoder

# 单分类标签编码
label_encoder = LabelEncoder()
label_encoder.fit(train_data['Label'])
X_muti_label = label_encoder.transform(X['Label'])
X_valid_muti_label = label_encoder.transform(X_valid['Label'])

# 多标签分类标签编码
# label_encoder = MultiLabelBinarizer()
# label_encoder.fit(train_data['Label'].str.split(','))
# X_muti_label = label_encoder.transform(X['Label'].str.split(','))
# X_valid_muti_label = label_encoder.transform(X_valid['Label'].str.split(','))
from paddle.io import Dataset, DataLoader

# 自定义数据集
class MyDataset(Dataset):
    def __init__(self, data, label):
        self.data = data
        self.label = label

    def __getitem__(self, idx):
        return self.data[idx], self.label[idx]

    def __len__(self):
        return len(self.data)

train_loader = DataLoader(MyDataset(X['Text'].values[:], X_muti_label[:]), batch_size=50, shuffle=True)
valid_loader = DataLoader(MyDataset(X_valid['Text'].values[:], X_valid_muti_label[:]), batch_size=50)
optimizer = paddle.optimizer.AdamW(0.00005, parameters=model.parameters())
loss_fn = paddle.nn.loss.CrossEntropyLoss(reduction='mean')

for epoch in range(10):
    # 训练过程
    model.train()
    for batch_x, batch_y in tqdm(train_loader):
        batch_x = tokenizer(batch_x, max_length=70, padding=True)
        batch_x = {key: paddle.to_tensor(value) for key, value in batch_x.items()}
        
        pred = model(batch_x['input_ids'], batch_x['token_type_ids'])
        loss = loss_fn(pred, paddle.to_tensor(batch_y, dtype="int32"))

        loss.backward()
        optimizer.step()
        optimizer.clear_gradients()
    
    # 验证过程
    model.eval()
    val_loss = []
    with paddle.no_grad():
        for batch_x, batch_y in tqdm(valid_loader):
            batch_x = tokenizer(batch_x, max_length=70, padding=True)
            batch_x = {key: paddle.to_tensor(value) for key, value in batch_x.items()}
            batch_y = paddle.to_tensor(batch_y, dtype="int32")

            pred = model(batch_x['input_ids'], batch_x['token_type_ids'])
            loss = loss_fn(pred, batch_y)
            val_loss.append(loss.item())
    
    print('Epoch {0}, Val loss {1:3f}, Val Accuracy {2:3f}'.format(
        epoch,
        np.mean(val_loss), 
        (pred.argmax(1) == batch_y).astype('float').mean().item()
    ))
100%|██████████| 1247/1247 [01:11<00:00, 17.36it/s]
100%|██████████| 300/300 [00:13<00:00, 21.88it/s]


Epoch 0, Val loss 2.689926, Val Accuracy 0.500000


  0%|          | 0/1247 [00:00<?, ?it/s]100%|██████████| 1247/1247 [01:10<00:00, 17.59it/s]
100%|██████████| 300/300 [00:13<00:00, 21.62it/s]


Epoch 1, Val loss 1.945902, Val Accuracy 0.660000


  0%|          | 0/1247 [00:00<?, ?it/s]100%|██████████| 1247/1247 [01:10<00:00, 17.63it/s]
100%|██████████| 300/300 [00:13<00:00, 21.48it/s]


Epoch 2, Val loss 1.601855, Val Accuracy 0.660000


  0%|          | 0/1247 [00:00<?, ?it/s]100%|██████████| 1247/1247 [01:10<00:00, 17.69it/s]
100%|██████████| 300/300 [00:13<00:00, 21.69it/s]


Epoch 3, Val loss 1.377380, Val Accuracy 0.660000


  0%|          | 0/1247 [00:00<?, ?it/s]100%|██████████| 1247/1247 [01:11<00:00, 17.36it/s]
100%|██████████| 300/300 [00:13<00:00, 21.75it/s]


Epoch 4, Val loss 1.250767, Val Accuracy 0.800000


  0%|          | 0/1247 [00:00<?, ?it/s]100%|██████████| 1247/1247 [01:10<00:00, 17.72it/s]
100%|██████████| 300/300 [00:13<00:00, 21.88it/s]


Epoch 5, Val loss 1.161806, Val Accuracy 0.720000


  0%|          | 0/1247 [00:00<?, ?it/s]100%|██████████| 300/300 [00:14<00:00, 21.35it/s]s]


Epoch 6, Val loss 1.091673, Val Accuracy 0.800000


  0%|          | 0/1247 [00:00<?, ?it/s]100%|██████████| 1247/1247 [01:10<00:00, 17.60it/s]
100%|██████████| 300/300 [00:13<00:00, 21.82it/s]


Epoch 7, Val loss 1.039342, Val Accuracy 0.840000


  0%|          | 0/1247 [00:00<?, ?it/s]100%|██████████| 1247/1247 [01:10<00:00, 17.71it/s]
100%|██████████| 300/300 [00:13<00:00, 21.88it/s]


Epoch 8, Val loss 1.012694, Val Accuracy 0.860000


  0%|          | 0/1247 [00:00<?, ?it/s] 36%|███▌      | 447/1247 [00:26<00:48, 16.46it/s]

总结与展望

在本项目中我们对微博话题进行了识别,我们对文本内容长度和话题进行了分析,发现了一些微博数据本身存在的规律。接下来我们尝试了多种思路,包括机器学习和深度学习。在所有模型中ERNIE模型的效果最好,也是非常值得学习的。

在使用ERNIE模型的过程中,我们使用搭建了数据集、模型、训练过程,这也是文本分类任务的常规过程。后续改进包括:

  • 使用模型更加的ERNIE版本,项目中选择的是small版
  • 考虑进行多标签分类,而不是单标签分类
  • 考虑交叉验证、模型调参和对抗训练

  • 1
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值