2022人民网算法赛：微博话题识别任务（ERNIE文本分类）

AI Studio

已于 2023-03-21 21:11:17 修改

阅读量1.6k

点赞数 1

分类专栏：人工智能算法文章标签：算法分类 python

于 2023-03-13 16:52:16 首次发布

原文链接：https://aistudio.baidu.com/aistudio/projectdetail/5699265?forkThirdPart=1

版权

人工智能同时被 2 个专栏收录

180 篇文章

订阅专栏

算法

15 篇文章

订阅专栏

★★★ 本文源自AlStudio社区精品项目，【点击此处】查看更多精品内容 >>>

2022人民网算法赛：微博话题识别

比赛地址：http://data.sklccc.com/2022

赛题介绍

新浪微博作为新型社交媒体积累了各领域的海量数据，从中挖掘出潜在的特征并及时识别出话题，能够带来可观的社会价值。本次比赛提供微博识别数据集，每条数据包括微博文本数据及对应的话题标签，每个数据样本可能包含一个或多个话题标签。参赛选手需要通过训练集数据建立预测模型，对测试集数据的话题标签作出识别。

数据说明

训练集包含一批文本信息样本及其标签，文件名为train.csv，各字段以tab分隔，格式如下：
- Text，微博文本内容
- Label，话题标签

Text	Label
原来是这样啊，又是一个新的道理，简直让我长知识了，七星连珠是很难得的一种现象，这个解析也太到位了	label_878402
老师好厉害，咱就是说，老师，能不能，就是，研究一下能不能穿越🙏 代入感很强，我已经想穿越去…	label_878402
这个七星连珠的现象居然真的存在，我只在小说上面看过，太神奇了吧，至于其他的我也不太懂，科学现…	label_878402

测试集包含一批不含标签的样本，文件名为test.csv，格式如下：
- ID，样本ID
- Text，微博文本内容

评价指标

在此任务中，为了更好的反应模型的能力，我们对评估样本采用部分正确的评估方法。

from sklearn import metrics
import numpy as np
y_true = np.array([0,1,0,1], [1,0,1,0])
y_pred = np.array([0,1,1,0], [1,0,1,0])

F1_score = metrics.f1_score(y_true, y_pred, average="samples")

数据读取

!unzip data/data193239/weibo_topic_recognition_01.zip
!pip install wordcloud

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting wordcloud
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/05/1e/ff1052b62f233243f3d088d0815bf5ca0ed31f1aa64ae060dd78f3e1d636/wordcloud-1.8.2.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (435 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m435.2/435.2 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hRequirement already satisfied: pillow in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from wordcloud) (8.2.0)
Requirement already satisfied: numpy>=1.6.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from wordcloud) (1.19.5)
Requirement already satisfied: matplotlib in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from wordcloud) (2.2.3)
Requirement already satisfied: cycler>=0.10 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from matplotlib->wordcloud) (0.10.0)
Requirement already satisfied: pytz in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from matplotlib->wordcloud) (2019.3)
Requirement already satisfied: six>=1.10 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from matplotlib->wordcloud) (1.16.0)
Requirement already satisfied: python-dateutil>=2.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from matplotlib->wordcloud) (2.8.2)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from matplotlib->wordcloud) (3.0.9)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from matplotlib->wordcloud) (1.1.0)
Requirement already satisfied: setuptools in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from kiwisolver>=1.0.1->matplotlib->wordcloud) (56.2.0)
Installing collected packages: wordcloud
Successfully installed wordcloud-1.8.2.2

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.1.2[0m[39;49m -> [0m[32;49m23.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import jieba  
from wordcloud import WordCloud, ImageColorGenerator  
from collections import Counter
from tqdm import tqdm
import paddle

train_data = pd.read_csv('train.csv', sep='\t')
test_data = pd.read_csv('test.csv', sep='\t')

train_data.head(3)

	ID	Text	Label
0	0	原来是这样啊，又是一个新的道理，简直让我长知识了，七星连珠是很难得的一种现象，这个解析也太到位了	label_878402
1	1	老师好厉害，咱就是说，老师，能不能，就是，研究一下能不能穿越🙏 代入感很强，我已经想穿越去...	label_878402
2	2	这个七星连珠的现象居然真的存在，我只在小说上面看过，太神奇了吧，至于其他的我也不太懂，科学现...	label_878402

test_data.head(3)

	ID	Text
0	5077	@河南城建学院2022年第二学士学位招生简章已发布，招生对象为当年普通高校本科毕业并获得学士...
1	5435	左航ZH💙 💙 一个人的能力和才华可以成为吸引人的资本，但这世上最能打动人的，永远是内心那些...
2	7668	韩国人连炸鸡都吃不起了？韩国龙头企业上调炸鸡价格！网友：失去炸鸡自由了、搞机数码哥的微博视...

数据分析

句子长度分析

plt.figure(figsize=(6, 3))

plt.subplot(121)
train_data['Text'].apply(len).plot(kind='box')
plt.title('Train')

plt.subplot(122)
test_data['Text'].apply(len).plot(kind='box')
plt.title('Test')

Text(0.5,1,'Test')

请添加图片描述

话题分布分析

plt.figure(figsize=(6, 3))
# 统计一个微博包含多个话题的个数，大部分的微博只包含1个话题，最多有7个话题
train_data['Label'].apply(lambda x: len(x.split('，'))).value_counts().plot(kind='barh')

<matplotlib.axes._subplots.AxesSubplot at 0x7ff1003f9c10>

请添加图片描述

# 话题label_1191241样本最多，且类别样本严重不均衡
pd.Series('，'.join(train_data['Label']).split('，')).value_counts().head(20)

label_1191241    10235
label_1008181    10023
label_1281707     4680
label_472394      4671
label_1515062     4671
label_742793      2783
label_1411524     1906
label_1265038     1881
label_896157      1730
label_1064693     1719
label_467023      1641
label_1474127     1421
label_1227838     1066
label_1166118     1066
label_19479       1060
label_287908       828
label_753343       719
label_1056127      716
label_529001       711
label_512340       704
dtype: int64

# label_1191241 话题词云
content = ''.join(train_data[train_data['Label'].str.contains('label_1191241')]['Text'])
wordcloud = WordCloud(background_color = 'white', max_words = 1000, font_path = 'STHeiti-Light.ttc')
wordcloud.generate(''.join(jieba.lcut(content)))
plt.imshow(wordcloud)
plt.xticks([]); plt.yticks([])

Building prefix dict from the default dictionary ...
Dumping model to file cache /tmp/jieba.cache
Loading model cost 0.805 seconds.
Prefix dict has been built successfully.





([], <a list of 0 Text yticklabel objects>)

请添加图片描述

# label_1515062 话题词云
content = ''.join(train_data[train_data['Label'].str.contains('label_1515062')]['Text'])
wordcloud = WordCloud(background_color = 'white', max_words = 1000, font_path = 'STHeiti-Light.ttc')
wordcloud.generate(''.join(jieba.lcut(content)))
plt.imshow(wordcloud)
plt.xticks([]); plt.yticks([])

([], <a list of 0 Text yticklabel objects>)

请添加图片描述

# label_878402 话题词云
content = ''.join(train_data[train_data['Label'].str.contains('label_878402')]['Text'])
wordcloud = WordCloud(background_color = 'white', max_words = 1000, font_path = 'STHeiti-Light.ttc')
wordcloud.generate(''.join(jieba.lcut(content)))
plt.imshow(wordcloud)
plt.xticks([]); plt.yticks([])

([], <a list of 0 Text yticklabel objects>)

请添加图片描述

赛题思路

赛题是一个典型的文本类型的比赛，但与普通的文本分类任务又存在区别。比如微博文本中其实关键词会更加明显，且微博的话题可能是多个。

接下来我们将逐步尝试多个思路（单分类思路）：

思路1：直接统计话题微博中的关键词，并使用关键词进行分类
思路2：使用TFIDF或BOW进行提取特征，使用机器学习模型进行分类
思路3：使用PaddleNLP中ERNIE3.0模型训练进行分类

划分验证集

train_data = train_data.sample(frac=1.0)
X, X_valid = train_data.iloc[:-15000], train_data.iloc[-15000:]

X['word'] = X['Text'].apply(jieba.lcut)
X_valid['word'] = X_valid['Text'].apply(jieba.lcut)

Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.684 seconds.
Prefix dict has been built successfully.
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/ipykernel_launcher.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/ipykernel_launcher.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """

思路1：话题关键词

train_label = pd.Series('，'.join(X['Label']).split('，')).value_counts().index

# 对top10话题
for label in train_label[:10]:
    # 统计话题微博中次数出现最多的5个
    words = sum(X[X['Label'].str.contains(label)]['word'].iloc[:100], [])
    words = [x for x in words if len(x.strip()) >= 2]
    print(label, Counter(words).most_common(5))

label_1191241 [('左航', 208), ('ZH', 96), ('TF', 96), ('家族', 96), ('超话', 79)]
label_1008181 [('左航', 208), ('ZH', 96), ('TF', 96), ('家族', 96), ('超话', 79)]
label_1281707 [('张极', 333), ('TF', 93), ('家族', 93), ('舞台', 68), ('微博', 59)]
label_1515062 [('张极', 333), ('TF', 93), ('家族', 93), ('舞台', 68), ('微博', 59)]
label_472394 [('张极', 333), ('TF', 93), ('家族', 93), ('舞台', 68), ('微博', 59)]
label_742793 [('浙江', 54), ('高考', 17), ('考生', 16), ('真的', 14), ('分数线', 14)]
label_1265038 [('怎么办', 53), ('我们', 47), ('施暴', 31), ('为什么', 25), ('自己', 21)]
label_1411524 [('会员', 131), ('成长', 130), ('PK', 130), ('厉害', 100), ('击败', 100)]
label_896157 [('浙江', 69), ('最后', 28), ('高考', 20), ('一下', 18), ('一年', 16)]
label_1064693 [('炸鸡', 37), ('韩国', 36), ('我们', 20), ('什么', 16), ('自己', 15)]

train_label = pd.Series('，'.join(X['Label']).split('，')).value_counts().index

top_word_rule = {}
for label in train_label[:]:
    # 统计话题单词出现次数最多的5个
    words = sum(X[X['Label'].str.contains(label)]['word'].iloc[:100], [])
    words = [x for x in words if len(x.strip()) >= 2]
    
    # 如果这个单词之前不属于其他话题，则保存
    top_words = [x[0] for x in Counter(words).most_common(5)]
    for word in top_words:
        if word not in top_word_rule:
            top_word_rule[word] = label
            break

X_valid_pred = []

# 贪心匹配每个话题是否包含单词
for text in X_valid['Text']:
    label = -1
    for word in top_word_rule.keys():
        if word in text:
            label = top_word_rule[word]
            break

    X_valid_pred.append(label)

from sklearn.metrics import accuracy_score

# 关键词准确率
accuracy_score(X_valid['Label'], X_valid_pred)

0.14406666666666668

思路2：TFIDF + 机器学习

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB, BernoulliNB

BOW特征

vector = CountVectorizer(max_features=2000).fit([' '.join(x) for x in X['word']])
X_data = vector.transform([' '.join(x) for x in X['word']])
X_valid_data = vector.transform([' '.join(x) for x in X_valid['word']])

model = MultinomialNB().fit(X_data, X['Label'])

# BOW + NB准确率
model.score(X_valid_data, X_valid['Label'])

0.679

TFIDF特征

vector = TfidfVectorizer(max_features=2000).fit([' '.join(x) for x in X['word']])
X_data = vector.transform([' '.join(x) for x in X['word']])
X_valid_data = vector.transform([' '.join(x) for x in X_valid['word']])

model = MultinomialNB().fit(X_data, X['Label'])

# TFIDF + NB准确率
model.score(X_valid_data, X_valid['Label'])

0.589

思路3：ERNIE分类

import paddle
from paddlenlp.datasets import load_dataset
from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer

# 加载模型
model = AutoModelForSequenceClassification.from_pretrained("ernie-3.0-mini-zh", num_classes=1399)

# 加载字符编码器
tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-mini-zh")

tokenizer.encode('你好，我是阿水。')

[2023-02-24 21:10:49,988] [    INFO] - We are using <class 'paddlenlp.transformers.ernie.modeling.ErnieForSequenceClassification'> to load 'ernie-3.0-mini-zh'.
[2023-02-24 21:10:49,991] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/ernie-3.0-mini-zh/ernie_3.0_mini_zh.pdparams
W0224 21:10:49.996181  2852 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0224 21:10:50.000505  2852 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
[2023-02-24 21:10:52,243] [    INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load 'ernie-3.0-mini-zh'.
[2023-02-24 21:10:52,247] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/ernie-3.0-mini-zh/ernie_3.0_mini_zh_vocab.txt
[2023-02-24 21:10:52,274] [    INFO] - tokenizer config file saved in /home/aistudio/.paddlenlp/models/ernie-3.0-mini-zh/tokenizer_config.json
[2023-02-24 21:10:52,277] [    INFO] - Special tokens file saved in /home/aistudio/.paddlenlp/models/ernie-3.0-mini-zh/special_tokens_map.json





{'input_ids': [1, 226, 170, 4, 75, 10, 816, 101, 12043, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}

from sklearn.preprocessing import MultiLabelBinarizer, LabelEncoder

# 单分类标签编码
label_encoder = LabelEncoder()
label_encoder.fit(train_data['Label'])
X_muti_label = label_encoder.transform(X['Label'])
X_valid_muti_label = label_encoder.transform(X_valid['Label'])

# 多标签分类标签编码
# label_encoder = MultiLabelBinarizer()
# label_encoder.fit(train_data['Label'].str.split('，'))
# X_muti_label = label_encoder.transform(X['Label'].str.split('，'))
# X_valid_muti_label = label_encoder.transform(X_valid['Label'].str.split('，'))

from paddle.io import Dataset, DataLoader

# 自定义数据集
class MyDataset(Dataset):
    def __init__(self, data, label):
        self.data = data
        self.label = label

    def __getitem__(self, idx):
        return self.data[idx], self.label[idx]

    def __len__(self):
        return len(self.data)

train_loader = DataLoader(MyDataset(X['Text'].values[:], X_muti_label[:]), batch_size=50, shuffle=True)
valid_loader = DataLoader(MyDataset(X_valid['Text'].values[:], X_valid_muti_label[:]), batch_size=50)

optimizer = paddle.optimizer.AdamW(0.00005, parameters=model.parameters())
loss_fn = paddle.nn.loss.CrossEntropyLoss(reduction='mean')

for epoch in range(10):
    # 训练过程
    model.train()
    for batch_x, batch_y in tqdm(train_loader):
        batch_x = tokenizer(batch_x, max_length=70, padding=True)
        batch_x = {key: paddle.to_tensor(value) for key, value in batch_x.items()}
        
        pred = model(batch_x['input_ids'], batch_x['token_type_ids'])
        loss = loss_fn(pred, paddle.to_tensor(batch_y, dtype="int32"))

        loss.backward()
        optimizer.step()
        optimizer.clear_gradients()
    
    # 验证过程
    model.eval()
    val_loss = []
    with paddle.no_grad():
        for batch_x, batch_y in tqdm(valid_loader):
            batch_x = tokenizer(batch_x, max_length=70, padding=True)
            batch_x = {key: paddle.to_tensor(value) for key, value in batch_x.items()}
            batch_y = paddle.to_tensor(batch_y, dtype="int32")

            pred = model(batch_x['input_ids'], batch_x['token_type_ids'])
            loss = loss_fn(pred, batch_y)
            val_loss.append(loss.item())
    
    print('Epoch {0}, Val loss {1:3f}, Val Accuracy {2:3f}'.format(
        epoch,
        np.mean(val_loss), 
        (pred.argmax(1) == batch_y).astype('float').mean().item()
    ))

100%|██████████| 1247/1247 [01:11<00:00, 17.36it/s]
100%|██████████| 300/300 [00:13<00:00, 21.88it/s]


Epoch 0, Val loss 2.689926, Val Accuracy 0.500000


  0%|          | 0/1247 [00:00<?, ?it/s]100%|██████████| 1247/1247 [01:10<00:00, 17.59it/s]
100%|██████████| 300/300 [00:13<00:00, 21.62it/s]


Epoch 1, Val loss 1.945902, Val Accuracy 0.660000


  0%|          | 0/1247 [00:00<?, ?it/s]100%|██████████| 1247/1247 [01:10<00:00, 17.63it/s]
100%|██████████| 300/300 [00:13<00:00, 21.48it/s]


Epoch 2, Val loss 1.601855, Val Accuracy 0.660000


  0%|          | 0/1247 [00:00<?, ?it/s]100%|██████████| 1247/1247 [01:10<00:00, 17.69it/s]
100%|██████████| 300/300 [00:13<00:00, 21.69it/s]


Epoch 3, Val loss 1.377380, Val Accuracy 0.660000


  0%|          | 0/1247 [00:00<?, ?it/s]100%|██████████| 1247/1247 [01:11<00:00, 17.36it/s]
100%|██████████| 300/300 [00:13<00:00, 21.75it/s]


Epoch 4, Val loss 1.250767, Val Accuracy 0.800000


  0%|          | 0/1247 [00:00<?, ?it/s]100%|██████████| 1247/1247 [01:10<00:00, 17.72it/s]
100%|██████████| 300/300 [00:13<00:00, 21.88it/s]


Epoch 5, Val loss 1.161806, Val Accuracy 0.720000


  0%|          | 0/1247 [00:00<?, ?it/s]100%|██████████| 300/300 [00:14<00:00, 21.35it/s]s]


Epoch 6, Val loss 1.091673, Val Accuracy 0.800000


  0%|          | 0/1247 [00:00<?, ?it/s]100%|██████████| 1247/1247 [01:10<00:00, 17.60it/s]
100%|██████████| 300/300 [00:13<00:00, 21.82it/s]


Epoch 7, Val loss 1.039342, Val Accuracy 0.840000


  0%|          | 0/1247 [00:00<?, ?it/s]100%|██████████| 1247/1247 [01:10<00:00, 17.71it/s]
100%|██████████| 300/300 [00:13<00:00, 21.88it/s]


Epoch 8, Val loss 1.012694, Val Accuracy 0.860000


  0%|          | 0/1247 [00:00<?, ?it/s] 36%|███▌      | 447/1247 [00:26<00:48, 16.46it/s]