【AI创造营】网抑云选手等级鉴别器

飞桨PaddlePaddle

于 2021-10-28 15:22:57 发布

阅读量412

点赞数

文章标签：人工智能自然语言处理机器学习

原文链接：https://blog.csdn.net/rehersjnhrtsj/article/details/115056803

版权

网抑云选手鉴定手册：

[AI创造营]是飞桨邀请广大开发者基于PaddleHub实现AI创意项目的新人练习赛，我们在比赛中基于paddlehub制作了一个网易云选手鉴别器

#效果展示：

#部分评论等级鉴定效果：

##B站讲解：
https://www.bilibili.com/video/BV1Up4y1b7kT

##CSDN技术详解：
https://blog.csdn.net/rehersjnhrtsj/article/details/115056803#comments_15702282

##github开源代码：
https://github.com/chestnutly/-AI-

#项目介绍：

深夜总是打开网易云的时候，基于网易云的热评数据集与情感分析为大家制作了一个网易云选手鉴定器，可以通过你输入的话语与感悟鉴定你是几级网易云选手，使用方法简单，

#使用步骤：

1.在左边文件目录中找到test.txt文件，点击进入写两句自己的任意感受与想法（文件中已有一些例子，看了后一定会有些许感悟）

2.书写感悟完毕后点击后上角落运行-全部运行将代码全部运行好，拉取在最下方可以查看检测结果

#思路介绍：

深夜最有趣的事情就是打开网易云看评论，很多优质的评论都能引起共鸣，但其中很多评论感情色彩比较中，所以我们可以基于nlp领域的情感分析对网易云评论进行分级，将评论情感二分类，越靠近悲伤类别，鉴定等级越高

#方案亮点：

基于优质网易云评论训练集，利用LSTM神经网络进行情感分类，新创数学映射方式将类别概率映射为网抑云等级

环境介绍

PaddlePaddle框架，AI Studio平台已经默认安装最新版2.0。
PaddleNLP，深度兼容框架2.0，是飞桨框架2.0在NLP领域的最佳实践。

# 下载paddlenlp
!pip install --upgrade paddlenlp>=2.0.0b0 -i https://pypi.org/simple

import paddle
import paddlenlp

#引用需要的库
import numpy as np
from functools import partial

import paddle.nn as nn
import paddle.nn.functional as F
import paddlenlp as ppnlp
from paddlenlp.data import Pad, Stack, Tuple
from paddlenlp.datasets import MapDatasetWrapper

from utils import load_vocab, convert_example

数据集和数据处理

自定义数据集

映射式(map-style)数据集需要继承paddle.io.Dataset

__getitem__: 根据给定索引获取数据集中指定样本，在 paddle.io.DataLoader 中需要使用此函数通过下标获取样本。
__len__: 返回数据集样本个数， paddle.io.BatchSampler 中需要样本个数生成下标序列。

class SelfDefinedDataset(paddle.io.Dataset):
    def __init__(self, data):
        super(SelfDefinedDataset, self).__init__()
        self.data = data

    def __getitem__(self, idx):
        return self.data[idx]

    def __len__(self):
        return len(self.data)
        
    def get_labels(self):
        return ["0", "1"]

def txt_to_list(file_name):
    res_list = []
    for line in open(file_name):
        res_list.append(line.strip().split('\t'))
    return res_list

trainlst = txt_to_list('train.txt')
devlst = txt_to_list('dev.txt')
testlst = txt_to_list('test.txt')

train_ds, dev_ds, test_ds = SelfDefinedDataset.get_datasets([trainlst, devlst, testlst])

看看数据长什么样

label_list = train_ds.get_labels()
print(label_list)

for i in range(10):
    print (train_ds[i])

['0', '1']
['你眼里有春有秋 ,胜过我见过爱过的山川河流', '1']
['你曾是我的他却不再是我的他，谢谢你赠与我空欢喜', '0']
['我们都是，苦尽甘来的人，但愿殊途同归，你能与我讲讲来时的路', '0']
['在我的世界里，你的出现让我明白了什么是陪伴。我还记得你说过的这句话“从前车马很慢，一生只够爱一人。”我想做那个人，和你一起携手走完这一生。我知道以后会经历各种困难，但我想跟你并肩同行，和你有着耳鬢厮磨的爱情', '1']
['春意渐浓，想你、念你、陪你、爱你。', '1']
['我想陪你走过春夏秋冬  陪你感受爱恨情长', '1']
['我想陪着你，从执拗到素淡，从青丝到白发，从一场秋到另一场秋，从不谙世事到步履阑珊，我想陪着你，在有限的生命里', '1']
['你曾是我的他却不再是我的他，谢谢你赠与我空欢喜', '0']
['有时关不上冰箱的门， 脚趾撞到了桌腿， 临出门找不到想要的东西， 突然忍不住掉泪， 你觉得小题大作， 只有我自己知道为什么；', '0']
['人总是贪婪的，就像最开始我只想知道你的名字', '0']

数据处理

为了将原始数据处理成模型可以读入的格式，本项目将对数据作以下处理：

首先使用jieba切词，之后将jieba切完后的单词映射词表中单词id。

使用paddle.io.DataLoader接口多线程异步加载数据。

其中用到了PaddleNLP中关于数据处理的API。PaddleNLP提供了许多关于NLP任务中构建有效的数据pipeline的常用API

API	简介
`paddlenlp.data.Stack`	堆叠N个具有相同shape的输入数据来构建一个batch，它的输入必须具有相同的shape，输出便是这些输入的堆叠组成的batch数据。
`paddlenlp.data.Pad`	堆叠N个输入数据来构建一个batch，每个输入数据将会被padding到N个输入数据中最大的长度
`paddlenlp.data.Tuple`	将多个组batch的函数包装在一起

更多数据处理操作详见： https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/data.md

# 下载词汇表文件word_dict.txt，用于构造词-id映射关系。
!wget https://paddlenlp.bj.bcebos.com/data/senta_word_dict.txt

# 加载词表
vocab = load_vocab('./senta_word_dict.txt')

for k, v in vocab.items():
    print(k, v)
    break

--2021-03-29 15:34:41--  https://paddlenlp.bj.bcebos.com/data/senta_word_dict.txt
Resolving paddlenlp.bj.bcebos.com (paddlenlp.bj.bcebos.com)... 182.61.200.229, 182.61.200.195, 2409:8c00:6c21:10ad:0:ff:b00e:67d
Connecting to paddlenlp.bj.bcebos.com (paddlenlp.bj.bcebos.com)|182.61.200.229|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14600150 (14M) [text/plain]
Saving to: ‘senta_word_dict.txt.1’

senta_word_dict.txt 100%[===================>]  13.92M  33.1MB/s    in 0.4s    

2021-03-29 15:34:41 (33.1 MB/s) - ‘senta_word_dict.txt.1’ saved [14600150/14600150]

[PAD] 0

构造dataloder

下面的create_data_loader函数用于创建运行和预测时所需要的DataLoader对象。

paddle.io.DataLoader返回一个迭代器，该迭代器根据batch_sampler指定的顺序迭代返回dataset数据。异步加载数据。
batch_sampler：DataLoader通过 batch_sampler 产生的mini-batch索引列表来 dataset 中索引样本并组成mini-batch
collate_fn：指定如何将样本列表组合为mini-batch数据。传给它参数需要是一个callable对象，需要实现对组建的batch的处理逻辑，并返回每个batch的数据。在这里传入的是prepare_input函数，对产生的数据进行pad操作，并返回实际长度等。

# Reads data and generates mini-batches.
def create_dataloader(dataset,
                      trans_function=None,
                      mode='train',
                      batch_size=1,
                      pad_token_id=0,
                      batchify_fn=None):
    if trans_function:
        dataset = dataset.apply(trans_function, lazy=True)

    # return_list 数据是否以list形式返回
    # collate_fn  指定如何将样本列表组合为mini-batch数据。传给它参数需要是一个callable对象，需要实现对组建的batch的处理逻辑，并返回每个batch的数据。在这里传入的是`prepare_input`函数，对产生的数据进行pad操作，并返回实际长度等。
    dataloader = paddle.io.DataLoader(
        dataset,
        return_list=True,
        batch_size=batch_size,
        collate_fn=batchify_fn)
        
    return dataloader

# python中的偏函数partial，把一个函数的某些参数固定住（也就是设置默认值），返回一个新的函数，调用这个新函数会更简单。
trans_function = partial(
    convert_example,
    vocab=vocab,
    unk_token_id=vocab.get('[UNK]', 1),
    is_test=False)

# 将读入的数据batch化处理，便于模型batch化运算。
# batch中的每个句子将会padding到这个batch中的文本最大长度batch_max_seq_len。
# 当文本长度大于batch_max_seq时，将会截断到batch_max_seq_len；当文本长度小于batch_max_seq时，将会padding补齐到batch_max_seq_len.
batchify_fn = lambda samples, fn=Tuple(
    Pad(axis=0, pad_val=vocab['[PAD]']),  # input_ids
    Stack(dtype="int64"),  # seq len
    Stack(dtype="int64")  # label
): [data for data in fn(samples)]


train_loader = create_dataloader(
    train_ds,
    trans_function=trans_function,
    batch_size=128,
    mode='train',
    batchify_fn=batchify_fn)
dev_loader = create_dataloader(
    dev_ds,
    trans_function=trans_function,
    batch_size=128,
    mode='validation',
    batchify_fn=batchify_fn)
test_loader = create_dataloader(
    test_ds,
    trans_function=trans_function,
    batch_size=128,
    mode='test',
    batchify_fn=batchify_fn)

模型搭建

使用LSTMencoder搭建一个BiLSTM模型用于进行句子建模，得到句子的向量表示。

然后接一个线性变换层，完成二分类任务。

paddle.nn.Embedding组建word-embedding层
ppnlp.seq2vec.LSTMEncoder组建句子建模层
paddle.nn.Linear构造二分类器

图1：seq2vec示意图

除LSTM外，seq2vec还提供了许多语义表征方法，详细可参考：seq2vec介绍

#搭建lastm模型
class LSTMModel(nn.Layer):
    def __init__(self,
                 vocab_size,
                 num_classes,
                 emb_dim=128,
                 padding_idx=0,
                 lstm_hidden_size=198,
                 direction='forward',
                 lstm_layers=1,
                 dropout_rate=0,
                 pooling_type=None,
                 fc_hidden_size=96):
        super().__init__()

        # 首先将输入word id 查表后映射成 word embedding
        self.embedder = nn.Embedding(
            num_embeddings=vocab_size,
            embedding_dim=emb_dim,
            padding_idx=padding_idx)

        # 将word embedding经过LSTMEncoder变换到文本语义表征空间中
        self.lstm_encoder = ppnlp.seq2vec.LSTMEncoder(
            emb_dim,
            lstm_hidden_size,
            num_layers=lstm_layers,
            direction=direction,
            dropout=dropout_rate,
            pooling_type=pooling_type)

        # LSTMEncoder.get_output_dim()方法可以获取经过encoder之后的文本表示hidden_size
        self.fc = nn.Linear(self.lstm_encoder.get_output_dim(), fc_hidden_size)

        # 最后的分类器
        self.output_layer = nn.Linear(fc_hidden_size, num_classes)

    def forward(self, text, seq_len):
        # text shape: (batch_size, num_tokens)
        # print('input :', text.shape)
        
        # Shape: (batch_size, num_tokens, embedding_dim)
        embedded_text = self.embedder(text)
        # print('after word-embeding:', embedded_text.shape)

        # Shape: (batch_size, num_tokens, num_directions*lstm_hidden_size)
        # num_directions = 2 if direction is 'bidirectional' else 1
        text_repr = self.lstm_encoder(embedded_text, sequence_length=seq_len)
        # print('after lstm:', text_repr.shape)


        # Shape: (batch_size, fc_hidden_size)
        fc_out = paddle.tanh(self.fc(text_repr))
        # print('after Linear classifier:', fc_out.shape)

        # Shape: (batch_size, num_classes)
        logits = self.output_layer(fc_out)
        # print('output:', logits.shape)
        
        # probs 分类概率值
        probs = F.softmax(logits, axis=-1)
        # print('output probability:', probs.shape)
        return probs

model= LSTMModel(
        len(vocab),
        len(label_list),
        direction='bidirectional',
        padding_idx=vocab['[PAD]'])
model = paddle.Model(model)

模型配置

optimizer = paddle.optimizer.Adam(
        parameters=model.parameters(), learning_rate=5e-5)

loss = paddle.nn.CrossEntropyLoss()
metric = paddle.metric.Accuracy()

model.prepare(optimizer, loss, metric)

# 设置visualdl路径
log_dir = './visualdl'
callback = paddle.callbacks.VisualDL(log_dir=log_dir)

model.fit(train_loader, dev_loader, epochs=10, save_dir='./checkpoints', save_freq=5, callbacks=callback)

The loss value printed in the log is the current step, and the metric is the average value of previous step.
Epoch 1/10


Building prefix dict from the default dictionary ...
2021-03-29 15:34:47,109 - DEBUG - Building prefix dict from the default dictionary ...
Dumping model to file cache /tmp/jieba.cache
2021-03-29 15:34:47,844 - DEBUG - Dumping model to file cache /tmp/jieba.cache
Loading model cost 0.788 seconds.
2021-03-29 15:34:47,899 - DEBUG - Loading model cost 0.788 seconds.
Prefix dict has been built successfully.
2021-03-29 15:34:47,900 - DEBUG - Prefix dict has been built successfully.
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:77: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
  return (isinstance(seq, collections.Sequence) and


step 4/4 - loss: 0.6870 - acc: 0.5533 - 301ms/step
save checkpoint at /home/aistudio/checkpoints/0
Eval begin...
The loss value printed in the log is the current batch, and the metric is the average value of previous step.
step 1/1 - loss: 0.6966 - acc: 0.4444 - 27ms/step
Eval samples: 81
Epoch 2/10
step 4/4 - loss: 0.6852 - acc: 0.5533 - 43ms/step
Eval begin...
The loss value printed in the log is the current batch, and the metric is the average value of previous step.
step 1/1 - loss: 0.6977 - acc: 0.4444 - 23ms/step
Eval samples: 81
Epoch 3/10
step 4/4 - loss: 0.6834 - acc: 0.5533 - 42ms/step
Eval begin...
The loss value printed in the log is the current batch, and the metric is the average value of previous step.
step 1/1 - loss: 0.6988 - acc: 0.4444 - 22ms/step
Eval samples: 81
Epoch 4/10
step 4/4 - loss: 0.6817 - acc: 0.5533 - 43ms/step
Eval begin...
The loss value printed in the log is the current batch, and the metric is the average value of previous step.
step 1/1 - loss: 0.7000 - acc: 0.4444 - 22ms/step
Eval samples: 81
Epoch 5/10
step 4/4 - loss: 0.6800 - acc: 0.5533 - 42ms/step
Eval begin...
The loss value printed in the log is the current batch, and the metric is the average value of previous step.
step 1/1 - loss: 0.7012 - acc: 0.4444 - 22ms/step
Eval samples: 81
Epoch 6/10
step 4/4 - loss: 0.6785 - acc: 0.5533 - 43ms/step
save checkpoint at /home/aistudio/checkpoints/5
Eval begin...
The loss value printed in the log is the current batch, and the metric is the average value of previous step.
step 1/1 - loss: 0.7024 - acc: 0.4444 - 24ms/step
Eval samples: 81
Epoch 7/10
step 4/4 - loss: 0.6769 - acc: 0.5533 - 44ms/step
Eval begin...
The loss value printed in the log is the current batch, and the metric is the average value of previous step.
step 1/1 - loss: 0.7036 - acc: 0.4444 - 21ms/step
Eval samples: 81
Epoch 8/10
step 4/4 - loss: 0.6755 - acc: 0.5533 - 43ms/step
Eval begin...
The loss value printed in the log is the current batch, and the metric is the average value of previous step.
step 1/1 - loss: 0.7048 - acc: 0.4444 - 21ms/step
Eval samples: 81
Epoch 9/10
step 4/4 - loss: 0.6740 - acc: 0.5533 - 43ms/step
Eval begin...
The loss value printed in the log is the current batch, and the metric is the average value of previous step.
step 1/1 - loss: 0.7060 - acc: 0.4444 - 21ms/step
Eval samples: 81
Epoch 10/10
step 4/4 - loss: 0.6726 - acc: 0.5533 - 42ms/step
Eval begin...
The loss value printed in the log is the current batch, and the metric is the average value of previous step.
step 1/1 - loss: 0.7072 - acc: 0.4444 - 20ms/step
Eval samples: 81
save checkpoint at /home/aistudio/checkpoints/final

启动VisualDL查看训练过程可视化结果

启动步骤：

1、切换到本界面左侧「可视化」
2、日志文件路径选择 ‘visualdl’
3、点击「启动VisualDL」后点击「打开VisualDL」，即可查看可视化结果：
Accuracy和Loss的实时变化趋势如下：

results = model.evaluate(dev_loader)
print("Finally test acc: %.5f" % results['acc'])

Eval begin...
The loss value printed in the log is the current batch, and the metric is the average value of previous step.
step 1/1 - loss: 0.7072 - acc: 0.4444 - 24ms/step
Eval samples: 81
Finally test acc: 0.44444

预测

import random
label_map = {0: '悲伤', 1: '积极'}
results = model.predict(test_loader, batch_size=128)[0]
predictions = []

for batch_probs in results:
    # 映射分类label
    idx = np.argmax(batch_probs, axis=-1)
    idx = idx.tolist()
    labels = [label_map[i] for i in idx]
    predictions.extend(labels)

# 看看预测数据前5个样例分类结果
#for idx, data in enumerate(test_ds.data[:10]):
#    print('Data: {} \t Label: {}'.format(data[0], predictions[idx]))

for i in range(len(results[0])):
    weight=random.uniform(0.1,0.4)
    sad_level=results[0][i][0]
    happy_level=results[0][i][1]
    popele_level=int(-((-sad_level+happy_level)/2-0.5+weight)*10)
    print('Data: {} \t 网抑云等级: {}'.format(test_ds.data[i][0], popele_level))
    

: {}'.format(test_ds.data[i][0], popele_level))

Predict begin...
step 1/1 [==============================] - 14ms/step
Predict samples: 24
Data: 我跟你说星星很好看，你跟别人说星星很好看，我以后都不会再跟你说星星很好看了。别人跟你说星星很好看，你跟我说星星很好看，星星就没那么好看了 	 网抑云等级: 2
Data: 我没有取悦你的天分 但我比谁都认真。 	 网抑云等级: 4
Data: 我爱你可能只需要天份，你爱我却需要天意 	 网抑云等级: 4
Data: 我们都在爱情里少一点天份 所以才跌跌撞撞满身伤痕 	 网抑云等级: 3
Data: 喜欢是每个人都有的天分，而爱你是我独有的天份 	 网抑云等级: 4
Data: “ 你是我的满目山河，也是我的爱而不得。” 	 网抑云等级: 3
Data: 事实上，机场比婚礼现场见证了更多真挚的接吻，医院的墙壁比教堂聆听了更多的祷告 	 网抑云等级: 4
Data: 因为慢半拍，一开始动心的人是你，结果越陷越深的是我。 	 网抑云等级: 3
Data: 所以，如果你终究要离开我，请不要来爱我。 	 网抑云等级: 4
Data: 心之所向是你 人间理想是你，星途耀眼是你 未来可期也是你 	 网抑云等级: 4
Data: 未来你的歌声依旧会伴我入眠，未来你行走的路上也会星星闪烁 	 网抑云等级: 2
Data: 太阳和月球难得一次亲密接触，很快便又分开。而且这亲密也只是我们三点一线的目光所限，实际上它们之间仍然十万八千里。旁人眼里的亲密无间代表不了什么。 	 网抑云等级: 3
Data: “其他人都已经不爱了, 而在爱情里慢半拍的人, 却才刚爱上.” 	 网抑云等级: 4
Data: 我很自负，总以为我们合拍，直到你的失望溢满，才知道，是我和你的心跳慢半拍，想挽回，发现联系已删，为时已晚。 	 网抑云等级: 3
Data: “我总是慢半拍,追不上你的节奏,跟不上你的步伐.” 	 网抑云等级: 3
Data: 你不要嫌弃我衣上烟味，寂寞时谁也不会皱着眉。 	 网抑云等级: 3
Data: 用最真诚的心，谱最纯粹的曲，唱最动人的情， 	 网抑云等级: 4
Data: 在感情快速消费的时代，所谓的暧昧，换不来真心的情感。 	 网抑云等级: 2
Data: 反正现在的感情都暧昧，付出过的人排队谈体会，感情像牛奶一杯 越甜越让人生畏，弃之可惜 食而无味” 	 网抑云等级: 3
Data: 我猜暧昧的意思是，在阳光温和的日子里，爱未曾来。 	 网抑云等级: 4
Data: “暧昧是什么？”“所有人都以为你们在一起了，只有你清楚的知道你们的距离。 	 网抑云等级: 3
Data: 开始慢慢懂爱情，却开始害怕听情歌。 	 网抑云等级: 2
Data: 男生就像洋葱，因为你想探知，于是就剥，边剥边哭，后来发现没心。 可其实，你一直剥的，就是他的心。他一开始把心给你了，你却不相信。 	 网抑云等级: 4
Data: 后来，我再也没有遇到对我那么好的人。 	 网抑云等级: 3

**总结+改进方向：
现有的模型还有很多不足，优质训练集的短缺，让情感分类效果还不太明显，需要添加偏置量才能有较好分类效果，后期我们会更添加更多的训练集，并尝试使用Bert预训练好的模型来进行情感分析。

**
这里是引用项目『NLP经典项目集』03：利用情感分析选择年夜饭，为基础改进的模型，并且训练集中也有部分是使用的年夜饭训练集，后期我们会更添加更多的训练集，并尝试使用Bert来进行情感分析。