Mindspore框架循环神经网络RNN模型实现情感分类|（一）IMDB影评数据集准备

柏常青

已于 2024-08-09 16:13:37 修改

阅读量720

点赞数 19

分类专栏： Mindspore 文章标签：分类情感分类

于 2024-07-21 16:11:42 首次发布

本文链接：https://blog.csdn.net/beauthy/article/details/140588495

版权

Mindspore 专栏收录该内容

43 篇文章 1 订阅

订阅专栏

Mindspore框架循环神经网络RNN模型实现情感分类

tips:安装依赖库

pip install -i https://pypi.mirrors.ustc.edu.cn/simple mindspore==2.2.14
pip install tqdm requests

一、IMDB影评数据集

影评-标签

Review	Label
“Quitting” may be as much about exiting a pre-ordained identity as about drug withdrawal. As a rural guy coming to Beijing, class and success must have struck this young artist face on as an appeal to separate from his roots and far surpass his peasant parents’ acting success. Troubles arise, however, when the new man is too new, when it demands too big a departure from family, history, nature, and personal identity. The ensuing splits, and confusion between the imaginary and the real and the dissonance between the ordinary and the heroic are the stuff of a gut check on the one hand or a complete escape from self on the other.	Negative
This movie is amazing because the fact that the real people portray themselves and their real life experience and do such a good job it’s like they’re almost living the past over again. Jia Hongsheng plays himself an actor who quit everything except music and drugs struggling with depression and searching for the meaning of life while being angry at everyone especially the people who care for him most.	Positive

Review

Label

“Quitting” may be as much about exiting a pre-ordained identity as about drug withdrawal. As a rural guy coming to Beijing, class and success must have struck this young artist face on as an appeal to separate from his roots and far surpass his peasant parents’ acting success. Troubles arise, however, when the new man is too new, when it demands too big a departure from family, history, nature, and personal identity. The ensuing splits, and confusion between the imaginary and the real and the dissonance between the ordinary and the heroic are the stuff of a gut check on the one hand or a complete escape from self on the other.

Negative

This movie is amazing because the fact that the real people portray themselves and their real life experience and do such a good job it’s like they’re almost living the past over again. Jia Hongsheng plays himself an actor who quit everything except music and drugs struggling with depression and searching for the meaning of life while being angry at everyone especially the people who care for him most.

Positive

情感分类是自然语言处理中的经典任务，是典型的分类问题。本项目使用MindSpore框架实现一个基于RNN网络的情感分类模型，实现如下的效果：

输入: This film is terrible
正确标签: Negative
预测标签: Negative

输入: This film is great
正确标签: Positive
预测标签: Positive

1.1 数据下载

首先设计数据下载模块，实现可视化下载流程，并保存至指定路径。
数据下载模块使用requests库进行http请求，并通过tqdm库对下载百分比进行可视化。此外针对下载安全性，使用IO的方式下载临时文件，而后保存至指定的路径并返回。

import os
import shutil
import requests
import tempfile
from tqdm import tqdm
from typing import IO
from pathlib import Path

# 指定保存路径为 `home_path/.mindspore_examples`
cache_dir = Path.home() / '.mindspore_examples'

def http_get(url: str, temp_file: IO):
    """使用requests库下载数据，并使用tqdm库进行流程可视化"""
    req = requests.get(url, stream=True)
    content_length = req.headers.get('Content-Length')
    total = int(content_length) if content_length is not None else None
    progress = tqdm(unit='B', total=total)
    for chunk in req.iter_content(chunk_size=1024):
        if chunk:
            progress.update(len(chunk))
            temp_file.write(chunk)
    progress.close()

def download(file_name: str, url: str):
    """下载数据并存为指定名称"""
    if not os.path.exists(cache_dir):
        os.makedirs(cache_dir)
    cache_path = os.path.join(cache_dir, file_name)
    cache_exist = os.path.exists(cache_path)
    if not cache_exist:
        with tempfile.NamedTemporaryFile() as temp_file:
            http_get(url, temp_file)
            temp_file.flush()
            temp_file.seek(0)
            with open(cache_path, 'wb') as cache_file:
                shutil.copyfileobj(temp_file, cache_file)
    return cache_path

下载数据并保存：

imdb_path = download('aclImdb_v1.tar.gz', 'https://mindspore-website.obs.myhuaweicloud.com/notebook/datasets/aclImdb_v1.tar.gz')
imdb_path

原始的IMDB数据集解压目录：
在这里插入图片描述

数据集已分割为train和test两部分，且每部分包含neg和pos两个分类的文件夹，因此需分别train和test进行读取并处理数据和标签。

1.2 IMDB数据集加载器

import re
import six
import string
import tarfile

class IMDBData():
    """IMDB数据集加载器加载IMDB数据集
    并处理为一个Python迭代对象。
    """
    label_map = {
        "pos": 1,
        "neg": 0
    }
    def __init__(self, path, mode="train"):
        self.mode = mode
        self.path = path
        self.docs, self.labels = [], []  # review-label

        self._load("pos")
        self._load("neg")

    def _load(self, label):
        pattern = re.compile(r"aclImdb/{}/{}/.*\.txt$".format(self.mode, label))
        # 将数据加载至内存，tarfile 模块可以读取和写入 tar 文件
        with tarfile.open(self.path) as tarf:
            tf = tarf.next()
            while tf is not None:
                if bool(pattern.match(tf.name)):
                    # 对文本进行分词、去除标点和特殊字符、小写处理
                    self.docs.append(str(tarf.extractfile(tf).read().rstrip(six.b("\n\r"))
                                         .translate(None, six.b(string.punctuation)).lower()).split())
                    self.labels.append([self.label_map[label]])
                tf = tarf.next()

    def __getitem__(self, idx):
        return self.docs[idx], self.labels[idx]

    def __len__(self):
        return len(self.docs)

完成IMDB数据加载器后，加载训练数据集进行测试，输出数据集数量：

imdb_train = IMDBData(imdb_path, 'train')
len(imdb_train)

将IMDB数据集加载至内存并构造为迭代对象后，可以使用mindspore.dataset提供的Generatordataset接口加载数据集迭代对象，并进行下一步的数据处理，下面封装一个函数将train和test分别使用Generatordataset进行加载，并指定数据集中文本和标签的column_name分别为text和label:

import mindspore.dataset as ds

def load_imdb(imdb_path):
    imdb_train = ds.GeneratorDataset(IMDBData(imdb_path, "train"), column_names=["text", "label"], shuffle=True, num_samples=10000)  # 加载训练集为数据生成器
    imdb_test = ds.GeneratorDataset(IMDBData(imdb_path, "test"), column_names=["text", "label"], shuffle=False)  # 加载测试集
    return imdb_train, imdb_test


imdb_train, imdb_test = load_imdb(imdb_path)
#  打印imdb_train  为：<mindspore.dataset.engine.datasets_user_defined.GeneratorDataset at 0xfffece2a5be0>

1.3 加载预训练词向量

预训练词向量是对输入单词的数值化表示。
“I like the movie!”
如何将这个内容转为一个词向量？
首先要有一个字典，字典有固定的长度，字典囊括了数据集中出现的词，词在字典中的位置按照词在数据集中出现的次数从大到小排列。IMDB数据集的字典为：imdb.vocab文件，那就是字典，这个字典大小为89527。
比如这个imdb.vocab字典中，‘the’在评论中出现次数最大，the放在字典的第一个位置上；‘and’出现的次数第二多，所以排在第二 …
在这里插入图片描述
有了字典，给定一个词，就能找到它在字典中的位置。比如评价中出现了单词a，在字典中a的位置为3；评论中出现的词在字典中不存则为0。所谓词向量就是把每个词用其在字典中的index来表示。每一个评论都将会构造一个对应长度的词向量。
那么：“I like the movie!” = 词映射[9 37 10 16 28]，索引。

但每一个review影评段落的每一个单词需要换算成词向量，就像每一个图像的坐标位置有像素表示一样。单词需要转换成数值特征。常用换算方法为word2vec和glove统计模型，有词库（语料库）生成所有单词的词向量。词向量的生成与使用具体实现参考第（二）节。有了词向量后，就可以进行使用了。
在nn.Embedding层，采用查表的方式，输入单词对应词表中的index，获得对应的表达向量。
此处：使用Glove(Global Vectors for Word Representation)这种经典的预训练词向量。标准化后的数据格式如下：

Word	Vector
the	0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862 -0.00066023 …
,	0.013441 0.23682 -0.16899 0.40951 0.63812 0.47709 -0.42852 -0.55641 -0.364 …

直接使用第一列的单词作为词表，使用dataset.text.Vocab将其按顺序加载；同时读取每一行的Vector并转为numpy.array，用于nn.Embedding加载权重使用。
词向量形式表示为vector=0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862 -0.00066023 ...；

import zipfile
import numpy as np

def load_glove(glove_path):
    glove_100d_path = os.path.join(cache_dir, 'glove.6B.100d.txt')
    if not os.path.exists(glove_100d_path):
        glove_zip = zipfile.ZipFile(glove_path)
        glove_zip.extractall(cache_dir)

    embeddings = []
    tokens = []
    with open(glove_100d_path, encoding='utf-8') as gf:
        for glove in gf:
            word, embedding = glove.split(maxsplit=1)
            tokens.append(word)
            embeddings.append(np.fromstring(embedding, dtype=np.float32, sep=' '))
    # 添加 <unk>, <pad> 两个特殊占位符对应的embedding
    embeddings.append(np.random.rand(100))
    embeddings.append(np.zeros((100,), np.float32))

    vocab = ds.text.Vocab.from_list(tokens, special_tokens=["<unk>", "<pad>"], special_first=False)
    embeddings = np.array(embeddings).astype(np.float32)
    return vocab, embeddings

由于数据集中可能存在词表没有覆盖的单词，因此需要加入标记符；同时由于输入长度的不一致，在打包为一个batch时需要将短的文本进行填充，因此需要加入标记符。完成后的词表长度为原词表长度+2。

下面下载Glove词向量，并加载生成词表和词向量权重矩阵。

glove_path = download('glove.6B.zip', 'https://mindspore-website.obs.myhuaweicloud.com/notebook/datasets/glove.6B.zip')
vocab, embeddings = load_glove(glove_path)
len(vocab.vocab())

使用词表将the转换为index id，并查询词向量矩阵对应的词向量：

idx = vocab.tokens_to_ids('the')
embedding = embeddings[idx]
idx, embedding

在这里插入图片描述

1.4 数据集预处理

通过加载器加载的IMDB数据集进行了分词处理，但不满足构造训练数据的需要，因此要对其进行额外的预处理。其中包含的预处理如下：

通过Vocab将所有的Token处理为index id。
将文本序列统一长度，不足的使用<pad>补齐，超出的进行截断。

这里我们使用mindspore.dataset中提供的接口进行预处理操作。这里使用到的接口均为MindSpore的高性能数据引擎设计，每个接口对应操作视作数据流水线的一部分，详情请参考MindSpore数据引擎。
首先针对token到index id的查表操作，使用text.Lookup接口，将前文构造的词表加载，并指定unknown_token。其次为文本序列统一长度操作，使用PadEnd接口，此接口定义最大长度和补齐值(pad_value)，这里我们取最大长度为500，填充值对应词表中<pad>的index id。

除了对数据集中text进行预处理外，由于后续模型训练的需要，要将label数据转为float32格式。

import mindspore as ms

lookup_op = ds.text.Lookup(vocab, unknown_token='<unk>')
pad_op = ds.transforms.PadEnd([500], pad_value=vocab.tokens_to_ids('<pad>'))
type_cast_op = ds.transforms.TypeCast(ms.float32)

imdb_train = imdb_train.map(operations=[lookup_op, pad_op], input_columns=['text'])
imdb_train = imdb_train.map(operations=[type_cast_op], input_columns=['label'])

imdb_test = imdb_test.map(operations=[lookup_op, pad_op], input_columns=['text'])
imdb_test = imdb_test.map(operations=[type_cast_op], input_columns=['label'])
imdb_train, imdb_valid = imdb_train.split([0.7, 0.3])
imdb_train = imdb_train.batch(64, drop_remainder=True)
imdb_valid = imdb_valid.batch(64, drop_remainder=True)