使用paddle进行酒店评论的情感分类4——数据集准备

铭....

已于 2023-08-03 11:44:26 修改

阅读量403

点赞数

分类专栏： paddle 文章标签： paddle 分类数据挖掘

于 2023-08-01 21:35:28 首次发布

本文链接：https://blog.csdn.net/weixin_46538207/article/details/132041597

版权

paddle 专栏收录该内容

12 篇文章 1 订阅

订阅专栏

本文介绍了如何在飞桨平台上下载ChnSentiCorp情感分析酒店评论数据集，进行数据预处理（如分词、去除标点），并将数据划分为训练集和测试集。同时，使用jieba进行中文分词，并构建词典以支持后续深度学习模型训练。

摘要由CSDN通过智能技术生成

数据集下载

这里使用飞桨平台数据集大厅的文本分类数据集，网址如下：
https://aistudio.baidu.com/aistudio/datasetdetail/183563
下载ChnSentiCorp情感分析酒店评论.zip
在这里插入图片描述
解压后得到如下

这里为不给后面找麻烦将正面、负面文件夹重命名为pos、neg，文件夹ChnSentiCorp情感分析酒店评论重命名为comment
为了方便起见，这里将数据集人为划分为训练集与测试集，分别为1700个与300个，链接如下：
链接：https://pan.baidu.com/s/1IfgS_f0xIvYCHwAzVXFCxA?pwd=6ss3
提取码：6ss3

数据集导入

将文件夹中数据保存到创建好的list中

# 库文件导入
# encoding=utf8
import re
import random
import requests
import numpy as np
import paddle
from paddle.nn import Embedding
import paddle.nn.functional as F
from paddle.nn import LSTM, Embedding, Dropout, Linear
import os

# 加载数据集
def load_comment(is_training):
    data_set = []

    # 训练数据和测试数据已经经过切分，其中训练数据的地址为：
    # ./train/pos/ 和 ./train/neg/，分别存储着正向情感的数据和负向情感的数据
    # 我们把数据依次读取出来，并放到data_set里
    # data_set中每个元素都是一个二元组，（句子，label），其中label=0表示负向情感，label=1表示正向情感
    
    # 遍历正/负评论目录，读取指定文件名模式的文本文件
    for label in ["pos", "neg"]:
        # 设置文件夹路径和文件名匹配模式
        if is_training:
            folder_path = "D:/study_software/vscode/Microsoft VS Code/paddle/homework_sentiment_analysis/dataset/comment/train/" + label
        else:
            folder_path = "D:/study_software/vscode/Microsoft VS Code/paddle/homework_sentiment_analysis/dataset/comment/test/" + label
        # 遍历文件夹中的文件列表
        for filename in os.listdir(folder_path):
            # 读取文本内容，并加入数据集
            file_path = os.path.join(folder_path, filename)
            with open(file_path, "r", encoding="utf-8") as f:
                sentence = f.read()
                sentence_label = 0 if label == 'neg' else 1
                data_set.append((sentence, sentence_label))
        
    return data_set

train_corpus = load_comment(True)
test_corpus = load_comment(False)

for i in range(5):
    print("sentence %d, %s" % (i, train_corpus[i][0]))    
    print("sentence %d, label %d" % (i, train_corpus[i][1]))

运行上述代码，得到下面结果
在这里插入图片描述
可以看到，成功读入数据

语料切割——词典构造

jieba下载

查了一下，发现国内主流的分词软件可以选择jieba，在对应的环境下使用

pip install jieba

即可下载
实际操作中，我碰到了超时下载失败等问题，于是打算让以后默认的下载路径都改为国内的镜像源，在paddle环境中输入下面的代码，创建ini文件，以后的下载路径都默认为清华镜像源

pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple

再次进行下载，成功下载
在这里插入图片描述

语料切割

对于前文得到的文本，去掉标点符号，避免标点污染字典

# 使用正则表达式去除标点符号
text = re.sub(r'[^\w\s]', '', sentence)
text = re.sub(r'\s+', '', text)
# 对文本进行分词
word_list = jieba.lcut(text)

建立字典

将前面的load_comment.py文件中的函数导入build_dict.py文件中，对训练集建立字典，并输出频率最高的20个词

# 库文件导入
# encoding=utf8
import re
import random
import requests
import numpy as np
import paddle
from paddle.nn import Embedding
import paddle.nn.functional as F
from paddle.nn import LSTM, Embedding, Dropout, Linear
import os
import jieba  

# 导入数据集与数据预处理模块
import load_comment
import data_preprocess

train_corpus = load_comment.load_comment(True)


# 构造词典，统计每个词的频率，并根据频率将每个词转换为一个整数id
# 库文件导入
# encoding=utf8
import re
import random
import requests
import numpy as np
import paddle
from paddle.nn import Embedding
import paddle.nn.functional as F
from paddle.nn import LSTM, Embedding, Dropout, Linear
import os
import jieba
import paddle.fluid

# 导入数据集与数据预处理模块
import load_comment
import data_preprocess



# 构造词典，统计每个词的频率，并根据频率将每个词转换为一个整数id
def build_dict(corpus):
    word_freq_dict = dict()
    for sentence, _ in corpus:

        '''# 使用正则表达式去除标点符号
        text = re.sub(r'[^\w\s]', '', sentence)
        text = re.sub(r'\s+', '', text)
        text = re.sub(r'\s+', '', text)
        # 对文本进行分词
        word_list = jieba.lcut(text)'''

        # word_list = data_preprocess.data_preprocess(sentence)

        for word in sentence:
            if word not in word_freq_dict:
                word_freq_dict[word] = 0
            word_freq_dict[word] += 1

    word_freq_dict = sorted(word_freq_dict.items(), key = lambda x:x[1], reverse = True)
    
    word2id_dict = dict()
    word2id_freq = dict()

    # 一般来说，我们把oov和pad放在词典前面，给他们一个比较小的id，这样比较方便记忆，并且易于后续扩展词表
    word2id_dict['[oov]'] = 0
    word2id_freq[0] = 1e10

    word2id_dict['[pad]'] = 1
    word2id_freq[1] = 1e10

    for word, freq in word_freq_dict:
        word2id_dict[word] = len(word2id_dict)
        word2id_freq[word2id_dict[word]] = freq

    return word2id_freq, word2id_dict


train_corpus = load_comment.load_comment(True)
train_corpus = data_preprocess.data_preprocess(train_corpus)
word2id_freq, word2id_dict = build_dict(train_corpus)
vocab_size = len(word2id_freq)
print("there are totoally %d different words in the corpus" % vocab_size)
for _, (word, word_id) in zip(range(20), word2id_dict.items()):
    print("word %s, its id %d, its word freq %d" % (word, word_id, word2id_freq[word_id]))