数据集下载
这里使用飞桨平台数据集大厅的文本分类数据集,网址如下:
https://aistudio.baidu.com/aistudio/datasetdetail/183563
下载ChnSentiCorp情感分析酒店评论.zip
解压后得到如下
这里为不给后面找麻烦将正面、负面文件夹重命名为pos、neg,文件夹ChnSentiCorp情感分析酒店评论重命名为comment
为了方便起见,这里将数据集人为划分为训练集与测试集,分别为1700个与300个,链接如下:
链接:https://pan.baidu.com/s/1IfgS_f0xIvYCHwAzVXFCxA?pwd=6ss3
提取码:6ss3
数据集导入
将文件夹中数据保存到创建好的list中
# 库文件导入
# encoding=utf8
import re
import random
import requests
import numpy as np
import paddle
from paddle.nn import Embedding
import paddle.nn.functional as F
from paddle.nn import LSTM, Embedding, Dropout, Linear
import os
# 加载数据集
def load_comment(is_training):
data_set = []
# 训练数据和测试数据已经经过切分,其中训练数据的地址为:
# ./train/pos/ 和 ./train/neg/,分别存储着正向情感的数据和负向情感的数据
# 我们把数据依次读取出来,并放到data_set里
# data_set中每个元素都是一个二元组,(句子,label),其中label=0表示负向情感,label=1表示正向情感
# 遍历正/负评论目录,读取指定文件名模式的文本文件
for label in ["pos", "neg"]:
# 设置文件夹路径和文件名匹配模式
if is_training:
folder_path = "D:/study_software/vscode/Microsoft VS Code/paddle/homework_sentiment_analysis/dataset/comment/train/" + label
else:
folder_path = "D:/study_software/vscode/Microsoft VS Code/paddle/homework_sentiment_analysis/dataset/comment/test/" + label
# 遍历文件夹中的文件列表
for filename in os.listdir(folder_path):
# 读取文本内容,并加入数据集
file_path = os.path.join(folder_path, filename)
with open(file_path, "r", encoding="utf-8") as f:
sentence = f.read()
sentence_label = 0 if label == 'neg' else 1
data_set.append((sentence, sentence_label))
return data_set
train_corpus = load_comment(True)
test_corpus = load_comment(False)
for i in range(5):
print("sentence %d, %s" % (i, train_corpus[i][0]))
print("sentence %d, label %d" % (i, train_corpus[i][1]))
运行上述代码,得到下面结果
可以看到,成功读入数据
语料切割——词典构造
jieba下载
查了一下,发现国内主流的分词软件可以选择jieba,在对应的环境下使用
pip install jieba
即可下载
实际操作中,我碰到了超时下载失败等问题,于是打算让以后默认的下载路径都改为国内的镜像源,在paddle环境中输入下面的代码,创建ini文件,以后的下载路径都默认为清华镜像源
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
再次进行下载,成功下载
语料切割
对于前文得到的文本,去掉标点符号,避免标点污染字典
# 使用正则表达式去除标点符号
text = re.sub(r'[^\w\s]', '', sentence)
text = re.sub(r'\s+', '', text)
# 对文本进行分词
word_list = jieba.lcut(text)
建立字典
将前面的load_comment.py文件中的函数导入build_dict.py文件中,对训练集建立字典,并输出频率最高的20个词
# 库文件导入
# encoding=utf8
import re
import random
import requests
import numpy as np
import paddle
from paddle.nn import Embedding
import paddle.nn.functional as F
from paddle.nn import LSTM, Embedding, Dropout, Linear
import os
import jieba
# 导入数据集与数据预处理模块
import load_comment
import data_preprocess
train_corpus = load_comment.load_comment(True)
# 构造词典,统计每个词的频率,并根据频率将每个词转换为一个整数id
# 库文件导入
# encoding=utf8
import re
import random
import requests
import numpy as np
import paddle
from paddle.nn import Embedding
import paddle.nn.functional as F
from paddle.nn import LSTM, Embedding, Dropout, Linear
import os
import jieba
import paddle.fluid
# 导入数据集与数据预处理模块
import load_comment
import data_preprocess
# 构造词典,统计每个词的频率,并根据频率将每个词转换为一个整数id
def build_dict(corpus):
word_freq_dict = dict()
for sentence, _ in corpus:
'''# 使用正则表达式去除标点符号
text = re.sub(r'[^\w\s]', '', sentence)
text = re.sub(r'\s+', '', text)
text = re.sub(r'\s+', '', text)
# 对文本进行分词
word_list = jieba.lcut(text)'''
# word_list = data_preprocess.data_preprocess(sentence)
for word in sentence:
if word not in word_freq_dict:
word_freq_dict[word] = 0
word_freq_dict[word] += 1
word_freq_dict = sorted(word_freq_dict.items(), key = lambda x:x[1], reverse = True)
word2id_dict = dict()
word2id_freq = dict()
# 一般来说,我们把oov和pad放在词典前面,给他们一个比较小的id,这样比较方便记忆,并且易于后续扩展词表
word2id_dict['[oov]'] = 0
word2id_freq[0] = 1e10
word2id_dict['[pad]'] = 1
word2id_freq[1] = 1e10
for word, freq in word_freq_dict:
word2id_dict[word] = len(word2id_dict)
word2id_freq[word2id_dict[word]] = freq
return word2id_freq, word2id_dict
train_corpus = load_comment.load_comment(True)
train_corpus = data_preprocess.data_preprocess(train_corpus)
word2id_freq, word2id_dict = build_dict(train_corpus)
vocab_size = len(word2id_freq)
print("there are totoally %d different words in the corpus" % vocab_size)
for _, (word, word_id) in zip(range(20), word2id_dict.items()):
print("word %s, its id %d, its word freq %d" % (word, word_id, word2id_freq[word_id]))
输出结果如下
可以看到,较好的完成了字典的建立
数据集准备到此完成