Pytorch实现RNN进行文本（情感）分类

最新推荐文章于 2024-06-27 14:55:55 发布

Frierice

最新推荐文章于 2024-06-27 14:55:55 发布

阅读量7.8k

点赞数 9

文章标签： nlp python pytorch

本文链接：https://blog.csdn.net/Frierice/article/details/104286545

版权

本文介绍了使用PyTorch实现RNN进行文本情感分类的过程，包括数据集处理、GloVe预训练词嵌入、模型训练及预测。通过实验，模型正确率提升至0.59720，相较于one-hot和softmax方法有所提高。

摘要由CSDN通过智能技术生成

导读

本菜鸟在学习NLP过程中，入门任务中有这么一个任务：用RNN实现文本分类
有如下几个知识点：
1.CNN/RNN
2.pytorch
3.词嵌入
4.Dropout
在这里我就不细说RNN了，毕竟我也不是很熟悉啊哈哈哈，给出一个讲的比较好的博文链接：
RNN认识，RNN如何训练的

1.数据集及处理

这里使用的数据集是：Classify the sentiment of sentences from the Rotten Tomatoes dataset

这是一个影评预料库，本次的训练的目的就是对语料库中的影评进行训练，得出观众的情感（类似你平时点外卖给外卖小哥的评价是1星还是5星，在1星的时候你就会吐槽这个外卖小哥送餐很慢）

在获得语料后开始进行预处理，一开始我是使用了one-hot，然后用softmax实现预测，但是这样会忽略词与词之间的影响，例如：Lack of good taste,good service，good 、、、这里面有很多个good如果不考虑到lack的影响的话，会导致预测的结果为正面的，但事实是负面的。所以我在这里采用了GloVe预训练的embedding来进行初始化：

# -*- coding: utf-8 -*-
"""
Created on Sun Feb  9 19:52:26 2020

@author: Frierice
加载模型
"""

import gensim
import os
import shutil
from sys import platform

#计算行数，就是单词数
def getFileLineNums(filename):
	f = open(filename, 'r',encoding='UTF-8')
	count = 0
	for line in f:
		count += 1
	return count
 
#Linux或者Windows下打开词向量文件，在开始增加一行
def prepend_line(infile, outfile, line):
	with open(infile, 'r') as old:
		with open(outfile, 'w') as new:
			new.write(str(line) + "\n")
			shutil.copyfileobj(old, new)
 
def prepend_slow(infile, outfile, line):
	with open(infile, 'r',encoding='UTF-8') as fin:
		with open(outfile, 'w',encoding='UTF-8') as fout:
			fout.write(line + "\n")
			for line in fin:
				fout.write(line)
 
def load(filename):
    num_lines = getFileLineNums(filename)
    gensim_file = 'D:/spyderProject/Task2/gloveModel/glove_model.txt'
    gensim_first_line = "{} {}".format(num_lines, 300)
#     Prepends the line.
    if platform == "linux" or platform == "linux2":
    		prepend_line(filename, gensim_file, gensim_first_line)
    else:
    		prepend_slow(filename, gensim_file, gensim_first_line)

    model = gensim.models.KeyedVectors.load_word2vec_format(gensim_file)
#    model.save('D:/spyderProject/Task2/gloveModel/glove.model')
#    print(model['unk'])
 
load('D:/spyderProject/Task2/gloveModel/glove.6B.300d.txt')

上面的代码是在数据第一行加上400000 300使得能够直接用word2vec
接下来是将上面的模型的字典和本次训练数据的字典交集（这个可有可无，原本以为嵌入需要些时间，结果发现时间都在加载模型上，所以这个就可有可无了）

def getDicWordEmbedding():
    gensim_file = 'D:/spyderProject/Task2/gloveModel/glove_model.txt'
    model = gensim.models

最低0.47元/天解锁文章

Frierice

关注

9
点赞
踩
77

收藏

觉得还不错? 一键收藏
4
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫