python机器学习——文本情感分析（英文文本情感分析）

本文链接：https://blog.csdn.net/leilei7407/article/details/103858402

这篇博客介绍了使用LSTM+RNN模型进行英文文本情感分析的机器学习项目。内容涵盖数据集简介，数据处理，模型训练及新数据预测，预测结果保存在result.txt中。提供了数据集和代码下载链接。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

本人机器学习课程的小作业，记录一下，希望可以帮到一些小伙伴。
项目介绍，给一段英文文本（英文影评评论）来预测情感是正向还是负向
模型使用的是LSTM+RNN。
代码包括数据处理，模型训练，对新数据做出预测，并将预测结果（正向情感）保存到result.txt中
软件：anaconda3

代码下载链接：下载链接
word文档（文本情感色彩分类技术报告）下载链接：下载链接

一.数据集介绍

数据集链接: https://pan.baidu.com/s/1oIXkaL_SL9GSN3S56ZwvWQ
提取码: qgtg

训练集labeledTrainData.tsv(24500条带标签的训练数据)
id sentiment review 分别表示：每段文本的唯一ID，情感色彩类别标签，待分析的文本数据。
在这里插入图片描述
测试集（testData.tsv：22000条无标签测试数据）

二.代码详解

import numpy as np
import tensorflow as tf
wordsList=np.load('C:/NLP/wordsList.npy') #包含40万个词的python列表
wordsList=np.load('C:/NLP/wordsList.npy') #包含40万个词的python列表
wordsList=wordsList.tolist()
wordsList=[word.decode('UTF-8') for word in wordsList]
wordVectors=np.load('C:/NLP/wordVectors.npy')
baseballIndex=wordsList.index('baseball')
print(wordVectors[baseballIndex])
import pandas as pd
#读入数据
df=pd.read_csv('C:/NLP/labeledTrainData.tsv',sep='\t',escapechar='\\')

numDimensions=300
print("down")

maxSeqLength=250
import re
strip_special_chars = re.compile("[^A-Za-z0-9 ]+")
#清洗数据，HTML字符使用空格替代
def cleanSentences(string):
    string = string.lower().replace("<br />", " ")
    return re.sub(strip_special_chars, "", string.lower())

# #生成索引矩阵，得到24500*250的索引矩阵
# ids=np.zeros((24500,maxSeqLength),dtype='int32')
# #print(ids.shape) #输出结果为（24500,250）

# fileCounter=0
# for pf in range(0,len(df)): 
#     #print(pf)
#     indexCounter=0
#     cleanedLine=cleanSentences(df['review'][pf])
#     split=cleanedLine.split()
#     for word in split:
#         try:
#             #print('111')
#             ids[fileCounter][indexCounter]=wordsList.index(word)
#         except ValueError:
#             ids[fileCounter][indexCounter]=399999
#         indexCounter=indexCounter+1
#         if indexCounter>=maxSeqLength:
#             break        
#     fileCount