利用LSTＭ做文本分类_lstm文本分类-CSDN博客

本文链接：https://blog.csdn.net/qq_29678299/article/details/103109636

本文探讨如何使用长短时记忆网络（LSTM）进行文本分类。通过介绍LSTM的基本原理，结合实例展示了如何预处理文本数据，构建LSTM模型，并进行训练和评估。文章深入浅出地解释了LSTM在捕捉文本序列信息方面的优势，以及在实际任务中如何调整模型参数以优化性能。

摘要由CSDN通过智能技术生成

"""
RNN模型
下面我们尝试把模型换成一个recurrent neural network (RNN)。RNN经常会被用来encode一个sequence
ℎ𝑡=RNN(𝑥𝑡,ℎ𝑡−1)
 
我们使用最后一个hidden state  ℎ𝑇 来表示整个句子。
然后我们把 ℎ𝑇 通过一个线性变换 𝑓 ，然后用来预测句子的情感。

也可以　把　每个hidden　statae 进行平均　作为句子的词向量　进行分类　　
２者的　差别不是很大　　
"""
import torch
from torchtext import data

SEED = 1234

torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed(SEED)
    

"""
但和 cudnn.benchmark 有何联系呢？实际上，设置这个 flag 为 True，我们就可以在 PyTorch 中对模型里的卷积层进行预先的优化，
也就是在每一个卷积层中测试 cuDNN 提供的所有卷积实现算法，然后选择最快的那个。这样在模型启动的时候，
只要额外多花一点点预处理时间，就可以较大幅度地减少训练时间。


"""
#torch.backends.cudnn.deterministic = True


"""
先决定怎么处理数据  Field是决定如何处理数据的  
默认进行空格分词
spacy 是根据点进行分词
spaCy是世界上最快的工业级自然语言处理工具。 支持多种自然语言处理基本功能。
官网地址：https://spacy.io/
spaCy主要功能包括分词、词性标注、词干化、命名实体识别、名词短语提取等等。


"""
TEXT = data.Field(tokenize='spacy')
LABEL = data.LabelField(dtype=torch.float)

"""
TorchText支持很多常见的自然语言处理数据集。
下面的代码会自动下载IMDb数据集，然后分成train/test两个torchtext.datasets类别。
数据被前面的Fields处理。IMDb数据集一共有50000电影评论，每个评论都被标注为正面的或负面的。
"""
from torchtext import datasets
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)
"""
查看每个数据split 有多少条数据 
"""
print(f'Number of training examples: {len(train_data)}')
print(f'Number of testing examples: {len(test_data)}')
"""
查看一个example  

{'text': ['Brilliant', 'adaptation', 'of', 'the', 'novel', 'that', 'made', 'famous', 'the', 'relatives', 'of', 'Chilean', 'President', 'Salvador', 'Allende', '
killed', '.', 'In', 'the', 'environment', 'of', 'a', 'large', 'estate', 'that', 'arises', 'from', 'the', 'ruins', ',', 'becoming', 'a', 'force', 'to', 'abuse', 'and',
'exploitation', 'of', 'outrage', ',', 'a', 'luxury', 'estate', 'for', 'the', 'benefit', 'of', 'the', 'upstart', 'Esteban', 'Trueba', 'and', 'his', 'undeserved',
'family', ',', 'the', 'brilliant', 'Danish', 'director', 'Bille', 'August', 'recreates', ',', 'in', 'micro', ',', 'which', 'at', 'the', 'time', 'would', 'be', 'the', 
'process', 'leading', 'to', 'the', 'greatest', 'infamy', 'of', &#