Pytorch：循环神经网络-LSTM

宅家的小魏

已于 2022-02-13 17:28:13 修改

阅读量1.2k

点赞数 3

分类专栏： PyTorch 文章标签： pytorch lstm rnn 循环神经网络深度学习

于 2022-02-05 00:05:58 首次发布

本文链接：https://blog.csdn.net/weixin_44979150/article/details/122779345

版权

Pytorch: 循环神经网络：LSTM进行新闻分类

Copyright: Jingmin Wei, Pattern Recognition and Intelligent System, School of Artificial and Intelligence, Huazhong University of Science and Technology

Pytorch教程专栏链接

文章目录

Pytorch: 循环神经网络：LSTM进行新闻分类

@[toc]

将文本整合到 train、test、val 三个文件中

中文数据读取与预处理

网络训练数据的导入与探索

搭建LSTM网络

LSTM网络的训练

LSTM网络预测

可视化词向量的分布

本教程不商用，仅供学习和参考交流使用，如需转载，请联系本人。

详细的 LSTM 结构可以参考教程的上篇文章。

本文主要是采用门控循环单元网络 LSTM 来进行新闻类别分类，大家也可以尝试把模型改成下篇文章的 GRU 对比两种网络的效果。

使用 THUCNews 数据库进行分类，一共包含 $10$ 类文本数据，每个类别数据有 $6500$ 条文本，切分为训练集( $5000\times10$ )、验证集( $500\times10$ )和测试集( $1000\times10$ )

数据集下载链接：http://thuctc.thunlp.org/

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 
import re
import string
import copy
import time
import os
from sklearn.metrics import accuracy_score, confusion_matrix
import matplotlib
import csv

import torch
import torch.nn as nn
import torch.nn.functional as F 
import torch.optim as optim
import torch.utils.data as Data 
import jieba
from torchtext import data
from torchtext.vocab import Vectors

# 输出图显示中文
from matplotlib.font_manager import FontProperties
fonts = FontProperties(fname = 'C:/windows/Fonts/STXIHEI.TTF')

# 模型加载选择GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
print(torch.cuda.device_count())
print(torch.cuda.get_device_name(0))

cuda
1
GeForce MX250

将文本整合到 train、test、val 三个文件中

数据集划分程序参考：https://github.com/gaussic/text-classification-cnn-rnn

def _read_file(filename):
    """读取一个文件并转换为一行"""
    with open(filename, 'r', encoding='utf-8') as f:
        return f.read().replace('\n', '').replace('\t', '').replace('\u3000', '')

def save_file(dirname):
    """
    将多个文件整合并存到3个文件中
    """
    f_train = open('data/cnews1/cnews.train.txt', 'w', encoding='utf-8')
    f_test = open('data/cnews1/cnews.test.txt', 'w', encoding='utf-8')
    f_val = open('data/cnews1/cnews.val.txt', 'w', encoding='utf-8')
    for category in os.listdir(dirname):   # 分类目录
        cat_dir = os.path.join(dirname, category)
        if not os.path.isdir(cat_dir):
            continue
        files = os.listdir(cat_dir)
        count = 0
        for cur_file in files:
            filename = os.path.join(cat_dir, cur_file)
            content = _read_file(filename)
            if count < 5000:
                f_train.write(category + '\t' + content + '\n')
            elif count < 6000:
                f_test.write(category + '\t' + content + '\n')
            else:
                f_val.write(category + '\t' + content + '\n')
            count += 1

        print('Finished:', category)

    f_train.close()
    f_test.close()
    f_val.close()

save_file('data/thucnews')
print(len(open('data/cnews/cnews.train.txt', 'r', encoding='utf-8').readlines()))
print(len(open('data/cnews/cnews.test.txt', 'r', encoding='utf-8').readlines()))
print(len(open('data/cnews/cnews.val.txt', 'r', encoding