【Pytorch】利用Pytorch+LSTM实现中文新闻分类(附源码)

2 篇文章 0 订阅

1.数据的预处理

        首先将文件读取出来并利用chinese_pre()函数对中文文本数据进行预处理,去除一些不需要的字符,分词,去停用词,等操作。然后将预处理后的结果保存为新的文件。接着利用map函数将“体育”、“娱乐”等中文标签转化为数字标签。并存入文件。其中re.sub()、jieba.cut()。map()函数等说明文档在下方链接

re.sub()用法的详细介绍_jack的博客-CSDN博客_re sub

jieba源码解析(二):jieba.cut - AloisWei - 博客园

Python map() 函数 | 菜鸟教程

2.数据的构造

        首先定义了按空格切分的文本切分方法,并且按照特征和标签将数据集分为“TEXT”和“LABEL”并分别定义了相关操作。然后将csv文件的列名与对应的操作(“TEXT”或”“LABEL)进行map。最后按文件依次读取数据。接着利用vocab()方法和BucketIterator()方法将数据进行转化。(【Torchtext】Torchtext.Vocab、Torchtext.data.BucketIterator、build_vocab函数以及Torchtext.vocab.Vectors_m0_58810879的博客-CSDN博客

3.LSTM网络的构造

        首先利用Embedding()函数将文本进行词向量处理(Embedding()函数参见我的另外一篇文章),接着构造前向传播,其返回值为LSTM最后一层的输出特征,h_n h_c分别代表隐藏层的输出。然后对LSTM网络赋上初始值,并开始训练。其余过程同RNN实现MNIST分类,此处便不再赘诉了。

## 导入本章所需要的模块
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
## 输出图显示中文
from matplotlib.font_manager import FontProperties
fonts = FontProperties(fname = "/Library/Fonts/华文细黑.ttf")
import re
import string
import copy
import time
from sklearn.metrics import accuracy_score,confusion_matrix


import torch
from torch import nn
import torch.nn.functional as F
import torch.optim as optim
import torch.utils.data as Data
import jieba
from torchtext import data
from torchtext.vocab import Vectors
## 读取训练、验证和测试数据集
train_df = pd.read_csv("data/chap7/cnews/cnews.train.txt",sep="\t",
                       header=None,names = ["label","text"])
val_df = pd.read_csv("data/chap7/cnews/cnews.val.txt",sep="\t",
                       header=None,names = ["label","text"])
test_df = pd.read_csv("data/chap7/cnews/cnews.test.txt",sep="\t",
                       header=None,names = ["label","text"])
train_df.head(5)
stop_words = pd.read_csv("data/chap7/cnews/中文停用词库.txt",
                         header=None,names = ["text"])
def chinese_pre(text_data):
    ## 字母转化为小写,去除数字,
    text_data = text_data.lower()
    text_data = re.sub("\d+", "", text_data)
    ## 分词,使用精确模式
    text_data = list(jieba.cut(text_data,cut_all=False)) 
    ## 去停用词和多余空格
    text_data = [word.strip() for word in text_data if word not in stop_words.text.values]
    ## 处理后的词语使用空格连接为字符串
    text_data = " ".join(text_data)
    return text_data
## 对数据进行分词
train_df["cutword"] = train_df.text.apply(chinese_pre)
val_df["cutword"] = val_df.text.apply(chinese_pre)
test_df["cutword"] = test_df.text.apply(chinese_pre)
## 预处理后的结果保存为新的文件
train_df[["label","cutword"]].to_csv("data/chap7/cnews_train.csv",index=False)
val_df[["label","cutword"]].to_csv("data/chap7/cnews_val.csv",index=False)
test_df[["label","cutword"]].to_csv("data/chap7/cnews_test.csv",index=False)
train_df.cutword.head()
train_df = pd.read_csv("data/chap7/cnews_train.csv")
val_df = pd.read_csv("data/chap7/cnews_val.csv")
test_df = pd.read_csv("data/chap7/cnews_test.csv")
labelMap = {"体育": 0,"娱乐": 1,"家居": 2,"房产": 3,"教育": 4,
            "时尚": 5,"时政": 6,"游戏": 7,"科技": 8,"财经": 9}
train_df["labelcode"] =train_df["label"].map(labelMap)
val_df["labelcode"] =val_df["label"].map(labelMap)
test_df["labelcode"] =test_df["label"].map(labelMap)
print(train_df.head())
train_df[["labelcode","cutword"]].to_csv("data/chap7/cnews_train2.csv",index=False)
val_df[["labelcode","cutword"]].to_csv("data/chap7/cnews_val2.csv",index=False)
test_df[["labelcode","cutword"]].to_csv("data/chap7/cnews_test2.csv",index=False)
## 使用torchtext库进行数据准备
# 定义文件中对文本和标签所要做的操作
"""
sequential=True:表明输入的文本时字符,而不是数值字
tokenize="spacy":使用spacy切分词语
use_vocab=True: 创建一个词汇表
batch_first=True: batch优先的数据方式
fix_length=400 :每个句子固定长度为400
"""
## 定义文本切分方法,因为前面已经做过处理,所以直接使用空格切分即可
mytokenize = lambda x: x.split()
TEXT = data.Field(sequential=True, tokenize=mytokenize, 
                  include_lengths=True, use_vocab=True,
                  batch_first=True, fix_length=400)
LABEL = data.Field(sequential=False, use_vocab=False, 
                   pad_token=None, unk_token=None)
## 对所要读取的数据集的列进行处理
text_data_fields = [
    ("labelcode", LABEL), # 对标签的操作
    ("cutword", TEXT) # 对文本的操作
]
## 读取数据
traindata,valdata,testdata = data.TabularDataset.splits(
    path="data/chap7", format="csv", 
    train="cnews_train2.csv", fields=text_data_fields, 
    validation="cnews_val2.csv",
    test = "cnews_test2.csv", skip_header=True
)
print(len(traindata),len(valdata),len(testdata))
## 检查一个样本的标签和文本
em = traindata.examples[0]
print(em.labelcode)
print(em.cutword)
## 使用训练集构建单词表,没有预训练好的词项量
TEXT.build_vocab(traindata,max_size=20000,vectors = None)
LABEL.build_vocab(traindata)
## 可视化训练集中的前50个高频词
word_fre = TEXT.vocab.freqs.most_common(n=50)
word_fre = pd.DataFrame(data=word_fre,columns=["word","fre"])
word_fre.plot(x="word", y="fre", kind="bar",legend=False,figsize=(12,7))
plt.xticks(rotation = 90,fontproperties = fonts,size = 10)
plt.show()

print("词典的词数:",len(TEXT.vocab.itos))
print("前10个单词:\n",TEXT.vocab.itos[0:10])
## 类别标签的数量和类别
print("类别标签情况:",LABEL.vocab.freqs)
## 定义一个迭代器,将类似长度的示例一起批处理。
BATCH_SIZE = 64
train_iter = data.BucketIterator(traindata,batch_size = BATCH_SIZE)
val_iter = data.BucketIterator(valdata,batch_size = BATCH_SIZE)
test_iter = data.BucketIterator(testdata,batch_size = BATCH_SIZE)
##  获得一个batch的数据,对数据进行内容进行介绍
for step, batch in enumerate(train_iter):  
    if step > 0:
        break
## 针对一个batch 的数据,可以使用batch.labelcode获得数据的类别标签
print("数据的类别标签:\n",batch.labelcode)
## batch.cutword[0]是文本对应的标签向量
print("数据的尺寸:",batch.cutword[0].shape)
## batch.cutword[1] 对应每个batch使用的原始数据中的索引
print("数据样本数:",len(batch.cutword[1]))
class LSTMNet(nn.Module):
    def __init__(self, vocab_size,embedding_dim, hidden_dim, layer_dim, output_dim):
        """
        vocab_size:词典长度
        embedding_dim:词向量的维度
        hidden_dim: RNN神经元个数
        layer_dim: RNN的层数
        output_dim:隐藏层输出的维度(分类的数量)
        """
        super(LSTMNet, self).__init__()
        self.hidden_dim = hidden_dim ## RNN神经元个数
        self.layer_dim = layer_dim ## RNN的层数
        ## 对文本进行词项量处理
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        # LSTM + 全连接层
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, layer_dim,
                            batch_first=True)
        self.fc1 = nn.Linear(hidden_dim, output_dim)
    def forward(self, x):
        embeds = self.embedding(x)
        # r_out shape (batch, time_step, output_size)
        # h_n shape (n_layers, batch, hidden_size)   LSTM 有两个 hidden states, h_n 是分线, h_c 是主线
        # h_c shape (n_layers, batch, hidden_size)
        r_out, (h_n, h_c) = self.lstm(embeds, None)   # None 表示 hidden state 会用全0的 state
        # 选取最后一个时间点的out输出
        out = self.fc1(r_out[:, -1, :]) 
        return out
    
vocab_size = len(TEXT.vocab)
embedding_dim = 100
hidden_dim = 128
layer_dim = 1
output_dim = 10
lstmmodel = LSTMNet(vocab_size, embedding_dim, hidden_dim, layer_dim, output_dim)
print(lstmmodel)
## 定义网络的训练过程函数
def train_model2(model,traindataloader, valdataloader,criterion, 
                 optimizer,num_epochs=25,):
    """
    model:网络模型;traindataloader:训练数据集;
    valdataloader:验证数据集,;criterion:损失函数;optimizer:优化方法;
    num_epochs:训练的轮数
    """
    train_loss_all = []
    train_acc_all = []
    val_loss_all = []
    val_acc_all = []
    since = time.time()
    for epoch in range(num_epochs):
        print('-' * 10)
        print('Epoch {}/{}'.format(epoch, num_epochs - 1))
        # 每个epoch有两个阶段,训练阶段和验证阶段
        train_loss = 0.0
        train_corrects = 0
        train_num = 0
        val_loss = 0.0
        val_corrects = 0
        val_num = 0
        model.train() ## 设置模型为训练模式
        for step,batch in enumerate(traindataloader):
            textdata,target = batch.cutword[0],batch.labelcode.view(-1)
            out = model(textdata)
            pre_lab = torch.argmax(out,1) # 预测的标签
            loss = criterion(out, target) # 计算损失函数值
            optimizer.zero_grad()        
            loss.backward()       
            optimizer.step()  
            train_loss += loss.item() * len(target)
            train_corrects += torch.sum(pre_lab == target.data)
            train_num += len(target)
        ## 计算一个epoch在训练集上的损失和精度
        train_loss_all.append(train_loss / train_num)
        train_acc_all.append(train_corrects.double().item()/train_num)
        print('{} Train Loss: {:.4f}  Train Acc: {:.4f}'.format(
            epoch, train_loss_all[-1], train_acc_all[-1]))
        
        ## 计算一个epoch的训练后在验证集上的损失和精度
        model.eval() ## 设置模型为训练模式评估模式 
        for step,batch in enumerate(valdataloader):
            textdata,target = batch.cutword[0],batch.labelcode.view(-1)
            out = model(textdata)
            pre_lab = torch.argmax(out,1)
            loss = criterion(out, target)   
            val_loss += loss.item() * len(target)
            val_corrects += torch.sum(pre_lab == target.data)
            val_num += len(target)
        ## 计算一个epoch在训练集上的损失和精度
        val_loss_all.append(val_loss / val_num)
        val_acc_all.append(val_corrects.double().item()/val_num)
        print('{} Val Loss: {:.4f}  Val Acc: {:.4f}'.format(
            epoch, val_loss_all[-1], val_acc_all[-1]))
    train_process = pd.DataFrame(
        data={"epoch":range(num_epochs),
              "train_loss_all":train_loss_all,
              "train_acc_all":train_acc_all,
              "val_loss_all":val_loss_all,
              "val_acc_all":val_acc_all})  
    return model,train_process
# 定义优化器
optimizer = torch.optim.Adam(lstmmodel.parameters(), lr=0.0003)  
loss_func = nn.CrossEntropyLoss()   # 损失函数
## 对模型进行迭代训练,对所有的数据训练EPOCH轮
lstmmodel,train_process = train_model2(
    lstmmodel,train_iter,val_iter,loss_func,optimizer,num_epochs=20)
## 输出结果保存和数据保存
torch.save(lstmmodel,"data/chap7/lstmmodel.pkl")
## 导入保存的模型
lstmmodel = torch.load("data/chap7/lstmmodel.pkl")
lstmmodel
## 保存训练过程
train_process.to_csv("data/chap7/lstmmodel_process.csv",index=False)
train_process
## 可视化模型训练过程中
plt.figure(figsize=(18,6))
plt.subplot(1,2,1)
plt.plot(train_process.epoch,train_process.train_loss_all,
         "r.-",label = "Train loss")
plt.plot(train_process.epoch,train_process.val_loss_all,
         "bs-",label = "Val loss")
plt.legend()
plt.xlabel("Epoch number",size = 13)
plt.ylabel("Loss value",size = 13)
plt.subplot(1,2,2)
plt.plot(train_process.epoch,train_process.train_acc_all,
         "r.-",label = "Train acc")
plt.plot(train_process.epoch,train_process.val_acc_all,
         "bs-",label = "Val acc")
plt.xlabel("Epoch number",size = 13)
plt.ylabel("Acc",size = 13)
plt.legend()
plt.show()
## 对测试集进行预测并计算精度
lstmmodel.eval() ## 设置模型为训练模式评估模式 
test_y_all = torch.LongTensor()
pre_lab_all = torch.LongTensor()
for step,batch in enumerate(test_iter):
    textdata,target = batch.cutword[0],batch.labelcode.view(-1)
    out = lstmmodel(textdata)
    pre_lab = torch.argmax(out,1)
    test_y_all = torch.cat((test_y_all,target)) ##测试集的标签
    pre_lab_all = torch.cat((pre_lab_all,pre_lab))##测试集的预测标签

acc = accuracy_score(test_y_all,pre_lab_all)
print("在测试集上的预测精度为:",acc)
## 计算混淆矩阵并可视化
class_label = ["体育","娱乐","家居","房产","教育",
               "时尚","时政","游戏","科技","财经"]
conf_mat = confusion_matrix(test_y_all,pre_lab_all)
df_cm = pd.DataFrame(conf_mat, index=class_label, columns=class_label)
heatmap = sns.heatmap(df_cm, annot=True, fmt="d",cmap="YlGnBu")
heatmap.yaxis.set_ticklabels(heatmap.yaxis.get_ticklabels(), rotation=0,
                             ha='right',fontproperties = fonts)
heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), rotation=45,
                             ha='right',fontproperties = fonts)
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()
plt.figure(figsize=(10,7))
heatmap = sns.heatmap(df_cm, annot=True, fmt="d",cmap="YlGnBu")
heatmap.yaxis.set_ticklabels(heatmap.yaxis.get_ticklabels(), rotation=0,
                             ha='right',fontproperties = fonts)
heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), rotation=45,
                             ha='right',fontproperties = fonts)
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()

  • 5
    点赞
  • 36
    收藏
    觉得还不错? 一键收藏
  • 16
    评论
基于PyTorch LSTM(长短期记忆网络)进行时间序列预测的代码如下所示: ```python import torch import torch.nn as nn import numpy as np # 定义LSTM模型 class LSTM(nn.Module): def __init__(self, input_size, hidden_size, num_layers, output_size): super(LSTM, self).__init__() self.hidden_size = hidden_size self.num_layers = num_layers self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True) self.fc = nn.Linear(hidden_size, output_size) def forward(self, x): h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(device) c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(device) out, _ = self.lstm(x, (h0, c0)) out = self.fc(out[:, -1, :]) return out # 设置随机种子,以便复现结果 torch.manual_seed(42) # 超参数 input_size = 1 hidden_size = 32 num_layers = 2 output_size = 1 num_epochs = 100 learning_rate = 0.001 # 创建数据集(假设我们有一个包含100个数据点的时间序列) data = np.sin(np.arange(0, 10*np.pi, 0.1)) data = data[:, None] # 划分训练集和测试集 train_size = int(len(data) * 0.8) test_size = len(data) - train_size train_data = data[:train_size, :] test_data = data[train_size:, :] # 将数据集转化为PyTorch的张量 train_data = torch.Tensor(train_data).float() test_data = torch.Tensor(test_data).float() # 定义设备 device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # 初始化模型 model = LSTM(input_size, hidden_size, num_layers, output_size).to(device) # 定义损失函数和优化器 criterion = nn.MSELoss() optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate) # 训练模型 for epoch in range(num_epochs): model.train() # 前向传播和计算损失 output = model(train_data) loss = criterion(output, train_data) # 反向传播和优化 optimizer.zero_grad() loss.backward() optimizer.step() if (epoch+1) % 10 == 0: print(f"Epoch: {epoch+1}, Loss: {loss.item()}") # 在测试集上进行预测 model.eval() with torch.no_grad(): # 训练数据结果 train_result = model(train_data).detach().numpy() # 测试数据结果 test_result = model(test_data).detach().numpy() # 绘制结果图 import matplotlib.pyplot as plt plt.figure(figsize=(12, 8)) plt.plot(train_data.numpy(), label='Original data') plt.plot(range(train_size, len(data)), train_result, label='Train prediction') plt.plot(range(train_size, len(data)), test_result, label='Test prediction') plt.legend() plt.show() ``` 上述代码首先定义了一个LSTM模型,然后设置了超参数,创建了一个包含100个数据点的时间序列,并将其划分为训练集和测试集。接下来,将数据转换为PyTorch张量,并定义了设备(CPU或GPU)。 然后,初始化模型,并定义损失函数和优化器。在训练过程中,进行前向传播、计算损失、反向传播和优化,然后将损失打印出来。 在测试阶段,使用训练好的模型对训练集和测试集进行预测,并将结果可视化显示出来。最后,使用matplotlib库绘制了原始数据、训练预测和测试预测的图形。
评论 16
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值