使用LSTM训练语言模型(以《魔道祖师》为corpus)

最新推荐文章于 2024-03-10 17:20:32 发布

缦旋律

最新推荐文章于 2024-03-10 17:20:32 发布

阅读量885

点赞数

分类专栏： pytorch 深度学习

小陈一行一行地敲出来的啦~

本文链接：https://blog.csdn.net/weixin_41391619/article/details/117454106

版权

本文档介绍了如何使用LSTM训练一个以《魔道祖师》为语料库的语言模型，涵盖了从读取原始文档、分词处理、建立字典和迭代器，到定义模型、评估函数和模型训练的完整过程。

摘要由CSDN通过智能技术生成

文章目录

1.读入原始文档和停用词txt文件
2.分词处理
3.建立字典和迭代器
4.定义模型及评估函数
5.开始训练
6.将训练好的模型load进来并进行评估

import torchtext
from torchtext.vocab import Vectors
import torch 
from torch import nn
import numpy as np
import random
import jieba

random.seed(53113)
np.random.seed(53113)
torch.manual_seed(53113)
use_cuda = torch.cuda.is_available()
if use_cuda:
    torch.cuda.manual_seed(53113)
    device = torch.device('cuda')
else:
    device = torch.device('cpu')

1.读入原始文档和停用词txt文件

原始文档和停用词文档
在这里插入图片描述

with open('./mdzs.txt') as f:
    text = f.readlines()
f.close()
text = [i.strip() for i in text]

with open('./stop_words.txt',encoding='utf-8') as f:
    stop_words = f.readlines()
f.close()
stop_words = [i.strip() for i in stop_words]
stop_word = [' ','PS','1V1','HE','┃','O','∩','☆'] 
for word in stop_word:
    stop_words.append(word)

text[:10]

['',
 '《魔道祖师[重生]》作者：墨香铜臭',
 '',
 '文案：',
 '前世的魏无羡万人唾骂，声名狼藉。',
 '被护持一生的师弟带人端了老巢，',
 '纵横一世，死无全尸。',
 '',
 '曾掀起腥风血雨的一代魔道祖师，重生成了一个……',
 '脑残。']

stop_words[:10]

['', '为止', '纵然', 'all', '例如', '［④ｅ］', 'when', '亦', '来讲', '谁料']

2.分词处理

text_token = []
for sentence in text:
    token = jieba.lcut(sentence)
    for word in token:
        if word not in stop_words:
            text_token.append(word)

a = ' '.join(i for i in text_token)
with open('cql.txt','w',encoding='utf-8') as f:
    f.write(a)
f.close()

3.建立字典和迭代器

field = torchtext.data.Field()
train = torchtext.datasets.LanguageModelingDataset.splits(path='./',train="cql.txt",text_field=field)[0]
field.build_vocab(train, max_size=20000)

train_iter,val_iter = torchtext.data.BPTTIterator.splits(
                                (train,train)

最低0.47元/天解锁文章

缦旋律

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
使用LSTM训练语言模型(以《魔道祖师》为corpus)

import torchtextfrom torchtext.vocab import Vectorsimport torch from torch import nnimport numpy as npimport randomimport jiebarandom.seed(53113)np.random.seed(53113)torch.manual_seed(53113)use_cuda = torch.cuda.is_available()if use_cuda: t
复制链接

扫一扫

专栏目录