LSTM做文本生成（基于word2vec)

最新推荐文章于 2024-05-27 20:44:31 发布

勤奋的郑先生

最新推荐文章于 2024-05-27 20:44:31 发布

阅读量7.1k

点赞数 3

文章标签： LSTM WORD2VEC

本文链接：https://blog.csdn.net/weixin_41370083/article/details/82847705

版权

该博客介绍了如何利用Keras框架，结合word2vec模型，进行LSTM文本生成的实践。作者选择了丘吉尔的人物传记作为语料库，目标是根据已有的单词序列预测下一个单词。在数据预处理阶段，将word2vec的数字表示转化为LSTM所需的输入格式，即[样本数，时间步长，特征]，输出则为128维的向量。

摘要由CSDN通过智能技术生成

数据：使用丘吉尔的人物传记作为我的学习语料

框架：Keras

import os
import numpy as np
import nltk
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils
from gensim.models.word2vec import Word2Vec

#读入文本
raw_text=""
for file in os.listdir("../input/"):
    if file.endswith(".txt"):
        raw_text+=open("../input/"+file,errors="ignore").read()+\"n\n"
#row_test=open("../input/Winston_Churchil.txt").read()
raw_text=raw_text.lower()
sentensor=nltk.data.load("tokenizers/punkt/english.pickle")
sents=sentensor.tokenize(raw_text)
corpus=[]
for sen in sents:
    corpus.append(nltk.word_tokenize(sen))

print(len(corpus))
print(corpus[:3])


#结果
91007
[['\ufeffthe', 'project', 'gutenberg', 'ebook', 'of', 'great', 'expectations', ',', 'by', 'charles', 'dickens', 'this', 'ebook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever', '.'], ['you', 'may', 'copy', 'it', ',', 'give', 'it', 'away', 'or', 're-use', 'it', 'under', 'the', 'terms', 'of', 'the', 'project', 'gutenberg', 'license', 'included', 'with', 'this', 'ebook', 'or', 'online', 'at', 'www.gutenberg.org', 'title', ':', 'great', 'expectations', 'author', ':', 'charles', 'dickens', 'posting', 'date', ':', 'august', '20', ',', '2008', '[', 'ebook', '#'