python-pytorch 利用word2vec实现lstm模型预测中文文本输出0.1.00
前言
使用pretrained word embeddings word2vec 替代nn.Embedding,过程还存在问题,最明显的是预测会不停循环一句话
- 要使用替代word2vec,核心代码两步
sentences = LineSentence(dataset_path)
model = word2vec.Word2Vec(sentences, sg=1, window=5, min_count=1, workers=4,epochs=2000)
- 要使用到LineSentence函数,文本格式有要求
一是,需要文本内容是使用空格分好,内容如:ZooKeeper 定义 的 存储 目录 不 正确 或 ZooKeeper 的 存储 规划 变化 时
二是,一行一个句子
源数据
一篇新闻:https://news.sina.com.cn/c/2024-04-12/doc-inarqiev0222543.shtml
导入包
import logging
import sys
import gensim.models as word2vec
from gensim.models.word2vec import LineSentence, logger
import jieba
import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as Data
from torch.autograd import Variable
加载数据分析后写入新文件
要把源文件一行一行的,使用jieba分词后用空格分开,才能使用word2vec的LineSentence
with open("./howtousercbow/data/news.txt","r",encoding="utf-8") as f:
lines=f.readlines()
for line in lines:
jiebacutresult=list(jieba.cut(line.replace(",","").replace("。","").replace("\n","").replace(",","").replace("、","").replace("?","").replace(":",""),False))
sttr=""
for jb in jieba