用LSTM、GRU来训练字符级的语言模型

用LSTM、GRU来训练字符级的语言模型

import torch
import torch.nn as nn
import torch.utils.data as Data
import torch.autograd as autograd
import torch.optim as optim
import numpy as np
from torch.autograd import Variable

# 读取文件
poetrys = []
poetry = ''
with open("poetryFromTang.txt", encoding='utf-8') as f:
    next(f)
    for line in f:
        if len(line)!=1:
            poetry += line.strip('\n')
        else:
            poetrys.append(poetry)
            poetry = ''

# 生成词库
all_word = ''
for potery in poetrys:
    all_word += potery

all_word = all_word.replace(',','').replace('。','')

# 统计词频
word_dict = {
   }

for word in all_word:
    if word not in word_dict:
        word_dict[word] = 1
    else:
        word_dict[word] += 1
        
word_sort = sorted(word_dict.items(),key=lambda x:x[1],reverse=True)
words, _ = zip(*word_sort)

# 获取词典
word_to_token = {
   word:id for id, word in enumerate(words)}
token_to_word = dict(enumerate(words))

# 将字序列转化为id序列
def transword(char_list):
    ids = [word_to_token.get(char, len(word)-1) for 
  • 1
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值