【Deep Learning】循环神经网络(RNN)推导和实现

本文基于wildml的博客,详细介绍了循环神经网络(RNN)的语言模型、数据预处理、网络结构、参数初始化、前向传播、损失函数、反向传播(BPTT)、梯度消失问题以及LSTM解决方案。提供了纯Python实现的RNN,包含梯度检验和随机梯度下降(SGD)训练过程。
摘要由CSDN通过智能技术生成

       主要参考wildml的博客所写,所有的代码都是python实现,并且没有使用深度学习的框架,所以对理解RNN可以起到很大的帮助。

一、语言模型

        如果一个句子有m个词,那么这个句子生成的概率就是:

        其即假设下一次词生成的概率和只和句子前面的词有关,举一个例子:How are you,生成的概率可以表示为: 

 P(How are you) = P(you)*P(you|How,are) 。

二、数据预处理

       语料预处理会去掉一些低频词从而控制词典大小,这里我们截取前8000个高频词汇,低频词使用一个统一标识替换(这里是UNKNOWN_TOKEN),在经过预处理之后每一个词得到一个编号;为了学出来哪些词常常作为句子开始和句子结束,引入SENTENCE_START和SENTENCE_END两个特殊字符。具体代码如下:

vocabulary_size = 8000
unknown_token = "UNKNOWN_TOKEN"
sentence_start_token = "SENTENCE_START"
sentence_end_token = "SENTENCE_END"
 
# Read the data and append SENTENCE_START and SENTENCE_END tokens
print "Reading CSV file..."
with open('data/reddit-comments-2015-08.csv', 'rb') as f:
    reader = csv.reader(f, skipinitialspace=True)
    reader.next()
    # Split full comments into sentences
    sentences = itertools.chain(*[nltk.sent_tokenize(x[0].decode('utf-8').lower()) for x in reader])
    # Append SENTENCE_START and SENTENCE_END
    sentences = ["%s %s %s" % (sentence_start_token, x, sentence_end_token) for x in sentences]
print "Parsed %d sentences." % (len(sentences))
     
# Tokenize the sentences into words
tokenized_sentences = [nltk.word_tokenize(sent) for sent in sentences]
 
# Count the word frequencies
word_freq = nltk.FreqDist(itertools.chain(*tokenized_sentences))
print "Found %d unique words tokens." % len(word_freq.items())
 
# Get the most common words and build index_to_word and word_to_index vectors
vocab = word_freq.most_common(vocabulary_size-1)
index_to_word = [x[0] for x in vocab]
index_to_word.append(unknown_token)
word_to_index = dict([(w,i) for i,w in enumerate(index_to_word)])
 
print "Using vocabulary size %d." % vocabulary_size
print "The least frequent word in our vocabulary is '%s' and appeared %d times." % (vocab[-1][0], vocab[-1][1])
 
# Replace all words not in our vocabulary with the unknown token
for i, sent in enumerate(tokenized_sentences):
    tokenized_sentences[i] = [w if w in word_to_index else unknown_token for w in sent]
 
print "\nExample sentence: '%s'&
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值