自然语言处理中非结构化数据有时候很头疼,拿到一大堆数据后,到底是怎么变成最终喂给网络的向量的呢?time_step和batch_size到底是什么样子的呢??
我们先不去看变成向量后的batch_size和time_step,如果现在手头有那么一些句子如下
sentences = ["i love you", "he loves me", "she likes baseball", "i hate you", "sorry for that"]
这里一共五句话,如果我是要做情感分类,因为这里数据量比较小,所以我就把五句话当成一个批次,然后time_step为3
这样的话输入就是下面这样
sentences = ["i love you", "he loves me", "she likes baseball", "i hate you", "sorry for that"]
>>> batch_size 为5
time_step1: i he she i sorry
time_step2: love love likes hate for
time_step3: you me baseball you that
然后,rnn计算损失的时候,就会在每一个time_step计算batch_size大小个样本的平均损失(也有时候是最后一个time_step统计损失).
现在我们把其变成计算机看的懂的语言,首先我们要有一个词表,如下:
word_list = " ".join(sentences).split()
word_list = list(set(word_list))
word_dict = {w: i for i, w in enumerate(word_list)}
vocab_size = len(word_dict)
>>>word_list
['awful','likes','is','she','for','hate','i','you','love','loves','that','this','sorry','he','me','baseball']
>>>word_dict
{'awful': 0,'likes': 1,'is': 2,'she': 3,'for': 4,'hate': 5,'i': 6,'you': 7,'love': 8,'loves': 9,'that': 10,'this': 11,'sorry': 12,'he': 13,'me': 14,'baseball': 15}
>>>vocab_size
16
然后我们要就要把每一个单词变成一个one-hot向量 ,维度为16。
比如:time_step1就是下面的样子:
array([[0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0], #i
[0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0], #he
[0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0], #she
[0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0], #i
[0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0]]) #sorry
这个就是batch_size为5 ,time_step1时刻喂给网络的值。