目录:
深度学习语言模型(1)-word2vce的发展历程
深度学习语言模型(2)-词向量,神经概率网络模型(keras版本)
深度学习语言模型(3)-word2vce负采样(Negative Sampling) 模型(keras版本)
1.神经概率网络模型(2003年),步骤如下:
(1)输入层,将每一个词都使用随机的100维向量表示
(2)投影层,将一个上下文的词拼接起来,比如滑动窗口是3,则有(batch_size,6,100)
(3)隐藏层,就是一个全连接层,比如(100,1024)
(4)输出层,使用softmax分类器,类别就是所有词的id,比如现在有50000个词,则,输出层(1024,50000)
在反向传播过程中,不仅更新输出层,隐藏层的权重,还要更新投影层中的x,即词相对应的100维向量。
优点:
(1)即使两个句子出现的次数由很大悬殊,但相似的句子会几乎被同等对待,比如”我今天打篮球”和”我今天打羽毛球”。但还是由点差别,差别来自于两句子中的篮球和羽毛球出现不等,因为在训练过程中训练篮球时,羽毛球的label=0,在训练羽毛球时,篮球的label=0,所以会收到次数的影响,如果不想受到次数的影响,则可以采用将在训练篮球的时候,羽毛球的label=1,训练羽毛球的时候,篮球的label=1。或者采用word2vec(Hierarchical Softmax)模型或者word2vec (Negative Sampling)模型,在训练充足的情况下差距会变小。
缺点:
(1)如果词的量过大,后面的输出层权重参数过多,导致内存不足
# coding=utf-8
'''
Created on 2018年9月15日
@author: admin
'''
from gensim import corpora, models, similarities
import numpy as np
if __name__ == '__main__':
text = [["我","今天","打","篮球"],
["我","今天","打","足球"],
["我","今天","打","羽毛球"],
["我","今天","打","网球"],
["我","今天","打","排球"],
["我","今天","打","气球"],
["我","今天","打","游戏"],
["我","今天","打","冰球"],
["我","今天","打","人"],
["我","今天","打","台球"],
["我","今天","打","桌球"],
["我","今天","打","水"],
["我","今天","打","篮球"],
["我","今天","打","足球"],
["我","今天","打","羽毛球"],
["我","今天","打","网球"],
["我","今天","打","排球"],
["我","今天","打","气球"],
]
#使用gensim生成词典
dictionary = corpora.Dictionary(text,prune_at=2000000)
#打印词典中的词
for key in dictionary.iterkeys():
print(key,dictionary.get(key),dictionary.dfs[key])
#保存词典
dictionary.save_as_text('word_dict.dict', sort_by_word=True)
#加载词典
dictionary = dictionary.load_from_text('word_dict.dict')
#词语个数
word_num = len(dictionary.keys())
#使用多少编文章生成每个batch数据
sentence_batch_size = 1
#滑动窗口
window = 3
#生成CBOW数据
def data_generator(): #训练数据生成器
while True:
x,y = [],[]
_ = 0
for sentence in text:
#在两端插入空字符,这里使用word_num这个值来代替
sentence = [word_num]*window + [dictionary.token2id[w] for w in sentence if w in dictionary.token2id] + [word_num]*window
for i in range(window, len(sentence)-window):
x.append(sentence[i-window:i]+sentence[i+1:i+1+window])
#因为使用的loss函数为sparse_categorical_crossentropy,所以不用one-hot
y.append([sentence[i]])
_ += 1
if _ == sentence_batch_size:
x,y = np.array(x),np.array(y)
print("输入的数据 :",x.shape)
print("对应的标签 :",y.shape)
yield x,y
x,y = [],[]
_ = 0
from keras.models import Sequential
from keras.layers import Dense, Activation,Embedding,Reshape
model = Sequential()
model.add(Embedding(word_num+1, 200, input_length=6))
model.add(Dense(1024, activation='relu'))
model.add(Flatten())
model.add(Dense(word_num+1, activation='softmax'))
model.compile(optimizer='sgd',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.summary()
model.fit_generator(data_generator(),steps_per_epoch =np.ceil(dictionary.num_docs/sentence_batch_size),epochs=1000,max_queue_size=1,workers=1)
#保存模型
model.save_weights("DNNword-vec.h5")
#加载模型
model.load_weights("DNNword-vec.h5",by_name=True)
#获取embeding的权重,也就是词向量
embeddings = model.get_weights()[0]
#向量标准化
normalized_embeddings = embeddings / (embeddings**2).sum(axis=1).reshape((-1,1))**0.5
dictionary.id2token = {j:i for i,j in dictionary.token2id.items()}
#获取前面最相似的10个词语
def most_similar(w,dictionary):
v = normalized_embeddings[dictionary.token2id[w]]
#向量标准化之后分母就是1,所以直接相乘就好
sims = np.dot(normalized_embeddings, v)
sort = sims.argsort()[::-1]
sort = sort[sort > 0]
#如果是占位符则不输出
return [(dictionary.id2token[i],sims[i]) for i in sort[:10] if i in dictionary.id2token]
for sim in most_similar(u'网球',dictionary):
print(sim[0],sim[1])
# 网球 1.0000001
# 羽毛球 0.11263792
# 桌球 0.07463527
# 篮球 0.066648
# 足球 0.06379064
# 台球 0.046809338
# 排球 0.04252596
# 我 0.04014937
# 人 0.028555304
# 水 0.007580313