关于词向量
有三种存储格式:
- txt
文本格式,类似 word 0.001233 0.34219 … - bin
google的序列化,二进制模式; - mmap
内存共享模式。一个字就是快;加载快。
加载方法
bin格式转mmap;或者txt转mmap(binary=False)
word = '机器学习'
def bin2mmap():
word2vec_model = KeyedVectors.load_word2vec_format(word2vec_model_path_bin, binary=True, unicode_errors='ignore')
print(word2vec_model.get_vector(word))
print(word2vec_model.similar_by_word(word), 10)
t0 = time.time()
for i in range(1,100):
word2vec_model.similar_by_word(word)
t1 = time.time()
print("timeusage-word2vec:",(t1-t0))
word2vec_model.init_sims(replace=True)
print(word2vec_model.get_vector(word))
print(word2vec_model.similar_by_word(word), 10)
t0 = time.time()
for i in range(1,100):
word2vec_model.sim