1、如何在TensorFlow中处理大型(>2GB)嵌入查找表
import tensorflow as tf
embedding_weights = tf.Variable(tf.constant(0.0, shape=[embedding_vocab_size, EMBEDDING_DIM]), trainable=False, name='embedding_weights')
sess = tf.Session(config=tf.ConfigProto(log_divice_placement=True))
sess.run(embedding_weights.assign(embedding_matrix))
报错:Cannot create a tensor proto whose content is larger than 2GB.
当时我是使用fasttext的词向量作为embedding层的权重,fasttext词向量 模型有4个多GB
下面是处理方法:
import tensorflow as tf
embedding_weights = tf.Variable(tf.constant(0.0, shape=[embedding_vocab_size, EMBEDDING_DIM]), trainable=False, name='embedding_weights')
embedding_placeholder = tf.placeholder(tf.float32, [embedding_vocab_size, EMBEDDING_DIM])
embedding_init = embedding_weight.assign(embedding_placeholder)
sess = tf.Session(config=tf.ConfigProto(log_divice_placement=True))
sess.run(embedding_init, feed_dict={embedding_placeholder: embedding_matrix})
然后,可以使用embedding_weights变量来执行查找(计数存储字索引映射)
https://qa.1r1g.com/sf/ask/3269905411/
2、使用tf的VocabularyProcessor创建词汇表
from jieba import cut
from tensorflow.contrib import learn
import numpy as np
DOCUMENTS = [
'这是一条测试1',
'这是一条测试2',
'这是一条测试3',
'这是其他测试',
]
def chinese_tokenizer(docs):
for doc in docs:
yield list(cut(doc))
vocab = learn.preprocessing.VocabularyProcessor(10, 0, tokenizer_fn=chinese_tokenizer)
x = list(vocab.fit_transform(DOCUMENTS))
print(np.array(x))
问题:使用vocab.transform(new_document)时,vocab.vocabulary_词库的大小总发生变化,这时需要运行vocab.vocabulary_.freeze()函数冻结新的词加入到词库中