TensorFlow神经网络中遇到的一些问题_from tensorflow.contrib import learn-CSDN博客

本文链接：https://blog.csdn.net/LFGxiaogang/article/details/107813565

1、如何在TensorFlow中处理大型(>2GB)嵌入查找表

import tensorflow as tf

embedding_weights = tf.Variable(tf.constant(0.0, shape=[embedding_vocab_size, EMBEDDING_DIM]), trainable=False, name='embedding_weights')



sess = tf.Session(config=tf.ConfigProto(log_divice_placement=True))
sess.run(embedding_weights.assign(embedding_matrix))

报错：Cannot create a tensor proto whose content is larger than 2GB.

当时我是使用fasttext的词向量作为embedding层的权重，fasttext词向量模型有4个多GB

下面是处理方法：

import tensorflow as tf

embedding_weights = tf.Variable(tf.constant(0.0, shape=[embedding_vocab_size, EMBEDDING_DIM]), trainable=False, name='embedding_weights')
embedding_placeholder = tf.placeholder(tf.float32, [embedding_vocab_size, EMBEDDING_DIM])
embedding_init = embedding_weight.assign(embedding_placeholder)


sess = tf.Session(config=tf.ConfigProto(log_divice_placement=True))
sess.run(embedding_init, feed_dict={embedding_placeholder: embedding_matrix})

然后，可以使用embedding_weights变量来执行查找（计数存储字索引映射）

https://qa.1r1g.com/sf/ask/3269905411/

2、使用tf的VocabularyProcessor创建词汇表

from jieba import cut
from tensorflow.contrib import learn
import numpy as np

DOCUMENTS = [
    '这是一条测试1',
    '这是一条测试2',
    '这是一条测试3',
    '这是其他测试',
]


def chinese_tokenizer(docs):
    for doc in docs:
        yield list(cut(doc))


vocab = learn.preprocessing.VocabularyProcessor(10, 0, tokenizer_fn=chinese_tokenizer)
x = list(vocab.fit_transform(DOCUMENTS))
print(np.array(x))

问题：使用vocab.transform(new_document)时，vocab.vocabulary_词库的大小总发生变化，这时需要运行vocab.vocabulary_.freeze()函数冻结新的词加入到词库中

https://www.jianshu.com/p/db400a569730