TensorFlow中内建的类tf.contrib.learn.preprocessing.VocabularyProcessor( max_document_length, min_frequency=0, vocabulary=None, tokenizer_fn=None)可以返回一个“能够将文档中的词汇转化为数字索引文档”的对象。其中,max_document_length表示转换完之后,文档中,每句话的长度,min_frequency=0表示文档中,每个词出现的频次最小数。
from tensorflow.contrib import learn
texts = ['go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat',
'ok lar joking wif u oni',
'free entry in a wkly comp to win fa cup final tkts st may text fa to to receive entry questionstd txt ratetcs apply overs',
'u dun say so early hor u c already then say',
'nah i dont think he goes to usf he lives around here though']
texts2 = texts[0:5]
vocab_processor = learn.preprocessing.VocabularyProcessor(20, min_frequency=1)
transformed_texts = np.array([x for x in vocab_processor.transform(texts)])
print(transformed_texts)
## 运行结果:
[[ 1 2 3 ... 18 19 20]
[ 21 22 23 ... 0 0 0]
[ 27 28 8 ... 32 41 28]
...
[7687 302 8 ... 0 0 0]
[ 128 3066 205 ... 166 68 54]
[3173 64 1156 ... 0 0 0]]