Coursera课堂笔记Natural Language Processing in TensorFlow
单词向量化是把句子中的单词用数字来编码,如:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
sentences = [
'i love my dog',
'I love my cat',
]
tokenizer = Tokenizer(num_words=100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)
输出:
{'i': 1, 'love': 2, 'my': 3, 'dog': 4, 'cat': 5}
请注意,原句中有小写的i和大写的I,向量化后都用小写i。
现在增加一个句子: 'You love my dog',其中love、my、dog都已经存在,实际上只新增了一个单词You
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
sentences = [
'I love my dog',
'I love my cat',
'You love my dog'
]
tokenizer = Tokenizer(num_words=100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)
输出:
{'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}