3.1.1 Word based encondings
思考:如何让机器识别文字
1. 基于字母
如果用ASCII码将字母编码,会出现以下情况:
不同单词,编码一样,但完全不一样的意思
2. 基于单词
如果将两个类似的句子每个单词标号
3.1.2 Using APIs
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
sentences = {
'I love my dog',
'I love my cat',
'You love my dog!'
}
tokenizer = Tokenizer(num_words=100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)
利用令牌生成器(Tokenizer),生成字典,单词编码并从句子中创建向量
num_words = 100 : 文本中单独的词汇 (unique distinct words) 超参数(hyperparameter),Tokenizer 按照按照(volume)排序,并编码 (encode) 单词。有时较少的单词,训练准确,但是耗时很久。
fit_on_texts(): 接收文本,进行编码
tokenizer.word_index: Tokenizer提供了一个单词索引属性(word_index),将返回包含键值对的字典,“键key” 为单词,"值value" 为单词的标记/代号 (token)
显示结果为:
Tokenizer将去除大小写,标点,例如最后一句中的!,不影响前面 'dog' 单词,所以这里只有‘dog’,没有‘dog!’
3.1.3 Text to sequence
from tensorflow.keras.preprocessing.text import Tokenizer
sentences = {
'I love my dog',
'I love my cat',
'You love my dog!',
'Do you think my dog is amazing?'
}
tokenizer = Tokenizer(num_words=100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
# 返回一系列序列
sequences = tokenizer.texts_to_sequences(sentences)
print(sequences)
显示结果:
文本 => 单词代号,所以字典一定要固定!
3.1.4 more at Tokenizer
图像识别时,输入纬度要设置相同,即输入图片像素相同,否则要进行转换。
from tensorflow.keras.preprocessing.text import Tokenizer
sentences = {
'I love my dog',
}
tokenizer = Tokenizer(num_words=100, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(sentences)
test_data = {
'I really love my dog',
'my dog loves my manatee'
}
test_seq = tokenizer.texts_to_sequences(test_data)
print(word_index)
print('sequences:', sequences)
print('test sequences:', test_seq)
显示结果:
"I love my dog" => [2, 3, 4, 5]
"I really love my dog" => [2, '<oov>', 3, 4, 5]
oov: out of vocabulary,不在字典词汇表中
3.1.5 Padding 填充
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
sentences = {
'I love my dog',
'I love my cat',
'You love my dog!',
'Do you think my dog is amazing?'
}
tokenizer = Tokenizer(num_words=100, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)
sequences = tokenizer.texts_to_sequences(sentences)
print('sequences:', sequences)
# Padding
padded = pad_sequences(sequences)
print(padded)
显示结果:
最后得到一个矩阵,每行都有相同的长度。空位被0填充了。
3.1.6 综上所述,举个复杂栗子
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
sentences = {
'I love my dog',
'I love my cat',
'You love my dog!',
'Do you think my dog is amazing?'
}
tokenizer = Tokenizer(num_words=100, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print('word index:', word_index)
sequences = tokenizer.texts_to_sequences(sentences)
print('\nsequences:', sequences)
padded = pad_sequences(sequences, maxlen=5)
print('\npadded sequences:\n', padded)
test_data = {
'I really love my dog',
'my dog loves my manatee'
}
test_seq = tokenizer.texts_to_sequences(test_data)
print('\ntest sequences:', test_seq)
padded =pad_sequences(test_seq, maxlen=10)
print('\npadded test sequences:\n', padded)
显示结果:
当设置maxlen为5时,即一个句子只保留5个单词
'Do you think my dog is amazing?'
这句话太长,会被省略前面2个单词。
3.1.7 Sarcasm date set
Sarcasm数据集来源于:https://storage.googleapis.com/laurencemoroney-blog.appspot.com/sarcasm.json
这个json文件中包含26709篇网络文章,三个键:文章标题 ['headline']、文章连接 ['article_link']、是否为讽刺 ['is_sarcastic']
import json
with open('./tmp/sarcasm.json', 'r') as f:
data = json.load(f)
sentences = []
labels = []
urls = []
for item in data:
sentences.append(item['headline'])
labels.append(item['is_sarcastic'])
urls.append(item['article_link'])
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
tokenizer = Tokenizer(oov_token='<oov>')
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print('\nlength of the word index: ', len(word_index))
print('\nexamples of the word index:\n', word_index)
sequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(sequences, padding='post')
print(padded[0])
print(padded.shape)
显示结果:
……
- 一共有29657个单词,填充编码后,40的长度
-
padded = pad_sequences(sequences, padding='post')
填充设置:padding = 'post' 意味着在sentences的后面填充0,即1个文章标题最长不超过40个单词
-
最后得到 (26709, 40) 维度的矩阵