【吴恩达Tensorflow 2.0实践课】3.1 自然语言处理

3.1.1 Word based encondings

思考:如何让机器识别文字

1. 基于字母

如果用ASCII码将字母编码,会出现以下情况:

不同单词,编码一样,但完全不一样的意思

2. 基于单词

如果将两个类似的句子每个单词标号

 

3.1.2 Using APIs

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

sentences = {
    'I love my dog',
    'I love my cat',
    'You love my dog!'
}

tokenizer = Tokenizer(num_words=100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

利用令牌生成器(Tokenizer),生成字典,单词编码并从句子中创建向量

num_words = 100 : 文本中单独的词汇 (unique distinct words) 超参数(hyperparameter),Tokenizer 按照按照(volume)排序,并编码 (encode) 单词。有时较少的单词,训练准确,但是耗时很久。

fit_on_texts(): 接收文本,进行编码

tokenizer.word_index: Tokenizer提供了一个单词索引属性(word_index),将返回包含键值对的字典,“键key” 为单词,"值value" 为单词的标记/代号 (token)  

显示结果为:

Tokenizer将去除大小写,标点,例如最后一句中的!,不影响前面 'dog' 单词,所以这里只有‘dog’,没有‘dog!’

 

3.1.3 Text to sequence

 

from tensorflow.keras.preprocessing.text import Tokenizer

sentences = {
    'I love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
}

tokenizer = Tokenizer(num_words=100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

# 返回一系列序列
sequences = tokenizer.texts_to_sequences(sentences)
print(sequences)

显示结果:

文本 => 单词代号,所以字典一定要固定!

 

3.1.4 more at Tokenizer

图像识别时,输入纬度要设置相同,即输入图片像素相同,否则要进行转换。

from tensorflow.keras.preprocessing.text import Tokenizer

sentences = {
    'I love my dog',
}

tokenizer = Tokenizer(num_words=100, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)

test_data = {
    'I really love my dog',
    'my dog loves my manatee'
}

test_seq = tokenizer.texts_to_sequences(test_data)

print(word_index)
print('sequences:', sequences)
print('test sequences:', test_seq)

显示结果:

"I love my dog" => [2, 3, 4, 5]

"I really love my dog" => [2, '<oov>', 3, 4, 5]

oov: out of vocabulary,不在字典词汇表中

 

3.1.5 Padding 填充

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

sentences = {
    'I love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
}

tokenizer = Tokenizer(num_words=100, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

sequences = tokenizer.texts_to_sequences(sentences)
print('sequences:', sequences)

# Padding
padded = pad_sequences(sequences)
print(padded)

显示结果:

最后得到一个矩阵,每行都有相同的长度。空位被0填充了。

 

3.1.6 综上所述,举个复杂栗子

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

sentences = {
    'I love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
}

tokenizer = Tokenizer(num_words=100, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print('word index:', word_index)

sequences = tokenizer.texts_to_sequences(sentences)
print('\nsequences:', sequences)

padded = pad_sequences(sequences, maxlen=5)
print('\npadded sequences:\n', padded)

test_data = {
    'I really love my dog',
    'my dog loves my manatee'
}

test_seq = tokenizer.texts_to_sequences(test_data)
print('\ntest sequences:', test_seq)

padded =pad_sequences(test_seq, maxlen=10)
print('\npadded test sequences:\n', padded)

显示结果:

当设置maxlen为5时,即一个句子只保留5个单词

'Do you think my dog is amazing?'

这句话太长,会被省略前面2个单词。

 

3.1.7 Sarcasm date set

Sarcasm数据集来源于:https://storage.googleapis.com/laurencemoroney-blog.appspot.com/sarcasm.json

这个json文件中包含26709篇网络文章,三个键:文章标题 ['headline']、文章连接 ['article_link']、是否为讽刺 ['is_sarcastic']

import json

with open('./tmp/sarcasm.json', 'r') as f:
    data = json.load(f)

sentences = []
labels = []
urls = []
for item in data:
    sentences.append(item['headline'])
    labels.append(item['is_sarcastic'])
    urls.append(item['article_link'])

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(oov_token='<oov>')
tokenizer.fit_on_texts(sentences)

word_index = tokenizer.word_index
print('\nlength of the word index: ', len(word_index))
print('\nexamples of the word index:\n', word_index)

sequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(sequences, padding='post')

print(padded[0])
print(padded.shape)

显示结果:

……

  • 一共有29657个单词,填充编码后,40的长度
  • padded = pad_sequences(sequences, padding='post')

    填充设置:padding = 'post'   意味着在sentences的后面填充0,即1个文章标题最长不超过40个单词

  • 最后得到 (26709, 40) 维度的矩阵

 

 

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值