自然语言处理NLP Tokenizer padding和embedding

最新推荐文章于 2024-04-30 10:06:23 发布

grt要一直一直努力呀

最新推荐文章于 2024-04-30 10:06:23 发布

阅读量3.3k

点赞数 4

文章标签： python nlp

本文链接：https://blog.csdn.net/qq_44870115/article/details/111386470

版权

自然语言理解和自然语言生成是自然语言处理的两大内核
编码
计算机视觉较为简单，是由于像素值已经是数字了，并且具有物理含义，可以直接送入神经网络，但是对于自然语言处理（NLP），首先需要对字符进行编码
1），采用ASCII码对于字母进行编程，可能的问题是两个单词具有相同的字母但含义完全相反。
2），对于单词进行编码

Tokenizer

tensorflow高阶API，可以生成字典，进行单词编码，从句子中创建向量。
有一个超参数需要设置，代表出现频次最高的100个单词，对其进行编码。
tokenizer.fit_on_texts(sentences): take in the data and encodes it
tokenizer.word_index:return a dictionary contain key value pairs,the key is the word,and the value is the token for that word
tokenizer可以去除标点和空格，且不区分大小写

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
]
tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(sentences) # 将句子encode into integer（整数） lists
print(word_index)
print(sequences)

output:

{'my': 1, 'love': 2, 'dog': 3, 'i': 4, 'you': 5, 'cat': 6, 'do': 7, 'think': 8, 'is': 9, 'amazing': 10}
[[4, 2, 1, 3], [4, 2, 1, 6], [5, 2, 1, 3], [7, 5, 8, 1, 3, 9, 10]]

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
]
tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

test_data = [
    'I really love my dog',
    'My dog loves my manatee'
]
sequences = tokenizer.texts_to_sequences(sentences) # 将句子encode into integer（整数） lists
print(word_index)
test_seq = tokenizer.texts_to_sequences(test_data)
print(test_seq)

output:

{'my': 1, 'love': 2, 'dog': 3, 'i': 4, 'you': 5, 'cat': 6, 'do': 7, 'think': 8, 'is': 9, 'amazing': 10}
[[4, 2, 1, 3], [1, 3, 1]]

可以发现编码出现了错误，是由于dictionary中没有really，loves，manatee这三个词
可以在Tokenizer中添加这个参数，使对于列表中没有出现的单词用oov代替

tokenizer = Tokenizer(num_words = 100, oov_token="<OOV>") # oov为out of vocabulary 用于不在单词索引中的单词

output：

{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}
[[5, 1, 3, 2, 4], [2, 4, 1, 2, 1]]

padding

before training，we need to have some level of uniformity of size
为了使用padding，首先先import this module

from tensorflow.keras.preprocessing.sequence import pad_sequences

添加代码如下

padded = pad_sequences(sequences)
print(padded)

output：上面是加了padding之后的，下面是没加的

[[ 0  0  0  5  3  2  4]
 [ 0  0  0  5  3  2  7]
 [ 0  0  0  6  3  2  4]
 [ 8  6  9  2  4 10 11]]
[[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]

可以发现这个是在前面补0，如果是想在后面补0，则可以增加参数

padded = pad_sequences(sequences,padding='post')
[[ 5  3  2  4  0  0  0]
 [ 5  3  2  7  0  0  0]
 [ 6  3  2  4  0  0  0]
 [ 8  6  9  2  4 10 11]]

如果想要设置一个句子最多包含的单词个数，通过设置maxlen参数，当然这样会导致句子有效成分的丢失。

padded = pad_sequences(sequences,padding='post',maxlen=5)
[[ 5  3  2  4  0]
 [ 5  3  2  7  0]
 [ 6  3  2  4  0]
 [ 9  2  4 10 11]]

Embeddings

take this numbers and start to establish sentiment from this, so that you can begin to classify and then later predict texts.
word embedding 不仅是给予了每个单词一个数字，意思相近的词也会被映射到相同的位置上。