vocab 文本_持续更新：那些Keras中文本预处理超好用API

最新推荐文章于 2023-11-24 23:06:01 发布

SIB驴

最新推荐文章于 2023-11-24 23:06:01 发布

阅读量171

点赞数

文章标签： vocab 文本

本文链接：https://blog.csdn.net/weixin_34031238/article/details/112487783

版权

import tensorflow as tf
from tensorflow import keras
import numpy as np

Tokenizer : 文本到序列的映射1

fit_on_sequence
fit_on_texts
get_config
sequences_to_test ....

from tensorflow.keras.preprocessing.text import Tokenizer

# 导入文本数据
with open("shakespeare.txt",'r',encoding='utf=8') as f:
    text = f.read()

print(text[:100])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You

# 初始化 tokenizer,oov_token是指语料库中不存在的单词，这里假定所有不存在的单词都为<unk>
tokernizer = Tokenizer(char_level=False,oov_token='<unk>')

# 在给定的语料库中训练，之后tokernizer能够映射任意给定的文本到序列
tokernizer.fit_on_texts([text])

sequences = tokernizer.texts_to_sequences(["Before we proceed any further, hear me speak."])
print(sequences)

sequences = tokernizer.texts_to_sequences(["First Citizen:"])
print(sequences)

sequences = tokernizer.texts_to_sequences(["Hello world and hi"])
print(sequences)

[[140, 36, 970, 144, 669, 128, 16, 103]]
[[89, 270]]
[[1, 187, 3, 1]]

# tokenizer.word_index 用来查看 tokenizer 中 token被编码的序号
tokenizer.word_index

oov_token被编码在第一个位置，之后按照词频编码

one_hot ：文本到序列的映射2

from tensorflow.keras.preprocessing.text import one_hot

# 参数 n 是vocab_size，字典大小，应当尽可能大，否则会出现两个单词相同映射的情况
one_hot("Before we proceed any further ha",n=128) 
[4, 106, 102, 87, 62, 96]

# 这里出现了 we 和 any 映射出的整数相同。
one_hot("Before we proceed any further ha",n=10) 
[8, 9, 2, 9, 2, 8]

text_to_word_sequence : 分词工具

from tensorflow.keras.preprocessing.text import text_to_word_sequence

print(text_to_word_sequence("Before we proceed any further, hear me speak.First Citizen:You"))

['before', 'we', 'proceed', 'any', 'further', 'hear', 'me', 'speak', 'first', 'citizen', 'you']

pad_sequences : padding工具

这个API是NLP的利器，大多数模型要求序列长度固定，但我们知道文本中sentence的长度不是固定的。因此我们就需要这个API对序列的长度进行padding或者truncating

'''
tf.keras.preprocessing.sequence.pad_sequences(
    sequences, maxlen=None, dtype='int32', padding='pre',
    truncating='pre', value=0.0
)
'''
t1 = tf.constant([[1,2,3,7,8],[1,2,5,7,8]])
t1
<tf.Tensor: id=0, shape=(2, 5), dtype=int32, numpy=
array([[1, 2, 3, 7, 8],
       [1, 2, 5, 7, 8]])>

# 默认 padding='pre'，在前方填充，默认value=0.0，使用0值填充
keras.preprocessing.sequence.pad_sequences(t1,maxlen=7) # 将t1 填充至(2,7)
array([[0, 0, 1, 2, 3, 7, 8],
       [0, 0, 1, 2, 5, 7, 8]])

# 指定 padding='post'，在后方填充
keras.preprocessing.sequence.pad_sequences(t1,maxlen=7,padding='post')
array([[1, 2, 3, 7, 8, 0, 0],
       [1, 2, 5, 7, 8, 0, 0]])

# 使用指定值填充
keras.preprocessing.sequence.pad_sequences(t1,maxlen=7,padding='post',value=1)
array([[1, 2, 3, 7, 8, 1, 1],
       [1, 2, 5, 7, 8, 1, 1]])

# truncating='pre' 当序列长度超过maxlen参数，会从前方截断，post则从后方截断
keras.preprocessing.sequence.pad_sequences(t1,maxlen=3,truncating='post')
array([[1, 2, 3],
       [1, 2, 5]])

to_categorical : 类别标签一键one_hot

在实际应用中，我们经常需要将类别标签y进行 one_hot 编码，to_categorical函数很好地解决了这个问题

from tensorflow.keras.utils import to_categorical

y = [0,0,3,5,1,4,2,2,1]
to_categorical(y)
array([[1., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 1.],
       [0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0.],
       [0., 0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0.]], dtype=float32)

# 指定 class 的数量
y = [0,0,3,5,1,4,2,2,1]
to_categorical(y,num_classes=8)
array([[1., 0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0.]], dtype=float32)

KeyValueTensorInitializer & StaticVocabularyTable : 类别映射到索引

vocab = ["Up",'Down','Left','Right'] # 词汇表，所有可能的类别的列表
indices = tf.range(len(vocab),dtype=tf.int64) # 词汇表索引的张量
table_init = tf.lookup.KeyValueTensorInitializer(vocab,indices)
num_oov = 3
table = tf.lookup.StaticVocabularyTable(table_init,num_oov) # 查找表
# 创建了查找表之后，我们就可以对任意的类别进行索引编码
categories = tf.constant(['Down','Down','Right','Up','Up','Unknown'])
table.lookup(categories)

<tf.Tensor: id=167, shape=(6,), dtype=int64, numpy=array([1, 1, 3, 0, 0, 6], dtype=int64)>

SIB驴

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
vocab 文本_持续更新：那些Keras中文本预处理超好用API

import tensorflow as tffrom tensorflow import kerasimport numpy as npTokenizer : 文本到序列的映射1fit_on_sequencefit_on_textsget_configsequences_to_test ....from tensorflow.keras.preprocessing.text import T...
复制链接

扫一扫