Tensorflow.keras common function
Tensorflow.keras.preprocessing
Tokenizers
tf.keras.preprocessing.text.Tokenizer(
num_words=None, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True,
split=' ', char_level=False, oov_token=None, document_count=0, **kwargs
)
主要参数:
num_words: 最大保留词汇量
filters: 用作过滤标点, tab, 线段以及 ’ 等character的
split: 用作将str中的单词分割开的
char_level: 如果是正确的,每个单词都应该被认作一个token
oov_token: 若为given, 它将被加入单词向量并且当我们作text-to-sequence operations时用来代替词汇以外的单词
以上操作完成以后,我们训练的方法是用fit_on_sequences等
- Updates internal vocabulary based on a list of sequences:
fit_on_sequences(
sequences
)
Required before using sequences_to_matrix (if fit_on_texts was never called).
- Updates internal vocabulary based on a list of texts:
fit_on_texts(
texts
)
Required before using texts_to_sequences or texts_to_matrix.
- Returns the tokenizer configuration as Python dictionary. The word count dictionaries used by the tokenizer get serialized into plain JSON, so that the configuration can be read by other projects.
get_config()
- Converts a list of sequences into a Numpy matrix:
sequences_to_matrix(
sequences, mode='binary'
)
- Transforms each sequence into a list of text.
sequences_to_texts(
sequences
)
等
Pad_sequences
tf.keras.preprocessing.sequence.pad_sequences(
sequences, maxlen=None, dtype='int32', padding='pre', truncating='pre',
value=0.0
)
这个函数把a list (of length num_samples) of sequences(lists of integer)转换成一个2D array, 它的shape为(num_samples, num_timesteps). num_timesteps , 如果参数给了 maxlen ,则 num_timesteps = maxlen, 如果没给,则num_timsteps = length of longest sequence. 另外,如果有的sequences长度小于了num_timesteps 则padded with value.
padding 有两个选项,pre and post, 比哦是在每个sequence 前面还是后面pad, truncating 同padding.
本文详细介绍了TensorFlow.Keras库中用于文本预处理的Tokenizer类,包括参数设置、训练方法和序列转换。同时,讨论了pad_sequences函数在调整序列长度和填充方面的应用,确保输入到模型的数据形状一致。这两个工具在自然语言处理和文本分类任务中至关重要。
8650

被折叠的 条评论
为什么被折叠?



