该文件用于NLP中词向量处理
keras\preprocessing\text.py
建立词向量嵌入层,把输入文本转为可以进一步处理的数据格式(例如,矩阵)
代码注释
# -*- coding: utf-8 -*-
"""Utilities for text input preprocessing.
用于文本输入预处理的实用工具。
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import string
import sys
import warnings
from collections import OrderedDict
from hashlib import md5
import numpy as np
from six.moves import range
from six.moves import zip
if sys.version_info < (3,):
maketrans = string.maketrans
else:
maketrans = str.maketrans
def text_to_word_sequence(text,
filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
lower=True, split=" "):
"""Converts a text to a sequence of words (or tokens).
将文本转换为单词序列(或标记、分词)。
# Arguments
参数
text: Input text (string).
text: 输入文本(字符串)
filters: Sequence of characters to filter out.
filters: 过滤的字符序列
lower: Whether to convert the input to lowercase.
lower: 是否将输入转换为小写。
split: Sentence split marker (string).
split:句子分割标记(字符串)。
# Returns
返回
A list of words (or tokens).
单词(或分词,令牌)的列表。
"""
if lower:
text = text.lower()
if sys.version_info < (3,) and isinstance(text, unicode):
translate_map = dict((ord(c), unicode(split)) for c in filters)
else:
translate_map = maketrans(filters, split * len(filters))
text = text.translate(translate_map)
seq = text.split(split)
return [i for i in seq if i]
def one_hot(text, n,
filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
lower=True,
split=' '):
"""One-hot encodes a text into a list of word indexes of size n.
One-hot将文本编码成大小为n的单词索引列表。
This is a wrapper to the `hashing_trick` function using `hash` as the
hashing function; unicity of word to index mapping non-guaranteed.
这是使用哈希函数'hash '作为“hashing_trick”函数的包装器。不保证单词到索引映射的唯一性。
# Arguments
参数
text: Input text (string).
text: 输入文本(字符串)
n: Dimension of the hashing space.
n: 哈希空间的维数。
filters: Sequence of characters to filter out.
filters: 要过滤的字符序列。
lower: Whether to convert the input to lowercase.
lower: 是否将输入转换为小写。
split: Sentence split marker (string).
split:句子分割标记(字符串)。
# Returns
返回
A list of integer word indices (unicity non-guaranteed).
整数字索引列表(单一性不保证)。
"""
return hashing_trick(text, n,
hash_function=hash,
filters=filters,
lower=lower,
split=split)
def hashing_trick(text, n,
hash_function=None,
filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
lower=True,
split=' '):
"""Converts a text to a sequence of indexes in a fixed-size hashing space.
将文本转换为固定大小散列空间中的索引序列。
# Arguments
参数
text: Input text (string).
text: 输入文本(字符串)。
n: Dimension of the hashing space.
n: 哈希空间的维数。
hash_function: if `None` uses python `hash` function, can be 'md5' or
any function that takes in input a string and returns a int.
Note that `hash` is not a stable hashing function, so
it is not consistent across different runs, while 'md5'
is a stable hashing function.
hash_function:如果“没有”使用Python“hash”函数,则可以是“MD5”或任何输入一个字符串并返回整数(int)的函数。
注意“hash”不是一个稳定的哈希函数,因此它在不同的运行中不一致,而“MD5”是一个稳定的哈希函数。
filters: Sequence of characters to filter out.
filters: 要过滤的字符序列。
lower: Whether to convert the input to lowercase.
lower: 是否将输入转换为小写。
split: Sentence split marker (string).
split: 句子分割标记(字符串)。
# Returns
返回
A list of integer word indices (unicity non-guaranteed).
整数字索引列表(单一性不保证)。
`0` is a reserved index that won't be assigned to any word.
`0`是一个不被分配给任何单词的保留索引。
Two or more words may be assigned to the same index, due to possible
collisions by the hashing function.
由于哈希函数可能发生的冲突,可以将两个或多个单词分配给同一索引。
The [probability](https://en.wikipedia.org/wiki/Birthday_problem#Probability_table)
of a collision is in relation to the dimension of the hashing space and
the number of distinct objects.
碰撞的概率(https://en.wikipedia.org/wiki/Birthday_problem#Probability_table)概率与散列空间的维数和不同对象的数量有关。
"""
if hash_function is None:
hash_function = hash
elif hash_function == 'md5':
hash_function = lambda w: int(md5(w.encode()).hexdigest(), 16)
seq = text_to_word_sequence(text,
filters=filters,
lower=lower,
split=split)
return [(hash_function(w) % (n - 1) + 1) for w in seq]
class Tokenizer(object):
"""Text tokenization utility class.
文本标记化实用类。
This class allows to vectorize a text corpus, by turning each
text into either a sequence of integers (each integer being the index
of a token in a dictionary) or into a vector where the coefficient
for each token could be binary, based on word count, based on tf-idf...
该类允许矢量化文本语料库,通过将每个文本转换成整数序列(每个整数是字典中的令牌的索引),或者将每个令牌的系数可以是
二进制的,基于单词计数,基于tf-idf...
TF-IDF(term frequency–inverse document frequency)是一种用于信息检索与数据挖掘的常用加权技术。
TF意思是词频(Term Frequency)
IDF意思是逆文本频率指数(Inverse Document Frequency)。
# Arguments
参数
num_words: the maximum number of words to keep, based
on word frequency. Only the most common `num_words` words will
be kept.
num_words:基于词频的单词的最大保持数。只有最常见的`num_words` 会被保留下来。
filters: a string where each element is a character that will be
filtered from the texts. The default is all punctuation, plus
tabs and line breaks, minus the `'` character.
filters:一个字符串,其中每个元素是一个将从文本中过滤的字符。默认值是所有标点符号,
加上制表符和断线,减去“字符”。
lower: boolean. Whether to convert the texts to lowercase.
lower: 是否将输入转换为小写。
split: character or string to use for token splitting.
split: 用于标记拆分的字符或字符串。
char_level: if True, every character will be treated as a token.
char_level:如果是真的,每个字符将被当作标记对待。
oov_token: if given, it will be added to word_index and used to
replace out-of-vocabulary words during text_to_sequence calls
oov_token:如果给定的,它将被添加到word_index中,用于在text_to_sequence调用期间替换词汇表之外的单词。
By default, all punctuation is removed, turning the texts into
space-separated sequences of words
默认情况下,删除所有标点符号,将文本转换为空间分隔的单词序列。
(words maybe include the `'` character). These sequences are then
split into lists of tokens. They will then be indexed or vectorized.
(单词可能包括'` 字符。)然后将这些序列分割成标记列表。然后,它们将被索引或矢量化。
`0` is a reserved index that won't be assigned to any word.
0`是一个不被分配给任何单词的保留索引。
"""
def __init__(self, num_words=None,
filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
lower=True,
split=' ',
char_level=False,
oov_token=None,
**kwargs):
# Legacy support
if 'nb_words' in kwargs:
warnings.warn('The `nb_words` argument in `Tokenizer` '
'has been renamed `num_words`.')
num_words = kwargs.pop('nb_words')
if kwargs:
raise TypeError('Unrecognized keyword arguments: ' + str(kwargs))
self.word_counts = OrderedDict()
self.word_docs = {}
self.filters = filters
self.split = split
self.lower = lower
self.num_words = num_words
self.document_count = 0
self.char_level = char_level
self.oov_token = oov_token
self.index_docs = {}
def fit_on_texts(self, texts):
"""Updates internal vocabulary based on a list of texts.
基于文本列表更新内部词汇表。
In the case where texts contains lists, we assume each entry of the lists
to be a token.
在文本包含列表的情况下,我们假设列表的每个条目是标记。
Required before using `texts_to_sequences` or `texts_to_matrix`.
在使用“texts_to_sequences”或“texts_to_matrix”之前使用。
# Arguments
参数
texts: can be a list of strings,
a generator of strings (for memory-efficiency),
or a list of list of strings.
texts:可以是字符串列表、字符串生成器(用于内存效率)或字符串列表。
"""
for text in texts:
self.document_count += 1
if self.char_level or isinstance(text, list):
seq = text
else:
seq = text_to_word_sequence(text,
self.filters,
self.lower,
self.split)
for w in seq:
if w in self.word_counts:
self.word_counts[w] += 1
else:
self.word_counts[w] = 1
for w in set(seq):
if w in self.word_docs:
self.word_docs[w] += 1
else:
self.word_docs[w] = 1
wcounts = list(self.word_counts.items())
wcounts.sort(key=lambda x: x[1], reverse=True)
sorted_voc = [wc[0] for wc in wcounts]
# note that index 0 is reserved, never assigned to an existing word
self.word_index = dict(list(zip(sorted_voc, list(range(1, len(sorted_voc) + 1)))))
if self.oov_token is not None:
i = self.word_index.get(self.oov_token)
if i is None:
self.word_index[self.oov_token] = len(self.word_index) + 1
for w, c in list(self.word_docs.items()):
self.index_docs[self.word_index[w]] = c
def fit_on_sequences(self, sequences):
"""Updates internal vocabulary based on a list of sequences.
基于序列列表更新内部词汇表。
Required before using `sequences_to_matrix`
(if `fit_on_texts` was never called).
在`sequences_to_matrix`前使用
(如果 `fit_on_texts` 没有使用).
# Arguments
参数
sequences: A list of sequence.
sequences: 序列列表.
A "sequence" is a list of integer word indices.
"sequence" 是一个整数词索引列表
"""
self.document_count += len(sequences)
for seq in sequences:
seq = set(seq)
for i in seq:
if i not in self.index_docs:
self.index_docs[i] = 1
else:
self.index_docs[i] += 1
def texts_to_sequences(self, texts):
"""Transforms each text in texts in a sequence of integers.
转换文本中的每一文本为整数序列。
Only top "num_words" most frequent words will be taken into account.
只有顶部的“num_words”最常见的词将被考虑在内。
Only words known by the tokenizer will be taken into account.
只有分词器所知道的单词才会被考虑进去。
# Arguments
参数
texts: A list of texts (strings).
texts: 文本(字符串)列表
# Returns
返回
A list of sequences.
序列列表
"""
res = []
for vect in self.texts_to_sequences_generator(texts):
res.append(vect)
return res
def texts_to_sequences_generator(self, texts):
"""Transforms each text in `texts` in a sequence of integers.
以整数序列转换“文本”中的每个文本。
Each item in texts can also be a list, in which case we assume each item of that list
to be a token.
文本中的每个项目也可以是一个列表,在这种情况下,我们假设该列表中的每个项目都是分词。
Only top "num_words" most frequent words will be taken into account.
只有顶部的“num_words”最常见的词将被考虑在内。
Only words known by the tokenizer will be taken into account.
只有分词器所知道的单词才会被考虑进去。
# Arguments
参数
texts: A list of texts (strings).
texts: 文本(字符串)列表
# Yields
生成
Yields individual sequences.
生成单个序列。
"""
num_words = self.num_words
for text in texts:
if self.char_level or isinstance(text, list):
seq = text
else:
seq = text_to_word_sequence(text,
self.filters,
self.lower,
self.split)
vect = []
for w in seq:
i = self.word_index.get(w)
if i is not None:
if num_words and i >= num_words:
continue
else:
vect.append(i)
elif self.oov_token is not None:
i = self.word_index.get(self.oov_token)
if i is not None:
vect.append(i)
yield vect
def texts_to_matrix(self, texts, mode='binary'):
"""Convert a list of texts to a Numpy matrix.
将文本列表转换为Numpy矩阵。
# Arguments
参数
texts: list of strings.
texts: 字符串列表
mode: one of "binary", "count", "tfidf", "freq".
mode: "binary", "count", "tfidf", "freq".中一个
# Returns
返回
A Numpy matrix.
Numpy矩阵
"""
sequences = self.texts_to_sequences(texts)
return self.sequences_to_matrix(sequences, mode=mode)
def sequences_to_matrix(self, sequences, mode='binary'):
"""Converts a list of sequences into a Numpy matrix.
转换序列列表为Numpy矩阵
# Arguments
参数
sequences: list of sequences
sequences: 序列列表
(a sequence is a list of integer word indices).
(序列是整数词索引列表).
mode: one of "binary", "count", "tfidf", "freq"
mode: "binary", "count", "tfidf", "freq".中一个
# Returns
返回
A Numpy matrix.
Numpy矩阵
# Raises
# 增加
ValueError: In case of invalid `mode` argument,
or if the Tokenizer requires to be fit to sample data.
ValueError:在无效的`mode` 参数的情况下,或者如果标记器需要适合采样数据。
"""
if not self.num_words:
if self.word_index:
num_words = len(self.word_index) + 1
else:
raise ValueError('Specify a dimension (num_words argument), '
'or fit on some text data first.')
else:
num_words = self.num_words
if mode == 'tfidf' and not self.document_count:
raise ValueError('Fit the Tokenizer on some data '
'before using tfidf mode.')
x = np.zeros((len(sequences), num_words))
for i, seq in enumerate(sequences):
if not seq:
continue
counts = {}
for j in seq:
if j >= num_words:
continue
if j not in counts:
counts[j] = 1.
else:
counts[j] += 1
for j, c in list(counts.items()):
if mode == 'count':
x[i][j] = c
elif mode == 'freq':
x[i][j] = c / len(seq)
elif mode == 'binary':
x[i][j] = 1
elif mode == 'tfidf':
# Use weighting scheme 2 in
# https://en.wikipedia.org/wiki/Tf%E2%80%93idf
# 使用https://en.wikipedia.org/wiki/Tf%E2%80%93idf中加权方案2
tf = 1 + np.log(c)
idf = np.log(1 + self.document_count /
(1 + self.index_docs.get(j, 0)))
x[i][j] = tf * idf
else:
raise ValueError('Unknown vectorization mode:', mode)
return x
代码执行
Keras详细介绍
中文:http://keras-cn.readthedocs.io/en/latest/
实例下载
https://github.com/keras-team/keras
https://github.com/keras-team/keras/tree/master/examples
完整项目下载
方便没积分童鞋,请加企鹅452205574,共享文件夹。
包括:代码、数据集合(图片)、已生成model、安装库文件等。