keras\preprocessing目录文件详解5.3（text.py）-keras学习笔记五

最新推荐文章于 2024-04-09 10:00:04 发布
wyx100
最新推荐文章于 2024-04-09 10:00:04 发布
阅读量1.1k
点赞数
分类专栏： python 人工智能文章标签：深度学习 keras
本文链接：https://blog.csdn.net/wyx100/article/details/81083056
版权
人工智能同时被 2 个专栏收录
56 篇文章 0 订阅
订阅专栏
python
43 篇文章 1 订阅
订阅专栏
该文件用于NLP中词向量处理
keras\preprocessing\text.py
建立词向量嵌入层，把输入文本转为可以进一步处理的数据格式（例如，矩阵）
Keras开发包文件目录
Keras实例文件目录
代码注释
# -*- coding: utf-8 -*-
"""Utilities for text input preprocessing.
用于文本输入预处理的实用工具。
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import string
import sys
import warnings
from collections import OrderedDict
from hashlib import md5

import numpy as np
from six.moves import range
from six.moves import zip

if sys.version_info < (3,):
    maketrans = string.maketrans
else:
    maketrans = str.maketrans


def text_to_word_sequence(text,
                          filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
                          lower=True, split=" "):
    """Converts a text to a sequence of words (or tokens).
    将文本转换为单词序列（或标记、分词）。

    # Arguments
     参数
        text: Input text (string).
        text: 输入文本（字符串）
        filters: Sequence of characters to filter out.
        filters: 过滤的字符序列
        lower: Whether to convert the input to lowercase.
        lower: 是否将输入转换为小写。
        split: Sentence split marker (string).
        split:句子分割标记（字符串）。

    # Returns
    返回
        A list of words (or tokens).
        单词（或分词，令牌）的列表。
    """
    if lower:
        text = text.lower()

    if sys.version_info < (3,) and isinstance(text, unicode):
        translate_map = dict((ord(c), unicode(split)) for c in filters)
    else:
        translate_map = maketrans(filters, split * len(filters))

    text = text.translate(translate_map)
    seq = text.split(split)
    return [i for i in seq if i]


def one_hot(text, n,
            filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
            lower=True,
            split=' '):
    """One-hot encodes a text into a list of word indexes of size n.
    One-hot将文本编码成大小为n的单词索引列表。

    This is a wrapper to the `hashing_trick` function using `hash` as the
    hashing function; unicity of word to index mapping non-guaranteed.
    这是使用哈希函数'hash '作为“hashing_trick”函数的包装器。不保证单词到索引映射的唯一性。

    # Arguments
    参数
        text: Input text (string).
        text: 输入文本（字符串）
        n: Dimension of the hashing space.
        n: 哈希空间的维数。
        filters: Sequence of characters to filter out.
        filters: 要过滤的字符序列。
        lower: Whether to convert the input to lowercase.
        lower: 是否将输入转换为小写。
        split: Sentence split marker (string).
        split:句子分割标记（字符串）。

    # Returns
    返回
        A list of integer word indices (unicity non-guaranteed).
        整数字索引列表（单一性不保证）。

    """
    return hashing_trick(text, n,
                         hash_function=hash,
                         filters=filters,
                         lower=lower,
                         split=split)


def hashing_trick(text, n,
                  hash_function=None,
                  filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
                  lower=True,
                  split=' '):
    """Converts a text to a sequence of indexes in a fixed-size hashing space.
    将文本转换为固定大小散列空间中的索引序列。

    # Arguments
    参数
        text: Input text (string).
        text: 输入文本（字符串）。
        n: Dimension of the hashing space.
        n: 哈希空间的维数。
        hash_function: if `None` uses python `hash` function, can be 'md5' or
            any function that takes in input a string and returns a int.
            Note that `hash` is not a stable hashing function, so
            it is not consistent across different runs, while 'md5'
            is a stable hashing function.
        hash_function:如果“没有”使用Python“hash”函数，则可以是“MD5”或任何输入一个字符串并返回整数（int）的函数。
        注意“hash”不是一个稳定的哈希函数，因此它在不同的运行中不一致，而“MD5”是一个稳定的哈希函数。
        filters: Sequence of characters to filter out.
        filters: 要过滤的字符序列。
        lower: Whether to convert the input to lowercase.
        lower: 是否将输入转换为小写。
        split: Sentence split marker (string).
        split: 句子分割标记（字符串）。

    # Returns
    返回
        A list of integer word indices (unicity non-guaranteed).
        整数字索引列表（单一性不保证）。

    `0` is a reserved index that won't be assigned to any word.
    `0`是一个不被分配给任何单词的保留索引。

    Two or more words may be assigned to the same index, due to possible
    collisions by the hashing function.
    由于哈希函数可能发生的冲突，可以将两个或多个单词分配给同一索引。

    The [probability](https://en.wikipedia.org/wiki/Birthday_problem#Probability_table)
    of a collision is in relation to the dimension of the hashing space and
    the number of distinct objects.
    碰撞的概率(https://en.wikipedia.org/wiki/Birthday_problem#Probability_table)概率与散列空间的维数和不同对象的数量有关。
    """
    if hash_function is None:
        hash_function = hash
    elif hash_function == 'md5':
        hash_function = lambda w: int(md5(w.encode()).hexdigest(), 16)

    seq = text_to_word_sequence(text,
                                filters=filters,
                                lower=lower,
                                split=split)
    return [(hash_function(w) % (n - 1) + 1) for w in seq]


class Tokenizer(object):
    """Text tokenization utility class.
    文本标记化实用类。

    This class allows to vectorize a text corpus, by turning each
    text into either a sequence of integers (each integer being the index
    of a token in a dictionary) or into a vector where the coefficient
    for each token could be binary, based on word count, based on tf-idf...
    该类允许矢量化文本语料库，通过将每个文本转换成整数序列（每个整数是字典中的令牌的索引），或者将每个令牌的系数可以是
    二进制的，基于单词计数，基于tf-idf...
    TF-IDF（term frequency–inverse document frequency）是一种用于信息检索与数据挖掘的常用加权技术。
    TF意思是词频(Term Frequency)
    IDF意思是逆文本频率指数(Inverse Document Frequency)。

    # Arguments
    参数
        num_words: the maximum number of words to keep, based
            on word frequency. Only the most common `num_words` words will
            be kept.
        num_words:基于词频的单词的最大保持数。只有最常见的`num_words` 会被保留下来。
        filters: a string where each element is a character that will be
            filtered from the texts. The default is all punctuation, plus
            tabs and line breaks, minus the `'` character.
        filters:一个字符串，其中每个元素是一个将从文本中过滤的字符。默认值是所有标点符号，
        加上制表符和断线，减去“字符”。
        lower: boolean. Whether to convert the texts to lowercase.
        lower: 是否将输入转换为小写。
        split: character or string to use for token splitting.
        split: 用于标记拆分的字符或字符串。
        char_level: if True, every character will be treated as a token.
        char_level:如果是真的，每个字符将被当作标记对待。
        oov_token: if given, it will be added to word_index and used to
            replace out-of-vocabulary words during text_to_sequence calls
         oov_token:如果给定的，它将被添加到word_index中，用于在text_to_sequence调用期间替换词汇表之外的单词。

    By default, all punctuation is removed, turning the texts into
    space-separated sequences of words
    默认情况下，删除所有标点符号，将文本转换为空间分隔的单词序列。
    (words maybe include the `'` character). These sequences are then
    split into lists of tokens. They will then be indexed or vectorized.
    （单词可能包括'` 字符。）然后将这些序列分割成标记列表。然后，它们将被索引或矢量化。

    `0` is a reserved index that won't be assigned to any word.
    0`是一个不被分配给任何单词的保留索引。
    """

    def __init__(self, num_words=None,
                 filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
                 lower=True,
                 split=' ',
                 char_level=False,
                 oov_token=None,
                 **kwargs):
        # Legacy support
        if 'nb_words' in kwargs:
            warnings.warn('The `nb_words` argument in `Tokenizer` '
                          'has been renamed `num_words`.')
            num_words = kwargs.pop('nb_words')
        if kwargs:
            raise TypeError('Unrecognized keyword arguments: ' + str(kwargs))

        self.word_counts = OrderedDict()
        self.word_docs = {}
        self.filters = filters
        self.split = split
        self.lower = lower
        self.num_words = num_words
        self.document_count = 0
        self.char_level = char_level
        self.oov_token = oov_token
        self.index_docs = {}

    def fit_on_texts(self, texts):
        """Updates internal vocabulary based on a list of texts.
        基于文本列表更新内部词汇表。
        In the case where texts contains lists, we assume each entry of the lists
        to be a token.
        在文本包含列表的情况下，我们假设列表的每个条目是标记。

        Required before using `texts_to_sequences` or `texts_to_matrix`.
        在使用“texts_to_sequences”或“texts_to_matrix”之前使用。

        # Arguments
        参数
            texts: can be a list of strings,
                a generator of strings (for memory-efficiency),
                or a list of list of strings.
            texts:可以是字符串列表、字符串生成器（用于内存效率）或字符串列表。
        """
        for text in texts:
            self.document_count += 1
            if self.char_level or isinstance(text, list):
                seq = text
            else:
                seq = text_to_word_sequence(text,
                                            self.filters,
                                            self.lower,
                                            self.split)
            for w in seq:
                if w in self.word_counts:
                    self.word_counts[w] += 1
                else:
                    self.word_counts[w] = 1
            for w in set(seq):
                if w in self.word_docs:
                    self.word_docs[w] += 1
                else:
                    self.word_docs[w] = 1

        wcounts = list(self.word_counts.items())
        wcounts.sort(key=lambda x: x[1], reverse=True)
        sorted_voc = [wc[0] for wc in wcounts]
        # note that index 0 is reserved, never assigned to an existing word
        self.word_index = dict(list(zip(sorted_voc, list(range(1, len(sorted_voc) + 1)))))

        if self.oov_token is not None:
            i = self.word_index.get(self.oov_token)
            if i is None:
                self.word_index[self.oov_token] = len(self.word_index) + 1

        for w, c in list(self.word_docs.items()):
            self.index_docs[self.word_index[w]] = c

    def fit_on_sequences(self, sequences):
        """Updates internal vocabulary based on a list of sequences.
        基于序列列表更新内部词汇表。

        Required before using `sequences_to_matrix`
        (if `fit_on_texts` was never called).
        在`sequences_to_matrix`前使用
        (如果 `fit_on_texts` 没有使用).

        # Arguments
        参数
            sequences: A list of sequence.
            sequences: 序列列表.
                A "sequence" is a list of integer word indices.
                "sequence" 是一个整数词索引列表
        """
        self.document_count += len(sequences)
        for seq in sequences:
            seq = set(seq)
            for i in seq:
                if i not in self.index_docs:
                    self.index_docs[i] = 1
                else:
                    self.index_docs[i] += 1

    def texts_to_sequences(self, texts):
        """Transforms each text in texts in a sequence of integers.
       转换文本中的每一文本为整数序列。

        Only top "num_words" most frequent words will be taken into account.
        只有顶部的“num_words”最常见的词将被考虑在内。
        Only words known by the tokenizer will be taken into account.
        只有分词器所知道的单词才会被考虑进去。

        # Arguments
        参数
            texts: A list of texts (strings).
            texts: 文本（字符串）列表

        # Returns
        返回
            A list of sequences.
            序列列表
        """
        res = []
        for vect in self.texts_to_sequences_generator(texts):
            res.append(vect)
        return res

    def texts_to_sequences_generator(self, texts):
        """Transforms each text in `texts` in a sequence of integers.
        以整数序列转换“文本”中的每个文本。
        Each item in texts can also be a list, in which case we assume each item of that list
        to be a token.
        文本中的每个项目也可以是一个列表，在这种情况下，我们假设该列表中的每个项目都是分词。

        Only top "num_words" most frequent words will be taken into account.
        只有顶部的“num_words”最常见的词将被考虑在内。
        Only words known by the tokenizer will be taken into account.
        只有分词器所知道的单词才会被考虑进去。

        # Arguments
        参数
            texts: A list of texts (strings).
            texts: 文本（字符串）列表

        # Yields
        生成
            Yields individual sequences.
            生成单个序列。
        """
        num_words = self.num_words
        for text in texts:
            if self.char_level or isinstance(text, list):
                seq = text
            else:
                seq = text_to_word_sequence(text,
                                            self.filters,
                                            self.lower,
                                            self.split)
            vect = []
            for w in seq:
                i = self.word_index.get(w)
                if i is not None:
                    if num_words and i >= num_words:
                        continue
                    else:
                        vect.append(i)
                elif self.oov_token is not None:
                    i = self.word_index.get(self.oov_token)
                    if i is not None:
                        vect.append(i)
            yield vect

    def texts_to_matrix(self, texts, mode='binary'):
        """Convert a list of texts to a Numpy matrix.
        将文本列表转换为Numpy矩阵。

        # Arguments
        参数
            texts: list of strings.
            texts: 字符串列表
            mode: one of "binary", "count", "tfidf", "freq".
            mode: "binary", "count", "tfidf", "freq".中一个

        # Returns
        返回
            A Numpy matrix.
            Numpy矩阵
        """
        sequences = self.texts_to_sequences(texts)
        return self.sequences_to_matrix(sequences, mode=mode)

    def sequences_to_matrix(self, sequences, mode='binary'):
        """Converts a list of sequences into a Numpy matrix.
        转换序列列表为Numpy矩阵

        # Arguments
        参数
            sequences: list of sequences
            sequences: 序列列表
                (a sequence is a list of integer word indices).
                (序列是整数词索引列表).
            mode: one of "binary", "count", "tfidf", "freq"
            mode: "binary", "count", "tfidf", "freq".中一个

        # Returns
        返回
            A Numpy matrix.
            Numpy矩阵

        # Raises
        # 增加
            ValueError: In case of invalid `mode` argument,
                or if the Tokenizer requires to be fit to sample data.
            ValueError:在无效的`mode` 参数的情况下，或者如果标记器需要适合采样数据。
        """
        if not self.num_words:
            if self.word_index:
                num_words = len(self.word_index) + 1
            else:
                raise ValueError('Specify a dimension (num_words argument), '
                                 'or fit on some text data first.')
        else:
            num_words = self.num_words

        if mode == 'tfidf' and not self.document_count:
            raise ValueError('Fit the Tokenizer on some data '
                             'before using tfidf mode.')

        x = np.zeros((len(sequences), num_words))
        for i, seq in enumerate(sequences):
            if not seq:
                continue
            counts = {}
            for j in seq:
                if j >= num_words:
                    continue
                if j not in counts:
                    counts[j] = 1.
                else:
                    counts[j] += 1
            for j, c in list(counts.items()):
                if mode == 'count':
                    x[i][j] = c
                elif mode == 'freq':
                    x[i][j] = c / len(seq)
                elif mode == 'binary':
                    x[i][j] = 1
                elif mode == 'tfidf':
                    # Use weighting scheme 2 in
                    # https://en.wikipedia.org/wiki/Tf%E2%80%93idf
                    # 使用https://en.wikipedia.org/wiki/Tf%E2%80%93idf中加权方案2
                    tf = 1 + np.log(c)
                    idf = np.log(1 + self.document_count /
                                 (1 + self.index_docs.get(j, 0)))
                    x[i][j] = tf * idf
                else:
                    raise ValueError('Unknown vectorization mode:', mode)
        return x
代码执行
Keras详细介绍
英文：https://keras.io/
中文：http://keras-cn.readthedocs.io/en/latest/
实例下载
https://github.com/keras-team/keras
https://github.com/keras-team/keras/tree/master/examples
完整项目下载
方便没积分童鞋，请加企鹅452205574，共享文件夹。
包括：代码、数据集合（图片）、已生成model、安装库文件等。