gensim.model.word2vec参数

最新推荐文章于 2024-06-10 09:35:46 发布

amiliaaaa

最新推荐文章于 2024-06-10 09:35:46 发布

阅读量566

点赞数

分类专栏：学习笔记 NLP python

python 同时被 3 个专栏收录

6 篇文章 0 订阅

订阅专栏

学习笔记

5 篇文章 0 订阅

订阅专栏

NLP

1 篇文章 0 订阅

订阅专栏

def __init__(self, sentences=None, 
                   corpus_file=None, 
                   size=100, 
                   alpha=0.025, 
                   window=5, 
                   min_count=5,
                   max_vocab_size=None, 
                   sample=1e-3, 
                   seed=1, 
                   workers=3, 
                   min_alpha=0.0001,
                   sg=0, 
                   hs=0, 
                   negative=5, 
                   ns_exponent=0.75, 
                   cbow_mean=1, 
                   hashfxn=hash, 
                   iter=5, 
                   null_word=0,
                   trim_rule=None, 
                   sorted_vocab=1, 
                   batch_words=MAX_WORDS_IN_BATCH, 
                   compute_loss=False, 
                   callbacks=(),
                   max_final_vocab=None):

    Parameters 所有参数均为可选的
    ----------
    sentences : iterable of iterables
        待训练语料
            a list of lists of tokens
            大语料时：consider an iterable that streams the sentences directly from disk/network.
        不提供sentences的情况：此处模型不进行初始化，适用于计划用其他方式初始化模型时 
       
    corpus_file : str
        语料存放路径
        可以使用该参数代替‘sentences’参数来获得性能提升，两者二选一，均不提供时，模型不进行初始化

    size : int
        词向量维度

    window : int
        当前词和预测词之间的最大距离

    min_count : int
        忽略词频低于此值的全部词

    workers : int
        训练模型时的线程数量 (=多核机器训练更快)

    sg : {0, 1}
        训练使用算法: 1： skip-gram; 0： CBOW.

    hs : {0, 1}
        1： 模型训练使用hierarchical softmax
        0： 且参数‘negative’非0时, 使用负采样（negative sampling）

    negative : int
        > 0： 使用负采样，该值为指定的"noise words"数量 (通常取值5-20).
                       (the int for negative specifies how many "noise words" should be drawn)
        = 0： 不使用负采样

    ns_exponent : float
        用来度量负采样分布的指数。
        = 1.0：按频率比例采样
        = 0.0：平均采样所有词
        < 0.0：更多采样低频率词
        默认值0.75，来源于Word2Vec原始论文，最近C，L和R推荐了其他值，可能会带来更好性能，参考：https://arxiv.org/abs/1804.04212, Caselles-Dupré, Lesaint, & Royo-Letelier

    cbow_mean : {0, 1}
        0： 使用上下文词向量总和
        1： 使用平均数，用CBOW时使用

    alpha : float
        初始学习率

    min_alpha : float
        训练过程中学习率下降到的的最小值

    seed : int
        随机数字生成器seed，初始化向量，每个词向量为词的哈希值串联word + `str(seed)`。（这块有点没懂，懂了再改）Initial vectors for each word are seeded with a hash of the concatenation of word + `str(seed)`. 
        需要注意的是，对于完全可确定的可重现运行（fully deterministically-reproducible run），必须要同时限制模型为单一线程(`workers=1`)来消除系统线程调度中的请求波动(ordering jitter). 
        (在 Python 3 中, 解释器启动之间的重现性(reproducibility)还需要使用 "PYTHONHSED" 环境变量来控制哈希随机化).

    max_vocab_size : int
        限制RAM，影响词的数量，高于此值时，删除词频较低的词，1千万（10 million）词汇大约需要1GB RAM.
        None：不限制

    max_final_vocab : int
        通过自动选取匹配的min_count限制vocab，达到目标vocab大小。如果指定的min_count大于计算出的min_count, 则将使用指定的min_count。
        None：不需要

    sample : float
        用于配置随机采样的高频单词的阈值。通常范围：(0, 1e-5).

    hashfxn : function
        哈希函数，用于随机初始化权重, 以提高训练重现性。

    iter : int
        在语料库上的迭代（epoch）数量

    trim_rule : function
        词汇表修剪规则，指定某些词汇是否应该保留在词汇表中，是否应该进行修剪，或者使用默认值处理(如果字数<min_count，则丢弃)。
        None：min_count将被使用，参考:func: ' ~gensim.utils.keep_vocab_item '，或者接受参数(word、count、min_count)并返回的可调用函数 :attr:`gensim.utils.RULE_DISCARD`, :attr:`gensim.utils.RULE_KEEP` 或 :attr: `gensim.utils.RULE_DEFAULT`.
        如果给出了该规则，那么它只用于在build_vocab()期间修剪词汇表，而不是作为模型的一部分存储。

        输入参数类型:
            * `word` (str) - 要学习的词
            * `count` (int) - 词频（词语在语料库中的出现频率）
            * `min_count` (int) - 最小记数阈值.

    sorted_vocab : {0, 1}
        1：在分配单词索引之前，按降序频率对词汇表进行排序。
        See :meth:`~gensim.models.word2vec.Word2VecVocab.sort_vocab()`.

    batch_words : int
        传递给工作线程(以及cython例程)的batch的示例的目标大小(以单词表示)。(如果单个文本长度超过10000个单词，将传递更大的批，但是标准cython代码将截短到这个最大值。)

    compute_loss: bool
        True：计算并保留可被检索使用的损失值
        :meth:`~gensim.models.word2vec.Word2Vec.get_latest_training_loss`.

    callbacks : iterable of :class:`~gensim.models.callbacks.CallbackAny2Vec`
        在训练的特定阶段执行回调的序列。

 Examples （gensim自带示例）
    --------
    Initialize and train a :class:`~gensim.models.word2vec.Word2Vec` model

    .. sourcecode:: pycon

        >>> from gensim.models import Word2Vec
        >>> sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
        >>> model = Word2Vec(sentences, min_count=1)

amiliaaaa

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
gensim.model.word2vec参数

def __init__(self, sentences=None, corpus_file=None, size=100, alpha=0.025, window=5, min_count=5,...
复制链接

扫一扫

专栏目录