gensim LDA的相关参数

最新推荐文章于 2024-11-18 15:09:01 发布

学点儿啥

最新推荐文章于 2024-11-18 15:09:01 发布

阅读量4k

点赞数 3

文章标签： python

原文链接：https://blog.csdn.net/zkq_1986/article/details/105900410?ops_request_misc=%257B%2522request%255Fid%2522%253A%2522161510518416780266211249%2522%252C%2522scm%2522%253A%252220140713.130102334.pc%255Fall.%2522%257D&request_id=161510518416780266211249&biz_id=0

版权

gensimLDA的相关参数

转载链接：[nlp]gensim lda使用方法

Model persistency is achieved through load() and save() methods
Parameters

corpus ({iterable of list of (int, float), scipy.sparse.csc}, optional) – Stream of document vectors or sparse matrix of shape (num_terms, num_documents). If not given, the model is left untrained (presumably because you want to call update() manually).

num_topics (int, optional) – The number of requested latent topics to be extracted from the training corpus.

id2word ({dict of (int, str), gensim.corpora.dictionary.Dictionary}) – Mapping from word IDs to words. It is used to determine the vocabulary size, as well as for debugging and topic printing.

distributed (bool, optional) – Whether distributed computing should be used to accelerate training.

chunksize (int, optional) – Number of documents to be used in each training chunk.

passes (int, optional) – Number of passes through the corpus during training.

update_every (int, optional) – Number of documents to be iterated through for each update. Set to 0 for batch learning, > 1 for online iterative learning.

alpha ({numpy.ndarray, str}, optional) –

Can be set to an 1D array of length equal to the number of expected topics that expresses our a-priori belief for the each topics’ probability. Alternatively default prior selecting strategies can be employed by supplying a string:

corpus ({列表的可迭代性 (INT, 浮子), Cip.space.csc}, 任选)-文档向量流或形状稀疏矩阵(名词_术语，名词_文档)。如果没有给出模型，则该模型没有经过训练(大概是因为您想要调用update())。

num_topics (INT, 任选)-从培训语料库中提取所要求的潜在主题的数量。

id2word((int，str)，gensim.corpora.dictionary.Dictionary)-从单词ID映射到单词。它用于确定词汇表的大小，以及调试和主题打印。

distributed (布尔, 任选)-是否应该使用分布式计算来加速培训。

chunksize (INT, 任选)-在每个培训块中使用的文档数量。

passes (INT, 任选)-训练期间通过语料库的次数。

update_every (INT, 任选)-每次更新要迭代的文档数。批量学习设置为0，在线迭代学习设置为>1。

alpha ({numpy.ndarray, 斯塔尔, 任选) –

可以设置为长度的一维数组，等于表示我们对每个主题的概率的先验信念的预期主题的数量

’asymmetric’: Uses a fixed normalized asymmetric prior of 1.0 / topicno.

’auto’: Learns an asymmetric prior from the corpus (not available if distributed==True).
eta ({float, np.array, str}, optional) –

A-priori belief on word probability, this can be:

scalar for a symmetric prior over topic/word probability,

vector of length num_words to denote an asymmetric user defined probability for each word,

matrix of shape (num_topics, num_words) to assign a probability for each word-topic combination,

the string ‘auto’ to learn the asymmetric prior from the data.

decay (float, optional) – A number between (0.5, 1] to weight what percentage of the previous lambda value is forgotten when each new document is examined. Corresponds to Kappa from Matthew D. Hoffman, David M. Blei, Francis Bach: “Online Learning for Latent Dirichlet Allocation NIPS’10”.

offset (float, optional) –

Hyper-parameter that controls how much we will slow down the first steps the first few iterations. Corresponds to Tau_0 from Matthew D. Hoffman, David M. Blei, Francis Bach: “Online Learning for Latent Dirichlet Allocation NIPS’10”.

eval_every (int, optional) – Log perplexity is estimated every that many updates. Setting this to one slows down training by ~2x.

iterations (int, optional) – Maximum number of iterations through the corpus when inferring the topic distribution of a corpus.

gamma_threshold (float, optional) – Minimum change in the value of the gamma parameters to continue iterating.

minimum_probability (float, optional) – Topics with a probability lower than this threshold will be filtered out.

random_state ({np.random.RandomState, int}, optional) – Either a randomState object or a seed to generate one. Useful for reproducibility.

ns_conf (dict of (str, object), optional) – Key word parameters propagated to gensim.utils.getNS() to get a Pyro4 Nameserved. Only used if distributed is set to True.

minimum_phi_value (float, optional) – if per_word_topics is True, this represents a lower bound on the term probabilities.

per_word_topics (bool) – If True, the model also computes a list of topics, sorted in descending order of most likely topics for each word, along with their phi values multiplied by the feature length (i.e. word count).

callbacks (list of Callback) – Metric callbacks to log and visualize evaluation metrics of the model during training.

dtype ({numpy.float16, numpy.float32, numpy.float64}, optional) – Data-type to use during calculations inside model. All inputs are also converted.

bound(corpus, gamma=None, subsample_ratio=1.0)

Estimate the variational bound of documents from the corpus as E_q[log p(corpus)] - E_q[log q(corpus)].

Parameters

corpus ({iterable of list of (int, float), scipy.sparse.csc}, optional) – Stream of document vectors or sparse matrix of shape (num_terms, num_documents) used to estimate the variational bounds.

gamma (numpy.ndarray, optional) – Topic weight variational parameters for each document. If not supplied, it will be inferred from the model.

subsample_ratio (float, optional) – Percentage of the whole corpus represented by the passed corpus argument (in case this was a sample). Set to 1.0 if the whole corpus was passed.This is used as a multiplicative factor to scale the likelihood appropriately.