人工智能：python 实现第十章，NLP 第七天，主题模型

文档主题生成模型

topic model指一种统计模型,用来从一批文档的集合中发现抽象的主题/论题。如果文本包含多个主题，这个技术能够用来识别和分离这些主题。我们这样做可以发掘给定的一系列文本的隐藏的主题结构。

Topic Modeling 以一个最佳的方式帮助我们组织文档，这种方式能够被用来分析。值得注意的是，Topic modeling 算法不需要任何被标记的数据。这就像无监督学习一样，依靠自己本身来识别模式。对于网络上产生的海量的文本数据，Topic Modeling 就很重要了，因为它能够让我们归纳所有的数据，这对于人来说是不可能的。

LDA（Latent Dirichlet Allocation）是一种文档主题生成模型，也称为一个三层贝叶斯概率模型，包含词、主题和文档三层结构。所谓生成模型，就是说，我们认为一篇文章的每个词都是通过“以一定概率选择了某个主题，并从这个主题中以一定概率选择某个词语”这样一个过程得到。文档到主题服从多项式分布，主题到词服从多项式分布。

LDA是一种非监督机器学习技术，可以用来识别大规模文档集（document collection）或语料库（corpus）中潜藏的主题信息。它采用了词袋（bag of words）的方法，这种方法将每一篇文档视为一个词频向量，从而将文本信息转化为了易于建模的数字信息。但是词袋方法没有考虑词与词之间的顺序，这简化了问题的复杂性，同时也为模型的改进提供了契机。每一篇文档代表了一些主题所构成的一个概率分布，而每一个主题又代表了很多单词所构成的一个概率分布。

我们将在这一节使用一个叫做gensim的库，我们已经在第一节中安装了这个库。

实现代码如下：

from nltk.tokenize import RegexpTokenizer  
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from gensim import models, corpora


# Load input data
def load_data(input_file):
    data = []
    with open(input_file, 'r') as f:
        for line in f.readlines():
            data.append(line[:-1])


    return data


# Processor function for tokenizing, removing stop 
# words, and stemming
def process(input_text):
    # Create a regular expression tokenizer
    tokenizer = RegexpTokenizer(r'\w+')


    # Create a Snowball stemmer 
    stemmer = SnowballStemmer('english')


    # Get the list of stop words 
    stop_words = stopwords.words('english')
    
    # Tokenize the input string
    tokens = tokenizer.tokenize(input_text.lower())


    # Remove the stop words 
    tokens = [x for x in tokens if not x in stop_words]
    
    # Perform stemming on the tokenized words 
    tokens_stemmed = [stemmer.stem(x) for x in tokens]


    return tokens_stemmed
    
if __name__=='__main__':
    # Load input data
    data = load_data('data.txt')


    # Create a list for sentence tokens
    tokens = [process(x) for x in data]


    # Create a dictionary based on the sentence tokens 
    dict_tokens = corpora.Dictionary(tokens)
        
    # Create a document-term matrix
    doc_term_mat = [dict_tokens.doc2bow(token) for token in tokens]


    # Define the number of topics for the LDA model
    num_topics = 2


    # Generate the LDA model 
    ldamodel = models.ldamodel.LdaModel(doc_term_mat, 
            num_topics=num_topics, id2word=dict_tokens, passes=25)


    num_words = 5
    print('\nTop ' + str(num_words) + ' contributing words to each topic:')
    for item in ldamodel.print_topics(num_topics=num_topics, num_words=num_words):
        print('\nTopic', item[0])


        # Print the contributing words along with their relative contributions 
        list_of_strings = item[1].split(' + ')
        for text in list_of_strings:
            weight = text.split('*')[0]
            word = text.split('*')[1]
            print(word, '==>', str(round(float(weight) * 100, 2)) + '%')

data.txt

The Roman empire expanded very rapidly and it was the biggest empire in the world for a long time.
An algebraic structure is a set with one or more finitary operations defined on it that satisfies a list of axioms.
Renaissance started as a cultural movement in Italy in the Late Medieval period and later spread to the rest of Europe.
The line of demarcation between prehistoric and historical times is crossed when people cease to live only in the present.
Mathematicians seek out patterns and use them to formulate new conjectures.  
A notational symbol that represents a number is called a numeral in mathematics. 
The process of extracting the underlying essence of a mathematical concept is called abstraction.
Historically, people have frequently waged wars against each other in order to expand their empires.
Ancient history indicates that various outside influences have helped formulate the culture and traditions of Eastern Europe.
Mappings between sets which preserve structures are of special interest in many fields of mathematics.

运行结果