python语义分析_使用潜在语义分析在python中发现文档的隐藏主题

python语义分析

Discovering topics are very useful for various purposes such as for clustering documents, organizing online available content for information retrieval and recommendations. Various content providers and news agencies are using topic models for recommending articles to readers. Similarly recruiting firms are using in extracting job descriptions and mapping them with candidate skill set. If you see the data scientist job, which is all about extracting the ‘knowledge’ from a large amount of collected data. Mostly collected data is unstructured in nature. you need powerful tools and techniques to analyze and understand a large amount of unstructured data.

发现主题对于各种目的非常有用,例如用于将文档聚类,组织在线可用内容以进行信息检索和推荐。 各种内容提供商和新闻社都使用主题模型向读者推荐文章。 同样,招聘公司也正在使用中提取职位描述,并将其与候选技能集进行映射。 如果您看到数据科学家的工作,那就是从大量收集的数据中提取“知识”。 本质上,大多数收集的数据是非结构化的。 您需要强大的工具和技术来分析和理解大量的非结构化数据。

Topic modeling is a text mining technique that provides methods to identify co-occurring keywords to summarize large collections of textual information. It helps in discovering hidden topics in the document, annotate the documents with these topics, and organize a large amount of unstructured data.

主题建模是一种文本挖掘技术,它提供了一些方法来标识共同出现的关键字以汇总大量文本信息。 它有助于发现文档中的隐藏主题,使用这些主题对文档进行注释,以及组织大量非结构化数据。

In this tutorial, you will cover the following topics:

在本教程中,您将涵盖以下主题:

  1. What is Topic Modelling?

    什么是主题建模?
  2. Comparison between text classification and topic modeling

    文本分类和主题建模之间的比较
  3. Latent Semantic Analysis

    潜在语义分析
  4. Implementing LSA in Python using Gensim

    使用Gensim在Python中实现LSA
  5. Determine the optimum number of topics in a document

    确定文档中的最佳主题数
  6. Pros and cons of LSA

    LSA的优缺点
  7. Use cases of Topic Modelling

    主题建模的用例
  8. Conclusion

    结论

For more such tutorials, projects, and courses visit DataCamp:

有关更多此类教程,项目和课程,请访问DataCamp

主题建模 (Topic Modelling)

Topic Modelling automatically discovers the hidden themes from given documents. It is an unsupervised text analytics algorithm that is used for finding a group of words from the given document. These group of words represents a topic. There is a possibility that a single document can associate with multiple themes. for example, group words such as ‘patient’, ‘doctor’, ‘disease’, ‘cancer’, ad ‘health’ will represent the topic ‘healthcare’. Topic Modelling is a different game compared to rule-based text searching that uses regular expressions.

主题建模会自动发现给定文档中的隐藏主题。 它是一种无监督的文本分析算法,用于从给定文档中查找一组单词。 这些词组代表一个主题。 一个文档有可能与多个主题相关联。 例如,“患者”,“医生”,“疾病”,“癌症”,“健康”等组词将代表“医疗保健”主题。 与使用正则表达式的基于规则的文本搜索相比,主题建模是一种不同的游戏。

文本分类和主题建模之间的比较 (Comparison Between Text Classification and topic modeling)

Text classification is a supervised machine learning problem, where a text document or article classified into a pre-defined set of classes. Topic modeling is the process of discovering groups of co-occurring words in text documents. These group co-occurring related words makes “topics”. It is a form of unsupervised learning so the set of possible topics are unknown. Topic modeling can be used to solve the text classification problem. Topic modeling will identify the topics presents in a document” while text classification classifies the text into a single class.

文本分类是有监督的机器学习问题,其中文本文档或文章分类为一组预定义的类。 主题建模是在文本文档中发现多组共同出现的单词的过程。 这些组共同出现的相关单词构成“主题”。 这是无监督学习的一种形式,因此可能的主题集是未知的。 主题建模可用于解决文本分类问题。 主题建模将识别文档中存在的主题”,而文本分类将文本分类为一个类。

Image for post

潜在语义分析 (Latent Semantic Analysis)

LSA (Latent Semantic Analysis) also known as LSI (Latent Semantic Index) LSA uses a bag of word(BoW) model, which results in the term-document matrix (occurrence of terms in a document). rows represent terms and columns represent documents.LSA learns latent topics by performing a matrix decomposition on the document-term matrix using Singular value decomposition. LSA is typically used as a dimension reduction or noise-reducing technique.

LSA(潜在语义分析)也称为LSI(潜在语义索引)LSA使用单词袋(BoW)模型,这会生成术语文档矩阵(文档中术语的出现)。 行代表术语,列代表文档。LSA通过使用奇异值分解对文档术语矩阵执行矩阵分解来学习潜在主题。 LSA通常用作降维或降噪技术。

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值