数据科学与大数据分析学习笔记-9文本分析

文本分析

–指文本数据的表示、处理和建模,以获得有用的见解。
–遭受高维度的诅咒。
–大多数情况下,文本没有结构化。

语料库

–自然语言处理中用于各种目的的大量文本(文档)集合。

Text Analysis Steps

• Parsing
– Takes unstructured text and imposes a structure for further analysis.

• Search and retrieval
– Identification of the documents in a corpus that contain search items (key terms).

• Text mining
– Use the results of the prior steps to discover meaningful insights.
– Clustering and classification techniques can be adapted to text mining. For example:

• Cluster documents into groups.
• Classify texts for sentiment analysis.
– Utilises various methods and techniques
• Statistical analysis.
• Information retrieval.
• Data mining and Natural Language Processing.

A Text Analysis Example
在这里插入图片描述

Challenges 语法,语义,语用

Syntax concerns the sentence structure and the rules of grammar.
Semantics is the study of the meaning of sentences.
Pragmatics concerns the meaning of sentences in a certain context.
Homonyms are words that have the same spelling but have different meanings
Acronyms are abbreviated versions of words.

Collecting Raw Text
For text analysis, data must be collected before anything can happen.
这对应于数据分析生命周期的第 1 阶段和第 2 阶段
Raw text needs to be transformed with text normalization techniques.

Representing Text
• Raw text needs to be transformed with text normalization techniques.

Tokenization
The task of separating words from the body of text.
Tokenizing based on spaces. • “day” vs “day.”
Tokenizing based on punctuation marks & spaces.

Case folding
Reduces all letters to lowercase (or uppercase)
May need to create a lookup table of words not to be case folded. 举例:WHO; US

Stop words
Not all the words from a given language may need to be considered. 举例:the, a, of, and, to

Lemmatization and stemming.
Walk: walking, walk, walks, walked… (stemmed)
Goose: geese, goose, gander, ganders (lemmatized)
Good: good, better (lemmatized)

Bag-of-words representation
A document becomes a high-dimensional vector, indicating the presence/absence/frequency of various words in this document.

对于一个给定的文档,词袋法把文档表示成一组项,同时忽略顺序、上下文、推论和语篇等信息。每个单词都被认为是一个项或标记(通常是分析中最小的单元)。在许多情况下,词袋法额外假设文档中的每个项是独立的。文档然后成为一个向量,每个不同的项在空间中具有一个维度,而且项也是未排序的。
“a dog bites a man” same as “a man bites a dog”
博主在写文章 same as 文章在写博主
在这里插入图片描述
Corpus
A corpus is a collection of documents, usually focused on specific domains.
Some corpora include the information content of every word in its metadata.

Information content (IC)
– A metric denotes the importance of a term in a corpus.
– Terms with higher IC values are more important.
However, information content (IC) Cannot satisfy the need to analyse the dynamically changed, unstructured data.
• Two problems
– Both traditional corpora and IC metadata do not change over time.
– Traditional corpora limits the entire knowledge used for a text analysis algorithm.
We need a metric that adapts to the context and the nature of text (not like IC) 

Term Frequency – Inverse Document Frequency (TFIDF)
TFIDF is based entirely on all the fetched documents.
TFIDF can be easily updated once the fetched documents change.
TFIDF is a measure widely used in text analysis.

在这里插入图片描述
Zipf’s Law: the i-th most common word occurs approximately 1/i as the most frequent term.
在这里插入图片描述
Document Frequency
• Document Frequency of a term:– The number of documents in a corpus that contain a term.
在这里插入图片描述
Inverse Document Frequency
在这里插入图片描述
Inverse Document Frequency of a term
• The IDF of a rare term would be high.
• The IDF of a frequent term would be low.
• IDF solely depends on the DF.

Term Frequency – Inverse Document Frequency (TFIDF)
• A measure that considers:
– The prevalence of a term within a document (TF).
– The scarcity of the term over the corpus (IDF).
• The TFIDF of a term t in a document d is

在这里插入图片描述
• TFIDF scores a term higher if it appears more often in a document but less in a corpus.
一个词的 TFIDF 的值越高,它在一个文档中出现的越频繁,但在语料库的所有文档中出现
得就少。
注意 TFIDF 应用于特定文档中的词语,所以同一个词在不同的文档中很可能得到不同的 TFIDF 值

Categorizing Documents by Topics

• Topic modelling:
– Provides short descriptions for documents.
– Helps to organize, search, understand, and summarize text.
• Topic models are statistical models that:
– examine words from a set of documents,
– determine the themes over the text, and
– discover how the themes are associated or change over time.

The process of topic modeling

  1. Uncover the hidden topical patterns within a corpus.
  2. Annotate documents according to these topics.
  3. Use annotations to organize, search, and summarize texts.

• A topic is formally defined as a distribution over a fixed vocabulary of words.
– Different topics have different distributions over the same vocabulary.
• A topic can be viewed as a cluster of words with related meanings.
– A word from the vocabulary can reside in multiple topics with different weights.
主题(topic)被正式定义为词在固定词汇上的分布。不同的主题在相同的词汇上会有不同的分布。主题可以看作具有相关含义的一组词,每个词在这个主题中有着相应的权重。注意,词汇中的某个词可以在多个主题中有不同的权重。主题模型并不一定需要文本的先验知识。主题可以完全基于分析文本来形成。

The simplest topic model is Latent Dirichlet Allocation (LDA)

A generative probabilistic model of a corpus.
在生成概率建模中,数据被当做包含了隐藏变量的生成过程的结果

In LDA, documents are treated as the result of a generative process
LDA assumes
– There is a fixed vocabulary of words.
– The number of the latent topics is predefined.
– Each latent topic is characterised by a distribution over words in a vocabulary .
– Each document is represented as a random mixture over latent topics.

How to process a document via LDA
– Choose the length N of the document.
– Choose a distribution over the topics.
– For each of the N words of this document
• Choose a topic based on the above distribution.
• Choose a word from the corresponding topic

在这里插入图片描述
在这里插入图片描述
图的左边显示了从一个语料库构建的 4 个主题,其中每个主题包含词汇中最重要的一组关键词。这 4 个主题分别与 problem、policy、neural 以及 report相关。如图右则的直方图中显示,每个文档有一个在主题上的分布。接下来,为文档中的每个词分配一个主题,并且选中相应的主题(色盘)中的词。在现实中,只有文档(如图的中间所示)是可用的。LDA 的目标是为每一个文档推断潜在的主题、主题比例以及主题分配。

Determining Sentiments

Sentiment analysis
– Uses statistics and NLP to mine opinions to identify and exact subjective information from texts.

• Applications
– Detect the polarity of product or movie reviews.

• Analysis level
– Document, sentence, phrase, and short-text.

• Classification methods are often used to extract corpus statistics for sentiment analysis
– Naïve Bayes classifier, Maximum Entropy, Support Vector Machines, ….

• Movie review corpus
– Consists of 2000 movie reviews.
– Manually tagged into 1000 positive and 1000 negative reviews .

Using Naïve Bayes classifier python code:
在这里插入图片描述
通过代码输出的混淆矩阵来计算准确率(Precision)和召回率(Recall)
在这里插入图片描述

在理想情况下,一个好的分类器的准确率和召回率应该接近 1。

在信息检索中,一个完美的准确率 1,意味着搜索检索到的每一个结果都是相关的。
一个完美的召回率 1,意味着所有的相关文档都被搜索到了。
在现实中,一个分类器很难同时实现高准确率和高召回率。
数据科学团队需要检查数据的清洁度,优化分类器,并找到方法来提高准确率,同时还要保持一个较高的召回率。

Gaining Insights

Word cloud (tag cloud)
标记一般都是单个单词,每一个单词的重要性由字体大小或颜色体现。出现越频繁的词会以相对较大的字体来显示
在这里插入图片描述

TFIDF can be used to highlight the informative words in text
其中具有较大字号的每个单词对应于一个较大的 TFIDF 值。每条评论被当做一个文档。
在这里插入图片描述

Circular graph of topics obtained from LDA.
– The disc size represents the weight of a word.
每个主题专注于描述评论的一个不同侧面。圆盘的大小代表一个词的权重。
在这里插入图片描述

参考书目

  1. Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, EMC Education Services, John Wiley & Sons, 27 Jan. 2015

  2. Data Mining: The Textbook by Charu C. Aggarwal, Springer 2015

  3. C.M. Christopher, P. Raghavan and H. Schutze. Introduction to Information Retrieval, Cambridge University Press. 20084.

  4. Computer Vision: A Modern Approach (2nd Edition), by David A. Forsyth and Jean Ponce, Pearson, 2011.

图片来自课件和个人的整理。
中文图片来自网络。

  • 2
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Data Science and Big Data Analytics is about harnessing the power of data for new insights. The book covers the breadth of activities and methods and tools that Data Scientists use. The content focuses on concepts, principles and practical applications that are applicable to any industry and technology environment, and the learning is supported and explained with examples that you can replicate using open-source software. This book will help you: Become a contributor on a data science team Deploy a structured lifecycle approach to data analytics problems Apply appropriate analytic techniques and tools to analyzing big data Learn how to tell a compelling story with data to drive business action Prepare for EMC Proven Professional Data Science Certification Corresponding data sets are available at www.wiley.com/go/9781118876138. Get started discovering, analyzing, visualizing, and presenting data in a meaningful way today! Table of Contents Chapter 1 Introduction to Big Data Analytics Chapter 2 Data Analytics Lifecycle Chapter 3 Review of Basic Data Analytic Methods Using R Chapter 4 Advanced Analytical Theory and Methods: Clustering Chapter 5 Advanced Analytical Theory and Methods: Association Rules Chapter 6 Advanced Analytical Theory and Methods: Regression Chapter 7 Advanced Analytical Theory and Methods: Classification Chapter 8 Advanced Analytical Theory and Methods: Time Series Analysis Chapter 9 Advanced Analytical Theory and Methods: Text Analysis Chapter 10 Advanced Analytics—Technology and Tools: MapReduce and Hadoop Chapter 11 Advanced Analytics—Technology and Tools: In-Database Analytics Chapter 12 The Endgame, or Putting It All Together
Jupyter Notebook 和 Spyder 都是非常流行的 Python 开发环境,它们各有特点,适合不同的使用场景,特别是对于学习数据分析。 Jupyter Notebook: - **交互式**:Jupyter Notebook 是一种基于 web 的交互式笔记本,适合实时探索数据、编写代码、展示结果。它支持 Markdown 文本、代码块和可视化,并且可以直接在浏览器中查看,非常便于教学和分享。 - **演示和文档**:由于其直观的界面和可嵌入式的代码块,适合做数据分析过程的记录和报告,方便他人复现分析步骤。 - **可视化**:虽然内置的可视化工具可能不如专门的数据可视化库(如 Matplotlib 和 Seaborn)强大,但对于初学者来说,集成的简单性是优点之一。 Spyder: - **IDE(集成开发环境)**:Spyder 提供了一个更完整的开发环境,包括代码编辑器、变量浏览器、调试器等,适合开发大型项目,包括数据分析和机器学习。 - **专业工具集**:内置了诸如 IPython Console、Pandas 数据视图和科学计算库的直接支持,对数据分析的工具链支持更好。 - **代码管理和版本控制**:Spyder 对于代码管理和 Git 等版本控制系统有更好的集成,有利于团队协作。 选择哪个取决于你的需求和个人喜好: - 如果你是初学者,喜欢交互式环境和即时反馈,Jupyter Notebook 是很好的起点,特别适合通过交互学习数据分析。 - 如果你打算进行更复杂的数据处理和开发,并希望有一个全面的开发环境,那么 Spyder 可能更适合你。 相关问题-- 1. Jupyter Notebook 和 Spyder 在用户体验上有何不同? 2. 它们分别如何支持数据分析的各个环节? 3. 在团队协作中,哪个工具更受欢迎?

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值