数据科学与大数据分析学习笔记-9文本分析

文本分析

–指文本数据的表示、处理和建模,以获得有用的见解。
–遭受高维度的诅咒。
–大多数情况下,文本没有结构化。

语料库

–自然语言处理中用于各种目的的大量文本(文档)集合。

Text Analysis Steps

• Parsing
– Takes unstructured text and imposes a structure for further analysis.

• Search and retrieval
– Identification of the documents in a corpus that contain search items (key terms).

• Text mining
– Use the results of the prior steps to discover meaningful insights.
– Clustering and classification techniques can be adapted to text mining. For example:

• Cluster documents into groups.
• Classify texts for sentiment analysis.
– Utilises various methods and techniques
• Statistical analysis.
• Information retrieval.
• Data mining and Natural Language Processing.

A Text Analysis Example
在这里插入图片描述

Challenges 语法,语义,语用

Syntax concerns the sentence structure and the rules of grammar.
Semantics is the study of the meaning of sentences.
Pragmatics concerns the meaning of sentences in a certain context.
Homonyms are words that have the same spelling but have different meanings
Acronyms are abbreviated versions of words.

Collecting Raw Text
For text analysis, data must be collected before anything can happen.
这对应于数据分析生命周期的第 1 阶段和第 2 阶段
Raw text needs to be transformed with text normalization techniques.

Representing Text
• Raw text needs to be transformed with text normalization techniques.

Tokenization
The task of separating words from the body of text.
Tokenizing based on spaces. • “day” vs “day.”
Tokenizing based on punctuation marks & spaces.

Case folding
Reduces all letters to lowercase (or uppercase)
May need to create a lookup table of words not to be case folded. 举例:WHO; US

Stop words
Not all the words from a given language may need to be considered. 举例:the, a, of, and, to

Lemmatization and stemming.
Walk: walking, walk, walks, walked… (stemmed)
Goose: geese, goose, gander, ganders (lemmatized)
Good: good, better (lemmatized)

Bag-of-words representation
A document becomes a high-dimensional vector, indicating the presence/absence/frequency of various words in this document.

对于一个给定的文档,词袋法把文档表示成一组项,同时忽略顺序、上下文、推论和语篇等信息。每个单词都被认为是一个项或标记(通常是分析中最小的单元)。在许多情况下,词袋法额外假设文档中的每个项是独立的。文档然后成为一个向量,每个不同的项在空间中具有一个维度,而且项也是未排序的。
“a dog bites a man” same as “a man bites a dog”
博主在写文章 same as 文章在写博主
在这里插入图片描述
Corpus
A corpus is a collection of documents, usually focused on specific domains.
Some corpora include the information content of every word in its metadata.

Information content (IC)
– A metric denotes the importance of a term in a corpus.
– Terms with higher IC values are more important.
However, information content (IC) Cannot satisfy the need to analyse the dynamically changed, unstructured data.
• Two problems
– Both traditional corpora and IC metadata do not change over time.
– Traditional corpora limits the entire knowledge used for a text analysis algorithm.
We need a metric that adapts to the context and the nature of text (not like IC) 

Term Frequency – Inverse Document Frequency (TFIDF)
TFIDF is based entirely on all the fetched documents.
TFIDF can be easily updated once the fetched documents change.
TFIDF is a measure widely used in text analysis.

在这里插入图片描述
Zipf’s Law: the i-th most common word occurs approximately 1/i as the most frequent term.
在这里插入图片描述
Document Frequency
• Document Frequency of a term:– The number of documents in a corpus that contain a term.
在这里插入图片描述
Inverse Document Frequency
在这里插入图片描述
Inverse Document Frequency of a term
• The IDF of a rare term would be high.
• The IDF of a frequent term would be low.
• IDF solely depends on the DF.

Term Frequency – Inverse Document Frequency (TFIDF)
• A measure that considers:
– The prevalence of a term within a document (TF).
– The scarcity of the term over the corpus (IDF).
• The TFIDF of a term t in a document d is

在这里插入图片描述
• TFIDF scores a term higher if it appears more often in a document but less in a corpus.
一个词的 TFIDF 的值越高,它在一个文档中出现的越频繁,但在语料库的所有文档中出现
得就少。
注意 TFIDF 应用于特定文档中的词语,所以同一个词在不同的文档中很可能得到不同的 TFIDF 值

Categorizing Documents by Topics

• Topic modelling:
– Provides short descriptions for documents.
– Helps to organize, search, understand, and summarize text.
• Topic models are statistical models that:
– examine words from a set of documents,
– determine the themes over the text, and
– discover how the themes are associated or change over time.

The process of topic modeling

  1. Uncover the hidden topical patterns within a corpus.
  2. Annotate documents according to these topics.
  3. Use annotations to organize, search, and summarize texts.

• A topic is formally defined as a distribution over a fixed vocabulary of words.
– Different topics have different distributions over the same vocabulary.
• A topic can be viewed as a cluster of words with related meanings.
– A word from the vocabulary can reside in multiple topics with different weights.
主题(topic)被正式定义为词在固定词汇上的分布。不同的主题在相同的词汇上会有不同的分布。主题可以看作具有相关含义的一组词,每个词在这个主题中有着相应的权重。注意,词汇中的某个词可以在多个主题中有不同的权重。主题模型并不一定需要文本的先验知识。主题可以完全基于分析文本来形成。

The simplest topic model is Latent Dirichlet Allocation (LDA)

A generative probabilistic model of a corpus.
在生成概率建模中,数据被当做包含了隐藏变量的生成过程的结果

In LDA, documents are treated as the result of a generative process
LDA assumes
– There is a fixed vocabulary of words.
– The number of the latent topics is predefined.
– Each latent topic is characterised by a distribution over words in a vocabulary .
– Each document is represented as a random mixture over latent topics.

How to process a document via LDA
– Choose the length N of the document.
– Choose a distribution over the topics.
– For each of the N words of this document
• Choose a topic based on the above distribution.
• Choose a word from the corresponding topic

在这里插入图片描述
在这里插入图片描述
图的左边显示了从一个语料库构建的 4 个主题,其中每个主题包含词汇中最重要的一组关键词。这 4 个主题分别与 problem、policy、neural 以及 report相关。如图右则的直方图中显示,每个文档有一个在主题上的分布。接下来,为文档中的每个词分配一个主题,并且选中相应的主题(色盘)中的词。在现实中,只有文档(如图的中间所示)是可用的。LDA 的目标是为每一个文档推断潜在的主题、主题比例以及主题分配。

Determining Sentiments

Sentiment analysis
– Uses statistics and NLP to mine opinions to identify and exact subjective information from texts.

• Applications
– Detect the polarity of product or movie reviews.

• Analysis level
– Document, sentence, phrase, and short-text.

• Classification methods are often used to extract corpus statistics for sentiment analysis
– Naïve Bayes classifier, Maximum Entropy, Support Vector Machines, ….

• Movie review corpus
– Consists of 2000 movie reviews.
– Manually tagged into 1000 positive and 1000 negative reviews .

Using Naïve Bayes classifier python code:
在这里插入图片描述
通过代码输出的混淆矩阵来计算准确率(Precision)和召回率(Recall)
在这里插入图片描述

在理想情况下,一个好的分类器的准确率和召回率应该接近 1。

在信息检索中,一个完美的准确率 1,意味着搜索检索到的每一个结果都是相关的。
一个完美的召回率 1,意味着所有的相关文档都被搜索到了。
在现实中,一个分类器很难同时实现高准确率和高召回率。
数据科学团队需要检查数据的清洁度,优化分类器,并找到方法来提高准确率,同时还要保持一个较高的召回率。

Gaining Insights

Word cloud (tag cloud)
标记一般都是单个单词,每一个单词的重要性由字体大小或颜色体现。出现越频繁的词会以相对较大的字体来显示
在这里插入图片描述

TFIDF can be used to highlight the informative words in text
其中具有较大字号的每个单词对应于一个较大的 TFIDF 值。每条评论被当做一个文档。
在这里插入图片描述

Circular graph of topics obtained from LDA.
– The disc size represents the weight of a word.
每个主题专注于描述评论的一个不同侧面。圆盘的大小代表一个词的权重。
在这里插入图片描述

参考书目

  1. Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, EMC Education Services, John Wiley & Sons, 27 Jan. 2015

  2. Data Mining: The Textbook by Charu C. Aggarwal, Springer 2015

  3. C.M. Christopher, P. Raghavan and H. Schutze. Introduction to Information Retrieval, Cambridge University Press. 20084.

  4. Computer Vision: A Modern Approach (2nd Edition), by David A. Forsyth and Jean Ponce, Pearson, 2011.

图片来自课件和个人的整理。
中文图片来自网络。

  • 2
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值