UoG Text as Data Lecture1

最新推荐文章于 2024-08-19 15:00:59 发布

JYY_JYY_

最新推荐文章于 2024-08-19 15:00:59 发布

阅读量244

点赞数

分类专栏： Text As Data

本文链接：https://blog.csdn.net/qq_41157876/article/details/105128732

版权

Text As Data 专栏收录该内容

6 篇文章 2 订阅

订阅专栏

1. Text Processing

首先 token 的概念

1）Segmenting/tokenzing

e.g. “House of Tartan sells blankets” --------> “House”, “of”, “Tartan” “sells”, “blankets”

2）Normalize

e.g., “House”, “of”, “Tartan” “sells”, “blankets” --------> “house”, “of”, “tartan” “sell”, “blanket”

3）Stemming和lemmatization区别

词形还原（lemmatization），是把一个任何形式的语言词汇还原为一般形式（能表达完整语义）。

Definition: process of grouping together the different inflected forms of a word to a base form

常常根据字典来还原

词干提取（stemming）是抽取词的词干或词根形式（不一定能够表达完整语义）。

Definition: Process for reducing inflected words to their stem or root form

stemming: “computer”, “computers”, “computing”, “compute”--------> comput

区别：stemming: wolves -------->wolv

lemmatization -------->wolf

2.Text collections

N = number of all token occurrences (word count) 文章一共有多少个字（token）
V = vocabulary = set of types (unique normalized tokens) 文章字典是多少（unique token）

举例子

1）Representing the text （有了字典后，如何表示一篇文章）

one-hot encoding

每篇文章用一个 dimension = |V|的向量表示，该单词出现值=1，不出现=0

实现方法：用字典来存储V ，遇到新的文章进来，不断给字典添加新单词

这样每篇文章都可以用向量来表示，并且基本都是稀疏的，非常容易被压缩（用其他稀疏压缩算法来表示，节省存储空间）

注意：We may already have “frozen” our dictionary – then new words are “OOV” out-of-vocabulary, also
known as “UNK” for unknown terms，还需要在字典里加一个叫 <UNK>的KEY来表示unkown terms。字典建立好之后新来的文章会遇到一些奇怪的单词（不常见），可以把他们算作<UNK>。rare terms都可以算作<UNK>，不然字典太大了。或者，用hash把所有单词映射，一个key对应多个单词。

3. Text Similarity文本相似度

文本相似度很重要，例如：

E.g. grouping together tweets or news articles about the same event. tweets的聚类

E.g. identifying documents similar to a user's query. query的信息检索document

以下的Similarity都是基于set的（set-based similarity），即one-hot encoding方法来表示向量，不统计每个document里面某个term出现多少次。原因有：1.Work well for short pieces of text. 2.Simple (trivial) to compute with basic data structures.数据结构简单 3.Fundamental building block of more complex (learned) functions. 为之后更复杂的算法作铺垫 4.There are fast and efficient approximations!有快速有效的近似值，为以后大数据作铺垫