斯坦福 Speech and Language Processing - 2.4 Text normalization

最新推荐文章于 2024-07-06 01:43:55 发布

周某1111

最新推荐文章于 2024-07-06 01:43:55 发布

阅读量153

点赞数

分类专栏：自学文章标签：自然语言处理

本文链接：https://blog.csdn.net/weixin_48760912/article/details/114837882

版权

自学专栏收录该内容

48 篇文章 5 订阅

订阅专栏

2.4.1 Simple Tokenization in UNIX

Given a text file, output the word tokens and their frequencies

tr -sc ’A-Za-z’ ’\n’ < sh.txt
tr -sc ’A-Za-z’ ’\n’ < sh.txt | sort | uniq -c 
tr -sc ’A-Za-z’ ’\n’ < sh.txt | tr A-Z a-z | sort | uniq -c 
tr -sc ’A-Za-z’ ’\n’ < sh.txt | tr A-Z a-z | sort | uniq -c | sort -n -r

2.4.2 Word Tokenization 分词

Tokenization in NLTK
在这里插入图片描述
中文一般不需要word tokenization

2.4.3 Byte Pair Encoding tokenization 字节对编码分词

Subword tokenization (because tokens are often parts of words)
Can include common morphemes like -est or -er.
(A morpheme is the smallest meaning-bearing unit of a language：unlikeliest has morphemes un-, likely, and -est.)
Three common algorithms:

Byte-Pair Encoding (BPE) (Sennrich et al., 2016)
unigram language modeling tokenization (Kudo, 2018)
WordPiece (Schuster and Nakajima, 2012)
All have 2 parts:

A token learner that takes a raw training corpus and induces a vocabulary (a set of tokens).
A token segmenter that takes a raw test sentence and tokenizes it according to that vocabulary

BPE token learner algorithm
在这里插入图片描述
So first add a special end-of-word symbol ‘__’ before whitespace in training corpus

2.4.4 Word Normalization,Lemmatization and Stemming

Word Normalization 词标准化

Putting words/tokens in a standard format
例如：
U.S.A. or USA
uhhuh or uh-huh
Fed or fed
am, is be, are

Case folding 大写转小写

有些情况需要这样做，有些不需要

For speech recognition and information retrieval, everything is mapped to lower case
For sentiment analysis and text classification, IE, MT, case is helpful

Lemmatization 词形还原

Represent all words as their shared root, = dictionary headword form:

am, are, is -> be
car, cars, car’s, cars’ -> car

Stemming 词干提取

Reduce terms to stems, chopping off affixes crudely

Porter Stemmer

Based on a series of rewrite rules run in series
A cascade, in which output of each pass fed to next pass
Some sample rules:

周某1111

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
斯坦福 Speech and Language Processing - 2.4 Text normalization

目录2.4.1 Simple Tokenization in UNIX2.4.2 Word Tokenization 分词2.4.3 Byte Pair Encoding tokenization 字节对编码分词2.4.4 Word Normalization,Lemmatization and StemmingWord Normalization 词标准化Case folding 大写转小写Lemmatization 词形还原Stemming 词干提取Porter StemmerEvery NLP ta
复制链接

扫一扫