斯坦福 Speech and Language Processing - 2.4 Text normalization


Every NLP task requires text normalization:

  1. Tokenzing (segmenting) words
  2. Normalizing word formats
  3. Segmenting sentences

2.4.1 Simple Tokenization in UNIX

Given a text file, output the word tokens and their frequencies

tr -sc ’A-Za-z’ ’\n’ < sh.txt
tr -sc ’A-Za-z’ ’\n’ < sh.txt | sort | uniq -c 
tr -sc ’A-Za-z’ ’\n’ < sh.txt | tr A-Z a-z | sort | uniq -c 
tr -sc ’A-Za-z’ ’\n’ < sh.txt | tr A-Z a-z | sort | uniq -c | sort -n -r 

2.4.2 Word Tokenization 分词

Tokenization in NLTK
在这里插入图片描述
中文一般不需要word tokenization

2.4.3 Byte Pair Encoding tokenization 字节对编码分词

  • Subword tokenization (because tokens are often parts of words)
    Can include common morphemes like -est or -er.
    (A morpheme is the smallest meaning-bearing unit of a language:unlikeliest has morphemes un-, likely, and -est.)

  • Three common algorithms:

  1. Byte-Pair Encoding (BPE) (Sennrich et al., 2016)
  2. unigram language modeling tokenization (Kudo, 2018)
  3. WordPiece (Schuster and Nakajima, 2012)
    All have 2 parts:
  • A token learner that takes a raw training corpus and induces a vocabulary (a set of tokens).
  • A token segmenter that takes a raw test sentence and tokenizes it according to that vocabulary

BPE token learner algorithm
在这里插入图片描述
So first add a special end-of-word symbol ‘__’ before whitespace in training corpus

2.4.4 Word Normalization,Lemmatization and Stemming

Word Normalization 词标准化

Putting words/tokens in a standard format
例如:
U.S.A. or USA
uhhuh or uh-huh
Fed or fed
am, is be, are

Case folding 大写转小写

有些情况需要这样做,有些不需要

  • For speech recognition and information retrieval, everything is mapped to lower case
  • For sentiment analysis and text classification, IE, MT, case is helpful

Lemmatization 词形还原

Represent all words as their shared root, = dictionary headword form:

  • am, are, is -> be
  • car, cars, car’s, cars’ -> car

Stemming 词干提取

Reduce terms to stems, chopping off affixes crudely

Porter Stemmer

  • Based on a series of rewrite rules run in series
    A cascade, in which output of each pass fed to next pass
    Some sample rules:
    在这里插入图片描述
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值