文本预处理与相似度计算-CSDN博客

本文链接：https://blog.csdn.net/weixin_45390999/article/details/115203251

文章目录

词干提取 & 词形还原
停用词、罕见次过滤
分词
- jieba
相似性度量

词法分析常见任务：

分词（tokenization）
词性标注（Part-Of-Speech Tagging）
词形还原（Lemmatization）
识别停用词（Identifying Stop-Words）

词干提取 & 词形还原

词干提取 stemming

实现功能：如 eating, eaten, ate, eats —> eat
一般词干提取器，移除 -s/es, -ing, -ed 这类事的准确度可以达到 70%；

Porter 词干提取器，使用更多规则，精确度更高；
Snowball 提取器，是一个提取家族，可以分别处理多国语言。

注意：
如果要用到词性标注(POS)、NER 或某种依赖性解析器中的某些部分，应避免词干提取操作；因为词干提取会对相关分词进行修改，这可能导致不同的结果。

from nltk.stem import PorterStemmer
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem.snowball import SnowballStemmer

pst = PorterStemmer() 
lst = LancasterStemmer()

lst.stem('eating')
# 'eat'

pst.stem('eating')
# 'eat'

自己设计 Porter 词干提取器

http://snowball.tartarus.org/algorithms/english/stemmer.html

词形还原（lemmatization）

相比词干提取更健全、更条理化；会利用上下文语境推断相关单词的变化形式，并运用不同的标准化规则，根据词性来获取相关的词根（lemma）。

from nltk.stem import WordNetLemmatizer
wlem = WordNetLemmatizer()

wlem.lemmatize('eating')

wlem.lemmatize('ate')

wordnet 是一个语义词典；
WordNetLemmatizer 会针对某个单词去搜索 wordnet；另外，它还是用了变形分析，以便直切词根并搜索到特殊的词形（即这个单词的相关变化）。

词干提取 & 词形还原

词干提取操作更多时候是一套用于获取词干一般形式的规则方法；
词形还原主要考虑当前的上下文语境以及相关单词的 POS，然后将规则应用到特定的语法变化中。
通常来说，词干提取的操作，实现起来更简单，处理时间也短。

停用词、罕见次过滤

停用词 stopwords

停用词：跟要做的实际主题不相关的文本，在 NPL任务中（信息检索、分类）毫无意义；通常情况下，冠词和代词都会被列为；一般歧义不大，移除后影响小。

一般情况下，给定语言的停用词都是人工制定，跨语料库，针对最常见单词的停用词表。停用词表可能使用网站上找到已有的，也可能是基于给定语料库自动生成。
简单的生成停用词表方式之一：基于相关单词在文档中出现的频率。

NLTK 库中涵盖了 22 种语言的停用词表。

1、查看停用词

from nltk.corpus import stopwords # 加载停用词
stopwords.readme().replace('\n', ' ')  # 停用词说明文档，由于有很多 \n 符号，所以这样操作来方便查看
 
'''
    'Stopwords Corpus  This corpus contains lists of stop words for several languages.  These are high-frequency grammatical words which are usually ignored in text retrieval applications.  They were obtained from: http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/snowball/stopwords/  The stop words for the Romanian language were obtained from: http://arlc.ro/resources/  The English list has been augmented https://github.com/nltk/nltk_data/issues/22  The German list has been corrected https://github.com/nltk/nltk_data/pull/49  A Kazakh list has been added https://github.com/nltk/nltk_data/pull/52  A Nepali list has been added https://github.com/nltk/nltk_data/pull/83  An Azerbaijani list has been added https://github.com/nltk/nltk_data/pull/100  A Greek list has been added https://github.com/nltk/nltk_data/pull/103  An Indonesian list has been added https://github.com/nltk/nltk_data/pull/112 '

'''

# 查看停用词表，不同语言；没有对中文的支持
stopwords.fileids() 
 

'''
    ['arabic',
     'azerbaijani',
     'danish',
     'dutch',
     'english',
     'finnish',
     'french',
     'german',
     'greek',
     'hungarian',
     'indonesian',
     'italian',
     'kazakh',
     'nepali',
     'norwegian',
     'portuguese',
     'romanian',
     'russian',
     'slovene',
     'spanish',
     'swedish',
     'tajik',
     'turkish']
'''
 
# 查看英文停用词表

stopwords.raw('english').replace('\n', ' ')
 
'''
    "i me my myself we our ours ourselves you you're you've you'll you'd your yours yourself yourselves he him his himself she she's her hers herself it it's its itself they them their theirs themselves what which who whom this that that'll these those am is are was were be been being have has had having do does did doing a an the and but if or because as until while of at by for with about against between into through during before after above below to from up down in out on off over under again further then once here there when where why how all any both each few more most other some such no nor not only own same so than too very s t can will just don don't should should've now d ll m o re ve y ain aren aren't couldn couldn't didn didn't doesn doesn't hadn hadn't hasn hasn't haven haven't isn isn't ma mightn mightn't mustn mustn't needn needn't shan shan't shouldn shouldn't wasn wasn't weren weren't won won't wouldn wouldn't "
'''

停用词过滤

test_words = [word.lower() for word in tokens]

# 转化为集合，方便求和停用词表的交集
test_words_set = set(test_words)  

test_words_set
'''
    {',',
     '.',
     'and',
     'api',
     'articles',
     'browse',
     'code',
     &#