Stemming and lemmatization

最新推荐文章于 2024-04-28 15:41:47 发布

子燕若水

最新推荐文章于 2024-04-28 15:41:47 发布

阅读量767

点赞数

分类专栏： NLP

本文链接：https://blog.csdn.net/u010087338/article/details/119893565

版权

NLP 专栏收录该内容

28 篇文章

订阅专栏

本文探讨了自然语言处理中的Stemming（词干提取）和Lemmatization（词形还原）的区别。Stemming通过去除词缀简化词根，如Porter和Snowball算法，可能返回非字典词；而Lemmatization更进一步，将词还原为字典中的基本形式，确保结果是词汇库内的词。Spacy主要依赖于lemmatization。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Stemming和lemmatization的区别

Stemming 通常指的是一种粗略的砍枝叶过程，它在大多数情况下希望能正确地实现这个目标，它会砍掉单词的结尾词缀、屈折词素Inflectional Morphemes，并且通常会去除derivational morpheme派生词素。

词形还原 通常也是砍枝叶过程，通常仅去除屈折词素Inflectional Morphemes并返回单词的基本或字典形式。

Stemming

Stemming 是指将一个词简化为它的词根root形式的过程。在执行自然语言处理任务时，您会遇到各种场景，您会发现具有相同词根的不同单词。例如，compute, computer, computing, computed等。为了统一起见，您希望将单词简化为它们共同的词根形式。这就是词干提取Stemming发挥作用的地方。

您可能会感到惊讶，但 spaCy 不包含任何词干功能，因为它仅依赖于词形还原lemmatization。因此，在本节中，我们将使用 NLTK 进行词干提取。

NLTK 中有两种类型的词干分析器：Porter Stemmer和Snowball 词干分析器。它们都使用不同的算法实现。

代码：

import nltk

from nltk.stem.porter import *
stemmer = PorterStemmer()
tokens = ['compute', 'computer', 'computed', 'computing']
for token in tokens:
    print(token + ' --> ' + stemmer.stem(token))

The output is as follows:

compute --> comput
computer --> comput
computed --> comput
computing --> comput

Snowball Stemmer

Snowball 词干分析器是 Porter 词干分析器的略微改进版本，通常优于后者。让我们看看Snowball词干分析器的运行情况：

from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer(language='english')

tokens = ['compute', 'computer', 'computed', 'computing']

for token in tokens:
    print(token + ' --> ' + stemmer.stem(token))

In the script above, we used Snowball stemmer to find the stem of the same 4 words that we used with porter stemmer. The output looks like this:

compute --> comput
computer --> comput
computed --> comput
computing --> comput

lemmatization

你可以看到结果是一样的。我们仍然以“comput”为词干。同样，“comput”这个词实际上不是字典词。如果用户给出computes，我们stem之后得到comput，但是comput并不能查字典。

这是词形还原lemmatization派上用场的地方。lemmatization将单词还原为它在字典中出现的词干。也就是说，lemmatization也是在做剪枝操作，只不过剪枝到词典词为止。

上代码：

import spacy

# English pipelines include a rule-based lemmatizer
nlp = spacy.load("en_core_web_sm")
lemmatizer = nlp.get_pipe("lemmatizer")
print(lemmatizer.mode)  # 'rule'

doc = nlp("I was reading the paper.")
print([token.lemma_ for token in doc])
# ['I', 'be', 'read', 'the', 'paper', '.']

OK了！

参考文章：

https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

https://stackabuse.com/python-for-nlp-tokenization-stemming-and-lemmatization-with-spacy-library/