文本预处理——词干提取与词性还原-CSDN博客

本文链接：https://blog.csdn.net/2301_79731058/article/details/143244521

在任何自然语言中，单词都可以根据情况以多种形式书写或说出。这就是语言成为我们生活中如此激动人心的一部分的原因，但是机器不会这样激动。它们会以不同的方式处理这些词。因此，我们需要将它们标准化为它们的词根。文本规范化是将单词转换为单一规范形式的过程。这可以通过两个过程来完成，即词干提取和词形还原。让我们详细了解它们是什么。

词干提取和词形还原只是词语的规范化，即将单词简化为其词根形式。

在大多数自然语言中，一个词根可以有多种变体。例如，“play”这个词可以用作“playing”、“played”、“plays”等。你可以想到类似的例子（而且有很多）。

词干提取

词干提取是一种文本规范化技术，它通过考虑单词中可能出现的常见前缀或后缀列表来截断单词的结尾或开头
这是一个基于规则的基本过程，从单词中去除后缀（“ing”、“ly”、“es”、“s”等）

词性还原

另一方面，词形还原是一个有组织的、循序渐进的获取单词根形式的过程。它利用词汇（单词的词典重要性）和形态分析（单词结构和语法关系）。

词干提取算法的工作原理是从单词中截取后缀或前缀。词形还原是一种更强大的操作，因为它考虑到了单词的形态分析。

词形还原返回词根，它是所有词形变形形式的词根。

我们可以说，词干提取是一种快速而粗略的方法，将单词截断为词根形式，而词形还原是一种智能操作，它使用由深入的语言知识创建的词典。因此，词形还原有助于形成更好的特征。

执行文本规范化的方法

1.使用NLTK进行文本规范化

NLTK 库有很多很棒的方法来执行数据预处理的不同步骤。有PorterStemmer()和WordNetLemmatizer()等方法分别用于执行词干提取和词形还原。

词干提取

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
from nltk.stem import PorterStemmer

set(stopwords.words('english'))

text = """He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and 
fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had 
indeed the vaguest idea where the wood and river in question were."""

stop_words = set(stopwords.words('english')) 
  
word_tokens = word_tokenize(text) 
    
filtered_sentence = [] 
  
for w in word_tokens: 
    if w not in stop_words: 
        filtered_sentence.append(w) 

Stem_words = []
ps =PorterStemmer()
for w in filtered_sentence:
    rootWord=ps.stem(w)
    Stem_words.append(rootWord)
print(filtered_sentence)
print(Stem_words)

output:
He determined drop litigation monastry, relinguish claims wood-cuting fishery rihgts. He ready becuase rights become much less valuable, indeed vaguest idea wood river question.He determin drop litig monastri, relinguish claim wood-cut fisheri rihgt. He readi becuasright become much less valuabl, inde vaguest idea wood river question.

词性还原

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
import nltk
from nltk.stem import WordNetLemmatizer
set(stopwords.words('english'))

text = """He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and 
fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had 
indeed the vaguest idea where the wood and river in question were."""

stop_words = set(stopwords.words('english')) 
  
word_tokens = word_tokenize(text) 
    
filtered_sentence = [] 
  
for w in word_tokens: 
    if w not in stop_words: 
        filtered_sentence.append(w) 
print(filtered_sentence) 

lemma_word = []
import nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
for w in filtered_sentence:
    word1 = wordnet_lemmatizer.lemmatize(w, pos = "n")
    word2 = wordnet_lemmatizer.lemmatize(word1, pos = "v")
    word3 = wordnet_lemmatizer.lemmatize(word2, pos = ("a"))
    lemma_word.append(word3)
print(lemma_word)

output：

He determined drop litigation monastry, relinguish claims wood-cuting fishery rihgts. He ready becuase rights become much less valuable, indeed vaguest idea wood river question.He determined drop litigation monastry, relinguish claim wood-cuting fishery rihgts. He ready becuase right become much le valuable, indeed vaguest idea wood river question.

这里，v代表动词，a代表形容词，n代表名词。词形还原器仅对那些与lemmatize 方法的pos参数匹配的单词进行词形还原。

2.spaCy进行文本规范化

正如我们之前所见，spaCy 是一个很棒的 NLP 库。它提供了许多行业级方法来执行词形还原。不幸的是，spaCy 没有用于词干提取的模块。要执行词形还原，请查看以下代码：

import en_core_web_sm
nlp = en_core_web_sm.load()

doc = nlp(u"""He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and 
fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had 
indeed the vaguest idea where the wood and river in question were.""")

lemma_word1 = [] 
for token in doc:
    lemma_word1.append(token.lemma_)

output:

-PRON- determine to drop -PRON- litigation with the monastry, and relinguish -PRON- claim to the wood-cuting and \n fishery rihgts at once. -PRON- be the more ready to do this becuase the right have become much less valuable, and -PRON- have \n indeed the vague idea where the wood and river in question be.

这里-PRON-是代词的符号，可以使用正则表达式轻松删除。spaCy的好处是我们不必传递任何pos参数来执行词形还原。

3. 使用 TextBlob 进行文本规范化

TextBlob 是一个专门用于预处理文本数据的 Python 库。它基于 NLTK 库。我们可以使用 TextBlob 进行词形还原。但是，TextBlob 中没有用于词干提取的模块。

from textblob import Word 

text = """He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and 
fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had 
indeed the vaguest idea where the wood and river in question were."""

lem = []
for i in text.split():
    word1 = Word(i).lemmatize("n")
    word2 = Word(word1).lemmatize("v")
    word3 = Word(word2).lemmatize("a")
    lem.append(Word(word3).lemmatize())
print(lem)

output:

He determine to drop his litigation with the monastry, and relinguish his claim to the 
wood-cuting and fishery rihgts at once. He wa the more ready to do this becuase the right have become much le valuable, and he have indeed the vague idea where the wood and river in question were.

正如我们在上面的 NLTK 部分中看到的那样，TextBlob 也使用 POS 标记来执行词形还原。