自然语言处理(NLP)之英文单词词性还原

最新推荐文章于 2024-10-31 00:42:07 发布

IT之一小佬

最新推荐文章于 2024-10-31 00:42:07 发布

阅读量3.8k

点赞数 5

分类专栏：自然语言处理文章标签： nlp python 自然语言处理机器学习深度学习

原文链接：https://blog.csdn.net/jclian91/article/details/83661714

版权

自然语言处理专栏收录该内容

51 篇文章

订阅专栏

词形还原（Lemmatization）是文本预处理中的重要部分，与词干提取（stemming）很相似。

简单说来，词形还原就是去掉单词的词缀，提取单词的主干部分，通常提取后的单词会是字典中的单词，不同于词干提取（stemming），提取后的单词不一定会出现在单词中。比如，单词“cars”词形还原后的单词为“car”，单词“ate”词形还原后的单词为“eat”。

在Python的nltk模块中，使用WordNet为我们提供了稳健的词形还原的函数。如以下示例Python代码：

from nltk.stem import WordNetLemmatizer

wnl = WordNetLemmatizer()

#  lemmatize nouns
print(wnl.lemmatize('cars', 'n'))
print(wnl.lemmatize('men', 'n'))

#  lemmatize verbs
print(wnl.lemmatize('running', 'v'))
print(wnl.lemmatize('ate', 'v'))

#  lemmatize adjectives
print(wnl.lemmatize('saddest', 'a'))
print(wnl.lemmatize('fancier', 'a'))

运行结果：

car
men
run
eat
sad
fancy

在以上代码中，wnl.lemmatize()函数可以进行词形还原，第一个参数为单词，第二个参数为该单词的词性，如名词，动词，形容词等，返回的结果为输入单词的词形还原后的结果。

词形还原一般是简单的，但具体我们在使用时，指定单词的词性很重要，不然词形还原可能效果不好，如以下代码：

from nltk.stem import WordNetLemmatizer

wnl = WordNetLemmatizer()
print(wnl.lemmatize('ate', 'n'))
print(wnl.lemmatize('fancier', 'v'))

输出结果如下：

那么，如何获取单词的词性呢？在NLP中，使用Parts of speech（POS）技术实现。在nltk中，可以使用nltk.pos_tag()获取单词在句子中的词性，如以下Python代码：

from nltk import word_tokenize
from nltk import pos_tag

sentence = 'The brown fox is quick and he is jumping over the lazy dog'

tokens = word_tokenize(sentence)
tagged_sent = pos_tag(tokens)

print(tokens)
print(tagged_sent)

输出结果如下：

['The', 'brown', 'fox', 'is', 'quick', 'and', 'he', 'is', 'jumping', 'over', 'the', 'lazy', 'dog']
[('The', 'DT'), ('brown', 'JJ'), ('fox', 'NN'), ('is', 'VBZ'), ('quick', 'JJ'), ('and', 'CC'), ('he', 'PRP'), ('is', 'VBZ'), ('jumping', 'VBG'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]

OK，知道了获取单词在句子中的词性，再结合词形还原，就能很好地完成词形还原功能。示例的Python代码如下：

from nltk import word_tokenize, pos_tag
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer


#  获取单词的词性
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return None


sentence = 'football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal.'
print(sentence)
tokens = word_tokenize(sentence)  # 分词
tagged_sent = pos_tag(tokens)  # 获取单词的词性
print(tagged_sent)

wnl = WordNetLemmatizer()
lemmas_sent = []

for tag in tagged_sent:
    wordnet_pos = get_wordnet_pos(tag[1]) or wordnet.NOUN
    lemmas_sent.append(wnl.lemmatize(tag[0], pos=wordnet_pos))  # 词性还原

print(lemmas_sent)

输出结果如下：

football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal.
[('football', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('family', 'NN'), ('of', 'IN'), ('team', 'NN'), ('sports', 'NNS'), ('that', 'WDT'), ('involve', 'VBP'), (',', ','), ('to', 'TO'), ('varying', 'VBG'), ('degrees', 'NNS'), (',', ','), ('kicking', 'VBG'), ('a', 'DT'), ('ball', 'NN'), ('to', 'TO'), ('score', 'VB'), ('a', 'DT'), ('goal', 'NN'), ('.', '.')]
['football', 'be', 'a', 'family', 'of', 'team', 'sport', 'that', 'involve', ',', 'to', 'vary', 'degree', ',', 'kick', 'a', 'ball', 'to', 'score', 'a', 'goal', '.']

输出的结果就是对句子中的单词进行词形还原后的结果。