Python系列：NLP系列一：词袋模型及句子相似度、探究TF-IDF的原理、词形还原（Lemmatization）

坦笑&&life

已于 2024-02-03 16:09:37 修改

阅读量1.2k

点赞数 10

分类专栏： # python 文章标签：自然语言处理 python tf-idf

于 2024-01-17 08:25:23 首次发布

本文链接：https://blog.csdn.net/weixin_54626591/article/details/135639899

版权

python 专栏收录该内容

179 篇文章 7 订阅

订阅专栏

这里写目录标题

一. NLP入门（一）词袋模型及句子相似度
二. NLP入门（二）探究TF-IDF的原理
三. NLP入门（三）词形还原（Lemmatization）

一. NLP入门（一）词袋模型及句子相似度

本文作为笔者NLP入门系列文章第一篇，以后我们就要步入NLP时代。
本文将会介绍NLP中常见的词袋模型（Bag of Words）以及如何利用词袋模型来计算句子间的相似度（余弦相似度，cosine similarity）。
首先，让我们来看一下，什么是词袋模型。我们以下面两个简单句子为例：

sent1 = "I love sky, I love sea."
sent2 = "I like running, I love reading."

通常，NLP无法一下子处理完整的段落或句子，因此，第一步往往是分句和分词。这里只有句子，因此我们只需要分词即可。对于英语句子，可以使用NLTK中的word_tokenize函数，对于中文句子，则可使用jieba模块。故第一步为分词，代码如下：

from nltk import word_tokenize
sents = [sent1, sent2]
texts = [[word for word in word_tokenize(sent)] for sent in sents]

输出的结果如下：

[['I', 'love', 'sky', ',', 'I', 'love', 'sea', '.'], ['I', 'like', 'running', ',', 'I', 'love', 'reading', '.']]

分词完毕。下一步是构建语料库，即所有句子中出现的单词及标点。代码如下：

all_list = []
for text in texts:
    all_list += text
corpus = set(all_list)
print(corpus

输出如下：

{'love', 'running', 'reading', 'sky', '.', 'I', 'like', 'sea', ','}

可以看到，语料库中一共是8个单词及标点。接下来，对语料库中的单词及标点建立数字映射，便于后续的句子的向量表示。代码如下：

corpus_dict = dict(zip(corpus, range(len(corpus))))
print(corpus_dict)

输出如下：

{'running': 1, 'reading': 2, 'love': 0, 'sky': 3, '.': 4, 'I': 5, 'like': 6, 'sea': 7, ',': 8}

虽然单词及标点并没有按照它们出现的顺序来建立数字映射，不过这并不会影响句子的向量表示及后续的句子间的相似度。
下一步，也就是词袋模型的关键一步，就是建立句子的向量表示。这个表示向量并不是简单地以单词或标点出现与否来选择0，1数字，而是把单词或标点的出现频数作为其对应的数字表示，结合刚才的语料库字典，句子的向量表示的代码如下：

# 建立句子的向量表示
def vector_rep(text, corpus_dict):
    vec = []
    for key in corpus_dict.keys():
        if key in text:
            vec.append((corpus_dict[key], text.count(key)))
        else:
            vec.append((corpus_dict[key], 0))

    vec = sorted(vec, key= lambda x: x[0])

    return vec

vec1 = vector_rep(texts[0], corpus_dict)
vec2 = vector_rep(texts[1], corpus_dict)
print(vec1)
print(vec2)

输出如下：

[(0, 2), (1, 0), (2, 0), (3, 1), (4, 1), (5, 2), (6, 0), (7, 1), (8, 1)]
[(0, 1), (1, 1), (2, 1), (3, 0), (4, 1), (5, 2), (6, 1), (7, 0), (8, 1)]

让我们稍微逗留一会儿，来看看这个向量。在第一句中I出现了两次，在预料库字典中，I对应的数字为5，因此在第一句中5出现2次，在列表中的元组即为(5,2)，代表单词I在第一句中出现了2次。以上的输出可能并不那么直观，真实的两个句子的代表向量应为：

[2, 0, 0, 1, 1, 2, 0, 1, 1]
[1, 1, 1, 0, 1, 2, 1, 0, 1]

OK，词袋模型到此结束。接下来，我们会利用刚才得到的词袋模型，即两个句子的向量表示，来计算相似度。
在NLP中，如果得到了两个句子的向量表示，那么，一般会选择用余弦相似度作为它们的相似度，而向量的余弦相似度即为两个向量的夹角的余弦值。其计算的Python代码如下：

from math import sqrt
def similarity_with_2_sents(vec1, vec2):
    inner_product = 0
    square_length_vec1 = 0
    square_length_vec2 = 0
    for tup1, tup2 in zip(vec1, vec2):
        inner_product += tup1[1]*tup2[1]
        square_length_vec1 += tup1[1]**2
        square_length_vec2 += tup2[1]**2

    return (inner_product/sqrt(square_length_vec1*square_length_vec2))


cosine_sim = similarity_with_2_sents(vec1, vec2)
print('两个句子的余弦相似度为： %.4f。'%cosine_sim)

输出结果如下：

两个句子的余弦相似度为： 0.7303。

这样，我们就通过句子的词袋模型，得到了它们间的句子相似度。
当然，在实际的NLP项目中，如果需要计算两个句子的相似度，我们只需调用gensim模块即可，它是NLP的利器，能够帮助我们处理很多NLP任务。下面为用gensim计算两个句子的相似度的代码：

sent1 = "I love sky, I love sea."
sent2 = "I like running, I love reading."

from nltk import word_tokenize
sents = [sent1, sent2]
texts = [[word for word in word_tokenize(sent)] for sent in sents]
print(texts)

from gensim import corpora
from gensim.similarities import Similarity

#  语料库
dictionary = corpora.Dictionary(texts)

# 利用doc2bow作为词袋模型
corpus = [dictionary.doc2bow(text) for text in texts]
similarity = Similarity('-Similarity-index', corpus, num_features=len(dictionary))
print(similarity)
# 获取句子的相似度
new_sensence = sent1
test_corpus_1 = dictionary.doc2bow(word_tokenize(new_sensence))

cosine_sim = similarity[test_corpus_1][1]
print("利用gensim计算得到两个句子的相似度： %.4f。"%cosine_sim)

输出结果如下：

[['I', 'love', 'sky', ',', 'I', 'love', 'sea', '.'], ['I', 'like', 'running', ',', 'I', 'love', 'reading', '.']]
Similarity index with 2 documents in 0 shards (stored under -Similarity-index)
利用gensim计算得到两个句子的相似度： 0.7303。

注意，如果在运行代码时出现以下warning:

gensim\utils.py:1209: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")

gensim\matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int32 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):

如果想要去掉这些warning，则在导入gensim模块的代码前添加以下代码即可：

import warnings
warnings.filterwarnings(action='ignore',category=UserWarning,module='gensim')
warnings.filterwarnings(action='ignore',category=FutureWarning,module='gensim')

本文到此结束，感谢阅读！如果不当之处，请速联系笔者，欢迎大家交流！祝您好运~

二. NLP入门（二）探究TF-IDF的原理

TF-IDF介绍

TF-IDF是NLP中一种常用的统计方法，用以评估一个字词对于一个文件集或一个语料库中的其中一份文件的重要程度，通常用于提取文本的特征，即关键词。字词的重要性随着它在文件中出现的次数成正比增加，但同时会随着它在语料库中出现的频率成反比下降。
在NLP中，TF-IDF的计算公式如下：
在这里插入图片描述

t f i d f = t f ∗ i d f . tfidf = tf*idf.
tfidf=tf∗idf.

其中，tf是词频(Term Frequency)，idf为逆向文件频率(Inverse Document Frequency)。
tf为词频，即一个词语在文档中的出现频率，假设一个词语在整个文档中出现了i次，而整个文档有N个词语，则tf的值为i/N.
idf为逆向文件频率，假设整个文档有n篇文章，而一个词语在k篇文章中出现，则idf值为
在这里插入图片描述

i d f = log ⁡ 2 ( n k ) . idf=\log_{2}(\frac{n}{k}).
idf=log 
2

 ( 
k
n

 ).

当然，不同地方的idf值计算公式会有稍微的不同。比如有些地方会在分母的k上加1，防止分母为0，还有些地方会让分子，分母都加上1，这是smoothing技巧。在本文中，还是采用最原始的idf值计算公式，因为这与gensim里面的计算公式一致。
假设整个文档有D篇文章，则单词i在第j篇文章中的tfidf值为

在这里插入图片描述

以上就是TF-IDF的计算方法。

文本介绍及预处理

我们将采用以下三个示例文本：

text1 ="""
Football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal. 
Unqualified, the word football is understood to refer to whichever form of football is the most popular 
in the regional context in which the word appears. Sports commonly called football in certain places 
include association football (known as soccer in some countries); gridiron football (specifically American 
football or Canadian football); Australian rules football; rugby football (either rugby league or rugby union); 
and Gaelic football. These different variations of football are known as football codes.
"""

text2 = """
Basketball is a team sport in which two teams of five players, opposing one another on a rectangular court, 
compete with the primary objective of shooting a basketball (approximately 9.4 inches (24 cm) in diameter) 
through the defender's hoop (a basket 18 inches (46 cm) in diameter mounted 10 feet (3.048 m) high to a backboard 
at each end of the court) while preventing the opposing team from shooting through their own hoop. A field goal is 
worth two points, unless made from behind the three-point line, when it is worth three. After a foul, timed play stops 
and the player fouled or designated to shoot a technical foul is given one or more one-point free throws. The team with 
the most points at the end of the game wins, but if regulation play expires with the score tied, an additional period 
of play (overtime) is mandated.
"""

text3 = """
Volleyball, game played by two teams, usually of six players on a side, in which the players use their hands to bat a 
ball back and forth over a high net, trying to make the ball touch the court within the opponents’ playing area before 
it can be returned. To prevent this a player on the opposing team bats the ball up and toward a teammate before it touches 
the court surface—that teammate may then volley it back across the net or bat it to a third teammate who volleys it across 
the net. A team is allowed only three touches of the ball before it must be returned over the net.
"""

这三篇文章分别是关于足球，篮球，排球的介绍，它们组成一篇文档。
接下来是文本的预处理部分。
首先是对文本去掉换行符，然后是分句，分词，再去掉其中的标点，完整的Python代码如下，输入的参数为文章text:

import nltk
import string

# 文本预处理
# 函数：text文件分句，分词，并去掉标点
def get_tokens(text):
    text = text.replace('\n', '')
    sents = nltk.sent_tokenize(text)  # 分句
    tokens = []
    for sent in sents:
        for word in nltk.word_tokenize(sent):  # 分词
            if word not in string.punctuation: # 去掉标点
                tokens.append(word)
    return tokens

接着，去掉文章中的通用词（stopwords），然后统计每个单词的出现次数，完整的Python代码如下，输入的参数为文章text:

from nltk.corpus import stopwords     #停用词

# 对原始的text文件去掉停用词
# 生成count字典，即每个单词的出现次数
def make_count(text):
    tokens = get_tokens(text)
    filtered = [w for w in tokens if not w in stopwords.words('english')]    #去掉停用词
    count = Counter(filtered)
    return count

以text3为例，生成的count字典如下：

Counter({‘ball’: 4, ‘net’: 4, ‘teammate’: 3, ‘returned’: 2, ‘bat’: 2,
‘court’: 2, ‘team’: 2, ‘across’: 2, ‘touches’: 2, ‘back’: 2,
‘players’: 2, ‘touch’: 1, ‘must’: 1, ‘usually’: 1, ‘side’: 1,
‘player’: 1, ‘area’: 1, ‘Volleyball’: 1, ‘hands’: 1, ‘may’: 1,
‘toward’: 1, ‘A’: 1, ‘third’: 1, ‘two’: 1, ‘six’: 1, ‘opposing’: 1,
‘within’: 1, ‘prevent’: 1, ‘allowed’: 1, ‘’’: 1, ‘playing’: 1,
‘played’: 1, ‘volley’: 1, ‘surface—that’: 1, ‘volleys’: 1,
‘opponents’: 1, ‘use’: 1, ‘high’: 1, ‘teams’: 1, ‘bats’: 1, ‘To’: 1,
‘game’: 1, ‘make’: 1, ‘forth’: 1, ‘three’: 1, ‘trying’: 1})

Gensim中的TF-IDF

对文本进行预处理后，对于以上三个示例文本，我们都会得到一个count字典，里面是每个文本中单词的出现次数。下面，我们将用gensim中的已实现的TF-IDF模型，来输出每篇文章中TF-IDF排名前三的单词及它们的tfidf值，完整的代码如下：

from nltk.corpus import stopwords     #停用词
from gensim import corpora, models, matutils

#training by gensim's Ifidf Model
def get_words(text):
    tokens = get_tokens(text)
    filtered = [w for w in tokens if not w in stopwords.words('english')]
    return filtered

# get text
count1, count2, count3 = get_words(text1), get_words(text2), get_words(text3)
countlist = [count1, count2, count3]
# training by TfidfModel in gensim
dictionary = corpora.Dictionary(countlist)
new_dict = {v:k for k,v in dictionary.token2id.items()}
corpus2 = [dictionary.doc2bow(count) for count in countlist]
tfidf2 = models.TfidfModel(corpus2)
corpus_tfidf = tfidf2[corpus2]

# output
print("\nTraining by gensim Tfidf Model.......\n")
for i, doc in enumerate(corpus_tfidf):
    print("Top words in document %d"%(i + 1))
    sorted_words = sorted(doc, key=lambda x: x[1], reverse=True)    #type=list
    for num, score in sorted_words[:3]:
        print("    Word: %s, TF-IDF: %s"%(new_dict[num], round(score, 5)))

输出的结果如下：

Training by gensim Tfidf Model.......

Top words in document 1
    Word: football, TF-IDF: 0.84766
    Word: rugby, TF-IDF: 0.21192
    Word: known, TF-IDF: 0.14128
Top words in document 2
    Word: play, TF-IDF: 0.29872
    Word: cm, TF-IDF: 0.19915
    Word: diameter, TF-IDF: 0.19915
Top words in document 3
    Word: net, TF-IDF: 0.45775
    Word: teammate, TF-IDF: 0.34331
    Word: across, TF-IDF: 0.22888

输出的结果还是比较符合我们的预期的，比如关于足球的文章中提取了football, rugby关键词，关于篮球的文章中提取了plat, cm关键词，关于排球的文章中提取了net, teammate关键词。

自己动手实践TF-IDF模型

有了以上我们对TF-IDF模型的理解，其实我们自己也可以动手实践一把，这是学习算法的最佳方式！
以下是笔者实践TF-IDF的代码（接文本预处理代码）：

import math

# 计算tf
def tf(word, count):
    return count[word] / sum(count.values())
# 计算count_list有多少个文件包含word
def n_containing(word, count_list):
    return sum(1 for count in count_list if word in count)

# 计算idf
def idf(word, count_list):
    return math.log2(len(count_list) / (n_containing(word, count_list)))    #对数以2为底
# 计算tf-idf
def tfidf(word, count, count_list):
    return tf(word, count) * idf(word, count_list)

# TF-IDF测试
count1, count2, count3 = make_count(text1), make_count(text2), make_count(text3)
countlist = [count1, count2, count3]
print("Training by original algorithm......\n")
for i, count in enumerate(countlist):
    print("Top words in document %d"%(i + 1))
    scores = {word: tfidf(word, count, countlist) for word in count}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)    #type=list
    # sorted_words = matutils.unitvec(sorted_words)
    for word, score in sorted_words[:3]:
        print("    Word: %s, TF-IDF: %s"%(word, round(score, 5)))

输出结果如下：

Training by original algorithm......

Top words in document 1
    Word: football, TF-IDF: 0.30677
    Word: rugby, TF-IDF: 0.07669
    Word: known, TF-IDF: 0.05113
Top words in document 2
    Word: play, TF-IDF: 0.05283
    Word: inches, TF-IDF: 0.03522
    Word: worth, TF-IDF: 0.03522
Top words in document 3
    Word: net, TF-IDF: 0.10226
    Word: teammate, TF-IDF: 0.07669
    Word: across, TF-IDF: 0.05113

可以看到，笔者自己动手实践的TF-IDF模型提取的关键词与gensim一致，至于篮球中为什么后两个单词不一致，是因为这些单词的tfidf一样，随机选择的结果不同而已。但是有一个问题，那就是计算得到的tfidf值不一样，这是什么原因呢？
查阅gensim中计算tf-idf值的源代码（https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/tfidfmodel.py）：
在这里插入图片描述

也就是说，gensim对得到的tf-idf向量做了规范化（normalize），将其转化为单位向量。因此，我们需要在刚才的代码中加入规范化这一步，代码如下：

import numpy as np

# 对向量做规范化, normalize
def unitvec(sorted_words):
    lst = [item[1] for item in sorted_words]
    L2Norm = math.sqrt(sum(np.array(lst)*np.array(lst)))
    unit_vector = [(item[0], item[1]/L2Norm) for item in sorted_words]
    return unit_vector

# TF-IDF测试
count1, count2, count3 = make_count(text1), make_count(text2), make_count(text3)
countlist = [count1, count2, count3]
print("Training by original algorithm......\n")
for i, count in enumerate(countlist):
    print("Top words in document %d"%(i + 1))
    scores = {word: tfidf(word, count, countlist) for word in count}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)    #type=list
    sorted_words = unitvec(sorted_words)   # normalize
    for word, score in sorted_words[:3]:
        print("    Word: %s, TF-IDF: %s"%(word, round(score, 5)))

输出结果如下：

Training by original algorithm......

Top words in document 1
    Word: football, TF-IDF: 0.84766
    Word: rugby, TF-IDF: 0.21192
    Word: known, TF-IDF: 0.14128
Top words in document 2
    Word: play, TF-IDF: 0.29872
    Word: shooting, TF-IDF: 0.19915
    Word: diameter, TF-IDF: 0.19915
Top words in document 3
    Word: net, TF-IDF: 0.45775
    Word: teammate, TF-IDF: 0.34331
    Word: back, TF-IDF: 0.22888

现在的输出结果与gensim得到的结果一致！

总结

Gensim是Python做NLP时鼎鼎大名的模块，有空还是多读读源码吧！以后，我们还会继续介绍TF-IDF在其它方面的应用，欢迎大家交流~

注意：本人现已开通微信公众号： NLP奇幻之旅（微信号为：easy_web_scrape），欢迎大家关注哦~~

本文的完整代码如下：

import nltk
import math
import string
from nltk.corpus import stopwords     #停用词
from collections import Counter       #计数
from gensim import corpora, models, matutils

text1 ="""
Football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal. 
Unqualified, the word football is understood to refer to whichever form of football is the most popular 
in the regional context in which the word appears. Sports commonly called football in certain places 
include association football (known as soccer in some countries); gridiron football (specifically American 
football or Canadian football); Australian rules football; rugby football (either rugby league or rugby union); 
and Gaelic football. These different variations of football are known as football codes.
"""

text2 = """
Basketball is a team sport in which two teams of five players, opposing one another on a rectangular court, 
compete with the primary objective of shooting a basketball (approximately 9.4 inches (24 cm) in diameter) 
through the defender's hoop (a basket 18 inches (46 cm) in diameter mounted 10 feet (3.048 m) high to a backboard 
at each end of the court) while preventing the opposing team from shooting through their own hoop. A field goal is 
worth two points, unless made from behind the three-point line, when it is worth three. After a foul, timed play stops 
and the player fouled or designated to shoot a technical foul is given one or more one-point free throws. The team with 
the most points at the end of the game wins, but if regulation play expires with the score tied, an additional period 
of play (overtime) is mandated.
"""

text3 = """
Volleyball, game played by two teams, usually of six players on a side, in which the players use their hands to bat a 
ball back and forth over a high net, trying to make the ball touch the court within the opponents’ playing area before 
it can be returned. To prevent this a player on the opposing team bats the ball up and toward a teammate before it touches 
the court surface—that teammate may then volley it back across the net or bat it to a third teammate who volleys it across 
the net. A team is allowed only three touches of the ball before it must be returned over the net.
"""

# 文本预处理
# 函数：text文件分句，分词，并去掉标点
def get_tokens(text):
    text = text.replace('\n', '')
    sents = nltk.sent_tokenize(text)  # 分句
    tokens = []
    for sent in sents:
        for word in nltk.word_tokenize(sent):  # 分词
            if word not in string.punctuation: # 去掉标点
                tokens.append(word)
    return tokens

# 对原始的text文件去掉停用词
# 生成count字典，即每个单词的出现次数
def make_count(text):
    tokens = get_tokens(text)
    filtered = [w for w in tokens if not w in stopwords.words('english')]    #去掉停用词
    count = Counter(filtered)
    return count

# 计算tf
def tf(word, count):
    return count[word] / sum(count.values())
# 计算count_list有多少个文件包含word
def n_containing(word, count_list):
    return sum(1 for count in count_list if word in count)

# 计算idf
def idf(word, count_list):
    return math.log2(len(count_list) / (n_containing(word, count_list)))    #对数以2为底
# 计算tf-idf
def tfidf(word, count, count_list):
    return tf(word, count) * idf(word, count_list)

import numpy as np

# 对向量做规范化, normalize
def unitvec(sorted_words):
    lst = [item[1] for item in sorted_words]
    L2Norm = math.sqrt(sum(np.array(lst)*np.array(lst)))
    unit_vector = [(item[0], item[1]/L2Norm) for item in sorted_words]
    return unit_vector

# TF-IDF测试
count1, count2, count3 = make_count(text1), make_count(text2), make_count(text3)
countlist = [count1, count2, count3]
print("Training by original algorithm......\n")
for i, count in enumerate(countlist):
    print("Top words in document %d"%(i + 1))
    scores = {word: tfidf(word, count, countlist) for word in count}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)    #type=list
    sorted_words = unitvec(sorted_words)   # normalize
    for word, score in sorted_words[:3]:
        print("    Word: %s, TF-IDF: %s"%(word, round(score, 5)))

#training by gensim's Ifidf Model
def get_words(text):
    tokens = get_tokens(text)
    filtered = [w for w in tokens if not w in stopwords.words('english')]
    return filtered

# get text
count1, count2, count3 = get_words(text1), get_words(text2), get_words(text3)
countlist = [count1, count2, count3]
# training by TfidfModel in gensim
dictionary = corpora.Dictionary(countlist)
new_dict = {v:k for k,v in dictionary.token2id.items()}
corpus2 = [dictionary.doc2bow(count) for count in countlist]
tfidf2 = models.TfidfModel(corpus2)
corpus_tfidf = tfidf2[corpus2]

# output
print("\nTraining by gensim Tfidf Model.......\n")
for i, doc in enumerate(corpus_tfidf):
    print("Top words in document %d"%(i + 1))
    sorted_words = sorted(doc, key=lambda x: x[1], reverse=True)    #type=list
    for num, score in sorted_words[:3]:
        print("    Word: %s, TF-IDF: %s"%(new_dict[num], round(score, 5)))
        
"""

输出结果：

Training by original algorithm......

Top words in document 1
    Word: football, TF-IDF: 0.84766
    Word: rugby, TF-IDF: 0.21192
    Word: word, TF-IDF: 0.14128
Top words in document 2
    Word: play, TF-IDF: 0.29872
    Word: inches, TF-IDF: 0.19915
    Word: points, TF-IDF: 0.19915
Top words in document 3
    Word: net, TF-IDF: 0.45775
    Word: teammate, TF-IDF: 0.34331
    Word: bat, TF-IDF: 0.22888

Training by gensim Tfidf Model.......

Top words in document 1
    Word: football, TF-IDF: 0.84766
    Word: rugby, TF-IDF: 0.21192
    Word: known, TF-IDF: 0.14128
Top words in document 2
    Word: play, TF-IDF: 0.29872
    Word: cm, TF-IDF: 0.19915
    Word: diameter, TF-IDF: 0.19915
Top words in document 3
    Word: net, TF-IDF: 0.45775
    Word: teammate, TF-IDF: 0.34331
    Word: across, TF-IDF: 0.22888
"""

三. NLP入门（三）词形还原（Lemmatization）

词形还原（Lemmatization）是文本预处理中的重要部分，与词干提取（stemming）很相似。
简单说来，词形还原就是去掉单词的词缀，提取单词的主干部分，通常提取后的单词会是字典中的单词，不同于词干提取（stemming），提取后的单词不一定会出现在单词中。比如，单词“cars”词形还原后的单词为“car”，单词“ate”词形还原后的单词为“eat”。
在Python的nltk模块中，使用WordNet为我们提供了稳健的词形还原的函数。如以下示例Python代码：

from nltk.stem import WordNetLemmatizer

wnl = WordNetLemmatizer()
# lemmatize nouns
print(wnl.lemmatize('cars', 'n'))
print(wnl.lemmatize('men', 'n'))

# lemmatize verbs
print(wnl.lemmatize('running', 'v'))
print(wnl.lemmatize('ate', 'v'))

# lemmatize adjectives
print(wnl.lemmatize('saddest', 'a'))
print(wnl.lemmatize('fancier', 'a'))

输出结果如下：

car men run eat sad fancy

在以上代码中，wnl.lemmatize()函数可以进行词形还原，第一个参数为单词，第二个参数为该单词的词性，如名词，动词，形容词等，返回的结果为输入单词的词形还原后的结果。
词形还原一般是简单的，但具体我们在使用时，指定单词的词性很重要，不然词形还原可能效果不好，如以下代码：

from nltk.stem import WordNetLemmatizer

wnl = WordNetLemmatizer()
print(wnl.lemmatize('ate', 'n'))
print(wnl.lemmatize('fancier', 'v'))

输出结果如下：

ate fancier

那么，如何获取单词的词性呢？在NLP中，使用Parts of speech（POS）技术实现。在nltk中，可以使用nltk.pos_tag()获取单词在句子中的词性，如以下Python代码：

sentence = 'The brown fox is quick and he is jumping over the lazy dog'
import nltk
tokens = nltk.word_tokenize(sentence)
tagged_sent = nltk.pos_tag(tokens)
print(tagged_sent)

输出结果如下：

[(‘The’, ‘DT’), (‘brown’, ‘JJ’), (‘fox’, ‘NN’), (‘is’, ‘VBZ’),
(‘quick’, ‘JJ’), (‘and’, ‘CC’), (‘he’, ‘PRP’), (‘is’, ‘VBZ’),
(‘jumping’, ‘VBG’), (‘over’, ‘IN’), (‘the’, ‘DT’), (‘lazy’, ‘JJ’),
(‘dog’, ‘NN’)]

关于上述词性的说明，可以参考下表：

在这里插入图片描述

OK，知道了获取单词在句子中的词性，再结合词形还原，就能很好地完成词形还原功能。示例的Python代码如下：

from nltk import word_tokenize, pos_tag
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

# 获取单词的词性
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return None

sentence = 'football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal.'
tokens = word_tokenize(sentence)  # 分词
tagged_sent = pos_tag(tokens)     # 获取单词词性

wnl = WordNetLemmatizer()
lemmas_sent = []
for tag in tagged_sent:
    wordnet_pos = get_wordnet_pos(tag[1]) or wordnet.NOUN
    lemmas_sent.append(wnl.lemmatize(tag[0], pos=wordnet_pos)) # 词形还原

print(lemmas_sent)

输出结果如下：

[‘football’, ‘be’, ‘a’, ‘family’, ‘of’, ‘team’, ‘sport’, ‘that’,
‘involve’, ‘,’, ‘to’, ‘vary’, ‘degree’, ‘,’, ‘kick’, ‘a’, ‘ball’,
‘to’, ‘score’, ‘a’, ‘goal’, ‘.’]

输出的结果就是对句子中的单词进行词形还原后的结果。
本次分享到此结束，欢迎大家交流~

山阴少年

NLP入门（一）词袋模型及句子相似度

NLP入门（二）探究TF-IDF的原理

NLP入门（三）词形还原（Lemmatization）