一、准备工作
①安装好nltk模块并在:
链接中手动下载模型并放入到对应文件夹下。
具体放到哪个文件夹,先执行看报错后的提示即可。
②准备pos_map.json文件,放置到当前文件夹下。该文件用于词性统一
{
"NN": "n",
"NNS": "n",
"NNP": "n",
"NNPS": "n",
"PRP": "n",
"PRP$": "n",
"VB": "v",
"VBD": "v",
"VBG": "v",
"VBN": "v",
"VBP": "v",
"VBZ": "v",
"MD": "v",
"JJ": "a",
"JJR": "s",
"JJS": "s",
"RB": "r",
"RBR": "r",
"RBS": "r",
"IN": "r",
"TO": "r",
"CD": "n",
"DT": "a",
"WDT": "a",
"CC": "r",
"UH": "r"
}
二、执行下述代码
from nltk import word_tokenize, pos_tag
from nltk.stem import WordNetLemmatizer
import json
def tokenize_and_tag(sentence):
# 分词
tokens = word_tokenize(sentence)
# 词性标注
tagged = pos_tag(tokens)
# 分离单词和标签
words = [item[0] for item in tagged]
pos_tags = [item[1] for item in tagged]
return words, pos_tags
# 示例用法
wnl = WordNetLemmatizer()
sentence = "The quick brown fox jumps over the lazy dog."
words, pos_tags = tokenize_and_tag(sentence)
print("分词列表:", words)
print("词性列表:", pos_tags)
with open("pos_map.json", "r", encoding="utf-8") as f:
pos_map: dict = json.load(f)
pos_tags = [pos_map.get(tag, "n") for tag in pos_tags]
for i in range(len(words)):
print(words[i]+'--'+pos_tags[i]+'-->'+wnl.lemmatize(words[i],pos_tags[i]))
示例结果:
分词列表: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']
词性列表: ['DT', 'JJ', 'NN', 'NN', 'VBZ', 'IN', 'DT', 'JJ', 'NN', '.']
The--a-->The
quick--a-->quick
brown--n-->brown
fox--n-->fox
jumps--v-->jump
over--r-->over
the--a-->the
lazy--a-->lazy
dog--n-->dog
.--n-->.