nltk-英文句子分词+词干化

一、准备工作

①安装好nltk模块并在:

nltk/nltk_data: NLTK Data

链接中手动下载模型并放入到对应文件夹下。

具体放到哪个文件夹,先执行看报错后的提示即可。

②准备pos_map.json文件,放置到当前文件夹下。该文件用于词性统一

{
    "NN": "n",
    "NNS": "n",
    "NNP": "n",
    "NNPS": "n",
    "PRP": "n",
    "PRP$": "n",
    "VB": "v",
    "VBD": "v",
    "VBG": "v",
    "VBN": "v",
    "VBP": "v",
    "VBZ": "v",
    "MD": "v",
    "JJ": "a",
    "JJR": "s",
    "JJS": "s",
    "RB": "r",
    "RBR": "r",
    "RBS": "r",
    "IN": "r",
    "TO": "r",
    "CD": "n",
    "DT": "a",
    "WDT": "a",
    "CC": "r",
    "UH": "r"
}

二、执行下述代码

from nltk import word_tokenize, pos_tag
from nltk.stem import WordNetLemmatizer
import json


def tokenize_and_tag(sentence):
    # 分词
    tokens = word_tokenize(sentence)
    # 词性标注
    tagged = pos_tag(tokens)
    # 分离单词和标签
    words = [item[0] for item in tagged]
    pos_tags = [item[1] for item in tagged]
    return words, pos_tags

# 示例用法
wnl = WordNetLemmatizer()
sentence = "The quick brown fox jumps over the lazy dog."
words, pos_tags = tokenize_and_tag(sentence)

print("分词列表:", words)
print("词性列表:", pos_tags)

with open("pos_map.json", "r", encoding="utf-8") as f:
    pos_map: dict = json.load(f)

pos_tags = [pos_map.get(tag, "n") for tag in pos_tags]

for i in range(len(words)):
    print(words[i]+'--'+pos_tags[i]+'-->'+wnl.lemmatize(words[i],pos_tags[i]))

示例结果:

分词列表: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']
词性列表: ['DT', 'JJ', 'NN', 'NN', 'VBZ', 'IN', 'DT', 'JJ', 'NN', '.']
The--a-->The
quick--a-->quick
brown--n-->brown
fox--n-->fox
jumps--v-->jump
over--r-->over
the--a-->the
lazy--a-->lazy
dog--n-->dog
.--n-->.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值