python中文人名识别(使用hanlp，LTP，LAC)

最新推荐文章于 2025-04-15 10:21:16 发布

呆萌的代Ma

最新推荐文章于 2025-04-15 10:21:16 发布

阅读量1.4w

点赞数 3

分类专栏：自然语言处理文章标签：自然语言处理 python

本文为CSDN博主"呆萌的代Ma"原创文章，转载请注明博客链接：https://blog.csdn.net/weixin_35757704/

本文链接：https://blog.csdn.net/weixin_35757704/article/details/121097143

版权

自然语言处理专栏收录该内容

56 篇文章

订阅专栏

中文人名识别属于命名实体识别的范畴，解决问题的思路很多，但是在实际的应用过程中各种库做的参差不齐，下面是3个开源库的使用方法与效果展示：

首先是hanlp
hanlp github主页：https://github.com/hankcs/pyhanlp

然后是LTP
ltp github主页：https://github.com/HIT-SCIR/ltp
ltp 文档主页：https://ltp.readthedocs.io/zh_CN/latest/

最后使用LAC
LAC github主页：https://github.com/baidu/lac

在我的项目中，需要做的是在一个几乎没有逻辑可言的句子里识别人名，因此命名实体识别的效果很差，使用词法分析来提取人名，然后依据业务逻辑删减。因此越依赖上下文的方法，越会有问题，整体的效果是：百度智能云词法分析>LTP>LAC>阿里云自然语言处理词法分析>hanlp

其中：

LTP中规中矩
LAC提取的量最少，但是只要提取出来，几乎是正确的
hanlp提取出来的量是最多的，但是一多半不是人名。
（这和下面的示例结果相反，很神奇)

其他中文的工具类库：

NLPIR：https://github.com/NLPIR-team/NLPIR
FoolNLTK：https://github.com/rockyzhengwu/FoolNLTK
THULAC：https://github.com/thunlp/THULAC-Python

示例

from LAC import LAC


def hanlp_username(sentences: str) -> list:
    from pyhanlp import HanLP
    segment = HanLP.newSegment().enableNameRecognize(True)
    seg_words = segment.seg(sentences)
    user_list = []
    for value in seg_words:
        split_words = str(value).split('/')  # check //m
        word, tag = split_words[0], split_words[-1]
        if tag == 'nr':
            user_list.append(word)
    return user_list


def ltp_username(sentences: str) -> list:
    from ltp import LTP

    ltp = LTP()  # 默认加载 Small 模型，下载的路径是：~/.cache/torch/ltp
    seg, hidden = ltp.seg([sentences])  # 分词
    nh_user_list = []
    pos_index_values = ltp.pos(hidden)
    # seg 是 list to list 的格式
    for index, seg_i in enumerate(seg):
        pos_values = pos_index_values[index]
        for _index, _pos in enumerate(pos_values):
            if _pos == "nh":
                nh_user_list.append(seg_i[_index])
    return nh_user_list


def lac_username(sentences: str) -> list:
    # 装载LAC模型
    user_name_list = []
    lac = LAC(mode="lac")
    lac_result = lac.run(sentences)
    for index, lac_label in enumerate(lac_result[1]):
        if lac_label == "PER":
            user_name_list.append(lac_result[0][index])
    return user_name_list


if __name__ == '__main__':
    text = "周树人（1881年9月25日－1936年10月19日），原名周樟寿，字豫山、豫亭，后改字豫才，以笔名鲁迅聞名於世，浙江紹興人"
    hanlp_user = hanlp_username(text)
    lac_user = lac_username(text)
    ltp_user = ltp_username(text)
    print("hanlp:", hanlp_user)
    print("LAC:", lac_user)
    print("LTP:", ltp_user)

结果：

hanlp: ['周樟寿', '鲁迅']
LAC: ['周树人', '周樟寿', '鲁迅']
LTP: ['周树人', '周樟寿', '豫才', '鲁迅']