信息抽取之街道抽取

最新推荐文章于 2024-10-12 18:23:57 发布

Peter_ch_26

最新推荐文章于 2024-10-12 18:23:57 发布

阅读量428

点赞数

分类专栏：机器学习信息抽取文章标签：自然语言处理

本文链接：https://blog.csdn.net/c654528593/article/details/114048304

版权

道路信息抽取文本挖掘词性标注规则匹配道路名称识别

关键词由CSDN通过智能技术生成

机器学习同时被 2 个专栏收录

5 篇文章 0 订阅

订阅专栏

信息抽取

1 篇文章 0 订阅

订阅专栏

如何从文本信息抽取出道路信息

问题

从给定的语料中抽取出相应的道路信息。

数据

向塘北大道西50米
天龙路与龙华路交叉口北50米
观澜大道490号附近
成都市锦江区海椒市街13号附7号
玉兰西路
团结北路23号
湖塘镇火炬北路12号
昆明市晋宁区庄跷西路28
金水路合作路28-1号
长公大道浙江显家门业阆中总代理旁
安阳街道岭下东路4号楼
万顷沙珠江街珠江东路169号
中央大街万达广场a座一层a17
梅亭路18号民生银行旁
北京市四川西路

输出

向塘北大道西50米 -> 塘北大道
北京市四川西路 -> 四川西路

思路

现有工具包先词性标注，然后观察数据总结规则。（通用方法+规则）
在1的基础上，积累领域词，然后利用这类词标注一些数据，然后构建自己领域NER模型。

实现

第一种方案实现：

"""
@desc:
    en:the simple code of road name extration
    cn:简单街道抽取脚本
@author:peter
@mail:peter_chen_jaon@foxmail.com
@date:2021/2/24
@note:
    该脚本可能存在问题，但由于目前数据就这么多所以就先这样吧，仅供参考。
"""

import pkuseg
import jieba
from jieba import posseg


def tokenize_pku(word):
    return tokenizer.cut(word)

def tokenize_jieba(word):
    ret = posseg.cut(word)
    return [(word.word,word.flag) for word in ret]

USE_PKU_TOKENIZER=True

tokenize = None
if USE_PKU_TOKENIZER:
    tokenizer = pkuseg.pkuseg(postag=True)
    tokenize = tokenize_pku
else:
    #jieba准确率有限。
    tokenize = tokenize_jieba


data="""向塘北大道西50米
天龙路与龙华路交叉口北50米
观澜大道490号附近
成都市锦江区海椒市街13号附7号
玉兰西路
团结北路23号
湖塘镇火炬北路12号
昆明市晋宁区庄跷西路28
金水路合作路28-1号
长公大道浙江显家门业阆中总代理旁
安阳街道岭下东路4号楼
万顷沙珠江街珠江东路169号
中央大街万达广场a座一层a17
梅亭路18号民生银行旁
北京市四川西路""".split("\n")
data=[line.strip()for line in data]

#数据
pos_cands = [tokenize(line) for line in data]

road_keywords = ["街","大道","路",]

def check(word):
    #检测是否为街道路
    for w in road_keywords:
        if w in word:
            return True
    return False

def check_city(word):
    #检测是否为城市
    keywords=["省","市","区","街道","县","村","镇"]
    for key in keywords:
        if word.endswith(key):
            return True
    return False

def find_road(pos_cands,verbose=False):
    """
    道路组合形式：n+n
            v+ns
            ns+n
            ns+ns
            ns
            n
            j+n
            n+n

        n与ns需要包含关键词:
    Args:
        pos_cands:list,e.g. [("北京","ns")]
    """
    res = []
    pre_idx = -1
    pre_pos = ""
    text = ""
    if verbose:
        print(pos_cands)
    for idx,(word,pos) in enumerate(pos_cands):
        #过滤地区词
        if pos=="ns" and check_city(word):
            continue

        #总结规律，写规则
        if pre_pos in ["v","j","n","a","ns"]and pos in ["ns","n"]:
            if check(word):
                text+=word  
                res.append(text)
                
                text = ""
                pre_pos=""
            else:
                text=word
                pre_pos=pos
            pre_idx=idx
                
        elif check(word) and pos in ["ns","n"]:
            res.append(word)
            pre_idx=idx 

        elif pos in ["v","j","n","a"]:
            # print(word)
            text+=word  
            pre_idx=idx
            pre_pos=pos

        elif pos in ["ns","n"]:
            # print(word)
            text+=word  
            pre_idx=idx
            pre_pos=pos
        else:
            pre_idx= idx
            pre_pos=""

    if text:
        res.append(text)
    real_res = []
    for word in res:
        for key in road_keywords:
            if key in word:
                real_res.append(word)
    return real_res

for cand in pos_cands:
    print(find_road(cand))

            
# print(find_road(pos_cands[-4],True))