Python识别文字中的省市区_python cpca-CSDN博客

本文链接：https://blog.csdn.net/xun527/article/details/142335481

一、库安装

pip install cpca

注意：目前 cpca 模块仅支持Python3及以上版本。

二、基本使用

通过两行代码就能实现最基本的省市区提取：

import cpca

location_str = [
    "新疆古阿贾克斯就打开房间啊开始",
    "河北省石家庄市动物园",
    "安全生产目标为“五无”：无死亡、无重伤、无倒（坍）塌、无中毒、无火灾。争创天津市市级文明工地。",
    "武清区广贤路与广聚路交叉口北200米",
    "共和人民政府"
]
df = cpca.transform(location_str)
print(df)

效果如下：

          省     市     区               地址  adcode
0  新疆维吾尔自治区  None  None    古阿贾克斯就打开房间啊开始  650000
1       河北省  石家庄市  None              动物园  130100
2       天津市  None  None          市级文明工地。  120000
3       天津市   市辖区   武清区  广贤路与广聚路交叉口北200米  120114
4      None  None  None             None    None

如果你想获知程序是从字符串的那个位置提取出省市区名的，可以添加一个 pos_sensitive=True 参数:

import cpca

location_str = [
    "新疆古阿贾克斯就打开房间啊开始",
    "河北省石家庄市动物园",
    "安全生产目标为“五无”：无死亡、无重伤、无倒（坍）塌、无中毒、无火灾。争创天津市市级文明工地。",
    "武清区广贤路与广聚路交叉口北200米",
    "共和人民政府"
]
df = cpca.transform(location_str, pos_sensitive=True)
print(df)

效果如下：

          省     市     区               地址  adcode  省_pos  市_pos  区_pos
0  新疆维吾尔自治区  None  None    古阿贾克斯就打开房间啊开始  650000      0     -1     -1
1       河北省  石家庄市  None              动物园  130100      0      3     -1
2       天津市  None  None          市级文明工地。  120000     37     -1     -1
3       天津市   市辖区   武清区  广贤路与广聚路交叉口北200米  120114     -1     -1      0
4      None  None  None             None    None     -1     -1     -1

三、高级使用

从大段文本中批量识别多个地区：

import cpca

location_str = "太原是一座具有2500年建城史的历史文化名城。"\
                "行走锦绣太原城，每一条沧桑厚重的街巷都充满了历史气息，"\
                "每一块古色古香的砖瓦都承载了文化符号，"\
                "每一段鲜为人知的背后都记载了人文故事。"
df = cpca.transform_text_with_addrs(location_str, pos_sensitive=True)
print(df)

效果如下：

     省    市     区 地址  adcode  省_pos  市_pos  区_pos
0  山西省  太原市  None     140100     -1     27     -1

四、更多内容

更多的细节你可以访问这个项目的Github主页阅读，该项目的README完全中文编写，非常容易阅读：

https://github.com/DQinYuan/chinese_province_city_area_mapper