基与距离和字符串相似的地址匹配

风暴之零

已于 2025-03-11 10:54:15 修改

阅读量557

点赞数

分类专栏： pandas 高效运算文章标签： python 开发语言

于 2022-12-04 18:59:57 首次发布

本文链接：https://blog.csdn.net/A41915460/article/details/128175436

版权

pandas 高效运算专栏收录该内容

6 篇文章

订阅专栏

本文介绍了一种基于距离和字符串相似度进行地址匹配的方法，包括计算经纬度距离进行初筛，使用cpca库处理地址，以及通过编辑距离、莱文斯坦比和jaro_winkler算法计算字符串相似度。还讨论了设置不同相似度门限以平衡误检和漏检的问题。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

基与距离和字符串相似进行地址匹配，主要分为三部分。

数据：
业务源数据：df_s 包含业务地址、经纬度
待查找数据：df_find包含待查找地点地址、经纬度

1、计算业务源地址的经纬度与待查找地址经纬度的距离，进行初筛。
2、对df_s字符串和df_find字符串进行处理，为字符串相似做准备。
2.1.通过cpca库进行地址分词，去除省市信息仅保留最少地址信息。
2.2.通过正则去除字符串的特殊字符，仅保留字母、数字、汉字。
3、通过字符串相似算法，计算df_s和df_find的相似性。

一、计算距离

参考下面文章
https://blog.csdn.net/A41915460/article/details/128065351

二、字符串处理

地址分词主要使用cpca库，注意库有一个Bug，当地址里面省、市、区三级信息为空时会返回None而不是地址。
如：王家庄村25号使用cpca库解析地址会返回None,但是山西省王家庄村25号会返回王家庄村25号。并且当行政区重复时如山西山西省王家庄村25号也会返回王家庄村25号，利用这一特性我们再地址字符串前面统一加省份名称避免返回None（即原字符串没有行政区名称时通过我们人为增加的省份字段，库可以解析出地址；当原字符串有行政区名称时，通过我们人为增加的省份字段，虽然省份字段重复，但是库依然可以解析出地址）。

三、字符串相似度

通过编辑距离、莱文斯坦比、jaro_winkler和包含关系计算字符串相似度。其中编辑距离和莱文斯坦比对相似的定义更严格但是容易漏检测，jaro_winkler相对宽松不容易漏检但是容易误检。因此设置门限时jaro_winkler要比莱文斯坦比高一点，利如莱文斯坦比门限时30%，可以设置jaro_winkler门限是50%。
具体判决规则为

    if (
            (r > r_threshold and d < d_threshold)
            or (l != "无")
            or j_r > j_threshold
            or  源地址和查询地址距离<=50):
                    
        判断为True

##四、整体代码

# -*- coding : utf-8-*-
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")
import time
from numba import njit
from numpy import radians, sin, cos, arcsin, sqrt
from rtree import index
from functools import lru_cache
import re
import Levenshtein
import cpca
from Levenshtein import jaro_winkler

############字符串相似度计算相关函数##############
def baoliuhanzhi(sr):
    # 仅仅保留汉字
    res_tr1 = '[^\u4e00-\u9fa5]+'
    # 仅仅保留汉字、数字和字母,但是去除不掉^
    re_str2 = "[^\u4e00-\u9fa5^a-z^A-Z^0-9]+"
    sr1 = re.sub(res_tr1, '', sr)
    sr2 = re.sub(re_str2, "", sr)
    return sr2


# 计算编辑距离
def L_distance(str1, str2):
    d = Levenshtein.distance(str1, str2)
    return d

# 计算jaro_winkler相似度
def J_radio(str1, str2):
    j = jaro_winkler(str1, str2)
    return j


# 计算莱文斯坦比。计算公式 r = (sum – ldist) / sum, 其中sum是指str1 和 str2 字串的长度总和，ldist是类编辑距离。注意这里是类编辑距离，在类编辑距离中删除、插入依然+1，但是替换+2。
def L_ratio(str1, str2):
    r = Levenshtein.ratio(str1, str2)
    return r


def L_in(str1, str2):
    st = "无"
    n = len(str1) * len(str2)
    if n > 0:
        if str1 in str2:
            st = "A in B"
        if str2 in str1:
            st = "B in A"
    return st


# 计算编辑距离、莱文斯坦比、和字段的包含关系，进行匹配，其中r_threshold, d_threshold分别是r值和d的门限
# 计算编辑距离、莱文斯坦比、和字段的包含关系，进行匹配，其中r_threshold, d_threshold分别是r值和d的门限
def str_matche(np_str, r_threshold, d_threshold, j_threshold):
    # 表的长度，用来形成结果存储矩阵
    m1 = np_str.shape[0]
    # 表的宽度，用来形成存储矩阵和遍历问题
    m2 = np_str.shape[1]
    # 建立结果存储矩阵，长度m1，宽度为6*m2（至少大于2*m2,应为原来有3列，结果写入7列）
    name_value = np.full((m1, 5 * m2), "", dtype=object, order='C')

    i = 0
    cout = 0
    for row in np_str:

        # row[n]名称字符串、row[n + 1]唯一标识、row[n + 2]距离

        ts = [(row[n], row[n + 1], row[n + 2]) for n in range(4, m2 - 3, 3) if isinstance(row[n], str)]

        n = len(ts)
        k = 2
        # 存放唯一标识，这里是工单编号，用于去除重复项
        L_id = []

        for j in range(n):
            # 物业点地址，加"山西省"避免没有行政区时无法解析
            wyd = adr_str("山西省" + row[0])
            wyd=baoliuhanzhi(wyd)
            # 投诉点地址，加"山西省"避免没有行政区时无法解析
            tsd = adr_str("山西省" + ts[j][0])
            tsd=baoliuhanzhi(tsd)
            # 如果字符串为空

            if wyd == "" or tsd == "":
                cout += 1

            if wyd != "" and tsd != "":

                d = L_distance(wyd, tsd)
                r = L_ratio(wyd, tsd)
                l = L_in(wyd, tsd)
                j_r = J_radio(wyd, tsd)

                name_value[i, k] = ts[j][0]
                name_value[i, k + 1] = ts[j][1]
                name_value[i, k + 2] = ts[j][2]
                name_value[i, k + 3] = tsd
                name_value[i, k + 4] = d
                name_value[i, k + 5] = r
                name_value[i, k + 6] = l
                name_value[i, k + 7] = j_r
                k = k + 8

                if (
                        (r > r_threshold and d < d_threshold)
                        or (l != "无")
                        or j_r > j_threshold
                        or ts[j][2]<=50):
                    L_id.append(ts[j][1])

                mgd = len(set(L_id))
                name_value[i, 0] = mgd
                name_value[i, 1] = wyd
        i = i + 1

    print('None:' + str(cout))

    return name_value


# 使用cpca库，将元素字符串替换成
def adr_str(str1):
    # cpca库只能处理可迭代对象
    a_list = [str1]
    # 转化为地址库
    adr = cpca.transform(a_list)
    # adr包括[省	市	区	地址	adcode]，选取 地址
    adr_p = adr.loc[0, "地址"]
    return adr_p


# 距离计算函数，为了与对应函数匹配，更改为三个参数
@njit()
def disN_3(o, s_lon, s_lat):
    # 将十进制转为弧度
    lon1, lat1, lon2, lat2 = map(radians, [o[0], o[1], s_lon, s_lat])

    # haversine公式
    d_lon = lon2 - lon1
    d_lat = lat2 - lat1
    aa = sin(d_lat / 2) ** 2 + cos(lat1) * cos(lat2) * sin(d_lon / 2) ** 2
    c = 2 * arcsin(sqrt(aa))
    r = 6371  # 地球半径，千米
    return c * r * 1000


# 清洗数据表
# 清洗数据表
def data_table_clear(df, col_name):
    # 将经纬度转为float
    try:
        df = df.astype({col_name[2]: 'float', col_name[3]: 'float'})
    except:
        print("经度或纬度列有不可转化为数字的字段 不可获取列名例如有合并单元格")
    # 清洗数据源表，生成最终字段和经纬度np数组
    df.dropna(inplace=True)
    df.reset_index(drop=True, inplace=True)
    data = df[col_name]
    np_data = data.values
    np_lon_lat = data[[col_name[2], col_name[3]]].values
    # df用于最终的合并表在结果中呈现（仅df_find使用）np_data用于查找源信息（仅df_s使用）,np_lon_lat训练模型和计算距离（都使用）
    return df, np_data, np_lon_lat


# Rtree模型训练
def Rtree_mode(np_lon_lat):
    n = np_lon_lat.shape[0]
    idx = index.Index(interleased=True)
    # 写入树
    for i in range(n):
        lon = np_lon_lat[i, 0]
        lat = np_lon_lat[i, 1]
        if lon > 0 and lat > 0:
            idx.insert(i, (lat, lon, lat, lon))
    return idx


# 加缓存，提速度，maxsize表示最多缓存多少个结果，默认128。不能设置太大，会缓存击穿。
@lru_cache(maxsize=8, typed=False)
def query_nearest(model, poly, k):
    nearest_tree_idx = model.nearest(poly, k)
    return nearest_tree_idx


# 得到窗口
@lru_cache(maxsize=8, typed=False)
def get_wind(find_point, dist):
    # 地球每度的弧长,111.199单位：千米
    lon_offset = dist / 111199  # dist 对应的纬度偏移
    # 地球不同纬度，经度每度长度不一致约为111*cosθ，采取全省最南端为90
    lat_offset = dist / 90000
    poly = (lat, lon, lat, lon) = (
        find_point[1] - lat_offset, find_point[0] - lon_offset, find_point[1] + lat_offset, find_point[0] + lon_offset)
    return poly


# 得到单个点窗口内所有站点索引
@lru_cache(maxsize=8, typed=False)
def query_nearest_dis(model, find_point, dist, max_point):
    poly = get_wind(find_point, dist)
    hits = model.intersection(poly)
    L = list(hits)
    n = len(L)
    if n > max_point:
        L = L[0:max_point]
    return L


"""
model:已经训练好的树模型
dist:空间距离小于dist的点
max_point:内存中可存储的的点的数量，尽量往大写，由于同一经纬度有重复的点，该参数必须大于k。过小会丢弃点，过多会浪费内存

"""


def nearest_point_dist(model, np_s, np_find, dist, max_point):
    # 建立np数组，存放查询到的k个最近的结果,对于无数据的部分使用-1 填充
    i = 0  # 待查数据集的索引，也是结果数组的索引
    m = np_find.shape[0]
    idx_res = np.full((m, 3 * max_point), -1, dtype=int, order='C')
    disc_res = np.full((m, 3 * max_point), 1008610, dtype=float, order='C')

    for find_point in np_find:

        j = 0  # 结果存储表的列

        # poly的参数必须为lat,lon的格式，Rtree库point必须转化为poly才能查询

        find_point = tuple(find_point)

        # poly = get_wind(find_point, dist)

        # 通过树查询最近点索引,是一个迭代器model.nearest(poly, k)
        nearest_tree_idx = query_nearest_dis(model, find_point, dist, max_point)

        for nearest_idx in nearest_tree_idx:
            idx_res[i, j] = nearest_idx
            # 查找建立rtree时idx对应的 经纬度。rtree本身不能查询
            d = disN_3(find_point, np_s[nearest_idx][0], np_s[nearest_idx][1])
            disc_res[i, j] = d
            # 各个point横向排列，因此j自增
            j += 1

        i = i + 1
    # 清除缓存
    query_nearest_dis.cache_clear()
    get_wind.cache_clear()
    # 先排序通过将距离大于dist的点赋值-1的方法进行剔除。这样可以使得数组的形状不变

    disc_res = np.sort(disc_res, axis=-1)
    p = np.where(disc_res > dist)

    idx_res[p] = -1
    disc_res[p] = -1

    # 获取最大的非-1列索引，剪裁数组，经量减少冗余数据，便于提升下一步处理的性能
    p1 = np.where(disc_res > -1)
    q = p1[1].max()

    id_res = idx_res[:, 0:q]
    dis_res = disc_res[:, 0:q]

    return id_res, dis_res


# 从建立Rteee的表中查询CGI和小区中文名,并与距离写在一个表中
def con_name_dis(id_r, dis_r, np_data_s):
    n = id_r.shape[0]
    # m = id_r.shape[1]
    w = np_data_s.shape[1]
    # 在np_data_s上追加一行都是-1的值，因为下面通过花式索引进行id_r中索引到np_data_s中的值，
    # id_r中的“-1”会索引最后一行，np_data_s追加一行“-1”后，确保占位的-1不会索引到数据
    A = np.repeat(-1, w)
    A = A.reshape(1, -1)
    np_data_s = np.r_[np_data_s, A]
    # 通过花索引值
    temp = np_data_s[id_r]
    temp[:, :, 2] = dis_r
    temp = temp[:, :, 0:3]
    name_dis = temp.reshape(n, -1)
    return name_dis


# 时间戳转为北京时间
def shijian(timeStamp):
    timeArray = time.localtime(timeStamp)
    otherStyleTime = time.strftime("%Y--%m--%d %H:%M:%S", timeArray)
    return otherStyleTime


if __name__ == "__main__":
    ########参数集合###############
    # 要保留的列，经纬度必须选，小区中文名，CGI可选，但是要修改函数相应部分,col_name_s是源数据，col_name_find是待查找数据
    col_name_s = ["故障发生具体地点", "全量投诉流水号", "经度", "纬度"]
    col_name_find = ["物业点名称", "楼宇唯一标识", "楼宇经度", "楼宇纬度"]
    # 距离dist内站点数量
    dist = 300
    # 最大保留点数量
    max_point = 80

    t1 = time.time()
    print("开始", shijian(t1))
    ########数据源############
    # A表中的点周围有几个B表中的点，则A表为df_find表，B表为df_s 分别读取path_find和path_s获取数据
    path_s = r"D:\data\1124投诉名字匹配\22年投诉跟踪表1111.xlsx"
    path_find = r"D:\data\1124投诉名字匹配\10月&11月扫楼数据汇总-1124.xlsx"

    # path_target存放最终结果
    path_target = r"D:\data\1124投诉名字匹配\123.csv"

    df_s = pd.read_excel(path_s, usecols=col_name_s)
    df_find = pd.read_excel(path_find)
    df_data_find, np_data_find, np_find = data_table_clear(df_find, col_name_find)

    # 如果结果只需要保留col_name_find中的列则用下面的，只读入部分列
    # df_find = pd.read_csv(path_find,usecols=col_name_find)

    #######训练模型############

    df_data_s, np_data_s, np_s = data_table_clear(df_s, col_name_s)
    model = Rtree_mode(np_s)
    t2 = time.time()
    print("模型训练完毕，下一步查找", shijian(t2))
    print("耗时", t2 - t1)

    ######查找################

    # 计算最近邻的点 id_r存放最近点在B表的索引，dis_r存放id_r对应的距离
    id_r, dis_r = nearest_point_dist(model, np_s, np_find, dist, max_point)

    # 查询np_data_s CGI和小区中文名,并与距离写在一个np数组中
    np_t = con_name_dis(id_r, dis_r, np_data_s)
    # 将查询表和结构表合并成一个df
    np_str = np.concatenate((np_data_find, np_t), axis=1)
    t3 = time.time()
    print("查找完毕，下一步合并表", shijian(t3))
    print("耗时", t3 - t2)
    ##### 计算字符串匹配形成最终结果######
    str_m = str_matche(np_str, 0.3, 21, 0.5)
    np_str_m = np.concatenate((np_data_find, str_m), axis=1)
    df_str_m = pd.DataFrame(np_str_m)

    df_str_m.rename(columns={0: col_name_find[0], 1: col_name_find[1], 2: col_name_find[2],
                             3: col_name_find[3], 4: "匹配投诉数量(汇总列)", 5: "物业点地址", 6: "投诉具体位置",
                             7: "投诉工单", 8: "距离",9: "投诉地址", 10: "编辑距离", 11: "莱文斯坦",
                             12: "是否包含关系", 13: "J_radio比",14:"后续列名重复H到O列"}, inplace=True)

    t4 = time.time()
    print("合并完毕，下一步写入csv", shijian(t4))
    print("耗时", t4 - t3)
    # 输出结果
    df_str_m.to_csv(path_target)

    t5 = time.time()
    print("完成", shijian(t5))
    print("总耗时", t5 - t1)