python字符串匹配+数字锚点re升阶版

最新推荐文章于 2024-08-22 02:05:38 发布

猿二出没

最新推荐文章于 2024-08-22 02:05:38 发布

阅读量386

点赞数 6

文章标签： python 开发语言正则表达式大数据

本文链接：https://blog.csdn.net/m0_62350221/article/details/132663834

版权

一、应用场景

在我们需要对两列数据中是否有相同字串进行匹配时候，我们通常会运用到python中的字符串匹配。但是单一的字符串匹配通常并不能达到我们对于解决实际问题的需求，在我们需要对保存格式不同的地址进行相互匹配时，我们就可以用到re来帮助我们达到目的。如下图一个是天眼查提供的地址，一个是自来水公司提供的地址，我们需要找出天眼查中的地址在自来水公司的地址（随机生成）。左边精确到市区镇街道户号，右边地址形式复杂不确定精确的只有街道和户号，所以用左边匹配右边更有利于节省计算机资源。

二、主要思想

因为右边（自来水公司地址）的数据形式多变，有直接精确到市的，有精确到镇的，但是街道和户号都是必定有的。所以我们需要从左边（天眼查地址）中提取出街道和户号信息。我们以户号首位数字为锚点，利用re.search（）来确定地址字段中数字的位置，并且截取户号前后3位字符串，然后再用这部分字符串在右边（自来水公司地址）中进行匹配。

三、运行代码

1、首先我们导入必要的库

import pandas as pd
import re

2、我们利用pandas来读取Excel中的数据，并将读取的所需数据处理成列表，方便接后续操作。

ps：利用pandas读取Excel文件的sheet2时，我们可以时候sheet_name='sheet2'来读取另一个sheet表

task = pd.read_excel('洗浴.xlsx')
task_one = pd.DataFrame(task['注册地址'], columns=['注册地址'])
task_one_l = task_one.values.tolist()

obj = pd.read_excel('副本明细汇总表7.24.xlsx', sheet_name='工商业')
obj_one = pd.DataFrame(obj['地址'], columns=['地址'])
obj_one_l = obj_one.values.tolist()

3、然后我们对读取到的数据进行字符串处理，我们要把初始数据和程序进展中生存的符号都去除掉，这样我们才能在进行字符串匹配时候不会出现因为符号导致的错误。

temp_one = str(task_one_l)
temp_two = str(obj_one_l)
temp_one = re.sub('\[', '', temp_one)  # re.sub用于去除字符串里面的所不需要的字符
temp_one = re.sub(']', '', temp_one)
temp_one = re.sub('\'', '', temp_one)
temp_one = re.sub(',', '', temp_one)
temp_two = re.sub('\[', '', temp_two)
temp_two = re.sub(']', '', temp_two)
temp_two = re.sub('\'', '', temp_two)
temp_two = re.sub(',', '', temp_two)

task_one_end = temp_one.split(' ')  # split函数将处理好的字符串按空格分割成列表
obj_one_end = temp_two.split(' ')

4、最后就是用以数字为锚点的提取的有效字符串在自来水公司地址中进行匹配。我们在循环的时候加入对字符串数字处理，从而判断字符串是否包含数字即精确的户号，然后来对没有包含户号的数据进行剔除，从而提高我们的处理速度。

ps：re.search（pattern，string，flags）可以在任意字符串内快速判断是否含有某值，re.search("\d", str(each))就是在字符串each内判断是否有数字出现。而.start()是根据其输

出将其遇到的第一个数字在字符串中的位置提取出来，作为数字锚点

def match(task_e, obj_e):
    for index, each in enumerate(task_e):
        if re.search("\d", str(each)):  # re.search判断each是否包含数字
            one = re.search("\d", str(each))
            num_index = one.start()  # start提取字符串中的第一个数字的位置，找到数字锚点
            temp_char = str(each)[num_index-3:num_index+5]  # 数字锚点前后几位有效字符串
            print(index, temp_char)
            for o in obj_e:
                if temp_char in str(o):
                    print(str(o))
                else:
                    pass
            print('\n')
        else:
            print(index, each)

5、完整代码在页面最后

四、总结

此次项目是在对自来水公司进行审计时运用对创新性办法，解决了在面对字符串匹配时候遇到的多变数据形式带来的难题，做到可以在不同类型但有相似属性的字符串中提取数字前后有效字符串，再应用于字符串匹配。项目代码只针对自来水公司提供的用户地址无规则，但是数字锚点的代码思想可以运用到很多工作场景。

五、完整代码

import pandas as pd
import re

def main():
    task_e, obj_e = read()
    match(task_e, obj_e)

def read():
    task = pd.read_excel('洗浴.xlsx')
    task_one = pd.DataFrame(task['注册地址'], columns=['注册地址'])
    task_one_l = task_one.values.tolist()

    obj = pd.read_excel('副本明细汇总表7.24.xlsx', sheet_name='工商业')
    obj_one = pd.DataFrame(obj['地址'], columns=['地址'])
    obj_one_l = obj_one.values.tolist()

    print(task_one_l)
    print(obj_one_l)

    temp_one = str(task_one_l)
    temp_two = str(obj_one_l)
    temp_one = re.sub('\[', '', temp_one)
    temp_one = re.sub(']', '', temp_one)
    temp_one = re.sub('\'', '', temp_one)
    temp_one = re.sub(',', '', temp_one)
    temp_two = re.sub('\[', '', temp_two)
    temp_two = re.sub(']', '', temp_two)
    temp_two = re.sub('\'', '', temp_two)
    temp_two = re.sub(',', '', temp_two)

    task_one_end = temp_one.split(' ')
    obj_one_end = temp_two.split(' ')
    print(task_one_end, obj_one_end)
    return task_one_end, obj_one_end

# 匹配洗浴表中地址列中第一个数字前后的字符串
def match(task_e, obj_e):
    for index, each in enumerate(task_e):
        if re.search("\d", str(each)):  # 判断each是否包含数字
            one = re.search("\d", str(each))
            num_index = one.start()
            temp_char = str(each)[num_index-3:num_index+5]
            print(index, temp_char)
            for o in obj_e:
                if temp_char in str(o):
                    print(str(o))
                else:
                    pass
            print('\n')
        else:
            print(index, each)


if __name__ == "__main__":
    main()