方法一:
使用地址标准化模块GeocodingCHN
- 该模块可以对地址进行标准化重构,也可对行政省市进行补齐(如果未写区县则无法补齐区县)
- 做相似性比较时会先对地址进行标准化转换,转换后对其进行对比,对比时实际使用余弦相似性来计算相似度。
实现如下:
import pandas as pd
import numpy as np
from GeocodingCHN import Geocoding
#读入数据df
#格式化函数
def addr_format(addr):
address_nor = Geocoding.normalizing(addr)
return address_nor
#相似度计算函数
def addr_similar(text1,text2):
Address_1 = Geocoding.normalizing(text1)
Address_2 = Geocoding.normalizing(text2)
if type(Address_1) == Geocoding.Address and type(Address_2) == Geocoding.Address:
similar = Geocoding.similarityWithResult(Address_1, Address_2)
return similar
else:
return 0
def ex_similar(df):
sim = addr_similar(df['addr1'],df['addr2'])
return sim
def ex_format1(df):
ex = ad