今天同事被告知要写一个查询商品相似度的系统,我以为事类似推荐系统一样的高大上系统,心中暗自庆幸没有被委以如此重任,不然在紧迫的时间里学习实现这套系统一定会睡眠不足的,后来同事讲解后我才知道只是一个商品名称相似度查找的小系统,说白了就是字符串相似度!
关于字符串相似度python也有很多库,比如自带的difflib库,第三方Levenshtein库等等
关于字符串相似度的原理我网上找了一篇博客看看,可惜太长了,理论知识太多,专业性太强,惭愧,我看了两行就没坚持下去,就直接用difflib库了,理由也很简单,不需要额外安装
直接上代码(不要在意为什么是卫生巾,我不是变态):
-
# -* coding:utf-8 -*-
-
import difflib
-
query_str = "美妆 日本 Laurier花王乐而雅F系列日用护翼卫生巾(25cm*17片)瞬吸干爽超薄棉柔".decode("utf-8")
-
str_2 = "Lauríer 乐而雅 F系列敏感肌超长夜用护翼型卫生巾 40厘米 7片".decode("utf-8")#8
-
str_3 = "Lauríer 乐而雅 花王S系列超薄日用护翼卫生巾 25厘米 19片".decode("utf-8")#10
-
str_4 = "Lauríer 乐而雅 S系列 超薄瞬吸量多夜用卫生巾 30厘米 15片/包".decode("utf-8")#8
-
str_5 = "Lauríer 乐而雅 花王F系列敏感肌超量日用护翼型卫生巾 25厘米 18片".decode("utf-8")#10
-
print(difflib.SequenceMatcher(None, query_str, str_2).quick_ratio())
-
print(difflib.SequenceMatcher(None, query_str, str_3).quick_ratio())
-
print(difflib.SequenceMatcher(None, query_str, str_4).quick_ratio())
-
print(difflib.SequenceMatcher(None, query_str, str_5).quick_ratio())
quick_ratio()比 ratio() 效率高,好像
虽然这个库和函数用起来很方便,但是,仔细一看发现结果却是:
0.560975609756
0.691358024691
0.571428571429
0.658823529412
结果显示相似度最高的尽然是“Lauríer 乐而雅 花王S系列超薄日用护翼卫生巾 25厘米 19片”,明显不对啊,一个F系列,一个S系列
于是我试了Levenshtein库
-
import Levenshtein
-
print(Levenshtein.ratio(query_str,str_2))
-
print(Levenshtein.ratio(query_str,str_3))
-
print(Levenshtein.ratio(query_str,str_4))
-
print(Levenshtein.ratio(query_str,str_5))
结果:
0.487804878049
0.543209876543
0.404761904762
0.541176470588
还是第二条相似度最高,见鬼啊,难道是程序比我更懂卫生巾,好吧,反正我对卫生巾的知识为零,比我强也不奇怪,但是作为字符串而言,怎么看都应该是最后一条相似度最高啊,于是我开始想怎么提高正确度,当然什么机器学习之类的高大上我是不懂,我只能自己想办法
第一个想法,分词,然后用分词得到的列表中的子字符串到对比字符串中看出现了几个,出现的越多我就认为它相似度越高
这时候就用到了结巴分词
-
# -* coding:utf-8 -*-
-
import jieba
-
import Levenshtein
-
import difflib
-
import numpy as np
-
#jieba.load_userdict("dict.txt") #可以使用自定义词典
-
query_str = "美妆 日本 Laurier花王乐而雅F系列日用护翼卫生巾(25cm*17片)瞬吸干爽超薄棉柔".decode("utf-8")
-
str_2 = "Lauríer 乐而雅 F系列敏感肌超长夜用护翼型卫生巾 40厘米 7片".decode("utf-8")#8
-
str_3 = "Lauríer 乐而雅 花王S系列超薄日用护翼卫生巾 25厘米 19片".decode("utf-8")#10
-
str_4 = "Lauríer 乐而雅 S系列 超薄瞬吸量多夜用卫生巾 30厘米 15片/包".decode("utf-8")#8
-
str_5 = "Lauríer 乐而雅 花王F系列敏感肌超量日用护翼型卫生巾 25厘米 18片".decode("utf-8")#10
-
str_list=[str_2,str_3,str_4,str_5]
-
seg_list = list(jieba.cut(query_str.strip()))#默认精确模式
-
dict_data={}
-
sarticiple =list(seg_list)
-
for strs in str_list:
-
num=0
-
for sart in sarticiple:
-
if sart in strs:
-
num = num+1
-
else:
-
num = num
-
dict_data[strs]=num
-
for k,v in dict_data.items():
-
print(k)
-
print(v)
-
print("\n")
结果:
Lauríer 乐而雅 花王S系列超薄日用护翼卫生巾 25厘米 19片
10
Lauríer 乐而雅 花王F系列敏感肌超量日用护翼型卫生巾 25厘米 18片
10
Lauríer 乐而雅 S系列 超薄瞬吸量多夜用卫生巾 30厘米 15片/包
8
Lauríer 乐而雅 F系列敏感肌超长夜用护翼型卫生巾 40厘米 7片
8
结果发现相似度最高的S系列也是最多10个匹配,这样可以推断出相似度算法基本也有一部分是像我这样分词匹配,看匹配的多少,难道这就是字符串相似度其中一部分算法?
这样还是没有提高准确度,那该怎么办呢,我在想原始字符串和对比字符串的两个长度会不会有什么关联呢,毕竟两个相同的字符串长度一定一样啊,那相似的字符串是不是长度也就越接近呢,那我可以用原始字符串减去(原始字符串和比较字符串的中位数)然后取得绝对值得到一个值用来调节相似度算法所得到的偏差值,因为我看difflib给出的值是在0.0x左右,所以我打算给调节值再乘以0.017,让它的影响力稍微小一点,不至于大的夸张,对于长度相等的我直接给0这个调节值,不要问为什么,我瞎给的
上代码:
-
# -* coding:utf-8 -*-
-
import jieba
-
# import Levenshtein
-
import difflib
-
import numpy as np
-
#jieba.load_userdict("dict.txt")
-
class StrSimilarity():
-
def __init__(self,word):
-
self.word = word
-
#Compared函数,参数str_list是对比字符串列表
-
#返回原始字符串分词后和对比字符串的匹配次数,返回一个字典
-
def Compared(self,str_list):
-
dict_data={}
-
sarticiple =list(jieba.cut(self.word.strip()))
-
for strs in str_list:
-
num=0
-
for sart in sarticiple:
-
if sart in strs:
-
num = num+1
-
else:
-
num = num
-
dict_data[strs]=num
-
return dict_data
-
#NumChecks函数,参数dict_data是原始字符串分词后和对比字符串的匹配次数的字典,也就是Compared函数的返回值
-
#返回出现次数最高的两个,返回一个字典
-
def NumChecks(self,dict_data):
-
list_data = sorted(dict_data.iteritems(), key=lambda asd:asd[1], reverse=True)
-
length = len(list_data)
-
json_data = {}
-
if length>=2:
-
datas = list_data[:2]
-
else:
-
datas = list_data[:length]
-
for data in datas:
-
json_data[data[0]]=data[1]
-
return json_data
-
#MMedian函数,参数dict_data是出现次数最高的两个对比字符串的字典,也就是NumChecks函数的返回值
-
#返回对比字符串和调节值的字典
-
def MMedian(self,dict_data):
-
median_list={}
-
length = len(self.word)
-
for k,v in dict_data.items():
-
num = np.median([len(k),length])
-
if abs(length-num) !=0 :
-
# xx = (1.0/(abs(length-num)))*0.1
-
xx = (abs(length - num)) * 0.017
-
else:
-
xx = 0
-
median_list[k] = xx
-
return median_list
-
#Appear函数,参数dict_data是对比字符串和调节值的字典,也就是MMedian函数的返回值
-
#返回最相似的字符串
-
def Appear(self,dict_data):
-
json_data={}
-
for k,v in dict_data.items():
-
fraction = difflib.SequenceMatcher(None, self.word, k).quick_ratio()-v
-
# fraction = difflib.SequenceMatcher(None, self.word, k).quick_ratio() + v
-
# print(fraction)
-
# print(difflib.SequenceMatcher(None, self.word, k).quick_ratio())
-
# print("++++")
-
json_data[k]=fraction
-
tulp_data = sorted(json_data.iteritems(), key=lambda asd:asd[1], reverse=True)
-
return tulp_data[0][0]
-
def main():
-
query_str = "美妆 日本 Laurier花王乐而雅F系列日用护翼卫生巾(25cm*17片)瞬吸干爽超薄棉柔".decode("utf-8")
-
str_2 = "Lauríer 乐而雅 F系列敏感肌超长夜用护翼型卫生巾 40厘米 7片".decode("utf-8")#8
-
str_3 = "Lauríer 乐而雅 花王S系列超薄日用护翼卫生巾 25厘米 19片".decode("utf-8")#10
-
str_4 = "Lauríer 乐而雅 S系列 超薄瞬吸量多夜用卫生巾 30厘米 15片/包".decode("utf-8")#8
-
str_5 = "Lauríer 乐而雅 花王F系列敏感肌超量日用护翼型卫生巾 25厘米 18片".decode("utf-8")#10
-
str_list=[str_2,str_3,str_4,str_5]
-
# query_str = "美妆 韩国 SNP燕窝面膜 深层补水保湿提亮美白水库 25ml*10 包邮".decode("utf-8")
-
# str_6 = "LEADERS 丽得姿 领先润美强化补水面膜 25毫升/片 10片/盒".decode("utf-8") #
-
# str_7 = "SNP 海洋燕窝高倍补水美白面膜 25毫升/片10片装".decode("utf-8") #
-
# str_8 = "SNP 钻石美白提亮安瓶精华面膜 25毫升/片 10片装".decode("utf-8") #
-
# str_list = [str_6, str_7, str_8]
-
# query_str = "美妆 丽得姿(Leaders)水库面膜超强补水(25ml*10片)".decode("utf-8")
-
# str_2 = "LEADERS 丽得姿 领先润美强化补水面膜 25毫升/片 10片/盒".decode("utf-8")
-
# str_3 = "LEADERS 丽得姿 美帝优超强补水面膜 25毫升/片 10片装".decode("utf-8")
-
# str_4 = "2件装 | 丽得姿 美帝优蜗牛原液面膜 10片装+领先润美强化补水面膜 10片装".decode("utf-8")
-
# str_5 = "LEADERS 丽得姿 领先润美祛痘净肤修复面膜 25毫升/片 10片装".decode("utf-8")
-
# str_list = [str_2,str_3,str_4,str_5]
-
ss = StrSimilarity(query_str)
-
list_data = ss.Compared(str_list)
-
num = ss.NumChecks(list_data)
-
mmedian = ss.MMedian(num)
-
print(ss.Appear(mmedian))
-
if __name__=="__main__":
-
main()
结果:
Lauríer 乐而雅 花王F系列敏感肌超量日用护翼型卫生巾 25厘米 18片
整个代码可能会有bug,因为没有时间测试而且没有任何异常捕捉(这个习惯很差,所以我一直是个渣渣),从构思到写好博客花了我两个多小时,已经快三点了,不想测试,不想测试,不想测试
如果使用以上代码碰到bug请告诉我,互相进步,如果觉得我的瞎想的求相似度方法有问题,不合理,请一定要告诉我,因为我真的不知道相似度的算法原理,更加需要相互学习
今天换了一个思路写了一个,也改用了Levenshtein库,不过效果和上面的差不多
代码:
-
# -* coding:utf-8 -*-
-
import jieba
-
import Levenshtein
-
import re
-
import difflib
-
import numpy as np
-
jieba.load_userdict("dict.txt")
-
class StrSimilarity():
-
def __init__(self,word):
-
# word = word.replace("/", " ").replace("*".decode("utf-8"), " ")
-
# word = word.replace("(".decode("utf-8"), " ").replace(")".decode("utf-8"), " ").replace("、".decode("utf-8")," ")
-
# word = word.replace("cm", " ").replace("ml", " ").replace("毫升".decode("utf-8"), " ").replace("片".decode("utf-8"), " ")
-
# word = word.replace("克".decode("utf-8"), "").replace("杯".decode("utf-8"), "").replace("袋".decode("utf-8"), "").replace("g", "")
-
self.word = word
-
#Compared函数,参数str_list是对比字符串列表
-
#返回原始字符串分词后和对比字符串的匹配次数,返回一个字典
-
def Compared(self,str_list):
-
dict_data={}
-
sarticiple =list(jieba.cut(self.word.strip()))
-
for strs in str_list:
-
num=0
-
for sart in sarticiple:
-
if sart.strip():
-
# print(sart)
-
if sart in strs:
-
if re.match('^[0-9a-zA-Z]+.*', sart):
-
num = num+1
-
else:
-
num+num
-
num = num+1
-
else:
-
num = num
-
else:
-
num = num
-
dict_data[strs]=num
-
return dict_data
-
#NumChecks函数,参数dict_data是原始字符串分词后和对比字符串的匹配次数的字典,也就是Compared函数的返回值
-
#返回出现次数最高的两个并给一个调节值,匹配度越高分数越低(为了之后相似度减去用),返回一个字典
-
def NumChecks(self,dict_data):
-
list_data = sorted(dict_data.iteritems(), key=lambda asd:asd[1], reverse=True)
-
length = len(list_data)
-
json_data = {}
-
if length>=2:
-
datas = list_data[:2]
-
else:
-
datas = list_data[:length]
-
for data in datas:
-
if data[1] != 0:
-
json_data[data[0]]=(1.0/(data[1]))*0.01
-
else:
-
json_data[data[0]] =0
-
return json_data
-
#MMedian函数,参数dict_data是出现次数最高的两个对比字符串的字典,也就是NumChecks函数的返回值
-
#返回对比字符串和调节值的字典,用原始字符串长度减去两个对比字符串长度的中位数相减
-
def MMedian(self,dict_data):
-
median_list={}
-
length = len(self.word)
-
for k,v in dict_data.items():
-
num = np.median([len(k),length])
-
if abs(length-num) !=0 :
-
# xx = (1.0/(abs(length-num)))*0.1
-
xx = (abs(length - num)) * 0.017
-
else:
-
xx = 0
-
median_list[k] = xx
-
return median_list
-
#Appear函数,参数dict_data是对比字符串和调节值的字典,也就是MMedian函数的返回值
-
#返回最相似的字符串
-
def Appear(self,dict_data):
-
json_data={}
-
for k,v in dict_data.items():
-
# fraction = difflib.SequenceMatcher(None, self.word, k).quick_ratio()-v
-
fraction = Levenshtein.ratio( self.word,k) - v
-
# print(v)
-
# print(fraction)
-
# print(difflib.SequenceMatcher(None, self.word, k).quick_ratio())
-
# print("++++")
-
json_data[k]=fraction
-
tulp_data = sorted(json_data.iteritems(), key=lambda asd:asd[1], reverse=True)
-
return tulp_data[0][0]
-
def main():
-
query_str = "美妆 日本 Laurier花王乐而雅F系列日用护翼卫生巾(25cm*17片)瞬吸干爽超薄棉柔".decode("utf-8")
-
str_2 = "Lauríer 乐而雅 F系列敏感肌超长夜用护翼型卫生巾 40厘米 7片".decode("utf-8")#8
-
str_3 = "Lauríer 乐而雅 花王S系列超薄日用护翼卫生巾 25厘米 19片".decode("utf-8")#10
-
str_4 = "Lauríer 乐而雅 S系列 超薄瞬吸量多夜用卫生巾 30厘米 15片/包".decode("utf-8")#8
-
str_5 = "Lauríer 乐而雅 花王F系列敏感肌超量日用护翼型卫生巾 25厘米 18片".decode("utf-8")#10
-
str_list=[str_2,str_3,str_4,str_5]
-
# query_str = "美妆 韩国 SNP燕窝面膜 深层补水保湿提亮美白水库 25ml*10 包邮".decode("utf-8")
-
# str_6 = "LEADERS 丽得姿 领先润美强化补水面膜 25毫升/片 10片/盒".decode("utf-8") #
-
# str_7 = "SNP 海洋燕窝高倍补水美白面膜 25毫升/片10片装".decode("utf-8") #
-
# str_8 = "SNP 钻石美白提亮安瓶精华面膜 25毫升/片 10片装".decode("utf-8") #
-
# str_list = [str_6, str_7, str_8]
-
ss = StrSimilarity(query_str)
-
list_data = ss.Compared(str_list)
-
num = ss.NumChecks(list_data)
-
mmedian = ss.MMedian(num)
-
print(ss.Appear(mmedian))
-
if __name__=="__main__":
-
main()
结果:
Lauríer 乐而雅 花王F系列敏感肌超量日用护翼型卫生巾 25厘米 18片
不管我怎么改,其实正确率只能提高很小的一点,很大的权重还在于两个字符串相似度库的分数
为了避免错误相似度问题,有些公司还会采用人工审核的方式,这样就不能只返回一个程序认为最相似的字符串而是要返回一个相似度降序排列的列表,这样方便人工审核
-
# -* coding:utf-8 -*-
-
import jieba
-
# import Levenshtein
-
import difflib
-
import numpy as np
-
from collections import Counter
-
#jieba.load_userdict("dict.txt")
-
class StrSimilarity():
-
def __init__(self,word):
-
self.word = word
-
#Compared函数,参数str_list是对比字符串列表
-
#返回原始字符串分词后和对比字符串的匹配次数,返回一个字典
-
def Compared(self,str_list):
-
dict_data={}
-
sarticiple =list(jieba.cut(self.word.strip()))
-
for strs in str_list:
-
num=0
-
for sart in sarticiple:
-
if sart in strs:
-
num = num+1
-
else:
-
num = num
-
dict_data[strs]=num
-
return dict_data
-
#NumChecks函数,参数dict_data是原始字符串分词后和对比字符串的匹配次数的字典,也就是Compared函数的返回值
-
#返回出现次数最高的两个,返回一个字典
-
def NumChecks(self,dict_data):
-
list_data = sorted(dict_data.iteritems(), key=lambda asd:asd[1], reverse=True)
-
# length = len(list_data)
-
json_data = {}
-
# if length>=2:
-
# datas = list_data[:2]
-
# else:
-
# datas = list_data[:length]
-
for data in list_data:
-
json_data[data[0]]=data[1]
-
return json_data
-
#MMedian函数,参数dict_data是出现次数最高的两个对比字符串的字典,也就是NumChecks函数的返回值
-
#返回对比字符串和调节值的字典
-
def MMedian(self,dict_data):
-
median_list={}
-
length = len(self.word)
-
for k,v in dict_data.items():
-
num = np.median([len(k),length])
-
if abs(length-num) !=0 :
-
# xx = (1.0/(abs(length-num)))*0.1
-
xx = (abs(length - num)) * 0.017
-
else:
-
xx = 0
-
median_list[k] = xx
-
return median_list
-
#Appear函数,参数dict_data是对比字符串和调节值的字典,也就是MMedian函数的返回值
-
#返回一个相似度降序列表
-
def Appear(self,dict_data):
-
sum_data = []
-
json_data={}
-
for k,v in dict_data.items():
-
fraction = difflib.SequenceMatcher(None, self.word, k).quick_ratio()-v
-
# fraction = difflib.SequenceMatcher(None, self.word, k).quick_ratio() + v
-
# print(fraction)
-
# print(difflib.SequenceMatcher(None, self.word, k).quick_ratio())
-
# print("++++")
-
json_data[k]=fraction
-
tulp_data = sorted(json_data.iteritems(), key=lambda asd:asd[1], reverse=True)
-
for data in tulp_data:
-
sum_data.append(data[0])
-
return tulp_data
-
def main():
-
# query_str = "美妆 日本 Laurier花王乐而雅F系列日用护翼卫生巾(25cm*17片)瞬吸干爽超薄棉柔".decode("utf-8")
-
# str_2 = "Lauríer 乐而雅 F系列敏感肌超长夜用护翼型卫生巾 40厘米 7片".decode("utf-8")#8
-
# str_3 = "Lauríer 乐而雅 花王S系列超薄日用护翼卫生巾 25厘米 19片".decode("utf-8")#10
-
# str_4 = "Lauríer 乐而雅 S系列 超薄瞬吸量多夜用卫生巾 30厘米 15片/包".decode("utf-8")#8
-
# str_5 = "Lauríer 乐而雅 花王F系列敏感肌超量日用护翼型卫生巾 25厘米 18片".decode("utf-8")#10
-
# str_list=[str_2,str_3,str_4,str_5]
-
query_str = "美妆 韩国 SNP燕窝面膜 深层补水保湿提亮美白水库 25ml*10".decode("utf-8")
-
str_6 = "LEADERS 丽得姿 领先润美强化补水面膜 25毫升/片 10片/盒".decode("utf-8") #
-
str_7 = "SNP 海洋燕窝高倍补水美白面膜 25毫升/片10片装".decode("utf-8") #
-
str_8 = "SNP 钻石美白提亮安瓶精华面膜 25毫升/片 10片装".decode("utf-8") #
-
str_list = [str_6, str_7, str_8]
-
# query_str = "美妆 丽得姿(Leaders)水库面膜超强补水 25ml*10片".decode("utf-8")
-
# str_2 = "LEADERS 丽得姿 领先润美强化补水面膜 25毫升/片 10片/盒".decode("utf-8")
-
# str_3 = "LEADERS 丽得姿 美帝优超强补水面膜 25毫升/片 10片装".decode("utf-8")
-
# str_4 = "2件装 | 丽得姿 美帝优蜗牛原液面膜 10片装+领先润美强化补水面膜 10片装".decode("utf-8")
-
# str_5 = "LEADERS 丽得姿 领先润美祛痘净肤修复面膜 25毫升/片 10片装".decode("utf-8")
-
# str_list = [str_2,str_3,str_4,str_5]
-
# query_str = "科沃斯(Ecovacs) CEN540 扫地机器人 地宝魔镜S".decode("utf-8")
-
# str2 = "(Ecovacs)科沃斯 CEN540 扫地机器人 地宝魔镜S".decode("utf-8")
-
# str_list = [str2]
-
# query_str = "母婴 韩国本土好奇(HUGGIES)纸尿裤NATURE MADE 3段M,男,52片/包 *3".decode("utf-8")
-
# str1 = "韩国金装好奇huggies超柔透气纸尿裤XXL码6段28片拉拉裤16魔术版".decode("utf-8")
-
# str2 = "HUGGIES好奇金装纸尿裤宝宝尿不湿婴儿彩箱L104片比L129L72L40L36".decode("utf-8")
-
# str3 = "Huggies好奇金装超柔贴身婴儿纸尿裤 宝宝尿不湿 小号S120片箱装".decode("utf-8")
-
# str4 = "特!好奇Huggies 银装 婴儿纸尿裤 中号M160片7 - 11kg【京东送货】".decode("utf-8")
-
# str5 = "现货德国进口欧版Huggies好奇婴儿宝宝防水游泳纸尿裤中码单片M号".decode("utf-8")
-
# str6 = "韩国本土huggies好奇纯净版纸尿裤L码4段42片男宝原装进口尿不湿".decode("utf-8")
-
# str7 = "好奇(Huggies) 银装 加大号XL104片(12 - 16kg) 干爽舒适 婴儿纸尿裤".decode("utf-8")
-
# str8 = "现货 美国Huggies好奇婴儿宝宝游泳裤 防水纸尿裤 L码 大号 整包".decode("utf-8")
-
# str9 = "HUGGIES / 好奇铂金装倍柔亲肤纸尿裤初生号NB66 + 10片 / NB76新生儿".decode("utf-8")
-
# str10 = "【进口正品】好奇铂金装婴儿纸尿裤大号L58片 * 2包 透气干爽尿不湿".decode("utf-8")
-
# str11 = "【进口正品】好奇铂金装婴儿纸尿裤XL44片 * 2包 透气干爽尿不湿".decode("utf-8")
-
# str12 = "HUGGIES / 好奇金装超柔贴身纸尿裤中号M50 + 4片 好奇M号M54".decode("utf-8")
-
# str13 = "Huggies好奇铂金装倍柔亲肤纸尿裤大号尿不湿L36 + 16 片 / L52片".decode("utf-8")
-
# str14 = "现货欧洲Huggies好奇婴儿宝宝游泳裤防水纸尿裤L大码单片12 - 18kg".decode("utf-8")
-
# str15 = "现货美国Huggies好奇婴儿宝宝游泳裤防水纸尿裤大L码单片14kg + ".decode("utf-8")
-
# str16 = "好奇(Huggies) 铂金装倍柔亲肤纸尿裤中号M72原装进口".decode("utf-8")
-
# str17 = "好奇S号 Huggies金装超柔纸尿裤新生儿小号S60 + 12片 / S72片 尿不湿".decode("utf-8")
-
# str18 = "Huggies / 好奇纸尿裤L好奇金装超柔透气纸尿裤L72片 大码尿不湿".decode("utf-8")
-
# str19 = "huggies好奇l铂金装纸尿裤l尿不湿l号铂金宝宝婴儿韩国原装进口l".decode("utf-8")
-
# str20 = "Huggies好奇铂金装纸尿裤S58 + 12 片婴儿宝宝纸尿裤新生儿尿不湿".decode("utf-8")
-
# str21 = "韩国本土Huggies好奇l纸尿裤l码尿不湿l50包邮清仓批发超薄韩版".decode("utf-8")
-
# str22 = "美国Huggies好奇婴儿宝宝游泳裤防水纸尿裤尿不湿M码单片11 - 15kg".decode("utf-8")
-
# str_list = [str1, str2, str3, str4, str5, str6, str7, str8, str9, str10, str11, str12, str13, str14, str15, str16,
-
# str17, str18, str19, str20, str21, str22]
-
ss = StrSimilarity(query_str)
-
list_data = ss.Compared(str_list)
-
num = ss.NumChecks(list_data)
-
mmedian = ss.MMedian(num)
-
print(ss.Appear(mmedian))
-
if __name__=="__main__":
-
main()
结果:
对于上面的算法其实还有很多问题,比如商品名中干扰项太多,打开淘宝搜索一件商品,你会发现商家为了让商品更多的被搜索到会加上很多词语,而这些词本身和商品无关的,亦或者是规格词语的中英文问题,有的是“毫升”,有的是“ml”,有的是“ML”,诸如此类,所以为了更准确的匹配到最相似的字符串需要过滤掉这些词,在TF-IDF中有一个停用词的概念,这里也拿来用一下,增加停用词过滤
-
# -* coding:utf-8 -*-
-
import jieba
-
# import Levenshtein
-
import difflib
-
import numpy as np
-
#jieba.load_userdict("dict.txt")
-
#过滤词,这里只是针对不同网站标题的过滤词,如果数量很大可以保存在一个文件中
-
stopwords = ["【","】","[","]","直邮","包邮","保税","全球购","包税","(",")","原装","京东超市","马尾保税仓","发货","进口","+","/","-","亏本","特价","现货","天天特价","专柜","代购","预定","正品","旗舰店","#",
-
"转卖","国内","柜台","无盒","保税仓","官方","店铺","爆款","、"]
-
#停用词,这里只是针对例子增加的停用词,如果数量很大可以保存在一个文件中
-
conversions = ["克","毫升"]
-
class StrSimilarity():
-
def __init__(self,word):
-
self.word = word
-
#Compared函数,参数str_list是对比字符串列表
-
#返回原始字符串分词后和对比字符串的匹配次数,返回一个字典
-
def Compared(self,str_list):
-
dict_data={}
-
# sarticiple =list(jieba.cut(self.word.strip()))
-
for cons in conversions:
-
self.word[4] = self.word[4].replace(cons, "")
-
# print(self.word[4])
-
sarticiple = [self.word[0],self.word[2],self.word[3],self.word[4],self.word[5],self.word[6],self.word[7]]
-
for strs in str_list:
-
for sws in stopwords:
-
strs_1 = strs.replace(sws.decode("utf-8")," ")
-
num=0
-
for sart in sarticiple:
-
# print("qqqqqqqqqqqqqqq",sart.decode("utf-8"))
-
sart = sart.decode("utf-8")
-
counts = strs_1.count(sart)
-
# print("zzzzzzzzzzzzzzzzzzz",strs_1)
-
if counts!=0:
-
# print("=========",sart)
-
num = num+1
-
else:
-
# print("..............", sart)
-
num = num-100
-
if num>0:
-
dict_data[strs]=num
-
return dict_data
-
#NumChecks函数,参数dict_data是原始字符串分词后和对比字符串的匹配次数的字典,也就是Compared函数的返回值
-
#返回出现次数最高的两个,返回一个字典
-
def NumChecks(self,dict_data):
-
list_data = sorted(dict_data.iteritems(), key=lambda asd:asd[1], reverse=True)
-
# length = len(list_data)
-
json_data = {}
-
# if length>=2:
-
# datas = list_data[:2]
-
# else:
-
# datas = list_data[:length]
-
for data in list_data:
-
json_data[data[0]]=data[1]
-
return json_data
-
#MMedian函数,参数dict_data是出现次数最高的两个对比字符串的字典,也就是NumChecks函数的返回值
-
#返回对比字符串和调节值的字典
-
def MMedian(self,dict_data):
-
median_list={}
-
length = len(self.word)
-
for k,v in dict_data.items():
-
num = np.median([len(k),length])
-
if abs(length-num) !=0 :
-
# xx = (1.0/(abs(length-num)))*0.1
-
xx = (abs(length - num)) * 0.017
-
else:
-
xx = 0
-
median_list[k] = xx
-
return median_list
-
#Appear函数,参数dict_data是对比字符串和调节值的字典,也就是MMedian函数的返回值
-
#返回最相似的字符串
-
def Appear(self,dict_data):
-
sum_data = []
-
json_data={}
-
for k,v in dict_data.items():
-
word = " ".join(self.word)
-
fraction = difflib.SequenceMatcher(None, word, k).quick_ratio()
-
# print("++++++++++",difflib.SequenceMatcher(None, word, k).quick_ratio())
-
# print("----------",v)
-
# fraction = difflib.SequenceMatcher(None, self.word, k).quick_ratio() + v
-
# print(fraction)
-
# print(difflib.SequenceMatcher(None, self.word, k).quick_ratio())
-
# print("++++")
-
json_data[k]=fraction
-
tulp_data = sorted(json_data.iteritems(), key=lambda asd:asd[1], reverse=True)
-
for data in tulp_data:
-
sum_data.append(data[0])
-
return tulp_data
-
def main():
-
# query_str = "美妆 日本 Laurier花王乐而雅F系列日用护翼卫生巾(25cm*17片)瞬吸干爽超薄棉柔".decode("utf-8")
-
# str_2 = "Lauríer 乐而雅 F系列敏感肌超长夜用护翼型卫生巾 40厘米 7片".decode("utf-8")#8
-
# str_3 = "Lauríer 乐而雅 花王S系列超薄日用护翼卫生巾 25厘米 19片".decode("utf-8")#10
-
# str_4 = "Lauríer 乐而雅 S系列 超薄瞬吸量多夜用卫生巾 30厘米 15片/包".decode("utf-8")#8
-
# str_5 = "Lauríer 乐而雅 花王F系列敏感肌超量日用护翼型卫生巾 25厘米 18片".decode("utf-8")#10
-
# str_list=[str_2,str_3,str_4,str_5]
-
# query_str = "美妆 韩国 SNP燕窝面膜 深层补水保湿提亮美白水库 25ml*10".decode("utf-8")
-
# str_6 = "LEADERS 丽得姿 领先润美强化补水面膜 25毫升/片 10片/盒".decode("utf-8") #
-
# str_7 = "SNP 海洋燕窝高倍补水美白面膜 25毫升/片10片装".decode("utf-8") #
-
# str_8 = "SNP 钻石美白提亮安瓶精华面膜 25毫升/片 10片装".decode("utf-8") #
-
# str_list = [str_6, str_7, str_8]
-
# query_str = "美妆 丽得姿(Leaders)水库面膜超强补水 25ml*10片".decode("utf-8")
-
# str_2 = "LEADERS 丽得姿 领先润美强化补水面膜 25毫升/片 10片/盒".decode("utf-8")
-
# str_3 = "LEADERS 丽得姿 美帝优超强补水面膜 25毫升/片 10片装".decode("utf-8")
-
# str_4 = "2件装 | 丽得姿 美帝优蜗牛原液面膜 10片装+领先润美强化补水面膜 10片装".decode("utf-8")
-
# str_5 = "LEADERS 丽得姿 领先润美祛痘净肤修复面膜 25毫升/片 10片装".decode("utf-8")
-
# str_list = [str_2,str_3,str_4,str_5]
-
# query_str = ["科沃斯","Ecovacs", "CEN540", "扫地机器人", "魔镜","S"]#.decode("utf-8")
-
# str2 = "(Ecovacs)科沃斯 CEN540 扫地机器人 地宝魔镜N".decode("utf-8")
-
# str3 = "(Ecovacs)科沃斯 CEN540 扫地机器人 地宝 S".decode("utf-8")
-
# str4 = "科沃斯地宝灵犀扫地机器人家用智能吸尘器魔镜S升级版CEN540/546".decode("utf-8")#1299
-
# str5 = "科沃斯地宝魔镜S扫地机器人吸尘器智能家用超薄全自动擦地机拖地".decode("utf-8")#1099
-
# str6 = "科沃斯/Ecovacs 地宝魔戒 CEN550、魔镜S CEN540-LG扫地机器人".decode("utf-8")#868
-
# str7 = "ECOVACS科沃斯/魔镜S 地宝灵犀 智能湿拖扫地机器人CEN540吸尘器".decode("utf-8")#849
-
# str8 = "科沃斯/Ecovacs 地宝魔戒 CEN550 扫地机器人 魔镜S(CEN540-LG)".decode("utf-8")#888
-
# str9 = "科沃斯(Ecovacs) CEN540 扫地机器人 地宝魔镜S 全自动充电".decode("utf-8")#959
-
# str10 = "科沃斯(Ecovacs)地宝魔镜S(CEN540-LG)扫地机器人智能拖地".decode("utf-8")#1000
-
# str_list = [str2,str3,str4,str5,str6,str7,str8,str9,str10]
-
# query_str = ["花王","Merries", "纸尿裤", "妙而舒","L", "54片","",""]#.decode("utf-8")
-
# str2 = "[店铺爆款]Kao/Merries 日本花王妙而舒新版纸尿裤大号(L)54 官方".decode("utf-8")
-
# str3 = "【红孩子母婴】花王(Merries)妙而舒纸尿裤大号尿不湿L54片纸尿裤".decode("utf-8")
-
# str4 = "日本 花王KAO 妙而舒 merries纸尿裤尿不湿 L码54片 保税发".decode("utf-8")#1299
-
# str5 = "正品行货日本花王 Merries 妙而舒 纸尿裤 (L)54片".decode("utf-8")#1099
-
# str6 = "日本本土正品花王 Merries 妙而舒 尿不湿纸尿裤 大号(L)54片".decode("utf-8")#868
-
# str7 = "日本原装进口花王纸尿裤Merries妙而舒宝宝尿不湿 L码54片".decode("utf-8")#849
-
# str8 = "香港代购 正品日本花王 Merries 妙而舒 纸尿裤 大号(L)54片".decode("utf-8")#888
-
# str9 = "2包包邮 Merries 妙而舒 日本本土花王 尿不湿/纸尿裤 L码 54片".decode("utf-8")#959
-
# str10 = "日本花王 Merries 妙而舒纸尿裤 尿布尿片 大号L 54 片".decode("utf-8")#1000
-
# str_list = [str2,str3,str4,str5,str6,str7,str8,str9,str10]
-
# query_str = ["伊思","It's skin", "韩国", "晶钻红参蜗牛面膜","25ml", "5片","",""]#.decode("utf-8")
-
# str2 = "韩国its skin伊思晶钻蜗牛原液面膜贴 胶原蛋白祛痘净化修复面膜".decode("utf-8")
-
# str3 = "韩国its skin伊思晶钻蜗牛原液面膜贴 胶原蛋白祛痘净化修复面膜".decode("utf-8")
-
# str4 = "格润丝韩国蜗牛面膜原液多补水去皱嫩肤滋润保湿20片正品包邮".decode("utf-8")#1299
-
# str5 = "韩国伊思(It's skin)晶钻红参蜗牛面膜(25ml*5片)".decode("utf-8")#1099
-
# str6 = "韩国专柜正品 伊思it's skin晶钻红参蜗牛精华面膜贴25ml 5片/盒".decode("utf-8")#868
-
query_str = ["诗留美屋", "Rosette", "海泥洁面乳洗面奶", "粉色白泥", "120克", "", "", ""] # .decode("utf-8")
-
str2 = "日本代购Rosette 露姬婷海泥洗面奶去角质黑头控油 祛痘 保湿洁净".decode("utf-8")
-
str3 = "日本Rosette蓝色海泥洁面膏洗面奶 清洁毛孔痘痘肌克星祛痘控油".decode("utf-8")
-
str4 = "日本rosette|露姬婷 海泥洁面膏洗面奶女深层清洁绿色120g 去角质".decode("utf-8")#1299
-
str5 = "日本诗留美屋Rosette 无添加海泥洁面乳洗面奶(粉色白泥)120克".decode("utf-8")#1099
-
str6 = "诗留美屋(Rosette)无添加海泥洁面乳洗面奶(粉色白泥)120克".decode("utf-8")#868
-
str7 = "日本诗留美屋Rosette海泥洁面乳洗面奶 粉色白泥 120g".decode("utf-8") # 868
-
str_list = [str2,str3,str4,str5,str6,str7]
-
# http://m.xiaohongshu.com/search/keyword?q=诗留美屋 Rosette 海泥洁面乳洗面奶 粉色白泥 120克
-
# query_str = ["自然晨露", "DEWYTREE", "蛇毒嫩肤竹炭黑面膜", "10片盒装", "", "", "", ""] # .decode("utf-8")
-
# str2 = "日本代购Rosette 露姬婷海泥洗面奶去角质黑头控油 祛痘 保湿洁净".decode("utf-8")
-
# str3 = "日本Rosette蓝色海泥洁面膏洗面奶 清洁毛孔痘痘肌克星祛痘控油".decode("utf-8")
-
# str4 = "日本rosette|露姬婷 海泥洁面膏洗面奶女深层清洁绿色120g 去角质".decode("utf-8") # 1299
-
# str5 = "日本诗留美屋Rosette 无添加海泥洁面乳洗面奶(粉色白泥)120克".decode("utf-8") # 1099
-
# str6 = "诗留美屋(Rosette)无添加海泥洁面乳洗面奶(粉色白泥)120克".decode("utf-8") # 868
-
# str7 = "日本诗留美屋Rosette海泥洁面乳洗面奶 粉色白泥 120g".decode("utf-8") # 868
-
#
-
# str_list = [str2, str3, str4, str5, str6, str7]
-
# "自然晨露 DEWYTREE 蛇毒嫩肤竹炭黑面膜 10片盒装"
-
ss = StrSimilarity(query_str)
-
list_data = ss.Compared(str_list)
-
# print(list_data)
-
num = ss.NumChecks(list_data)
-
mmedian = ss.MMedian(num)
-
print(ss.Appear(mmedian))
-
if __name__=="__main__":
-
main()
结果: