python距离聚类,Python:使用Levenshtein距离作为指标,使用scikit-learn的dbscan进行字符串聚类:...

I have been trying to cluster multiple datasets of URLs (around 1 million each), to find the original and the typos of each URL. I decided to use the levenshtein distance as a similarity metric, along with dbscan as the clustering algorithm as k-means algorithms won't work because I do not know the number of clusters.

I am facing some problems using Scikit-learn's implementation of dbscan.

This snippet below works on small datasets in the format I an using, but since it is precomputing the entire distance matrix, that takes O(n^2) space and time and is way too much for my large datasets. I have run this for many hours but it just ends up taking all the memory of my PC.

lev_similarity = -1*np.array([[distance.levenshtein(w1[0],w2[0]) for w1 in words] for w2 in words])

dbscan = sklearn.cluster.DBSCAN(eps = 7, min_samples = 1)

dbscan.fit(lev_similarity)

So I figured I needed some way to compute the similarity on the fly and hence tried this method.

dbscan = sklearn.cluster.DBSCAN(eps = 7, min_samples = 1, metric = distance.levenshtein)

dbscan.fit(words)

But this method ends up giving me an error:

ValueError: could not convert string to float: URL

Which I realize means that its trying to convert the inputs to the similarity function to floats. But I don't want it to do that.

As far as I understand, it just needs a function that can take two arguments and return a float value that it can then compare to eps, which the levenshtein distance should do.

I am stuck at this point, as I do not know the implementation details of sklearn's dbscan to find why it is trying to convert it to float, and neither do I have any better idea on how to avoid the O(n^2) matrix computation.

Please let me know if there is any better or faster way to cluster these many strings that I may have overlooked.

解决方案

From the scikit-learn faq you can do this by making a custom metric:

from leven import levenshtein

import numpy as np

from sklearn.cluster import dbscan

data = ["ACCTCCTAGAAG", "ACCTACTAGAAGTT", "GAATATTAGGCCGA"]

def lev_metric(x, y):

i, j = int(x[0]), int(y[0]) # extract indices

return levenshtein(data[i], data[j])

X = np.arange(len(data)).reshape(-1, 1)

dbscan(X, metric=lev_metric, eps=5, min_samples=2)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值