python距离聚类,Python：使用Levenshtein距离作为指标，使用scikit-learn的dbscan进行字符串聚类：...

最新推荐文章于 2024-03-10 13:50:08 发布

Dr熊吉

最新推荐文章于 2024-03-10 13:50:08 发布

阅读量390

点赞数 1

文章标签： python距离聚类

I have been trying to cluster multiple datasets of URLs (around 1 million each), to find the original and the typos of each URL. I decided to use the levenshtein distance as a similarity metric, along with dbscan as the clustering algorithm as k-means algorithms won't work because I do not know the number of clusters.

I am facing some problems using Scikit-learn's implementation of dbscan.

This snippet below works on small datasets in the format I an using, but since it is precomputing the entire distance matrix, that takes O(n^2) space and time and is way too much for my large datasets. I have run this for many hours but it just ends up taking all the memory of my PC.

lev_similarity = -1*np.array([[distance.levenshtein(w1[0],w2[0]) for w1 in words] for w2 in words])

dbscan = sklearn.cluster.DBSCAN(eps = 7, min_samples = 1)

dbscan.fit(lev_similarity)

So I figured I needed some way to compute the similarity on the fly and hence tried this method.

dbscan = sklearn.cluster.DBSCAN(eps = 7, min_samples = 1, metric = distance.levenshtein)

dbscan.fit(words)

But this method ends up giving me an error:

ValueError: could not convert string to float: URL

Which I realize means that its trying to convert the inputs to the similarity function to floats. But I don't want it to do that.

As far as I understand, it just needs a function that can take two arguments and return a float value that it can then compare to eps, which the levenshtein distance should do.

I am stuck at this point, as I do not know the implementation details of sklearn's dbscan to find why it is trying to convert it to float, and neither do I have any better idea on how to avoid the O(n^2) matrix computation.

Please let me know if there is any better or faster way to cluster these many strings that I may have overlooked.

解决方案

From the scikit-learn faq you can do this by making a custom metric:

from leven import levenshtein

import numpy as np

from sklearn.cluster import dbscan

data = ["ACCTCCTAGAAG", "ACCTACTAGAAGTT", "GAATATTAGGCCGA"]

def lev_metric(x, y):

i, j = int(x[0]), int(y[0]) # extract indices

return levenshtein(data[i], data[j])

X = np.arange(len(data)).reshape(-1, 1)

dbscan(X, metric=lev_metric, eps=5, min_samples=2)