基于密度分布函数的聚类算法DENCLUE
核心思想
每一个空间数据点通过影响函数事先对空间产生影响,影响值可以叠加,从而在空间形成一曲面,曲面的局部极大值点为一聚类吸引子,该吸引子的吸引域形成一类。
- 影响函数:这里指的是KDE核密度估计
- 核密度估计(KDE):
- 吸引子:也就是K-means算法中的质心
tips:在查阅了大量资料时,我发现大家对核密度公式本身就存在极大曲解,首先h不是一个定值,也不是多个定值,而是一个向量,对于每一个特征值会有特定的窗宽存在,而窗宽如何取当然取决于数据集本身。核密度函数?
算法原理
每个数据点的影响可以用一个数学函数来形式化地模拟,它描述了一个数据点在邻域内的影响,被称为影响函数;
数据空间的整体密度(全局密度函数)可以被模拟为所有数据点的影响函数的总和;
聚类可以通过确定密度吸引点(density attractor)来得到,这里的密度吸引点是全局密度函数的局部最大值。
一个点 x 是被一个密度吸引点 x密度吸引的,如果存在一组点 x0,x1,…,xk,使得x0=x,xk=x,对 0<i<k,xi-1 的梯度是在 xi的方向上,也就是说xi-1在xi的方向上变化最快。
DENCLUE算法步骤
1.首先还是使用核密度估计推导这个点的密度值,这一步需要参照核密度估计公式,是每一个点和剩下的所有点(特征值)进行推算,参照上面的公式。
2.使用爬山法(梯度上升法),寻找出局部密度最大值参照下图的公式
而梯度公式为下图(在这里我们就可以想象,假设某两个点是一个簇的,那么经过梯度上升法那么它们经过爬山法之后点的距离会更加近,就如同一个山坡,不管是同一坡还是对立坡,经过梯度上升法之后,两点的欧氏距离就会变得更加近,当然假设两个点不属于同一个簇,那么它们会背对背而行,经过给定次数的梯度上升法之后,我们将两点距离的大小求出来若小于收敛速度我们则认为它们是一个簇)而爬山法计算出来的点实则是不存在的,所以关联为一个簇的实际上还是原来的数据点,请不要搞混淆了,由于第一步我们计算出过所有点的密度值,在每一个簇中密度最大的点就是这个簇的质心。
3.经过爬山法确定后的簇是非常多的,如同上图两个同心圆一共有27个簇,在经过爬山法确定簇是远远不够的,这一部也是区分聚类效果好坏的关键一步就是合并簇,我们需要找到两个簇之间的点,也就是每一个簇里面密度最小的点,当然两个山峰的山脚下并不是相连的,假如是相连的,那么那一个点到底是属于哪一个簇我们无法确定。所以我们需要找到两个簇之间距离足够小的点,且那两个点的密度值都大于我们给定的阈值,那么我们将两个簇合并。
在上图我们看到D和E两个簇之间的点事大于密度阈值的(也就是说两个簇之间的点还是有很多点的),所以可以合并为一个簇,而A和B因为之间的点事小于阈值的,这说明两个簇之间的点很少。
4.第四步
就是画图了,因为一开始计算过密度值,假设在爬山法中由于没有与其他的点合并成一个簇,那么此点的密度值就一定是小于阈值的,我们只需要判断密度大小,然后依次画图就可以了,当然那些噪声点一般是一个点为一个簇,在上一步判断中,就可以轻易的排除掉。
程序举例
"""
denclue.py
@author: mgarrett
"""
import numpy as np
from sklearn.base import BaseEstimator, ClusterMixin
import networkx as nx
def _hill_climb(x_t, X, W=None, h=0.1, eps=1e-7):
"""
This function climbs the 'hill' of the kernel density function
and finds the 'peak', which represents the density attractor
"""
error = 99.
prob = 0.
x_l1 = np.copy(x_t)
#Sum of the last three steps is used to establish radius
#of neighborhood around attractor. Authors suggested two
#steps works well, but I found three is more robust to
#noisey datasets.
radius_new = 0.
radius_old = 0.
radius_twiceold = 0.
iters = 0.
while True:
radius_thriceold = radius_twiceold
radius_twiceold = radius_old
radius_old = radius_new
x_l0 = np.copy(x_l1)
x_l1, density = _step(x_l0, X, W=W, h=h)
error = density - prob
prob = density
radius_new = np.linalg.norm(x_l1-x_l0)
radius = radius_thriceold + radius_twiceold + radius_old + radius_new
iters += 1
if iters>3 and error < eps:
break
return [x_l1, prob, radius]
def _step(x_l0, X, W=None, h=0.1):
n = X.shape[0]
d = X.shape[1]
superweight = 0. #superweight is the kernel X weight for each item
x_l1 = np.zeros((1,d))
if W is None:
W = np.ones((n,1))
else:
W = W
for j in range(n):
kernel = kernelize(x_l0, X[j], h, d)
kernel = kernel * W[j]/(h**d)
superweight = superweight + kernel
x_l1 = x_l1 + (kernel * X[j])
x_l1 = x_l1/superweight
density = superweight/np.sum(W)
return [x_l1, density]
def kernelize(x, y, h, degree):
kernel = np.exp(-(np.linalg.norm(x-y)/h)**2./2.)/((2.*np.pi)**(degree/2))
return kernel
class DENCLUE(BaseEstimator, ClusterMixin):
"""Perform DENCLUE clustering from vector array.
Parameters
----------
h : float, optional
The smoothing parameter for the gaussian kernel. This is a hyper-
parameter, and the optimal value depends on data. Default is the
np.std(X)/5.
eps : float, optional
Convergence threshold parameter for density attractors
min_density : float, optional
The minimum kernel density required for a cluster attractor to be
considered a cluster and not noise. Cluster info will stil be kept
but the label for the corresponding instances will be -1 for noise.
Since what consitutes a high enough kernel density depends on the
nature of the data, it's often best to fit the model first and
explore the results before deciding on the min_density, which can be
set later with the 'set_minimum_density' method.
Default is 0.
metric : string, or callable
The metric to use when calculating distance between instances in a
feature array. In this version, I've only tested 'euclidean' at this
moment.
Attributes
-------
cluster_info_ : dictionary [n_clusters]
Contains relevant information of all clusters (i.e. density attractors)
Information is retained even if the attractor is lower than the
minimum density required to be labelled a cluster.
labels_ : array [n_samples]
Cluster labels for each point. Noisy samples are given the label -1.
Notes
-----
References
----------
Hinneburg A., Gabriel HH. "DENCLUE 2.0: Fast Clustering Based on Kernel
Density Estimation". In: R. Berthold M., Shawe-Taylor J., Lavrač N. (eds)
Advances in Intelligent Data Analysis VII. IDA 2007
"""
def __init__(self, h=None, eps=1e-8, min_density=0., metric='euclidean'):
self.h = h
self.eps = eps
self.min_density = min_density
self.metric = metric
def fit(self, X, y=None, sample_weight=None):
if not self.eps > 0.0:
raise ValueError("eps must be positive.")
self.n_samples = X.shape[0]
self.n_features = X.shape[1]
density_attractors = np.zeros((self.n_samples,self.n_features))
radii = np.zeros((self.n_samples,1))
density = np.zeros((self.n_samples,1))
#create default values
if self.h is None:
self.h = np.std(X)/5
if sample_weight is None:
sample_weight = np.ones((self.n_samples,1))
else:
sample_weight = sample_weight
#initialize all labels to noise
labels = -np.ones(X.shape[0])
#climb each hill
for i in range(self.n_samples):
density_attractors[i], density[i], radii[i] = _hill_climb(X[i], X, W=sample_weight,
h=self.h, eps=self.eps)
#initialize cluster graph to finalize clusters. Networkx graph is
#used to verify clusters, which are connected components of the
#graph. Edges are defined as density attractors being in the same
#neighborhood as defined by our radii for each attractor.
cluster_info = {}
num_clusters = 0
cluster_info[num_clusters]={'instances': [0],
'centroid': np.atleast_2d(density_attractors[0])}
g_clusters = nx.Graph()
for j1 in range(self.n_samples):
g_clusters.add_node(j1, attr_dict={'attractor':density_attractors[j1], 'radius':radii[j1],
'density':density[j1]})
#populate cluster graph
for j1 in range(self.n_samples):
for j2 in (x for x in range(self.n_samples) if x != j1):
if g_clusters.has_edge(j1,j2):
continue
diff = np.linalg.norm(g_clusters.node[j1]['attractor']-g_clusters.node[j2]['attractor'])
if diff <= (g_clusters.node[j1]['radius']+g_clusters.node[j1]['radius']):
g_clusters.add_edge(j1, j2)
#connected components represent a cluster
clusters = list(nx.connected_component_subgraphs(g_clusters))
num_clusters = 0
#loop through all connected components
for clust in clusters:
#get maximum density of attractors and location
max_instance = max(clust, key=lambda x: clust.node[x]['density'])
max_density = clust.node[max_instance]['density']
max_centroid = clust.node[max_instance]['attractor']
#In Hinneberg, Gabriel (2007), for attractors in a component that
#are not fully connected (i.e. not all attractors are within each
#other's neighborhood), they recommend re-running the hill climb
#with lower eps. From testing, this seems unnecesarry for all but
#special edge cases. Therefore, completeness info is put into
#cluster info dict, but not used to re-run hill climb.
complete = False
c_size = len(clust.nodes())
if clust.number_of_edges() == (c_size*(c_size-1))/2.:
complete = True
#populate cluster_info dict
cluster_info[num_clusters] = {'instances': clust.nodes(),
'size': c_size,
'centroid': max_centroid,
'density': max_density,
'complete': complete}
#if the cluster density is not higher than the minimum,
#instances are kept classified as noise
if max_density >= self.min_density:
labels[clust.nodes()]=num_clusters
num_clusters += 1
self.clust_info_ = cluster_info
self.labels_ = labels
return self
def get_density(self, x, X, y=None, sample_weight=None):
superweight=0.
n_samples = X.shape[0]
n_features = X.shape[1]
if sample_weight is None:
sample_weight = np.ones((n_samples,1))
else:
sample_weight = sample_weight
for y in range(n_samples):
kernel = kernelize(x, X[y], h=self.h, degree=n_features)
kernel = kernel * sample_weight[y]/(self.h**n_features)
superweight = superweight + kernel
density = superweight/np.sum(sample_weight)
return density
def set_minimum_density(self, min_density):
self.min_density = min_density
labels_copy = np.copy(self.labels_)
for k in self.clust_info_.keys():
if self.clust_info_[k]['density']<min_density:
labels_copy[self.clust_info_[k]['instances']]= -1
else:
labels_copy[self.clust_info_[k]['instances']]= k
self.labels_ = labels_copy
return self