k-prototypes算法python实现，参数详解

最新推荐文章于 2025-02-07 17:29:39 发布

idiotic_bird

最新推荐文章于 2025-02-07 17:29:39 发布

阅读量8.4k

点赞数 3

分类专栏： GITHUB分享文章标签：聚类 python 算法

原文链接：https://github.com/nicodv/kmodes/blob/master/examples/benchmark_kprototypes.py

版权

GITHUB分享专栏收录该内容

1 篇文章

订阅专栏

k-prototypes算法是用于处理混合类型数据的经典聚类算法，为了方便研究者利用python进行混合聚类的数据分析，特将python中kmodes包重要参数与使用方法转载如下：

以下内容搬运自创作者的GITHUB:
https://github.com/nicodv/kmodes/blob/master/kmodes/kprototypes.py

kmodes包提供了kprotypes算法的python 实现，使用方式与sklearn中kmeans算法类似。

训练样例：

kp = KPrototypes(n_clusters=i, max_iter=80, n_init=8, n_jobs=5, verbose=2).fit(x_train2, categorical=[3,4,5,7,8,9])

具体的参数如下（parameters对应样例第一个括号内参数）：

Parameters
-----------
n_clusters : int, optional, default: 8
要形成的类的数量以及要产生的质心的数量。
max_iter : int, default: 100
k-modes算法单次运行的最大迭代次数。
num_dissim : func, default: euclidian_dissim
数值变量算法所采用的相似度函数。
默认为欧几里得距离函数。
cat_dissim : func, default: matching_dissim
分类变量的kmodes算法使用的相似度函数。（以下内容请自行翻译）
Defaults to the matching dissimilarity function.
n_init : int, default: 10
Number of time the k-modes algorithm will be run with different
centroid seeds. The final results will be the best output of
n_init consecutive runs in terms of cost.
init : {‘Huang’, ‘Cao’, ‘random’ or a list of ndarrays}, default: ‘Cao’
Method for initialization:
‘Huang’: Method in Huang [1997, 1998]
‘Cao’: Method in Cao et al. [2009]
‘random’: choose ‘n_clusters’ observations (rows) at random from
data for the initial centroids.
If a list of ndarrays is passed, it should be of length 2, with
shapes (n_clusters, n_features) for numerical and categorical
data respectively. These are the initial centroids.
gamma : float, default: None
Weighing factor that determines relative importance of numerical vs.
categorical attributes (see discussion in Huang [1997]). By default,
automatically calculated from data.
verbose : integer, optional
Verbosity mode.
random_state : int, RandomState instance or None, optional, default: None
If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used by np.random.
n_jobs : int, default: 1
The number of jobs to use for the computation. This works by computing
each of the n_init runs in parallel.
If -1 all CPUs are used. If 1 is given, no parallel computing code is
used at all, which is useful for debugging. For n_jobs below -1,
(n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one
are used.

训练过程中fit后括号内参数如下：
Parameters
----------
X : array-like, shape=[n_samples, n_features]
categorical : Index of columns that contain categorical data

训练结果的展示代码样例：

label = kp.labels_

其他可选的展示参数如下：
Attributes
----------
cluster_centroids_ : array, [n_clusters, n_features]
Categories of cluster centroids
labels_ :
Labels of each point
cost_ : float
Clustering cost, defined as the sum distance of all points to
their respective cluster centroids.
n_iter_ : int
The number of iterations the algorithm ran for.
epoch_costs_ :
The cost of the algorithm at each epoch from start to completion.
gamma : float
The (potentially calculated) weighing factor.
Notes
-----
See:
Huang, Z.: Extensions to the k-modes algorithm for clustering large
data sets with categorical values, Data Mining and Knowledge
Discovery 2(3), 1998.

原作者还提供了官方的样例如下：

#!/usr/bin/env python

import timeit

import numpy as np

from kmodes.kprototypes import KPrototypes

# number of clusters
K = 20
# no. of points
N = int(1e5)
# no. of dimensions
M = 10
# no. of numerical dimensions
MN = 5
# no. of times test is repeated
T = 3

data = np.random.randint(1, 1000, (N, M))


def huang():
    KPrototypes(n_clusters=K, init='Huang', n_init=1, verbose=2)\
        .fit_predict(data, categorical=list(range(M - MN, M)))


def cao():
    KPrototypes(n_clusters=K, init='Cao', verbose=2)\
        .fit_predict(data, categorical=list(range(M - MN, M)))


if __name__ == '__main__':

    for cm in ('huang', 'cao'):
        print(cm.capitalize() + ': {:.2} seconds'.format(
            timeit.timeit(cm + '()',
                          setup='from __main__ import ' + cm,
                          number=T)))