k-prototypes算法python实现,参数详解

k-prototypes算法是用于处理混合类型数据的经典聚类算法,为了方便研究者利用python进行混合聚类的数据分析,特将python中kmodes包重要参数与使用方法转载如下:

以下内容搬运自创作者的GITHUB:
https://github.com/nicodv/kmodes/blob/master/kmodes/kprototypes.py

kmodes包提供了kprotypes算法的python 实现,使用方式与sklearn中kmeans算法类似。

训练样例:

kp = KPrototypes(n_clusters=i, max_iter=80, n_init=8, n_jobs=5, verbose=2).fit(x_train2, categorical=[3,4,5,7,8,9])

具体的参数如下(parameters对应样例第一个括号内参数):

Parameters
-----------
n_clusters : int, optional, default: 8
要形成的类的数量以及要产生的质心的数量。
max_iter : int, default: 100
k-modes算法单次运行的最大迭代次数。
num_dissim : func, default: euclidian_dissim
数值变量算法所采用的相似度函数。
默认为欧几里得距离函数。
cat_dissim : func, default: matching_dissim
分类变量的kmodes算法使用的相似度函数。(以下内容请自行翻译)
Defaults to the matching dissimilarity function.
n_init : int, default: 10
Number of time the k-modes algorithm will be run with different
centroid seeds. The final results will be the best output of
n_init consecutive runs in terms of cost.
init : {‘Huang’, ‘Cao’, ‘random’ or a list of ndarrays}, default: ‘Cao’
Method for initialization:
‘Huang’: Method in Huang [1997, 1998]
‘Cao’: Method in Cao et al. [2009]
‘random’: choose ‘n_clusters’ observations (rows) at random from
data for the initial centroids.
If a list of ndarrays is passed, it should be of length 2, with
shapes (n_clusters, n_features) for numerical and categorical
data respectively. These are the initial centroids.
gamma : float, default: None
Weighing factor that determines relative importance of numerical vs.
categorical attributes (see discussion in Huang [1997]). By default,
automatically calculated from data.
verbose : integer, optional
Verbosity mode.
random_state : int, RandomState instance or None, optional, default: None
If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used by np.random.
n_jobs : int, default: 1
The number of jobs to use for the computation. This works by computing
each of the n_init runs in parallel.
If -1 all CPUs are used. If 1 is given, no parallel computing code is
used at all, which is useful for debugging. For n_jobs below -1,
(n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one
are used.

训练过程中fit后括号内参数如下:
Parameters
----------
X : array-like, shape=[n_samples, n_features]
categorical : Index of columns that contain categorical data

训练结果的展示代码样例:

label = kp.labels_

其他可选的展示参数如下:
Attributes
----------
cluster_centroids_ : array, [n_clusters, n_features]
Categories of cluster centroids
labels_ :
Labels of each point
cost_ : float
Clustering cost, defined as the sum distance of all points to
their respective cluster centroids.
n_iter_ : int
The number of iterations the algorithm ran for.
epoch_costs_ :
The cost of the algorithm at each epoch from start to completion.
gamma : float
The (potentially calculated) weighing factor.
Notes
-----
See:
Huang, Z.: Extensions to the k-modes algorithm for clustering large
data sets with categorical values, Data Mining and Knowledge
Discovery 2(3), 1998.

原作者还提供了官方的样例如下:

#!/usr/bin/env python

import timeit

import numpy as np

from kmodes.kprototypes import KPrototypes

# number of clusters
K = 20
# no. of points
N = int(1e5)
# no. of dimensions
M = 10
# no. of numerical dimensions
MN = 5
# no. of times test is repeated
T = 3

data = np.random.randint(1, 1000, (N, M))


def huang():
    KPrototypes(n_clusters=K, init='Huang', n_init=1, verbose=2)\
        .fit_predict(data, categorical=list(range(M - MN, M)))


def cao():
    KPrototypes(n_clusters=K, init='Cao', verbose=2)\
        .fit_predict(data, categorical=list(range(M - MN, M)))


if __name__ == '__main__':

    for cm in ('huang', 'cao'):
        print(cm.capitalize() + ': {:.2} seconds'.format(
            timeit.timeit(cm + '()',
                          setup='from __main__ import ' + cm,
                          number=T)))
  • 2
    点赞
  • 32
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值