使用scipy进行层次聚类和k-means聚类

最新推荐文章于 2024-08-08 16:43:27 发布

Yan456jie

最新推荐文章于 2024-08-08 16:43:27 发布

阅读量2.3k

点赞数

分类专栏：机器学习

机器学习专栏收录该内容

55 篇文章 0 订阅

订阅专栏

原文地址

scipy cluster库简介

scipy.cluster是scipy下的一个做聚类的package, 共包含了两类聚类方法:
1. 矢量量化(scipy.cluster.vq):支持vector quantization 和 k-means 聚类方法
2. 层次聚类(scipy.cluster.hierarchy):支持hierarchical clustering 和 agglomerative clustering(凝聚聚类)

聚类方法实现:k-means和hierarchical clustering.

<code class="language-python hljs  has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">###cluster.py</span>
<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#导入相应的包</span>
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> scipy
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> scipy.cluster.hierarchy <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">as</span> sch
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">from</span> scipy.cluster.vq <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> vq,kmeans,whiten
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> numpy <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">as</span> np
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> matplotlib.pylab <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">as</span> plt


<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#生成待聚类的数据点,这里生成了20个点,每个点4维:</span>
points=scipy.randn(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">20</span>,<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>)  

<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#1. 层次聚类</span>
<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#生成点与点之间的距离矩阵,这里用的欧氏距离:</span>
disMat = sch.distance.pdist(points,<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'euclidean'</span>) 
<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#进行层次聚类:</span>
Z=sch.linkage(disMat,method=<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'average'</span>) 
<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#将层级聚类结果以树状图表示出来并保存为plot_dendrogram.png</span>
P=sch.dendrogram(Z)
plt.savefig(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'plot_dendrogram.png'</span>)
<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#根据linkage matrix Z得到聚类结果:</span>
cluster= sch.fcluster(Z, t=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>, <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'inconsistent'</span>) 

<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">print</span> <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"Original cluster by hierarchy clustering:\n"</span>,cluster

<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#2. k-means聚类</span>
<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#将原始数据做归一化处理</span>
data=whiten(points)

<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#使用kmeans函数进行聚类,输入第一维为数据,第二维为聚类个数k.</span>
<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#有些时候我们可能不知道最终究竟聚成多少类,一个办法是用层次聚类的结果进行初始化.当然也可以直接输入某个数值. </span>
<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#k-means最后输出的结果其实是两维的,第一维是聚类中心,第二维是损失distortion,我们在这里只取第一维,所以最后有个[0]</span>
centroid=kmeans(data,max(cluster))[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>]  

<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#使用vq函数根据聚类中心对所有数据进行分类,vq的输出也是两维的,[0]表示的是所有数据的label</span>
label=vq(data,centroid)[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>] 

<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">print</span> <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"Final clustering by k-means:\n"</span>,label</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li><li style="box-sizing: border-box; padding: 0px 5px;">20</li><li style="box-sizing: border-box; padding: 0px 5px;">21</li><li style="box-sizing: border-box; padding: 0px 5px;">22</li><li style="box-sizing: border-box; padding: 0px 5px;">23</li><li style="box-sizing: border-box; padding: 0px 5px;">24</li><li style="box-sizing: border-box; padding: 0px 5px;">25</li><li style="box-sizing: border-box; padding: 0px 5px;">26</li><li style="box-sizing: border-box; padding: 0px 5px;">27</li><li style="box-sizing: border-box; padding: 0px 5px;">28</li><li style="box-sizing: border-box; padding: 0px 5px;">29</li><li style="box-sizing: border-box; padding: 0px 5px;">30</li><li style="box-sizing: border-box; padding: 0px 5px;">31</li><li style="box-sizing: border-box; padding: 0px 5px;">32</li><li style="box-sizing: border-box; padding: 0px 5px;">33</li><li style="box-sizing: border-box; padding: 0px 5px;">34</li><li style="box-sizing: border-box; padding: 0px 5px;">35</li><li style="box-sizing: border-box; padding: 0px 5px;">36</li><li style="box-sizing: border-box; padding: 0px 5px;">37</li><li style="box-sizing: border-box; padding: 0px 5px;">38</li></ul>

在Terminal中输入:python cluster.py
输出:
Original cluster by hierarchy clustering:
[4 3 3 1 3 3 2 3 2 3 2 3 3 2 3 1 3 3 2 2]
Final clustering by k-means:
[1 2 1 3 1 2 0 2 0 0 0 2 1 0 1 3 2 2 0 0]
数值是随机标的,不用看,只需要关注同类的是哪些.可以看出层次聚类的结果和k-means还是有区别的.

补充:一些函数的用法

1.linkage(y, method=’single’, metric=’euclidean’)
共包含3个参数:
y是距离矩阵,由pdist得到;method是指计算类间距离的方法,比较常用的有3种:
(1)single:最近邻,把类与类间距离最近的作为类间距
(2)complete:最远邻,把类与类间距离最远的作为类间距
(3)average:平均距离,类与类间所有pairs距离的平均

其他的method还有如weighted,centroid等等,具体可以参考:http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html#scipy.cluster.hierarchy.linkage

2.fcluster(Z, t, criterion=’inconsistent’, depth=2, R=None, monocrit=None)
第一个参数Z是linkage得到的矩阵,记录了层次聚类的层次信息; t是一个聚类的阈值-“The threshold to apply when forming flat clusters”,在实际中,感觉这个阈值的选取还是蛮重要的.另外,scipy提供了多种实施阈值的方法(criterion):

inconsistent : If a cluster node and all its descendants have an inconsistent value less than or equal to t then all its leaf descendants belong to the same flat cluster. When no non-singleton cluster meets this criterion, every node is assigned to its own cluster. (Default)

distance : Forms flat clusters so that the original observations in each flat cluster have no greater a cophenetic distance than t.

……

其他的参数我用的是默认的,具体可以参考:
http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.fcluster.html#scipy.cluster.hierarchy.fcluster

3.kmeans(obs, k_or_guess, iter=20, thresh=1e-05, check_finite=True)
输入obs是数据矩阵,行代表数据数目,列代表特征维度; k_or_guess表示聚类数目;iter表示循环次数,最终返回损失最小的那一次的聚类中心;
输出有两个,第一个是聚类中心(codebook),第二个是损失distortion,即聚类后各数据点到其聚类中心的距离的加和.

参考页面:http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.vq.kmeans.html#scipy.cluster.vq.kmeans

4.vq(obs, code_book, check_finite=True)
根据聚类中心将所有数据进行分类.obs为数据,code_book则是kmeans产生的聚类中心.
输出同样有两个:第一个是各个数据属于哪一类的label,第二个和kmeans的第二个输出是一样的,都是distortion