集群的计算与疾病防御

最新推荐文章于 2022-12-20 14:00:21 发布

栉风沐雨1314

最新推荐文章于 2022-12-20 14:00:21 发布

阅读量867

点赞数

分类专栏： python 文章标签： python cluster 分组集群迭代

本文链接：https://blog.csdn.net/gt11799/article/details/40212019

版权

python 专栏收录该内容

35 篇文章 0 订阅

订阅专栏

这是一篇早就该写的文章，一直拖到了现在。

拿出一张国家地图，每出现一个患者，就在地图上按一个图钉，可以得到一个散点图。你可以想象成一个一个小洞戳满了地图，要分析的时候，医生建议把这些患者跟最近的患者结合到一起，分成52组，怎么解决这个问题呢？

这就是集群的计算，其实也是一个迭代的过程，首先是离得最近的两点结合到一起，把这两个点看成一个点，中心是二者位置的中心。然后继续计算，知道集群的数量符合要求为止。一听就知道计算量巨大，自然要绞尽脑汁的divide-merge了。在那之前，我们要整理出我们要面对的难题：

1.当3个点的集群和10个点的集群合并的时候，中心不是二者的中心的平均值，而应该是根据集群的大小给予一定的权重。

2.这些点如何跟地图结合在一起？是不是对象本身要自带这些信息？

于是老师说，来来来，我们来建个类吧，作为处理这个问题的数据类型。于是老师为我们建立了一个Cluster类，其中不仅具备了多种性质，更是有了很多方便计算的method。老师建立的类如下：（具体参见源代码）

class Cluster:
    """
    Class for creating and merging clusters of counties
    """
    
    def __init__(self, fips_codes, horiz_pos, vert_pos, population, risk):
              
    def __repr__(self):
        
    def fips_codes(self):
    
    def horiz_center(self):
    
    def vert_center(self):
    
    def total_population(self):
    
    def averaged_risk(self):  
        
    def copy(self):

    def distance(self, other_cluster):
        
    def merge_clusters(self, other_cluster):
   
    def cluster_error(self, data_table):

fips_codes是一个set，里面包含着集群里所有地区的地区编码（应该跟我们的邮政编码挺像的）。然后是两个坐标，人口，以及患病风险。值得一提的是，这个格式是根据美国疾病预防网站上的数据格式而制定的。这门课上我们主要使用的癌症风险数据。

这些method也是极其有用的，避免了反复造轮子。method基本一目了然，merge_clusters就是两个集合的融合，cluster_error则是计算融合后的误差。

我们以下的计算都是基于这种数据格式的。

首先就是计算两个cluster的距离：

def pair_distance(cluster_list, idx1, idx2):
    return (cluster_list[idx1].distance(cluster_list[idx2]), min(idx1, idx2), max(idx1, idx2))

因为会反复用到，因此建立这样一个辅助函数。

在我的源代码中，有两种方法计算两个距离最近的点，一种就是普通的迭代，一个一个去计算，这种方法就不提了。另外一种就是divide-merge了。

我们可以在脑海中构思这样一幅画面，一群点扔在坐标轴第一象限，我们把有点的区域拿出来，从中间切开，分成两个区域；然后再继续切，直到每个区域只剩下不超过三个点为止，然后返回最小距离，再向上合并，每次也都是这么计算，返回最小的距离。

但是有个很明显的问题，如果最小距离的两个点恰好位于两个区域的界限的两侧怎么办？

于是我们在每次合并的时候，取得最小距离为d，然后把界线两侧各划出为d的区域来，把这些点按照y轴的大小排序，距离比d还小的两个点不会超出4个索引。（至于为什么我还没有想清楚。）

就这样，我们可以一层一层合并上来，最后就可以得到最小距离的两个点了。

一开始我也是这么实现的，代码的复杂程度并不高，但是却有致命的问题：需要在迭代内排序。排序是个复杂度极高的运算，因此算法速度的提升并不明显。

老师说：在迭代外排序，然后把索引传进去就好了嘛。最终的程序就成了这个样子：

def fast_closest_pair(cluster_list):

    def fast_helper(cluster_list, horiz_order, vert_order):

        length = len(horiz_order)
        if length <= 3:
            Q = [cluster_list[idx].copy() for idx in horiz_order]
            (d,i,j) = list(slow_closest_pairs(Q))[0]
            return (d, horiz_order[i], horiz_order[j])
            
        else:
            m = int(length / 2)
            mid = (cluster_list[horiz_order[m-1]].horiz_center() + cluster_list[horiz_order[m]].horiz_center()) / 2
            
            #divide
            Hl = horiz_order[:m]
            Hr = horiz_order[m:]
            Vl = [idx for idx in vert_order if idx in Hl]
            Vr = [idx for idx in vert_order if idx in Hr]
            
            (dl, il, jl) = fast_helper(cluster_list, Hl, Vl)
            (dr, ir, jr) = fast_helper(cluster_list, Hr, Vr)
            (d, i, j) = min((dl, il, jl), (dr, ir, jr))
            
            #merge
            S = [idx for idx in vert_order if abs(cluster_list[idx].horiz_center() - mid) < d]
            length_S = len(S)
            for u in range(length_S-1):
                for v in range(u+1, min(u+4, length_S)):
                    (d,i,j) = min((d,i,j), pair_distance(cluster_list, min(S[u],S[v]), max(S[u],S[v])))
       
        return (d, i, j)
            
    # compute list of indices for the clusters ordered in the horizontal direction
    hcoord_and_index = [(cluster_list[idx].horiz_center(), idx) 
                        for idx in range(len(cluster_list))]    
    hcoord_and_index.sort()
    horiz_order = [hcoord_and_index[idx][1] for idx in range(len(hcoord_and_index))]
     
    # compute list of indices for the clusters ordered in vertical direction
    vcoord_and_index = [(cluster_list[idx].vert_center(), idx) 
                        for idx in range(len(cluster_list))]    
    vcoord_and_index.sort()
    vert_order = [vcoord_and_index[idx][1] for idx in range(len(vcoord_and_index))]

    # compute answer recursively
    answer = fast_helper(cluster_list, horiz_order, vert_order) 
    return (answer[0], min(answer[1:]), max(answer[1:]))

fast_helper就是实际迭代的部分，后面的部分都是分别在x方向和y方向上排序。这里遵守的规则是，始终传输Index，在迭代过程内亦是。在按照x方向分成两半后，向双方传入的也是排好序的y方向上的排序后的Index。

有点绕？比如我要排序a, b, c, d四个点，假设它们组成的list：

[a, b, c, d]

按照x方向上排序后的序列为：

[b, c, a, d]

按照y方向上排序后的序列为：

[d, b, c, a]

但是我只是传入原来序列的Index，比如x方向上的顺序为：

[1, 2, 0, 3]

y方向为：

[3, 1, 2, 0]

当我要按照x方向分成两半的时候，y轴的顺序也要分成两半，传入各自的迭代：

left:[1, 2] [1, 2]
right: [0,3] [3, 0]

在整个迭代过程中，一定不能改变原来的序列，也一定不要改变Index。我一开始的程序总是出错，就是在点少于3个的时候，返回来的Index是错的。

有了这个函数，进行集群的分组就简单多了：

def hierarchical_clustering(cluster_list, num_clusters):

    length = len(cluster_list)
    temp_list = list(cluster_list)
    while length > num_clusters:
        #(d,i,j) = list(slow_closest_pairs(temp_list))[0]
        (d,i,j) = fast_closest_pair(temp_list)
        temp_list[i] = temp_list[i].merge_clusters(temp_list[j])
        temp_list.pop(j)
        
        length = len(temp_list)
    return temp_list

就是不停的融合，直到集群数量满足需求为止。

计算集群还有一种更巧妙的办法。比如我们的要求是最终形成7个集群：

1. 我们可以随意找7个中心，

2. 对于每个cluster，找到最近的中心，融合在一起。

3. 融合后的集群的中心取出来当作初始的中心，进行2

4. 如此重复多次，得到想要的结果。

这种方法起名叫k-means：

def kmeans_clustering(cluster_list, num_clusters, num_iterations):

    length = len(cluster_list)
    temp_list = [(cluster_list[idx].total_population(), idx) for idx in range(length)]
    U = []
    
    #find the cluster with the biggest population
    for _ in range(num_clusters):
        temp = max(temp_list)
        U.append(cluster_list[temp[1]].copy())
        temp_list.remove(temp)
        
    for _ in range(num_iterations):
        answer = [alg_cluster.Cluster(set([]), 0., 0., 0, 0.0) for idx in range(num_clusters)]
        for idx in range(length):
            item = min([(cluster_list[idx].distance(U[idx2]), idx2) for idx2 in range(len(U))])
            answer[item[1]] = answer[item[1]].copy().merge_clusters(cluster_list[idx])
            
        U = list(answer)
        
    return answer

效率比hierarchical方法要快很多，初始值的取值很重要，我的方法是选取人口最多的几个cluster（有待商榷），基本重复5次就可以得到满意的结果。

有了这两个函数，我们就可以去计算疾病分布图了。由于篇幅较长，就拆成两篇。

源代码：Github