python K-Means 实例二则

I had to illustrate a k-means algorithm for my thesis, but I could not find any existing examples that were both simple and looked good on paper. See below for Python code that does just what I wanted.

#!/usr/bin/python
 
# Adapted from http://hackmap.blogspot.com/2007/09/k-means-clustering-in-scipy.html
 
import numpy
import matplotlib
matplotlib.use('Agg')
from scipy.cluster.vq import *
import pylab
pylab.close()
 
# generate 3 sets of normally distributed points around
# different means with different variances
pt1 = numpy.random.normal(1, 0.2, (100,2))
pt2 = numpy.random.normal(2, 0.5, (300,2))
pt3 = numpy.random.normal(3, 0.3, (100,2))
 
# slightly move sets 2 and 3 (for a prettier output)
pt2[:,0] += 1
pt3[:,0] -= 0.5
 
xy = numpy.concatenate((pt1, pt2, pt3))
 
# kmeans for 3 clusters
res, idx = kmeans2(numpy.array(zip(xy[:,0],xy[:,1])),3)
 
colors = ([([0.4,1,0.4],[1,0.4,0.4],[0.1,0.8,1])[i] for i in idx])
 
# plot colored points
pylab.scatter(xy[:,0],xy[:,1], c=colors)
 
# mark centroids as (X)
pylab.scatter(res[:,0],res[:,1], marker='o', s = 500, linewidths=2, c='none')
pylab.scatter(res[:,0],res[:,1], marker='x', s = 500, linewidths=2)
 
pylab.savefig('/tmp/kmeans.png')


The output looks like this (also available in vector format here):

The X’s mark cluster centers. Feel free to use any of these files for whatever purposes. An attribution would be nice, but is not required :-) .




Using python and k-means to find the dominant colors in images

October 23, 2012 17:23 /  0 comments algorithms python

I'm working on a little photography website for my Dad and thought it would be neat to extract color information from photographs. I tried a couple of different approaches before finding one that works pretty well. This approach uses k-means clustering to cluster the pixels in groups based on their color. The center of those resulting clusters are then the "dominant" colors. k-means is a great fit for this problem because it is (usually) fast. It has the caveat of requiring you to specify up-front how many clusters you want -- I found that it works well when I specified around 3.

A warning

I'm no expert on data-mining -- almost all my experience comes from reading Toby Segaran's excellent book Programming Collective Intelligence . In one of the first chapters Toby covers clustering algorithms, including a nice treatment of k-means, so if you want to really learn from an expert I'd suggest picking up a copy. You won't be disappointed.

How it works

The way I understand it to work is you start with a bunch of data points. For simplicity let's say they're numbers on a number-line. You want to group the numbers into "k" clusters, so pick "k" points randomly from the data to use as your "clusters".

Now loop over every point in the data and calculate its distance to each of the "k" clusters. Find the nearest cluster and associate that point with the cluster. When you've looped over all the points they should all be assigned to one of the "k" clusters. Now, for each cluster recalculate its center by averaging the distances of all the associated points and start over.

When the centers stop moving very much you can stop looping. You will end up with something like this -- the points are colored based on what "cluster" they are in and the dark-black circles indicate the centers of each cluster.

K-Means

Applying it to photographs

The neat thing about this algorithm is, since it relies only on a simple distance calculation, you can extend it out to multi-dimensional data. Color is often represented using 3 channels, Red, Green, and Blue. So what I did was treat all the pixels in the image like points on a 3-dimensional space. That's all there was to it!

I made a few optimizations along the way:

  1. resize the image down to 200x200 or so using PIL
  2. instead of storing "duplicate" points, store a count with each -- saves on calculations

Looking at some results

Akira motorcycles

The results:                                 

Akira motorcycles 2

The results:                                 

Akira 3

The results:                                 

Akira 4

The results:                                 

The source code

Below is the source code. It requires PIL to resize the image down to 200x200 and to extract the colors/counts. The "colorz" function is the one that returns the actual color codes for a filename.

from collections import namedtuple
from math import sqrt
import random
try:
    import Image
except ImportError:
    from PIL import Image

Point = namedtuple('Point', ('coords', 'n', 'ct'))
Cluster = namedtuple('Cluster', ('points', 'center', 'n'))

def get_points(img):
    points = []
    w, h = img.size
    for count, color in img.getcolors(w * h):
        points.append(Point(color, 3, count))
    return points

rtoh = lambda rgb: '#%s' % ''.join(('%02x' % p for p in rgb))

def colorz(filename, n=3):
    img = Image.open(filename)
    img.thumbnail((200, 200))
    w, h = img.size

    points = get_points(img)
    clusters = kmeans(points, n, 1)
    rgbs = [map(int, c.center.coords) for c in clusters]
    return map(rtoh, rgbs)

def euclidean(p1, p2):
    return sqrt(sum([
        (p1.coords[i] - p2.coords[i]) ** 2 for i in range(p1.n)
    ]))

def calculate_center(points, n):
    vals = [0.0 for i in range(n)]
    plen = 0
    for p in points:
        plen += p.ct
        for i in range(n):
            vals[i] += (p.coords[i] * p.ct)
    return Point([(v / plen) for v in vals], n, 1)

def kmeans(points, k, min_diff):
    clusters = [Cluster([p], p, p.n) for p in random.sample(points, k)]

    while 1:
        plists = [[] for i in range(k)]

        for p in points:
            smallest_distance = float('Inf')
            for i in range(k):
                distance = euclidean(p, clusters[i].center)
                if distance < smallest_distance:
                    smallest_distance = distance
                    idx = i
            plists[idx].append(p)

        diff = 0
        for i in range(k):
            old = clusters[i]
            center = calculate_center(plists[i], old.n)
            new = Cluster(plists[i], center, old.n)
            clusters[i] = new
            diff = max(diff, euclidean(old.center, new.center))

        if diff < min_diff:
            break

    return clusters

Playing with it in the browser

I ported the code over to JavaScript -- let me tell you, its pretty rough, but it works and is fast. If you'd like to take a look at a live example, check out:

http://charlesleifer.com/static/colors/ -- you can view the source to see the js version, but basically it is just using the HTML5 canvas and its getImageDatamethod.

Thanks for reading

Thanks for reading, I hope you found this post interesting. I am sure this is not the only approach so if you have other ideas please feel free to leave a comment orcontact me directly.

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
以下是Python实现K-Means聚类的代码示例: ```python import numpy as np from matplotlib import pyplot as plt # 生成数据点 X = np.array([[1, 2], [1.5, 1.8], [5, 8], [8, 8], [1, 0.6], [9, 11]]) # 设定初始聚类中心 centroids = np.array([[1, 0.6], [1.5, 1.8], [5, 8]]) # 定义距离函数 def euclidean_distance(x1, x2): return np.sqrt(np.sum((x1 - x2) ** 2)) # 定义K-Means函数 def k_means(X, k=3, max_iters=100): # 随机初始化聚类中心 idx = np.random.choice(len(X), k, replace=False) centroids = X[idx] # 迭代更新聚类中心 for i in range(max_iters): # 分配数据点到最近的聚类中心 clusters = [[] for _ in range(k)] for x in X: distances = [euclidean_distance(x, c) for c in centroids] cluster_idx = np.argmin(distances) clusters[cluster_idx].append(x) # 计算新的聚类中心 new_centroids = [] for c in clusters: mean = np.mean(c, axis=0) new_centroids.append(mean) new_centroids = np.array(new_centroids) # 判断是否收敛 if np.all(centroids == new_centroids): break centroids = new_centroids return centroids, clusters # 调用K-Means函数 centroids, clusters = k_means(X, k=3) # 可视化聚类结果 colors = ['r', 'g', 'b'] for i, c in enumerate(clusters): for x in c: plt.scatter(x[0], x[1], color=colors[i]) plt.scatter(centroids[:,0], centroids[:,1], marker='*', s=200, color='black') plt.show() ``` 运行上述代码,可以得到类似于以下图像的聚类结果: ![K-Means聚类结果](https://img-blog.csdn.net/20180721090548206?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2JqMjM4NzA1MjYwNjg=//font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/q/80)

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值