Clustering - K-means algorithm

摘要: 本文是吴恩达 (Andrew Ng)老师《机器学习》课程,第十四章《无监督学习》中第109课时《K-means算法》的视频原文字幕。为本人在视频学习过程中记录下来并加以修正,使其更加简洁,方便阅读,以便日后查阅使用。现分享给大家。如有错误,欢迎大家批评指正,在此表示诚挚地感谢!同时希望对大家的学习能有所帮助.
————————————————

In the clustering problem, we are given an unlabeled data set and we would like to have an algorithm automatically group the data into coherent subsets or into coherent clusters for us. The K-means algorithm is by the the most popular, by far the most widely used clustering algorithm, and in this video I would like to tell you what the K-means algorithm is and how it works.

The K-means clustering algorithm is best illustrated in pictures. Let's say I want to take an unlabeled data set like the one shown here. And I want to group the data into two clusters. If I run the K-means clustering algorithm, here's what I'm going to do. The first step is to randomly initialize two points, called the cluster centroids. So, these two crosses here, these are called the cluster centroids. And I have two of them because I want to group my data into two clusters. K-means is an iterative algorithm and it does two things. First is a cluster assignment step, and second is a move centroid step. So, let me tell you what those things mean. The first of the two steps in the inner loop of K-means, is the cluster assignment step. What that means is that, it's going to go through each of the examples, each of the green dots shown here and depending on whether it's closer to the red cluster centroid or the blue cluster centroid, it's going to assign each of the data points to one of the cluster centroids.

Specially, what I mean by that, is going to go through your data set and color each of the points either red or blue, depending on whether it's closer to red cluster centroid or the blue cluster centroid. Now I've done that in this diagram here. The other part of K-means, the inner loop of K-means, is the move centroid step. And what we are going to do is to take the two cluster centroids, and we're going to move them to the average of the points colored the same color. So what we're going to do is look at all the red points and compute the average as really the mean of the location of all the red points, so we are going to move the red cluster centroid there. And the same thing for the blue cluster centroid. So let me do that now.

We're going to move the cluster centroids as follows. And I've moved them to their new means. And then we go back to another cluster assignment step. So, we are again going to look at all my unlabeled examples and depending on whether it's closer to the red or the blue cluster centroid, I'm going to color them either red or blue. So I'm going to assign each point to one the the two cluster centroids. So let me do that now.

And so the colors of some of the points just changed. And then I'm going to do another move centroid step. So I'm going to compute the average of all the blue points and red points.

And move my cluster centroids like this.

  

And so, Let me do one more cluster assignment step and color each point red or blue based on that it's closer to. And then do another move centroid. And we're done.

And in fact, if you keep running additional iterations of K-means from here, the cluster centroids will not change any further. And the colors of the points will not change any further. At this point, K-means has converged. And it's done a pretty good job finding the two clusters in this data. Let's write out the K-means algorithm more formally.

The K-means algorithm takes two inputs. One is a parameter K, which is the number of clusters you want to find in the data. I'll later say how we might go about trying to choose K. But for now, let's just day that we've decided we want a certain number of clusters. And we'll going to tell the algorithm how many clusters we think there are in the data set. And then K means also takes as input this sort of unlabeled training set of just Xs and because this is unsupervised learning, we don't have the labels y anymore. And for unsupervised learning of K-means, I'm going to use the convension that x^{(i)} is an n dimension vector (x^{(i)}\in \mathbb{R}^{n}). And that's why my training examples are now n dimensional rather n+1 dimension vectors.

This is what the K-means algorithm does. The first step is that it randomly initializes k cluster centroids which we'll call \mu _{1}, \mu _{2} up to \mu _{k}. And so in the earlier diagram, the cluster centroids corresponded to the location of the red cross and the location of the blue cross. So there we had two cluster centroids, so maybe the red cross was \mu _{1}, and the blue cross was \mu _{2}. And more generally we would have k class centroids rather than just 2. Then the inner loop of k-means does the following. We're going to repeatedly do the following. First, for each of my traing examples, I'm going to set this variable c^{(i)} to be the index, 1 through k, of the cluster centroid clostest to x^{(i)}. So this was my cluster assignment step where we took each of my examples and colored it either red or blue, depending on which cluster centroid which it was closest to. So c^{(i)} is going to be a number from 1 to k that tells us is it closer to the red cross or is it closer to the blue cross. And another way of writing this is, to compute c^{(i)}, I'm going to take my i^{th} example x^{(i)}, and I'm going to measure its distance to each of my cluster centroids. So c^{(i)} is the k that minimize \left \| x^{(i)} -\mu _{k}\right \|^{2}. So we think of c^{(i)} as picking up the cluster centroid with the smallest squared distance to my training example x^{(i)}. But of course minimizing squared distance, and minimizing this distance, that should give you the same value of c^{(i)}. But we usually put the square there, just as the convention that people use for k-means. So that was the cluster assignment step.

The other inner loop of k-means does the move centroid step. And what that does is for each of my cluster centroids, so for lower case k equals 1 through K, it sets \mu _{k} equals to the average of the points assigned to cluster. So as a concrete example, let's say that one of my cluster centroids, that's the cluster centroid 2, has training examples, x^{(1)}, x^{(5)}, x^{(6)}, x^{(10)} assigned to it. This means is that c^{(1)}=2, c^{(5)}=2, c^{(6)}=2, c^{(10)}=2. If we got that from the cluster assignment step, then that means examples 1,5,6,10 were assigned to the cluster centriod 2. Then in this move centroid step, what I'm going to do is just compute the average of these four things: \mu _{2} = \frac{1}{4}[x^{(1)}+x^{(5)}+x^{(6)}+x^{(10)}]. And now \mu _{2}\in \mathbb{R}^{n}. This has the effect of moving \mu _{2} to the average of the four points listed here. One thing that sometimes I've been asked is, let's say \mu _{k} to be the average of the points assigned to the cluster. But what if does a cluster centroid with no points assigned to it? In that case, the more common thing to do is just eliminate that cluster centroid. And if you do that, you end up with K-1 instead of K clusters. But sometimes if you really need K clusters, then the other thing you can do if you have a cluster centroid with no points assigned to it is you can just randomly reinitialize that cluster centroid, but it's more common to just eliminate a cluster if somewhere during K-means with no points assigned to that cluster centroid. And that can happen although in practice it happens not that often. So that's the K-means algorithm.

Before wrapping out this video I'll tell you one other common application of K-means. That's to the problems with non well separated clusters. Here's what I mean. So far we've been picturing K-means and applying it to data sets like that shown here where we have three pretty well separated clusters, and we'd like an algorithm to find maybe the 3 clusters for us. But it turns out that very often K-means is also applied to data sets that look like this where there may not be several very well separated clusters. Here's an example application to T-shirt sizing. So let's say you're a T-shirt manufacturer and what you've done is you've gone to the population that you want to sell T-shirts to, and you've collected a number of examples of the height and weight of these people in your population. Let me guess height and weight tend to be positively high related. So, maybe you end up with a data set like this, with a sample or set of examples of different peoples heights and weight. Let's say you want to size your t-shirts. Let's say I want to design and sell t-shirts of three sizes, small, medius and large. So how big should I make my small one? How big should I make my medium? And how big should I make my large t-shirts? One way to do this would be to run my K-means clustering algorithm on this data set that I have shown on the right. And maybe what K-means will do is group all of these points into one cluster and group all of these points into a 2nd cluster and group all of these points into a 3rd cluster. So even though the data, you know, beforehand it doesn't seem like we have three well separated clusters, K-means will kind of separate out the data into multiple clusters for you. And what you can do is then look at this first population of people and look at them and look at the height and weight and try to design a small t-shirt so that it kind of fits this first population of people well. And then design a medium t-shirt and design a large t-shirt. And this is in fact kind of an example of market segmentation where you're using K-means to separate your market into three different segments so you can design a product separately that is the small, medium and large t-shirts that tries to suit the needs of each of your three separate sub-populations well.

So, that's the K-means algorithm. By know, you should know how to implement the K-means algorithm and kind of get it to work for some problems. But in the next few videos, what I want to do is really get more deeply into the nuts and bolts of K-means and talk a bit about how to actually get this work really well

<end>

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值