Stanford Machine Learning: (7). Clustering

Unsupervised learning - introduction
  • Talk about clustering
    • Learning from unlabeled data
  • Unsupervised learning
    • Useful to contras with supervised learning
  • Compare and contrast
    • Supervised learning
      • Given a set of labels, fit a hypothesis to it
    • Unsupervised learning
      • Try and determining structure in the data
      • Clustering algorithm groups data together based on data features
  • What is clustering good for
    • Market segmentation - group customers into different market segments
    • Social network analysis - Facebook "smartlists"
    • Organizing computer clusters and data centers for network layout and location
    • Astronomical data analysis - Understanding galaxy formation
K-means algorithm
  • Want an algorithm to automatically group the data into coherent clusters
  • K-means is by far the most widely used clustering algorithm
Overview
  • Take unlabeled data and group into two clusters
  • Algorithm overview
    • 1) Randomly allocate two points as the cluster centroids
      • Have as many cluster centroids as clusters you want to do (K cluster centroids, in fact)
      • In our example we just have two clusters
    • 2) Cluster assignment step
      • Go through each example and depending on if it's closer to the red or blue centroid assign each point to one of the two clusters
      • To demonstrate this, we've gone through the data and "colour" each point red or blue
    • 3) Move centroid step
      • Take each centroid and move to the average of the correspondingly assigned data-points 
      • Repeat 2) and 3) until convergence
  • More formal definition
    • Input: 
      • K (number of clusters in the data)
      • Training set {x1x2x3 ..., xn
    • Algorithm:
      • Randomly initialize K cluster centroids as {μ1μ2μ3 ..μK}
          • Loop 1
            • This inner loop repeatedly sets the c(i) variable to be the index of the closes variable of cluster centroid closes to x
            • i.e. take ith example, measure squared distance to each cluster centroid, assign c(i)to the cluster closest
          • Loop 2
            • Loops over each centroid calculate the average mean based on all the points associated with each centroid from c(i)
      • What if there's a centroid with no data
        • Remove that centroid, so end up with K-1 classes
        • Or, randomly reinitialize it
          • Not sure when though...
K-means for non-separated clusters
  • So far looking at K-means where we have well defined clusters
  • But often K-means is applied to datasets where there aren't well defined clusters
    • e.g. T-shirt sizing
  • Not obvious discrete groups
  • Say you want to have three sizes (S,M,L) how big do you make these?
    • One way would be to run K-means on this data
    • May do the following
    • So creates three clusters, even though they aren't really there
    • Look at first population of people
      • Try and design a small T-shirt which fits the 1st population
      • And so on for the other two
    • This is an example of market segmentation 
      • Build products which suit the needs of your subpopulations

K means optimization objective
  • Supervised learning algorithms have an optimization objective (cost function)
    • K-means does too
  • K-means has an optimization objective like the supervised learning functions we've seen
    • Why is this good?
    • Knowing this is useful because it helps for debugging
    • Helps find better clusters
  • While K-means is running we keep track of two sets of variables
    • ci is the index of clusters {1,2, ..., K} to which xi is currently assigned
      • i.e. there are m ci values, as each example has a ci value, and that value is one the the clusters (i.e. can only be one of K different values)
    • μk, is the cluster associated with centroid k
      • Locations of cluster centroid k
      • So there are K 
      • So these the centroids which exist in the training data space
    •  μci, is the cluster centroid of the cluster to which example xi has been assigned to
      • This is more for convenience than anything else
        • You could look up that example i is indexed to cluster j (using the c vector), where j is between 1 and K
        • Then look up the value associated with cluster j in the μ vector (i.e. what are the features associated with μj)
        • But instead, for easy description, we have this variable which gets exactly the same value 
      • Lets say xas been assigned to cluster 5
        • Means that
          • ci = 5
          • μci, = μ5
  • Using this notation we can write the optimization objective;

    • i.e. squared distances between training example xand the cluster centroid to which xhas been assigned to
      • This is just what we've been doing, as the visual description below shows;
      • The red line here shows the distances between the example xand the cluster to which that example has been assigned
        • Means that when the example is very close to the cluster, this value is small
        • When the cluster is very far away from the example, the value is large
    • This is sometimes called the distortion (or distortion cost function)
    • So we are finding the values which minimizes this function;
  • If we consider the k-means algorithm
    • The cluster assigned step is minimizing J(...) with respect to c1c2 ... ci 
      • i.e. find the centroid closest to each example
      • Doesn't change the centroids themselves
    • The move centroid step
      • We can show this step is choosing the values of μ which minimizes J(...) with respect to μ
    • So, we're partitioning the algorithm into two parts
      • First part minimizes the c variables
      • Second part minimizes the J variables
  • We can use this knowledge to help debug our K-means algorithm
Random initialization
  • How we initialize K-means
    • And how avoid local optimum
  • Consider clustering algorithm
    • Never spoke about how we initialize the centroids
      • A few ways - one method is most recommended
  • Have number of centroids set to less than number of examples (K < m) (if K > m we have a problem)0
    • Randomly pick K training examples
    • Set μ1 up to μK to these example's values
  • K means can converge to different solutions depending on the initialization setup
    • Risk of local optimum
    • The local optimum are valid convergence, but local optimum not global ones
  • If this is a concern
    • We can do multiple random initializations
      • See if we get the same result - many same results are likely to indicate a global optimum
  • Algorithmically we can do this as follows;


    • A typical number of times to initialize K-means is 50-1000
    • Randomly initialize K-means
      • For each 100 random initialization run K-means
      • Then compute the distortion on the set of cluster assignments and centroids at convergent
      • End with 100 ways of cluster the data
      • Pick the clustering which gave the lowest distortion
  • If you're running K means with 2-10 clusters can help find better global optimum
    • If K is larger than 10, then multiple random initializations are less likely to be necessary
    • First solution is probably good enough (better granularity of clustering)
How do we choose the number of clusters?
  • Choosing K?
    • Not a great way to do this automatically
    • Normally use visualizations to do it manually
  • What are the intuitions regarding the data?
  • Why is this hard
    • Sometimes very ambiguous
      • e.g. two clusters or four clusters
      • Not necessarily a correct answer
    • This is why doing it automatic this is hard
Elbow method
  • Vary K and compute cost function at a range of K values
  • As K increases J(...) minimum value should decrease (i.e. you decrease the granularity so centroids can better optimize)
    • Plot this (K vs J())
  • Look for the "elbow" on the graph
  • Chose the "elbow" number of clusters
  • If you get a nice plot this is a reasonable way of choosing K
  • Risks
    • Normally you don't get a a nice line -> no clear elbow on curve
    • Not really that helpful
Another method for choosing K
  • Using K-means for market segmentation
  • Running K-means for a later/downstream purpose
    • See how well different number of clusters serve you later needs
  • e.g.
    • T-shirt size example
      • If you have three sizes (S,M,L)
      • Or five sizes (XS, S, M, L, XL)
      • Run K means where K = 3 and K = 5
    • How does this look
    • This gives a way to chose the number of clusters
      • Could consider the cost of making extra sizes vs. how well distributed the products are
      • How important are those sizes though? (e.g. more sizes might make the customers happier)
      • So applied problem may help guide the number of clusters
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值