Example of Single Pass Clustering Technique

Suppose that we have the following set of documents and terms, and that
we are interested in clustering the terms using the single pass method
(note that the same method can beused to cluster the documents, but in
that case, we would be using the document vectors (rows) rather than the
term vector (columns).

	T1	T2	T3	T4	T5
Doc1	1	2	0	0	1
Doc2	3	1	2	3	0
Doc3	3	0	0	0	1
Doc4	2	1	0	3	0
Doc5	2	2	1	5	1

Start with T1 in a cluster by itself, say C1. At this point, C1 contains
only one item, T1, so the centroid of C1 is simply the vector for T1:

C1 = <1, 3, 3, 2, 2>.

Now compare (i.e., measure similarities) of the next item (T2) to centroids
of all existing clusters. At this point we have only one cluster, C1 (we
will use dot product for simplicity):

SIM(T2, C1) = 1*2 + 1*3 + 0*3 + 1*2 + 2*2 = 11

Now we need a pre-specified similarity threshold. Let's say that our
threshold is 10. This means that if the similarity of T2 to the cluster
centroid is >= 10, then we add T2 to the cluster, otherwise we use T2 to
start a new cluster.

In this case. SIM(T2, C1) = 11 > 10. Therefore we add T2 to cluster C1.
We now need to compute the new centroid for C1 (which now contains T1 and
T2). The centroid (which is the average vector for T1 and T2 is:

C1 = <3/2, 4/2, 3/2, 3/2, 4/2>

Now, we move to the next item, T3. Again, there is only one cluster, C1,
so we only need to compare T3 with C1 centroid. The dot product of T3 and
the above centroid is:

SIM(T3, C1) = 0 + 8/2 + 0 + 0 + 4/2 = 6

This time, T3 does not pass the threshold test (the similarity is less
than 10). Therefore, we use T3 to start a new cluster, C2. Now we have two
clusters

C1 = {T1, T2}
C2 = {T3}

We move to the next unclustered item, T4. Since we now have two clusters,
we need to compute the MAX similarity of T4 to the 2 cluster centroids
(note that the centroid of cluster C2 right now is just the vector for T3):

SIM(T4, C1) = <0, 3, 0, 3, 5> . <3/2, 4/2, 3/2, 3/2, 4/2>
            = 0 + 12/2 + 0 + 9/2 + 20/2 = 20.5

SIM(T4, C2) = <0, 3, 0, 3, 5> . <0, 2, 0, 0, 1>
            = 0 + 6 + 0 + 0 + 5 = 11

Note that both similarity scores pass the threshold (10), however, we pick
the MAX, and therefore, T4 will be added to cluster C1. Now we have the
following:

C1 = {T1, T2, T4}
C2 = {T3}

The centroid for C2 is still just the vector for T3:

C2 = <0, 2, 0, 0, 1>

and the new centroid for C1 is now:

C1 = <3/3, 7/3, 3/3, 6/3, 9/3>

The only item left unclustered is T5. We compute its similarity to the
centroids of existing clusters:

SIM(T5, C1) = <1, 0, 1, 0, 1> . <3/3, 7/3, 3/3, 6/3, 9/3>
            = 3/3 + 0 + 3/3 + 0 + 9/3 = 5

SIM(T5, C2) = <1, 0, 1, 0, 1> . <0, 2, 0, 0, 1>
            = 0 + 0 + 0 + 0 +1 = 1

Neither of these similarity values pass the threshold. Therefore, T5 will
have to go into a new cluster C3. There are no more unclustered items, so
we are done (after making a single pass through the items). The final
clusters are:

C1 = {T1, T2, T4}
C2 = {T3}
C3 = {T5}

Note: Obviously, the results for this method are highly dependent on the
similarity threshold that is used. You should use your judgment in setting
this threshold so that you are left with a reasonable number of clusters.
来源: <http://facweb.cs.depaul.edu/mobasher/classes/csc575/assignments/single-pass.html>

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值