Mahout: An overview of clustering techniques

Different kinds of clustering problems

  • EXCLUSIVE CLUSTERING In exclusive clustering, an item belongs exclusively to one cluster, not several.
  • OVERLAPPING CLUSTERING What if we wanted to do non-exclusive clustering; that is, put Harry Potter not only in fiction but also in a young adult cluster as well as under fantasy. An overlapping clustering algorithm like fuzzy k-means achieves this easily. Moreover, fuzzy k-means also indicates the degree with which an object is associated with a cluster.



 

  • HIERARCHICAL CLUSTERING  Now, assume a situation where we have two clusters of books, one for fantasy and the other for space travel. Harry Potter is in the cluster of fantasy books, but these two clusters, space travel and fantasy, could be visualized as subclusters of fiction. Hence, we can construct a fiction cluster by merging these and other similar clusters.



 

  • PROBABILISTIC CLUSTERING A probabilistic model is usually a characteristic shape or a type of distribution of a set of points in an n-dimensional plane.


 

Different clustering approaches

  • FIXED NUMBER OF CENTERS These clustering methods fix the number of clusters ahead of time.
  • BOTTOM-UP APPROACH: FROM POINTS TO CLUSTERS VIA GROUPING



 

  • TOP-DOWN APPROACH: SPLITTING THE GIANT CLUSTER




  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Mahout是一个基于Hadoop的机器学习库,其中包含了很多常用的机器学习算法,包括K-Means聚类算法。下面是基于Mahout实现K-Means聚类的步骤: 1. 准备数据 将需要聚类的数据准备好,以适合Mahout输入格式的方式存储,例如HDFS上的文本文件。 2. 配置Mahout 在Hadoop集群上安装Mahout,并配置好Hadoop和Mahout的环境变量。 3. 运行K-Means聚类 使用Mahout中的kmeans命令来运行K-Means聚类算法,命令格式如下: ``` mahout kmeans -i <input> -c <centroids> -o <output> -dm <distanceMeasure> -k <k> ``` 其中,参数含义如下: - input:输入数据路径 - centroids:初始质心路径 - output:输出结果路径 - distanceMeasure:距离度量方法,例如EuclideanDistanceMeasure - k:聚类数量 4. 分析结果 分析K-Means聚类的结果,可以使用Mahout中的clusterdump命令来输出聚类结果,例如: ``` mahout clusterdump -i <input> -o <output> -p <points> -d <dictionary> -dt <distanceMeasure> ``` 其中,参数含义如下: - input:聚类结果路径 - output:输出结果路径 - points:数据点路径 - dictionary:词典路径 - distanceMeasure:距离度量方法 以上是基于Mahout实现K-Means聚类的步骤,需要注意的是,Mahout的输入格式和输出格式都需要按照Mahout要求的格式进行,否则会导致运行失败。同时,在运行过程中,需要根据实际情况调整参数,以达到最佳的聚类效果。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值