Mahout K-Means输出结果解析

怎么使用Mahout做聚类有空我会专门写的,这篇博客主要为了讲一下Mahout处理的结果。
Mahout版本为0.9,数据没做归一化、标准化,只是为了测试。

输出目录下有clusteredPoints、cluster-x、cluster-(x+1)-final等几个文件夹,x表示第x次迭代,每次的迭代结果都会存到cluster-x,最后一次(x+1)迭代结果存在cluster-(x+1)-final,clusteredPoints下存的也是最后聚类结果,但它俩存的东西不太一样,一个是类,一个是点,具体情况请看下面。
ps:
这里写图片描述

mahout clusterdump 解析ClusterWritable并转成可读文件 -of TEXT,CSV等,后面有贴的
#最后聚类结果(类名称vl-x,中心点位置c,半径r,类中点个数n)
[root@drguo home]# mahout clusterdump -i file:///home/guo/Desktop/output/clusters-2-final -o /home/guo/Desktop/result
VL-0{n=7 c=[1.714, 2.286, 4.429, 0.857, 7.571] r=[2.185, 2.711, 6.884, 2.100, 5.233]}
VL-1{n=3 c=[0.667, 8.667, 11.333, 5.333, 0.667, 4.333, 1.667, 3.333, 21.667] r=[0.943, 5.437, 5.185, 7.542, 0.943, 6.128, 2.357, 4.714, 9.428]}

#最后聚类结果(key:所属类,value:权重wt、距离、向量(这是有名字的namedvector,不是普通的哦,之后我也会专门写如何生成))
[root@drguo clusteredPoints]# mahout seqdumper -i file:///home/guo/Desktop/output/clusteredPoints -o /home/guo/Desktop/points
Input Path: file:/home/guo/Desktop/output/clusteredPoints/part-m-0
Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.clustering.classify.WeightedPropertyVectorWritable
Key: 0: Value: wt: 0.7140480784137244 distance: 6.885358615591935  vec: 001461E4-86C64780-A0B495C4-D19BA86F__201601 = [5.000, 6.000, 6.000]
Key: 1: Value: wt: 0.6106543697821432 distance: 11.445523142259598  vec: 001461E4-86C64780-A0B495C4-D19BA86F__201602 = [12.000, 15.000, 15.000]
Key: 1: Value: wt: 0.6113140078611051 distance: 11.775681155103799  vec: 001461E4-86C64780-A0B495C4-D19BA86F__201603 = [13.000, 15.000, 15.000]
Key: 0: Value: wt: 0.7140480784137244 distance: 6.885358615591935  vec: 001461E4-86C64780-A0B495C4-D19BA86F__201604 = [5.000, 6.000, 6.000]
Key: 0: Value: wt: 0.7643111018595771 distance: 6.010195419417895  vec: 001461E4-86C64780-A0B495C4-D19BA86F__201605 = [2.000, 4.000, 4.000]
Key: 0: Value: wt: 0.7408819961153278 distance: 7.529533687488249  vec: 001641C0-75CC4BC2-9E31CF60-C15627D2__201603 = [6.000, 6.000]
Key: 0: Value: wt: 0.7511412095733683 distance: 7.989789402348321  vec: 001641C0-75CC4BC2-9E31CF60-C15627D2__201604 = [1.000, 1.000]
Key: 0: Value: wt: 0.6648742191066574 distance: 9.264811638337692  vec: 001641C0-75CC4BC2-9E31CF60-C15627D2__201605 = [12.000, 12.000]
Key: 0: Value: wt: 0.53656917576395 distance: 17.373449130609547  vec: 001641C0-75CC4BC2-9E31CF60-C15627D2__201606 = [18.000, 18.000]
Key: 1: Value: wt: 0.5948320024451352 distance: 23.202011407059803  vec: 001641C0-75CC4BC2-9E31CF60-C15627D2__201608 = [2.000, 1.000, 4.000, 16.000, 2.000, 13.000, 5.000, 10.000, 35.000]
Count: 10

#将类与点结合输出
[root@drguo home]# mahout clusterdump -i file:///home/guo/Desktop/output/clusters-2-final -p file:///home/guo/Desktop/output/clusteredPoints -o /home/guo/Desktop/cluster-point
VL-0{n=7 c=[1.714, 2.286, 4.429, 0.857, 7.571] r=[2.185, 2.711, 6.884, 2.100, 5.233]}
    Weight : [props - optional]:  Point:
    0.7140480784137244 : [distance=6.885358615591935]: 001461E4-86C64780-A0B495C4-D19BA86F__201601 = [5.000, 6.000, 6.000]
    0.7140480784137244 : [distance=6.885358615591935]: 001461E4-86C64780-A0B495C4-D19BA86F__201604 = [5.000, 6.000, 6.000]
    0.7643111018595771 : [distance=6.010195419417895]: 001461E4-86C64780-A0B495C4-D19BA86F__201605 = [2.000, 4.000, 4.000]
    0.7408819961153278 : [distance=7.529533687488249]: 001641C0-75CC4BC2-9E31CF60-C15627D2__201603 = [6.000, 6.000]
    0.7511412095733683 : [distance=7.989789402348321]: 001641C0-75CC4BC2-9E31CF60-C15627D2__201604 = [1.000, 1.000]
    0.6648742191066574 : [distance=9.264811638337692]: 001641C0-75CC4BC2-9E31CF60-C15627D2__201605 = [12.000, 12.000]
    0.53656917576395 : [distance=17.373449130609547]: 001641C0-75CC4BC2-9E31CF60-C15627D2__201606 = [18.000, 18.000]
VL-1{n=3 c=[0.667, 8.667, 11.333, 5.333, 0.667, 4.333, 1.667, 3.333, 21.667] r=[0.943, 5.437, 5.185, 7.542, 0.943, 6.128, 2.357, 4.714, 9.428]}
    Weight : [props - optional]:  Point:
    0.6106543697821432 : [distance=11.445523142259598]: 001461E4-86C64780-A0B495C4-D19BA86F__201602 = [12.000, 15.000, 15.000]
    0.6113140078611051 : [distance=11.775681155103799]: 001461E4-86C64780-A0B495C4-D19BA86F__201603 = [13.000, 15.000, 15.000]
    0.5948320024451352 : [distance=23.202011407059803]: 001641C0-75CC4BC2-9E31CF60-C15627D2__201608 = [2.000, 1.000, 4.000, 16.000, 2.000, 13.000, 5.000, 10.000, 35.000]

最后贴一下参数选项

seqdumper

Job-Specific Options:                                                           
  --input (-i) input            Path to job input directory.                    
  --output (-o) output          The directory pathname for output.              
  --substring (-b) substring    The number of chars to print out per value      
  --count (-c)                  Report the count only                           
  --numItems (-n) numItems      Output at most <n> key value pairs              
  --facets (-fa)                Output the counts per key.  Note, if there are  
                                a lot of unique keys, this can take up a fair   
                                amount of memory                                
  --quiet (-q)                  Print only file contents.                       
  --help (-h)                   Print out help                                  
  --tempDir tempDir             Intermediate output directory                   
  --startPhase startPhase       First phase to run                              
  --endPhase endPhase           Last phase to run   

clusterdump

Job-Specific Options:                                                           
  --input (-i) input                         Path to job input directory.       
  --output (-o) output                       The directory pathname for output. 
  --outputFormat (-of) outputFormat          The optional output format for the 
                                             results.  Options: TEXT, CSV, JSON 
                                             or GRAPH_ML                        
  --substring (-b) substring                 The number of chars of the         
                                             asFormatString() to print          
  --numWords (-n) numWords                   The number of top terms to print   
  --pointsDir (-p) pointsDir                 The directory containing points    
                                             sequence files mapping input       
                                             vectors to their cluster.  If      
                                             specified, then the program will   
                                             output the points associated with  
                                             a cluster                          
  --samplePoints (-sp) samplePoints          Specifies the maximum number of    
                                             points to include _per_ cluster.   
                                             The default is to include all      
                                             points                             
  --dictionary (-d) dictionary               The dictionary file                
  --dictionaryType (-dt) dictionaryType      The dictionary file type           
                                             (text|sequencefile)                
  --evaluate (-e)                            Run ClusterEvaluator and           
                                             CDbwEvaluator over the input.  The 
                                             output will be appended to the     
                                             rest of the output at the end.     
  --distanceMeasure (-dm) distanceMeasure    The classname of the               
                                             DistanceMeasure. Default is        
                                             SquaredEuclidean                   
  --help (-h)                                Print out help                     
  --tempDir tempDir                          Intermediate output directory      
  --startPhase startPhase                    First phase to run                 
  --endPhase endPhase                        Last phase to run     
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

光于前裕于后

您的打赏将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值