mahout 0.9 连接mysql_mahout-distribution-0.9.tar.gz的安装的与配置、启动与运行自带的mahout算法...

不多说,直接上干货!

首先,别在windows下搭建什么,安装什么Cygwin啊!直接在linux,对于企业里推荐用CentOS6.5,在学校里用Ubuntu。

Mahout安装所需软件清单:

软件        版本          说明

操作系统    CentOS6.5        64位

JDK      jdk1.7.0_79

Hadoop      2.6.0

Mahout     mahout-distribution-0.8

为什么采用这个版本,而不是0.9及其以后的版本,是因为差别有点大,比如fpg关联规则算法。以及网上参考资料少

说在前面的话,

关于Mahout的安装配置,这里介绍两种方式:其一,下载源码(直接下载源码或者通过svn下载源码都可以),然后使用Maven进行编译;其二,下载完整包进行解压缩。这里我使用的是完整包进行解压缩安装。

一、 mahout-distribution-0.8.tar.gz的下载

http://archive.apache.org/dist/mahout/0.8/

d7e0ca83b78ca08f6078cd506edbe766.png

我这里,以稳定版本mahout-0.9为例

fe9e18e62aa2287100b7e47743c10439.png

当然,这里也可以使用wget命令在线下载,很简单,不多说。

二、 mahout-distribution-0.8.tar.gz的安装

1、先新建好目录

我一般喜欢在/usr/loca/下新建

d1774b051855c0218c489d8b00ac0ee6.png

de6b279bdf5db8d29213499e1726d1e5.png

[root@djt002 local]# pwd/usr/local

[root@djt002 local]# ll

total72drwxr-xr-x. 2 root root 4096 Sep 23 2011bin

drwxr-xr-x. 2 hadoop hadoop 4096 Mar 14 06:19data

drwxr-xr-x. 3 hadoop hadoop 4096 Feb 21 23:10elasticsearch

drwxr-xr-x. 2 root root 4096 Sep 23 2011etc

drwxr-xr-x. 3 hadoop hadoop 4096 Jan 17 17:14flume

drwxr-xr-x. 2 root root 4096 Sep 23 2011games

drwxr-xr-x. 3 hadoop hadoop 4096 Jan 16 23:33hadoop

drwxr-xr-x. 3 hadoop hadoop 4096 Mar 16 18:26hbase

drwxr-xr-x. 4 hadoop hadoop 4096 Mar 14 17:48hive

drwxr-xr-x. 2 root root 4096 Sep 23 2011include

drwxr-xr-x. 3 hadoop hadoop 4096 Jan 16 23:25jdk

drwxr-xr-x. 2 root root 4096 Sep 23 2011lib

drwxr-xr-x. 2 root root 4096 Sep 23 2011lib64

drwxr-xr-x. 2 root root 4096 Sep 23 2011libexec

drwxr-xr-x. 2 root root 4096 Sep 23 2011sbin

drwxr-xr-x. 5 root root 4096 Jan 16 20:09share

drwxr-xr-x. 4 hadoop hadoop 4096 Mar 17 23:33sqoop

drwxr-xr-x. 2 root root 4096 Sep 23 2011src

[root@djt002 local]# mkdir mahout

[root@djt002 local]# ll

total76drwxr-xr-x. 2 root root 4096 Sep 23 2011bin

drwxr-xr-x. 2 hadoop hadoop 4096 Mar 14 06:19data

drwxr-xr-x. 3 hadoop hadoop 4096 Feb 21 23:10elasticsearch

drwxr-xr-x. 2 root root 4096 Sep 23 2011etc

drwxr-xr-x. 3 hadoop hadoop 4096 Jan 17 17:14flume

drwxr-xr-x. 2 root root 4096 Sep 23 2011games

drwxr-xr-x. 3 hadoop hadoop 4096 Jan 16 23:33hadoop

drwxr-xr-x. 3 hadoop hadoop 4096 Mar 16 18:26hbase

drwxr-xr-x. 4 hadoop hadoop 4096 Mar 14 17:48hive

drwxr-xr-x. 2 root root 4096 Sep 23 2011include

drwxr-xr-x. 3 hadoop hadoop 4096 Jan 16 23:25jdk

drwxr-xr-x. 2 root root 4096 Sep 23 2011lib

drwxr-xr-x. 2 root root 4096 Sep 23 2011lib64

drwxr-xr-x. 2 root root 4096 Sep 23 2011libexec

drwxr-xr-x 2 root root 4096 Apr 7 00:21mahout

drwxr-xr-x. 2 root root 4096 Sep 23 2011sbin

drwxr-xr-x. 5 root root 4096 Jan 16 20:09share

drwxr-xr-x. 4 hadoop hadoop 4096 Mar 17 23:33sqoop

drwxr-xr-x. 2 root root 4096 Sep 23 2011src

[root@djt002 local]# chown-R hadoop:hadoop mahout

[root@djt002 local]# ll

total76drwxr-xr-x. 2 root root 4096 Sep 23 2011bin

drwxr-xr-x. 2 hadoop hadoop 4096 Mar 14 06:19data

drwxr-xr-x. 3 hadoop hadoop 4096 Feb 21 23:10elasticsearch

drwxr-xr-x. 2 root root 4096 Sep 23 2011etc

drwxr-xr-x. 3 hadoop hadoop 4096 Jan 17 17:14flume

drwxr-xr-x. 2 root root 4096 Sep 23 2011games

drwxr-xr-x. 3 hadoop hadoop 4096 Jan 16 23:33hadoop

drwxr-xr-x. 3 hadoop hadoop 4096 Mar 16 18:26hbase

drwxr-xr-x. 4 hadoop hadoop 4096 Mar 14 17:48hive

drwxr-xr-x. 2 root root 4096 Sep 23 2011include

drwxr-xr-x. 3 hadoop hadoop 4096 Jan 16 23:25jdk

drwxr-xr-x. 2 root root 4096 Sep 23 2011lib

drwxr-xr-x. 2 root root 4096 Sep 23 2011lib64

drwxr-xr-x. 2 root root 4096 Sep 23 2011libexec

drwxr-xr-x 2 hadoop hadoop 4096 Apr 7 00:21mahout

drwxr-xr-x. 2 root root 4096 Sep 23 2011sbin

drwxr-xr-x. 5 root root 4096 Jan 16 20:09share

drwxr-xr-x. 4 hadoop hadoop 4096 Mar 17 23:33sqoop

drwxr-xr-x. 2 root root 4096 Sep 23 2011src

[root@djt002 local]#

2、上传mahout压缩包

4f9b2caa6b03897d318e6cf0ff14699a.png

[root@djt002 local]# su hadoop

[hadoop@djt002 local]$ cd mahout/[hadoop@djt002 mahout]$ pwd/usr/local/mahout

[hadoop@djt002 mahout]$ ll

total0[hadoop@djt002 mahout]$ rz

[hadoop@djt002 mahout]$ ll

total67628

-rw-r--r-- 1 hadoop hadoop 69248331 Apr 6 16:09 mahout-distribution-0.8.tar.gz

[hadoop@djt002 mahout]$

3、解压

[hadoop@djt002 mahout]$ pwd/usr/local/mahout

[hadoop@djt002 mahout]$ ll

total67628

-rw-r--r-- 1 hadoop hadoop 69248331 Apr 6 16:09 mahout-distribution-0.8.tar.gz

[hadoop@djt002 mahout]$ tar -zxvfmahout-distribution-0.9.tar.gz

4、删除压缩包和赋予用户组

[hadoop@djt002 mahout]$ pwd/usr/local/mahout

[hadoop@djt002 mahout]$ ll

total67632drwxrwxr-x 7 hadoop hadoop 4096 Apr 7 00:25 mahout-distribution-0.8

-rw-r--r-- 1 hadoop hadoop 69248331 Apr 6 16:09 mahout-distribution-0.8.tar.gz

[hadoop@djt002 mahout]$ rm mahout-distribution-0.9.tar.gz

[hadoop@djt002 mahout]$ ll

total4drwxrwxr-x 7 hadoop hadoop 4096 Apr 7 00:25 mahout-distribution-0.8[hadoop@djt002 mahout]$

5、mahout的配置

[root@djt002 mahout-distribution-0.8]# pwd/usr/local/mahout/mahout-distribution-0.8[root@djt002 mahout-distribution-0.8]# vim /etc/profile

0014103795fd8f825b5feb5bee664a54.png

#mahout

export MAHOUT_HOME=/usr/local/mahout/mahout-distribution-0.8export MAHOUT_HOME_CONF_DIR=/usr/local/mahout/mahout-distribution-0.8/conf

export PATH=$PATH:$MAHOUT_HOME/bin

export CLASSPATH=.:$JAVA_HOME/lib:$MAHOUT_HOME/lib:$JRE_HOME/lib:$CLASSPATH

9a7515bfaefa90aad6acc3bc6780b1b0.png

[root@djt002 mahout-distribution-0.9]# source /etc/profile

认识下mahout的目录结构

7b7c2496bb43b36db993f21993a2dce9.png

[hadoop@djt002 mahout-distribution-0.8]$ pwd/usr/local/mahout/mahout-distribution-0.8[hadoop@djt002 mahout-distribution-0.8]$ ll

total64924drwxrwxr-x 2 hadoop hadoop 4096 Apr 28 22:06bin

drwxrwxr-x 3 hadoop hadoop 4096 Apr 28 22:06buildtools

drwxr-xr-x 2 hadoop hadoop 4096 Jul 8 2013conf

drwxrwxr-x 3 hadoop hadoop 4096 Apr 28 22:06core

drwxrwxr-x 3 hadoop hadoop 4096 Apr 28 22:06distribution

drwxrwxr-x 6 hadoop hadoop 4096 Apr 28 22:06docs

drwxrwxr-x 5 hadoop hadoop 4096 Apr 28 22:06examples

drwxrwxr-x 3 hadoop hadoop 4096 Apr 28 22:06integration

drwxrwxr-x 3 hadoop hadoop 4096 Apr 28 22:06lib-rw-r--r-- 1 hadoop hadoop 39588 Jul 8 2013LICENSE.txt-rw-r--r-- 1 hadoop hadoop 1643245 Jul 8 2013 mahout-core-0.8.jar-rw-r--r-- 1 hadoop hadoop 19929354 Jul 8 2013 mahout-core-0.8-job.jar-rw-r--r-- 1 hadoop hadoop 273767 Jul 8 2013 mahout-examples-0.8.jar-rw-r--r-- 1 hadoop hadoop 42503144 Jul 8 2013 mahout-examples-0.8-job.jar-rw-r--r-- 1 hadoop hadoop 439078 Jul 8 2013 mahout-integration-0.8.jar-rw-r--r-- 1 hadoop hadoop 1590913 Jul 8 2013 mahout-math-0.8.jar

drwxrwxr-x 3 hadoop hadoop 4096 Apr 28 22:06math-rw-r--r-- 1 hadoop hadoop 1888 Jul 8 2013NOTICE.txt-rw-r--r-- 1 hadoop hadoop 1212 Jul 8 2013README.txt

[hadoop@djt002 mahout-distribution-0.8]$

三、验证mahout是否安装成功

[hadoop@djt002 mahout-distribution-0.8]$ bin/mahout --help

Running on hadoop,using /usr/local/hadoop/hadoop-2.6.0/bin/hadoop and HADOOP_CONF_DIR=MAHOUT-JOB: /usr/local/mahout/mahout-distribution-0.9/mahout-examples-0.9-job.jar

Unknown program'--help'chosen.

Valid program names are:

arff.vector: : Generate Vectorsfroman ARFF file or directory

baumwelch: : Baum-Welch algorithm forunsupervised HMM training

canopy: : Canopy clustering

cat: : Print a file or resourceasthe logistic regression models would see it

cleansvd: : Cleanup and verification of SVD output

clusterdump: : Dump cluster output to text

clusterpp: : Groups Clustering Output In Clusters

cmdump: : Dump confusion matrixinHTML or text formats

concatmatrices: : Concatenates2matrices of same cardinality into a single matrix

cvb: : LDA via Collapsed Variation Bayes (0th deriv. approx)

cvb0_local: : LDA via Collapsed Variation Bayes,inmemory locally.

evaluateFactorization: : compute RMSE and MAE of a rating matrix factorization against probes

fkmeans: : Fuzzy K-means clustering

hmmpredict: : Generate random sequence of observations by given HMM

itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering

kmeans: : K-means clustering

lucene.vector: : Generate Vectorsfroma Lucene index

lucene2seq: : Generate Text SequenceFilesfroma Lucene index

matrixdump: : Dump matrixinCSV format

matrixmult: : Take the product of two matrices

parallelALS: : ALS-WR factorization of a rating matrix

qualcluster: : Runs clustering experiments and summarizes resultsina CSV

recommendfactorized: : Compute recommendationsusingthe factorization of a rating matrix

recommenditembased: : Compute recommendationsusing item-based collaborative filtering

regexconverter: : Convert text files on a per line basis based on regular expressions

resplit: : Splits asetof SequenceFiles into a number of equal splits

rowid: : Map SequenceFile to {SequenceFile, SequenceFile}

rowsimilarity: : Compute the pairwise similarities of the rows of a matrix

runAdaptiveLogistic: : Scorenew production data usinga probably trained and validated AdaptivelogisticRegression model

runlogistic: : Run a logistic regression model against CSV data

seq2encoded: : Encoded Sparse Vector generationfromText sequence files

seq2sparse: : Sparse Vector generationfromText sequence files

seqdirectory: : Generate sequence files (of Text)froma directory

seqdumper: : Generic Sequence File dumper

seqmailarchives: : Creates SequenceFilefroma directory containing gzipped mail archives

seqwiki: : Wikipedia xml dump to sequence file

spectralkmeans: : Spectral k-means clustering

split: : Split Input data into test and train sets

splitDataset: : split a rating dataset into training and probe parts

ssvd: : Stochastic SVD

streamingkmeans: : Streaming k-means clustering

svd: : Lanczos Singular Value Decomposition

testnb: : Test the Vector-based Bayes classifier

trainAdaptiveLogistic: : Train an AdaptivelogisticRegression model

trainlogistic: : Train a logistic regressionusingstochastic gradient descent

trainnb: : Train the Vector-based Bayes classifier

transpose: : Take the transpose of a matrix

validateAdaptiveLogistic: : Validate an AdaptivelogisticRegression model against hold-out data setvecdist: : Compute the distances between aset of Vectors (or Cluster or Canopy, they must fit inmemory) and a list of Vectors

vectordump: : Dump vectorsfroma sequence file to text

viterbi: : Viterbi decoding of hidden statesfromgiven output states sequence

[hadoop@djt002 mahout-distribution-0.9]$

出现上述的界面,说明mahout安装成功,因为,自动列出mahout已经实现的所有命令。

运行mahout自带的示例(确保hadoop集群已开启)

1b460d38aaa14c3b107619bc8110be2f.png

mahout中的算法大致可以分为三大类:

聚类,协同过滤和分类

其中

常用聚类算法有:canopy聚类,k均值算法(kmeans),模糊k均值,层次聚类,LDA聚类等

常用分类算法有:贝叶斯,逻辑回归,支持向量机,感知器,神经网络等

因为,我的版本是mahout-0.8,所以mahout-examples-0.8-job.jar。

以下是运行mahout自带的keans算法

$HADOOP_HOME/bin/hadoop jar /usr/local/mahout/mahout-distribution-0.8/mahout-examples-0.8-job.jar org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

或者

以下是运行mahout自带的cnopy算法

$HADOOP_HOME/bin/hadoop jar /usr/local/mahout/mahout-distribution-0.8/mahout-examples-0.8-job.jar org.apache.mahout.clustering.syntheticcontrol.canopy.Job

61152935aacffc4c38a4b2117e1750fc.png

[hadoop@djt002 mahout-distribution-0.9]$ $HADOOP_HOME/bin/hadoop jar /usr/local/mahout/mahout-distribution-0.9/mahout-examples-0.9-job.jar org.apache.mahout.clustering.syntheticcontrol.canopy.Job17/04/28 06:42:49 INFO canopy.Job: Running with defaultarguments17/04/28 06:42:54 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032

17/04/28 06:42:55 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.17/04/28 06:42:58 INFO mapreduce.JobSubmitter: Cleaning up the staging area /tmp/hadoop-yarn/staging/hadoop/.staging/job_1493332712225_0001

Exceptionin thread "main" org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://djt002:9000/user/hadoop/testdata

at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:321)

at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:264)

at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:385)

at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:597)

at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:614)

at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:492)

at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1296)

at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1293)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:415)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)

at org.apache.hadoop.mapreduce.Job.submit(Job.java:1293)

at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1314)

at org.apache.mahout.clustering.conversion.InputDriver.runJob(InputDriver.java:108)

at org.apache.mahout.clustering.syntheticcontrol.canopy.Job.run(Job.java:85)

at org.apache.mahout.clustering.syntheticcontrol.canopy.Job.main(Job.java:55)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at org.apache.hadoop.util.RunJar.run(RunJar.java:221)

at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

[hadoop@djt002 mahout-distribution-0.9]$

准备测试数据

练习数据下载地址:

http://download.csdn.net/detail/qq1010885678/8582941

上面的练习数据是用来检测kmeans聚类算法的数据。

将练习数据(data.txt)上传到hdfs中对应的hdfs://djt002:9000/user/hadoop/testdata目录下即可。(这是样本数据集,可以适用各种算法)

eacbf87c7dbdeb1dd8aba15312c6cac0.png

我这里,上传测试数据。到我本地linux自己写的一个路径。(这里为了自己所需哈)

2b3ea86ef326b16ae8819f302a1c4512.png

[hadoop@djt002 mahout]$ pwd/usr/local/mahout

[hadoop@djt002 mahout]$ ll

total4drwxrwxr-x 7 hadoop hadoop 4096 Apr 7 00:25 mahout-distribution-0.8[hadoop@djt002 mahout]$mkdir mahoutData

[hadoop@djt002 mahout]$ ll

total8drwxrwxr-x 2 hadoop hadoop 4096 Apr 28 06:59mahoutData

drwxrwxr-x 7 hadoop hadoop 4096 Apr 7 00:25 mahout-distribution-0.8[hadoop@djt002 mahout]$ cd mahoutData/[hadoop@djt002 mahoutData]$ pwd/usr/local/mahout/mahoutData

[hadoop@djt002 mahoutData]$ ll

total0[hadoop@djt002 mahoutData]$ rz

CC[hadoop@djt002 mahoutData]$ ll

total0[hadoop@djt002 mahoutData]$ rz

[hadoop@djt002 mahoutData]$ ll

total284

-rw-r--r-- 1 hadoop hadoop 288972 Apr 27 22:48data.txt

[hadoop@djt002 mahoutData]$

然后,将/usr/local/mahout/mahoutData/下的测试数据,上传到hdfs://djt002:9000/user/hadoop/testdata下

241e5fa02e5f62096f974e4e54f37916.png

[hadoop@djt002 mahoutData]$ $HADOOP_HOME/bin/hadoop fs -put /usr/local/mahout/mahoutData/data.txt hdfs://djt002:9000/user/hadoop/testdata

或者

[hadoop@djt002 mahoutData]$ $HADOOP_HOME/bin/hadoop fs -copyFromLocal  /usr/local/mahout/mahoutData/data.txt  hdfs://djt002:9000/user/hadoop/testdata/

[hadoop@djt002 mahoutData]$ $HADOOP_HOME/bin/hadoop fs -ls hdfs://djt002:9000/user/hadoop/testdata/

-rw-r--r-- 1 hadoop supergroup 288972 2017-04-28 07:02 hdfs://djt002:9000/user/hadoop/testdata

也许中间会出现,这个数据集,你会上传不了。解决方案如下

bbb0a887d936f1c6dd26f1490a08b4cb.png

[hadoop@djt002 mahoutData]$ $HADOOP_HOME/bin/hadoop fs -put /usr/local/mahout/mahoutData/data.txt hdfs://djt002:9000/user/hadoop/testdata/

put: `hdfs://djt002:9000/user/hadoop/testdata': File exists

[hadoop@djt002 mahoutData]$ $HADOOP_HOME/bin/hadoop fs -rm hdfs://djt002:9000/user/hadoop/testdata/

17/04/28 07:16:58 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0minutes.

Deleted hdfs://djt002:9000/user/hadoop/testdata

[hadoop@djt002 mahoutData]$ $HADOOP_HOME/bin/hadoop fs -mkdir hdfs://djt002:9000/user/hadoop/testdata/

[hadoop@djt002 mahoutData]$

42fbeb2029637a95c63f9a9026247722.png

66934080b4642c64ce42218cdb9e864a.png

[hadoop@djt002 mahoutData]$ $HADOOP_HOME/bin/hadoop fs -put /usr/local/mahout/mahoutData/data.txt hdfs://djt002:9000/user/hadoop/testdata/

[hadoop@djt002 mahoutData]$

fc7a3f0785f2ed647aab79207c926a5c.png

使用kmeans算法

[hadoop@djt002 mahoutData]$ $HADOOP_HOME/bin/hadoop jar /usr/local/mahout/mahout-distribution-0.8/mahout-examples-0.8-job.jar org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

注意,是不需输入路径和输出路径的啊!(自带的jar包里都已经写死了的)

(注意:如果你是选择用mahout压缩包里自带的kmeans算法的话,则它的输入路径是testdata是固定死的,

即hdfs:djt002://9000/user/hadoop/testdata/  )

并且每次运行hadoop都要删掉原来的output目录!

[hadoop@djt002 mahoutData]$ $HADOOP_HOME/bin/hadoop fs -rm -r hdfs://djt002:9000/user/hadoop/output/*

15828a28e983cfaca554dbf4c10376a6.png

6f387ca0a96c27d53d63e1f8753953d3.png

....

由于聚类算法是一种迭代的过程(之后会讲解)

所以,它会一直重复的执行mr任务到符合要求(这其中的过程可能有点久。。。)

bd89fd67d315c5b51e1ada16e27b1f25.png

Kmeans运行结果如下:

d463842ced82cf2a4e81fa373fd3ebed.png

70, 7.311, 10.611, 6.924, 3.440, 9.465, 4.764, 2.838, 8.807, 1.960, 2.864, 6.728, 0.369, 1.374, -0.167, 2.125, 8.306, 4.908, -0.432]1.0 : [distance=29.095866076790845]: 60 = [30.817, 28.079, 24.628, 23.933, 28.660, 25.704, 27.501, 23.513, 30.377, 27.595, 22.938, 26.684, 25.208, 26.834, 22.931, 17.732, 17.544, 24.167, 25.602, 19.269, 14.978, 17.223, 18.962, 22.281, 17.035, 23.789, 14.878, 18.113, 10.981, 11.661, 14.331, 19.942, 11.175, 10.714, 15.675, 15.468, 16.010, 14.972, 15.101, 15.131, 15.154, 10.492, 14.754, 5.222, 5.393, 13.606, 11.775, 6.307, 3.370, 10.107, 7.779, 10.209, 1.493, 4.822, 0.019, 8.019, -0.279, -0.049, 5.757, 2.718]1.0 : [distance=24.674726284993667]: 60 = [31.347, 28.245, 34.275, 29.885, 30.573, 32.373, 24.031, 24.057, 24.099, 23.777, 28.993, 29.853, 26.485, 29.245, 28.145, 22.528, 20.390, 20.570, 27.921, 18.786, 22.144, 20.163, 17.616, 19.541, 20.342, 22.061, 21.358, 23.951, 13.447, 12.974, 18.406, 17.349, 17.425, 11.041, 14.912, 10.147, 16.731, 9.845, 14.840, 18.283, 18.426, 10.059, 16.760, 14.187, 14.301, 14.277, 12.823, 15.574, 10.789, 10.957, 8.361, 4.116, 3.732, 3.508, 2.288, 9.768, 9.661, 2.183, 6.933, 4.670]1.0 : [distance=31.366016794511612]: 60 = [35.439, 24.104, 27.345, 28.982, 34.488, 27.952, 32.550, 25.255, 29.188, 24.766, 29.235, 20.520, 19.745, 27.306, 29.226, 27.510, 21.879, 25.199, 19.470, 19.373, 19.371, 26.519, 19.270, 18.184, 24.926, 15.082, 17.402, 14.351, 22.618, 22.343, 22.627, 15.136, 16.385, 13.479, 21.914, 21.072, 18.025, 15.178, 19.715, 11.919, 18.650, 16.242, 12.783, 17.710, 17.715, 8.372, 13.702, 7.537, 9.190, 11.098, 13.714, 8.595, 11.006, 15.031, 10.061, 7.613, 13.295, 12.292, 12.478, 11.095]1.0 : [distance=26.598263851474357]: 60 = [26.273, 31.229, 29.741, 34.208, 33.329, 33.610, 31.072, 22.530, 28.587, 21.130, 23.557, 28.078, 27.546, 25.825, 18.454, 25.903, 24.448, 24.003, 23.199, 22.158, 17.711, 23.922, 20.550, 15.913, 17.699, 13.883, 17.494, 16.360, 20.679, 11.790, 18.424, 10.493, 11.001, 17.994, 11.673, 11.014, 11.437, 16.197, 16.435, 7.331, 15.089, 16.779, 14.449, 9.551, 11.331, 10.564, 5.992, 8.369, 11.402, 7.865, 2.526, 4.632, 9.335, 6.772, 3.018, 3.675, 0.455, 5.362, 6.945, 7.901]1.0 : [distance=27.50313693276032]: 60 = [26.148, 30.828, 27.122, 31.797, 26.812, 24.681, 31.379, 22.047, 22.034, 24.293, 30.875, 22.493, 30.889, 19.167, 19.199, 27.696, 17.370, 27.648, 23.842, 26.493, 23.635, 23.577, 20.884, 18.786, 18.898, 18.091, 22.021, 20.674, 23.890, 12.646, 18.448, 17.732, 17.897, 14.679, 13.598, 12.689, 19.832, 12.489, 9.745, 18.990, 18.820, 16.517, 12.024, 14.131, 13.394, 15.473, 11.140, 5.094, 15.265, 14.651, 8.299, 3.163, 12.039, 4.893, 7.552, 12.315, 9.581, 5.462, 2.984, 8.981]1.0 : [distance=41.63476648186727]: 60 = [30.822, 26.592, 32.747, 31.626, 31.853, 32.258, 34.720, 25.605, 24.215, 29.830, 28.270, 30.519, 27.139, 32.953, 29.208, 27.265, 31.003, 24.601, 27.746, 29.257, 25.375, 9.397, 11.854, 18.179, 11.058, 12.507, 14.945, 19.796, 9.565, 19.152, 11.940, 16.022, 17.441, 10.963, 10.996, 8.929, 15.033, 8.991, 20.548, 17.140, 13.223, 14.981, 10.412, 19.554, 19.192, 13.297, 15.799, 11.817, 12.925, 12.827, 13.102, 13.449, 11.540, 17.939, 8.543, 13.994, 15.765, 16.096, 16.662, 8.968]1.0 : [distance=47.92825575495409]: 60 = [35.675, 32.252, 33.359, 31.057, 24.062, 29.028, 24.791, 27.460, 25.859, 28.450, 30.435, 27.962, 28.948, 27.236, 28.649, 29.507, 35.871, 31.607, 25.408, 30.508, 32.454, 26.580, 27.593, 34.277, 27.145, 33.938, 27.016, 12.593, 10.910, 4.930, 4.463, 5.002, 11.772, 15.086, 10.525, 13.935, 10.900, 15.151, 8.885, 14.374, 13.364, 13.354, 6.827, 14.907, 4.364, 15.200, 14.254, 8.839, 13.155, 7.695, 8.300, 15.678, 14.164, 10.802, 9.084, 5.791, 10.142, 16.019, 12.784, 12.437]1.0 : [distance=48.93716831670561]: 60 = [31.775, 33.510, 25.615, 27.700, 24.828, 33.067, 34.310, 28.609, 34.490, 35.751, 25.563, 26.692, 34.970, 30.595, 26.545, 35.828, 29.338, 24.678, 33.323, 33.962, 34.928, 16.294, 8.878, 12.901, 7.906, 6.083, 6.624, 11.364, 9.335, 11.368, 10.111, 15.291, 13.921, 10.583, 15.977, 16.325, 11.815, 11.675, 11.011, 16.201, 9.244, 15.829, 10.276, 16.145, 13.675, 9.326, 10.849, 6.772, 17.498, 7.973, 16.450, 9.991, 6.178, 16.111, 17.548, 13.860, 10.801, 8.851, 10.028, 8.332]1.0 : [distance=45.830951493743164]: 60 = [28.636, 35.554, 28.989, 26.883, 30.280, 35.294, 33.550, 32.722, 30.094, 32.951, 34.356, 33.583, 27.756, 33.049, 25.218, 31.894, 34.318, 25.636, 32.570, 24.817, 27.464, 12.408, 9.314, 12.147, 8.343, 7.502, 11.223, 12.910, 10.207, 14.853, 6.479, 11.333, 14.162, 5.533, 14.142, 15.040, 13.506, 5.263, 6.361, 13.789, 13.502, 8.490, 11.222, 15.391, 9.330, 15.925, 13.675, 13.507, 12.027, 12.400, 11.421, 8.011, 12.951, 8.780, 11.031, 12.124, 12.020, 12.910, 8.291, 10.597]1.0 : [distance=48.07002341109426]: 60 = [34.335, 30.938, 31.953, 31.146, 24.519, 24.393, 27.696, 29.874, 26.767, 33.089, 31.371, 26.233, 26.383, 35.661, 32.663, 27.685, 29.277, 31.761, 34.650, 24.940, 33.434, 26.849, 28.714, 26.581, 34.825, 34.026, 8.823, 12.634, 12.694, 6.279, 13.644, 16.651, 18.078, 7.975, 9.274, 9.208, 12.879, 12.729, 6.976, 17.832, 13.330, 6.326, 12.131, 11.842, 16.716, 10.425, 9.445, 14.400, 15.696, 11.028, 10.608, 15.190, 9.076, 17.909, 9.846, 15.013, 13.913, 11.743, 11.699, 10.152]17/04/28 07:35:13 INFO clustering.ClusterDumper: Wrote 6clusters

[hadoop@djt002 mahoutData]$

mahout无异常!!!

注意:执行完这个kmeans算法之后产生的文件按普通方式是查看不了的,看到的只是一堆莫名其妙的数据!!!

查看聚类分析的结果:

需要用mahout的seqdumper命令来下载到本地linux上才能查看正常结果。

[hadoop@djt002 ~]$ $MAHOUT_HOME/bin/mahout seqdumper -i /user/hadoop/output/data/part-m-00000 -o ~/res.txt

358f5771305d3051122464ef9e1a9a93.png

[hadoop@djt002 ~]$ $MAHOUT_HOME/bin/mahout seqdumper -i /user/hadoop/output/data/part-m-00000 -o ~/res.txt

Running on hadoop,using /usr/local/hadoop/hadoop-2.6.0/bin/hadoop and HADOOP_CONF_DIR=MAHOUT-JOB: /usr/local/mahout/mahout-distribution-0.9/mahout-examples-0.9-job.jar17/04/28 18:31:20 INFO common.AbstractJob: Command line arguments: {--endPhase=[2147483647], --input=[/user/hadoop/output/data/part-m-00000], --output=[/home/hadoop/res.txt], --startPhase=[0], --tempDir=[temp]}17/04/28 18:31:29 INFO driver.MahoutDriver: Program took 8750 ms (Minutes: 0.14583333333333334)

[hadoop@djt002~]$ ll

total444

-rw-r--r--. 1 hadoop hadoop 4176 Feb 21 09:01anagram.jar

drwxrwxr-x. 3 hadoop hadoop 4096 Mar 19 04:34app

drwxr-xr-x. 2 hadoop hadoop 4096 Jan 17 18:23Desktop

drwxrwxr-x. 2 hadoop hadoop 4096 Feb 21 17:03djt

drwxr-xr-x. 2 hadoop hadoop 4096 Jan 17 18:23Documents

drwxr-xr-x. 2 hadoop hadoop 4096 Jan 17 18:23Downloads

drwxrwxr-x. 4 hadoop hadoop 4096 Jan 17 18:54flume

drwxr-xr-x. 2 hadoop hadoop 4096 Jan 17 18:23Music

drwxr-xr-x. 2 hadoop hadoop 4096 Jan 17 18:23Pictures

drwxr-xr-x. 2 hadoop hadoop 4096 Jan 17 18:23Public-rw-rw-r-- 1 hadoop hadoop 397021 Apr 28 18:31res.txt

drwxr-xr-x. 2 hadoop hadoop 4096 Jan 17 18:23Templates

drwxrwxr-x. 3 hadoop hadoop 4096 Mar 23 08:06tvdata

drwxr-xr-x. 2 hadoop hadoop 4096 Jan 17 18:23Videos

[hadoop@djt002~]$ sz res.txt

02b74b5f5866d5622f1bacc2485b85dd.png

Input Path: /user/hadoop/output/data/part-m-00000Keyclass: class org.apache.hadoop.io.Text Value Class: classorg.apache.mahout.math.VectorWritable

Key:60: Value: {0:28.7812,31:26.6311,34:29.1495,4:28.9207,32:35.6541,5:33.7596,8:35.2479,6:25.3969,30:25.0293,24:33.0292,29:34.9424,17:26.5235,51:24.5556,36:26.1927,12:36.0253,23:29.5054,58:25.4652,21:29.27,11:29.2171,10:32.8717,15:32.8717,7:27.7849,28:26.1203,46:28.0721,33:28.4353,55:34.9879,54:34.9318,25:25.04,3:31.2834,49:29.747,41:26.2353,1:34.4632,26:28.9167,44:31.0558,37:33.3182,56:32.4721,42:28.9964,27:24.3437,50:31.4333,16:34.1173,40:35.5344,48:35.4973,39:27.0443,9:27.1159,52:33.7431,13:32.337,43:32.0036,19:26.3693,59:25.8717,2:31.3381,20:25.7744,18:27.6623,22:30.7326,35:28.1584,57:33.3759,45:34.2553,38:30.9772,47:28.9402,14:34.5249,53:25.0466}

Key:60: Value: {0:24.8923,31:32.5981,34:26.9414,4:27.8789,32:28.3038,5:31.5926,8:27.9516,6:31.4861,30:34.0765,24:31.9874,29:25.0701,17:35.6273,51:31.0205,36:33.1089,12:27.4867,23:30.4719,58:32.1005,21:24.1311,11:31.1887,10:27.5415,15:24.488,7:35.5469,28:33.6472,46:26.3458,33:26.1471,55:26.4244,54:33.6564,25:33.6615,3:32.8217,49:29.4047,41:26.5301,1:25.741,26:25.5511,44:32.8357,37:24.1491,56:28.4661,42:24.8578,27:30.4686,50:32.5577,16:27.5918,40:35.9519,48:28.9861,39:25.7906,9:31.6595,52:26.6418,13:31.391,43:25.9562,19:31.4167,59:26.691,2:27.5532,20:30.7447,18:35.4102,22:35.1422,35:31.5203,57:34.2484,45:28.5322,38:28.5157,47:30.6213,14:27.811,53:28.4331}

Key:60: Value: {0:31.3987,31:24.246,34:31.6114,4:27.8613,32:26.9631,5:28.5491,8:25.2239,6:24.9717,30:27.3086,24:24.3323,29:28.8778,17:32.5614,51:26.5966,36:27.4809,12:28.2572,23:32.3851,58:29.5446,21:31.4781,11:27.2587,10:31.8387,15:35.0625,7:32.4358,28:31.5137,46:29.6082,33:25.2919,55:29.9897,54:25.5772,25:30.2001,3:24.2905,49:27.1717,41:31.0561,1:30.6316,26:31.2452,44:31.4391,37:24.2075,56:31.351,42:26.3583,27:26.6814,50:33.6318,16:31.5717,40:32.6293,48:34.1444,39:35.1253,9:27.3068,52:25.5387,13:26.5819,43:28.0861,19:34.1202,59:29.343,2:26.3983,20:26.9337,18:31.0308,22:35.0173,35:24.7131,57:33.9002,45:27.3057,38:26.8059,47:35.9725,14:24.0455,53:32.5434}

Key:60: Value: {0:25.774,31:28.3714,34:35.9346,4:27.97,32:32.3667,5:25.2702,8:31.4549,6:28.132,30:27.5587,24:29.2806,29:24.824,17:35.0966,51:28.7261,36:24.3749,12:29.9578,23:31.6264,58:27.3659,21:25.0102,11:28.9916,10:28.9564,15:24.3037,7:29.4268,28:25.5265,46:35.769,33:26.9752,55:32.5492,54:34.6156,25:34.2021,3:25.6033,49:31.156,41:26.8908,1:30.5262,26:26.5077,44:34.3336,37:27.6083,56:30.9827,42:31.3209,27:32.2279,50:34.6292,16:24.314,40:32.4185,48:34.2054,39:29.8557,9:27.32,52:28.2979,13:30.2773,43:29.3849,19:32.0968,59:25.3069,2:35.4209,20:33.3303,18:25.3679,22:35.3155,35:35.1146,57:24.8938,45:24.7381,38:27.8433,47:31.8725,14:30.4447,53:31.5787}

Key:60: Value: {0:27.1798,31:33.4129,34:29.6526,4:24.6555,32:26.9245,5:28.9446,8:24.5596,6:35.798,30:33.1247,24:24.6081,29:28.0295,17:31.1274,51:27.9601,36:24.5119,12:35.4154,23:33.0321,58:31.1057,21:31.6565,11:25.3216,10:27.9634,15:29.4686,7:34.9446,28:35.8773,46:29.1348,33:30.2123,55:29.9993,54:35.3375,25:33.2025,3:25.6264,49:34.9244,41:27.9072,1:29.2498,26:27.4335,44:33.833,37:33.9931,56:34.2149,42:35.111,27:32.6355,50:27.7218,16:33.1739,40:31.2651,48:32.3223,39:33.204,9:34.2366,52:35.7198,13:34.862,43:35.0757,19:26.5173,59:31.0179,2:33.6928,20:28.6486,18:31.3701,22:35.9497,35:30.8644,57:33.1276,45:25.9481,38:33.3094,47:24.2875,14:25.1472,53:27.576}

....

....

3c5f8987797cfd0009082a2ac628e436.png

1de244f1d955bbacdb28a0eb1e204266.png

当然,你可以去看输出目录下/user/hadoop/output的其他的,比如clusters-0、clusters-1等,我这里仅仅是

看的是/user/hadoop/output/data/下的。

使用canopy算法

[hadoop@djt002 mahoutData]$ $HADOOP_HOME/bin/hadoop jar /usr/local/mahout/mahout-distribution-0.8/mahout-examples-0.8-job.jar org.apache.mahout.clustering.syntheticcontrol.canopy.Job

这里不多赘述。

使用dirichlet 算法

[hadoop@djt002 mahoutData]$ $HADOOP_HOME/bin/hadoop jar /usr/local/mahout/mahout-distribution-0.8/mahout-examples-0.8-job.jar org.apache.mahout.clustering.syntheticcontrol.dirichlet.Job

这里不多赘述。

使用meanshift算法

[hadoop@djt002 mahoutData]$ $HADOOP_HOME/bin/hadoop jar /usr/local/mahout/mahout-distribution-0.8/mahout-examples-0.8-job.jar org.apache.mahout.clustering.syntheticcontrol.meanshift.Job

这里不多赘述。

总结

mahout压缩包,给我们的默认输入路径是/user/hadoop/testdata  和  输出路径是 /user/hadoop/output

其实,我们是自己可以跟上自定义的输入路径和自定义输出路径的。

[hadoop@djt002 mahoutData]$ $HADOOP_HOME/bin/hadoop jar /usr/local/mahout/mahout-distribution-0.8/mahout-examples-0.8-job.jar org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

[hadoop@djt002 mahoutData]$ $HADOOP_HOME/bin/hadoop jar /usr/local/mahout/mahout-distribution-0.8/mahout-examples-0.8-job.jar org.apache.mahout.clustering.syntheticcontrol.kmeans.Job -i /user/hadoop/mahoutData/data.txt -o /user/hadoop/output

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值