0 准备工作
安装Mahout的主机需要安装有JDK、Hadoop执行环境和Maven管理工具。
1 编译安装Mahout 1.0
Mahout 0.9默认不支持hadoop2,要使用必须修改dependency。
下载源码:
git clone git@github.com:apache/mahout.git
进入到mahout根目录,编译:
mvn -Dhadoop.version=2.5.1 clean compile
打包:
mvn -Dhadoop.version=2.5.1 -DskipTests=true clean package
安装(可选):
mvn -Dhadoop.version=2.5.1 clean install
添加环境变量:
export JAVA_HOME=/usr/java/jdk1.8.0_25
export M2_HOME=/usr/apache-maven
export M2=$M2_HOME/bin
export HADOOP_HOME=/usr/hadoop
export MAHOUT_HOME=/usr/mahout
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export MAHOUT_CONF_DIR=$MAHOUT_HOME/conf
export PATH=$MAHOUT_HOME/bin:$M2_HOME/bin:/usr/hadoop/bin:/usr/hadoop/sbin:$HIVE_HOME/bin:$JAVA_HOME/bin:$PATH
export CLASSPATH=.:/usr/hadoop/share/hadoop/common/:/usr/hadoop/share/hadoop/common/lib/:/usr/hive/lib/:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
测试:
[root@hive ~]# mahout
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /usr/hadoop/bin/hadoop and HADOOP_CONF_DIR=/usr/hadoop/etc/hadoop
MAHOUT-JOB: /usr/mahout/examples/target/mahout-examples-1.0-SNAPSHOT-job.jar
An example program must be given as the first argument.
Valid program names are:
arff.vector: : Generate Vectors from an ARFF file or directory
baumwelch: : Baum-Welch algorithm for unsupervised HMM training
buildforest: : Build the random forest classifier
canopy: : Canopy clustering
cat: : Print a file or resource as the logistic regression models would see it
cleansvd: : Cleanup and verification of SVD output
clusterdump: : Dump cluster output to text
clusterpp: : Groups Clustering Output In Clusters…
wget http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data
hdfs dfs -mkdir testdata
hdfs dfs -put synthetic_control.data testdata/
hadoop jar /usr/mahout/examples/target/mahout-examples-1.0-SNAPSHOT-job.jar org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
mahout clusterdump -i output/clusters-10-final -p output/clusteredPoints -o ~/test