此篇博客主要参考https://cwiki.apache.org/confluence/display/MAHOUT/Partial+Implementation,不过个人按照上面的提示,出现了一些错误,下面就结合出现的问题和解决方案简要分析下:
(mahout版本:0.7)
一、数据:
数据从下面的网址下载:http://archive.ics.uci.edu/ml/datasets/KDD+Cup+1999+Data,训练数据:19.1M,测试数据:3.4M,训练数据的前三行如下:
0,tcp,ftp_data,SF,491,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,150,25,0.17,0.03,0.17,0.00,0.00,0.00,0.05,0.00,normal,20
0,udp,other,SF,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,13,1,0.00,0.00,0.00,0.00,0.08,0.15,0.00,255,1,0.00,0.60,0.88,0.00,0.00,0.00,0.00,0.00,normal,15
0,tcp,private,S0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,123,6,1.00,1.00,0.00,0.00,0.05,0.07,0.00,255,26,0.10,0.05,0.00,0.00,1.00,1.00,0.00,0.00,neptune,19
二、运行实例:
(1)上传数据:$hadoop_home/bin/hadoop fs -put kddTrain.txt input/kddTrain.txt ;$hadoop_home/bin/hadoop fs -put kddTest.txt input/kddTest.txt ;
(2)生成原始数据的描述文件:
fansy@fansyPC:~/hadoop-1.0.2$ bin/hadoop jar ../mahout-0.7-pure/core/target/mahout-core-0.7-job.jar org.apache.mahout.classifier.df.tools.Describe -p input/kddTrain.txt -f out/forest/info/kdd1.info -d N 3 C 2 N C 4 N C 8 N 2 C 19 N L N
上面的命令和原文的不一样,上面的红色的部分是不同 的;数据的描述文件可以在
http://archive.ics.uci.edu/ml/machine-learning-databases/kddcup99-mld/kddcup.names上面看到,不过下载的数据中多出最后一列,暂时在描述文件中也没有找到,所以在Label后面多了一个Number(N)。
提示信息如下:
12/12/26 20:02:27 INFO tools.Describe: Generating the descriptor...
12/12/26 20:02:28 INFO tools.Describe: generating the dataset...
12/12/26 20:02:31 INFO tools.Describe: storing the dataset description
(3)建树:
fansy@fansyPC:~/hadoop-1.0.2$ bin/hadoop jar ../mahout-0.7-pure/examples/target/mahout-examples-0.7-job.jar org.apache.mahout.classifier.df.mapreduce.BuildForest -Dmapred.max.split.size=1874231 -ood -d input/kddTrain.txt -ds out/forest/info/kdd1.info -sl 5 -p -t 100 --output out/forest1
如果按照原文的话,会出现下面的错误提示:
12/12/26 20:04:34 ERROR mapreduce.BuildForest: Exception
org.apache.commons.cli2.OptionException: Unexpected out/forest1 while processing Options
at org.apache.commons.cli2.commandline.Parser.parse(Parser.java:99)
at org.apache.mahout.classifier.df.mapreduce.BuildForest.run(BuildForest.java:139)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.mahout.classifier.df.mapreduce.BuildForest.main(BuildForest.java:253)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
根据上面的错误提示,我把-o(--output)选项去掉了(或者去掉-ood选项,保留--output选项亦可),然后就可以了,生成的目录文件为od/forest.seq ;
下面是BuildForest的使用参数:
Usage:
[--data <path> --dataset <dataset> --selection <m> --no-complete --minsplit
<minsplit> --minprop <minprop> --seed <seed> --partial --nbtrees <nbtrees>
--output <path> --help]
Options
--data (-d) path Data path
--dataset (-ds) dataset Dataset path
--selection (-sl) m Optional, Number of variables to select randomly
at each tree-node.
For classification problem, the default is
square root of the number of explanatory
variables.
For regression problem, the default is 1/3 of
the number of explanatory variables.
--no-complete (-nc) Optional, The tree is not complemented
--minsplit (-ms) minsplit Optional, The tree-node is not divided, if the
branching data size is smaller than this value.
The default is 2.
--minprop (-mp) minprop Optional, The tree-node is not divided, if the
proportion of the variance of branching data is
smaller than this value.
In the case of a regression problem, this value
is used. The default is 1/1000(0.001).
--seed (-sd) seed Optional, seed value used to initialise the
Random number generator
--partial (-p) Optional, use the Partial Data implementation
--nbtrees (-t) nbtrees Number of trees to grow
--output (-o) path Output path, will contain the Decision Forest
--help (-h) Print out help
下面是建好树的提示信息:
12/12/26 20:06:21 INFO mapreduce.BuildForest: Build Time: 0h 1m 32s 618
12/12/26 20:06:21 INFO mapreduce.BuildForest: Forest num Nodes: 47353
12/12/26 20:06:21 INFO mapreduce.BuildForest: Forest mean num Nodes: 473
12/12/26 20:06:21 INFO mapreduce.BuildForest: Forest mean max Depth: 12
12/12/26 20:06:21 INFO mapreduce.BuildForest: Storing the forest in: od/forest.seq
(3)测试数据:
命令:
fansy@fansyPC:~/hadoop-1.0.2$ bin/hadoop jar ../mahout-0.7-pure/examples/target/mahout-examples-0.7-job.jar org.apache.mahout.classifier.df.mapreduce.TestForest -i input/kddTest.txt -ds out/forest/info/kdd1.info -m od/forest.seq -a -mr -o predictions
测试信息如下:
Summary
-------------------------------------------------------
Correctly Classified Instances : 16285 72.2365%
Incorrectly Classified Instances : 6259 27.7635%
Total Classified Instances : 22544
TestForest的用法:
Usage:
[--input <input> --dataset <dataset> --model <path> --output <output>
--analyze --mapreduce --help]
Options
--input (-i) input Path to job input directory.
--dataset (-ds) dataset Dataset path
--model (-m) path Path to the Decision Forest
--output (-o) output The directory pathname for output.
--analyze (-a)
--mapreduce (-mr)
--help (-h) Print out help
分享,成长,快乐