mahout决策树之Partial Implementation 实战

最新推荐文章于 2019-07-24 14:26:53 发布

iteye_4515

最新推荐文章于 2019-07-24 14:26:53 发布

阅读量153

点赞数

文章标签： java 数据结构与算法大数据

此篇博客主要参考https://cwiki.apache.org/confluence/display/MAHOUT/Partial+Implementation，不过个人按照上面的提示，出现了一些错误，下面就结合出现的问题和解决方案简要分析下：

（mahout版本：0.7）

一、数据：

数据从下面的网址下载：http://archive.ics.uci.edu/ml/datasets/KDD+Cup+1999+Data，训练数据：19.1M，测试数据：3.4M，训练数据的前三行如下：

0,tcp,ftp_data,SF,491,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,150,25,0.17,0.03,0.17,0.00,0.00,0.00,0.05,0.00,normal,20
0,udp,other,SF,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,13,1,0.00,0.00,0.00,0.00,0.08,0.15,0.00,255,1,0.00,0.60,0.88,0.00,0.00,0.00,0.00,0.00,normal,15
0,tcp,private,S0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,123,6,1.00,1.00,0.00,0.00,0.05,0.07,0.00,255,26,0.10,0.05,0.00,0.00,1.00,1.00,0.00,0.00,neptune,19

二、运行实例：

(1)上传数据：$hadoop_home/bin/hadoop fs -put kddTrain.txt input/kddTrain.txt ;$hadoop_home/bin/hadoop fs -put kddTest.txt input/kddTest.txt ;

(2)生成原始数据的描述文件：

fansy@fansyPC:~/hadoop-1.0.2$ bin/hadoop jar ../mahout-0.7-pure/core/target/mahout-core-0.7-job.jar org.apache.mahout.classifier.df.tools.Describe -p input/kddTrain.txt -f out/forest/info/kdd1.info -d N 3 C 2 N C 4 N C 8 N 2 C 19 N L N

上面的命令和原文的不一样，上面的红色的部分是不同的；数据的描述文件可以在 http://archive.ics.uci.edu/ml/machine-learning-databases/kddcup99-mld/kddcup.names上面看到，不过下载的数据中多出最后一列，暂时在描述文件中也没有找到，所以在Label后面多了一个Number（N）。
提示信息如下：

12/12/26 20:02:27 INFO tools.Describe: Generating the descriptor...
12/12/26 20:02:28 INFO tools.Describe: generating the dataset...
12/12/26 20:02:31 INFO tools.Describe: storing the dataset description

(3)建树：

fansy@fansyPC:~/hadoop-1.0.2$ bin/hadoop jar ../mahout-0.7-pure/examples/target/mahout-examples-0.7-job.jar org.apache.mahout.classifier.df.mapreduce.BuildForest -Dmapred.max.split.size=1874231 -ood -d input/kddTrain.txt -ds out/forest/info/kdd1.info -sl 5 -p -t 100 --output out/forest1

如果按照原文的话，会出现下面的错误提示：

12/12/26 20:04:34 ERROR mapreduce.BuildForest: Exception
org.apache.commons.cli2.OptionException: Unexpected out/forest1 while processing Options
	at org.apache.commons.cli2.commandline.Parser.parse(Parser.java:99)
	at org.apache.mahout.classifier.df.mapreduce.BuildForest.run(BuildForest.java:139)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
	at org.apache.mahout.classifier.df.mapreduce.BuildForest.main(BuildForest.java:253)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

根据上面的错误提示，我把-o（--output）选项去掉了(或者去掉-ood选项，保留--output选项亦可)，然后就可以了，生成的目录文件为od/forest.seq ;

下面是BuildForest的使用参数：

Usage:                                                                          
 [--data <path> --dataset <dataset> --selection <m> --no-complete --minsplit    
<minsplit> --minprop <minprop> --seed <seed> --partial --nbtrees <nbtrees>      
--output <path> --help]                                                         
Options                                                                         
  --data (-d) path             Data path                                        
  --dataset (-ds) dataset      Dataset path                                     
  --selection (-sl) m          Optional, Number of variables to select randomly 
                               at each tree-node.                               
                               For classification problem, the default is       
                               square root of the number of explanatory         
                               variables.                                       
                               For regression problem, the default is 1/3 of    
                               the number of explanatory variables.             
  --no-complete (-nc)          Optional, The tree is not complemented           
  --minsplit (-ms) minsplit    Optional, The tree-node is not divided, if the   
                               branching data size is smaller than this value.  
                               The default is 2.                                
  --minprop (-mp) minprop      Optional, The tree-node is not divided, if the   
                               proportion of the variance of branching data is  
                               smaller than this value.                         
                               In the case of a regression problem, this value  
                               is used. The default is 1/1000(0.001).           
  --seed (-sd) seed            Optional, seed value used to initialise the      
                               Random number generator                          
  --partial (-p)               Optional, use the Partial Data implementation    
  --nbtrees (-t) nbtrees       Number of trees to grow                          
  --output (-o) path           Output path, will contain the Decision Forest    
  --help (-h)                  Print out help

下面是建好树的提示信息：

12/12/26 20:06:21 INFO mapreduce.BuildForest: Build Time: 0h 1m 32s 618
12/12/26 20:06:21 INFO mapreduce.BuildForest: Forest num Nodes: 47353
12/12/26 20:06:21 INFO mapreduce.BuildForest: Forest mean num Nodes: 473
12/12/26 20:06:21 INFO mapreduce.BuildForest: Forest mean max Depth: 12
12/12/26 20:06:21 INFO mapreduce.BuildForest: Storing the forest in: od/forest.seq

(3)测试数据：

命令：

fansy@fansyPC:~/hadoop-1.0.2$ bin/hadoop jar ../mahout-0.7-pure/examples/target/mahout-examples-0.7-job.jar org.apache.mahout.classifier.df.mapreduce.TestForest -i input/kddTest.txt -ds out/forest/info/kdd1.info -m od/forest.seq -a -mr -o predictions

测试信息如下：

Summary
-------------------------------------------------------
Correctly Classified Instances          :      16285	   72.2365%
Incorrectly Classified Instances        :       6259	   27.7635%
Total Classified Instances              :      22544

TestForest的用法：

Usage:                                                                          
 [--input <input> --dataset <dataset> --model <path> --output <output>          
--analyze --mapreduce --help]                                                   
Options                                                                         
  --input (-i) input         Path to job input directory.                       
  --dataset (-ds) dataset    Dataset path                                       
  --model (-m) path          Path to the Decision Forest                        
  --output (-o) output       The directory pathname for output.                 
  --analyze (-a)                                                                
  --mapreduce (-mr)                                                             
  --help (-h)                Print out help

分享，成长，快乐