Classifying with random forests

Classifying with random forests

Introduction

This quick start page shows how to build a decision forest using the partial implementation. This tutorial also explains how to use the decision forest to classify new data. Partial Decision Forests is a mapreduce implementation where each mapper builds a subset of the forest using only the data available in its partition. This allows building forests using large datasets as long as each partition can be loaded in-memory.

Steps

Download the data

  • The current implementation is compatible with the UCI repository file format. In this example we’ll use the NSL-KDD dataset because its large enough to show the performances of the partial implementation. You can download the dataset here http://nsl.cs.unb.ca/NSL-KDD/ You can either download the full training set “KDDTrain+.ARFF”, or a 20% subset “KDDTrain+_20Percent.ARFF” (we’ll use the full dataset in this tutorial) and the test set “KDDTest+.ARFF”.
  • Open the train and test files and remove all the lines that begin with ‘@’. All those lines are at the top of the files. Actually you can keep those lines somewhere, because they’ll help us describe the dataset to Mahout
  • Put the data in HDFS: {code} $HADOOP_HOME/bin/hadoop fs -mkdir testdata $HADOOP_HOME/bin/hadoop fs -put testdata{code}

Build the Job files

  • In $MAHOUT_HOME/ run: {code}mvn clean install -DskipTests{code}

Generate a file descriptor for the dataset:

run the following command:

$HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/core/target/mahout-core-<VERSION>-job.jar org.apache.mahout.classifier.df.tools.Describe -p testdata/KDDTrain+.arff -f testdata/KDDTrain+.info -d N 3 C 2 N C 4 N C 8 N 2 C 19 N L

The “N 3 C 2 N C 4 N C 8 N 2 C 19 N L” string describes all the attributes of the data. In this cases, it means 1 numerical(N) attribute, followed by 3 Categorical(C) attributes, …L indicates the label. You can also use ‘I’ to ignore some attributes

Run the example

$HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-<version>-job.jar org.apache.mahout.classifier.df.mapreduce.BuildForest -Dmapred.max.split.size=1874231 -d testdata/KDDTrain+.arff -ds testdata/KDDTrain+.info -sl 5 -p -t 100 -o nsl-forest

which builds 100 trees (-t argument) using the partial implementation (-p). Each tree is built using 5 random selected attribute per node (-sl argument) and the example outputs the decision tree in the “nsl-forest” directory (-o). The number of partitions is controlled by the -Dmapred.max.split.size argument that indicates to Hadoop the max. size of each partition, in this case 1/10 of the size of the dataset. Thus 10 partitions will be used. IMPORTANT: using less partitions should give better classification results, but needs a lot of memory. So if the Jobs are failing, try increasing the number of partitions.

  • The example outputs the Build Time and the oob error estimation

    10/03/13 17:57:29 INFO mapreduce.BuildForest: Build Time: 0h 7m 43s 582 10/03/13 17:57:33 INFO mapreduce.BuildForest: oob error estimate : 0.002325895231517865 10/03/13 17:57:33 INFO mapreduce.BuildForest: Storing the forest in: nsl-forest/forest.seq

Using the Decision Forest to Classify new data

run the following command:

$HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-<version>-job.jar org.apache.mahout.classifier.df.mapreduce.TestForest -i nsl-kdd/KDDTest+.arff -ds nsl-kdd/KDDTrain+.info -m nsl-forest -a -mr -o predictions

This will compute the predictions of “KDDTest+.arff” dataset (-i argument) using the same data descriptor generated for the training dataset (-ds) and the decision forest built previously (-m). Optionally (if the test dataset contains the labels of the tuples) run the analyzer to compute the confusion matrix (-a), and you can also store the predictions in a text file or a directory of text files(-o). Passing the (-mr) parameter will use Hadoop to distribute the classification.

  • The example should output the classification time and the confusion matrix

    10/03/13 18:08:56 INFO mapreduce.TestForest: Classification Time: 0h 0m 6s 355 10/03/13 18:08:56 INFO mapreduce.TestForest: ======================================================= Summary ——————————————————- Correctly Classified Instances : 17657 78.3224% Incorrectly Classified Instances : 4887 21.6776% Total Classified Instances : 22544

    ======================================================= Confusion Matrix ——————————————————- a b <–Classified as 9459 252 | 9711 a = normal 4635 8198 | 12833 b = anomaly Default Category: unknown: 2

If the input is a single file then the output will be a single text file, in the above example ‘predictions’ would be one single file. If the input if a directory containing for example two files ‘a.data’ and ‘b.data’, then the output will be a directory ‘predictions’ containing two files ‘a.data.out’ and ‘b.data.out’

Known Issues and limitations

The “Decision Forest” code is still “a work in progress”, many features are still missing. Here is a list of some known issues:

  • For now, the training does not support multiple input files. The input dataset must be one single file (this support will be available with the upcoming release). Classifying new data does support multiple input files.
  • The tree building is done when each mapper.close() method is called. Because the mappers don’t refresh their state, the job can fail when the dataset is big and you try to build a large number of trees.

Copyright © 2014-2024 The Apache Software Foundation, Licensed under the Apache License, Version 2.0.

使用随机森林进行分类

介绍

本快速入门页面展示了如何使用部分实现来构建决策森林。此教程还解释了如何使用决策森林对新数据进行分类。部分决策森林是一个mapreduce实现,其中每个mapper仅使用其分区中的数据构建森林的子集。这允许在数据集非常大的情况下,只要每个分区可以在内存中加载,就能构建森林。

步骤

下载数据

当前实现兼容UCI存储库文件格式。在本例中,我们将使用NSL-KDD数据集,因为它足够大,能够展示部分实现的性能。你可以在这里下载数据集。你可以下载完整的训练集“KDDTrain+.ARFF”,或20%的子集“KDDTrain+_20Percent.ARFF”(本教程将使用完整数据集)以及测试集“KDDTest+.ARFF”。

打开训练和测试文件,并删除所有以‘@’开头的行。这些行都在文件的顶部。实际上你可以将这些行保留下来,因为它们将帮助我们向Mahout描述数据集。

将数据放入HDFS:

 

bash

复制代码

$HADOOP_HOME/bin/hadoop fs -mkdir testdata $HADOOP_HOME/bin/hadoop fs -put testdata

构建Job文件

在$MAHOUT_HOME/目录下运行:

 

bash

复制代码

mvn clean install -DskipTests

生成数据集的文件描述符

运行以下命令:

 

bash

复制代码

$HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/core/target/mahout-core-<VERSION>-job.jar org.apache.mahout.classifier.df.tools.Describe -p testdata/KDDTrain+.arff -f testdata/KDDTrain+.info -d N 3 C 2 N C 4 N C 8 N 2 C 19 N L

字符串“N 3 C 2 N C 4 N C 8 N 2 C 19 N L”描述了数据的所有属性。在这种情况下,它意味着1个数值(N)属性,接着是3个分类(C)属性,依此类推,L表示标签。你也可以使用‘I’来忽略某些属性。

运行示例

 

bash

复制代码

$HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-<version>-job.jar org.apache.mahout.classifier.df.mapreduce.BuildForest -Dmapred.max.split.size=1874231 -d testdata/KDDTrain+.arff -ds testdata/KDDTrain+.info -sl 5 -p -t 100 -o nsl-forest

该命令使用部分实现(-p)构建100棵树(-t 参数)。每棵树在每个节点使用5个随机选择的属性(-sl 参数),示例将决策树输出到“nsl-forest”目录(-o)。分区数量由-Dmapred.max.split.size参数控制,该参数指示Hadoop每个分区的最大大小,在这种情况下为数据集大小的1/10,因此将使用10个分区。重要的是:使用较少的分区应能提供更好的分类结果,但需要大量内存。因此,如果任务失败,尝试增加分区数量。

示例输出构建时间和oob误差估计:

 

yaml

复制代码

10/03/13 17:57:29 INFO mapreduce.BuildForest: Build Time: 0h 7m 43s 582 10/03/13 17:57:33 INFO mapreduce.BuildForest: oob error estimate : 0.002325895231517865 10/03/13 17:57:33 INFO mapreduce.BuildForest: Storing the forest in: nsl-forest/forest.seq

使用决策森林对新数据进行分类

运行以下命令:

 

bash

复制代码

$HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-<version>-job.jar org.apache.mahout.classifier.df.mapreduce.TestForest -i nsl-kdd/KDDTest+.arff -ds nsl-kdd/KDDTrain+.info -m nsl-forest -a -mr -o predictions

此命令将使用之前生成的数据描述符(-ds)和之前构建的决策森林(-m),计算“KDDTest+.arff”数据集(-i 参数)的预测。如果测试数据集包含元组的标签,可以运行分析器来计算混淆矩阵(-a),并且你可以将预测结果存储在文本文件或文本文件目录中(-o)。传递(-mr)参数将使用Hadoop分布分类。

示例将输出分类时间和混淆矩阵:

 

yaml

复制代码

10/03/13 18:08:56 INFO mapreduce.TestForest: Classification Time: 0h 0m 6s 355 10/03/13 18:08:56 INFO mapreduce.TestForest: ======================================================= Summary ——————————————————- Correctly Classified Instances : 17657 78.3224% Incorrectly Classified Instances : 4887 21.6776% Total Classified Instances : 22544 ======================================================= Confusion Matrix ——————————————————- a b <–Classified as 9459 252 | 9711 a = normal 4635 8198 | 12833 b = anomaly Default Category: unknown: 2

如果输入是单个文件,则输出将是单个文本文件,在上述示例中,“predictions”将是一个单一文件。如果输入是包含例如两个文件‘a.data’和‘b.data’的目录,则输出将是一个目录‘predictions’,其中包含两个文件‘a.data.out’和‘b.data.out’。

已知问题和限制

“决策森林”代码仍在“开发中”,许多功能仍然缺失。以下是一些已知问题的列表:

  1. 目前,训练不支持多个输入文件。输入数据集必须是一个单一文件(即将发布的版本将支持此功能)。
  2. 分类新数据确实支持多个输入文件。
  3. 树的构建在每个mapper.close()方法被调用时完成。由于mappers不会刷新其状态,当数据集很大且尝试构建大量树时,任务可能会失败。

  • 13
    点赞
  • 13
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值