mahout的并行随机森林是怎样创建的

最新推荐文章于 2024-04-19 00:24:15 发布

jianjian1992

最新推荐文章于 2024-04-19 00:24:15 发布

阅读量2.3k

点赞数 1

分类专栏： hadoop 文章标签： mapreduce mahout 随机森林大数据并行算法

本文链接：https://blog.csdn.net/jianjian1992/article/details/48092307

版权

hadoop 专栏收录该内容

11 篇文章 0 订阅

订阅专栏

我挺好奇mahout是怎样分布式建立一个随机森林的，所以特意看看它的BuildForest源码，看看里边的mapreduce是怎样实现的。

还有个问题也挺让我好奇的，就是随机森林是怎么保存的呢？

我看的是0.9版本的mahout。

首先想想随机森林的可并行性在哪里？如果是我来做并行，会怎么做呢？

因为随机森林是由很多决策树组成，而这些决策树建立的时候相互之间是不影响的，所以建树这个过程便是可以并行化的。

接着考虑数据问题，以前我用weka编程，一个1G多的数据全部用来做训练，然后就会报内存溢出，所以在大数据下，这个问题该怎么解决呢？

数据太大，一般都会切分成很多block，放在不同的datanode里边存储，那么是要把所有数据都拿到然后训练还是该怎么做呢？

看看mahout是怎么解决这个问题的吧！

BuildForest类

首先从BuildForest类开始，这个类在package org.apache.mahout.classifier.df.mapreduce里边，

在mahout-distribution-0.9\examples\src\main\java\org\apache\mahout\classifier\df\mapreduce文件夹里边。

这个类定义的参数如下，依次为：

dataPath，数据集路径
datasetPath，虽然名为数据集，但实际上是描述数据集的文件，也即描述数据各个属性以及label属性
outputPath，生成的随机森林的保存路径
m，生成决策树的时候每次随机选择的参数个数
complemented，生成的树是否为完全树么
minSplitNum，分类树判断一个节点是否需要继续分裂下去时使用，如果一个节点里边属性数目小于minSplitNum，那么就不再分裂，设置为叶子节点
minVarianceProportion，同上，不过是回归树使用
nbTrees，森林的决策树数目
seed，随机种子
isPartial，我觉得这个挺有意思的，使用部分数据，怎么选择部分呢？剩下的部分呢？

  private Path dataPath;
  
  private Path datasetPath;

  private Path outputPath;

  private Integer m; // Number of variables to select at each tree-node

  private boolean complemented; // tree is complemented
  
  private Integer minSplitNum; // minimum number for split

  private Double minVarianceProportion; // minimum proportion of the total variance for split

  private int nbTrees; // Number of trees to grow
  
  private Long seed; // Random seed
  
  private boolean isPartial; // use partial data implementation

这个类继续Configured类实现Tool接口，所以需要重载run方法。

run方法首先处理输入参数，这一段不是我关注的，所以不管它啦，

读取参数之后，便运行了一个buildForest函数。

buildForest函数

首先判断输出目录是否存在，这个在运行hadoop程序的时候倒是经常遇见哦，输出文件存在就会报错的。

// make sure the output path does not exist
    FileSystem ofs = outputPath.getFileSystem(getConf());
    if (ofs.exists(outputPath)) {
      log.error("Output path already exists");
      return;
    }

然后是决策树生成器的创建，并根据输入设置它的参数。

DecisionTreeBuilder treeBuilder = new DecisionTreeBuilder();
    if (m != null) {
      treeBuilder.setM(m);
    }
    treeBuilder.setComplemented(complemented);
    if (minSplitNum != null) {
      treeBuilder.setMinSplitNum(minSplitNum);
    }
    if (minVarianceProportion != null) {
      treeBuilder.setMinVarianceProportion(minVarianceProportion);
    }

接着是森林生成器的创建，依然是根据输入设置参数。

Builder forestBuilder;
    
    if (isPartial) {
      log.info("Partial Mapred implementation");
      forestBuilder = new PartialBuilder(treeBuilder, dataPath, datasetPath, seed, getConf());
    } else {
      log.info("InMem Mapred implementation");
      forestBuilder = new InMemBuilder(treeBuilder, dataPath, datasetPath, seed, getConf());
    }

    forestBuilder.setOutputDirName(outputPath.getName());

最后就是生成森林以及保存森林了。

DecisionForest forest = forestBuilder.build(nbTrees);

// store the decision forest in the output path
    Path forestPath = new Path(outputPath, "forest.seq");
    log.info("Storing the forest in: {}", forestPath);
    DFUtils.storeWritable(getConf(), forestPath, forest);

forestBuild

创建森林的方式有两种，它考虑了数据的问题，所以给出inMem和partial两种方式，简而言之，就是

数据不大的话，把所有数据都放到内存中，然后用所有数据训练决策树
数据很大的话，把Mapper所在节点的这部分数据取出来用作训练，其它数据不管

那先来看看Builder这个基类。

Builder

Builder类需要关注的几个函数：

setTreeBuilder和getTreeBuilder

Configuration保存变量的方式是以conf.set("name", value)形式保存，

取得变量则是以conf.get("name")方式，和JavaWeb里边的session好像哦。

StringUtils真是很神奇啊！！居然可以把TreeBuilder当做字符串保存，要拿出来的时候直接用个fromString就好了！！厉害！

  public static TreeBuilder getTreeBuilder(Configuration conf) {
    String string = conf.get("mahout.rf.treebuilder");
    if (string == null) {
      return null;
    }
    
    return StringUtils.fromString(string);
  }
  
  private static void setTreeBuilder(Configuration conf, TreeBuilder treeBuilder) {
    conf.set("mahout.rf.treebuilder", StringUtils.toString(treeBuilder));
  }

getOutputPath

文件系统的获取

FileSystem.get(conf)，fs也就相当于是hdfs了，可以使用hdfs的命令，比如说fs.mkdir(new Path("dir"))来创建文件.

fs.getWorkingDirectory()返回的是一个Path，我一般用Path都是写成new Path(String)，这里倒是可以将一个path和一个string连在一起，

new Path(fs.getWorkingDirectory(), outputDirName) = WORKING_DIRECTORY/OUTPUT_DIR_NAME

  protected Path getOutputPath(Configuration conf) throws IOException {
    // the output directory is accessed only by this class, so use the default
    // file system
    FileSystem fs = FileSystem.get(conf);
    return new Path(fs.getWorkingDirectory(), outputDirName);
  }

build关键

public DecisionForest build(int nbTrees)建树的方法，关键之处了哦

首先检查输出路径是否已经有文件存在。

    Path outputPath = getOutputPath(conf);
    FileSystem fs = outputPath.getFileSystem(conf);
    
    // check the output
    if (fs.exists(outputPath)) {
      throw new IOException("Output path already exists : " + outputPath);
    }

然后设置参数，也即决策树的棵树，决策树的创建器，以及随机种子。

if (seed != null) {
      setRandomSeed(conf, seed);
    }
    setNbTrees(conf, nbTrees);
    setTreeBuilder(conf, treeBuilder);

接着将数据描述文件以URI方式加入到分布式缓存中。

// put the dataset into the DistributedCache
    DistributedCache.addCacheFile(datasetPath.toUri(), conf);

然后创建job，运行job，把job发送给jobTracker.

configureJob(job)是个由虚函数，由子类来进行具体配置。

Job job = new Job(conf, "decision forest builder");
    
    log.debug("Configuring the job...");
    configureJob(job);
    
    log.debug("Running the job...");
    if (!runJob(job)) {
      log.error("Job failed!");
      return null;
    }

runJob也就是一句job.waitForCompletion(true),和平常写的一样啦。

  protected boolean runJob(Job job) throws ClassNotFoundException, IOException, InterruptedException {
    return job.waitForCompletion(true);
  }

最后输出文件转换成森林再返回，这也算个虚函数，由子类来实现。

HadoopUtil.delete(conf, outputPath);

这一句把输出文件给删掉了，我试验了一下，放在job.waitForComplection(true)之后，原本有的输出文件就被删掉啦。

因为输出文件已经转成森林了，所以删掉正好嘛。

if (isOutput(conf)) {
      log.debug("Parsing the output...");
      DecisionForest forest = parseOutput(job);
      HadoopUtil.delete(conf, outputPath);
      return forest;
    }

DistributedCache

分布式缓存，把数据放到分布式缓存==把数据放到hdfs，然后从hdfs将数据发送到需要用这个数据的若干DataNode上。

它这里提供了从分布式缓存中取第i个文件的函数。

如下所示，使用HadoopUtil.getCachedFiles(conf)即可得到所有缓存文件的路径。

  /**
   * Helper method. Get a path from the DistributedCache
   * 
   * @param conf
   *          configuration
   * @param index
   *          index of the path in the DistributedCache files
   * @return path from the DistributedCache
   * @throws IOException
   *           if no path is found
   */
  public static Path getDistributedCacheFile(Configuration conf, int index) throws IOException {
    Path[] files = HadoopUtil.getCachedFiles(conf);
    
    if (files.length <= index) {
      throw new IOException("path not found in the DistributedCache");
    }
    
    return files[index];
  }
  
  /**
   * Helper method. Load a Dataset stored in the DistributedCache
   * 
   * @param conf
   *          configuration
   * @return loaded Dataset
   * @throws IOException
   *           if we cannot retrieve the Dataset path from the DistributedCache, or the Dataset could not be
   *           loaded
   */
  public static Dataset loadDataset(Configuration conf) throws IOException {
    Path datasetPath = getDistributedCacheFile(conf, 0);
    
    return Dataset.load(conf, datasetPath);
  }

一般来说可以在main函数里边添加cacheFile，然后在Mapper的setup或者cleanup函数里边取出这些文件。

我做了一点小测试，在main函数里边添加如下代码：

String inputStr1 = "hdfs://127.0.0.1:9000/user/HTTP.dat";
String inputStr2 = "hdfs://127.0.0.1:9000/user/HTTP2.dat";
String inputStr3 = "hdfs://127.0.0.1:9000/user/HTTP3.dat";
String inputStr4 = "hdfs://127.0.0.1:9000/user/HTTP4.dat";
        
Configuration conf = new Configuration();
System.out.println("URI is " + (new Path(inputStr1)).toUri().toString());
DistributedCache.addCacheFile((new Path(inputStr2)).toUri(), conf);
DistributedCache.addCacheFile((new Path(inputStr3)).toUri(), conf);
DistributedCache.addCacheFile((new Path(inputStr4)).toUri(), conf);
DistributedCache.addCacheFile((new Path(inputStr1)).toUri(), conf);

然后在Mapper的setup里边这样写：

我是

用context.getConfiguration()得到conf

用DistributedCache.getLocalCacheFiles(conf)得到文件路径

用DistributedCache.getCacheFiles(conf)得到文件uri

和上面有些不同。

@Override
		protected void setup(
				Mapper<LongWritable, Text, Text, MyData>.Context context)
				throws IOException, InterruptedException {
			// TODO Auto-generated method stub
			super.setup(context);
			Configuration conf = context.getConfiguration();
			Path[] paths = DistributedCache.getLocalCacheFiles(conf);
			for (Path path : paths){
				System.out.println("path is : " + path.toString());
			}
			URI[] uris = DistributedCache.getCacheFiles(conf);
			for (URI uri : uris){
				System.out.println("uri is : " + uri.toString());
			}
		}

最后运行的结果如下:

main里边的uri转换得到依旧是

hdfs://127.0.0.1:9000/user/HTTP.dat

没有变化的。

但是在setup里边的就有了很大的变化了哦！

URI是hdfs里边的值，但是path则是在本地文件系统中的值。

因为这几个文件会从hdfs中被发送到运行Mapper任务的node任务中。

那么这个/tmp/...文件到底在哪里呢？

如下图所示，就是在本机文件系统里边。

path is : /tmp/hadoop-user/mapred/local/archive/5212817132532961590_760395248_212172121/127.0.0.1/user/HTTP2.dat
path is : /tmp/hadoop-user/mapred/local/archive/-73117579183240740_254836623_1592684623/127.0.0.1/user/HTTP3.dat
path is : /tmp/hadoop-user/mapred/local/archive/-630335170877855814_-250722002_212204501/127.0.0.1/user/HTTP4.dat
path is : /tmp/hadoop-user/mapred/local/archive/-1453350350290660245_1383593160_212210867/127.0.0.1/user/HTTP.dat
uri is : hdfs://127.0.0.1:9000/user/HTTP2.dat
uri is : hdfs://127.0.0.1:9000/user/HTTP3.dat
uri is : hdfs://127.0.0.1:9000/user/HTTP4.dat
uri is : hdfs://127.0.0.1:9000/user/HTTP.dat

下面主要关注两个子类实现的虚函数。

PartialBuilder

部分创建？这是怎么工作的呢？

类定义在包package org.apache.mahout.classifier.df.mapreduce.partial中。

关于这个类的描述如下，也即就只使用这个mapper对应的InputSplit里边的数据来进行建树。

/**
 * Builds a random forest using partial data. Each mapper uses only the data given by its InputSplit
 */

job已经由builder的build创建好了，接着便是由ParitalBuilder如下对job进行配置。

下面的配置和平常编写mapreduce基本是一样的。

  @Override
  protected void configureJob(Job job) throws IOException {
    Configuration conf = job.getConfiguration();
    
    job.setJarByClass(PartialBuilder.class);
    
    FileInputFormat.setInputPaths(job, getDataPath());
    FileOutputFormat.setOutputPath(job, getOutputPath(conf));
    
    job.setOutputKeyClass(TreeID.class);
    job.setOutputValueClass(MapredOutput.class);
    
    job.setMapperClass(Step1Mapper.class);
    job.setNumReduceTasks(0); // no reducers
    
    job.setInputFormatClass(TextInputFormat.class);
    job.setOutputFormatClass(SequenceFileOutputFormat.class);
  }

至于输出是按照如下方式进行处理的，得到要生成树的总数，生成保存树(TreeID, Node)的数组来保存森林，接着从输出目录中得到所有输出文件，

从中将树一颗颗地读出来并保存在keys,trees里边，最后生成一个决策树保存就好啦。

  @Override
  protected DecisionForest parseOutput(Job job) throws IOException {
    Configuration conf = job.getConfiguration();
    
    int numTrees = Builder.getNbTrees(conf);
    
    Path outputPath = getOutputPath(conf);
    
    TreeID[] keys = new TreeID[numTrees];
    Node[] trees = new Node[numTrees];
        
    processOutput(job, outputPath, keys, trees);
    
    return new DecisionForest(Arrays.asList(trees));
  }

在mapreduce里边如果设置了K个reducer任务，最后在输出目录中每个reducer都会有自己的一个输出文件，所以需要对所有输出文件都进行处理。

这里没有reduce，所以每个mapper任务都会有个输出文件。

树输出的时候以key,value形式保存，所以类型是Pair<TreeID, MapredOutput>，输出文件是SequenceFile类型，所以读取文件用SequenceFileIterable吧。

  protected static void processOutput(JobContext job,
                                      Path outputPath,
                                      TreeID[] keys,
                                      Node[] trees) throws IOException {
    Preconditions.checkArgument(keys == null && trees == null || keys != null && trees != null,
        "if keys is null, trees should also be null");
    Preconditions.checkArgument(keys == null || keys.length == trees.length, "keys.length != trees.length");

    Configuration conf = job.getConfiguration();

    FileSystem fs = outputPath.getFileSystem(conf);

    Path[] outfiles = DFUtils.listOutputFiles(fs, outputPath);

    // read all the outputs
    int index = 0;
    for (Path path : outfiles) {
      for (Pair<TreeID,MapredOutput> record : new SequenceFileIterable<TreeID, MapredOutput>(path, conf)) {
        TreeID key = record.getFirst();
        MapredOutput value = record.getSecond();
        if (keys != null) {
          keys[index] = key;
        }
        if (trees != null) {
          trees[index] = value.getTree();
        }
        index++;
      }
    }

    // make sure we got all the keys/values
    if (keys != null && index != keys.length) {
      throw new IllegalStateException("Some key/values are missing from the output");
    }
  }
}

它的mapper是如何工作的呢？

Step1Mapper

这里关心的问题应该有：

样本集如何得到？

该生成多少颗决策树呢？

该如何训练呢？

样本集的构造

Mapper类里边用如下的instances变量保存样本集。

private final List<Instance> instances = Lists.newArrayList();

map函数的过程便是从文件中读取数据，然后将数据加入到instances里边。每一行一个训练数据。

  @Override
  protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    instances.add(converter.convert(value.toString()));
  }

训练

准备好数据之后在cleanup函数里边训练。

采用bagging的方式将样本集分成训练，测试两部分集合。

接着用这些数据进行nbTrees次循环，每次生成一颗决策树，并将树写到输出中。

树的key包含了mapper的partition序号（也就是InputSplit序号）以及树的ID号。

树的value则是一个Node。

决策树生成的方式本文不讨论。

  @Override
  protected void cleanup(Context context) throws IOException, InterruptedException {
    // prepare the data
    log.debug("partition: {} numInstances: {}", partition, instances.size());
    
    Data data = new Data(getDataset(), instances);
    Bagging bagging = new Bagging(getTreeBuilder(), data);
    
    TreeID key = new TreeID();
    
    log.debug("Building {} trees", nbTrees);
    for (int treeId = 0; treeId < nbTrees; treeId++) {
      log.debug("Building tree number : {}", treeId);
      
      Node tree = bagging.build(rng);
      
      key.set(partition, firstTreeId + treeId);
      
      if (isOutput()) {
        MapredOutput emOut = new MapredOutput(tree);
        context.write(key, emOut);
      }
    }
  }

树的棵树nbTrees

一个mapper该生成多少颗树呢？

看如下函数，一般来说是对于K个mapper任务，它们把树均分就可以了，

可是有时候numTrees/numMaps会有余数，这多出来的归谁呢？

这里是归第一个Mapper，因为它认为第一个划分拥有的数据会比剩下的要多，有这种说法么?

  /**
   * Compute the number of trees for a given partition. The first partition (0) may be longer than the rest of
   * partition because of the remainder.
   * 
   * @param numMaps
   *          total number of maps (partitions)
   * @param numTrees
   *          total number of trees to build
   * @param partition
   *          partition to compute the number of trees for
   */
  public static int nbTrees(int numMaps, int numTrees, int partition) {
    int nbTrees = numTrees / numMaps;
    if (partition == 0) {
      nbTrees += numTrees - nbTrees * numMaps;
    }
    
    return nbTrees;
  }

原来在Builder里边有如下函数，它将所有的InputSplit进行了下升序排序，因为在compare里边长度小的被认为更大，所以最后排序的结果便是

数据越多的split排在越前边。

  /**
   * sort the splits into order based on size, so that the biggest go first.<br>
   * This is the same code used by Hadoop's JobClient.
   * 
   * @param splits
   *          input splits
   */
  public static void sortSplits(InputSplit[] splits) {
    Arrays.sort(splits, new Comparator<InputSplit>() {
      @Override
      public int compare(InputSplit a, InputSplit b) {
        try {
          long left = a.getLength();
          long right = b.getLength();
          if (left == right) {
            return 0;
          } else if (left < right) {
            return 1;
          } else {
            return -1;
          }
        } catch (IOException ie) {
          throw new IllegalStateException("Problem getting input split size", ie);
        } catch (InterruptedException ie) {
          throw new IllegalStateException("Problem getting input split size", ie);
        }
      }
    });
  }

InMemBuilder

这一个的mapper就有点奇怪了哦

首先有几个问题，数据集怎么办？

每个mapper训练几颗树呢？

训练集的话在configureJob里边将Data加入到DistributedCache里边就好啦，既然能放到内存里边，那肯定不大啦。

几颗树呢？

这里来看看它的map函数。

真心疯狂啊，每一次map操作居然就建一棵树，我也是无语啦。。。。

@Override
  protected void map(IntWritable key,
                     NullWritable value,
                     Context context) throws IOException, InterruptedException {
    map(key, context);
  }
  
  void map(IntWritable key, Context context) throws IOException, InterruptedException {
    
    initRandom((InMemInputSplit) context.getInputSplit());
    
    log.debug("Building...");
    Node tree = bagging.build(rng);
    
    if (isOutput()) {
      log.debug("Outputing...");
      MapredOutput mrOut = new MapredOutput(tree);
      
      context.write(key, mrOut);
    }
  }

jianjian1992

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
mahout的并行随机森林是怎样创建的

我挺好奇mahout是怎样分布式建立一个随机森林的，所以特意看看它的BuildForest源码，看看里边的mapreduce是怎样实现的。还有个问题也挺让我好奇的，就是随机森林是怎么保存的呢？我看的是0.9版本的mahout。
复制链接

扫一扫