一次Giraph自定义的benchmark(KMeansBenchmark)实现过程

这里Kmeans的用例来自 https://github.com/tmalaska/Giraph.KMeans.Example/tree/master/src,  需要做修改部分源码


1)CmmmandLine类(commons-cli-1.2.jar)中两个链表,分别存储识别的和未识别的参数,根据"-"来进行区分。

   2)在
   在giraph-env中替换GiraphRunner为自己定义的benchmark,需要implementes Tool 且extends GiraphBenchmark.
   对于自定义的参数,如-vif等,需要在解析之前加进getBenchmarkOptions()中,这些所有参数的选项名称(如-w,-vof等),如果不加,在后面
   CommandLineParser parser = new PosixParser();
    CommandLine cmd = parser.parse(options, args);
解析参数时,会被视为未定义参数报错。


  执行命令为:  giraph ../giraph-core-1.1.0.jar  org.apache.giraph.benchmark.kmeans.KMeansComputation -clu 5 -dim 2 -mIr 5  -vip /test/youTube.txt -op /output  -w 1 
  注意(../giraph-core-1.1.0.jar不能省略),否则会报:17/01/20 17:17:42 WARN fs.FSInputChecker: Problem opening checksum file: file:/opt/hadoop-1.2.1/logs/history/job_201611032251_0013_1478242352396_liuqiang2_Giraph%3A+org.apache.giraph.benchmark.PageRankComput.  Ignoring exception: java.io.EOFException
  原因不明,猜测可能是由于执行的jar包没有传过去导致分布式系统和本地文件校验和不同出错。
  
  giraph ../giraph-core-1.1.0.jar  org.apache.giraph.benchmark.kmeans.KMeansComputation -clu 5 -dim 2 -mIr 5  -vip /test/dataset.txt -op /output  -w 1 -v true


  注意下: -v参数指 verbose, 即是否打印细节: 

   另外, KMeansBenchmark 代码定义如下:

package org.apache.giraph.benchmark.kmeans;

import java.io.IOException;
import java.util.Set;

import org.apache.commons.cli.CommandLine;
import org.apache.giraph.benchmark.BenchmarkOption;
import org.apache.giraph.benchmark.GiraphBenchmark;
import org.apache.giraph.conf.GiraphConfiguration;
import org.apache.giraph.io.formats.GiraphFileInputFormat;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.ToolRunner;

import com.google.common.collect.Sets;

public class KMeansBenchmark extends GiraphBenchmark {

	/**
	 * @param args
	 */
	public static void main(String[] args) throws Exception {
		System.exit(ToolRunner.run(new KMeansBenchmark(), args));
	}

	/**
	 * 所有的参数选项声明
	 */
	@Override
	public Set<BenchmarkOption> getBenchmarkOptions() {
		 return Sets.newHashSet(new BenchmarkOption("vip", "vertexInputPath", true, "Vertex input path"),
				 new BenchmarkOption("op",  "outputPath", true, "Output path"),
				 new BenchmarkOption("clu",  "clusterNumber", true, "number of clusters"),
				 new BenchmarkOption("dim",  "dimension", true, "number of dimensions"),
				 new BenchmarkOption("mIr",  "maxIterations", true, "max Iterations"));
	}

	@Override
	protected void prepareConfiguration(GiraphConfiguration conf,
			CommandLine cmd) {
		try {
			conf.setComputationClass(KMeansComputation.class);
			conf.setVertexInputFormatClass(KMeansTextVertexInputFormat.class);
			conf.setVertexOutputFormatClass(KMeansTextVertexOutputFormat.class);
			conf.setWorkerContextClass(KMeansNodeWorkerContext.class);
			conf.setMasterComputeClass(KMeansVertixMasterCompute.class);
			int workers = Integer.parseInt(BenchmarkOption.WORKERS.getOptionValue(cmd));
			conf.setWorkerConfiguration(workers, workers, 100.0f);
			conf.set(Const.NUMBER_OF_CLUSTERS, cmd.getOptionValue("clu"));
			conf.set(Const.NUMBER_OF_DIMENSIONS, cmd.getOptionValue("dim"));
			conf.set(Const.MAX_ITERATIONS, cmd.getOptionValue("mIr"));
			
			int workers2 = conf.getMinWorkers() ;
			if (cmd.hasOption("vip")) {
				if (FileSystem.get(new Configuration()).listStatus(
						new Path(cmd.getOptionValue("vip"))) == null) {
					throw new IllegalArgumentException(
							"Invalid vertex input path (-vip): "
									+ cmd.getOptionValue("vip"));
				}
				GiraphFileInputFormat.addVertexInputPath(conf,
						new Path(cmd.getOptionValue("vip")));
			}
		} catch (IOException e) {
			e.printStackTrace();
		}
	}
}


另外,  GiraphBenchmark  代码也需要进行一定修改:

/*
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.giraph.benchmark;

import org.apache.commons.cli.CommandLine;
import org.apache.commons.cli.CommandLineParser;
import org.apache.commons.cli.HelpFormatter;
import org.apache.commons.cli.Options;
import org.apache.commons.cli.PosixParser;
import org.apache.giraph.conf.GiraphConfiguration;
import org.apache.giraph.job.GiraphJob;
import org.apache.giraph.utils.LogVersions;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.log4j.Logger;

import java.io.IOException;
import java.util.Set;

/**
 * Abstract class which benchmarks should extend.
 */
public abstract class GiraphBenchmark implements Tool {
  /** Class logger */
  public static final Logger LOG = Logger.getLogger(GiraphBenchmark.class);
  /** Configuration */
  private Configuration conf;

  @Override
  public void setConf(Configuration conf) {
    this.conf = conf;
  }

  @Override
  public Configuration getConf() {
    return conf;
  }

  @Override
  public int run(String[] args) throws Exception {
    Set<BenchmarkOption> giraphOptions = getBenchmarkOptions();
    giraphOptions.add(BenchmarkOption.HELP);
    giraphOptions.add(BenchmarkOption.VERBOSE);
    giraphOptions.add(BenchmarkOption.WORKERS);
    Options options = new Options();
    for (BenchmarkOption giraphOption : giraphOptions) {
      giraphOption.addToOptions(options);
    }

    HelpFormatter formatter = new HelpFormatter();
    if (args.length == 0) {
      formatter.printHelp(getClass().getName(), options, true);
      return 0;
    }
    CommandLineParser parser = new PosixParser();
    CommandLine cmd = parser.parse(options, args);
    for (BenchmarkOption giraphOption : giraphOptions) {
      if (!giraphOption.checkOption(cmd, LOG)) {
        return -1;
      }
    }
    if (BenchmarkOption.HELP.optionTurnedOn(cmd)) {
      formatter.printHelp(getClass().getName(), options, true);
      return 0;
    }

    GiraphJob job = new GiraphJob(getConf(), getClass().getName());
    int workers = Integer.parseInt(BenchmarkOption.WORKERS.getOptionValue(cmd));

    prepareHadoopMRJob(job, cmd);
    GiraphConfiguration giraphConf = job.getConfiguration();
    giraphConf.addWorkerObserverClass(LogVersions.class);
    giraphConf.addMasterObserverClass(LogVersions.class);

    giraphConf.setWorkerConfiguration(workers, workers, 100.0f);
    prepareConfiguration(giraphConf, cmd);

    boolean isVerbose = false;
    if (BenchmarkOption.VERBOSE.optionTurnedOn(cmd)) {
      isVerbose = true;
    }
    if (job.run(isVerbose)) {
      return 0;
    } else {
      return -1;
    }
  }

  private void prepareHadoopMRJob(GiraphJob job, CommandLine cmd) {
	  if (cmd.hasOption("op")) {
			FileOutputFormat.setOutputPath(job.getInternalJob(),
					new Path(cmd.getOptionValue("op")));
	  }
  }

/**
   * Get the options to use in this benchmark.
   * BenchmarkOption.VERBOSE, BenchmarkOption.HELP and BenchmarkOption.WORKERS
   * will be added automatically, so you don't have to specify those.
   *
   * @return Options to use in this benchmark
   */
  public abstract Set<BenchmarkOption> getBenchmarkOptions();

  /**
   * Process options from CommandLine and prepare configuration for running
   * the job.
   * BenchmarkOption.VERBOSE, BenchmarkOption.HELP and BenchmarkOption.WORKERS
   * will be processed automatically so you don't have to process them.
   *
   * @param conf Configuration
   * @param cmd Command line
   */
  protected abstract void prepareConfiguration(GiraphConfiguration conf,
      CommandLine cmd);
}


结果:

17/01/20 21:18:09 INFO mapred.JobClient: Running job: job_201701201708_0025
17/01/20 21:18:10 INFO mapred.JobClient:  map 100% reduce 0%
17/01/20 21:18:17 INFO mapred.JobClient: Job complete: job_201701201708_0025
17/01/20 21:18:17 INFO mapred.JobClient: Counters: 48
17/01/20 21:18:17 INFO mapred.JobClient:   Zookeeper halt node
17/01/20 21:18:17 INFO mapred.JobClient:     /_hadoopBsp/job_201701201708_0025/_haltComputation=0
17/01/20 21:18:17 INFO mapred.JobClient:   Zookeeper base path
17/01/20 21:18:17 INFO mapred.JobClient:     /_hadoopBsp/job_201701201708_0025=0
17/01/20 21:18:17 INFO mapred.JobClient:   Job Counters 
17/01/20 21:18:17 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=45288
17/01/20 21:18:17 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
17/01/20 21:18:17 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
17/01/20 21:18:17 INFO mapred.JobClient:     Launched map tasks=2
17/01/20 21:18:17 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
17/01/20 21:18:17 INFO mapred.JobClient:   Giraph Timers
17/01/20 21:18:17 INFO mapred.JobClient:     Superstep 8 KMeansComputation (ms)=73
17/01/20 21:18:17 INFO mapred.JobClient:     Superstep 9 KMeansComputation (ms)=75
17/01/20 21:18:17 INFO mapred.JobClient:     Superstep 7 KMeansComputation (ms)=77
17/01/20 21:18:17 INFO mapred.JobClient:     Input superstep (ms)=423
17/01/20 21:18:17 INFO mapred.JobClient:     Superstep 2 KMeansComputation (ms)=78
17/01/20 21:18:17 INFO mapred.JobClient:     Shutdown (ms)=8953
17/01/20 21:18:17 INFO mapred.JobClient:     Initialize (ms)=11614
17/01/20 21:18:17 INFO mapred.JobClient:     Superstep 0 KMeansComputation (ms)=91
17/01/20 21:18:17 INFO mapred.JobClient:     Setup (ms)=111
17/01/20 21:18:17 INFO mapred.JobClient:     Superstep 3 KMeansComputation (ms)=83
17/01/20 21:18:17 INFO mapred.JobClient:     Superstep 10 KMeansComputation (ms)=71
17/01/20 21:18:17 INFO mapred.JobClient:     Superstep 1 KMeansComputation (ms)=112
17/01/20 21:18:17 INFO mapred.JobClient:     Total (ms)=10379
17/01/20 21:18:17 INFO mapred.JobClient:     Superstep 4 KMeansComputation (ms)=82
17/01/20 21:18:17 INFO mapred.JobClient:     Superstep 6 KMeansComputation (ms)=74
17/01/20 21:18:17 INFO mapred.JobClient:     Superstep 5 KMeansComputation (ms)=73
17/01/20 21:18:17 INFO mapred.JobClient:   Zookeeper server:port
17/01/20 21:18:17 INFO mapred.JobClient:     c02b03:22181=0
17/01/20 21:18:17 INFO mapred.JobClient:   Giraph Stats
17/01/20 21:18:17 INFO mapred.JobClient:     Aggregate edges=0
17/01/20 21:18:17 INFO mapred.JobClient:     Sent message bytes=0
17/01/20 21:18:17 INFO mapred.JobClient:     Superstep=11
17/01/20 21:18:17 INFO mapred.JobClient:     Last checkpointed superstep=0
17/01/20 21:18:17 INFO mapred.JobClient:     Current workers=1
17/01/20 21:18:17 INFO mapred.JobClient:     Aggregate sent messages=0
17/01/20 21:18:17 INFO mapred.JobClient:     Current master task partition=0
17/01/20 21:18:17 INFO mapred.JobClient:     Sent messages=0
17/01/20 21:18:17 INFO mapred.JobClient:     Aggregate finished vertices=24
17/01/20 21:18:17 INFO mapred.JobClient:     Aggregate sent message message bytes=0
17/01/20 21:18:17 INFO mapred.JobClient:     Aggregate vertices=24
17/01/20 21:18:17 INFO mapred.JobClient:   File Output Format Counters 
17/01/20 21:18:17 INFO mapred.JobClient:     Bytes Written=0
17/01/20 21:18:17 INFO mapred.JobClient:   FileSystemCounters
17/01/20 21:18:17 INFO mapred.JobClient:     HDFS_BYTES_READ=311
17/01/20 21:18:17 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=268140
17/01/20 21:18:17 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=296
17/01/20 21:18:17 INFO mapred.JobClient:   File Input Format Counters 
17/01/20 21:18:17 INFO mapred.JobClient:     Bytes Read=0
17/01/20 21:18:17 INFO mapred.JobClient:   Map-Reduce Framework
17/01/20 21:18:17 INFO mapred.JobClient:     Map input records=2
17/01/20 21:18:17 INFO mapred.JobClient:     Physical memory (bytes) snapshot=791097344
17/01/20 21:18:17 INFO mapred.JobClient:     Spilled Records=0
17/01/20 21:18:17 INFO mapred.JobClient:     CPU time spent (ms)=9390
17/01/20 21:18:17 INFO mapred.JobClient:     Total committed heap usage (bytes)=1331167232
17/01/20 21:18:17 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=3453415424
17/01/20 21:18:17 INFO mapred.JobClient:     Map output records=0
17/01/20 21:18:17 INFO mapred.JobClient:     SPLIT_RAW_BYTES=88


注意到10次迭代后  Aggregate sent message message bytes=0 为0,可见kmeans确实是计算密集型计算。


假设有7个任务,每个任务首轮根据中心点数组计算各自分区的点的分类,划分完毕后各任务重新生成新的中心点数据加入aggregator并汇总至master, master生成新的中心点数组发送至各子节点进行下一轮计算。中间需要发送的消息仅仅是更新后的中心点数据信息,通信量非常小.



  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值