关于Giraph 数据划分V1.2版本

giraph ../giraph-core-1.2.0.jar  org.apache.giraph.benchmark.PageRankComputation -vif  org.apache.giraph.io.formats.IntFloatNullTextInputFormat -vip /test/youTube.txt  -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op /output  -w 2


-w 后面的worker number指的是map任务数,实际执行时还要加上三个任务:

即taskID为0的master协调任务,1,2 Id为计算任务,  3Id为job setup , 4Id 为job cleanup 任务。



假设HDFS块大小为16M, 实际数据为30M, 那么spits数量为30/16,2个任务,

实际计算splits与partition关系时,使用如下函数:

 PartitionUtils.computePartitionCount

 public static int computePartitionCount(int availableWorkerCount,
      ImmutableClassesGiraphConfiguration conf) {
    if (availableWorkerCount == 0) {
      throw new IllegalArgumentException(
          "computePartitionCount: No available workers");
    }

    int userPartitionCount = USER_PARTITION_COUNT.get(conf);
    int partitionCount;
    if (userPartitionCount == USER_PARTITION_COUNT.getDefaultValue()) {
      float multiplier = GiraphConstants.PARTITION_COUNT_MULTIPLIER.get(conf);
      partitionCount = Math.max(
          (int) (multiplier * availableWorkerCount * availableWorkerCount), 1);
      int minPartitionsPerComputeThread =
          MIN_PARTITIONS_PER_COMPUTE_THREAD.get(conf);
      int totalComputeThreads =
          NUM_COMPUTE_THREADS.get(conf) * availableWorkerCount;
      partitionCount = Math.max(partitionCount,
          minPartitionsPerComputeThread * totalComputeThreads);
    } else {
      partitionCount = userPartitionCount;
    }
    if (LOG.isInfoEnabled()) {
      LOG.info("computePartitionCount: Creating " +
          partitionCount + " partitions.");
    }
    return partitionCount;
  }

multiplier默认为1, availableWorkerCount即为2,初始的worker number, 

 partitionCount = Math.max(
          (int) (multiplier * availableWorkerCount * availableWorkerCount), 1); 

实际上就是将一个split均匀切成worker number份,测试开始时由于只设定了2个任务,这样每个任务实际运行两个分区数据, 每个分区占1/4,

导致netty发送分区报内存不足:

2016-12-09 22:52:47,649 ERROR org.apache.giraph.comm.netty.NettyClient: Request failed
java.lang.OutOfMemoryError: Java heap space
	at io.netty.buffer.UnpooledHeapByteBuf.<init>(UnpooledHeapByteBuf.java:45)
	at io.netty.buffer.UnpooledByteBufAllocator.newHeapBuffer(UnpooledByteBufAllocator.java:43)
	at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:136)
	at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:127)
	at io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:85)
	at org.apache.giraph.comm.netty.handler.RequestEncoder.write(RequestEncoder.java:81)
	at io.netty.channel.DefaultChannelHandlerContext.invokeWrite(DefaultChannelHandlerContext.java:645)
	at io.netty.channel.DefaultChannelHandlerContext.access$2000(DefaultChannelHandlerContext.java:29)
	at io.netty.channel.DefaultChannelHandlerContext$WriteTask.run(DefaultChannelHandlerContext.java:906)
	at io.netty.util.concurrent.DefaultEventExecutor.run(DefaultEventExecutor.java:36)
	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:101)
	at java.lang.Thread.run(Thread.java:745)


后来把worker number改成3, 这样一共9个分区,平均每个任务3个分区,每个分区占1/9的数据,就OK了~



关于DiskBackedMessageStore

V1.2中,取消了DiskBackedMessageStoreFactory,  原来是图,消息是否存磁盘各设各的,现在改为在ServerData类中只要设置GiraphConstants.USE_OUT_OF_CORE_GRAPH  (giraph.useOutOfCoreGraph) 为true,  oocEngine即为true. 那么无论是图和消息都默认存储磁盘。

这个改进改的不是很好啊! 不如原来的! 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值