giraph ../giraph-core-1.2.0.jar org.apache.giraph.benchmark.PageRankComputation -vif org.apache.giraph.io.formats.IntFloatNullTextInputFormat -vip /test/youTube.txt -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op /output -w 2
-w 后面的worker number指的是map任务数,实际执行时还要加上三个任务:
即taskID为0的master协调任务,1,2 Id为计算任务, 3Id为job setup , 4Id 为job cleanup 任务。
假设HDFS块大小为16M, 实际数据为30M, 那么spits数量为30/16,2个任务,
实际计算splits与partition关系时,使用如下函数:
PartitionUtils.computePartitionCount
public static int computePartitionCount(int availableWorkerCount,
ImmutableClassesGiraphConfiguration conf) {
if (availableWorkerCount == 0) {
throw new IllegalArgumentException(
"computePartitionCount: No available workers");
}
int userPartitionCount = USER_PARTITION_COUNT.get(conf);
int partitionCount;
if (userPartitionCount == USER_PARTITION_COUNT.getDefaultValue()) {
float multiplier = GiraphConstants.PARTITION_COUNT_MULTIPLIER.get(conf);
partitionCount = Math.max(
(int) (multiplier * availableWorkerCount * availableWorkerCount), 1);
int minPartitionsPerComputeThread =
MIN_PARTITIONS_PER_COMPUTE_THREAD.get(conf);
int totalComputeThreads =
NUM_COMPUTE_THREADS.get(conf) * availableWorkerCount;
partitionCount = Math.max(partitionCount,
minPartitionsPerComputeThread * totalComputeThreads);
} else {
partitionCount = userPartitionCount;
}
if (LOG.isInfoEnabled()) {
LOG.info("computePartitionCount: Creating " +
partitionCount + " partitions.");
}
return partitionCount;
}
multiplier默认为1, availableWorkerCount即为2,初始的worker number,
partitionCount = Math.max(
(int) (multiplier * availableWorkerCount * availableWorkerCount), 1);
实际上就是将一个split均匀切成worker number份,测试开始时由于只设定了2个任务,这样每个任务实际运行两个分区数据, 每个分区占1/4,
导致netty发送分区报内存不足:
2016-12-09 22:52:47,649 ERROR org.apache.giraph.comm.netty.NettyClient: Request failed
java.lang.OutOfMemoryError: Java heap space
at io.netty.buffer.UnpooledHeapByteBuf.<init>(UnpooledHeapByteBuf.java:45)
at io.netty.buffer.UnpooledByteBufAllocator.newHeapBuffer(UnpooledByteBufAllocator.java:43)
at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:136)
at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:127)
at io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:85)
at org.apache.giraph.comm.netty.handler.RequestEncoder.write(RequestEncoder.java:81)
at io.netty.channel.DefaultChannelHandlerContext.invokeWrite(DefaultChannelHandlerContext.java:645)
at io.netty.channel.DefaultChannelHandlerContext.access$2000(DefaultChannelHandlerContext.java:29)
at io.netty.channel.DefaultChannelHandlerContext$WriteTask.run(DefaultChannelHandlerContext.java:906)
at io.netty.util.concurrent.DefaultEventExecutor.run(DefaultEventExecutor.java:36)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:101)
at java.lang.Thread.run(Thread.java:745)
后来把worker number改成3, 这样一共9个分区,平均每个任务3个分区,每个分区占1/9的数据,就OK了~
关于DiskBackedMessageStore
V1.2中,取消了DiskBackedMessageStoreFactory, 原来是图,消息是否存磁盘各设各的,现在改为在ServerData类中只要设置GiraphConstants.USE_OUT_OF_CORE_GRAPH (giraph.useOutOfCoreGraph) 为true, oocEngine即为true. 那么无论是图和消息都默认存储磁盘。
这个改进改的不是很好啊! 不如原来的!