GraphX和GraphFrame connectedComponent计算性能对比

最新推荐文章于 2024-05-31 13:45:37 发布

hhtop112408

最新推荐文章于 2024-05-31 13:45:37 发布

阅读量3.8k

点赞数 2

本文链接：https://blog.csdn.net/hhtop112408/article/details/77869845

版权

在测试中，对比了GraphX和GraphFrame的connectedComponents计算性能。使用GraphX和GraphFrame（设置algorithm为"graphx"）在494587个节点和1997743条边的图上，两者运行时间接近，大约2.6到2.7分钟。然而，直接使用GraphFrame的运行时间最快，仅为1.3分钟。通过修改GraphX的代码，将map数据转换为BoxedUnit并更改StorageLevel为MEMORY_AND_DISK，GraphX的运行时间降低到1.1分钟，与GraphFrame（algorithm为"graphx"）相当。此外，发现GraphFrame需要checkpoint并使用了更多的block。

摘要由CSDN通过智能技术生成

测试文件：用Graph rmatGraph 1000000 2000000 去重后 494587个点，1997743个边

运行环境：246 GB，core 71.

测试三个运行例子1：Graph connectedComponents 2：GraphFrame connectedComponents 3：GraphFrame connectedComponents setAlgorithm( "graphx" )

运行结果

1和2 差不多，都在2.6和2.7分钟左右。

3最快 1.3分钟。

1和3 的job是一样的都是10个，2的job是28个。

提交到yarn 上的命令是

spark-submit --class blost.ConnectedComponentTest --master yarn --deploy-mode cluster --num-executors 10 --executor-cores 2 --driver-memory 2g --executor-memory 2g a.jar

后面加 1或者2或者3 启动三个计算。

做了测试后，为什1和3用的事件不一样，理论上应该一样的。看了运行监控，对1做了修改

map data 成BoxedUnit 这样数据少了 StorageLevel.改成StorageLevel.MEMORY_AND_DISK。

运行事件变为1.1分钟，和3差不多。

代码如下

private static void graphTest(JavaSparkContext ctx, String vPath, String ePath)
{
JavaRDD<Tuple2<Object, BoxedUnit>> verticeRDD = ctx.textFile( vPath ).map( s->{
String[] strs = pattern.split( s );
return new Tuple2<Object, BoxedUnit>(Long.parseLong( strs[0] ), BoxedUnit.UNIT);
});

JavaRDD<Edge<BoxedUnit>> edgeRDD = ctx.textFile( ePath ).map( s->
{
String[] strs = pattern.split( s );
return new Edge<BoxedUnit>(Long.parseLong( strs[0] ), Long.parseLong( strs[1]), BoxedUnit.UNIT);
});

Graph<BoxedUnit, BoxedUnit> g = Graph.apply( verticeRDD.rdd( ), edgeRDD.rdd( ), BoxedUnit.UNIT,
StorageLevel.MEMORY_AND_DIS