spark 比 mysql慢,与csv文件相比,将mysql表转换为spark数据集的速度非常慢

I have csv file in Amazon s3 with is 62mb in size (114 000 rows). I am converting it into spark dataset, and taking first 500 rows from it. Code is as follow;

DataFrameReader df = new DataFrameReader(spark).format("csv").option("header", true);

Dataset set=df.load("s3n://"+this.accessId.replace("\"", "")+":"+this.accessToken.replace("\"", "")+"@"+this.bucketName.replace("\"", "")+"/"+this.filePath.replace("\"", "")+"");

set.take(500)

The whole operation takes 20 to 30 sec.

Now I am trying the same but rather using csv I am using mySQL table with 119 000 rows. MySQL server is in amazon ec2. Code is as follow;

String url ="jdbc:mysql://"+this.hostName+":3306/"+this.dataBaseName+"?user="+this.userName+"&password="+this.password;

SparkSession spark=StartSpark.getSparkSession();

SQLContext sc = spark.sqlContext();

DataFrameReader df = new DataFrameReader(spark).format("csv").option("header", true);

Dataset set = sc

.read()

.option("url", url)

.option("dbtable", this.tableName)

.option("driver","com.mysql.jdbc.Driver")

.format("jdbc")

.load();

set.take(500);

This is taking 5 to 10 minutes.

I am running spark inside jvm. Using same configuration in both cases.

I can use partitionColumn,numParttition etc but I don't have any numeric column and one more issue is the schema of the table is unknown to me.

My issue is not how to decrease the required time as I know in ideal case spark will run in cluster but what I can not understand is why this big time difference in the above two case?

解决方案

This problem has been covered multiple times on StackOverflow:

and in external sources:

so just to reiterate - by default DataFrameReader.jdbc doesn't distribute data or reads. It uses single thread, single exectuor.

To distribute reads:

use ranges with lowerBound / upperBound:

Properties properties;

Lower

Dataset set = sc

.read()

.option("partitionColumn", "foo")

.option("numPartitions", "3")

.option("lowerBound", 0)

.option("upperBound", 30)

.option("url", url)

.option("dbtable", this.tableName)

.option("driver","com.mysql.jdbc.Driver")

.format("jdbc")

.load();

predicates

Properties properties;

Dataset set = sc

.read()

.jdbc(

url, this.tableName,

{"foo < 10", "foo BETWWEN 10 and 20", "foo > 20"},

properties

)

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值