spark 比 mysql慢,与csv文件相比，将mysql表转换为spark数据集的速度非常慢

最新推荐文章于 2024-01-03 10:20:05 发布

宝源冷气工程

最新推荐文章于 2024-01-03 10:20:05 发布

阅读量166

点赞数

文章标签： spark 比 mysql慢

I have csv file in Amazon s3 with is 62mb in size (114 000 rows). I am converting it into spark dataset, and taking first 500 rows from it. Code is as follow;

DataFrameReader df = new DataFrameReader(spark).format("csv").option("header", true);

Dataset set=df.load("s3n://"+this.accessId.replace("\"", "")+":"+this.accessToken.replace("\"", "")+"@"+this.bucketName.replace("\"", "")+"/"+this.filePath.replace("\"", "")+"");

set.take(500)

The whole operation takes 20 to 30 sec.

Now I am trying the same but rather using csv I am using mySQL table with 119 000 rows. MySQL server is in amazon ec2. Code is as follow;

String url ="jdbc:mysql://"+this.hostName+":3306/"+this.dataBaseName+"?user="+this.userName+"&password="+this.password;

SparkSession spark=StartSpark.getSparkSession();

SQLContext sc = spark.sqlContext();

DataFrameReader df = new DataFrameReader(spark).format("csv").option("header", true);

Dataset set = sc

.read()

.option("url", url)

.option("dbtable", this.tableName)

.option("driver","com.mysql.jdbc.Driver")

.format("jdbc")

.load();

set.take(500);

This is taking 5 to 10 minutes.

I am running spark inside jvm. Using same configuration in both cases.

I can use partitionColumn,numParttition etc but I don't have any numeric column and one more issue is the schema of the table is unknown to me.

My issue is not how to decrease the required time as I know in ideal case spark will run in cluster but what I can not understand is why this big time difference in the above two case?

解决方案

This problem has been covered multiple times on StackOverflow:

and in external sources:

so just to reiterate - by default DataFrameReader.jdbc doesn't distribute data or reads. It uses single thread, single exectuor.

To distribute reads:

use ranges with lowerBound / upperBound:

Properties properties;

Lower

Dataset set = sc

.read()

.option("partitionColumn", "foo")

.option("numPartitions", "3")