当前位置:我的异常网» 数据库 » Spark JDBC(一)MySQL Database RDD
Spark JDBC(一)MySQL Database RDD
www.myexceptions.net 网友分享于:2015-06-16 浏览:0次
Spark JDBC(1)MySQL Database RDD
Spark JDBC(1)MySQL Database RDD
Try to understand how the JDBCRDD work on Spark.
First of all, the master did not connect to the database.
First step,
The client driver class will connect to the MySQL and get the minId and maxId.
150612 17:21:55 58 Connectcluster@192.168.56.1 on lmm
select coalesce(min(d.id), 0) from device d where d.last_updated >= '2014-06-12 00:00:00.0000' and d.last_updated < '2014-06-13 00:00:00.0000'
select coalesce(max(d.id), 0) from device d
Second step, All the workers will try to fetch the data based on partitions
150612 17:22:13 59 Connectcluster@ubuntu-dev2 on lmm
select id, tenant_id, date_created, last_updated, device_id, os_type, os_version,
search_radius, sdk_major_version, last_time_zone, sendable
from
device d
where
375001 <= d.id and
d.id <= 750001
select id, tenant_id, date_created, last_updated, device_id, os_type, os_version,
search_radius, sdk_major_version, last_time_zone, sendable
from
device d
where
750002 <= d.id and
d.id <= 1125002
62 Connectcluster@ubuntu-dev1 on lmm
62 Queryselect id, tenant_id, date_created, last_updated, device_id, os_type, os_version,
search_radius, sdk_major_version, last_time_zone, sendable
from
device d
where
0 <= d.id and
d.id <= 375000
63 Queryselect id, tenant_id, date_created, last_updated, device_id, os_type, os_version,
search_radius, sdk_major_version, last_time_zone, sendable
from
device d
where
1500004 <= d.id and
d.id <= 1875004
The sample JDBCRDD is in code
https://github.com/luohuazju/sillycat-spark/tree/streaming
References:
http://spark.apache.org/docs/1.4.0/tuning.html
http://stackoverflow.com/questions/27619230/how-to-split-the-input-file-in-apache-spark
文章评论