执行Spark任务免不了从多个数据源拿数据,除了从HDFS获取数据以外,我们还经常从Mysql和HBase中拿数据,今天讲一下如何使用Spark查询Mysql和HBase
1. Spark查询Mysql
首先,Spark连接Mysql当然需要有Mysql的驱动包,你可以在启动时加上如下命令:
bin/spark-shell --driver-class-path /home/hadoop/jars/mysql-connector-java-5.1.34.jar --jars /home/hadoop/jars/mysql-connector-java-5.1.34.jar
还有一种更方便的方法就是直接将这个jar包放到Spark放jar包的目录下面,我的目录是/data/install/spark-2.0.0-bin-hadoop2.7/jars
,这样Spark就可以直接找到Mysql驱动包了
然后给出Spark读Mysql时的标准代码:
val imeis = spark.read.format("jdbc").options(
Map("url" -> DbUtil.IMEI_DB_URL,
// "dbtable" -> "(SELECT id,imei,imeiid FROM t_imei_all) a",
"dbtable" -> DbUtil.IMEI_ALL_TABLE,
"user" -> DbUtil.IMEI_DB_USERNAME,
"password" -> DbUtil.IMEI_DB_PASSWORD,
"driver" -> "com.mysql.jdbc.Driver",
// "fetchSize" -> "1000",
"partitionColumn" -> "id",
"lowerBound" -> "1",