spark
今天准备将mysql的数据倒腾到RDD,很早以前就知道有一个JdbcRDD,就想着使用一下,结果发现却是鸡肋一个。
附上个例子:
使用的MySQL表的数据如下:
首先,看看JdbcRDD的定义:
- * An RDD that executes an SQL query on a JDBC connection and reads results.
- * For usage example, see test case JdbcRDDSuite.
- *
- * @param getConnection a function that returns an open Connection.
- * The RDD takes care of closing the connection.
- * @param sql the text of the query.
- * The query must contain two ? placeholders for parameters used to partition the results.
- * E.g. "select title, author from books where ? <= id and id <= ?"
- * @param lowerBound the minimum value of the first placeholder
- * @param upperBound the maximum value of the second placeholder
- * The lower and upper bounds are inclusive.
- * @param numPartitions the number of partitions.
- * Given a lowerBound of 1, an upperBound of 20, and a numPartitions of 2,
- * the query would be executed twice, once with (1, 10) and once with (11, 20)
- * @param mapRow a function from a ResultSet to a single row of the desired result type(s).
- * This should only call getInt, getString, etc; the RDD takes care of calling next.
- * The default maps a ResultSet to an array of Object.
- */
- class JdbcRDD[T: ClassTag](
- sc: SparkContext,
- getConnection: () => Connection,
- sql: String,
- lowerBound: Long,
- upperBound: Long,
- numPartitions: Int,
- mapRow: (ResultSet) => T = JdbcRDD.resultSetToObjectArray _)
* An RDD that executes an SQL query on a JDBC connection and reads results.
* For usage example, see test case JdbcRDDSuite.
*
* @param getConnection a function that returns an open Connection.
* The RDD takes care of closing the connection.
* @param sql the text of the query.
* The query must contain two ? placeholders for parameters used to partition the results.
* E.g. "select title, author from books where ? <= id and id <= ?"
* @param lowerBound the minimum value of the first placeholder
* @param upperBound the maximum value of the second placeholder
* The lower and upper bounds are inclusive.
* @param numPartitions the number of partitions.
* Given a lowerBound of 1, an upperBound of 20, and a numPartitions of 2,
* the query would be executed twice, once with (1, 10) and once with (11, 20)
* @param mapRow a function from a ResultSet to a single row of the desired result type(s).
* This should only call getInt, getString, etc; the RDD takes care of calling next.
* The default maps a ResultSet to an array of Object.
*/
class JdbcRDD[T: ClassTag](
sc: SparkContext,
getConnection: () => Connection,
sql: String,
lowerBound: Long,
upperBound: Long,
numPartitions: Int,
mapRow: (ResultSet) => T = JdbcRDD.resultSetToObjectArray _)
附上个例子:
- package test
- import java.sql.{Connection, DriverManager, ResultSet}
- import org.apache.spark.rdd.JdbcRDD
- import org.apache.spark.{SparkConf, SparkContext}
- object spark_mysql {
- def main(args: Array[String]) {
- //val conf = new SparkConf().setAppName("spark_mysql").setMaster("local")
- val sc = new SparkContext("local","spark_mysql")
- def createConnection() = {
- Class.forName("com.mysql.jdbc.Driver").newInstance()
- DriverManager.getConnection("jdbc:mysql://192.168.0.15:3306/wsmall", "root", "passwd")
- }
- def extractValues(r: ResultSet) = {
- (r.getString(1), r.getString(2))
- }
- val data = new JdbcRDD(sc, createConnection, "SELECT id,aa FROM bbb where ? <= ID AND ID <= ?", lowerBound = 3, upperBound =5, numPartitions = 1, mapRow = extractValues)
- println(data.collect().toList)
- sc.stop()
- }
- }
package test
import java.sql.{Connection, DriverManager, ResultSet}
import org.apache.spark.rdd.JdbcRDD
import org.apache.spark.{SparkConf, SparkContext}
object spark_mysql {
def main(args: Array[String]) {
//val conf = new SparkConf().setAppName("spark_mysql").setMaster("local")
val sc = new SparkContext("local","spark_mysql")
def createConnection() = {
Class.forName("com.mysql.jdbc.Driver").newInstance()
DriverManager.getConnection("jdbc:mysql://192.168.0.15:3306/wsmall", "root", "passwd")
}
def extractValues(r: ResultSet) = {
(r.getString(1), r.getString(2))
}
val data = new JdbcRDD(sc, createConnection, "SELECT id,aa FROM bbb where ? <= ID AND ID <= ?", lowerBound = 3, upperBound =5, numPartitions = 1, mapRow = extractValues)
println(data.collect().toList)
sc.stop()
}
}
使用的MySQL表的数据如下:
运行结果如下:
可以看出:JdbcRDD的
sql参数要带有两个?的占位符,而这两个占位符是给参数
lowerBound和参数
upperBound定义where语句的边界的,如果仅仅是这样的话,还可以接受;但悲催的是
参数
lowerBound和
参数
upperBound都是Long类型的,
,不知道现在作为关键字或做查询的字段有多少long类型呢?不过参照JdbcRDD的源代码,用户还是可以写出符合自己需求的JdbcRDD,这算是不幸中之大幸了。
最近一直忙于炼数成金的spark课程,没多少时间整理博客。特意给想深入了解spark的朋友推荐一位好友的博客
http://www.cnblogs.com/cenyuhai/,里面有不少源码博文,利于理解spark的内核。