鸡肋的JdbcRDD

spark
      今天准备将mysql的数据倒腾到RDD,很早以前就知道有一个JdbcRDD,就想着使用一下,结果发现却是鸡肋一个。
      首先,看看JdbcRDD的定义:
[plain] view plaincopyprint?
  1. * An RDD that executes an SQL query on a JDBC connection and reads results. 
  2. * For usage example, see test case JdbcRDDSuite. 
  3. * @param getConnection a function that returns an open Connection. 
  4. *   The RDD takes care of closing the connection. 
  5. * @param sql the text of the query. 
  6. *   The query must contain two ? placeholders for parameters used to partition the results. 
  7. *   E.g. "select title, author from books where ? <= id and id <= ?" 
  8. * @param lowerBound the minimum value of the first placeholder 
  9. * @param upperBound the maximum value of the second placeholder 
  10. *   The lower and upper bounds are inclusive. 
  11. * @param numPartitions the number of partitions. 
  12. *   Given a lowerBound of 1, an upperBound of 20, and a numPartitions of 2, 
  13. *   the query would be executed twice, once with (1, 10) and once with (11, 20) 
  14. * @param mapRow a function from a ResultSet to a single row of the desired result type(s). 
  15. *   This should only call getInt, getString, etc; the RDD takes care of calling next. 
  16. *   The default maps a ResultSet to an array of Object. 
  17. */ 
  18. class JdbcRDD[T: ClassTag]( 
  19.     sc: SparkContext, 
  20.     getConnection: () => Connection, 
  21.     sql: String, 
  22.     lowerBound: Long, 
  23.     upperBound: Long, 
  24.     numPartitions: Int, 
  25.     mapRow: (ResultSet) => T = JdbcRDD.resultSetToObjectArray _) 
 * An RDD that executes an SQL query on a JDBC connection and reads results.
 * For usage example, see test case JdbcRDDSuite.
 *
 * @param getConnection a function that returns an open Connection.
 *   The RDD takes care of closing the connection.
 * @param sql the text of the query.
 *   The query must contain two ? placeholders for parameters used to partition the results.
 *   E.g. "select title, author from books where ? <= id and id <= ?"
 * @param lowerBound the minimum value of the first placeholder
 * @param upperBound the maximum value of the second placeholder
 *   The lower and upper bounds are inclusive.
 * @param numPartitions the number of partitions.
 *   Given a lowerBound of 1, an upperBound of 20, and a numPartitions of 2,
 *   the query would be executed twice, once with (1, 10) and once with (11, 20)
 * @param mapRow a function from a ResultSet to a single row of the desired result type(s).
 *   This should only call getInt, getString, etc; the RDD takes care of calling next.
 *   The default maps a ResultSet to an array of Object.
 */
class JdbcRDD[T: ClassTag](
    sc: SparkContext,
    getConnection: () => Connection,
    sql: String,
    lowerBound: Long,
    upperBound: Long,
    numPartitions: Int,
    mapRow: (ResultSet) => T = JdbcRDD.resultSetToObjectArray _)

附上个例子:
[html] view plaincopyprint?
  1. package test 
  2.  
  3. import java.sql.{Connection, DriverManager, ResultSet} 
  4. import org.apache.spark.rdd.JdbcRDD 
  5. import org.apache.spark.{SparkConf, SparkContext} 
  6.  
  7. object spark_mysql { 
  8.   def main(args: Array[String]) { 
  9.     //val conf = new SparkConf().setAppName("spark_mysql").setMaster("local") 
  10.     val sc = new SparkContext("local","spark_mysql") 
  11.  
  12.     def createConnection() = { 
  13.       Class.forName("com.mysql.jdbc.Driver").newInstance() 
  14.       DriverManager.getConnection("jdbc:mysql://192.168.0.15:3306/wsmall", "root", "passwd") 
  15.     } 
  16.  
  17.     def extractValues(r: ResultSet) = { 
  18.       (r.getString(1), r.getString(2)) 
  19.     } 
  20.  
  21.     val data = new JdbcRDD(sc, createConnection, "SELECT id,aa FROM bbb where ? <= ID AND ID <= ?", lowerBound = 3, upperBound =5, numPartitions = 1, mapRow = extractValues
  22.  
  23.     println(data.collect().toList) 
  24.  
  25.     sc.stop() 
  26.   } 
package test

import java.sql.{Connection, DriverManager, ResultSet}
import org.apache.spark.rdd.JdbcRDD
import org.apache.spark.{SparkConf, SparkContext}

object spark_mysql {
  def main(args: Array[String]) {
    //val conf = new SparkConf().setAppName("spark_mysql").setMaster("local")
    val sc = new SparkContext("local","spark_mysql")

    def createConnection() = {
      Class.forName("com.mysql.jdbc.Driver").newInstance()
      DriverManager.getConnection("jdbc:mysql://192.168.0.15:3306/wsmall", "root", "passwd")
    }

    def extractValues(r: ResultSet) = {
      (r.getString(1), r.getString(2))
    }

    val data = new JdbcRDD(sc, createConnection, "SELECT id,aa FROM bbb where ? <= ID AND ID <= ?", lowerBound = 3, upperBound =5, numPartitions = 1, mapRow = extractValues)

    println(data.collect().toList)

    sc.stop()
  }
}

使用的MySQL表的数据如下:

运行结果如下:

    可以看出:JdbcRDD的 sql参数要带有两个?的占位符,而这两个占位符是给参数 lowerBound和参数 upperBound定义where语句的边界的,如果仅仅是这样的话,还可以接受;但悲催的是 参数 lowerBound和 参数 upperBound都是Long类型的, 鸡肋的JdbcRDD - mmicky - mmicky 的博客,不知道现在作为关键字或做查询的字段有多少long类型呢?不过参照JdbcRDD的源代码,用户还是可以写出符合自己需求的JdbcRDD,这算是不幸中之大幸了。

    最近一直忙于炼数成金的spark课程,没多少时间整理博客。特意给想深入了解spark的朋友推荐一位好友的博客 http://www.cnblogs.com/cenyuhai/,里面有不少源码博文,利于理解spark的内核。



评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值