鸡肋的JdbcRDD

最新推荐文章于 2021-07-08 22:15:13 发布

忽略一切阻力

最新推荐文章于 2021-07-08 22:15:13 发布

阅读量846

点赞数

分类专栏：云计算

云计算专栏收录该内容

9 篇文章 0 订阅

订阅专栏

spark

今天准备将mysql的数据倒腾到RDD，很早以前就知道有一个JdbcRDD，就想着使用一下，结果发现却是鸡肋一个。

首先，看看JdbcRDD的定义：

[plain] view plaincopyprint?

* An RDD that executes an SQL query on a JDBC connection and reads results.
* For usage example, see test case JdbcRDDSuite.
*
* @param getConnection a function that returns an open Connection.
* The RDD takes care of closing the connection.
* @param sql the text of the query.
* The query must contain two ? placeholders for parameters used to partition the results.
* E.g. "select title, author from books where ? <= id and id <= ?"
* @param lowerBound the minimum value of the first placeholder
* @param upperBound the maximum value of the second placeholder
* The lower and upper bounds are inclusive.
* @param numPartitions the number of partitions.
* Given a lowerBound of 1, an upperBound of 20, and a numPartitions of 2,
* the query would be executed twice, once with (1, 10) and once with (11, 20)
* @param mapRow a function from a ResultSet to a single row of the desired result type(s).
* This should only call getInt, getString, etc; the RDD takes care of calling next.
* The default maps a ResultSet to an array of Object.
*/
class JdbcRDD[T: ClassTag](
sc: SparkContext,
getConnection: () => Connection,
sql: String,
lowerBound: Long,
upperBound: Long,
numPartitions: Int,
mapRow: (ResultSet) => T = JdbcRDD.resultSetToObjectArray _)

 * An RDD that executes an SQL query on a JDBC connection and reads results.
 * For usage example, see test case JdbcRDDSuite.
 *
 * @param getConnection a function that returns an open Connection.
 *   The RDD takes care of closing the connection.
 * @param sql the text of the query.
 *   The query must contain two ? placeholders for parameters used to partition the results.
 *   E.g. "select title, author from books where ? <= id and id <= ?"
 * @param lowerBound the minimum value of the first placeholder
 * @param upperBound the maximum value of the second placeholder
 *   The lower and upper bounds are inclusive.
 * @param numPartitions the number of partitions.
 *   Given a lowerBound of 1, an upperBound of 20, and a numPartitions of 2,
 *   the query would be executed twice, once with (1, 10) and once with (11, 20)
 * @param mapRow a function from a ResultSet to a single row of the desired result type(s).
 *   This should only call getInt, getString, etc; the RDD takes care of calling next.
 *   The default maps a ResultSet to an array of Object.
 */
class JdbcRDD[T: ClassTag](
    sc: SparkContext,
    getConnection: () => Connection,
    sql: String,
    lowerBound: Long,
    upperBound: Long,
    numPartitions: Int,
    mapRow: (ResultSet) => T = JdbcRDD.resultSetToObjectArray _)

附上个例子：

[html] view plaincopyprint?

package test
import java.sql.{Connection, DriverManager, ResultSet}
import org.apache.spark.rdd.JdbcRDD
import org.apache.spark.{SparkConf, SparkContext}
object spark_mysql {
def main(args: Array[String]) {
//val conf = new SparkConf().setAppName("spark_mysql").setMaster("local")
val sc = new SparkContext("local","spark_mysql")
def createConnection() = {
Class.forName("com.mysql.jdbc.Driver").newInstance()
DriverManager.getConnection("jdbc:mysql://192.168.0.15:3306/wsmall", "root", "passwd")
}
def extractValues(r: ResultSet) = {
(r.getString(1), r.getString(2))
}
val data = new JdbcRDD(sc, createConnection, "SELECT id,aa FROM bbb where ? <= ID AND ID <= ?", lowerBound = 3, upperBound =5, numPartitions = 1, mapRow = extractValues)
println(data.collect().toList)
sc.stop()
}
}

package test

import java.sql.{Connection, DriverManager, ResultSet}
import org.apache.spark.rdd.JdbcRDD
import org.apache.spark.{SparkConf, SparkContext}

object spark_mysql {
  def main(args: Array[String]) {
    //val conf = new SparkConf().setAppName("spark_mysql").setMaster("local")
    val sc = new SparkContext("local","spark_mysql")

    def createConnection() = {
      Class.forName("com.mysql.jdbc.Driver").newInstance()
      DriverManager.getConnection("jdbc:mysql://192.168.0.15:3306/wsmall", "root", "passwd")
    }

    def extractValues(r: ResultSet) = {
      (r.getString(1), r.getString(2))
    }

    val data = new JdbcRDD(sc, createConnection, "SELECT id,aa FROM bbb where ? <= ID AND ID <= ?", lowerBound = 3, upperBound =5, numPartitions = 1, mapRow = extractValues)

    println(data.collect().toList)

    sc.stop()
  }
}

使用的MySQL表的数据如下：

运行结果如下：

可以看出：JdbcRDD的 sql参数要带有两个？的占位符，而这两个占位符是给参数 lowerBound和参数 upperBound定义where语句的边界的，如果仅仅是这样的话，还可以接受；但悲催的是参数 lowerBound和参数 upperBound都是Long类型的，鸡肋的JdbcRDD - mmicky - mmicky 的博客