Spark JDBCRDD详解

最新推荐文章于 2021-03-17 06:12:45 发布

zhou12314456

最新推荐文章于 2021-03-17 06:12:45 发布

阅读量727

点赞数

本文链接：https://blog.csdn.net/zhou12314/article/details/89056051

版权

本文详细解析Spark JDBCRDD的参数、分区策略和计算过程，探讨如何自定义JDBC RDD以满足特定业务需求，指出虽然通常使用Spark SQL，但JDBC RDD仍能提供灵活的数据获取选项。

摘要由CSDN通过智能技术生成

参数解释

在使用jdbc rdd的时候，我们一块需要传递7个参数

   val data = new JdbcRDD(
      sc,
      getConnection,
      "select * from table where id >= ? and id <= ?",
      1,
      10,
      2,
      flatValue
    )

sc: sparkcontext

getConnection: 创建链接的函数

"select * from table where id >= ? and id <= ?": sql语句

1: 要取数据的 id 最小行

10: 要取数据的 id 最大行号

2: 分区数

flatValue: 一个将 ResultSet 转化为需要类型的方法

关于jdbc rdd 的分区

override def getPartitions: Array[Partition] = {
    // bounds are inclusive, hence the + 1 here and - 1 on end 
   //获取上界和下界之间的数据宽度
    val length = BigInt(1) + upperBound - lowerBound
    (0 until numPartitions).map { i =>
      val start = lowerBound + ((i * length) / numPartitions)
      //获取每一次偏移的起始位置
      val end = lowerBound + (((i + 1) * leng