spark-jdbc读取数据库分区

spark通过jdbc从数据库中读取数据,如果数据过大,一定要分区,否则运行慢,分区数目可以从webui上看到,分区数就是task数目。如果分区后,有的task很快完成有的task完成比较慢,这时候需要用sql去查找数据库中看看到底哪里发生数据倾斜了,在数据倾斜的地方多建立几个分区,这样会快些。
好了,不多说,上代码

import java.util.Properties

import org.apache.spark.sql.{SaveMode, SparkSession}

object MppTopN {
  def main(args: Array[String]): Unit = {
    val session = SparkSession.builder().appName("testMPP").getOrCreate()
    val prop = new Properties()
    prop.setProperty("user","root")
    prop.setProperty("password","root")
    prop.setProperty("driver","com.gbase.jdbc.Driver")
    prop.setProperty("fetchsize","500000")
    val url="jdbc:gbase://172.18.66.70/lishuaitest"
    val tableName="menjin_1"
//这个是自定义分区,也可以让程序自己分区,但是程序自己分区是均匀分区,不会解决数据倾斜的问题,在九十十一十二这四个月数据量比较大,所以多设立几个分区。这里是根据日期去分区
 val predicates = Array(
      "2015/8/1" -> "2015/8/10",
      "2015/8/11" -> "2015/8/20",
      "2015/8/21" -> "2015/8/30",
      "2015/9/1" -> "2015/9/5",
      "2015/9/6" -> "2015/9/10",
      "2015/9/11" -> "2015/9/15",
      "2015/9/16" -> "2015/9/20",
      "2015/9/21" -> "2015/9/25",
      "2015/9/26" -> "2015/9/30",
      "2015/10/1" -> "2015/10/5",
      "2015/10/6" -> "2015/10/10",
      "2015/10/11" -> "2015/10/13",
      "2015/10/14" -> "2015/10/15",
      "2015/10/16" -> "2015/10/20",
      "2015/10/21" -> "2015/10/23",
      "2015/10/24" -> "2015/10/25",
      "2015/10/26" -> "2015/10/28",
      "2015/10/29" -> "2015/10/31",
      "2015/11/1" -> "2015/11/5",
      "2015/11/6" -> "2015/11/10",
      "2015/11/11" -> "2015/11/15",
      "2015/11/16" -> "2015/11/20",
      "2015/11/21" -> "2015/11/25",
      "2015/11/26" -> "2015/11/30",
      "2015/12/1" -> "2015/12/5",
      "2015/12/6" -> "2015/12/10",
      "2015/12/11" -> "2015/12/15",
      "2015/12/16" -> "2015/12/20",
      "2015/12/21" -> "2015/12/23",
      "2015/12/24" -> "2015/12/25",
      "2015/12/26" -> "2015/12/28",
      "2015/12/29" -> "2015/12/31"

    ).map{
      case (start,end) =>
        s"cast(INOUTTIME as date) >= date '$start'" + s"and cast(INOUTTIME as date) <= date '$end'"
    }
//读的时候对某一字段字段进行分区,然后按照分区读取数据
    val dataFrame = session.sqlContext.read.jdbc(url,tableName,predicates,prop)
    dataFrame.registerTempTable("test")
    val frame = dataFrame.sqlContext.sql("select BUILDING  from test order by BUILDING  limit 50 ")
 frame.coalesce(1).write.mode(SaveMode.Append).jdbc(url,"top_menjin",prop)
    session.stop()
  }
}
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值