spark通过jdbc从数据库中读取数据,如果数据过大,一定要分区,否则运行慢,分区数目可以从webui上看到,分区数就是task数目。如果分区后,有的task很快完成有的task完成比较慢,这时候需要用sql去查找数据库中看看到底哪里发生数据倾斜了,在数据倾斜的地方多建立几个分区,这样会快些。
好了,不多说,上代码
import java.util.Properties
import org.apache.spark.sql.{SaveMode, SparkSession}
object MppTopN {
def main(args: Array[String]): Unit = {
val session = SparkSession.builder().appName("testMPP").getOrCreate()
val prop = new Properties()
prop.setProperty("user","root")
prop.setProperty("password","root")
prop.setProperty("driver","com.gbase.jdbc.Driver")
prop.setProperty("fetchsize","500000")
val url="jdbc:gbase://172.18.66.70/lishuaitest"
val tableName="menjin_1"
//这个是自定义分区,也可以让程序自己分区,但是程序自己分区是均匀分区,不会解决数据倾斜的问题,在九十十一十二这四个月数据量比较大,所以多设立几个分区。这里是根据日期去分区
val predicates = Array(
"2015/8/1" -> "2015/8/10",
"2015/8/11" -> "2015/8/20",
"2015/8/21" -> "2015/8/30",
"2015/9/1" -> "2015/9/5",
"2015/9/6" -> "2015/9/10",
"2015/9/11" -> "2015/9/15",
"2015/9/16" -> "2015/9/20",
"2015/9/21" -> "2015/9/25",
"2015/9/26" -> "2015/9/30",
"2015/10/1" -> "2015/10/5",
"2015/10/6" -> "2015/10/10",
"2015/10/11" -> "2015/10/13",
"2015/10/14" -> "2015/10/15",
"2015/10/16" -> "2015/10/20",
"2015/10/21" -> "2015/10/23",
"2015/10/24" -> "2015/10/25",
"2015/10/26" -> "2015/10/28",
"2015/10/29" -> "2015/10/31",
"2015/11/1" -> "2015/11/5",
"2015/11/6" -> "2015/11/10",
"2015/11/11" -> "2015/11/15",
"2015/11/16" -> "2015/11/20",
"2015/11/21" -> "2015/11/25",
"2015/11/26" -> "2015/11/30",
"2015/12/1" -> "2015/12/5",
"2015/12/6" -> "2015/12/10",
"2015/12/11" -> "2015/12/15",
"2015/12/16" -> "2015/12/20",
"2015/12/21" -> "2015/12/23",
"2015/12/24" -> "2015/12/25",
"2015/12/26" -> "2015/12/28",
"2015/12/29" -> "2015/12/31"
).map{
case (start,end) =>
s"cast(INOUTTIME as date) >= date '$start'" + s"and cast(INOUTTIME as date) <= date '$end'"
}
//读的时候对某一字段字段进行分区,然后按照分区读取数据
val dataFrame = session.sqlContext.read.jdbc(url,tableName,predicates,prop)
dataFrame.registerTempTable("test")
val frame = dataFrame.sqlContext.sql("select BUILDING from test order by BUILDING limit 50 ")
frame.coalesce(1).write.mode(SaveMode.Append).jdbc(url,"top_menjin",prop)
session.stop()
}
}