spark SQL
(1)默认会写成一堆小文件,需要将其重新分区,直接指定几个分区
spark.sql("select *,row_number() over(partition by depId order by salary) rownum from EMP ").repartition(2).write.parquet("hdfs:///user/cuixiaojie/employeeRepartition")
(2)默认会写成一堆小文件,需要将其重新分区,直接指定按照某一列的值进行分区
spark.sql("select *,row_number() over(partition by depId order by salary) rownum from EMP ").write.partitionBy(3).parquet("hdfs:///user/cuixiaojie/employee")
UDF的一段学习笔记
val data= sc.textFile("/home/shiyanlou/uber") //文件在最后
data.first
val hd = data.first()
val datafiltered = data.filter(line => line != hd)
datafiltered.count
case class uber(dispatching_base_number:String ,date:String,active_vehicles:Int,trips:Int)
val df = datafiltered.map(x=>x.split(&