spark SQL
(1)默认会写成一堆小文件,需要将其重新分区,直接指定几个分区
spark.sql("select *,row_number() over(partition by depId order by salary) rownum from EMP ").repartition(2).write.parquet("hdfs:///user/cuixiaojie/employeeRepartition")
(2)默认会写成一堆小文件,需要将其重新分区,直接指定按照某一列的值进行分区
spark.sql("select *,row_number() over(partition by depId order by salary) rownum from EMP ").write.partitionBy(3).parquet("hdfs:///user/cuixiaojie/employee")
UDF的一段学习笔记
val data= sc.textFile("/home/shiyanlou/uber") //文件在最后
data.first
val hd = data.first()
val datafiltered = data.filter(line => line != hd)
datafiltered.count
case class uber(dispatching_base_number:String ,date:String,active_vehicles:Int,trips:Int)
val df = datafiltered.map(x=>x.split(",")).map(x => uber(x(0).toString,x(1),x(2).toInt,x(3).toInt)).toDF
df.registerTempTable("uber")
def
pas = (s: String) => {
val format = new java.text.SimpleDateFormat("MM/dd/yyyy")
var days =Array("Sun","Mon","Tue","Wed","Thu","Fri","Sat")
val split = days(format.parse(s).getDay).toString
split
}
sqlContext.udf.register("ppa",pas)
val rep = sqlContext.sql("select dispatching_base_number as dis, ppa(date) as dt ,sum(trips) as cnt from uber group by dispatching_base_number,ppa(date) order by cnt desc")
rep.collect
val rep = sqlContext.sql("select dispatching_base_number as dis, ppa(date) as dt ,sum(active_vehicles) as cnt from uber group by dispatching_base_number,ppa(date) order by cnt desc")
rep.collect
利用窗口函数按照部门开窗,然后薪资从低到高排序之后,再将某人和他之前的两个人的薪资求和
employee.json如下
{"name": "zhangsan", "age": 26, "depId": 1, "gender": "male", "salary": 20000}
{"name": "lisi", "age": 36, "depId": 2, "gender": "female", "salary": 8500}
{"name": "wangwu", "age": 23, "depId": 1, "gender": "male", "salary": 5000}
{"name": "zhaoliu", "age": 25, "depId": 3, "gender": "male", "salary": 7000}
{"name": "marry", "age": 19, "depId": 2, "gender": "female", "salary": 6600}
{"name": "Tom", "age": 36, "depId": 1, "gender": "female", "salary": 5000}
{"name": "kitty", "age": 43, "depId": 2, "gender": "female", "salary": 6000}
{"name": "lili", "age": 18, "depId": 3, "gender": "male", "salary": 6900}
{"name": "xiaoming", "age": 37, "depId": 1, "gender": "female", "salary": 3000}
{"name": "laomaotao", "age": 48, "depId": 2, "gender": "male", "salary": 13000}
spark.read.json("hdfs:///user/cuixiaojie/employee.json")createOrReplaceTempView("EMP")
spark.sql("select *,lag(salary,1) over(partition by depId order by salary) as lag1,lag(salary,2) over(partition by depId order by salary) as lag2 from EMP ").createOrReplaceTempView("employeelag1lag2")
spark.sql("select *,nvl(salary,0)+nvl(lag1,0)+nvl(lag2,0) sumsalary from employeelag1lag2").show
spark-shell指定参数的理解
spark-shell --master yarn-client --executor-memory 1G --num-executors 1 --executor-cores 1
在yarn-client下:Containers Running消耗en+1; Memory Used 消耗为en*(em+1)+1;VCores Used 消耗为en*ec+1