spark学习笔记

本文介绍了如何使用Spark SQL进行数据分区,包括通过row_number()函数进行重分区以及partitionBy()进行按列分区。此外,还讲解了UDF自定义函数的创建与应用,以及如何利用窗口函数进行薪资求和分析。同时,文章还讨论了在Spark Shell中指定参数的理解,包括对executor资源的配置。
摘要由CSDN通过智能技术生成

spark  SQL

(1)默认会写成一堆小文件,需要将其重新分区,直接指定几个分区

spark.sql("select *,row_number() over(partition by depId order by salary) rownum from EMP ").repartition(2).write.parquet("hdfs:///user/cuixiaojie/employeeRepartition")

(2)默认会写成一堆小文件,需要将其重新分区,直接指定按照某一列的值进行分区

spark.sql("select *,row_number() over(partition by depId order by salary) rownum from EMP ").write.partitionBy(3).parquet("hdfs:///user/cuixiaojie/employee")

 

UDF的一段学习笔记

val data= sc.textFile("/home/shiyanlou/uber")  //文件在最后
data.first
val hd = data.first()
val datafiltered = data.filter(line => line != hd)
datafiltered.count 
case class uber(dispatching_base_number:String ,date:String,active_vehicles:Int,trips:Int)
val df = datafiltered.map(x=>x.split(",")).map(x => uber(x(0).toString,x(1),x(2).toInt,x(3).toInt)).toDF
df.registerTempTable("uber")
def
pas = (s: String) => {
val format = new java.text.SimpleDateFormat("MM/dd/yyyy")
var days =Array("Sun","Mon","Tue","Wed","Thu","Fri","Sat")
val split = days(format.parse(s).getDay).toString
split
}
sqlContext.udf.register("ppa",pas)
val rep = sqlContext.sql("select dispatching_base_number as dis, ppa(date) as dt ,sum(trips) as cnt from uber group by dispatching_base_number,ppa(date) order by cnt desc")
rep.collect 
val rep = sqlContext.sql("select dispatching_base_number as dis, ppa(date) as dt ,sum(active_vehicles) as cnt from uber group by dispatching_base_number,ppa(date) order by cnt desc")

rep.collect 

利用窗口函数按照部门开窗,然后薪资从低到高排序之后,再将某人和他之前的两个人的薪资求和

employee.json如下
{"name": "zhangsan", "age": 26, "depId": 1, "gender": "male", "salary": 20000}
{"name": "lisi", "age": 36, "depId": 2, "gender": "female", "salary": 8500}
{"name": "wangwu", "age": 23, "depId": 1, "gender": "male", "salary": 5000}
{"name": "zhaoliu", "age": 25, "depId": 3, "gender": "male", "salary": 7000}
{"name": "marry", "age": 19, "depId": 2, "gender": "female", "salary": 6600}
{"name": "Tom", "age": 36, "depId": 1, "gender": "female", "salary": 5000}
{"name": "kitty", "age": 43, "depId": 2, "gender": "female", "salary": 6000}
{"name": "lili", "age": 18, "depId": 3, "gender": "male", "salary": 6900}
{"name": "xiaoming", "age": 37, "depId": 1, "gender": "female", "salary": 3000}
{"name": "laomaotao", "age": 48, "depId": 2, "gender": "male", "salary": 13000}

 

 

 

spark.read.json("hdfs:///user/cuixiaojie/employee.json")createOrReplaceTempView("EMP")
spark.sql("select *,lag(salary,1) over(partition by depId order by salary) as lag1,lag(salary,2) over(partition by depId order by salary) as lag2 from EMP ").createOrReplaceTempView("employeelag1lag2")
spark.sql("select *,nvl(salary,0)+nvl(lag1,0)+nvl(lag2,0) sumsalary from employeelag1lag2").show

 

 

 

spark-shell指定参数的理解

spark-shell --master yarn-client --executor-memory 1G --num-executors 1   --executor-cores 1
在yarn-client下:Containers Running消耗en+1;  Memory Used 消耗为en*(em+1)+1;VCores Used 消耗为en*ec+1



评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值