常用Spark SQL函数整理

最新推荐文章于 2024-05-18 12:43:51 发布

Ashley_JIANG

最新推荐文章于 2024-05-18 12:43:51 发布

阅读量1k

点赞数

本文链接：https://blog.csdn.net/Jacqueline_JIANG/article/details/115677885

版权

1. if(条件判断,true,false)

2. parse_url 解析url字符串

parse_url(url, url部分,具体字段)

url部分：HOST，QUERY

3. [uid -> 119024341,currPage -> indexpage,bannerType -> yueke,timestamp -> 1619440226820]这样格式的数据，数据格式：map

props['presaleId'],key:value的解析形式

4. nvl(a,b), 空值处理，如果a为空的时候，使用b进行填充

5.get_json_object(context,'$.字段')

6.按关键字截取字符串
substring_index（str,delim,count）
说明：substring_index（被截取字段，关键字，关键字出现的次数）
例：select substring_index（"blog.jb51.net"，"。"，2） as abstract from my_content_t
结果：blog.jb51
（注：如果关键字出现的次数是负数如-2 则是从后倒数，到字符串结束）

7. cache table 表名，

8. 同一行，取出多个字段中的最大值（greatest），最小值（least）

https://wenku.baidu.com/view/7f5f0b282f60ddccda38a01f.html

9. explode会过滤空值的数据

10. udf

Spark官方UDF使用文档：https://spark.apache.org/docs/latest/api/sql/index.html

11, !!!空值

表A需要筛出a中不等于aaa的数据（a字段有空值）

错误：select * from A where a != 'aaa' (空值数据也被过滤了)

正确：select * from A where (a != 'aaa' or a is null)

a	b
a a a	1
	22222

12. ARRAY 的相关操作

生成：collect_set(struct(a.lesson_id,b.lesson_title,b.lesson_type_id))

查询：where array_contains(字段, 17（目标值）)

13. 修改表名

ALTER TABLE 原表 RENAME TO目标表

14. first_value(),last_value

15. 获取周几

date_format(字段（时间戳格式）, 'u')

16. struct字段类型

17. ==

select 1 == '1' true

select 1 == 1 true

select 1 == '2' false

select 1 == 'jiang' 空（\n）

18. case when a = 'xx' then 1

when a='yy' then 2

else 3 then 字段名

19. row_number() over(partition by trade_order_no order by campus_name desc) 坑点

当分组之后，如果用于排序的字段是一样的，就会出现这几条数据的排序是随机的，就会导致每次跑的结果不一致

20. not in

注意：当数据是空的时候，使用not in 会将空值排除

21. cache不仅可以提高计算效率，有时不使用还有造成数据错误

table1:

user_id	课程	时间	order_no
001	数学	20210701	20210701002
001	数学	20210701	20210701003

select *

,row_number() over(partition by user_id, 课程 order by 时间) px

from table1

as table1_order;

select *

from table1_order

where px = 1

as table1_part1;

select *

from table1 a

left anti join table1_part1 b on a.order_no = b.order_no -- 第一次

as table1_part2;

select * from table1_part1

union

select * from table1_part2

as result;

最后result的值，可能只有一条。

原因：table1_part1不cache住，会被计算两次，而之前的排序因时间相同，排序具有随机性，可能第一次排序20210701002的px为1，table1_part2为 20210701003；第二次计算时20210701003的px为1。 union去重之后，就只留下20210701003一条数据。这时候需要在table1_part1计算结束后，加cache,将结果锁住，防止再次计算。

原理参考：https://www.pianshen.com/article/6444369153/

22. 取随机数

order by rand() limit 200 -- 随机取200条数据

23. union all 结果顺序是随机的

union all

结果可能是bca

https://blog.csdn.net/iteye_423/article/details/82441786

24. 2-null = null

涉及计算时，要将空值进行填充

阿里数仓建设指南： https://help.aliyun.com/document_detail/117437.html?spm=a2c4g.11186623.6.1102.7b7e47b3KLbxVW

hive join 原理

https://www.cnblogs.com/suanec/p/7560399.html