1 parse_url_tuple解析URL
+----------------------------------------------------+--+
| tab_name |
+----------------------------------------------------+--+
| parse_url_tuple(url, partname1, partname2, ..., partnameN) - extracts N (N>=1) parts from a URL. |
| It takes a URL and one or multiple partnames, and returns a tuple. All the input parameters and output column types are string. |
| Partname: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, USERINFO, QUERY:<KEY_NAME> |
| Note: Partnames are case-sensitive, and should not contain unnecessary white spaces. |
| Example: |
| > SELECT b.* FROM src LATERAL VIEW parse_url_tuple(fullurl, 'HOST', 'PATH', 'QUERY', 'QUERY:id') b as host, path, query, query_id LIMIT 1; |
| > SELECT parse_url_tuple(a.fullurl, 'HOST', 'PATH', 'QUERY', 'REF', 'PROTOCOL', 'FILE', 'AUTHORITY', 'USERINFO', 'QUERY:k1') as (ho, pa, qu, re, pr, fi, au, us, qk1) from src a; |
它常与LATERAL VIEW搭配使用
2 LATERAL VIEW一行进,多行出
3 regexp_replace替换正则表达式
+----------------------------------------------------+--+
| tab_name |
+----------------------------------------------------+--+
| regexp_replace(str, regexp, rep) - replace all substrings of str that match regexp with rep |
| Example: |
| > SELECT regexp_replace('100-200', '(\d+)', 'num') FROM src LIMIT 1; |
| 'num-num' |
+----------------------------------------------------+--+
4 开窗函数
用于计算基于组的某种聚合值,它和聚合函数的不同之处是:对于每个组返回多行,而聚合函数对于每个组只返回一行。开窗函数指定了分析函数工作的数据窗口大小,这个数据窗口大小可能会随着行的变化而变化。
常用于分组求topN的问题,例如求取每个用户获取到金额最大的前三位数值
rank over:如果有相同的数据,标号一样,标号根据数据的条数,往上涨
dense_rank() over如果有相同的数,标号一样,标号顺序往上涨
row_number() over如果有相同的数据,标号顺序往上涨
- over函数的写法:
over(partition by column.a order by column.b) 按照column.b排序进行累计,按照column.a分区。
- 开窗的窗口范围:
over(order by sroce range between 5 preceding and 5 following):窗口范围为当前行数据幅度减5加5后的范围内的。
over(order by sroce rows between 5 preceding and 5 following):窗口范围为当前行前后各移动5行
常见搭配使用:
sum()over():分组排序后求总和
first_value() over(),last_value() over()求第一和最后一名
count() over(partition by ... order by ...):求分组后的总数。
max() over(partition by ... order by ...):求分组后的最大值。
min() over(partition by ... order by ...):求分组后的最小值。
avg() over(partition by ... order by ...):求分组后的平均值。