P.S. : all the sql codes in this note are based on hive sql , so may be a little different with 61a sql part, mysql,oracle,etc. most of the content in this note is based on my work experiences.JUST A RECORD
Executing order of SQL
Bacis syntax
CREATE
CREATE (TEMP or TEMPORARY) TABLE table_name (data) ;
CREATE (TEMP or TEMPORARY) TABLE table_name AS SELECT … ;
INSERT
INSERT INTO table_name VALUES() ;
INSERT OVERWRITE table_name VALUES() ;
For partitioned tables, the insert overwrite table operation overwrites the data in the specified partition without affecting the data in other partitions.
For a normal table (that is, a non-partitioned table), the insert overwrite table operation completely overwrites all the data in the table, regardless of any partition. All data will be deleted and replaced with newly inserted data.
JOIN
Separator ‘,’
SELECT a.,b, from a,b ; If a has n rows and b has m rows , the result will be n*m rows.
Do not need the same columns to connect two table.
When a and b do not have the similar key colomns ,we use ‘,’ to connect two tables.
a FULL JOIN b ON a.col_name = b.col_name (and …)
It is different with ‘,’ . If a has n rows and b has m rows , the result will be max(n,m) rows.
ON can be regarded as a conditon statment
a LEFT JOIN b ON a.col_name = b.col_name (and …)
If a has n rows and b has m rows , the result will be n rows.
We can use LEFT JOIN and WHERE to replace INNER JOIN
a RIGHT JOIN b ON a.col_name = b.col_name (and …)
If a has n rows and b has m rows , the result will be m rows.
a INNER JOIN b ON a.col_name = b.col_name (and …)
UNION
UNION SELECT … UNION SELECT …
Automatically remove duplicate data
The data format for each column must be consistent
Certainly the column’s number must be the same.
*UNION ALL * SELECT … UNION ALL SELECT …
Keep the duplicate data
Others are the same as UNION
Conditional Statement
WHERE
SELECT … FROM table_name WHERE …(AND … OR…)
HAVING
SELECT col1,col2… FROM table_name GROUP BY col1,col2… HAVING SUM(col1)>1 OR COUNT(col2)>0 …
**CASE WHEN **
SELECT CASE WHEN col1 IS NOT NULL THEN col1 ELSE ’ ’ END AS new_col1 FROM table_name
SET hivevar:NOW_DATE=TO_DATE('2024-02-18');--sundaySET hivevar:END_DATE=TO_DATE(ADD_MONTHS(${NOW_DATE},+1));--2024-3-18-- eg calculate age DATEDIFF(BIG_DATE,SMALL_DATE) = DAYSSELECT floor(datediff(${NOW_DATE},to_date(birthday))/365)as age ;-- calculate the weekend -2023-01-01 sundaySELECTCASEWHEN pmod(datediff(${NOW_DATE},TO_DATE('2023-01-01')),7)=0THEN'WEEKEND'WHEN pmod(datediff(${NOW_DATE},TO_DATE('2023-01-01')),7)=6THEN'WEEKEND'ELSE'WEEKDAY'ENDAS WEEK_TAG;-- the last day of the monthSELECT LAST_DAY(${NOW_DATE});-- '2024-02-29'-- extract year/month/dateSELECT DATE_FORMAT(${NOW_DATE},'yyyy-MM');-- 2024-02SELECT DATE_FORMAT(${NOW_DATE},'MM');-- 02SELECT DATE_FORMAT(${NOW_DATE},'dd');-- 18-- add days OR subtract days DATE_SUB()SELECT DATE_ADD(LAST_DAY(${NOW_DATE}),1);--2024-03-01-- next monday - sundaySELECT NEXT_DAY(${NOW_DATE},'MO');-- 2024-02-19SELECT NEXT_DAY(${NOW_DATE},'we');--2024-02-21
if we do not give the value of cut_len, it will return all the character after the start index .
SUBSTR(targer_string , start_index, cut_len)
same as SUBSTRING note: some version of sql substring() is different with substr(). The last parameter of substring maybe the end_index. But i have tested in my sql version ,i got the same results using the same parameter’s value in the two function. JUST A NOTICE, maybe it depends on different situations.
Concat & Concat_ws
CONCAT(str1, str2…) or we can use || instead
Make sure that there are no empty strs in concat , or you will got NULL
EXPLODE(MAP(target_row)) : Set each pair of elements in the map as a row, with a column for key and a column for value
SELECT EXPLODE(ARRAY('1','E'));-- will get a new column:-- new_col-- 1-- ESELECT EXPLODE(MAP('A','1','B','2');-- key value-- A 1-- B 2
Lateral View lateral view udtf(expression) tableAlias as columnAlias (,columnAlias)*
udtf(expression) : eg: explore() , spilt()
tableAlias :The name of the virtual table that represents the transformation of the UDTF function
columnAlias:Indicates the virtual field name of the virtual table. If there is one column after splitting, write one column. If there are multiple columns after the split, declare all virtual column names in parentheses in the order of the columns, separated by commas
LOCATION: after FROM before WHERE
DROPTABLEIFEXISTS test_list;CREATETEMPORARYTABLE test_list
(u_id STRING,
a_list STRING)row format delimited
fieldsterminatedby"\t"
collection items terminatedby",";INSERTINTOTABLE test_list VALUES(('A,B'),('L1,L2,L3'));INSERTINTOTABLE test_list VALUES(('C,B,D'),('L4,L5,L6'));select uid_split, new_list
from(select u_id,new_list
from test_list
lateral view explode(split(a_list,",")) tmpTable as new_list
) a
lateral view explode(split(u_id,",")) m as uid_split;-- ans: 15rows * 2 cols
uid_split
new_list
A
L1
B
L1
…
…
C
L6
B
L6
D
L6
POSEXPLODE() change two colums at the same time
DROPTABLEIFEXISTS test_list;CREATETEMPORARYTABLE test_list
(u_id STRING,
a_list STRING)row format delimited
fieldsterminatedby"\t"
collection items terminatedby",";INSERTINTOTABLE test_list VALUES(('A,B'),('L1,L2'));INSERTINTOTABLE test_list VALUES(('C,E,D'),('L4,L5,L6'));select uid_split,new_list
from(select index_l, index_u ,new_list,uid_split
from test_list
lateral view posexplode(split(a_list,",")) ln_1 as index_l,new_list
lateral view posexplode(split(u_id,",")) u_1 as index_u,uid_split) a
where index_l = index_u;-- ans : 5rows * 2cols
* : matching the former character 0 times or infinite times
. : matching arbitrary characters
? : matching the former character 0 times or 1 time
index : indicate that return which part of the results
if index = 0 , return the whole expression
if index in [1 + ] , we will give the specific string as the position before the parentheses and the index will accord to the parentheses’ order and number.
regexp_extract('{"c_id":"D","order_time":"20240506"}','"order_time":(.*?)(")',1)as order_time
-- expected to get '20240505'
if it is a nesting structure , then using $.key1.key2[0]
get_json_object('{"c_id":"D","order_time":"20240506"}',$.order_time)as order_time
-- expected to get '20240505'
Analysis function
Ranking
ROW_NUMBER() OVER( PARTITION BY … ORDER BY …) : 1 2 3 4 …
RANK()OVER(PARTITION BY … ORDER BY …) : 1 2 2 4
DENSE_RANK(PARTITION BY … ORDER BY …) : 1 2 2 3
NTILE(group_num) OVER(PARTITION BY … ORDER BY …) : make sure every group at least one row,and If the slices are uneven, the distribution of the first slice is increased by default.
CUME_DIST()OVER(PARTITION BY … ORDER BY …) : Returns the number of rows less than or equal to the current value/the total number of rows in the group
Lag&Lead
LAG(col_name, n , lag_val):
col_name : target modified column
n : The second argument is the nth line up (optional, default is 1)
lag_val : The third parameter is the default value (if the nth action is NULL, take the default value, if not specified, NULL).
LEAD(col_name, n , lead_val):
lag line up lead line down
First/Last value
FIRST_VALUE( target_col) OVER(PARTITION BY … ORDER BY …) : Take the first value after sorting within the group, up to the current row
LAST_VALUE( target_col) OVER(PARTITION BY … ORDER BY …): After sorting within the group, up to the current row, the last value
DROPTABLEIFEXISTS TEMP_TEST;CREATETEMPORARYTABLE TEMP_TEST ASSELECT EXPLODE(ARRAY('1','2','3','2','3','4'))AS TEST
;SELECT A.*,ROW_NUMBER()OVER(ORDERBY TEST )AS ROW_RK
,RANK()OVER(ORDERBY TEST )AS RK_RK
,DENSE_RANK()OVER(ORDERBY TEST )AS DEN_RK
,round(CUME_DIST()OVER(ORDERBY TEST),2)AS less_or_equ_rt
,LAG(TEST,3,'LAG')OVER(ORDERBY TEST)AS lag_test
,LAG(TEST)OVER(ORDERBY TEST DESC)AS lag_2
,LEAD(TEST,2,'LEAD')OVER(ORDERBY TEST)AS lead_test
,LEAD(TEST)OVER(ORDERBY TEST DESC)AS lead_2
,FIRST_VALUE(TEST)OVER(ORDERBY TEST)AS first_test
,LAST_VALUE(TEST)OVER(ORDERBY TEST )AS last_test
FROM TEMP_TEST A ;
test
row_rk
rk_rk
den_rk
group_2
group_4
less_or_equ_rt
lag_test
lag_2
lead_test
lead_2
first_test
last_test
1
1
1
1
1
1
0.17
LAG
2
2
[NULL]
1
1
2
2
2
2
1
1
0.5
LAG
2
3
1
1
2
2
3
2
2
1
2
0.5
LAG
3
3
2
1
2
3
4
4
3
2
2
0.83
1
3
4
2
1
3
3
5
4
3
2
3
0.83
2
4
LEAD
3
1
3
4
6
6
4
2
4
1
2
[NULL]
LEAD
3
1
4
Handling NULL
NVL( expr1, default_expr): return default_expr if expr1 is null , or return expr1
Using group by to remove duplicate data ,which is faster than using DISTINCT
All dimesions like * c_id, order_year, product* should be occured after the group by statement
grouping__id sorting order
grouping__id : two underlines. grouping__id must be used with grouping set or cube or rollup , which can not be used alone. If we group by 3 columns , the order of grouping__id is as follows: (maybe different version of sql will give different 0 : (col1, col2, col3) 1 : (col1, col2) 2 : (col1, col3) 3 : (col1) 4 : ( col2, col3) 5 : (col2) 6 : (col3) 7: ()
CUBE
They are aggregated according to different dimensions and have no priority
Basic syntax: SELECT col1, col2, col3 FROM temp_table GROUP BY CUBE(col1, col2, col3) or SELECT col1, col2, col3 FROM temp_table GROUP BY col1, col2, col3 WITH CUBE
-- to find the sales of the every year and every productSELECT ORDER_YEAR,PRODUCT,SUM(REVENUE)AS SALES, grouping__id FROM TEST_GROUP GROUPBY ORDER_YEAR,PRODUCT
with cube;SELECT ORDER_YEAR,PRODUCT,SUM(REVENUE)AS SALES, grouping__id FROM TEST_GROUP GROUPBY CUBE(ORDER_YEAR,PRODUCT);
order_year
product
sales
grouping__id
2019
pro_2
40
0
2020
pro_1
70
0
2020
pro_2
140
0
2021
pro_2
50
0
2019
[NULL]
40
1
2020
[NULL]
210
1
2021
[NULL]
50
1
[NULL]
pro_1
70
2
[NULL]
pro_2
230
2
[NULL]
[NULL]
300
3
GROUPING SETS
When we just need some particular dimensions , we can use GROUPING SETS to determine which dimensions to aggregate
GROUPING SET is after GROUP BY
Basic syntax : SELECT col1, col2, col3 FROM temp_table GROUP BY col1, col2, col3 GROUPING SETS ((col1,col2,col3),())
() means taking all the data of the table as an integrity, while it can not be used alone . We can not write GROUPING SETS (()) .
The ROLLUP function is a subset of the CUBE, dominated by the leftmost dimension, aggregated in order.
Basic syntax :SELECT col1, col2, col3 FROM temp_table GROUP BY ROLLUP(col1, col2, col3) or SELECT col1, col2, col3 FROM temp_table GROUP BY col1, col2, col3 WITH ROLLUP
Is equivalent to SELECT col1, col2, col3 FROM temp_table GROUP BY col1, col2, col3 GROUPING SETS((),col1,(col1,col2),(col1,col2,col3))
Exist the priority ,and the left side has the highest priority, which means priority col1>col2>col3
SELECT ORDER_YEAR,PRODUCT,SUM(REVENUE)AS SALES, grouping__id FROM TEST_GROUP
GROUPBY ORDER_YEAR,PRODUCT
WITH ROLLUP;-- is equivalent to SELECT ORDER_YEAR,PRODUCT,SUM(REVENUE)AS SALES, grouping__id FROM TEST_GROUP
GROUPBY ROLLUP(ORDER_YEAR,PRODUCT);-- is equivalent to SELECT ORDER_YEAR,PRODUCT,SUM(REVENUE)AS SALES, grouping__id FROM TEST_GROUP
GROUPBY ORDER_YEAR,PRODUCT
GROUPING SETS((),ORDER_YEAR,(ORDER_YEAR,PRODUCT));
order_year
product
sales
grouping__id
2019
pro_2
40
0
2020
pro_1
70
0
2020
pro_2
140
0
2021
pro_2
50
0
2019
[NULL]
40
1
2020
[NULL]
210
1
2021
[NULL]
50
1
[NULL]
[NULL]
300
3
Example
Based on HIVE SQL
-- Creat a new detail table which contains the customer and the goods they broughtDROPTABLEIFEXISTS ORDER_DETAIL;CREATETEMPORARYTABLE ORDER_DETAIL(
C_ID STRING,
ORDER_DATE DATE,
PRODUCT_NAME STRING,
PRODUCT_UNITS INT,
STAFF STRING
)ROW FORMAT DELIMITED FIELDSTERMINATEDBY',' STORED AS textfile;INSERTINTO ORDER_DETAIL VALUES('A','2024-06-07','DRESS','2','S1');INSERTINTO ORDER_DETAIL VALUES('A','2024-05-08','SHOES','1','S3');INSERTINTO ORDER_DETAIL VALUES('B','2024-06-06','SHIRT','2','S1');INSERTINTO ORDER_DETAIL VALUES('B','2024-05-09','PANTS','1','S2');INSERTINTO ORDER_DETAIL VALUES('C','2024-06-09','PANTS','2','S3');INSERTINTO ORDER_DETAIL VALUES('C','2024-05-08','SHIRT','3','S4');DROPTABLEIFEXISTS PRODUCT_INFO;CREATETEMPORARYTABLE PRODUCT_INFO ASSELECT CAST(''AS STRING)AS PRODUCT_NAME, CAST(''ASFLOAT)AS PRODUCT_PRICE;-- We create a null line in table PRODUCT_INFO , so we can use insert overwrite to delete this useless line.INSERT OVERWRITE TABLE PRODUCT_INFO VALUES('DRESS',200);INSERTINTO PRODUCT_INFO VALUES('SHOES',300);INSERTINTO PRODUCT_INFO VALUES('SHIRT',500);INSERTINTO PRODUCT_INFO VALUES('PANTS',100);-- We found that we lost the data of customer D and the data have not been parsed yet. DROPTABLEIFEXISTS MISSING_RAWDATA;CREATETEMPORARYTABLE MISSING_RAWDATA ASSELECT'{"c_id":"D","order_time":"20240506","buy_detail":[{"product_name":"DRESS","units":"4","client":"S4"}]}'AS json_data
UNIONALLSELECT'{"c_id":"D","order_time":"20240607","buy_detail":[{"product_name":"SHOES","units":"2","client":"S1"}]}'AS json_data
UNIONALLSELECT'{"c_id":"B","order_time":"20240606","buy_detail":[{"product_name":"SHIRT","units":"2","client":"S1"}]}'AS json_data;DROPTABLEIFEXISTS PARSE_JSON;CREATETEMPORARYTABLE PARSE_JSON ASSELECT
regexp_extract(json_data,'"c_id":"(.*?)(")',1)AS C_ID
,get_json_object(json_data,'$.order_time')AS ORDER_DATE
,get_json_object(json_data,'$.buy_detail.product_name[0]')AS PRODUCT_NAME
,regexp_extract(json_data,'"units":"(.*?)(")',1)AS PRODUCT_UNITS
,get_json_object(json_data,'$.buy_detail.client[0]')AS STAFF
FROM MISSING_RAWDATA;-- change the format of order_date from yyyyMMdd to yyyy-MM-ddDROPTABLEIFEXISTS OTHERS_DETAIL;CREATETEMPORARYTABLE OTHERS_DETAIL ASSELECT C_ID
,TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(ORDER_DATE,'yyyyMMdd'),'yyyy-MM-dd'))AS ORDER_DATE
,PRODUCT_NAME
,CAST(PRODUCT_UNITS ASINT)AS PRODUCT_UNITS
,STAFF
FROM PARSE_JSON;DROPTABLEIFEXISTS FINAL_DETAIL;CREATETEMPORARYTABLE FINAL_DETAIL ASSELECT A.*,B.PRODUCT_PRICE
FROM ORDER_DETAIL A
LEFTJOIN PRODUCT_INFO B ON A.PRODUCT_NAME = B.PRODUCT_NAME
UNIONSELECT A.*,B.PRODUCT_PRICE
FROM OTHERS_DETAIL A
LEFTJOIN PRODUCT_INFO B ON A.PRODUCT_NAME = B.PRODUCT_NAME ;SELECT*FROM FINAL_DETAIL;
c_id
order_date
product_name
product_units
staff
product_price
A
2024-06-07
DRESS
2
S1
200
B
2024-05-09
PANTS
1
S2
100
C
2024-05-08
SHIRT
3
S4
500
C
2024-06-09
PANTS
2
S3
100
D
2024-06-06
DRESS
4
S4
200
A
2024-05-08
SHOES
1
S3
300
B
2024-06-06
SHIRT
2
S1
500
D
2024-06-07
SHOES
2
S1
300
-- product assortmentDROPTABLEIFEXISTS PRODUCT_NAME_UNITS_UNION;CREATETEMPORARYTABLE PRODUCT_NAME_UNITS_UNION ASSELECT A.*, PRODUCT_NAME||'×'||PRODUCT_UNITS AS NEW_PRODUCT
,CONCAT(PRODUCT_NAME,'×',PRODUCT_UNITS)AS NEW_PRODUCT1
FROM FINAL_DETAIL A;-- Now the new_product1 keeps the same as new_product DROPTABLEIFEXISTS PRODUCT_ASSORTMENT;CREATETEMPORARYTABLE PRODUCT_ASSORTMENT ASSELECT C_ID,SUM(PRODUCT_PRICE*PRODUCT_UNITS)AS REVENUE,CONCAT_WS(",", SORT_ARRAY(COLLECT_LIST(NEW_PRODUCT)))AS ALL_BUY
FROM PRODUCT_NAME_UNITS_UNION
GROUPBY C_ID;SELECT*from PRODUCT_ASSORTMENT;
c_id
revenue
all_buy
A
700
DRESS×2,SHOES×1
B
1100
PANTS×1,SHIRT×2
C
1700
PANTS×2,SHIRT×3
D
1400
DRESS×2,SHOES×1
-- if we do not want to analyze SHIRT and empty the SHIRTDROPTABLEIFEXISTS NO_SHIRT;CREATETEMPORARYTABLE NO_SHIRT ASSELECT C_ID
,CASEWHEN PRODUCT_NAME IN('DRESS','PANTS','SHOES')THEN PRODUCT_NAME ENDAS NEW_NAME
,CASEWHEN PRODUCT_NAME ='SHIRT'THEN CAST('0'ASINT)ELSE PRODUCT_UNITS ENDAS NEW_UNITS
FROM FINAL_DETAIL;-- Now processing the product assortment using CONCAT AND || may get different results.DROPTABLEIFEXISTS NO_SHIRT_ASSORT;CREATETEMPORARYTABLE NO_SHIRT_ASSORT ASSELECT C_ID,NEW_NAME||'×'||NEW_UNITS AS NEW_PRODUCT
,CONCAT(NEW_NAME,'×',NEW_UNITS)AS NEW_PRODUCT1
FROM NO_SHIRT ;-- so when we use the concat to combine two columns ,we need to make sure there is no empty value in that two columnsSELECT*FROM NO_SHIRT_ASSORT WHERE c_id ='C';
c_id
new_product
new_product_1
C
[NULL]
[NULL]
C
PANTS×2
PANTS×2
-- FIND THE MONTHLY sales champion DROPTABLEIFEXISTS MONTHLY_SALES_BY_STAFF;CREATETEMPORARYTABLE MONTHLY_SALES_BY_STAFF ASSELECT-- substr(ORDER_DATE,6,2)
SUBSTRING(ORDER_DATE,6,2)AS ORDER_MON, STAFF,SUM(PRODUCT_PRICE*PRODUCT_UNITS)AS SALES
FROM FINAL_DETAIL
GROUPBY SUBSTRING(ORDER_DATE,6,2), STAFF;DROPTABLEIFEXISTS MONTHLY_SALES_RANK;CREATETEMPORARYTABLE MONTHLY_SALES_RANK ASSELECT ORDER_MON, STAFF, SALES, DENSE_RANK()OVER(PARTITIONBY ORDER_MON ORDERBY SALES DESC)AS MON_RANK
FROM MONTHLY_SALES_BY_STAFF;SELECT*FROM MONTHLY_SALES_RANK WHERE mon_rank='1';
order_mon
staff
sales
mon_rank
06
S1
2000
1
05
S4
2300
1
-- USE NTILE TO CLUSTERING DROPTABLEIFEXISTS NTILE_TEST;CREATETEMPORARYTABLE NTILE_TEST ASSELECT A.*,NTILE(3)OVER(PARTITIONBY STAFF ORDERBY PRODUCT_PRICE)AS GROUP_1
FROM FINAL_DETAIL A;SELECT c_id,staff,product_price,group_1 FROM NTILE_TEST;