【CS 61a study notes 9】HIVE SQL

最新推荐文章于 2024-07-25 21:03:55 发布

present1012

最新推荐文章于 2024-07-25 21:03:55 发布

阅读量876

点赞数 26

文章标签： hive sql

本文链接：https://blog.csdn.net/present1012/article/details/138537589

版权

P.S. : all the sql codes in this note are based on hive sql , so may be a little different with 61a sql part, mysql,oracle,etc.
most of the content in this note is based on my work experiences.JUST A RECORD

Executing order of SQL

Bacis syntax

CREATE

CREATE (TEMP or TEMPORARY) TABLE table_name (data) ;
CREATE (TEMP or TEMPORARY) TABLE table_name AS SELECT … ;

INSERT

INSERT INTO table_name VALUES() ;
INSERT OVERWRITE table_name VALUES() ;
- For partitioned tables, the insert overwrite table operation overwrites the data in the specified partition without affecting the data in other partitions.
- **INSERT OVERWRITE TABLE table_name PARTITION(target_partition) SELECT … **
- For a normal table (that is, a non-partitioned table), the insert overwrite table operation completely overwrites all the data in the table, regardless of any partition. All data will be deleted and replaced with newly inserted data.

JOIN

Separator ‘,’
- SELECT a.,b, from a,b ; If a has n rows and b has m rows , the result will be n*m rows.
- Do not need the same columns to connect two table.
- When a and b do not have the similar key colomns ,we use ‘,’ to connect two tables.
a FULL JOIN b ON a.col_name = b.col_name (and …)
- It is different with ‘,’ . If a has n rows and b has m rows , the result will be max(n,m) rows.
- ON can be regarded as a conditon statment
a LEFT JOIN b ON a.col_name = b.col_name (and …)
- If a has n rows and b has m rows , the result will be n rows.
- We can use LEFT JOIN and WHERE to replace INNER JOIN
a RIGHT JOIN b ON a.col_name = b.col_name (and …)
- If a has n rows and b has m rows , the result will be m rows.
a INNER JOIN b ON a.col_name = b.col_name (and …)

UNION

UNION SELECT … UNION SELECT …
- Automatically remove duplicate data
- The data format for each column must be consistent
- Certainly the column’s number must be the same.
*UNION ALL * SELECT … UNION ALL SELECT …
- Keep the duplicate data
- Others are the same as UNION

Conditional Statement

WHERE
- SELECT … FROM table_name WHERE …(AND … OR…)
HAVING
- SELECT col1,col2… FROM table_name GROUP BY col1,col2… HAVING SUM(col1)>1 OR COUNT(col2)>0 …
**CASE WHEN **
- SELECT CASE WHEN col1 IS NOT NULL THEN col1 ELSE ’ ’ END AS new_col1 FROM table_name
- Remove duplicates upward
IF(condition, true_return, false_return)
- select if(1<>1,2,3); => 3

Function

About DATE

Change time’s format

yyyyMMdd⇒ yyyy-MM-dd or yyyy-MM-dd => yyyyMMdd

# create
TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(target_date,'yyyyMMdd'),'yyyy-MM-dd')) as new_date
FROM_UNIXTIME(UNIX_TIMESTAMP(target_date,'yyyy-MM-dd'),'yyyyMMdd') as new_string

Change STRING to DATE
- CAST(‘string_time’ AS DATE)

Counting dates

SET hivevar:NOW_DATE=TO_DATE('2024-02-18'); --sunday
SET hivevar:END_DATE=TO_DATE(ADD_MONTHS(${NOW_DATE},+1)); --2024-3-18
-- eg calculate age DATEDIFF(BIG_DATE,SMALL_DATE) = DAYS
SELECT floor(datediff(${NOW_DATE},to_date(birthday))/365) as age ;
-- calculate the weekend -2023-01-01 sunday
SELECT CASE WHEN pmod(datediff(${NOW_DATE},TO_DATE('2023-01-01')),7) = 0 THEN 'WEEKEND'
WHEN pmod(datediff(${NOW_DATE},TO_DATE('2023-01-01')),7) = 6 THEN 'WEEKEND'
ELSE 'WEEKDAY' END AS WEEK_TAG;

-- the last day of the month
SELECT LAST_DAY(${NOW_DATE}) ; -- '2024-02-29'
-- extract year/month/date
SELECT DATE_FORMAT(${NOW_DATE},'yyyy-MM'); -- 2024-02
SELECT DATE_FORMAT(${NOW_DATE},'MM'); -- 02
SELECT DATE_FORMAT(${NOW_DATE},'dd'); -- 18
-- add days OR subtract days DATE_SUB()
SELECT DATE_ADD(LAST_DAY(${NOW_DATE}),1); --2024-03-01
-- next monday - sunday
SELECT NEXT_DAY(${NOW_DATE},'MO'); -- 2024-02-19
SELECT NEXT_DAY(${NOW_DATE},'we'); --2024-02-21

About STRING

Lpad & Rpad

To figure out the problem like ‘2>10’ in string.

Lpad/Rpad(target_str, repeat_times[int], repeat_character)
- select Rpad(‘2’,3,‘0’); => ‘200’
- select Lpad(‘2’,3,‘0’); => ‘002’

Substring & Substr

SUBSTRING(targer_string , start_index, cut_len)
- The start_index is 1 not 0
- if we do not give the value of cut_len, it will return all the character after the start index .
SUBSTR(targer_string , start_index, cut_len)
- same as SUBSTRING
  note: some version of sql substring() is different with substr(). The last parameter of substring maybe the end_index. But i have tested in my sql version ,i got the same results using the same parameter’s value in the two function. JUST A NOTICE, maybe it depends on different situations.

Concat & Concat_ws

CONCAT(str1, str2…) or we can use || instead
- Make sure that there are no empty strs in concat , or you will got NULL
- || is equivalent to CONCAT
CONCAT_WS(separator, str1, str2…)
- separator can not be NULL while str has no limit.

Column to row

CONCAT_WS(separator , COLLECT_LIST/COLLECT_SET(col_name))

collect_list(col_name)
- KEEP the duplicate data
collect_set(col_name)
- REMOVE the duplicate data

Row to column

Simplest but fussy way : Using case when.

EXPLODE()
Just convert one row
- EXPLODE(MAP(target_row)) : Set each pair of elements in the map as a row, with a column for key and a column for value

SELECT EXPLODE(ARRAY('1','E'));
-- will get a new column:
-- new_col
-- 1
-- E
SELECT EXPLODE(MAP('A','1','B','2');
-- key value
-- A     1
-- B     2

Lateral View
lateral view udtf(expression) tableAlias as columnAlias (,columnAlias)*
- udtf(expression) : eg： explore() , spilt()
- tableAlias :The name of the virtual table that represents the transformation of the UDTF function
- columnAlias：Indicates the virtual field name of the virtual table. If there is one column after splitting, write one column. If there are multiple columns after the split, declare all virtual column names in parentheses in the order of the columns, separated by commas
- LOCATION: after FROM before WHERE

DROP TABLE IF EXISTS test_list;
CREATE TEMPORARY TABLE test_list 
(u_id STRING,
a_list STRING)
row format delimited 
fields terminated by "\t"
collection items terminated by ",";

INSERT INTO TABLE test_list VALUES (('A,B'),('L1,L2,L3'));
INSERT INTO TABLE test_list VALUES (('C,B,D'),('L4,L5,L6'));

select uid_split, new_list 
from ( 
	select u_id,new_list
	from test_list
	lateral view explode(split(a_list,",")) tmpTable as new_list
) a
lateral view explode(split(u_id, ",")) m as uid_split;
-- ans: 15rows * 2 cols

uid_split	new_list
A	L1
B	L1
…	…
C	L6
B	L6
D	L6

POSEXPLODE()
change two colums at the same time

DROP TABLE IF EXISTS test_list;
CREATE TEMPORARY TABLE test_list 
(u_id STRING,
a_list STRING)
row format delimited 
fields terminated by "\t"
collection items terminated by ",";

INSERT INTO TABLE test_list VALUES (('A,B'),('L1,L2'));
INSERT INTO TABLE test_list VALUES (('C,E,D'),('L4,L5,L6'));

select uid_split,new_list
from
(select index_l, index_u ,new_list,uid_split
from  test_list
lateral view posexplode(split(a_list,",")) ln_1 as index_l,new_list
lateral view posexplode(split(u_id,",")) u_1 as index_u,uid_split) a
where index_l = index_u;
-- ans : 5rows * 2cols

uid_split	new_list
A	L1
B	L2
C	L4
E	L5
D	L6

Extracting data

Using regular expression

Func regexp_extract : regexp_extract ( str , regexp, index)
- str: target string or target column’s name
- regexp : regular expression
  - * : matching the former character 0 times or infinite times
  - . : matching arbitrary characters
  - ? : matching the former character 0 times or 1 time
- index : indicate that return which part of the results
  - if index = 0 , return the whole expression
  - if index in [1 + ] , we will give the specific string as the position before the parentheses and the index will accord to the parentheses’ order and number.

regexp_extract('{"c_id":"D","order_time":"20240506"}','"order_time":(.*?)(")',1) as order_time
-- expected to get '20240505'

Using json data

Func get_json_object : get_json_object(str,json_path)
- str: target string or target column’s name
- json_path : using $.key
- if it is a nesting structure , then using $.key1.key2[0]

get_json_object('{"c_id":"D","order_time":"20240506"}',$.order_time) as order_time
-- expected to get '20240505'

Analysis function

Ranking

ROW_NUMBER() OVER( PARTITION BY … ORDER BY …) : 1 2 3 4 …
RANK()OVER(PARTITION BY … ORDER BY …) : 1 2 2 4
DENSE_RANK(PARTITION BY … ORDER BY …) : 1 2 2 3
NTILE(group_num) OVER(PARTITION BY … ORDER BY …) : make sure every group at least one row,and If the slices are uneven, the distribution of the first slice is increased by default.
CUME_DIST()OVER(PARTITION BY … ORDER BY …) : Returns the number of rows less than or equal to the current value/the total number of rows in the group

Lag&Lead

LAG(col_name, n , lag_val):
- col_name : target modified column
- n : The second argument is the nth line up (optional, default is 1)
- lag_val : The third parameter is the default value (if the nth action is NULL, take the default value, if not specified, NULL).
LEAD(col_name, n , lead_val):
- lag line up lead line down

First/Last value

FIRST_VALUE( target_col) OVER(PARTITION BY … ORDER BY …) : Take the first value after sorting within the group, up to the current row
LAST_VALUE( target_col) OVER(PARTITION BY … ORDER BY …): After sorting within the group, up to the current row, the last value

DROP TABLE IF EXISTS TEMP_TEST;
CREATE TEMPORARY TABLE TEMP_TEST AS
SELECT EXPLODE(ARRAY('1','2','3','2','3','4')) AS TEST
;

SELECT A.* ,ROW_NUMBER()OVER(ORDER BY TEST ) AS ROW_RK
,RANK()OVER(ORDER BY TEST ) AS RK_RK
,DENSE_RANK()OVER(ORDER BY TEST ) AS DEN_RK
,round(CUME_DIST()OVER(ORDER BY TEST),2) AS less_or_equ_rt 
,LAG(TEST,3,'LAG')OVER(ORDER BY TEST) AS lag_test
,LAG(TEST)OVER(ORDER BY TEST DESC) AS lag_2
,LEAD(TEST,2,'LEAD') OVER(ORDER BY TEST) AS lead_test
,LEAD(TEST)OVER(ORDER BY TEST DESC) AS lead_2
,FIRST_VALUE(TEST)OVER(ORDER BY TEST) AS first_test
,LAST_VALUE(TEST)OVER(ORDER BY TEST ) AS last_test
FROM TEMP_TEST A ;

test	row_rk	rk_rk	den_rk	group_2	group_4	less_or_equ_rt	lag_test	lag_2	lead_test	lead_2	first_test	last_test
1	1	1	1	1	1	0.17	LAG	2	2	[NULL]	1	1
2	2	2	2	1	1	0.5	LAG	2	3	1	1	2
2	3	2	2	1	2	0.5	LAG	3	3	2	1	2
3	4	4	3	2	2	0.83	1	3	4	2	1	3
3	5	4	3	2	3	0.83	2	4	LEAD	3	1	3
4	6	6	4	2	4	1	2	[NULL]	LEAD	3	1	4

Handling NULL

NVL( expr1, default_expr): return default_expr if expr1 is null , or return expr1
- parameters can be number string date etc…

SELECT NVL(NULL,'TEST'); -- 'TEST'
SELECT NVL('',NULL); -- ''
SELECT NVL(NULL,NULL); -- NULL

NULLIF(target_num , compare_num) : return null if target_num = compare_num , else return target_num
- is a shorthand for : CASE WHEN target_num = compare_num THEN NULL else target_num

SELECT 12/NULLIF(0,0) ; -- NULL

COALESCE(exp1,expr2,expr3…) : check the expression by order and return the first non-null value . if all the expressions are null ,then return null

SELECT COALESCE(NULL,NULL,NULL); --NULL
SELECT COALESCE(0,NULL,1); --0
SELECT COALESCE(NULL,NULL,NULL,'first','second'); --'first'
SELECT COALESCE(NULL,12/NULLIF(0,0),'third'); --'third'

GROUP BY

basic syntax:

-- example data
CREATE TEMPORARY TABLE TEST_GROUP(
C_ID   string  comment 'CUSTOMER ID',
ORDER_YEAR  string  ,
PRODUCT string,
REVENUE  int comment 'Amount of consumption'); 

INSERT OVERWRITE TABLE  TEST_GROUP VALUES
('A','2020','pro_1','50'),
('A','2020','pro_2','40'),
('A','2019','pro_2','40'),
('B','2020','pro_1','20'),
('B','2021','pro_2','50'),
('C','2020','pro_2','100');

select c_id, order_year, product-- dimensions
,sum(revenue) 
from TEST_GROUP
group by c_id, order_year, product ;

Using group by to remove duplicate data ,which is faster than using DISTINCT
All dimesions like * c_id, order_year, product* should be occured after the group by statement

grouping__id sorting order

grouping__id : two underlines. grouping__id must be used with grouping set or cube or rollup , which can not be used alone.
If we group by 3 columns , the order of grouping__id is as follows: (maybe different version of sql will give different
0 : (col1, col2, col3)
1 : (col1, col2)
2 : (col1, col3)
3 : (col1)
4 : ( col2, col3)
5 : (col2)
6 : (col3)
7: ()

CUBE

They are aggregated according to different dimensions and have no priority
Basic syntax: SELECT col1, col2, col3 FROM temp_table GROUP BY CUBE(col1, col2, col3) or SELECT col1, col2, col3 FROM temp_table GROUP BY col1, col2, col3 WITH CUBE

-- to find the sales of  the every year and every product
SELECT ORDER_YEAR,PRODUCT,SUM(REVENUE) AS SALES, grouping__id FROM TEST_GROUP GROUP BY ORDER_YEAR,PRODUCT
with cube;
SELECT ORDER_YEAR,PRODUCT,SUM(REVENUE) AS SALES, grouping__id  FROM TEST_GROUP GROUP BY CUBE(ORDER_YEAR,PRODUCT);

order_year	product	sales	grouping__id
2019	pro_2	40	0
2020	pro_1	70	0
2020	pro_2	140	0
2021	pro_2	50	0
2019	[NULL]	40	1
2020	[NULL]	210	1
2021	[NULL]	50	1
[NULL]	pro_1	70	2
[NULL]	pro_2	230	2
[NULL]	[NULL]	300	3

GROUPING SETS

When we just need some particular dimensions , we can use GROUPING SETS to determine which dimensions to aggregate
GROUPING SET is after GROUP BY
Basic syntax : SELECT col1, col2, col3 FROM temp_table GROUP BY col1, col2, col3 GROUPING SETS ((col1,col2,col3),())
() means taking all the data of the table as an integrity, while it can not be used alone . We can not write GROUPING SETS (()) .

SELECT c_id, order_year, product, sum(revenue) as sales, grouping__id 
FROM test_group
GROUP BY c_id, order_year, product
GROUPING SETS((),(c_id,order_year),product)
ORDER BY grouping__id ;

c_id	order_year	product	sales	grouping__id
A	2019	[NULL]	40	1
B	2020	[NULL]	20	1
C	2020	[NULL]	100	1
A	2020	[NULL]	90	1
B	2021	[NULL]	50	1
[NULL]	[NULL]	pro_1	70	6
[NULL]	[NULL]	pro_2	230	6
[NULL]	[NULL]	[NULL]	300	7

ROLLUP

The ROLLUP function is a subset of the CUBE, dominated by the leftmost dimension, aggregated in order.
Basic syntax :SELECT col1, col2, col3 FROM temp_table GROUP BY ROLLUP(col1, col2, col3) or SELECT col1, col2, col3 FROM temp_table GROUP BY col1, col2, col3 WITH ROLLUP
Is equivalent to SELECT col1, col2, col3 FROM temp_table GROUP BY col1, col2, col3 GROUPING SETS((),col1,(col1,col2),(col1,col2,col3))
Exist the priority ,and the left side has the highest priority, which means priority col1>col2>col3

SELECT ORDER_YEAR,PRODUCT,SUM(REVENUE) AS SALES, grouping__id FROM TEST_GROUP 
GROUP BY ORDER_YEAR,PRODUCT
WITH ROLLUP;
-- is equivalent to 
SELECT ORDER_YEAR,PRODUCT,SUM(REVENUE) AS SALES, grouping__id  FROM TEST_GROUP 
GROUP BY ROLLUP(ORDER_YEAR,PRODUCT);
-- is equivalent to 
SELECT  ORDER_YEAR,PRODUCT,SUM(REVENUE) AS SALES, grouping__id  FROM TEST_GROUP 
GROUP BY ORDER_YEAR,PRODUCT
GROUPING SETS((),ORDER_YEAR,(ORDER_YEAR,PRODUCT));

order_year	product	sales	grouping__id
2019	pro_2	40	0
2020	pro_1	70	0
2020	pro_2	140	0
2021	pro_2	50	0
2019	[NULL]	40	1
2020	[NULL]	210	1
2021	[NULL]	50	1
[NULL]	[NULL]	300	3

Example

Based on HIVE SQL

-- Creat a new detail table which contains the customer and the goods they brought
DROP TABLE IF EXISTS ORDER_DETAIL;
CREATE TEMPORARY TABLE ORDER_DETAIL(
C_ID STRING,
ORDER_DATE DATE,
PRODUCT_NAME STRING,
PRODUCT_UNITS INT,
STAFF STRING
) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS textfile;


INSERT INTO ORDER_DETAIL VALUES ('A', '2024-06-07' ,'DRESS' ,'2', 'S1');
INSERT INTO ORDER_DETAIL VALUES ('A', '2024-05-08' ,'SHOES' ,'1', 'S3');
INSERT INTO ORDER_DETAIL VALUES ('B', '2024-06-06' ,'SHIRT' ,'2', 'S1');
INSERT INTO ORDER_DETAIL VALUES ('B', '2024-05-09' ,'PANTS' ,'1','S2');
INSERT INTO ORDER_DETAIL VALUES ('C', '2024-06-09' ,'PANTS' ,'2','S3');
INSERT INTO ORDER_DETAIL VALUES ('C', '2024-05-08' ,'SHIRT' ,'3','S4');

DROP TABLE IF EXISTS PRODUCT_INFO;
CREATE TEMPORARY TABLE PRODUCT_INFO AS 
SELECT CAST('' AS STRING) AS PRODUCT_NAME, CAST('' AS FLOAT) AS PRODUCT_PRICE;

-- We create a null line in table PRODUCT_INFO , so we can use insert overwrite to delete this useless line.
INSERT OVERWRITE TABLE PRODUCT_INFO VALUES('DRESS',200);
INSERT INTO PRODUCT_INFO VALUES('SHOES',300);
INSERT INTO PRODUCT_INFO VALUES('SHIRT',500);
INSERT INTO PRODUCT_INFO VALUES('PANTS',100);

-- We found that we lost the data of customer D and the data have not been parsed yet. 

DROP TABLE IF EXISTS MISSING_RAWDATA;
CREATE TEMPORARY TABLE MISSING_RAWDATA AS 
SELECT '{"c_id":"D","order_time":"20240506","buy_detail":[{"product_name":"DRESS","units":"4","client":"S4"}]}' AS json_data
UNION ALL
SELECT '{"c_id":"D","order_time":"20240607","buy_detail":[{"product_name":"SHOES","units":"2","client":"S1"}]}' AS json_data
UNION ALL
SELECT '{"c_id":"B","order_time":"20240606","buy_detail":[{"product_name":"SHIRT","units":"2","client":"S1"}]}' AS json_data;

DROP TABLE IF EXISTS PARSE_JSON;
CREATE TEMPORARY TABLE PARSE_JSON AS 
SELECT 
regexp_extract(json_data, '"c_id":"(.*?)(")', 1) AS C_ID
,get_json_object(json_data, '$.order_time') AS ORDER_DATE
,get_json_object(json_data, '$.buy_detail.product_name[0]') AS PRODUCT_NAME
,regexp_extract(json_data, '"units":"(.*?)(")',1) AS PRODUCT_UNITS
,get_json_object(json_data, '$.buy_detail.client[0]') AS STAFF
FROM MISSING_RAWDATA;


-- change the format of order_date from yyyyMMdd to yyyy-MM-dd

DROP TABLE IF EXISTS OTHERS_DETAIL;
CREATE TEMPORARY TABLE OTHERS_DETAIL AS 
SELECT C_ID
,TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(ORDER_DATE,'yyyyMMdd'),'yyyy-MM-dd')) AS ORDER_DATE
,PRODUCT_NAME
,CAST(PRODUCT_UNITS AS INT) AS PRODUCT_UNITS
,STAFF
FROM PARSE_JSON;


DROP TABLE IF EXISTS FINAL_DETAIL;
CREATE TEMPORARY TABLE FINAL_DETAIL AS 
SELECT A.* ,B.PRODUCT_PRICE
FROM ORDER_DETAIL A 
LEFT JOIN PRODUCT_INFO B ON A.PRODUCT_NAME = B.PRODUCT_NAME 
UNION 
SELECT A.*,B.PRODUCT_PRICE
FROM OTHERS_DETAIL A 
LEFT JOIN PRODUCT_INFO B ON A.PRODUCT_NAME = B.PRODUCT_NAME ;


SELECT * FROM FINAL_DETAIL;

c_id	order_date	product_name	product_units	staff	product_price
A	2024-06-07	DRESS	2	S1	200
B	2024-05-09	PANTS	1	S2	100
C	2024-05-08	SHIRT	3	S4	500
C	2024-06-09	PANTS	2	S3	100
D	2024-06-06	DRESS	4	S4	200
A	2024-05-08	SHOES	1	S3	300
B	2024-06-06	SHIRT	2	S1	500
D	2024-06-07	SHOES	2	S1	300

-- product assortment

DROP TABLE IF EXISTS PRODUCT_NAME_UNITS_UNION;
CREATE TEMPORARY TABLE PRODUCT_NAME_UNITS_UNION AS 
SELECT A.*, PRODUCT_NAME||'×'||PRODUCT_UNITS AS NEW_PRODUCT
,CONCAT(PRODUCT_NAME, '×' ,PRODUCT_UNITS) AS NEW_PRODUCT1
FROM FINAL_DETAIL A;

-- Now the new_product1 keeps the same as new_product 

DROP TABLE IF EXISTS PRODUCT_ASSORTMENT;
CREATE TEMPORARY TABLE PRODUCT_ASSORTMENT AS  
SELECT C_ID, SUM(PRODUCT_PRICE*PRODUCT_UNITS) AS REVENUE,CONCAT_WS("," , SORT_ARRAY(COLLECT_LIST(NEW_PRODUCT))) AS ALL_BUY
FROM PRODUCT_NAME_UNITS_UNION
GROUP BY C_ID;

SELECT * from PRODUCT_ASSORTMENT;

c_id	revenue	all_buy
A	700	DRESS×2,SHOES×1
B	1100	PANTS×1,SHIRT×2
C	1700	PANTS×2,SHIRT×3
D	1400	DRESS×2,SHOES×1

-- if we do not want to analyze SHIRT and empty the SHIRT

DROP TABLE IF EXISTS NO_SHIRT;
CREATE TEMPORARY TABLE NO_SHIRT AS 
SELECT C_ID
,CASE WHEN PRODUCT_NAME IN ('DRESS','PANTS','SHOES') THEN PRODUCT_NAME END AS NEW_NAME
,CASE WHEN PRODUCT_NAME = 'SHIRT' THEN CAST('0' AS INT) ELSE PRODUCT_UNITS END AS NEW_UNITS
FROM FINAL_DETAIL;

-- Now processing the product assortment using CONCAT AND || may get different results.
DROP TABLE IF EXISTS NO_SHIRT_ASSORT;
CREATE TEMPORARY TABLE NO_SHIRT_ASSORT AS 
SELECT C_ID,NEW_NAME||'×'||NEW_UNITS AS NEW_PRODUCT
,CONCAT(NEW_NAME, '×' ,NEW_UNITS) AS NEW_PRODUCT1
FROM NO_SHIRT ;
-- so when we use the concat to combine two columns ,we need to make sure there is no empty value in that two columns
SELECT * FROM NO_SHIRT_ASSORT WHERE c_id = 'C';

c_id	new_product	new_product_1
C	[NULL]	[NULL]
C	PANTS×2	PANTS×2

-- FIND THE MONTHLY sales champion 
DROP TABLE IF EXISTS MONTHLY_SALES_BY_STAFF;
CREATE TEMPORARY TABLE MONTHLY_SALES_BY_STAFF AS 
SELECT 
-- substr(ORDER_DATE,6,2)
SUBSTRING(ORDER_DATE,6,2) AS ORDER_MON, STAFF, SUM(PRODUCT_PRICE*PRODUCT_UNITS) AS SALES
FROM FINAL_DETAIL
GROUP BY SUBSTRING(ORDER_DATE,6,2), STAFF;

DROP TABLE IF EXISTS MONTHLY_SALES_RANK;
CREATE TEMPORARY TABLE MONTHLY_SALES_RANK AS 
SELECT ORDER_MON, STAFF, SALES, DENSE_RANK()OVER(PARTITION BY ORDER_MON ORDER BY SALES DESC) AS MON_RANK
FROM MONTHLY_SALES_BY_STAFF;

SELECT * FROM MONTHLY_SALES_RANK WHERE mon_rank='1';

order_mon	staff	sales	mon_rank
06	S1	2000	1
05	S4	2300	1

-- USE NTILE TO CLUSTERING 
DROP TABLE IF EXISTS NTILE_TEST;
CREATE TEMPORARY TABLE NTILE_TEST AS 
SELECT A.* ,NTILE(3)OVER(PARTITION BY STAFF ORDER BY PRODUCT_PRICE) AS GROUP_1
FROM FINAL_DETAIL A;

SELECT c_id,staff,product_price,group_1 FROM NTILE_TEST;

c_id	staff	product_price	group_1
A	S1	200	1
D	S1	300	2
B	S1	500	3
B	S2	100	1
C	S3	100	1
A	S3	300	2
D	S4	200	1
C	S4	500	2


DROP TABLE IF EXISTS PRODUCT_LIST;
CREATE TEMPORARY TABLE PRODUCT_LIST 
(PRODUCT_LINE STRING,
ALL_DETAIL array<STRING>)
row format delimited 
fields terminated by "\t"
collection items terminated by ",";

INSERT INTO TABLE PRODUCT_LIST VALUES ('LINE1',ARRAY('SHOES','PANTS'));
INSERT INTO TABLE PRODUCT_LIST VALUES ('LINE2',ARRAY('SHIRT','DRESS','SKIRT'));

SELECT * FROM PRODUCT_LIST;
--SELECT EXPLODE(ALL_DETAIL) FROM PRODUCT_LIST;

product_line	all_detail
LINE1	[“SHOES”,“PANTS”]
LINE2	[“SHIRT”,“DRESS”,“SKIRT”]

SELECT PRODUCT_LINE,DETAIL_NAME 
FROM PRODUCT_LIST
LATERAL VIEW EXPLODE(ALL_DETAIL) TMPTABLE AS DETAIL_NAME;

product_line	detail_name
LINE1	SHOES
LINE1	PANTS
LINE2	SHIRT
LINE2	DRESS
LINE2	SKIRT