【CS 61a study notes 9】HIVE SQL

P.S. : all the sql codes in this note are based on hive sql , so may be a little different with 61a sql part, mysql,oracle,etc.
most of the content in this note is based on my work experiences.JUST A RECORD

Executing order of SQL

From
Join
On
Where
Group by
Avg/Sum
Having
Select
Distinct
Order by
Limit

Bacis syntax

CREATE

  • CREATE (TEMP or TEMPORARY) TABLE table_name (data) ;
  • CREATE (TEMP or TEMPORARY) TABLE table_name AS SELECT … ;

INSERT

  • INSERT INTO table_name VALUES() ;
  • INSERT OVERWRITE table_name VALUES() ;
    • For partitioned tables, the insert overwrite table operation overwrites the data in the specified partition without affecting the data in other partitions.
    • **INSERT OVERWRITE TABLE table_name PARTITION(target_partition) SELECT … **
    • For a normal table (that is, a non-partitioned table), the insert overwrite table operation completely overwrites all the data in the table, regardless of any partition. All data will be deleted and replaced with newly inserted data.

JOIN

  • Separator ‘,’
    • SELECT a.,b, from a,b ; If a has n rows and b has m rows , the result will be n*m rows.
    • Do not need the same columns to connect two table.
    • When a and b do not have the similar key colomns ,we use ‘,’ to connect two tables.
  • a FULL JOIN b ON a.col_name = b.col_name (and …)
    • It is different with ‘,’ . If a has n rows and b has m rows , the result will be max(n,m) rows.
    • ON can be regarded as a conditon statment
  • a LEFT JOIN b ON a.col_name = b.col_name (and …)
    • If a has n rows and b has m rows , the result will be n rows.
    • We can use LEFT JOIN and WHERE to replace INNER JOIN
  • a RIGHT JOIN b ON a.col_name = b.col_name (and …)
    • If a has n rows and b has m rows , the result will be m rows.
  • a INNER JOIN b ON a.col_name = b.col_name (and …)

UNION

  • UNION SELECT … UNION SELECT …
    • Automatically remove duplicate data
    • The data format for each column must be consistent
    • Certainly the column’s number must be the same.
  • *UNION ALL * SELECT … UNION ALL SELECT …
    • Keep the duplicate data
    • Others are the same as UNION

Conditional Statement

  • WHERE
    • SELECT … FROM table_name WHERE …(AND … OR…)
  • HAVING
    • SELECT col1,col2… FROM table_name GROUP BY col1,col2… HAVING SUM(col1)>1 OR COUNT(col2)>0 …
  • **CASE WHEN **
    • SELECT CASE WHEN col1 IS NOT NULL THEN col1 ELSE ’ ’ END AS new_col1 FROM table_name
    • Remove duplicates upward
  • IF(condition, true_return, false_return)
    • select if(1<>1,2,3); => 3

Function

About DATE

Change time’s format

  • yyyyMMdd⇒ yyyy-MM-dd or yyyy-MM-dd => yyyyMMdd
# create
TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(target_date,'yyyyMMdd'),'yyyy-MM-dd')) as new_date
FROM_UNIXTIME(UNIX_TIMESTAMP(target_date,'yyyy-MM-dd'),'yyyyMMdd') as new_string
  • Change STRING to DATE
    • CAST(‘string_time’ AS DATE)

Counting dates

SET hivevar:NOW_DATE=TO_DATE('2024-02-18'); --sunday
SET hivevar:END_DATE=TO_DATE(ADD_MONTHS(${NOW_DATE},+1)); --2024-3-18
-- eg calculate age DATEDIFF(BIG_DATE,SMALL_DATE) = DAYS
SELECT floor(datediff(${NOW_DATE},to_date(birthday))/365) as age ;
-- calculate the weekend -2023-01-01 sunday
SELECT CASE WHEN pmod(datediff(${NOW_DATE},TO_DATE('2023-01-01')),7) = 0 THEN 'WEEKEND'
WHEN pmod(datediff(${NOW_DATE},TO_DATE('2023-01-01')),7) = 6 THEN 'WEEKEND'
ELSE 'WEEKDAY' END AS WEEK_TAG;

-- the last day of the month
SELECT LAST_DAY(${NOW_DATE}) ; -- '2024-02-29'
-- extract year/month/date
SELECT DATE_FORMAT(${NOW_DATE},'yyyy-MM'); -- 2024-02
SELECT DATE_FORMAT(${NOW_DATE},'MM'); -- 02
SELECT DATE_FORMAT(${NOW_DATE},'dd'); -- 18
-- add days OR subtract days DATE_SUB()
SELECT DATE_ADD(LAST_DAY(${NOW_DATE}),1); --2024-03-01
-- next monday - sunday
SELECT NEXT_DAY(${NOW_DATE},'MO'); -- 2024-02-19
SELECT NEXT_DAY(${NOW_DATE},'we'); --2024-02-21

About STRING

Lpad & Rpad

To figure out the problem like ‘2>10’ in string.

  • Lpad/Rpad(target_str, repeat_times[int], repeat_character)
    • select Rpad(‘2’,3,‘0’); => ‘200’
    • select Lpad(‘2’,3,‘0’); => ‘002’

Substring & Substr

  • SUBSTRING(targer_string , start_index, cut_len)
    • The start_index is 1 not 0
    • if we do not give the value of cut_len, it will return all the character after the start index .
  • SUBSTR(targer_string , start_index, cut_len)
    • same as SUBSTRING
      note: some version of sql substring() is different with substr(). The last parameter of substring maybe the end_index. But i have tested in my sql version ,i got the same results using the same parameter’s value in the two function. JUST A NOTICE, maybe it depends on different situations.

Concat & Concat_ws

  • CONCAT(str1, str2…) or we can use || instead
    • Make sure that there are no empty strs in concat , or you will got NULL
    • || is equivalent to CONCAT
  • CONCAT_WS(separator, str1, str2…)
    • separator can not be NULL while str has no limit.

Column to row

CONCAT_WS(separator , COLLECT_LIST/COLLECT_SET(col_name))

  • collect_list(col_name)
    • KEEP the duplicate data
  • collect_set(col_name)
    • REMOVE the duplicate data

Row to column

Simplest but fussy way : Using case when.

  • EXPLODE()
    Just convert one row
    • EXPLODE(MAP(target_row)) : Set each pair of elements in the map as a row, with a column for key and a column for value
SELECT EXPLODE(ARRAY('1','E'));
-- will get a new column:
-- new_col
-- 1
-- E
SELECT EXPLODE(MAP('A','1','B','2');
-- key value
-- A     1
-- B     2
  • Lateral View
    lateral view udtf(expression) tableAlias as columnAlias (,columnAlias)*
    • udtf(expression) : eg: explore() , spilt()
    • tableAlias :The name of the virtual table that represents the transformation of the UDTF function
    • columnAlias:Indicates the virtual field name of the virtual table. If there is one column after splitting, write one column. If there are multiple columns after the split, declare all virtual column names in parentheses in the order of the columns, separated by commas
    • LOCATION: after FROM before WHERE
DROP TABLE IF EXISTS test_list;
CREATE TEMPORARY TABLE test_list 
(u_id STRING,
a_list STRING)
row format delimited 
fields terminated by "\t"
collection items terminated by ",";

INSERT INTO TABLE test_list VALUES (('A,B'),('L1,L2,L3'));
INSERT INTO TABLE test_list VALUES (('C,B,D'),('L4,L5,L6'));

select uid_split, new_list 
from ( 
	select u_id,new_list
	from test_list
	lateral view explode(split(a_list,",")) tmpTable as new_list
) a
lateral view explode(split(u_id, ",")) m as uid_split;
-- ans: 15rows * 2 cols
uid_splitnew_list
AL1
BL1
CL6
BL6
DL6
  • POSEXPLODE()
    change two colums at the same time
DROP TABLE IF EXISTS test_list;
CREATE TEMPORARY TABLE test_list 
(u_id STRING,
a_list STRING)
row format delimited 
fields terminated by "\t"
collection items terminated by ",";

INSERT INTO TABLE test_list VALUES (('A,B'),('L1,L2'));
INSERT INTO TABLE test_list VALUES (('C,E,D'),('L4,L5,L6'));

select uid_split,new_list
from
(select index_l, index_u ,new_list,uid_split
from  test_list
lateral view posexplode(split(a_list,",")) ln_1 as index_l,new_list
lateral view posexplode(split(u_id,",")) u_1 as index_u,uid_split) a
where index_l = index_u;
-- ans : 5rows * 2cols
uid_splitnew_list
AL1
BL2
CL4
EL5
DL6

Extracting data

Using regular expression

  • Func regexp_extract : regexp_extract ( str , regexp, index)
    • str: target string or target column’s name
    • regexp : regular expression
      • * : matching the former character 0 times or infinite times
      • . : matching arbitrary characters
      • ? : matching the former character 0 times or 1 time
    • index : indicate that return which part of the results
      • if index = 0 , return the whole expression
      • if index in [1 + ] , we will give the specific string as the position before the parentheses and the index will accord to the parentheses’ order and number.
regexp_extract('{"c_id":"D","order_time":"20240506"}','"order_time":(.*?)(")',1) as order_time
-- expected to get '20240505'

Using json data

  • Func get_json_object : get_json_object(str,json_path)
    • str: target string or target column’s name
    • json_path : using $.key
    • if it is a nesting structure , then using $.key1.key2[0]
get_json_object('{"c_id":"D","order_time":"20240506"}',$.order_time) as order_time
-- expected to get '20240505'

Analysis function

Ranking

  • ROW_NUMBER() OVER( PARTITION BY … ORDER BY …) : 1 2 3 4 …
  • RANK()OVER(PARTITION BY … ORDER BY …) : 1 2 2 4
  • DENSE_RANK(PARTITION BY … ORDER BY …) : 1 2 2 3
  • NTILE(group_num) OVER(PARTITION BY … ORDER BY …) : make sure every group at least one row,and If the slices are uneven, the distribution of the first slice is increased by default.
  • CUME_DIST()OVER(PARTITION BY … ORDER BY …) : Returns the number of rows less than or equal to the current value/the total number of rows in the group

Lag&Lead

  • LAG(col_name, n , lag_val):
    • col_name : target modified column
    • n : The second argument is the nth line up (optional, default is 1)
    • lag_val : The third parameter is the default value (if the nth action is NULL, take the default value, if not specified, NULL).
  • LEAD(col_name, n , lead_val):
    • lag line up lead line down

First/Last value

  • FIRST_VALUE( target_col) OVER(PARTITION BY … ORDER BY …) : Take the first value after sorting within the group, up to the current row
  • LAST_VALUE( target_col) OVER(PARTITION BY … ORDER BY …): After sorting within the group, up to the current row, the last value
DROP TABLE IF EXISTS TEMP_TEST;
CREATE TEMPORARY TABLE TEMP_TEST AS
SELECT EXPLODE(ARRAY('1','2','3','2','3','4')) AS TEST
;

SELECT A.* ,ROW_NUMBER()OVER(ORDER BY TEST ) AS ROW_RK
,RANK()OVER(ORDER BY TEST ) AS RK_RK
,DENSE_RANK()OVER(ORDER BY TEST ) AS DEN_RK
,round(CUME_DIST()OVER(ORDER BY TEST),2) AS less_or_equ_rt 
,LAG(TEST,3,'LAG')OVER(ORDER BY TEST) AS lag_test
,LAG(TEST)OVER(ORDER BY TEST DESC) AS lag_2
,LEAD(TEST,2,'LEAD') OVER(ORDER BY TEST) AS lead_test
,LEAD(TEST)OVER(ORDER BY TEST DESC) AS lead_2
,FIRST_VALUE(TEST)OVER(ORDER BY TEST) AS first_test
,LAST_VALUE(TEST)OVER(ORDER BY TEST ) AS last_test
FROM TEMP_TEST A ;
testrow_rkrk_rkden_rkgroup_2group_4less_or_equ_rtlag_testlag_2lead_testlead_2first_testlast_test
1111110.17LAG22[NULL]11
2222110.5LAG23112
2322120.5LAG33212
3443220.83134213
3543230.8324LEAD313
46642412[NULL]LEAD314

Handling NULL

  • NVL( expr1, default_expr): return default_expr if expr1 is null , or return expr1
    • parameters can be number string date etc…
SELECT NVL(NULL,'TEST'); -- 'TEST'
SELECT NVL('',NULL); -- ''
SELECT NVL(NULL,NULL); -- NULL
  • NULLIF(target_num , compare_num) : return null if target_num = compare_num , else return target_num
    • is a shorthand for : CASE WHEN target_num = compare_num THEN NULL else target_num
SELECT 12/NULLIF(0,0) ; -- NULL
  • COALESCE(exp1,expr2,expr3…) : check the expression by order and return the first non-null value . if all the expressions are null ,then return null
SELECT COALESCE(NULL,NULL,NULL); --NULL
SELECT COALESCE(0,NULL,1); --0
SELECT COALESCE(NULL,NULL,NULL,'first','second'); --'first'
SELECT COALESCE(NULL,12/NULLIF(0,0),'third'); --'third'

GROUP BY

  • basic syntax:
-- example data
CREATE TEMPORARY TABLE TEST_GROUP(
C_ID   string  comment 'CUSTOMER ID',
ORDER_YEAR  string  ,
PRODUCT string,
REVENUE  int comment 'Amount of consumption'); 

INSERT OVERWRITE TABLE  TEST_GROUP VALUES
('A','2020','pro_1','50'),
('A','2020','pro_2','40'),
('A','2019','pro_2','40'),
('B','2020','pro_1','20'),
('B','2021','pro_2','50'),
('C','2020','pro_2','100');

select c_id, order_year, product-- dimensions
,sum(revenue) 
from TEST_GROUP
group by c_id, order_year, product ;
  • Using group by to remove duplicate data ,which is faster than using DISTINCT
  • All dimesions like * c_id, order_year, product* should be occured after the group by statement

grouping__id sorting order

grouping__id : two underlines. grouping__id must be used with grouping set or cube or rollup , which can not be used alone.
If we group by 3 columns , the order of grouping__id is as follows: (maybe different version of sql will give different
0 : (col1, col2, col3)
1 : (col1, col2)
2 : (col1, col3)
3 : (col1)
4 : ( col2, col3)
5 : (col2)
6 : (col3)
7: ()

CUBE

  • They are aggregated according to different dimensions and have no priority
  • Basic syntax: SELECT col1, col2, col3 FROM temp_table GROUP BY CUBE(col1, col2, col3) or SELECT col1, col2, col3 FROM temp_table GROUP BY col1, col2, col3 WITH CUBE
-- to find the sales of  the every year and every product
SELECT ORDER_YEAR,PRODUCT,SUM(REVENUE) AS SALES, grouping__id FROM TEST_GROUP GROUP BY ORDER_YEAR,PRODUCT
with cube;
SELECT ORDER_YEAR,PRODUCT,SUM(REVENUE) AS SALES, grouping__id  FROM TEST_GROUP GROUP BY CUBE(ORDER_YEAR,PRODUCT);
order_yearproductsalesgrouping__id
2019pro_2400
2020pro_1700
2020pro_21400
2021pro_2500
2019[NULL]401
2020[NULL]2101
2021[NULL]501
[NULL]pro_1702
[NULL]pro_22302
[NULL][NULL]3003

GROUPING SETS

  • When we just need some particular dimensions , we can use GROUPING SETS to determine which dimensions to aggregate
  • GROUPING SET is after GROUP BY
  • Basic syntax : SELECT col1, col2, col3 FROM temp_table GROUP BY col1, col2, col3 GROUPING SETS ((col1,col2,col3),())
  • () means taking all the data of the table as an integrity, while it can not be used alone . We can not write GROUPING SETS (()) .
SELECT c_id, order_year, product, sum(revenue) as sales, grouping__id 
FROM test_group
GROUP BY c_id, order_year, product
GROUPING SETS((),(c_id,order_year),product)
ORDER BY grouping__id ;
c_idorder_yearproductsalesgrouping__id
A2019[NULL]401
B2020[NULL]201
C2020[NULL]1001
A2020[NULL]901
B2021[NULL]501
[NULL][NULL]pro_1706
[NULL][NULL]pro_22306
[NULL][NULL][NULL]3007

ROLLUP

  • The ROLLUP function is a subset of the CUBE, dominated by the leftmost dimension, aggregated in order.
  • Basic syntax :SELECT col1, col2, col3 FROM temp_table GROUP BY ROLLUP(col1, col2, col3) or SELECT col1, col2, col3 FROM temp_table GROUP BY col1, col2, col3 WITH ROLLUP
  • Is equivalent to SELECT col1, col2, col3 FROM temp_table GROUP BY col1, col2, col3 GROUPING SETS((),col1,(col1,col2),(col1,col2,col3))
  • Exist the priority ,and the left side has the highest priority, which means priority col1>col2>col3
SELECT ORDER_YEAR,PRODUCT,SUM(REVENUE) AS SALES, grouping__id FROM TEST_GROUP 
GROUP BY ORDER_YEAR,PRODUCT
WITH ROLLUP;
-- is equivalent to 
SELECT ORDER_YEAR,PRODUCT,SUM(REVENUE) AS SALES, grouping__id  FROM TEST_GROUP 
GROUP BY ROLLUP(ORDER_YEAR,PRODUCT);
-- is equivalent to 
SELECT  ORDER_YEAR,PRODUCT,SUM(REVENUE) AS SALES, grouping__id  FROM TEST_GROUP 
GROUP BY ORDER_YEAR,PRODUCT
GROUPING SETS((),ORDER_YEAR,(ORDER_YEAR,PRODUCT));
order_yearproductsalesgrouping__id
2019pro_2400
2020pro_1700
2020pro_21400
2021pro_2500
2019[NULL]401
2020[NULL]2101
2021[NULL]501
[NULL][NULL]3003

Example

Based on HIVE SQL

-- Creat a new detail table which contains the customer and the goods they brought
DROP TABLE IF EXISTS ORDER_DETAIL;
CREATE TEMPORARY TABLE ORDER_DETAIL(
C_ID STRING,
ORDER_DATE DATE,
PRODUCT_NAME STRING,
PRODUCT_UNITS INT,
STAFF STRING
) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS textfile;


INSERT INTO ORDER_DETAIL VALUES ('A', '2024-06-07' ,'DRESS' ,'2', 'S1');
INSERT INTO ORDER_DETAIL VALUES ('A', '2024-05-08' ,'SHOES' ,'1', 'S3');
INSERT INTO ORDER_DETAIL VALUES ('B', '2024-06-06' ,'SHIRT' ,'2', 'S1');
INSERT INTO ORDER_DETAIL VALUES ('B', '2024-05-09' ,'PANTS' ,'1','S2');
INSERT INTO ORDER_DETAIL VALUES ('C', '2024-06-09' ,'PANTS' ,'2','S3');
INSERT INTO ORDER_DETAIL VALUES ('C', '2024-05-08' ,'SHIRT' ,'3','S4');

DROP TABLE IF EXISTS PRODUCT_INFO;
CREATE TEMPORARY TABLE PRODUCT_INFO AS 
SELECT CAST('' AS STRING) AS PRODUCT_NAME, CAST('' AS FLOAT) AS PRODUCT_PRICE;

-- We create a null line in table PRODUCT_INFO , so we can use insert overwrite to delete this useless line.
INSERT OVERWRITE TABLE PRODUCT_INFO VALUES('DRESS',200);
INSERT INTO PRODUCT_INFO VALUES('SHOES',300);
INSERT INTO PRODUCT_INFO VALUES('SHIRT',500);
INSERT INTO PRODUCT_INFO VALUES('PANTS',100);

-- We found that we lost the data of customer D and the data have not been parsed yet. 

DROP TABLE IF EXISTS MISSING_RAWDATA;
CREATE TEMPORARY TABLE MISSING_RAWDATA AS 
SELECT '{"c_id":"D","order_time":"20240506","buy_detail":[{"product_name":"DRESS","units":"4","client":"S4"}]}' AS json_data
UNION ALL
SELECT '{"c_id":"D","order_time":"20240607","buy_detail":[{"product_name":"SHOES","units":"2","client":"S1"}]}' AS json_data
UNION ALL
SELECT '{"c_id":"B","order_time":"20240606","buy_detail":[{"product_name":"SHIRT","units":"2","client":"S1"}]}' AS json_data;

DROP TABLE IF EXISTS PARSE_JSON;
CREATE TEMPORARY TABLE PARSE_JSON AS 
SELECT 
regexp_extract(json_data, '"c_id":"(.*?)(")', 1) AS C_ID
,get_json_object(json_data, '$.order_time') AS ORDER_DATE
,get_json_object(json_data, '$.buy_detail.product_name[0]') AS PRODUCT_NAME
,regexp_extract(json_data, '"units":"(.*?)(")',1) AS PRODUCT_UNITS
,get_json_object(json_data, '$.buy_detail.client[0]') AS STAFF
FROM MISSING_RAWDATA;


-- change the format of order_date from yyyyMMdd to yyyy-MM-dd

DROP TABLE IF EXISTS OTHERS_DETAIL;
CREATE TEMPORARY TABLE OTHERS_DETAIL AS 
SELECT C_ID
,TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(ORDER_DATE,'yyyyMMdd'),'yyyy-MM-dd')) AS ORDER_DATE
,PRODUCT_NAME
,CAST(PRODUCT_UNITS AS INT) AS PRODUCT_UNITS
,STAFF
FROM PARSE_JSON;


DROP TABLE IF EXISTS FINAL_DETAIL;
CREATE TEMPORARY TABLE FINAL_DETAIL AS 
SELECT A.* ,B.PRODUCT_PRICE
FROM ORDER_DETAIL A 
LEFT JOIN PRODUCT_INFO B ON A.PRODUCT_NAME = B.PRODUCT_NAME 
UNION 
SELECT A.*,B.PRODUCT_PRICE
FROM OTHERS_DETAIL A 
LEFT JOIN PRODUCT_INFO B ON A.PRODUCT_NAME = B.PRODUCT_NAME ;


SELECT * FROM FINAL_DETAIL;

c_idorder_dateproduct_nameproduct_unitsstaffproduct_price
A2024-06-07DRESS2S1200
B2024-05-09PANTS1S2100
C2024-05-08SHIRT3S4500
C2024-06-09PANTS2S3100
D2024-06-06DRESS4S4200
A2024-05-08SHOES1S3300
B2024-06-06SHIRT2S1500
D2024-06-07SHOES2S1300
-- product assortment

DROP TABLE IF EXISTS PRODUCT_NAME_UNITS_UNION;
CREATE TEMPORARY TABLE PRODUCT_NAME_UNITS_UNION AS 
SELECT A.*, PRODUCT_NAME||'×'||PRODUCT_UNITS AS NEW_PRODUCT
,CONCAT(PRODUCT_NAME, '×' ,PRODUCT_UNITS) AS NEW_PRODUCT1
FROM FINAL_DETAIL A;

-- Now the new_product1 keeps the same as new_product 

DROP TABLE IF EXISTS PRODUCT_ASSORTMENT;
CREATE TEMPORARY TABLE PRODUCT_ASSORTMENT AS  
SELECT C_ID, SUM(PRODUCT_PRICE*PRODUCT_UNITS) AS REVENUE,CONCAT_WS("," , SORT_ARRAY(COLLECT_LIST(NEW_PRODUCT))) AS ALL_BUY
FROM PRODUCT_NAME_UNITS_UNION
GROUP BY C_ID;

SELECT * from PRODUCT_ASSORTMENT;

c_idrevenueall_buy
A700DRESS×2,SHOES×1
B1100PANTS×1,SHIRT×2
C1700PANTS×2,SHIRT×3
D1400DRESS×2,SHOES×1
-- if we do not want to analyze SHIRT and empty the SHIRT

DROP TABLE IF EXISTS NO_SHIRT;
CREATE TEMPORARY TABLE NO_SHIRT AS 
SELECT C_ID
,CASE WHEN PRODUCT_NAME IN ('DRESS','PANTS','SHOES') THEN PRODUCT_NAME END AS NEW_NAME
,CASE WHEN PRODUCT_NAME = 'SHIRT' THEN CAST('0' AS INT) ELSE PRODUCT_UNITS END AS NEW_UNITS
FROM FINAL_DETAIL;

-- Now processing the product assortment using CONCAT AND || may get different results.
DROP TABLE IF EXISTS NO_SHIRT_ASSORT;
CREATE TEMPORARY TABLE NO_SHIRT_ASSORT AS 
SELECT C_ID,NEW_NAME||'×'||NEW_UNITS AS NEW_PRODUCT
,CONCAT(NEW_NAME, '×' ,NEW_UNITS) AS NEW_PRODUCT1
FROM NO_SHIRT ;
-- so when we use the concat to combine two columns ,we need to make sure there is no empty value in that two columns
SELECT * FROM NO_SHIRT_ASSORT WHERE c_id = 'C';
c_idnew_productnew_product_1
C[NULL][NULL]
CPANTS×2PANTS×2
-- FIND THE MONTHLY sales champion 
DROP TABLE IF EXISTS MONTHLY_SALES_BY_STAFF;
CREATE TEMPORARY TABLE MONTHLY_SALES_BY_STAFF AS 
SELECT 
-- substr(ORDER_DATE,6,2)
SUBSTRING(ORDER_DATE,6,2) AS ORDER_MON, STAFF, SUM(PRODUCT_PRICE*PRODUCT_UNITS) AS SALES
FROM FINAL_DETAIL
GROUP BY SUBSTRING(ORDER_DATE,6,2), STAFF;

DROP TABLE IF EXISTS MONTHLY_SALES_RANK;
CREATE TEMPORARY TABLE MONTHLY_SALES_RANK AS 
SELECT ORDER_MON, STAFF, SALES, DENSE_RANK()OVER(PARTITION BY ORDER_MON ORDER BY SALES DESC) AS MON_RANK
FROM MONTHLY_SALES_BY_STAFF;

SELECT * FROM MONTHLY_SALES_RANK WHERE mon_rank='1';
order_monstaffsalesmon_rank
06S120001
05S423001
-- USE NTILE TO CLUSTERING 
DROP TABLE IF EXISTS NTILE_TEST;
CREATE TEMPORARY TABLE NTILE_TEST AS 
SELECT A.* ,NTILE(3)OVER(PARTITION BY STAFF ORDER BY PRODUCT_PRICE) AS GROUP_1
FROM FINAL_DETAIL A;

SELECT c_id,staff,product_price,group_1 FROM NTILE_TEST;
c_idstaffproduct_pricegroup_1
AS12001
DS13002
BS15003
BS21001
CS31001
AS33002
DS42001
CS45002

DROP TABLE IF EXISTS PRODUCT_LIST;
CREATE TEMPORARY TABLE PRODUCT_LIST 
(PRODUCT_LINE STRING,
ALL_DETAIL array<STRING>)
row format delimited 
fields terminated by "\t"
collection items terminated by ",";

INSERT INTO TABLE PRODUCT_LIST VALUES ('LINE1',ARRAY('SHOES','PANTS'));
INSERT INTO TABLE PRODUCT_LIST VALUES ('LINE2',ARRAY('SHIRT','DRESS','SKIRT'));

SELECT * FROM PRODUCT_LIST;
--SELECT EXPLODE(ALL_DETAIL) FROM PRODUCT_LIST;
product_lineall_detail
LINE1[“SHOES”,“PANTS”]
LINE2[“SHIRT”,“DRESS”,“SKIRT”]
SELECT PRODUCT_LINE,DETAIL_NAME 
FROM PRODUCT_LIST
LATERAL VIEW EXPLODE(ALL_DETAIL) TMPTABLE AS DETAIL_NAME;
product_linedetail_name
LINE1SHOES
LINE1PANTS
LINE2SHIRT
LINE2DRESS
LINE2SKIRT
  • 26
    点赞
  • 29
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值