Hive实战和调优
1.数据一致性问题
对与多次修改以及只追加记录的数据的表,如何保证一致性:
id | name | sex | times_tamp |
---|---|---|---|
1 | root | female | 2022-04-04 13:24:00 |
2 | hadoop | female | 2022-04-04 13:24:30 |
1 | flink | male | 2022-04-04 13:25:00 |
1 | hive | male | 2022-04-04 13:26:00 |
通过开窗,排名,取最新的时间戳的数据,得到
id | name | sex | timestamp |
---|---|---|---|
1 | hive | male | 2022-04-04 13:26:00 |
2 | hadoop | female | 2022-04-04 13:24:30 |
多条记录去重,SQL如下
SELECT
id,
name,
sex,
times_stamp
FROM
( SELECT id, name, sex, times_tamp, row_number () over ( PARTITION BY id ORDER BY times_tamp ) num FROM tb1 ) t1
WHERE
num =1
2.Jion扫描全表问题
由于业务问题,Hive使用了分区表,但是无法使用分区字段指定分区数据,因为jion的条件不固定在那个分区内。
原SQL(小表,join全量大表):
SELECT
id,
NAME,
sex,
age,
money
FROM
tabe1 A
LEFT JOIN table2 B ON A.id = B.id
AND A.userid = B.userid
AND A.pid = B.pid
优化(先where在Join)where userid IN ( SELECT userid FROM tabe1 GROUP BY userid )
:
SELECT
id,
NAME,
sex,
age,
money
FROM
tabe1 A
LEFT JOIN (
SELECT
*
FROM
table2
WHERE
userid IN ( SELECT userid FROM tabe1 GROUP BY userid )) AS B ON A.id = B.id
AND A.userid = B.userid
AND A.pid = B.pid
3.分区和分桶,设置文件格式
SQL,如下
SELECT
id,
NAME,
sex,
age,
money
FROM
( SELECT * FROM tabe1 whrere statdate = '2022-04-06' ) A
LEFT JOIN ( SELECT * FROM tabe2 whrere statdate = '2022-04-06' ) B ON A.userid = B.userid
优化按日期分区,按userid分桶
CREATE TABLE tb1 (
id string,
userid string,
money DOUBLE )
PARTITIONED BY ( `statdate` string )
CLUSTERED BY ( `userid` ) INTO 10 BUCKETS
STORED AS ORC
其它
FlinkSql on Hive 向Hive插入批量数据出现小文件问题
跟FlinkSql程序的提交并行度有关,多少个并行度,生成多少个文件。根据数据量大小确定并行度