【impala建表】kudu的表必须有主键,作为分区的字段需排在其他字段前面。
【range分区】(不推荐)
CREATE TABLE KUDU_WATER_HISTORY (
id STRING,
year INT,
device STRING,
reading INT,
time STRING,
PRIMARY KEY (id,year)
) PARTITION BY RANGE (year)
(
PARTITION VALUES < 2017,
PARTITION 2017 <= VALUES < 2018,
PARTITION 2018 <= VALUES
)
STORED AS KUDU
TBLPROPERTIES('kudu.master_addresses' = '10.10.30.200:7051');
【hash分区】(不推荐)
CREATE TABLE KUDU_WATER_HISTORY_PARTITION_BY_ID (
id STRING,
year INT,
device STRING,
reading INT,
time STRING,
PRIMARY KEY (id)
) PARTITION BY HASH (id) PARTITIONS 4
STORED AS KUDU
TBLPROPERTIES('kudu.master_addresses' = '10.10.30.200:7051');
【hash range混合分区】推荐是用混合分区方式
CREATE TABLE KUDU_WATER_HISTORY (
id STRING,
year INT,
device STRING,
reading INT,
time STRING,
PRIMARY KEY (id,device,year)
) PARTITION BY HASH (device) PARTITIONS 3,
RANGE (year)
(
PARTITION VALUE = 2016,
PARTITION VALUE = 2017,
PARTITION VALUE = 2018,
PARTITION VALUE = 2019
)
STORED AS KUDU
TBLPROPERTIES('kudu.master_addresses' = '10.10.30.200:7051');
CREATE TABLE DEVICE_KUDU (
id STRING,
device STRING,
name STRING,
orgId INT,
PRIMARY KEY (id)
) PARTITION BY HASH (id) PARTITIONS 4
STORED AS KUDU
TBLPROPERTIES('kudu.master_addresses' = '10.10.30.200:7051');
【增加分区】
ALTER TABLE KUDU_WATER_HISTORY ADD RANGE PARTITION VALUE = 2020;
【查询语句】
select
T_3C75F1.`device`,
year(T_3C75F1.`time`),
month(T_3C75F1.`time`),
sum(T_3C75F1.`reading`),
count(1)
from (select DEVICE_KUDU.device,reading,to_timestamp(time,'yyyy-MM-dd HH:mm:ss') as time from KUDU_WATER_HISTORY,DEVICE_KUDU where KUDU_WATER_HISTORY.device=DEVICE_KUDU.device) as `T_3C75F1`
group by
T_3C75F1.`device`,
year(T_3C75F1.`time`),
month(T_3C75F1.`time`);
耗时:DEVICE_KUDU表50条记录,KUDU_WATER_HISTORY表1亿条记录,执行上面的查询语句耗时12秒。同样的数据量和查询方式,比impala+hdfs+parquet快10倍。
【跨数据源的查询】
device表的数据存储在hdfs中,抄表读数的数据存储在kudu中。分别在impala中创建对应的表
explain select
T_3C75F1.`deviceid`,
year(T_3C75F1.`time`),
month(T_3C75F1.`time`),
sum(T_3C75F1.`reading`),
count(1)
from (select device_parquet.deviceid,reading,to_timestamp(time,'yyyy-MM-dd HH:mm:ss') as time from KUDU_WATER_HISTORY,device_parquet where KUDU_WATER_HISTORY.device=device_parquet.deviceid) as `T_3C75F1`
group by
T_3C75F1.`deviceid`,
year(T_3C75F1.`time`),
month(T_3C75F1.`time`);
执行上面的语句,查看执行计划
+-----------------------------------------------------------------------------------------------------------------------------------------+ | Explain String | +-----------------------------------------------------------------------------------------------------------------------------------------+ | Per-Host Resource Reservation: Memory=400.00MB | | Per-Host Resource Estimates: Memory=2.25GB | | WARNING: The following tables are missing relevant table and/or column statistics. | | default.device_parquet, default.kudu_water_history | | | | PLAN-ROOT SINK | | | | | 07:EXCHANGE [UNPARTITIONED] | | | | | 06:AGGREGATE [FINALIZE] | | | output: sum:merge(T_3C75F1.reading), count:merge(1) | | | group by: T_3C75F1.deviceid, year(T_3C75F1.time), month(T_3C75F1.time) | | | | | 05:EXCHANGE [HASH(T_3C75F1.deviceid,year(T_3C75F1.time),month(T_3C75F1.time))] | | | | | 03:AGGREGATE [STREAMING] | | | output: sum(reading), count(1) | | | group by: device_parquet.deviceid, year(to_timestamp(time, 'yyyy-MM-dd HH:mm:ss')), month(to_timestamp(time, 'yyyy-MM-dd HH:mm:ss')) | | | | | 02:HASH JOIN [INNER JOIN, BROADCAST] | | | hash predicates: KUDU_WATER_HISTORY.device = device_parquet.deviceid | | | | | |--04:EXCHANGE [BROADCAST] | | | | | | | 01:SCAN HDFS [default.device_parquet] | | | partitions=1/1 files=1 size=1.00KB | | | | | 00:SCAN KUDU [default.kudu_water_history] | +-----------------------------------------------------------------------------------------------------------------------------------------+
能看到00:SCAN KUDU和01:SCAN HDFS,分别从kudu和hdfs中查询数据。