ORC file can reduce the data size read from HDFS.
表名:catalog_sales,文件格式:orc format,大小: 151644639.
hive> SHOW CREATE TABLE tpcds_bin_partitioned_orc_2.catalog_sales;
OK
CREATE TABLE `tpcds_bin_partitioned_orc_2.catalog_sales`(
`cs_sold_time_sk` bigint,
`cs_ship_date_sk` bigint,
`cs_bill_customer_sk` bigint,
`cs_bill_cdemo_sk` bigint,
`cs_bill_hdemo_sk` bigint,
`cs_bill_addr_sk` bigint,
`cs_ship_customer_sk` bigint,
`cs_ship_cdemo_sk` bigint,
`cs_ship_hdemo_sk` bigint,
`cs_ship_addr_sk` bigint,
`cs_call_center_sk` bigint,
`cs_catalog_page_sk` bigint,
`cs_ship_mode_sk` bigint,
`cs_warehouse_sk` bigint,
`cs_item_sk` bigint,
`cs_promo_sk` bigint,
`cs_order_number` bigint,
`cs_quantity` int,
`cs_wholesale_cost` decimal(7,2),
`cs_list_price` decimal(7,2),
`cs_sales_price` decimal(7,2),
`cs_ext_discount_amt` decimal(7,2),
`cs_ext_sales_price` decimal(7,2),
`cs_ext_wholesale_cost` decimal(7,2),
`cs_ext_list_price` decimal(7,2),
`cs_ext_tax` decimal(7,2),
`cs_coupon_amt` decimal(7,2),
`cs_ext_ship_cost` decimal(7,2),
`cs_net_paid` decimal(7,2),
`cs_net_paid_inc_tax` decimal(7,2),
`cs_net_paid_inc_ship` decimal(7,2),
`cs_net_paid_inc_ship_tax` decimal(7,2),
`cs_net_profit` decimal(7,2))
PARTITIONED BY (
`cs_sold_date_sk` bigint)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://localhost:9000/user/hive/warehouse/tpcds_bin_partitioned_orc_2.db/catalog_sales'
TBLPROPERTIES (
'bucketing_version'='2',
'transient_lastDdlTime'='1628754485')
Time taken: 0.051 seconds, Fetched: 47 row(s)
hive> dfs -du -s hdfs://localhost:9000/user/hive/warehouse/tpcds_bin_partitioned_orc_2.db/catalog_sales;
151644639 151644639 hdfs://localhost:9000/user/hive/warehouse/tpcds_bin_partitioned_orc_2.db/catalog_sales
执行 select count, 从 HDFS 读取的数据量比HDFS 上表的数据量小的多
set hive.compute.query.using.stats=false;
select count(1) from tpcds_bin_partitioned_orc_2.catalog_sales;
VERTICES DURATION(ms) CPU_TIME(ms) GC_TIME(ms) INPUT_RECORDS OUTPUT_RECORDS
----------------------------------------------------------------------------------------------
Map 1 10082.00 38,630 868 2,880,058 4
Reducer 2 0.00 460 0 4 0
TEZ Counters: HDFS_BYTES_READ 30863152
Parquet file
create table parquet.catalog_sales stored as parquet as select * from tpcds_bin_partitioned_orc_2.catalog_sales;
select count(1) from parquet.catalog_sales;
Task Execution Summary
----------------------------------------------------------------------------------------------
VERTICES DURATION(ms) CPU_TIME(ms) GC_TIME(ms) INPUT_RECORDS OUTPUT_RECORDS
----------------------------------------------------------------------------------------------
Map 1 4020.00 12,620 601 2,880,058 12
Reducer 2 87.00 680 0 12 0
hive> dfs -du -s hdfs://localhost:9000/user/hive/warehouse/parquet.db/catalog_sales;
243755174 243755174 hdfs://localhost:9000/user/hive/warehouse/parquet.db/catalog_sales
TEZ Counters: HDFS_BYTES_READ 243795493
从 HDFS 读取的数据量大于表的数据量(包含执行计划的下载,各执行阶段的 conf 的文件等)。说明所有HDFS 的数据都读取。
但是执行时间比 ORC 格式的少。
select max
- orc
select max(cs_sold_time_sk) from tpcds_bin_partitioned_orc_2.catalog_sales;
Task Execution Summary
----------------------------------------------------------------------------------------------
VERTICES DURATION(ms) CPU_TIME(ms) GC_TIME(ms) INPUT_RECORDS OUTPUT_RECORDS
----------------------------------------------------------------------------------------------
Map 1 19159.00 44,600 837 2,880,058 4
Reducer 2 224.00 440 4 4 0
----------------------------------------------------------------------------------------------
HDFS_BYTES_READ: 32258574
- Parquet
select max(cs_sold_time_sk) from parquet.catalog_sales;
Task Execution Summary
----------------------------------------------------------------------------------------------
VERTICES DURATION(ms) CPU_TIME(ms) GC_TIME(ms) INPUT_RECORDS OUTPUT_RECORDS
----------------------------------------------------------------------------------------------
Map 1 3526.00 12,840 432 2,880,058 12
Reducer 2 0.00 820 5 12 0
----------------------------------------------------------------------------------------------
HDFS_BYTES_READ: 4613211