ORC File 和 Parquet 文件的比较

ORC file can reduce the data size read from HDFS.

表名:catalog_sales,文件格式:orc format,大小: 151644639.

hive> SHOW CREATE TABLE tpcds_bin_partitioned_orc_2.catalog_sales;
OK
CREATE TABLE `tpcds_bin_partitioned_orc_2.catalog_sales`(
  `cs_sold_time_sk` bigint, 
  `cs_ship_date_sk` bigint, 
  `cs_bill_customer_sk` bigint, 
  `cs_bill_cdemo_sk` bigint, 
  `cs_bill_hdemo_sk` bigint, 
  `cs_bill_addr_sk` bigint, 
  `cs_ship_customer_sk` bigint, 
  `cs_ship_cdemo_sk` bigint, 
  `cs_ship_hdemo_sk` bigint, 
  `cs_ship_addr_sk` bigint, 
  `cs_call_center_sk` bigint, 
  `cs_catalog_page_sk` bigint, 
  `cs_ship_mode_sk` bigint, 
  `cs_warehouse_sk` bigint, 
  `cs_item_sk` bigint, 
  `cs_promo_sk` bigint, 
  `cs_order_number` bigint, 
  `cs_quantity` int, 
  `cs_wholesale_cost` decimal(7,2), 
  `cs_list_price` decimal(7,2), 
  `cs_sales_price` decimal(7,2), 
  `cs_ext_discount_amt` decimal(7,2), 
  `cs_ext_sales_price` decimal(7,2), 
  `cs_ext_wholesale_cost` decimal(7,2), 
  `cs_ext_list_price` decimal(7,2), 
  `cs_ext_tax` decimal(7,2), 
  `cs_coupon_amt` decimal(7,2), 
  `cs_ext_ship_cost` decimal(7,2), 
  `cs_net_paid` decimal(7,2), 
  `cs_net_paid_inc_tax` decimal(7,2), 
  `cs_net_paid_inc_ship` decimal(7,2), 
  `cs_net_paid_inc_ship_tax` decimal(7,2), 
  `cs_net_profit` decimal(7,2))
PARTITIONED BY ( 
  `cs_sold_date_sk` bigint)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  'hdfs://localhost:9000/user/hive/warehouse/tpcds_bin_partitioned_orc_2.db/catalog_sales'
TBLPROPERTIES (
  'bucketing_version'='2', 
  'transient_lastDdlTime'='1628754485')
Time taken: 0.051 seconds, Fetched: 47 row(s)

hive> dfs -du -s hdfs://localhost:9000/user/hive/warehouse/tpcds_bin_partitioned_orc_2.db/catalog_sales;
151644639  151644639  hdfs://localhost:9000/user/hive/warehouse/tpcds_bin_partitioned_orc_2.db/catalog_sales

执行 select count, 从 HDFS 读取的数据量比HDFS 上表的数据量小的多

set hive.compute.query.using.stats=false;
select count(1) from tpcds_bin_partitioned_orc_2.catalog_sales;
  VERTICES      DURATION(ms)   CPU_TIME(ms)    GC_TIME(ms)   INPUT_RECORDS   OUTPUT_RECORDS
----------------------------------------------------------------------------------------------
     Map 1          10082.00         38,630            868       2,880,058                4
 Reducer 2              0.00            460              0               4                0

TEZ Counters: HDFS_BYTES_READ 30863152

Parquet file

create table parquet.catalog_sales stored as parquet as select * from tpcds_bin_partitioned_orc_2.catalog_sales;
select count(1) from parquet.catalog_sales;
Task Execution Summary
----------------------------------------------------------------------------------------------
  VERTICES      DURATION(ms)   CPU_TIME(ms)    GC_TIME(ms)   INPUT_RECORDS   OUTPUT_RECORDS
----------------------------------------------------------------------------------------------
     Map 1           4020.00         12,620            601       2,880,058               12
 Reducer 2             87.00            680              0              12                0

hive> dfs -du -s hdfs://localhost:9000/user/hive/warehouse/parquet.db/catalog_sales;
243755174  243755174  hdfs://localhost:9000/user/hive/warehouse/parquet.db/catalog_sales

TEZ Counters: HDFS_BYTES_READ 243795493

从 HDFS 读取的数据量大于表的数据量(包含执行计划的下载,各执行阶段的 conf 的文件等)。说明所有HDFS 的数据都读取。

但是执行时间比 ORC 格式的少。

select max

  • orc
select max(cs_sold_time_sk) from tpcds_bin_partitioned_orc_2.catalog_sales;
Task Execution Summary
----------------------------------------------------------------------------------------------
  VERTICES      DURATION(ms)   CPU_TIME(ms)    GC_TIME(ms)   INPUT_RECORDS   OUTPUT_RECORDS
----------------------------------------------------------------------------------------------
     Map 1          19159.00         44,600            837       2,880,058                4
 Reducer 2            224.00            440              4               4                0
----------------------------------------------------------------------------------------------

HDFS_BYTES_READ: 32258574

  • Parquet
select max(cs_sold_time_sk) from parquet.catalog_sales;
Task Execution Summary
----------------------------------------------------------------------------------------------
  VERTICES      DURATION(ms)   CPU_TIME(ms)    GC_TIME(ms)   INPUT_RECORDS   OUTPUT_RECORDS
----------------------------------------------------------------------------------------------
     Map 1           3526.00         12,840            432       2,880,058               12
 Reducer 2              0.00            820              5              12                0
----------------------------------------------------------------------------------------------

HDFS_BYTES_READ: 4613211

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值