ORC文件存储格式和Hive创建ORC表

首发于: https://www.jwldata.com/archives/134.html

ORC的优点

The Optimized Row Columnar (ORC) file format provides a highly efficient way to store Hive data. It was designed to overcome limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing data.

Compared with RCFile format, for example, ORC file format has many advantages such as:

  • a single file as the output of each task, which reduces the NameNode’s load
  • Hive type support including datetime, decimal, and the complex types (struct, list, map, and union)
  • light-weight indexes stored within the file
    • skip row groups that don’t pass predicate filtering
    • seek to a given row
  • block-mode compression based on data type
    • run-length encoding for integer columns
    • dictionary encoding for string columns
  • concurrent reads of the same file using separate RecordReaders
  • ability to split files without scanning for markers
  • bound the amount of memory needed for reading or writing
  • metadata stored using Protocol Buffers, which allows addition and removal of fields

ORC文件结构

ORC文件由stripe,file footer,postscript组成。

  • file footer
    contains a list of stripes in the file, the number of rows per stripe, and each column’s data type. It also contains column-level aggregates count, min, max, and sum.
  • postscript
    holds compression parameters and the size of the compressed footer.
  • stripe
    each stripe in an ORC file holds index data, row data, and a stripe footer.
    index data includes min and max values for each column and the row positions within each column.
    stripe footer contains a directory of stream locations.
    row data is used in table scans.

Hive Table properties

The parameters placed in the TBLPROPERTIES.

KeyDefaultNotes
orc.compressZLIBhigh level compression = {NONE, ZLIB, SNAPPY}
orc.compress.size262,144compression chunk size(number of bytes), 0.25MB
orc.stripe.size67,108,864memory buffer in bytes for writing(number of bytes), 64MB
orc.row.index.stride10,000number of rows between index entries
orc.create.indextruecreate indexes?
orc.bloom.filter.columns””comma separated list of column names
orc.bloom.filter.fpp0.05bloom filter false positive rate (must >0.0 and <1.0)

创建ORC表

CREATE TABLE IF NOT EXISTS bank.account_orc (
  `id_card` int,
  `tran_time` string,
  `name` string,
  `cash` int
  )
partitioned by(ds string)
stored as orc;

不加orc.compression,默认为ZLIB压缩。另外,还支持设置orc.compress为NONE, Snappy。

查询ORC建表语句

SHOW CREATE TABLE bank.account_orc;

CREATE TABLE `bank.account_orc`(
  `id_card` int, 
  `tran_time` string, 
  `name` string, 
  `cash` int)
PARTITIONED BY ( 
  `ds` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  'hdfs://nameservice1/user/hive/warehouse/bank.db/account_orc'
TBLPROPERTIES (
  'transient_lastDdlTime'='1627435885')

往ORC表插入样本数据

INSERT INTO bank.account_orc partition(ds='2020-09-21') values (1000, '2020-09-21 14:30:00', 'Tom', 100);
INSERT INTO bank.account_orc partition(ds='2020-09-20') values (1000, '2020-09-20 14:30:05', 'Tom', 50);
INSERT INTO bank.account_orc partition(ds='2020-09-20') values (1000, '2020-09-20 14:30:10', 'Tom', -25);
INSERT INTO bank.account_orc partition(ds='2020-09-21') values (1001, '2020-09-21 15:30:00', 'Jelly', 200);
INSERT INTO bank.account_orc partition(ds='2020-09-21') values (1001, '2020-09-21 15:30:05', 'Jelly', -50);

查询底层文件

[root@jwldata.com ~]# hadoop fs -ls /user/hive/warehouse/bank.db/account_orc/ds=2020-09-20                     
Found 2 items
-rwxrwx--x+  3 hive hive        519 2021-07-28 09:33 /user/hive/warehouse/bank.db/account_orc/ds=2020-09-20/000000_0
-rwxrwx--x+  3 hive hive        519 2021-07-28 09:34 /user/hive/warehouse/bank.db/account_orc/ds=2020-09-20/000000_0_copy_1
[root@jwldata.com ~]# 
[root@jwldata.com ~]# hadoop fs -ls /user/hive/warehouse/bank.db/account_orc/ds=2020-09-21
Found 3 items
-rwxrwx--x+  3 hive hive        516 2021-07-28 09:32 /user/hive/warehouse/bank.db/account_orc/ds=2020-09-21/000000_0
-rwxrwx--x+  3 hive hive        528 2021-07-28 09:34 /user/hive/warehouse/bank.db/account_orc/ds=2020-09-21/000000_0_copy_1
-rwxrwx--x+  3 hive hive        528 2021-07-28 09:35 /user/hive/warehouse/bank.db/account_orc/ds=2020-09-21/000000_0_copy_2

查询ORC文件的元数据

hive --orcfiledump hdfs:///user/hive/warehouse/bank.db/account_orc/ds=2020-09-20/000000_0

[root@jwldata.com ~]# hive --orcfiledump hdfs:///user/hive/warehouse/bank.db/account_orc/ds=2020-09-20/000000_0
Structure for hdfs:///user/hive/warehouse/bank.db/account_orc/ds=2020-09-20/000000_0
File Version: 0.12 with HIVE_8732
21/07/28 09:52:12 INFO orc.ReaderImpl: Reading ORC rows from hdfs:/user/hive/warehouse/bank.db/account_orc/ds=2020-09-20/000000_0 with {include: null, offset: 0, length: 9223372036854775807}
Rows: 1
Compression: ZLIB
Compression size: 262144
Type: struct<_col0:int,_col1:string,_col2:string,_col3:int>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 1 hasNull: false
    Column 1: count: 1 hasNull: false min: 1000 max: 1000 sum: 1000
    Column 2: count: 1 hasNull: false min: 2020-09-20 14:30:05 max: 2020-09-20 14:30:05 sum: 19
    Column 3: count: 1 hasNull: false min: Tom max: Tom sum: 3
    Column 4: count: 1 hasNull: false min: 50 max: 50 sum: 50

File Statistics:
  Column 0: count: 1 hasNull: false
  Column 1: count: 1 hasNull: false min: 1000 max: 1000 sum: 1000
  Column 2: count: 1 hasNull: false min: 2020-09-20 14:30:05 max: 2020-09-20 14:30:05 sum: 19
  Column 3: count: 1 hasNull: false min: Tom max: Tom sum: 3
  Column 4: count: 1 hasNull: false min: 50 max: 50 sum: 50

Stripes:
  Stripe: offset: 3 data: 53 rows: 1 tail: 65 index: 136
    Stream: column 0 section ROW_INDEX start: 3 length 11
    Stream: column 1 section ROW_INDEX start: 14 length 27
    Stream: column 2 section ROW_INDEX start: 41 length 45
    Stream: column 3 section ROW_INDEX start: 86 length 29
    Stream: column 4 section ROW_INDEX start: 115 length 24
    Stream: column 1 section DATA start: 139 length 7
    Stream: column 2 section DATA start: 146 length 22
    Stream: column 2 section LENGTH start: 168 length 6
    Stream: column 3 section DATA start: 174 length 6
    Stream: column 3 section LENGTH start: 180 length 6
    Stream: column 4 section DATA start: 186 length 6
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2
    Encoding column 2: DIRECT_V2
    Encoding column 3: DIRECT_V2
    Encoding column 4: DIRECT_V2

File length: 519 bytes
Padding length: 0 bytes
Padding ratio: 0%

查询ORC文件的数据

hive --orcfiledump -d hdfs:///user/hive/warehouse/bank.db/account_orc/ds=2020-09-20/000000_0

[root@jwldata.com ~]# hive --orcfiledump -d hdfs:///user/hive/warehouse/bank.db/account_orc/ds=2020-09-20/000000_0
21/07/28 09:53:08 INFO orc.ReaderImpl: Reading ORC rows from hdfs:/user/hive/warehouse/bank.db/account_orc/ds=2020-09-20/000000_0 with {include: null, offset: 0, length: 9223372036854775807}
{"_col0":1000,"_col1":"2020-09-20 14:30:05","_col2":"Tom","_col3":50}
[root@jwldata.com ~]# 
[root@jwldata.com ~]# hive --orcfiledump -d hdfs:///user/hive/warehouse/bank.db/account_orc/ds=2020-09-20/000000_0_copy_1
21/07/28 09:53:50 INFO orc.ReaderImpl: Reading ORC rows from hdfs:/user/hive/warehouse/bank.db/account_orc/ds=2020-09-20/000000_0_copy_1 with {include: null, offset: 0, length: 9223372036854775807}
{"_col0":1000,"_col1":"2020-09-20 14:30:10","_col2":"Tom","_col3":-25}

创建ORC表+Snappy压缩

CREATE TABLE IF NOT EXISTS bank.account_orc_snappy (
  `id_card` int,
  `tran_time` string,
  `name` string,
  `cash` int
  )
partitioned by(ds string)
stored as orc
TBLPROPERTIES ("orc.compression"="SNAPPY");

查询ORC带Snappy压缩的建表语句

SHOW CREATE TABLE bank.account_orc_snappy;

CREATE TABLE `bank.account_orc_snappy`(
  `id_card` int, 
  `tran_time` string, 
  `name` string, 
  `cash` int)
PARTITIONED BY ( 
  `ds` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  'hdfs://nameservice1/user/hive/warehouse/bank.db/account_orc_snappy'
TBLPROPERTIES (
  'orc.compression'='SNAPPY', 
  'transient_lastDdlTime'='1627542655')

参考文档

  • https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC
  • https://orc.apache.org/specification/ORCv1/
  • https://blog.csdn.net/dabokele/article/details/51542327
  • https://blog.csdn.net/dabokele/article/details/51813322

欢迎关注我的微信公众号“九万里大数据”,原创技术文章第一时间推送。
欢迎访问原创技术博客网站 jwldata.com,排版更清晰,阅读更爽快。

  • 1
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值