ORC文件存储格式和Hive创建ORC表

最新推荐文章于 2024-09-11 10:30:11 发布

九万里大数据

最新推荐文章于 2024-09-11 10:30:11 发布

阅读量2.9k

点赞数 1

分类专栏：大数据技术文章标签： hive

本文链接：https://blog.csdn.net/data2tech/article/details/119341929

版权

大数据技术专栏收录该内容

6 篇文章 0 订阅

订阅专栏

首发于： https://www.jwldata.com/archives/134.html

ORC的优点

The Optimized Row Columnar (ORC) file format provides a highly efficient way to store Hive data. It was designed to overcome limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing data.

Compared with RCFile format, for example, ORC file format has many advantages such as:

a single file as the output of each task, which reduces the NameNode’s load
Hive type support including datetime, decimal, and the complex types (struct, list, map, and union)
light-weight indexes stored within the file
- skip row groups that don’t pass predicate filtering
- seek to a given row
block-mode compression based on data type
- run-length encoding for integer columns
- dictionary encoding for string columns
concurrent reads of the same file using separate RecordReaders
ability to split files without scanning for markers
bound the amount of memory needed for reading or writing
metadata stored using Protocol Buffers, which allows addition and removal of fields

ORC文件结构

ORC文件由stripe，file footer，postscript组成。

file footer
contains a list of stripes in the file, the number of rows per stripe, and each column’s data type. It also contains column-level aggregates count, min, max, and sum.
postscript
holds compression parameters and the size of the compressed footer.
stripe
each stripe in an ORC file holds index data, row data, and a stripe footer.
index data includes min and max values for each column and the row positions within each column.
stripe footer contains a directory of stream locations.
row data is used in table scans.

Hive Table properties

The parameters placed in the TBLPROPERTIES.

Key	Default	Notes
orc.compress	ZLIB	high level compression = {NONE, ZLIB, SNAPPY}
orc.compress.size	262,144	compression chunk size(number of bytes), 0.25MB
orc.stripe.size	67,108,864	memory buffer in bytes for writing(number of bytes), 64MB
orc.row.index.stride	10,000	number of rows between index entries
orc.create.index	true	create indexes?
orc.bloom.filter.columns	””	comma separated list of column names
orc.bloom.filter.fpp	0.05	bloom filter false positive rate (must >0.0 and <1.0)

创建ORC表

CREATE TABLE IF NOT EXISTS bank.account_orc (
  `id_card` int,
  `tran_time` string,
  `name` string,
  `cash` int
  )
partitioned by(ds string)
stored as orc;

不加orc.compression，默认为ZLIB压缩。另外，还支持设置orc.compress为NONE, Snappy。

查询ORC建表语句

SHOW CREATE TABLE bank.account_orc;

CREATE TABLE `bank.account_orc`(
  `id_card` int, 
  `tran_time` string, 
  `name` string, 
  `cash` int)
PARTITIONED BY ( 
  `ds` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  'hdfs://nameservice1/user/hive/warehouse/bank.db/account_orc'
TBLPROPERTIES (
  'transient_lastDdlTime'='1627435885')

往ORC表插入样本数据

INSERT INTO bank.account_orc partition(ds='2020-09-21') values (1000, '2020-09-21 14:30:00', 'Tom', 100);
INSERT INTO bank.account_orc partition(ds='2020-09-20') values (1000, '2020-09-20 14:30:05', 'Tom', 50);
INSERT INTO bank.account_orc partition(ds='2020-09-20') values (1000, '2020-09-20 14:30:10', 'Tom', -25);
INSERT INTO bank.account_orc partition(ds='2020-09-21') values (1001, '2020-09-21 15:30:00', 'Jelly', 200);
INSERT INTO bank.account_orc partition(ds='2020-09-21') values (1001, '2020-09-21 15:30:05', 'Jelly', -50);

查询底层文件

[root@jwldata.com ~]# hadoop fs -ls /user/hive/warehouse/bank.db/account_orc/ds=2020-09-20                     
Found 2 items
-rwxrwx--x+  3 hive hive        519 2021-07-28 09:33 /user/hive/warehouse/bank.db/account_orc/ds=2020-09-20/000000_0
-rwxrwx--x+  3 hive hive        519 2021-07-28 09:34 /user/hive/warehouse/bank.db/account_orc/ds=2020-09-20/000000_0_copy_1
[root@jwldata.com ~]# 
[root@jwldata.com ~]# hadoop fs -ls /user/hive/warehouse/bank.db/account_orc/ds=2020-09-21
Found 3 items
-rwxrwx--x+  3 hive hive        516 2021-07-28 09:32 /user/hive/warehouse/bank.db/account_orc/ds=2020-09-21/000000_0
-rwxrwx--x+  3 hive hive        528 2021-07-28 09:34 /user/hive/warehouse/bank.db/account_orc/ds=2020-09-21/000000_0_copy_1
-rwxrwx--x+  3 hive hive        528 2021-07-28 09:35 /user/hive/warehouse/bank.db/account_orc/ds=2020-09-21/000000_0_copy_2

查询ORC文件的元数据

hive --orcfiledump hdfs:///user/hive/warehouse/bank.db/account_orc/ds=2020-09-20/000000_0

[root@jwldata.com ~]# hive --orcfiledump hdfs:///user/hive/warehouse/bank.db/account_orc/ds=2020-09-20/000000_0
Structure for hdfs:///user/hive/warehouse/bank.db/account_orc/ds=2020-09-20/000000_0
File Version: 0.12 with HIVE_8732
21/07/28 09:52:12 INFO orc.ReaderImpl: Reading ORC rows from hdfs:/user/hive/warehouse/bank.db/account_orc/ds=2020-09-20/000000_0 with {include: null, offset: 0, length: 9223372036854775807}
Rows: 1
Compression: ZLIB
Compression size: 262144
Type: struct<_col0:int,_col1:string,_col2:string,_col3:int>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 1 hasNull: false
    Column 1: count: 1 hasNull: false min: 1000 max: 1000 sum: 1000
    Column 2: count: 1 hasNull: false min: 2020-09-20 14:30:05 max: 2020-09-20 14:30:05 sum: 19
    Column 3: count: 1 hasNull: false min: Tom max: Tom sum: 3
    Column 4: count: 1 hasNull: false min: 50 max: 50 sum: 50

File Statistics:
  Column 0: count: 1 hasNull: false
  Column 1: count: 1 hasNull: false min: 1000 max: 1000 sum: 1000
  Column 2: count: 1 hasNull: false min: 2020-09-20 14:30:05 max: 2020-09-20 14:30:05 sum: 19
  Column 3: count: 1 hasNull: false min: Tom max: Tom sum: 3
  Column 4: count: 1 hasNull: false min: 50 max: 50 sum: 50

Stripes:
  Stripe: offset: 3 data: 53 rows: 1 tail: 65 index: 136
    Stream: column 0 section ROW_INDEX start: 3 length 11
    Stream: column 1 section ROW_INDEX start: 14 length 27
    Stream: column 2 section ROW_INDEX start: 41 length 45
    Stream: column 3 section ROW_INDEX start: 86 length 29
    Stream: column 4 section ROW_INDEX start: 115 length 24
    Stream: column 1 section DATA start: 139 length 7
    Stream: column 2 section DATA start: 146 length 22
    Stream: column 2 section LENGTH start: 168 length 6
    Stream: column 3 section DATA start: 174 length 6
    Stream: column 3 section LENGTH start: 180 length 6
    Stream: column 4 section DATA start: 186 length 6
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2
    Encoding column 2: DIRECT_V2
    Encoding column 3: DIRECT_V2
    Encoding column 4: DIRECT_V2

File length: 519 bytes
Padding length: 0 bytes
Padding ratio: 0%

查询ORC文件的数据

hive --orcfiledump -d hdfs:///user/hive/warehouse/bank.db/account_orc/ds=2020-09-20/000000_0

[root@jwldata.com ~]# hive --orcfiledump -d hdfs:///user/hive/warehouse/bank.db/account_orc/ds=2020-09-20/000000_0
21/07/28 09:53:08 INFO orc.ReaderImpl: Reading ORC rows from hdfs:/user/hive/warehouse/bank.db/account_orc/ds=2020-09-20/000000_0 with {include: null, offset: 0, length: 9223372036854775807}
{"_col0":1000,"_col1":"2020-09-20 14:30:05","_col2":"Tom","_col3":50}
[root@jwldata.com ~]# 
[root@jwldata.com ~]# hive --orcfiledump -d hdfs:///user/hive/warehouse/bank.db/account_orc/ds=2020-09-20/000000_0_copy_1
21/07/28 09:53:50 INFO orc.ReaderImpl: Reading ORC rows from hdfs:/user/hive/warehouse/bank.db/account_orc/ds=2020-09-20/000000_0_copy_1 with {include: null, offset: 0, length: 9223372036854775807}
{"_col0":1000,"_col1":"2020-09-20 14:30:10","_col2":"Tom","_col3":-25}

创建ORC表+Snappy压缩

CREATE TABLE IF NOT EXISTS bank.account_orc_snappy (
  `id_card` int,
  `tran_time` string,
  `name` string,
  `cash` int
  )
partitioned by(ds string)
stored as orc
TBLPROPERTIES ("orc.compression"="SNAPPY");

查询ORC带Snappy压缩的建表语句

SHOW CREATE TABLE bank.account_orc_snappy;

CREATE TABLE `bank.account_orc_snappy`(
  `id_card` int, 
  `tran_time` string, 
  `name` string, 
  `cash` int)
PARTITIONED BY ( 
  `ds` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  'hdfs://nameservice1/user/hive/warehouse/bank.db/account_orc_snappy'
TBLPROPERTIES (
  'orc.compression'='SNAPPY', 
  'transient_lastDdlTime'='1627542655')