本文是笔者初步整理的Hive元数据表,有不准确的地方请轻拍,后续我会补充.
1. Hive 0.11 元数据表汇总
线上Hive 0.11 metastore包括下述39个表,主要分为以下几类 :
Database相关
Table相关
数据存储相关SDS
COLUMN相关
SERDE相关(序列化)
Partition相关(分区)
SKEW相关(数据倾斜)
BUCKET相关(分桶)
PRIVS相关(权限管理)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
|
mysql> show tables;
+
---------------------------+
| Tables_in_hive_yz_test |
+
---------------------------+
| BUCKETING_COLS |
| CDS |
| COLUMNS_V2 |
| DATABASE_PARAMS |
| DBS |
| DB_PRIVS |
| GLOBAL_PRIVS |
| IDXS |
| INDEX_PARAMS |
| NUCLEUS_TABLES |
| PARTITIONS |
| PARTITION_EVENTS |
| PARTITION_KEYS |
| PARTITION_KEY_VALS |
| PARTITION_PARAMS |
| PART_COL_PRIVS |
| PART_COL_STATS |
| PART_PRIVS |
| ROLES |
| ROLE_MAP |
| SDS |
| SD_PARAMS |
| SEQUENCE_TABLE |
| SERDES |
| SERDE_PARAMS |
| SKEWED_COL_NAMES |
| SKEWED_COL_VALUE_LOC_MAP |
| SKEWED_STRING_LIST |
| SKEWED_STRING_LIST_VALUES |
| SKEWED_VALUES |
| SORT_COLS |
| TABLE_PARAMS |
| TAB_COL_STATS |
| TBLS |
| TBL_COL_PRIVS |
| TBL_PRIVS |
| TYPES |
| TYPE_FIELDS |
| VERSION |
+
---------------------------+
39
rows
in
set
(0.00 sec)
|
2.各个表的含义
2.1 Database表:DBS
描述 该表存储Hive Database的元数据信息,DB_ID是
数据库ID,NAME是库名,DB_LOCATION_URI是数据库在HDFS中的位置,DESC为数据库的描述信息。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
mysql>
desc
DBS;
+
-----------------+---------------+------+-----+---------+-------+
| Field | Type |
Null
|
Key
|
Default
| Extra |
+
-----------------+---------------+------+-----+---------+-------+
| DB_ID |
bigint
(20) |
NO
| PRI |
NULL
| |
|
DESC
|
varchar
(4000) | YES | |
NULL
| |
| DB_LOCATION_URI |
varchar
(4000) |
NO
| |
NULL
| |
|
NAME
|
varchar
(128) | YES | UNI |
NULL
| |
+
-----------------+---------------+------+-----+---------+-------+
mysql>
select
*
from
DBS
where
NAME
=
'acorn_3g'
;
+
-------+------+-------------------------------------------------------+----------+
| DB_ID |
DESC
| DB_LOCATION_URI |
NAME
|
+
-------+------+-------------------------------------------------------+----------+
+
-------+------+-------------------------------------------------------+----------+
|
2. 2 Table 表
描述:
TBLS 存储Hive Table的元数据信息,每个表有唯一的TBL_ID
SD_ID外键指向所属的Database,SD_IID关联SDS表的主键。 其中SDS存储列(CD_ID)等信息。TBLS.SD_ID关联SDS.SD_ID, SDS.SD_ID关联CDS.CD_ID,
CDS.CD_ID关联COLUMNS_V2.CD_ID
例子:*acorn_3g.user_act表的信息: TBL_ID为41231,TBL_TYPE为MANAGED_TABLE普通表(若值为EXTERNAL,表示外部表),DB_ID为81,表示隶属DB_ID=81的Database。
1
2
3
4
5
6
7
8
9
10
11
12
13
|
mysql>
select
*
from
TBLS
where
TBL_NAME=
'user_act'
and
DB_ID=81 \G
*************************** 1. row ***************************
TBL_ID: 41231
CREATE_TIME: 1366188055
DB_ID: 81
LAST_ACCESS_TIME: 0
OWNER: xianbing.liu
RETENTION: 0
SD_ID: 263311
TBL_NAME: user_act
TBL_TYPE: MANAGED_TABLE
VIEW_EXPANDED_TEXT:
NULL
VIEW_ORIGINAL_TEXT:
NULL
|
2.3 SDS表(数据存储表)
描述:
SDS表保存了Hive数据仓库所有的HDFS数据文件信息,每个SD_ID唯一标记一个数据存储记录
CD_ID关联COLUMN_V2.CD_ID,指定该数据的字段信息
SERDE_ID关联SERDES.SERDE_ID,指定该数据的序列化信息(如是否是序列化表,DELIMITED字段等)
例子:
根据SDS表找到acorn_3g.user_act表的CD_ID是263311, SERDE_ID是263301,默认存储位置为
1
2
3
4
5
6
7
8
9
10
11
12
13
|
mysql>
select
*
from
SDS
where
SD_ID=263311 \G
*************************** 1. row ***************************
SD_ID: 263311
CD_ID: 263311
INPUT_FORMAT: org.apache.hadoop.mapred.TextInputFormat
IS_COMPRESSED:
NUM_BUCKETS: -1
OUTPUT_FORMAT: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
SERDE_ID: 263301
IS_STOREDASSUBDIRECTORIES:
|
2.4 CDS 和 COLUMN_V2 (列信息)
CDS表
描述:
该表只有一个字段CD_ID,永远存储整个Hive数据仓库中的CD_ID.
例子:
可以看到acorn_3g.user_act表对应的CD_ID记录在CDS中
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
mysql>
desc
CDS;
+
-------+------------+------+-----+---------+-------+
| Field | Type |
Null
|
Key
|
Default
| Extra |
+
-------+------------+------+-----+---------+-------+
| CD_ID |
bigint
(20) |
NO
| PRI |
NULL
| |
+
-------+------------+------+-----+---------+-------+
1 row
in
set
(0.00 sec)
mysql>
select
*
FROM
CDS
where
CD_ID=263311;
+
--------+
| CD_ID |
+
--------+
| 263311 |
+
--------+
1 row
in
set
(0.00 sec)
|
COLUMN_V2表
描述:
该表存储了一个CD_ID对应的所有字段信息
例子:
查看acorn_3g.user_act表的COLUMN信息;我们可以看到acorn_3g.user_act表有14个字段,COLUMN_NAME为字段名,TYPE_NAME为字段类型,INTEGER_IDX为字段序号
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
mysql>
select
*
from
COLUMNS_V2
where
CD_ID=263311
order
by
integer_idx;
+
--------+---------+---------------+-----------+-------------+
| CD_ID | COMMENT | COLUMN_NAME | TYPE_NAME | INTEGER_IDX |
+
--------+---------+---------------+-----------+-------------+
| 263311 |
NULL
| id |
bigint
| 0 |
| 263311 |
NULL
| action_id |
int
| 1 |
| 263311 |
NULL
| user_id |
bigint
| 2 |
| 263311 |
NULL
| request | string | 3 |
| 263311 |
NULL
| visit_time | string | 4 |
| 263311 |
NULL
| source_id |
int
| 5 |
| 263311 |
NULL
| sess_id | string | 6 |
| 263311 |
NULL
| mobile_number | string | 7 |
| 263311 |
NULL
| from_id | string | 8 |
| 263311 |
NULL
| app_id | string | 9 |
| 263311 |
NULL
| version | string | 10 |
| 263311 |
NULL
| reg_type |
int
| 11 |
| 263311 |
NULL
| uniqid | string | 12 |
| 263311 |
NULL
| failure |
int
| 13 |
+
--------+---------+---------------+-----------+-------------+
2.5 SERDES和SERDE_PARAMS (序列化)
|
描述:
SERDES存储了所有的序列化信息(SERDE_ID,SLIB),SLIB表示序列化所采用的
Java类
SERDES_PARAMS 存储序列化具体的参数及值
例子:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
acorn_3g.user_act表对应的SERDE_ID=263301表示 采用hive默认序列化类org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe , DELIMITED字符为
'\t'
(即创建表时指定的 ...DELIMITED
BY
'\t'
...)
mysql>
select
*
FROM
SERDES
where
SERDE_ID=263301;
+
----------+------+----------------------------------------------------+
| SERDE_ID |
NAME
| SLIB |
+
----------+------+----------------------------------------------------+
| 263301 |
NULL
| org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
+
----------+------+----------------------------------------------------+
mysql>
select
SERDE_ID,PARAM_KEY,
REPLACE
(PARAM_VALUE,
'\t'
,
'\\t'
)
from
SERDE_PARAMS
where
SERDE_ID=263301;
+
----------+----------------------+---------------------------------+
| SERDE_ID | PARAM_KEY |
REPLACE
(PARAM_VALUE,
'\t'
,
'\\t'
) |
+
----------+----------------------+---------------------------------+
| 263301 | field.delim | \t |
| 263301 | serialization.format | \t |
+
----------+----------------------+---------------------------------+
|
2.6 PARTITIONS PARTITION_KEY 和 PARTITION_KEY_VALS (分区)
PARTITION_KEYS
描述:
PARTITION_KEYS 保存了所有分区表用于分区的字段
例子:
通过PARTITION_KEYS查看acorn_3g.user_act的分区信息,可看到该表是一个分区表,分区字段为log_date,其中INTEGER_IDX为分区字段的序号,和分区字段一一对应
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
mysql>
desc
PARTITION_KEYS;
+
--------------+---------------+------+-----+---------+-------+
| Field | Type |
Null
|
Key
|
Default
| Extra |
+
--------------+---------------+------+-----+---------+-------+
| TBL_ID |
bigint
(20) |
NO
| PRI |
NULL
| |
| PKEY_COMMENT |
varchar
(4000) | YES | |
NULL
| |
| PKEY_NAME |
varchar
(128) |
NO
| PRI |
NULL
| |
| PKEY_TYPE |
varchar
(767) |
NO
| |
NULL
| |
| INTEGER_IDX |
int
(11) |
NO
| |
NULL
| |
+
--------------+---------------+------+-----+---------+-------+
mysql>
select
*
FROM
PARTITION_KEYS
WHERE
TBL_ID=41231;
+
--------+--------------+-----------+-----------+-------------+
| TBL_ID | PKEY_COMMENT | PKEY_NAME | PKEY_TYPE | INTEGER_IDX |
+
--------+--------------+-----------+-----------+-------------+
| 41231 |
NULL
| log_date | string | 0 |
+
--------+--------------+-----------+-----------+-------------+
PARTITIONS
|
描述:
PARTITIONS存储了Hive数据仓库总所有的分区信息,每个分区由PART_ID标识,其中TBL_ID为隶属的Table,SD_ID为隶属的SDS(见2.3)
例子:
通过PARTITIONS表查看acorn_3g.user_act表的分区信息,如PART_ID为168301,名字为log_date=2013-03-01,SD_ID为231621
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
mysql>
desc
PARTITIONS;
+
------------------+--------------+------+-----+---------+-------+
| Field | Type |
Null
|
Key
|
Default
| Extra |
+
------------------+--------------+------+-----+---------+-------+
| PART_ID |
bigint
(20) |
NO
| PRI |
NULL
| |
| CREATE_TIME |
int
(11) |
NO
| |
NULL
| |
| LAST_ACCESS_TIME |
int
(11) |
NO
| |
NULL
| |
| PART_NAME |
varchar
(767) | YES | MUL |
NULL
| |
| SD_ID |
bigint
(20) | YES | MUL |
NULL
| |
| TBL_ID |
bigint
(20) | YES | MUL |
NULL
| |
+
------------------+--------------+------+-----+---------+-------+
mysql>
select
*
FROM
PARTITIONS
WHERE
TBL_ID=41231
order
by
PART_NAME limit 5;
+
---------+-------------+------------------+---------------------+--------+--------+
| PART_ID | CREATE_TIME | LAST_ACCESS_TIME | PART_NAME | SD_ID | TBL_ID |
+
---------+-------------+------------------+---------------------+--------+--------+
| 168301 | 1366259946 | 0 | log_date=2013-03-01 | 231621 | 41231 |
| 168321 | 1366260063 | 0 | log_date=2013-03-02 | 231641 | 41231 |
| 168331 | 1366260176 | 0 | log_date=2013-03-03 | 231651 | 41231 |
| 168346 | 1366260298 | 0 | log_date=2013-03-04 | 231666 | 41231 |
| 168361 | 1366260398 | 0 | log_date=2013-03-05 | 231681 | 41231 |
+
---------+-------------+------------------+---------------------+--------+--------+
PARTITION_KEY_VALS
|
描述:
PARTITION_KEY_VALS 存储了PARTITION_KEY中描述的分区字段的值,通常配合PARTITIONS 和PARTITION_KEYS表使用。
例子:
查看分区PART_ID=168301各分区字段的值,其中PARTITION_KEY_VALS存储了所有分区字段序号(INTEGER_IDX)和值(PART_KEY_VAL)间的对应关系。本例中acorn_3g.user_act PART_ID=168301分区log_date字段的值为‘2013-03-01’
1
2
3
4
5
6
7
|
mysql>
select
pk.PKEY_NAME,pk.PKEY_TYPE,pk.INTEGER_IDX,pkv.PART_KEY_VAL
from
PARTITION_KEYS pk,PARTITION_KEY_VALS pkv
where
pk.INTEGER_IDX=pkv.INTEGER_IDX
and
pk.TBL_ID=41231
and
pkv.PART_ID=168301;
+
-----------+-----------+-------------+--------------+
| PKEY_NAME | PKEY_TYPE | INTEGER_IDX | PART_KEY_VAL |
+
-----------+-----------+-------------+--------------+
| log_date | string | 0 | 2013-03-01 |
+
-----------+-----------+-------------+--------------+
|
2.7 BUCKET 相关表
描述
BUCKETING_COLS表描述了所有采用了分桶技术的SDS,目前公司未用BUCKET,//TODO
1
2
3
4
5
6
7
8
|
mysql>
desc
BUCKETING_COLS;
+
-----------------+--------------+------+-----+---------+-------+
| Field | Type |
Null
|
Key
|
Default
| Extra |
+
-----------------+--------------+------+-----+---------+-------+
| SD_ID |
bigint
(20) |
NO
| PRI |
NULL
| |
| BUCKET_COL_NAME |
varchar
(256) | YES | |
NULL
| |
| INTEGER_IDX |
int
(11) |
NO
| PRI |
NULL
| |
+
-----------------+--------------+------+-----+---------+-------+
|
2.8 PRIVS 权限管理相关表
TBL_PRIVS DB_PRIVS PART_PRIVS 等,目前Hive在权限管理方面远不及关系数据库,公司也未对权限进行统一管理。
2.9 SKEW 数据倾斜相关表
相比0.8版本,0.11元数据增加了数据倾斜相关的表 SKEWED_COL_NAMES SKEWED_COL_VALUE_LOC_MAP SKEWED_STRING_LIST SKEWED_STRING_LIST_VALUES SKEWED_VALUES ,这些高级特性还在测试阶段,目前公司没有用到。
2.3 其他
如VERSION 描述版本信息,这类表开发者不用太关心。