CC00049.pbpositions——|Hadoop&PB级数仓.V07|——|PB数仓.v07|拉链表实现|建表加载|测试案例|

yanqi_vip

已于 2022-04-12 18:00:23 修改

阅读量73

点赞数

分类专栏： bigdatav014——PB离线数仓文章标签：大数据 hive python java linux

于 2022-04-10 14:51:00 首次发布

不予转载

本文链接：https://blog.csdn.net/yanqi_vip/article/details/124090052

版权

bigdatav014——PB离线数仓专栏收录该内容

73 篇文章 1 订阅

订阅专栏

一、维表拉链表应用案例：维表拉链表案例说明

二、维表拉链表建表加载数据

### --- 创建用户信息表
~~~     用户信息

DROP TABLE IF EXISTS test.userinfo;

CREATE TABLE test.userinfo(
userid STRING COMMENT '用户编号',
mobile STRING COMMENT '手机号码',
regdate STRING COMMENT '注册日期')
COMMENT '用户信息'
PARTITIONED BY (dt string)
row format delimited fields terminated by ',';

### --- 创建拉链表
~~~     拉链表(存放用户历史信息)
~~~     拉链表不是分区表；多了两个字段start_date、end_date

DROP TABLE IF EXISTS test.userhis;

CREATE TABLE test.userhis(
userid STRING COMMENT '用户编号',
mobile STRING COMMENT '手机号码',
regdate STRING COMMENT '注册日期',
start_date STRING,
end_date STRING)
COMMENT '用户信息拉链表'
row format delimited fields terminated by ',';

### --- 数据准备
~~~     数据(/data/yanqidw/data/userinfo.dat)

[root@hadoop02 ~]# vim /data/yanqidw/data/userinfo.dat

001,13551111111,2020-03-01,2020-06-20
002,13561111111,2020-04-01,2020-06-20
003,13571111111,2020-05-01,2020-06-20
004,13581111111,2020-06-01,2020-06-20
002,13562222222,2020-04-01,2020-06-21
004,13582222222,2020-06-01,2020-06-21
005,13552222222,2020-06-21,2020-06-21
004,13333333333,2020-06-01,2020-06-22
005,13533333333,2020-06-21,2020-06-22
006,13733333333,2020-06-22,2020-06-22
001,13554444444,2020-03-01,2020-06-23
003,13574444444,2020-05-01,2020-06-23
005,13555554444,2020-06-21,2020-06-23
007,18600744444,2020-06-23,2020-06-23
008,18600844444,2020-06-23,2020-06-23

三、静态分区数据加载

### --- 准备数据文件
~~~     静态分区数据加载(略)

[root@hadoop02 ~]# vim /data/yanqidw/data/userinfo0620.dat

001,13551111111,2020-03-01
002,13561111111,2020-04-01
003,13571111111,2020-05-01
004,13581111111,2020-06-01

### --- 静态数据加载

hive (default)> load data local inpath
'/data/yanqidw/data/userinfo0620.dat'
into table test.userinfo
partition(dt='2020-06-20');

### --- 查看数据是否加载进来
~~~     若是我们有4个分区，数据我需要加载4次，
~~~     若是更多，可能需要更多的次数加载。
~~~     这种方法太繁琐。

hive (default)> select * from test.userinfo;
userinfo.userid userinfo.mobile userinfo.regdate    userinfo.dt
001 13551111111 2020-03-01  2020-06-20
002 13561111111 2020-04-01  2020-06-20
003 13571111111 2020-05-01  2020-06-20
004 13581111111 2020-06-01  2020-06-20

hive (default)> show partitions test.userinfo;
partition
dt=2020-06-20

### --- 清理表环境
~~~     用户信息

DROP TABLE IF EXISTS test.userinfo;

CREATE TABLE test.userinfo(
userid STRING COMMENT '用户编号',
mobile STRING COMMENT '手机号码',
regdate STRING COMMENT '注册日期')
COMMENT '用户信息'
PARTITIONED BY (dt string)
row format delimited fields terminated by ',';

四、动态分区数据加载

### --- 创建中间表
~~~     动态分区数据加载：分区的值是不固定的，由输入数据确定
~~~     创建中间表(非分区表)

hive (default)>drop table if exists test.tmp1;

hive (default)> create table test.tmp1 as
select * from test.userinfo;

~~~     # tmp1 非分区表，使用系统默认的字段分割符'\001'

hive (default)> alter table test.tmp1 set serdeproperties('field.delim'=',');

### --- 向中间表中加载数据
~~~     # 向中间表加载数据

hive (default)> load data local inpath 
'/data/yanqidw/data/userinfo.dat' 
into table test.tmp1;

### --- 从中间表向分区表插入数据
~~~     从中间表向分区表加载数据
~~~     把当前的模式更改为非严格模式

hive (default)> set hive.exec.dynamic.partition.mode=nonstrict;

hive (default)> insert into table test.userinfo
partition(dt)
select  * from test.tmp1;

五、维表拉链表参数说明

### --- 与动态分区相关的参数
~~~     # 与动态分区相关的参数

hive.exec.dynamic.partition
Default Value: false prior to Hive 0.9.0; true in Hive 0.9.0 and later
Added In: Hive 0.6.0
Whether or not to allow dynamic partitions in DML/DDL.
#表示开启动态分区功能

~~~     strict：最少需要有一个是静态分区
~~~     nonstrict：可以全部是动态分区

hive.exec.dynamic.partition.mode
Default Value: strict
Added In: Hive 0.6.0
In strict mode, the user must specify at least one static partition in case the user accidentally overwrites all partitions. In nonstrict mode all partitions are allowed to be dynamic.
Set to nonstrict to support INSERT ... VALUES, UPDATE, and DELETE transactions (Hive 0.14.0 and later).

~~~     表示一个动态分区语句可以创建的最大动态分区个数，超出报错

hive.exec.max.dynamic.partitions
Default Value: 1000
Added In: Hive 0.6.0
Maximum number of dynamic partitions allowed to be created in total.

~~~     表示每个mapper / reducer可以允许创建的最大动态分区个数，默认是100，超出则会报错。

hive.exec.max.dynamic.partitions.pernode
Default Value: 100
Added In: Hive 0.6.0
Maximum number of dynamic partitions allowed to be created in eachmapper/reducer node.

~~~     表示一个MR job可以创建的最大文件个数，超出报错。

hive.exec.max.created.files
Default Value: 100000
Added In: Hive 0.7.0
Maximum number of HDFS files created by all mappers/reducers in a MapReduce job.

yanqi_vip

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
CC00049.pbpositions——|Hadoop&PB级数仓.V07|——|PB数仓.v07|拉链表实现|建表加载|测试案例|

一、维表拉链表应用案例：维表拉链表案例说明二、维表拉链表建表加载数据### --- 创建用户信息表~~~ 用户信息DROP TABLE IF EXISTS test.userinfo;CREATE TABLE test.userinfo(userid STRING COMMENT '用户编号',mobile STRING COMMENT '手机号...
复制链接

扫一扫