【技术分享】数据仓库搭建

一、引言与准备

上一篇文章介绍了数据仓库的基本概念,本文参考《Hadoop数据仓库实践》中案例搭建一个简单的销售订单数据仓库。在Mysql上建立源数据库,在Hive中建立多维数据仓库。源数据库只有产品、客户、销售订单三张表,如下图:

源端系统

产品表中包含:产品编码(主键)、产品名称、产品分类;

客户表中包含:客户编码(主键)、客户名称、客户住址街道、客户邮编、客户城市、客户国家;

销售订单表包含:订单编号(主键)、客户编号(外键)、产品编号(外键)、订单日期、入库日期、订单金额。

数据仓库面向产品、客户、日期、订单等维度构建,如下图:

数据仓库

除了日期维度外,其他维度都在源数据基础上增加了代理键、版本号、生效日期、过期日期等字段,用来描述维度变化的历史。

本文中采用hive作为数据仓库,使用容器化部署hadoop+hive+mysql,可参考这里。启停的过程补充一下:

启动过程:

1、启动虚拟机,用systemctl status docker检查docker是否启动;

2、访问http://虚拟机ip:9000/查看portainer的Container管理界面,启动Master、Slave1、Slave2、mysql_hive,四个容器;

3、进入Master容器命令行,执行source /etc/profile配置环境变量,再进入/usr/local/hadoop/sbin目录执行./start-all.sh启动hadoop集群。启动hadoop后执行jps检查datanode、namenode、nodemanager、secondearynamenode、resourcemanger是否都启动;

4、执行hive,进入hive交互界面,验证hive正常启动;

5、执行nohup hiveserver2 >/dev/null 2>/dev/null &启动hiveserver2;

6、执行netstat -ntulp |grep 10000查看10000端口的网络监听是否启动了(可能要等待2分钟),启动之后可使用sqldeveloper工具连接hive。

注意4、5两步可能要多次执行source /etc/profile配置环境变量。

停止过程:

1、Master中执行/usr/local/hadoop/sbin# ./stop-all.sh

2、portainer的Container管理界面,关闭Master、Slave1、Slave2、mysql_hive,四个容器;

3、进入虚拟机,执行shutdown -h now

二、源端ODS构建

ODS使用mysql,与hive的配置库共用一个mysql。

drop database if exists source;

create database source;

use source;

create table customer ( customer_number int not null auto_increment primary key comment '客户编号,主键', customer_name varchar (50) comment '客户名称', customer_street_address varchar (50) comment '客户住址', customer_zip_code int comment '邮编', customer_city varchar (30) comment '所在城市', customer_state varchar (2) comment '所在省份' );

create table product ( product_code int not null auto_increment primary key comment '产品编码,主键', product_name varchar (30) comment '产品名称', product_category varchar (30) comment '产品类型' );

create table sales_order ( order_number int not null auto_increment primary key comment '订单号,主键', customer_number int comment '客户编号', product_code int comment '产品编码', order_date datetime comment '订单日期', entry_date datetime comment '登记日期', order_amount decimal (10 , 2 ) comment '销售金额', foreign key (customer_number) references customer (customer_number) on delete cascade on update cascade , foreign key (product_code) references product (product_code) on delete cascade on update cascade );

insert into customer (customer_name,customer_street_address,customer_zip_code, customer_city,customer_state) values ('really large customers', '7500 louise dr.',17050, 'mechanicsburg','pa'), ('small stores', '2500 woodland st.',17055, 'pittsburgh','pa'), ('medium retailers','1111 ritter rd.',17055,'pittsburgh','pa'), ('good companies','9500 scott st.',17050,'mechanicsburg','pa'), ('wonderful shops','3333 rossmoyne rd.',17050,'mechanicsburg','pa'), ('loyal clients','7070 ritter rd.',17055,'pittsburgh','pa'), ('distinguished partners','9999 scott st.',17050,'mechanicsburg','pa');

insert into product (product_name,product_category) values ('hard disk drive', 'storage'), ('floppy drive', 'storage'), ('lcd panel', 'monitor');

drop procedure if exists generate_sales_order_data;

delimiter //
create procedure generate_sales_order_data() 
begin
drop table if exists temp_sales_order_data; 
create table temp_sales_order_data as select * from sales_order where 1=0;
set @start_date := unix_timestamp('2016-03-01');
set @end_date := unix_timestamp('2016-07-01');
set @i := 1;
while @i<=100 do 
set @customer_number := floor (1 + rand() * 6);
set @product_code := floor (1 + rand() * 2);
set @order_date := from_unixtime(@start_date + rand() * (@end_date - @start_date));
set @amount := floor (1000 + rand() * 9000);
insert into temp_sales_order_data values (@i,@customer_number,@product_code,@order_date,@order_date,@amount); 
set @i:=@i+1;
end while; 
truncate table sales_order; 
insert into sales_order select null ,customer_number,product_code,order_date,entry_date,order_amount from temp_sales_order_data order by order_date; 
commit;
end //

delimiter ;

call generate_sales_order_data();

select count(*) from sales_order;

其中订单数据使用存储过程构造100条模拟数据,了解存储过程参考这里

三、贴源层RDS构建

RDS使用hive数据库,RDS表结构与ODS一致,作为贴源层。

drop database if exists rds cascade ;

create database rds; 

use rds;

create table customer ( customer_number int comment 'number', customer_name varchar (30) comment 'name', customer_street_address varchar (30) comment 'address', customer_zip_code int comment 'zipcode', customer_city varchar (30) comment 'city', customer_state varchar (2) comment 'state' );

create table product ( product_code int comment 'code', product_name varchar (30) comment 'name', product_category varchar (30) comment 'category' );

create table sales_order ( order_number int comment 'order number', customer_number int comment 'customer number', product_code int comment 'product code', order_date timestamp comment 'order date', entry_date timestamp comment 'entry date', order_amount decimal (10 , 2 ) comment 'order amount' );

四、分析层TDS构建

drop database if exists dw cascade ;

create database dw;

use dw;

create table date_dim ( date_sk int comment 'surrogate key', `date` date comment 'date,yyyy-mm-dd', month tinyint comment 'month', month_name varchar (9) comment 'month name', quarter tinyint comment 'quarter', year smallint comment 'year' ) comment 'date dimension table' row format delimited fields terminated by ',' stored as textfile;

create table customer_dim ( customer_sk int comment 'surrogate key', customer_number int comment 'number', customer_name varchar (50) comment 'name', customer_street_address varchar (50) comment 'address', customer_zip_code int comment 'zipcode', customer_city varchar (30) comment 'city', customer_state varchar (2) comment 'state', version int comment 'version', effective_date date comment 'effective date', expiry_date date comment 'expiry date' )clustered by (customer_sk) into 8 buckets stored as orc tblproperties ('transactional'='true');

create table product_dim ( product_sk int comment 'surrogate key', product_code int comment 'code', product_name varchar (30) comment 'name', product_category varchar (30) comment 'category', version int comment 'version', effective_date date comment 'effective date', expiry_date date comment 'expiry date' )clustered by (product_sk) into 8 buckets stored as orc tblproperties ('transactional'='true');

create table order_dim ( order_sk int comment 'surrogate key', order_number int comment 'number', version int comment 'version', effective_date date comment 'effective date', expiry_date date comment 'expiry date' )clustered by (order_sk) into 8 buckets stored as orc tblproperties ('transactional'='true');

create table sales_order_fact ( order_sk int comment 'order surrogate key', customer_sk int comment 'customer surrogate key', product_sk int comment 'product surrogate key', order_date_sk int comment 'date surrogate key', order_amount decimal (10 , 2 ) comment 'order amount' )clustered by (order_sk) into 8 buckets stored as orc tblproperties ('transactional'='true');

其中创建日期维度表时字段data与hive的关键字重复,需要加上Esc下方的`号,参考这里。创建日期数据也需要用到存储过程。

五、下一步工作

数仓的表都建好了,下一步需要使用ETL工具(sqoop等)将源系统中的数据装载到RDS中,再通过hivesql对数据进行转换和装载,完成整个数仓数据的流转,具体可参考《Hadoop构建数据仓库实践》

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值