3 项目环境初始化
3.1 Hive分层说明
-
分库存放
ods层
dw层
ads层 -
命名规则
-
ods层表与原始数据库表名称相同
-
dw层表
ofact_前缀表示事实表
odim_前缀表示维度表
创建分层数据库:
#hive>
create database itcast_ods;
create database itcast_dw;
create database itcast_ads;
3.2 创建ods层数据表
- hive 分为外部表与内部表,为便于管理,该部分均使用内部表(内外部表的区别就在于删除表的时候真正的数据是否会被删除,我们一般是ods层使用外部表,因为这个表是我们所有部门共用的,不能轻易删除数据)
执行“ods层建表语句业务数据.sql”
3.3 ods层全量数据抽取
步骤:
1、拖拽组件构建Kettle作业结构图
2、转换结构图–》配置命名参数
3、配置Hive SQL脚本
#重新插入添加此语句
#set hive.msck.path.validation=ignore;
msck repair table itcast_ods.itcast_orders;
msck repair table itcast_ods.itcast_goods;
msck repair table itcast_ods.itcast_order_goods;
msck repair table itcast_ods.itcast_shops;
msck repair table itcast_ods.itcast_goods_cats;
msck repair table itcast_ods.itcast_org;
msck repair table itcast_ods.itcast_order_refunds;
msck repair table itcast_ods.itcast_users;
msck repair table itcast_ods.itcast_user_address;
msck repair table itcast_ods.itcast_payments;
4、配置表输入
SELECT
*
FROM itcast_orders
WHERE DATE_FORMAT(createtime, '%Y%m%d') <= '${dt}';
5、配置字段选择指定日期格式,配置parquet格式并设置snappy压缩输出
配置文件位置
配置文件输出内容格式
测试数据是否都正确被加载!
select * from itcast_ods.itcast_orders limit 2;
select * from itcast_ods.itcast_goods limit 2;
select * from itcast_ods.itcast_order_goods limit 2;
select * from itcast_ods.itcast_shops limit 2;
select * from itcast_ods.itcast_goods_cats limit 2;
select * from itcast_ods.itcast_org limit 2;
select * from itcast_ods.itcast_order_refunds limit 2;
select * from itcast_ods.itcast_users limit 2;
select * from itcast_ods.itcast_user_address limit 2;
select * from itcast_ods.itcast_payments limit 2;
注意:
- 1:其中itcast_orders,itcast_order_goods,itcast_order_refunds表是根据时间抽取,其余表进行全量抽取!!
- 2:注意使用Hadoop file ouput组件时要注意修改日期格式为UTF8!!,parquet中fields中date类型改为UTF8类型!!
3.4 ods层增量数据抽取
增量抽取与全量抽取类似,只不过每次只抽取前一天的数据
测试SQL语句:
-- 查询订单
select * from itcast_ods.itcast_orders where dt='20190910' limit 2;
select * from itcast_ods.itcast_goods where dt='20190910' limit 2;
select * from itcast_ods.itcast_order_goods where dt='20190910' limit 2;
select * from itcast_ods.itcast_shops where dt='20190910' limit 2;
select * from itcast_ods.itcast_goods_cats where dt='20190910' limit 2;
select * from itcast_ods.itcast_org where dt='20190910' limit 2;
select * from itcast_ods.itcast_order_refunds where dt='20190910' limit 2;
select * from itcast_ods.itcast_users where dt='20190910' limit 2;
select * from itcast_ods.itcast_user_address where dt='20190910' limit 2;
select * from itcast_ods.itcast_payments where dt='20190910' limit 2;