7.1.1智慧物流【数据采集（sqoop）、ETL、数据导出】

最新推荐文章于 2023-07-03 07:42:48 发布

weixin_47134119

最新推荐文章于 2023-07-03 07:42:48 发布

阅读量2k

点赞数 7

文章标签： etl

本文链接：https://blog.csdn.net/weixin_47134119/article/details/118280302

版权

智慧物流

一、项目整体架构

在这里插入图片描述
整个项目主要分为两部分

基于数仓历史数据仓储预测和车辆智能调度
车辆运行数据实时监控分析

技术框架选择：

CDH5.14.0
数据采集：Sqoop(1.4.6)
数仓：Hive(Hive on Spark):1.1.0-cdh5.14.0 ,Hadoop(HDFS,YARN):2.6.0-cdh5.14.0
机器学习库：Spark Mlib :默认1.6，升级到2.4
语言：Java，Scala
消息系统：kafka：1.0.1
缓存数据库：Redis：3.2
大数据数据库:Hbase：1.2.0-cdh5.14.0
实时处理引擎：Spark StructedStreaming:2.4.0
Web系统：SpringCloud

二、数据采集

第 1 节数据说明

智慧物流项目挑选出两个重要的场景，一个仓储预测，另一个是车货匹配。
其中仓储预测使用机器学习算法LightGBM;智能调度则使用动态规划算法。

算法开发的一般流程：
在这里插入图片描述
不管使用何种算法来解决问题，重要的都是我们能拿到的数据，只有充分利用现有的数据进行数据分析
和处理，选择合适的算法才能获取到比较好的结果。在一个成熟的企业中对于算法的选择和实现以及评
价调优都是由专门的算法工程师或者机器学习工程师来负责。我们大数据部门工程师主要负责提供满足
算法部门要求的数据。我们以仓储预测场景来梳理一个算法的开发流程。

大体流程
在这里插入图片描述
Mysql数据库：lg_orders(订单表),lg_items(商品表)，lg_item_cats(商品分类表),lg_enterports(仓库表)
数据采集：使用Sqoop把数据同步到Hive数仓中
数仓ETL：对数据统一化，清洗，预处理
数据特征工程：Python/Scala
训练模型：SparkMlib
模型预测：SparkMlib
模型评价：损失函数
模型优化：调整

第 2 节业务数据

根据算法部门的要求，仓储预测模型需要提供4类数据
sales_train;items;item_categories;entreports

sales_train
表示的是销售数据，有日期，月份,仓库，商品，价格和日销售量
统计出的是每个仓库每个商品的日销量数据，可以来自于订单表(lg_orders);同步到数仓中然后统计出
指标数据；数据库是mysql；
items
表示的是商品信息，有商品名称，商品id,商品分类id
可以来自商品表，数据库：mysql
entreports
仓库数据集，
仓库id,仓库名称(地址信息)
数据库：mysql
item_category
来源商品分类表，其中有商品分类id以及分类的名称
数据来自表，来源是mysql.

选择linux123上的Mysql模拟业务数据库
创建lg_lgstic数据库

mysql> create database lg_logstic;
Query OK, 1 row affected (0.01 sec)
mysql> use lg_logstic;
Query OK, 1 row affected (0.01 sec)
#上传sql脚本文件到root/mysql_logstic，source执行
mysql> source /root/mysql_logstic/lg_logstic.sql;
Query OK, 1 row affected (0.01 sec)

数据库中表如下：
在这里插入图片描述

第 3 节同步数据到Hive

数仓分层实现
数仓主要分为四层：

在Hive中分别创建四个数据库对应数仓的四层

create database lg_ods;
create database lg_dwd;
create database lg_dws;
create database lg_ads;

对于数据采集来说分为两部分，

一部分是离线数据采集，主要是Mysql中相关业务数据的采集，
一部分是实时数据采集，主要是车辆行驶相关数据采集。

3.1 离线数据采集

对于以上业务数据的采集来说，继续沿用离线数仓的数据采集方式，也就是针对不同的表类型选择不同的同步方式。

共有6张表

lg_orders,
lg_order_entrepot,
lg_order_items,
lg_items,
lg_entrepots,
lg_item_cats

事实表：lg_orders,lg_order_items，lg_order_entrepot
维度表：lg_items,lg_entrepots,lg_item_cats
对于lg_orders,lg_items,lg_entrepots,lg_item_cats在数仓中主要以拉链表方式保存；

Hive创建ODS层表

-- 创建ODS层商品分类表
drop table if exists `lg_ods`.`lg_item_cats`;
create table `lg_ods`.`lg_item_cats`(
catId bigint,
parentId bigint,
catName string,
isShow bigint,
isFloor bigint,
catSort bigint,
dataFlag bigint,
createTime string,
commissionRate double,
catImg string,
subTitle string,
simpleName string,
seoTitle string,
seoKeywords string,
seoDes string,
catListTheme string,
detailTheme string,
mobileCatListTheme string,
mobileDetailTheme string,
wechatCatListTheme string,
wechatDetailTheme string,
cat_level bigint,
modifyTime string
)
partitioned by (dt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;

-- 创建ODS层商品表
drop table if exists `lg_ods`.`lg_items`;
create table `lg_ods`.`lg_items`(
itemsId bigint,
itemsSn string,
productNo string,
itemsName string,
itemsImg string,
entrepotId bigint,
itemsType bigint,
marketPrice double,
entrepotPrice double,
warnStock bigint,
itemsStock bigint,
itemsUnit string,
itemsTips string,
isSale bigint,
isBest bigint,
isHot bigint,
isNew bigint,
isRecom bigint,
itemsCatIdPath string,
itemsCatId bigint,
entrepotCatId1 bigint,
entrepotCatId2 bigint,
brandId bigint,
itemsDesc string,
itemsStatus bigint,
saleNum bigint,
saleTime string,
visitNum bigint,
appraiseNum bigint,
isSpec bigint,
gallery string,
itemsSeoKeywords string,
illegalRemarks string,
dataFlag bigint,
createTime string,
isFreeShipping bigint,
itemsSerachKeywords string,
modifyTime string
)
partitioned by (dt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;

--创建ods层订单仓库关联表
drop table if exists `lg_ods`.`lg_order_entrepot`;
CREATE TABLE `lg_ods`.`lg_order_entrepot` (
oeId bigint,
orderId bigint,
itemId bigint,
itemNums bigint,
itemName string,
entrepotId int,
userName string,
userAddress string,
promotionJson string,
createtime string,
modifyTime string
) partitioned by (dt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;

-- 创建ODS层仓库表
drop table if exists `lg_ods`.`lg_entrepots`;
CREATE TABLE `lg_ods`.`lg_entrepots` (
entrepotId bigint,
areaId bigint,
entrepotName string,
entrepotkeeper string,
telephone string,
entrepotImg string,
entrepotTel string,
entrepotQQ string,
entrepotAddress string,
invoiceRemarks string,
serviceStartTime bigint,
serviceEndTime bigint,
freight bigint,
entrepotAtive int,
entrepotStatus int,
statusDesc string,
dataFlag int,
createTime string ,
modifyTime string
) partitioned by (dt string) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;

-- 创建ODS层订单表
drop table if exists `lg_ods`.`lg_orders`;
CREATE TABLE lg_ods.lg_orders (
orderId bigint,
orderNo string,
userId bigint,
orderStatus bigint,
itemsMoney double,
deliverType bigint,
deliverMoney double,
totalMoney double,
realTotalMoney double,
payType bigint,
isPay bigint,
areaId bigint,
userAddressId bigint,
areaIdPath string,
userName string,
userAddress string,
userPhone string,
orderScore bigint,
isInvoice bigint,
invoiceClient string,
orderRemarks string,
orderSrc bigint,
needPay double,
payRand bigint,
orderType bigint,
isRefund bigint,
isAppraise bigint,
cancelReason bigint,
rejectReason bigint,
rejectOtherReason string,
isClosed bigint,
itemsSearchKeys string,
orderunique string,
isFromCart string,
receiveTime string,
deliveryTime string,
tradeNo string,
dataFlag bigint,
createTime string,
settlementId bigint,
commissionFee double,
scoreMoney double,
useScore bigint,
orderCode string,
extraJson string,
orderCodeTargetId bigint,
noticeDeliver bigint,
invoiceJson string,
lockCashMoney double,
payTime string,
isBatch bigint,
totalPayFee bigint,
modifyTime string
) partitioned by (dt string) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','STORED AS TEXTFILE;

-- 创建ODS层订单明细表
drop table if exists `lg_ods`.`lg_order_items`;
create table `lg_ods`.`lg_order_items`(
ogId bigint,
orderId bigint,
itemsId bigint,
itemsNum bigint,
itemsPrice double,
payPrice double,
itemsSpecId bigint,
itemsSpecNames string,
itemsName string,
itemsImg string,
extraJson string,
itemsType bigint,
commissionRate double,
itemsCode string,
promotionJson string,
createtime string
)partitioned by (dt string) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;

使用sqoop实现采集；
在这里插入图片描述
sqoop的使用回顾[安装见 3.2.5 数据迁移工具 – Sqoop]

--query
sqoop在导入数据时，可以使用--query搭配sql来指定查询条件，并且还需在sql中添加$CONDITIONS，来实现并行运行mr的功能。

如果使用–query但是不加$CONDITIONS会报错。如下

ERROR tool.ImportTool: Import failed: java.io.IOException: Query [select * from Person where score>50] must contain '$CONDITIONS' in WHERE clause.

指定多个mapper导入数据；-m设置多个任务，比如 -m 10 注意必须加上-split-by参数用来指定如何划分数据

补充：sqoop会向关系型数据库例如mysql发送一个命令：select max(id),min(id) from test ;会把
max和min之间的区间平分10份，最后并行10个map去拉取数据。

数据源：linux123的Mysql数据库
目的地：Hive的ODS层表
编写sqoop导入任务，从mysql中抽取数据到Hive中

3.1.1 第一次全量导入

1、导入lg_orders表以lg_orders表为例：编写shell脚本
import_order_data.sh

#!/bin/bash
source /etc/profile
##如果第一个参数不为空，则作为工作日期使用
if [ -n "$1" ]
then
do_date=$1
else
##昨天日期，减一
do_date=`date -d "-1 day" +"%Y%m%d"`
fi
#定义sqoop命令位置，Hive命令位置，在hadoop2
sqoop=/opt/lagou/servers/sqoop-1.4.7/bin/sqoop
Hive=/opt/lagou/servers/hive-2.3.7/bin
#定义工作日期
#do_date=20200827
#编写导入数据通用方法 接收两个参数：第一个：表名，第二个：查询语句
import_data(){
$sqoop import \
--connect jdbc:mysql://linux123:3306/lg_logstic \
--username root \
--password 12345678 \
--target-dir /user/hive/warehouse/lg_ods.db/$1/dt=$do_date \
--delete-target-dir \
--query "$2 and \$CONDITIONS" \
--num-mappers 1 \
--fields-terminated-by ',' \
--null-string '\\N' \
--null-non-string '\\N'
}
# 全量导入订单数据方法
import_lg_orders(){
import_data lg_orders
"select
*
from lg_orders
where 1=1"
}
#调用全量导入订单数据方法
import_lg_orders
#注意sqoop导入数据的方式，对于Hive分区表来说需要执行添加分区操作，数据才能被识别到
$Hive -e "alter table lg_ods.lg_orders add partition(dt='$do_date');"

对于其它表可以仿照上面的实现编写独立脚本，也可以全部综合到一起，如下
import_all_data.sh

#!/bin/bash
source /etc/profile
##如果第一个参数不为空，则作为工作日期使用
if [ -n "$1" ]
then
do_date=$1
else
##昨天日期，减一
do_date=`date -d "-1 day" +"%Y%m%d"`
fi
#定义sqoop命令位置，Hive命令位置，在hadoop2
sqoop=/opt/lagou/servers/sqoop-1.4.7/bin/sqoop
Hive=/opt/lagou/servers/hive-2.3.7/bin
#定义工作日期
#编写导入数据通用方法 接收两个参数：第一个：表名，第二个：查询语句
import_data(){
$sqoop import \
--connect jdbc:mysql://linux123:3306/lg_logstic \
--username root \
--password 12345678 \
--target-dir /user/hive/warehouse/lg_ods.db/$1/dt=$do_date \
--delete-target-dir \
--query "$2 and \$CONDITIONS" \
--num-mappers 1 \
--fields-terminated-by ',' \
--null-string '\\N' \
--null-non-string '\\N'
}

# 全量导入订单数据方法
import_lg_orders(){
import_data lg_orders "select
*
from lg_orders
where 1=1"
}
# 全量导入订单明细数据(包含商品)方法
import_lg_order_items(){
import_data lg_order_items "select
*
from lg_order_items
where 1=1"
}
# 全量导入商品方法
import_lg_items(){
import_data lg_items "select
*
from lg_items
where 1=1"
}
# 全量导入仓库方法
import_lg_entrepots(){
import_data lg_entrepots "select
*
from lg_entrepots
where 1=1"
}
# 全量导入商品分类数据方法
import_lg_item_cats(){
import_data lg_item_cats "select
*
from lg_item_cats
where 1=1"
}
# 全量导入订单仓库关联数据方法
import_lg_order_entrepot(){
import_data lg_order_entrepot "select
*
from lg_order_entrepot
where 1=1"
}
#调用全量导入订单数据方法
import_lg_orders
#调用全量导入订单明细数据方法
import_lg_order_items
#调用全量导入商品数据方法
import_lg_items
#调用全量导入仓库数据方法
import_lg_entrepots
#调用全量导入商品分类数据方法
import_lg_item_cats
#调用全量导入订单仓库关联数据方法
import_lg_order_entrepot

#注意sqoop导入数据的方式，对于Hive分区表来说需要执行添加分区操作，数据才能被识别到
$Hive -e "alter table lg_ods.lg_orders add partition(dt='$do_date');
alter table lg_ods.lg_order_items add partition(dt='$do_date');
alter table lg_ods.lg_items add partition(dt='$do_date');
alter table lg_ods.lg_entrepots add partition(dt='$do_date');
alter table lg_ods.lg_item_cats add partition(dt='$do_date');
alter table lg_ods.lg_order_entrepot add partition(dt='$do_date');"

执行

sh import_all_data.sh 20210627

验证导入结果

select * from lg_ods.lg_orders limit 5;
select * from lg_ods.lg_order_items limit 5;
select * from lg_ods.lg_items limit 5;
select * from lg_ods.lg_entrepots limit 5;
select * from lg_ods.lg_item_cats limit 5;
select * from lg_ods.lg_order_entrepot limit 5;

3.1.2 增量导入

对于拉链表则ODS表需要每日拉取新增和更新的数据

--抽取每日新增和更新的订单数据 
SELECT * FROM lg_orders WHERE DATE_FORMAT(modifyTime, '%Y%m%d') = '${do_date}';

shell脚本文件
import_incr_data.sh

#!/bin/bash
source /etc/profile
##如果第一个参数不为空，则作为工作日期使用
if [ -n "$1" ]
then
do_date=$1
else
##昨天日期，减一
do_date=`date -d "-1 day" +"%Y%m%d"`
fi
sqoop=/opt/lagou/servers/sqoop-1.4.7/bin/sqoop
Hive=/opt/lagou/servers/hive-2.3.7/bin/hive

import_data(){
$sqoop import \
--connect jdbc:mysql://linux123:3306/lg_logstic \
--username root \
--password 12345678 \
--target-dir /user/hive/warehouse/lg_ods.db/$1/dt=$do_date \
--delete-target-dir \
--query "$2 and \$CONDITIONS" \
--num-mappers 1 \
--fields-terminated-by ',' \
--null-string '\\N' \
--null-non-string '\\N'
}
##导入新增订单数据的方法
import_lg_orders(){
import_data lg_orders "SELECT *
FROM lg_orders
WHERE DATE_FORMAT(modifyTime, '%Y%m%d') = '${do_date}' "
}
# 导入新增订单明细数据(包含商品)方法
import_lg_order_items(){
import_data lg_order_items "select
*
from lg_order_items
WHERE DATE_FORMAT(createTime, '%Y%m%d') = '${do_date}'"
}
# 导入新增和变化商品方法
import_lg_items(){
import_data lg_items "select
*
from lg_items
WHERE DATE_FORMAT(modifyTime, '%Y%m%d') = '${do_date}'"
}
# 导入新增和变化仓库方法
import_lg_entrepots(){
import_data lg_entrepots "select
*
from lg_entrepots
WHERE DATE_FORMAT(modifyTime, '%Y%m%d') = '${do_date}'"
}
# 导入新增商品分类数据方法
import_lg_item_cats(){
import_data lg_item_cats "select
*
from lg_item_cats
WHERE DATE_FORMAT(modifyTime, '%Y%m%d') = '${do_date}'"
}
# 导入新增订单仓库关联数据方法
import_lg_order_entrepot(){
import_data lg_order_entrepot "select
*
from lg_order_entrepot
WHERE DATE_FORMAT(modifyTime, '%Y%m%d') = '${do_date}'"
}
#导入新增订单明细数据方法
import_lg_order_items
#调用导入新增商品数据方法
import_lg_items
#调用导入新增仓库数据方法
import_lg_entrepots
#导入新增商品分类数据方法
import_lg_item_cats
#导入新增订单数据
import_lg_orders
#导入新增订单仓库关联数据
import_lg_order_entrepot

#执行Hive修复分区命令
$Hive -e "alter table lg_ods.lg_orders add partition(dt='$do_date');
alter table lg_ods.lg_order_items add partition(dt='$do_date');
alter table lg_ods.lg_items add partition(dt='$do_date');
alter table lg_ods.lg_entrepots add partition(dt='$do_date');
alter table lg_ods.lg_item_cats add partition(dt='$do_date');
alter table lg_ods.lg_order_entrepot add partition(dt='$do_date');"

执行

sh import_incr_data.sh 20210628

验证导入结果

select * from lg_ods.lg_orders where dt='20210628' limit 5;
select * from lg_ods.lg_order_items where dt='20210628' limit 5;
select * from lg_ods.lg_items where dt='20210628' limit 5;
select * from lg_ods.lg_entrepots where dt='20210628' limit 5;
select * from lg_ods.lg_item_cats where dt='20210628' limit 5;
select * from lg_ods.lg_order_entrepot where dt='20210628' limit 5;

三、数据ETL

在这里插入图片描述

ods–>dwd
- lg_ods.lg_orders:订单表;
- lg_ods.lg_order_items:订单明细表
- lg_order_entrepot:订单仓库关联表

订单表是周期性事实表；为保留订单状态，使用拉链表进行处理；

订单明细表是普通的事实表，不涉及到状态变化和保留；如果有数据清洗、数据转换的业务需求，使用每日增量同步到dwd层即可；

订单仓库关联表是普通的事实表，不涉及到状态变化和保留；如果有数据清洗、数据转换的业务需求，使用每日增量同步到dwd层即可；

如果没有数据清洗、数据转换的业务需求，保留在ODS，不做任何变化。

ods–>dim
- lg_ods.lg_entrepots:仓库表
- lg_ods.lg_items:商品表
- lg_ods.lg_item_cats:商品分类表

仓库表，商品表，商品分类表都是作为维度表存在，使用拉链表进行处理；
创建对应数据库

--创建DIM层数据库 
create database lg_dim;

备注：
dwd,dws,dim层表底层均以parquet格式，因为底层使用spark引擎，spark程序对parquet格式的数据处理效率更好。
与维表不同，订单事实表的记录数非常多订单有生命周期；
订单的状态不可能永远处于变化之中（订单的生命周期一般在15天左右）
订单是一个拉链表，而且是分区表；分区的目的：订单一旦终止，不会重复计算；分区的条件：订单创建日期；保证相同笔订单存储在同一个分区

第 1 节 ODS->DWD

1.1 lg_dwd.fact_orders–订单拉链表

1.1.1 创建DWD层订单表

-- 创建DWD层订单表--拉链表
drop table if exists `lg_dwd`.`fact_orders`;
CREATE TABLE lg_dwd.fact_orders (
orderId bigint,
orderNo string,
userId bigint,
orderStatus bigint,
itemsMoney double,
deliverType bigint,
deliverMoney double,
totalMoney double,
realTotalMoney double,
payType bigint,
isPay bigint,
areaId bigint,
userAddressId bigint,
areaIdPath string,
userName string,
userAddress string,
userPhone string,
orderScore bigint,
isInvoice bigint,
invoiceClient string,
orderRemarks string,
orderSrc bigint,
needPay double,
payRand bigint,
orderType bigint,
isRefund bigint,
isAppraise bigint,
cancelReason bigint,
rejectReason bigint,
rejectOtherReason string,
isClosed bigint,
itemsSearchKeys string,
orderunique string,
isFromCart string,
receiveTime string,
deliveryTime string,
tradeNo string,
dataFlag bigint,
createTime string,
settlementId bigint,
commissionFee double,
scoreMoney double,
useScore bigint,
orderCode string,
extraJson string,
orderCodeTargetId bigint,
noticeDeliver bigint,
invoiceJson string,
lockCashMoney double,
payTime string,
isBatch bigint,
totalPayFee bigint,
modifytime string,
start_date string,
end_date string
) partitioned by (dt string) STORED AS PARQUET ;

-- 创建DWD层临时订单表--拉链表
drop table if exists `lg_dwd`.`tmp_fact_orders`;
CREATE TABLE lg_dwd.tmp_fact_orders (
orderId bigint,
orderNo string,
userId bigint,
orderStatus bigint,
itemsMoney double,
deliverType bigint,
deliverMoney double,
totalMoney double,
realTotalMoney double,
payType bigint,
isPay bigint,
areaId bigint,
userAddressId bigint,
areaIdPath string,
userName string,
userAddress string,
userPhone string,
orderScore bigint,
isInvoice bigint,
invoiceClient string,
orderRemarks string,
orderSrc bigint,
needPay double,
payRand bigint,
orderType bigint,
isRefund bigint,
isAppraise bigint,
cancelReason bigint,
rejectReason bigint,
rejectOtherReason string,
isClosed bigint,
itemsSearchKeys string,
orderunique string,
isFromCart string,
receiveTime string,
deliveryTime string,
tradeNo string,
dataFlag bigint,
createTime string,
settlementId bigint,
commissionFee double,
scoreMoney double,
useScore bigint,
orderCode string,
extraJson string,
orderCodeTargetId bigint,
noticeDeliver bigint,
invoiceJson string,
lockCashMoney double,
payTime string,
isBatch bigint,
totalPayFee bigint,
modifytime string,
start_date string,
end_date string
) partitioned by (dt string) STORED AS PARQUET ;

1.1.2 订单拉链操作

第一次导入拉链表
开启Hive的动态分区，并根据数据的createtime字段进行分区划分，同一天创建的订单放在同一分区！！

#开启动态分区，默认是false
#开启允许所有分区都是动态的，否则必须要有静态分区才能使用
set hive.exec.dynamici.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

订单表数据：ODS层导入DWD层
(1) 之前全部历史数据进入拉链表

insert overwrite table lg_dwd.fact_orders partition(dt)
select
orderid ,
orderno ,
userid ,
orderstatus ,
itemsmoney ,
delivertype ,
delivermoney ,
totalmoney ,
realtotalmoney ,
paytype ,
ispay ,
areaid ,
useraddressid ,
areaidpath ,
username ,
useraddress ,
userphone ,
orderscore ,
isinvoice ,
invoiceclient ,
orderremarks ,
ordersrc ,
needpay ,
payrand ,
ordertype ,
isrefund ,
isappraise ,
cancelreason ,
rejectreason ,
rejectotherreason,
isclosed ,
itemssearchkeys ,
orderunique ,
isfromcart ,
receivetime ,
deliverytime ,
tradeno ,
dataflag ,
createtime ,
settlementid ,
commissionfee ,
scoremoney ,
usescore ,
ordercode ,
extrajson ,
ordercodetargetid,
noticedeliver ,
invoicejson ,
lockcashmoney ,
paytime ,
isbatch ,
totalpayfee ,
modifytime ,
--增加开始时间
date_format(modifytime,'yyyy-MM-dd') as start_date,
--增加结束时间
'9999-12-31' as end_date,
--指定动态分区使用的字段，动态分区的用法：就是查询字段的最后一个字段Hive表进行解析然后存入指定分区
--此次数据分区按照订单的创建时间
date_format(createtime,'yyyyMMdd')
from lg_ods.lg_orders where dt="20210627";

拉链表与每日合并

insert overwrite table lg_dwd.tmp_fact_orders partition(dt)
--新增数据的更新
select
orderid ,
orderno ,
userid ,
orderstatus ,
itemsmoney ,
delivertype ,
delivermoney ,
totalmoney ,
realtotalmoney ,
paytype ,
ispay ,
areaid ,
useraddressid ,
areaidpath ,
username ,
useraddress ,
userphone ,
orderscore ,
isinvoice ,
invoiceclient ,
orderremarks ,
ordersrc ,
needpay ,
payrand ,
ordertype ,
isrefund ,
isappraise ,
cancelreason ,
rejectreason ,
rejectotherreason,
isclosed ,
itemssearchkeys ,
orderunique ,
isfromcart ,
receivetime ,
deliverytime ,
tradeno ,
dataflag ,
createtime ,
settlementid ,
commissionfee ,
scoremoney ,
usescore ,
ordercode ,
extrajson ,
ordercodetargetid,
noticedeliver ,
invoicejson ,
lockcashmoney ,
paytime ,
isbatch ,
totalpayfee ,
modifytime ,
--增加开始时间
date_format(modifyTime,'yyyy-MM-dd') as start_date,
--增加结束时间
'9999-12-31' as end_date,
--指定动态分区使用的字段，动态分区的用法：就是查询字段的最后一个字段Hive表进行解析然后存入指定
分区
--此次数据分区按照订单的创建时间
date_format(createtime,'yyyyMMdd') as part
from lg_ods.lg_orders where dt="20200610"
union all
--历史拉链表更新数据
select
dw.orderid as orderid ,
dw.orderno as orderno ,
dw.userid as userid ,
dw.orderstatus as orderstatus ,
dw.itemsmoney as itemsmoney ,
dw.delivertype as delivertype ,
dw.delivermoney as delivermoney ,
dw.totalmoney as totalmoney ,
dw.realtotalmoney as realtotalmoney ,
dw.paytype as paytype ,
dw.ispay as ispay ,
dw.areaid as areaid ,
dw.useraddressid as useraddressid ,
dw.areaidpath as areaidpath ,
dw.username as username ,
dw.useraddress as useraddress ,
dw.userphone as userphone ,
dw.orderscore as orderscore ,
dw.isinvoice as isinvoice ,
dw.invoiceclient as invoiceclient ,
dw.orderremarks as orderremarks ,
dw.ordersrc as ordersrc ,
dw.needpay as needpay ,
dw.payrand as payrand ,
dw.ordertype as ordertype ,
dw.isrefund as isrefund ,
dw.isappraise as isappraise ,
dw.cancelreason as cancelreason ,
dw.rejectreason as rejectreason ,
dw.rejectotherreason as rejectotherreason ,
dw.isclosed as isclosed ,
dw.itemssearchkeys as itemssearchkeys ,
dw.orderunique as orderunique ,
dw.isfromcart as isfromcart ,
dw.receivetime as receivetime ,
dw.deliverytime as deliverytime ,
dw.tradeno as tradeno ,
dw.dataflag as dataflag ,
dw.createtime as createtime ,
dw.settlementid as settlementid ,
dw.commissionfee as commissionfee ,
dw.scoremoney as scoremoney ,
dw.usescore as usescore ,
dw.ordercode as ordercode ,
dw.extrajson as extrajson ,
dw.ordercodetargetid as ordercodetargetid ,
dw.noticedeliver as noticedeliver ,
dw.invoicejson as invoicejson ,
dw.lockcashmoney as lockcashmoney ,
dw.paytime as paytime ,
dw.isbatch as isbatch ,
dw.totalpayfee as totalpayfee ,
dw.modifytime as modifytime ,
dw.start_date as start_date ,
--修改end_date
case when ods.orderid is not null and dw.end_date ='9999-12-31'
then '2020-06-09'
else dw.end_date
end as end_date,
--动态分区需要的字段
dw.dt as part
from
lg_dwd.fact_orders dw
left join
(select * from lg_ods.lg_orders where dt ='20200610') ods
on dw.orderid=ods.orderid ;

注意：使用union all要保证两个子查询得到的字段名称一致！！
临时表中数据插入拉链表中

insert overwrite table lg_dwd.fact_orders partition(dt) select * from lg_dwd.tmp_fact_orders ;

1.2 lg_order_items–订单明细表

对于订单明细表,订单仓库信息的ETL来说，由于该表中数据是作为订单表的补充不会变化，也不涉及到数据清洗，数据处理等需求，所以ods->dwd可以不做。直接从ods层获取数据即可。

第 2 节 ODS–>DIM

主要以下三张表需要从ODS转到DIM层，并且使用拉链表保存。

lg_ods.lg_entrepots:仓库表
lg_ods.lg_items:商品表
lg_ods.lg_item_cats:商品分类表
仓库表，商品表，商品分类表都是作为维度表存在，使用拉链表进行处理；

2.1 lg_dim.lg_items 商品表

创建DIM层商品表

use lg_dim;
DROP TABLE IF EXISTS `lg_dim`.`lg_dim_items`;
CREATE TABLE `lg_dim`.`lg_dim_items`(
goodsId bigint,
goodsSn string,
productNo string,
goodsName string,
goodsImg string,
shopId bigint,
goodsType bigint,
marketPrice double,
shopPrice double,
warnStock bigint,
goodsStock bigint,
goodsUnit string,
goodsTips string,
isSale bigint,
isBest bigint,
isHot bigint,
isNew bigint,
isRecom bigint,
goodsCatIdPath string,
goodsCatId bigint,
shopCatId1 bigint,
shopCatId2 bigint,
brandId bigint,
goodsDesc string,
goodsStatus bigint,
saleNum bigint,
saleTime string,
visitNum bigint,
appraiseNum bigint,
isSpec bigint,
gallery string,
goodsSeoKeywords string,
illegalRemarks string,
dataFlag bigint,
createTime string,
isFreeShipping bigint,
goodsSerachKeywords string,
modifyTime string,
start_date string,
end_date string
)
STORED AS PARQUET;

DROP TABLE IF EXISTS `lg_dim`.`tmp_lg_dim_items`;
CREATE TABLE `lg_dim`.`tmp_lg_dim_items`(
goodsId bigint,
goodsSn string,
productNo string,
goodsName string,
goodsImg string,
shopId bigint,
goodsType bigint,
marketPrice double,
shopPrice double,
warnStock bigint,
goodsStock bigint,
goodsUnit string,
goodsTips string,
isSale bigint,
isBest bigint,
isHot bigint,
isNew bigint,
isRecom bigint,
goodsCatIdPath string,
goodsCatId bigint,
shopCatId1 bigint,
shopCatId2 bigint,
brandId bigint,
goodsDesc string,
goodsStatus bigint,
saleNum bigint,
saleTime string,
visitNum bigint,
appraiseNum bigint,
isSpec bigint,
gallery string,
goodsSeoKeywords string,
illegalRemarks string,
dataFlag bigint,
createTime string,
isFreeShipping bigint,
goodsSerachKeywords string,
modifyTime string,
start_date string,
end_date string
)
STORED AS PARQUET;

2.1.1 商品拉链表操作

第一次导入拉链表

insert overwrite table `lg_dim`.`lg_dim_items`
select
itemsid as goodsId,
itemssn as goodsSn,
productno as productNo,
itemsname as goodsName,
itemsimg as goodsImg,
entrepotid as shopId,
itemstype as goodsType,
marketprice as marketPrice,
entrepotprice as shopPrice,
warnstock as warnStock,
itemsstock as goodsStock,
itemsunit as goodsUnit,
itemstips as goodsTips,
issale as isSale,
isbest as isBest,
ishot as isHot,
isnew as isNew,
isrecom as isRecom,
itemscatidpath as goodsCatIdPath,
itemscatid as goodsCatId,
entrepotcatid1 as shopCatId1,
entrepotcatid2 as shopCatId2,
brandid as brandId,
itemsdesc as goodsDesc,
itemsstatus as goodsStatus,
salenum as saleNum,
saletime as saleTime,
visitnum as visitNum,
appraisenum as appraiseNum,
isspec as isSpec,
gallery as gallery,
itemsseokeywords as goodsSeoKeywords,
illegalremarks as illegalRemarks,
dataflag as dataFlag,
createtime as createTime,
isfreeshipping as isFreeShipping,
itemsserachkeywords as goodsSerachKeywords,
modifytime as modifyTime,
case when modifyTime is not null
then from_unixtime(unix_timestamp(modifyTime, 'yyyy-MM-dd
HH:mm:ss'),'yyyy-MM-dd')
else from_unixtime(unix_timestamp(createTime, 'yyyy-MM-dd HH:mm:ss'),
'yyyy-MM-dd')
end as start_date,
'9999-12-31' as end_date
from
`lg_ods`.`lg_items`
where dt = '20210627';

每日合并拉链表

-- 将历史数据 当日数据合并加载到临时表
drop table if exists `lg_dim`.`tmp_lg_dim_items`;
create table `lg_dim`.`tmp_lg_dim_items`
as
select
dim.goodsId,
dim.goodsSn,
dim.productNo,
dim.goodsName,
dim.goodsImg,
dim.shopId,
dim.goodsType,
dim.marketPrice,
dim.shopPrice,
dim.warnStock,
dim.goodsStock,
dim.goodsUnit,
dim.goodsTips,
dim.isSale,
dim.isBest,
dim.isHot,
dim.isNew,
dim.isRecom,
dim.goodsCatIdPath,
dim.goodsCatId,
dim.shopCatId1,
dim.shopCatId2,
dim.brandId,
dim.goodsDesc,
dim.goodsStatus,
dim.saleNum,
dim.saleTime,
dim.visitNum,
dim.appraiseNum,
dim.isSpec,
dim.gallery,
dim.goodsSeoKeywords,
dim.illegalRemarks,
dim.dataFlag,
dim.createTime,
dim.isFreeShipping,
dim.goodsSerachKeywords,
dim.modifyTime,
dim.start_date,
case when dim.end_date >= '9999-12-31' and ods.itemsid is not null
then '2020-06-09'
else dim.end_date
end as end_date
from
`lg_dim`.`lg_dim_items` dim
left join
(select * from `lg_ods`.`lg_items` where dt='20200610') ods
on dim.goodsId = ods.itemsid
union all
select
itemsid as goodsId,
itemssn as goodsSn,
productno as productNo,
itemsname as goodsName,
itemsimg as goodsImg,
entrepotid as shopId,
itemstype as goodsType,
marketprice as marketPrice,
entrepotprice as shopPrice,
warnstock as warnStock,
itemsstock as goodsStock,
itemsunit as goodsUnit,
itemstips as goodsTips,
issale as isSale,
isbest as isBest,
ishot as isHot,
isnew as isNew,
isrecom as isRecom,
itemscatidpath as goodsCatIdPath,
itemscatid as goodsCatId,
entrepotcatid1 as shopCatId1,
entrepotcatid2 as shopCatId2,
brandid as brandId,
itemsdesc as goodsDesc,
itemsstatus as goodsStatus,
salenum as saleNum,
saletime as saleTime,
visitnum as visitNum,
appraisenum as appraiseNum,
isspec as isSpec,
gallery as gallery,
itemsseokeywords as goodsSeoKeywords,
illegalremarks as illegalRemarks,
dataflag as dataFlag,
createtime as createTime,
isfreeshipping as isFreeShipping,
itemsserachkeywords as goodsSerachKeywords,
modifytime as modifyTime,
case when modifyTime is not null
then from_unixtime(unix_timestamp(modifyTime, 'yyyy-MM-dd
HH:mm:ss'),'yyyy-MM-dd')
else from_unixtime(unix_timestamp(createTime, 'yyyy-MM-dd HH:mm:ss'),
'yyyy-MM-dd')
end as start_date,
'9999-12-31' as end_date
from
`lg_ods`.`lg_items`
where dt = '20200610';

临时表中数据插入拉链表

insert overwrite table lg_dim.lg_dim_items select * from lg_dim.tmp_lg_dim_items;

2.2 lg_dim.dim_product_cat 商品分类表

lg_dim.dim_product_cat表为商品分类数据，在数仓中使用拉链表存储。

2.2.1 创建商品分类拉链表

drop table if exists lg_dim.dim_product_cat;
create table lg_dim.dim_product_cat(
catId bigint,
parentId bigint,
catName string,
isShow bigint,
isFloor bigint,
catSort bigint,
dataFlag bigint,
createTime string,
commissionRate double,
catImg string,
subTitle string,
simpleName string,
seoTitle string,
seoKeywords string,
seoDes string,
catListTheme string,
detailTheme string,
mobileCatListTheme string,
mobileDetailTheme string,
wechatCatListTheme string,
wechatDetailTheme string,
cat_level bigint,
modifyTime string,
start_date string,
end_date string
)
STORED AS PARQUET;

drop table if exists lg_dim.tmp_dim_product_cat;
create table lg_dim.tmp_dim_product_cat(
catId bigint,
parentId bigint,
catName string,
isShow bigint,
isFloor bigint,
catSort bigint,
dataFlag bigint,
createTime string,
commissionRate double,
catImg string,
subTitle string,
simpleName string,
seoTitle string,
seoKeywords string,
seoDes string,
catListTheme string,
detailTheme string,
mobileCatListTheme string,
mobileDetailTheme string,
wechatCatListTheme string,
wechatDetailTheme string,
cat_level bigint,
modifyTime string,
start_date string,
end_date string
)
STORED AS PARQUET;

2.2.2 分类表拉链操作

第一次导入拉链表

insert overwrite table lg_dim.dim_product_cat
select
catid ,
parentid ,
catname ,
isshow ,
isfloor ,
catsort ,
dataflag ,
createtime ,
commissionrate ,
catimg ,
subtitle ,
simplename ,
seotitle ,
seokeywords ,
seodes ,
catlisttheme ,
detailtheme ,
mobilecatlisttheme ,
mobiledetailtheme ,
wechatcatlisttheme ,
wechatdetailtheme ,
cat_level ,
modifytime ,
case when modifyTime is not null
then from_unixtime(unix_timestamp(modifyTime, 'yyyy-MM-dd
HH:mm:ss'),'yyyy-MM-dd')
else from_unixtime(unix_timestamp(createTime, 'yyyy-MM-dd HH:mm:ss'),
'yyyy-MM-dd')
end as start_date,
'9999-12-31' as end_date
from
`lg_ods`.`lg_item_cats`
where dt = '20210627';

每日合并拉链表

drop table if exists lg_dim.tmp_dim_product_cat;
create table lg_dim.tmp_dim_product_cat as
select
dim.catid ,
dim.parentid ,
dim.catname ,
dim.isshow ,
dim.isfloor ,
dim.catsort ,
dim.dataflag ,
dim.createtime ,
dim.commissionrate ,
dim.catimg ,
dim.subtitle ,
dim.simplename ,
dim.seotitle ,
dim.seokeywords ,
dim.seodes ,
dim.catlisttheme ,
dim.detailtheme ,
dim.mobilecatlisttheme ,
dim.mobiledetailtheme ,
dim.wechatcatlisttheme ,
dim.wechatdetailtheme ,
dim.cat_level ,
dim.modifytime ,
dim.start_date ,
case when dim.end_date >= '9999-12-31' and ods.catid is not null
then '2020-06-097'
else dim.end_date
end as end_date
from
`lg_dim`.`dim_product_cat` dim
left join
(select * from `lg_ods`.`lg_item_cats` where dt='20200610') ods
on dim.catid = ods.catid
union all
select
catid ,
parentid ,
catname ,
isshow ,
isfloor ,
catsort ,
dataflag ,
createtime ,
commissionrate ,
catimg ,
subtitle ,
simplename ,
seotitle ,
seokeywords ,
seodes ,
catlisttheme ,
detailtheme ,
mobilecatlisttheme ,
mobiledetailtheme ,
wechatcatlisttheme ,
wechatdetailtheme ,
cat_level ,
modifytime ,
case when modifyTime is not null
then from_unixtime(unix_timestamp(modifyTime, 'yyyy-MM-dd
HH:mm:ss'),'yyyy-MM-dd')
else from_unixtime(unix_timestamp(createTime, 'yyyy-MM-dd HH:mm:ss'),
'yyyy-MM-dd')
end as start_date,
'9999-12-31' as end_date
from
`lg_ods`.`lg_item_cats`
where dt = '20200610';

导入拉链表

insert overwrite table lg_dim.dim_product_cat select * from lg_dim.tmp_dim_product_cat ;

2.3、创建仓库拉链表

2.3.1 创建仓库拉链表

drop table if exists `lg_dim`.`dim_lg_entrepots`;
CREATE TABLE `lg_dim`.`dim_lg_entrepots` (
entrepotId bigint,
areaId bigint,
entrepotName string,
entrepotkeeper string,
telephone string,
entrepotImg string,
entrepotTel string,
entrepotQQ string,
entrepotAddress string,
invoiceRemarks string,
serviceStartTime bigint,
serviceEndTime bigint,
freight bigint,
entrepotAtive int,
entrepotStatus int,
statusDesc string,
dataFlag int,
createTime string ,
modifyTime string,
start_date string,
end_date string
) STORED AS PARQUET;

drop table if exists `lg_dim`.`tmp_dim_lg_entrepots`;
CREATE TABLE `lg_dim`.`tmp_dim_lg_entrepots` (
entrepotId bigint,
areaId bigint,
entrepotName string,
entrepotkeeper string,
telephone string,
entrepotImg string,
entrepotTel string,
entrepotQQ string,
entrepotAddress string,
invoiceRemarks string,
serviceStartTime bigint,
serviceEndTime bigint,
freight bigint,
entrepotAtive int,
entrepotStatus int,
statusDesc string,
dataFlag int,
createTime string ,
modifyTime string,
start_date string,
end_date string
) STORED AS PARQUET;

2.3.2 仓库表拉链操作

第一次导入拉链表

insert overwrite table lg_dim.dim_lg_entrepots
select
entrepotId ,
areaId ,
entrepotName ,
entrepotkeeper ,
telephone ,
entrepotImg ,
entrepotTel ,
entrepotQQ ,
entrepotAddress ,
invoiceRemarks ,
serviceStartTime,
serviceEndTime ,
freight ,
entrepotAtive ,
entrepotStatus ,
statusDesc ,
dataFlag ,
createTime ,
modifyTime ,
case when modifyTime is not null
then from_unixtime(unix_timestamp(modifyTime, 'yyyy-MM-dd
HH:mm:ss'),'yyyy-MM-dd')
else from_unixtime(unix_timestamp(createTime, 'yyyy-MM-dd HH:mm:ss'),
'yyyy-MM-dd')
end as start_date,
'9999-12-31' as end_date
from
`lg_ods`.`lg_entrepots`
where dt = '20210627';

每日合并拉链表

drop table if exists lg_dim.tmp_dim_product_cat;
create table lg_dim.tmp_dim_product_cat as
select
dim.catid ,
dim.parentid ,
dim.catname ,
dim.isshow ,
dim.isfloor ,
dim.catsort ,
dim.dataflag ,
dim.createtime ,
dim.commissionrate ,
dim.catimg ,
dim.subtitle ,
dim.simplename ,
dim.seotitle ,
dim.seokeywords ,
dim.seodes ,
dim.catlisttheme ,
dim.detailtheme ,
dim.mobilecatlisttheme ,
dim.mobiledetailtheme ,
dim.wechatcatlisttheme ,
dim.wechatdetailtheme ,
dim.cat_level ,
dim.modifytime ,
dim.start_date ,
case when dim.end_date >= '9999-12-31' and ods.catid is not null
then '2020-06-09'
else dim.end_date
end as end_date
from
`lg_dim`.`dim_product_cat` dim
left join
(select * from `lg_ods`.`lg_item_cats` where dt='20200610') ods
on dim.catid = ods.catid
union all
select
catid ,
parentid ,
catname ,
isshow ,
isfloor ,
catsort ,
dataflag ,
createtime ,
commissionrate ,
catimg ,
subtitle ,
simplename ,
seotitle ,
seokeywords ,
seodes ,
catlisttheme ,
detailtheme ,
mobilecatlisttheme ,
mobiledetailtheme ,
wechatcatlisttheme ,
wechatdetailtheme ,
cat_level ,
modifytime ,
case when modifyTime is not null
then from_unixtime(unix_timestamp(modifyTime, 'yyyy-MM-dd
HH:mm:ss'),'yyyy-MM-dd')
else from_unixtime(unix_timestamp(createTime, 'yyyy-MM-dd HH:mm:ss'),
'yyyy-MM-dd')
end as start_date,
'9999-12-31' as end_date
from
`lg_ods`.`lg_item_cats`
where dt = '20200610';

导入拉链表

insert overwrite table lg_dim.dim_lg_entrepots select * from lg_dim.tmp_dim_lg_entrepots ;

四、数据导出

针对仓储预测准备相关数据
根据算法部门的要求，仓储预测模型需要提供4个数据集
sales_train.csv;items.csv;item_categories.csv;entreports.csv

sales_train
历史销售数据，有日期字段，月份,仓库，商品，价格和日销售量

统计出的是每个仓库每个商品的日销量数据，
items.csv
表示的是商品信息，有商品名称，仓库id,商品分类id
entrepots.csv
仓库数据集，仓库id,仓库名称
item_category
来源商品分类表，其中有商品分类id以及分类的名称

第一节导出销量数据

观察sales数据，其中有商品id,商品价格，商品销量数据，这些字段可以从订单明细表中获取，
此外还需要提供仓库id，这个字段需要从订单仓库关联表中获取
最后关于月份的编号需要额外进行处理获取；

#创建tmp层
create database lg_tmp;
# 订单明细表关联订单仓库表，基于订单id与商品id
use lg_tmp;
drop table if exists lg_tmp.tmp_order_item_entrepot;
create table tmp_order_item_entrepot
as
select t1.itemsid as itemId,
t1.itemsname as itemName,
t2.entrepotId as entrepotId,
t1.itemsNum as itemNum,
t1.itemsPrice as itemPrice,
t1.dt as dt
from
lg_ods.lg_order_items t1
join
lg_ods.lg_order_entrepot t2
on t1.orderId=t2.orderId
and
t1.itemsid =t2.itemid;

#对上面结果数据进行汇总，按照天统计每个商品的销量
use lg_tmp;
drop table if exists lg_tmp.tmp_order_item_entrepot_day;
create table tmp_order_item_entrepot_day
as
select
itemid,
entrepotid,
dt,
sum(itemNum) as itemNum,
itemprice
from lg_tmp.tmp_order_item_entrepot
group by
itemId,
entrepotid,
dt ,
itemprice;

#验证结果
select * from lg_tmp.tmp_order_item_entrepot_day limit 5;

#增加月份信息 使用排名函数，需要对数据按照月份排序打上序号获取月份数据
drop table if exists tmp_order_item_entrepot_month;
create table tmp_order_item_entrepot_month as
select itemid,entrepotid,itemnum,itemprice,dt,substr(dt,0,6) as month from
lg_tmp.tmp_order_item_entrepot_day ;
#对月份进行排序
create table lg_tmp.tmp_order_item_rank
as
select
itemid,
entrepotid,
itemnum,
itemprice,
dt,
dense_rank() over(order by month ) rank
from lg_tmp.tmp_order_item_entrepot_month ;

导出为csv文件

hive -e “set hive.cli.print.header=true; select dt,rank as block_num,entrepotid as entrepot_id,itemid as item_id,itemprice as item_price,itemnum as item_cnt_day from lg_tmp.tmp_order_item_rank ;” |grep -v “WARN” | sed ‘s/\t/,/g’ >> /sales.csv

其中 grep 过滤错误提醒，sed’s/\t/,/g’ 是将制表符替换为逗号
在这里插入图片描述

第二节导出商品数据

观察商品数据特征，有商品名称，仓库id,商品分类id
基于lg_tmp.tmp_order_item_entrepot获取

#lg_tmp.tmp_order_item_entrepot与商品表进行关联（获取全部数据不用考虑拉链表实现）
drop table if exists lg_tmp.tmp_items;
create table lg_tmp.tmp_items
as
select
t1.itemname as itemname,
t1.itemid as itemid,
t2.goodscatid as catid
from
lg_tmp.tmp_order_item_entrepot t1
join
lg_dim.lg_dim_items t2
on t1.itemid =t2.goodsid ;

#导出为csv文件
hive -e "set hive.cli.print.header=true; select * from lg_tmp.tmp_items ;" |grep -v "WARN" | sed 's/\t/,/g' >> /items.csv

第三节导出仓库及商品分类数据

直接从相关表中导出即可。

#导出仓库数据
hive -e “set hive.cli.print.header=true; select entrepotname,entrepotid from lg_ods.lg_entrepots ;” |grep -v “WARN” | sed ‘s/\t/,/g’ >> /entrepots.csv

#导出商品分类数据
hive -e "set hive.cli.print.header=true; select
t2.catname as type,
t1.catName as subtype,
t1.catId as item_category_id
from (select catId,parentId,catName from lg_ods.lg_item_cats where cat_level = 3
) t1 JOIN
(select catId,catName from lg_ods.lg_item_cats where cat_level = 2 ) t2
on t1.parentId =t2.catId ;" |grep -v "WARN" | sed 's/\t/,/g' >> /item_categories.csv

weixin_47134119

关注

7
点赞
踩
51

收藏

觉得还不错? 一键收藏
5
评论
7.1.1智慧物流【数据采集（sqoop）、ETL、数据导出】

智慧物流文章目录智慧物流一、项目整体架构二、数据采集第 1 节数据说明第 2 节业务数据第 3 节同步数据到Hive3.1 离线数据采集3.1.1 第一次全量导入3.1.2 增量导入三、数据ETL第 1 节 ODS->DWD1.1 lg_dwd.fact_orders--订单拉链表1.1.1 创建DWD层订单表1.1.2 订单拉链操作1.2 lg_order_items--订单明细表第 2 节 ODS-->DIM2.1 lg_dim.lg_items 商品表2.1.1 商品拉链表操作
复制链接

扫一扫