层级关系和作业
再回忆一下层级关系:
一.提款
统计指标:提款申请量,提款申请人数,提款通过量,提款通过人数,协议签订量,协议签订人数,申请提款金额,协议签订金额,实际提款金额。
统计粒度:用户一次提款请求作为一条记录。
分析维度:提款日期,产品,证件类型,性别,渠道,用户类型,客户经理,借款期限。
dim层建表:
create table dim_product(
id bigint
,repay_amount decimal(11,2)
,amount decimal(11,2)
,loan_term int
,composite_rate decimal(11,2)
,interest_rate decimal(11,2)
,real_interest_rate decimal(11,2)
,service_rate decimal(11,2)
,gutee_rate decimal(11,2)
,gutee_service_rate decimal(11,2)
,guarantee_prop decimal(11,2)
,month_pty_service decimal(11,2)
,day_penalty decimal(11,2)
,roof_penalty decimal(11,2)
,term_repay_amount decimal(11,2)
,day_interest_rate decimal(11,2)
,mgmt_consultation decimal(11,2)
,mgmt_service decimal(11,2)
,service_gutee decimal(11,2)
,repay_amount_month decimal(11,2)
,prin_interest_month decimal(11,2)
,mgmt_con_month decimal(11,2)
,mgmt_service_month decimal(11,2)
,service_gutee_month decimal(11,2)
,gutee_fee decimal(11,2)
,gutee_service_fee decimal(11,2)
,insurance_amount decimal(11,2)
,insurance_limit decimal(11,2)
,created_at string
,updated_at string
,etl_time string
)
comment ''
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n';
insert overwrite table dim_product
select
id
,repay_amount
,amount
,loan_term
,composite_rate
,interest_rate
,real_interest_rate
,service_rate
,gutee_rate
,gutee_service_rate
,guarantee_prop
,month_pty_service
,day_penalty
,roof_penalty
,term_repay_amount
,day_interest_rate
,mgmt_consultation
,mgmt_service
,service_gutee
,repay_amount_month
,prin_interest_month
,mgmt_con_month
,mgmt_service_month
,service_gutee_month
,gutee_fee
,gutee_service_fee
,insurance_amount
,insurance_limit
,created_at
,updated_at
,from_unixtime(unix_timestamp(),'yyyy-MM-dd HH:mm:ss') as etl_time
from ods_dict_product;
表字段关系以及表字段含义:
create table dim_region(
id bigint
,name string
,code string
,province_id bigint
,province_name string
,province_code string
,is_new string
,created_at string
,updated_at string
,etl_time string
)
comment ''
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n';
-- 县以及: 区、县、市(县级市) 偃师市(区)-->洛阳市(大城市) 乡-镇-街道办事处 这些是一个级别的
insert overwrite table dim_region
select
a.id
,a.name
,a.code
,a.province_id
,b.name as province_name
,b.code as province_code
,a.is_new
,a.created_at
,a.updated_at
,from_unixtime(unix_timestamp(),'yyyy-MM-dd HH:mm:ss') as etl_time
from ods_dict_citys a
left join ods_dict_provinces b
on a.province_id=b.id;
表字段关系以及表字段含义:
由此,可以看到维度表表示对分析主题所属类型的描述。
创建dwd层:
create table dwd_fact_drawal_dtl(
drawal_date string
,idty_type string
,channel_id bigint
,user_type string
,manager_id bigint
,sex string
,drawal_id bigint
,user_id bigint
,apply_id bigint
,product_id bigint
,audit_id bigint
,credit_type string
,amount decimal(11,2)
,loan_term bigint
,repay_amount decimal(11,2)
,status string
,lend_time string
,due_date string
,is_current string
,aggrement_id string
,is_valid string
,next_repay_date string
,apply_type string
,fund_provided bigint
,check_aggrement string
,audit_type string
,use_type string
,loan_type string
,occupation_type string
,company_type string
,working_age string
,post string
,title string
,salary decimal(11,2)
,social_security string
,credit_card_or_loan string
,loan_usage string
,city_id bigint
,etl_time string
)
comment ''
partitioned by (partition_date string comment '分区日期')
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n';
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions = 100000;
set hive.exec.max.dynamic.partitions.pernode=10000;
insert overwrite table dwd_fact_drawal_dtl partition(partition_date)
select
from_unixtime(unix_timestamp(t1.created_at,'yyyy-MM-dd HH:mm:ss'),'yyyyMMdd') drawal_date
,t4.idty_type
,t4.channel_id
,t4.user_type
,t4.manager_id
,t4.sex
,t1.id drawal_id
,t1.user_id
,t1.loan_apply_id apply_id
,t1.product_id
,t1.audit_id
,t1.credit_type
,t1.amount
,t1.loan_term
,t1.repay_amount
,t1.status
,t1.lend_time
,t1.due_date
,t1.is_current
,t1.aggrement_id
,t1.is_valid
,t1.next_repay_date
,t1.apply_type
,t1.fund_provided
,t1.check_aggrement
,t1.audit_type
,t1.use_type
,t1.loan_type
,t2.occupation_type
,t2.company_type
,t2.working_age
,t2.post
,t2.title
,t2.salary
,t2.social_security
,t2.credit_card_or_loan
,t2.loan_usage
,t3.city_id
,from_unixtime(unix_timestamp()+28800,'yyyy-MM-dd') etl_date
,from_unixtime(unix_timestamp(t1.created_at),'yyyy-MM-dd') as partition_date
from ods_drawal_apply t1
left join ods_drawal_companys t2 on t1.id = t2.drawal_apply_id
left join ods_drawal_address t3 on t1.id = t3.drawal_apply_id
left join ods_users t4 on t1.user_id = t4.id;
表字段关系如下:
需要注意的是:因为一个提款申请,对应多条提款公司或者多条地址信息,所以提款申请表应该关联最新的提款公司和提款地址信息。
再次对其进行更加详细的字段筛选如下代码:
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dwd_fact_drawal_dtl partition(partition_date)
select
from_unixtime(unix_timestamp(a.created_at),'yyyyMMdd') as drawal_date
,u.idty_type
,u.channel_id
,u.user_type
,u.manager_id
,u.sex
,a.id as drawal_id
,a.user_id
,a.loan_apply_id as apply_id
,a.product_id
,a.audit_id
,a.credit_type
,a.amount
,a.loan_term
,a.repay_amount
,a.status
,a.lend_time
,a.due_date
,a.is_current
,a.aggrement_id
,a.is_valid
,a.next_repay_date
,a.apply_type
,a.fund_provided
,a.check_aggrement
,a.audit_type
,a.use_type
,a.loan_type
,com_mid.occupation_type
,com_mid.company_type
,com_mid.working_age
,com_mid.post
,com_mid.title
,com_mid.salary
,com_mid.social_security
,com_mid.credit_card_or_loan
,com_mid.loan_usage
,addr_mid.city_id
,from_unixtime(unix_timestamp(),'yyyy-MM-dd HH:mm:ss') as etl_time
,from_unixtime(unix_timestamp(a.created_at),'yyyy-MM-dd') as partition_date
from ods_drawal_apply a
left join ods_users u on a.user_id=u.id
left join (
select
com.id,
com.user_id,
com.drawal_apply_id,
com.occupation_type,
com.company_type,
com.working_age,
com.post,
com.title,
com.salary,
com.social_security,
com.credit_card_or_loan,
com.loan_usage
from ods_drawal_companys com
inner join
(
select user_id,drawal_apply_id,max(id) as max_com_id
from ods_drawal_companys
group by user_id,drawal_apply_id
) x on com.user_id=x.user_id and com.drawal_apply_id=x.drawal_apply_id and com.id=x.max_com_id
) com_mid on a.id=com_mid.drawal_apply_id and a.user_id=com_mid.user_id
left join (
select
addr.id,
addr.user_id,
addr.drawal_apply_id,
addr.province_id,
addr.city_id
from ods_drawal_address addr
inner join
(
select user_id,drawal_apply_id,max(id) as max_addr_id
from ods_drawal_address
where address_type=1
group by user_id,drawal_apply_id
) y on addr.user_id=y.user_id and addr.drawal_apply_id=y.drawal_apply_id and addr.id=y.max_addr_id
) addr_mid on a.id=addr_mid.drawal_apply_id and a.user_id=addr_mid.user_id ;
表关系没有改变,仅是又增加了判断条件,使表更加细致符合要求。
制作dws层:
create table dws_fact_drawal_sum(
data_date string
,idty_type string
,channel_id bigint
,user_type string
,manager_id bigint
,sex string
,product_id bigint
,loan_term bigint
,drawal_app_num int -- 提款申请数量
,drawal_appuse_num int -- 提款申请用户
,drawal_app_amt decimal(11,2) -- 提款申请总金额
,pass_app_num int -- 通过提款申请数量
,pass_appuse_num int -- 通过提款申请用户
,pass_amt decimal(11,2) -- 通过放款金额
,deny_app_num int
,deny_appuse_num int
,deny_amt decimal(11,2)
,cancel_app_num int
,cancel_appuse_num int
,cancel_amt decimal(11,2)
,lent_app_num int -- 实际借出笔数
,lent_appuse_num int -- 实际借出人数
,lent_amt decimal(11,2) -- 实际借出金额
,etl_time string
)
comment ''
partitioned by (partition_date string comment '分区日期')
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n';
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dws_fact_drawal_sum partition (partition_date)
select
drawal_date
,idty_type
,channel_id
,user_type
,manager_id
,sex
,product_id
,loan_term,
-- 提款申请的各类指标
count(drawal_id) drawal_app_num,
count(distinct user_id) drawal_appuse_num,
sum(amount) drawal_app_amt,
-- 通过审核的各种指标
count( case when upper(status) in ('LENT','OVERDUE','BIDDING','FINAL_BID_FAILED','COMPLETED') then drawal_id else null end) pass_app_num,
count(distinct case when upper(status) in ('LENT','OVERDUE','BIDDING','FINAL_BID_FAILED','COMPLETED') then user_id else null end) pass_appuse_num,
sum(case when upper(status) in ('LENT','OVERDUE','BIDDING','FINAL_BID_FAILED','COMPLETED') then amount else 0 end) pass_amt
--拒绝类指标
,count(case when upper(status) in('REJECTED') then drawal_id else null end) deny_app_num
,count(distinct (case when upper(status) in('REJECTED') then user_id else null end)) deny_appuse_num
,sum(case when upper(status) in('REJECTED') then amount else 0 end) deny_amt
--取消类指标
,count(case when upper(status) in('CANCELED') then drawal_id else null end) cancel_app_num
,count(distinct (case when upper(status) in('CANCELED') then user_id else null end)) cancel_appuse_num
,sum(case when upper(status) in('CANCELED') then amount else 0 end) cancel_amt
--实际放款类指标
,count(case when upper(status) in('LENT','OVERDUE') then drawal_id else null end) lent_app_num
,count(distinct (case when upper(status) in('LENT','OVERDUE') then user_id else null end)) lent_appuse_num
,sum(case when upper(status) in('LENT','OVERDUE') then amount else 0 end) lent_amt
--ETL时间和分区字段
,max(from_unixtime(unix_timestamp(),'yyyy-MM-dd HH:mm:ss')) as etl_time
,max(from_unixtime(unix_timestamp(),'yyyy-MM-dd')) as partition_date
from dwd_fact_drawal_dtl
group by drawal_date
,idty_type
,channel_id
,user_type
,manager_id
,sex
,product_id
,loan_term;
表字段关系如下:
对个字段都进行了详细的描述,每个颜色框起来的都有对应颜色的文字描述,不难看出,DWS层的数据是从DWD层转换而来的,两者在数据仓库架构中相互关联、相互配合,共同构成了数据仓库的核心部分。dws在保留了部分属性类型的字段后,通过对status的各种状态进行函数操作,得到了对应的字段,方便我们在出指标的时候使用。
对字段里status的各属性进行一下解释:
CANCELED取消提现,COMPLETED还完,FINAL__BID_FAILED最终绑定失败,LENT借出,OVERDUE逾期,REJECTED拒绝,BIDING绑定中。
那么在当前模块中我们的流程应该是:
借款状态 BIDDING(绑定手机ing) --> 绑定成功进入LENT(借款) ---> 还款完成COMPLETED
如果 逾期状态OVERDUE --> 绑定失败
来用我们的dws层出一个指标玩玩,来一个累计放贷金额:
create table ads_lent_summary(
data_date string,
lent_amt double, -- 总放款数
lent_app_num double, -- 总借款人数
avg_lent_amt double, -- 平均借款人数
avg_loan double, -- 平均借款期数
lent_amt_acc double -- 累计借款金额
)row format delimited
fields terminated by ',';
insert into ads_lent_summary
select data_date,lent_amt,lent_app_num,avg_lent_amt,avg_loan,
sum(lent_amt) over(order by data_date)
from(
select data_date ,
sum(lent_amt) lent_amt,
sum(lent_app_num) lent_app_num,
sum(lent_amt)/sum(lent_appuse_num) avg_lent_amt,
sum(lent_app_num*loan_term) / sum(lent_appuse_num) avg_loan
from dws_fact_drawal_sum
where partition_date = '日期自己造格式:yyyy-MM-dd'
group by data_date
)t1;
如果你遇到了如下的错误:
2024-09-23T14:31:23,619 WARN [HiveServer2-Background-Pool: Thread-1938] metastore.RetryingMetaStoreClient: MetaStoreClient lost connection. Attempting to reconnect (1 of 1) after 1s. setPartitionColumnStatistics
org.apache.thrift.transport.TTransportException: null一般都是插入的时候,第二次插入。
如果你是metastore 和hiveserver2 在一台服务器上,报以上错误,如果,你是两台服务器,服务启动在不同的服务器上,会报如下错误:
java.lang.ClassCastException: org.apache.hadoop.hive.metastore.api.StringColumnStatsData cannot be cast to org.apache.hadoop.hive.metastore.columnstats.cache.StringColumnStatsDataInspector解决方案:
使用 insert overwrite table 替代 insert into
或者 在建表之前 set hive.stats.autogather=false; 建表,插入数据
二.还款
统计指标:还款合同量,还款人数,还款金额,预期本金,预期利息,预期服务费,预期天数,预期次数,预期状态
统计粒度:用户一次还款请求作为一条记录
分析维度:还款日期、产品、证件类型,性别,渠道,用户类型,客户经理
因为 表结构 有问题,所以需要修改变字段,因为在mysql就错了,所以需要执行如下两条sql。
预览可以看到:
以及写了如果存在便删除,所以只要在mysql运行如下的两条sql就可以了。
【免费】数仓项目实战三!里的sql修改时需要的命令资源-CSDN文库
那么在mysql导入成功后我们便再导入到hive里:
drop table ods_repay_plan_item;
drop table ods_repay_plan_item_his;
导入文件夹中的两个sql语句到mysql
sqoop import \
--driver com.mysql.cj.jdbc.Driver \
--connect 'jdbc:mysql://shucang:3306/jrxd?characterEncoding=UTF-8&zeroDateTimeBehavior=CONVERT_TO_NULL' \
--username root \
--password 123456 \
--table repay_plan_item \
--num-mappers 1 \
--hive-import \
--hive-overwrite \
--hive-database finance \
--hive-table ods_repay_plan_item \
--null-non-string '\\N' \
--null-string '\\N'
sqoop import \
--driver com.mysql.cj.jdbc.Driver \
--connect 'jdbc:mysql://shucang:3306/jrxd?characterEncoding=UTF-8&zeroDateTimeBehavior=CONVERT_TO_NULL' \
--username root \
--password 123456 \
--table repay_plan_item_his \
--num-mappers 1 \
--hive-import \
--hive-overwrite \
--hive-database finance \
--hive-table ods_repay_plan_item_his \
--null-non-string '\\N' \
--null-string '\\N'
那么开始创建dwd层:
create table dwd_fact_repay_plan_dtl(
id string
,user_id string
,drawal_id bigint
,contract_amount decimal(11,2)
,status string
,loan_term int
,paid_amount decimal(11,2)
,prestore_amount decimal(11,2)
,unpaid_amount decimal(11,2)
,reduce_amount decimal(11,2)
,early_repay_penalty_amount decimal(11,2)
,sync_hx string
,sync_hx_time string
,created_at string
,updated_at string
,etl_time string
)
comment ''
partitioned by (partition_date string comment '分区日期')
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n';
insert overwrite table dwd_fact_repay_plan_dtl partition(partition_date)
select
id
,user_id
,drawal_apply_id as drawal_id
,contract_amount
,status
,loan_term
,paid_amount
,prestore_amount
,unpaid_amount
,reduce_amount
,early_repay_penalty_amount
,sync_hx
,sync_xhx_time
,created_at
,updated_at
,from_unixtime(unix_timestamp(),'yyyy-MM-dd HH:mm:ss') as etl_time
,from_unixtime(unix_timestamp(created_at),'yyyy-MM-dd') as partition_date
from ods_repay_plan;
create table dwd_fact_repay_plan_item_dtl(
id bigint
,drawal_id bigint
,repay_plan_id bigint
,repay_term bigint
,status string
,due_date string
,dest_principal decimal(11,2)
,dest_interest decimal(11,2)
,dest_service decimal(11,2)
,dest_pty_interest decimal(11,2)
,dest_pty_fee decimal(11,2)
,dest_pty_late_fee decimal(11,2)
,dest_gutee_fee decimal(11,2)
,dest_gutee_service_fee decimal(11,2)
,dest_other decimal(11,2)
,paid_principal decimal(11,2)
,paid_interest decimal(11,2)
,paid_service decimal(11,2)
,paid_pty_interest decimal(11,2)
,paid_pty_fee decimal(11,2)
,paid_pty_late_fee decimal(11,2)
,paid_gutee_fee decimal(11,2)
,paid_gutee_service_fee decimal(11,2)
,paid_other decimal(11,2)
,reduce_principal decimal(11,2)
,reduce_interest decimal(11,2)
,reduce_service decimal(11,2)
,reduce_pty_interest decimal(11,2)
,reduce_pty_fee decimal(11,2)
,reduce_pty_late_fee decimal(11,2)
,reduce_gutee_fee decimal(11,2)
,reduce_gutee_service_fee decimal(11,2)
,reduce_other decimal(11,2)
,overdue_days int
,actual_repay_date string
,total_un_paid_amt decimal(11,2)
,updated_at string
,created_at string
,etl_time string
)
comment ''
partitioned by (partition_date string comment '分区日期')
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n';
insert overwrite table dwd_fact_repay_plan_item_dtl
select
id
,drawal_apply_id as drawal_id
,repay_plan_id
,repay_term
,status
,due_date
,dest_principal
,dest_interest
,dest_service
,dest_pty_interest
,dest_pty_fee
,dest_pty_late_fee
,dest_gutee_fee
,dest_gutee_service_fee
,dest_other
,paid_principal
,paid_interest
,paid_service
,paid_pty_interest
,paid_pty_fee
,paid_pty_late_fee
,paid_gutee_fee
,paid_gutee_service_fee
,paid_other
,reduce_principal
,reduce_interest
,reduce_service
,reduce_pty_interest
,reduce_pty_fee
,reduce_pty_lat_fee
,reduce_gutee_fee
,reduce_gutee_service_fee
,reduce_other
,overdue_days
,actual_repay_date
,total_un_paid_amt
,updated_at
,created_at
,from_unixtime(unix_timestamp(),'yyyy-MM-dd HH:mm:ss') as etl_time
,from_unixtime(unix_timestamp(created_at),'yyyy-MM-dd') as partition_date
from ods_repay_plan_item;
create table dwd_fact_repay_plan_item_his(
id bigint
,drawal_id bigint
,repay_plan_id bigint
,repay_plan_item_id bigint
,repay_term int
,old_status string
,new_status string
,created_at string
,etl_time string
)
comment ''
partitioned by (partition_date string comment '分区日期')
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n';
insert overwrite table dwd_fact_repay_plan_item_his
select
id
,drawal_apply_id as drawal_id
,repay_plan_id
,repay_plan_item_id
,repay_term
,old_status
,new_status
,created_at
,from_unixtime(unix_timestamp(),'yyyy-MM-dd HH:mm:ss') as etl_time
,from_unixtime(unix_timestamp(created_at),'yyyy-MM-dd') as partition_date
from ods_repay_plan_item_his;
以上的dwd设计的并不好,因为可以看出基本上都是选择了ods层的原表部分字段,非常的简单所以完全可以将三个表的数据或者两个表的数据关联在一起,由于过于简单所以不再分析表字段。
构建dws层:
create table dws_drawal_risk_sum(
drawal_id bigint
,con_status string
,dest_principal decimal(11,2)
,dest_interest decimal(11,2)
,dest_service decimal(11,2)
,dest_pty_interest decimal(11,2)
,dest_pty_fee decimal(11,2)
,dest_pty_late_fee decimal(11,2)
,dest_gutee_fee decimal(11,2)
,dest_gutee_service_fee decimal(11,2)
,dest_other decimal(11,2)
,paid_principal decimal(11,2)
,paid_interest decimal(11,2)
,paid_service decimal(11,2)
,paid_pty_interest decimal(11,2)
,paid_pty_fee decimal(11,2)
,paid_pty_late_fee decimal(11,2)
,paid_gutee_fee decimal(11,2)
,paid_gutee_service_fee decimal(11,2)
,paid_other decimal(11,2)
,reduce_principal decimal(11,2)
,reduce_interest decimal(11,2)
,reduce_service decimal(11,2)
,reduce_pty_interest decimal(11,2)
,reduce_pty_fee decimal(11,2)
,reduce_pty_late_fee decimal(11,2)
,reduce_gutee_fee decimal(11,2)
,reduce_gutee_service_fee decimal(11,2)
,reduce_other decimal(11,2)
,total_un_paid_amt decimal(11,2)
,first_due_date string
,last_due_date string
,curr_overdue_principal decimal(11,2)
,curr_overdue_interest decimal(11,2)
,curr_overdue_service decimal(11,2)
,curr_overdue_nums int
,curr_overdue_days int
,etl_time string
)
comment ''
partitioned by (partition_date string comment '分区日期')
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n';
select min('1还款中','1还款中','2已结清'); -- 1还款中
select min('2已结清','2已结清','2已结清'); -- 2已结清
insert overwrite table dws_drawal_risk_sum
partition(partition_date)
---status 0:未到期;1:已逾期;2:已结束
select
drawal_id
,min(case when status in ('0','1') then '1还款中'
when status in ('2') then '2已结清'
else null
end) as con_status
,sum(dest_principal) as dest_principal
,sum(dest_interest) as dest_interest
,sum(dest_service) as dest_service
,sum(dest_pty_interest) as dest_pty_interest
,sum(dest_pty_fee) as dest_pty_fee
,sum(dest_pty_late_fee) as dest_pty_late_fee
,sum(dest_gutee_fee) as dest_gutee_fee
,sum(dest_gutee_service_fee) as dest_gutee_service_fee
,sum(dest_other) as dest_other
,sum(paid_principal) as paid_principal
,sum(paid_interest) as paid_interest
,sum(paid_service) as paid_service
,sum(paid_pty_interest) as paid_pty_interest
,sum(paid_pty_fee) as paid_pty_fee
,sum(paid_pty_late_fee) as paid_pty_late_fee
,sum(paid_gutee_fee) as paid_gutee_fee
,sum(paid_gutee_service_fee) as paid_gutee_service_fee
,sum(paid_other) as paid_other
,sum(reduce_principal) as reduce_principal
,sum(reduce_interest) as reduce_interest
,sum(reduce_service) as reduce_service
,sum(reduce_pty_interest) as reduce_pty_interest
,sum(reduce_pty_fee) as reduce_pty_fee
,sum(reduce_pty_late_fee) as reduce_pty_late_fee
,sum(reduce_gutee_fee) as reduce_gutee_fee
,sum(reduce_gutee_service_fee) as reduce_gutee_service_fee
,sum(reduce_other) as reduce_other
,sum(total_un_paid_amt) as total_un_paid_amt
,min(due_date) as first_due_date
,max(due_date) as last_due_date
,sum(case when status ='1' then dest_principal-paid_principal-reduce_principal
else 0
end) as curr_overdue_principal
,sum(case when status ='1' then dest_interest-paid_interest-reduce_principal
else 0
end) as curr_overdue_interest
,sum(case when status ='1' then dest_service-paid_service-reduce_service
else 0
end) as curr_overdue_service
,sum(case when status ='1' then 1
else 0
end) as curr_overdue_nums
,max(case when status ='1' then overdue_days
else 0
end) as curr_overdue_days
,max(from_unixtime(unix_timestamp(),'yyyy-MM-dd HH:mm:ss')) as etl_time
,max(from_unixtime(unix_timestamp(),'yyyy-MM-dd')) as partition_date
from dwd_fact_repay_plan_item_dtl
group by drawal_id;
字段关系如下:
为什么这个表的字段比较多而且大部分都是函数处理过的呢?
数仓的意义就是为了出指标,设计的时候要有一定的前瞻性,在我们项目的最开始不可能说能算无遗策的列出所有指标,都是从核心指标入手,后续缺什么的话就添什么,这也可以解释为什么到项目都要结束了但是ods层好像有的表都没有用过但还是抽到了hive里,因为我们以后有可能会用到。那么怎么区分数仓的好坏呢?比如当ads层想出一个指标的时候,我们一般是dws关联一些维度表就可以了,但是dws出不了的话就要去dwd找数据,而dwd也没办法就只能去ods层了,那就说明数仓比较的差。
做俩指标玩玩:
剩余贷款余额分布:
create table ads_repay_sum as
with t as (
select *,case when unpaid_amount < 5000 then '5000以下'
when unpaid_amount < 10000 then '5000至10000'
else '10000以上' end amount_range from dwd_fact_repay_plan_dtl --where status in (0,1)
)
select amount_range,count(1) user_num from t group by amount_range;
还款状态分布:
create table ads_repay_sum_bl as
with t as(
select
if(curr_overdue_nums<=6,curr_overdue_nums,'6+') qs,
count(1) users_num from dws_drawal_risk_sum group by curr_overdue_nums
),t2 as (
select qs,sum(users_num) users_num from t group by qs
),t3 as (
select qs,users_num,sum(users_num) over() total_users from t2
)
select qs,users_num,round(users_num/total_users,2) from t3;
再说一下dws层为什么对带汉字的值进行了min操作:
select min('1还款中','1还款中','2已结清'); -- 1还款中
select min('2已结清','2已结清','2已结清'); -- 2已结清
他会默认根据第一个去比较也就是实际上还是对1,2进行了比较,非常的不错。
三.员工分析
统计指标:入职人数,离职人数,在职天数,在职月份数
统计粒度:每天的入职人数和离职人数
分析维度:离职日期或入职日期,年龄(段),性别,部门
创建dws层如下:
此处的年龄设置的相当不好,因为年龄还带小数,会将同一天的数据变成很多条。下一次不要写了。
create table dws_com_manager_sum(
dateid string
,age string
,sex string
,deptname string
,days_in_com int
,month_in_com int
,entry_nums int
,departure_nums int
,etl_time string
)
comment ''
partitioned by (partition_date string comment '分区日期')
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n';
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dws_com_manager_sum partition(partition_date)
select
dataid
,age
,sex
,deptname
,days_in_com
,month_in_com
,count(case when datatype=0 then id else null end) as entry_nums -- 在职人数
,count(case when datatype=1 then id else null end) as departure_nums -- 离职人数
,max(from_unixtime(unix_timestamp(),'yyyy-MM-dd HH:mm:ss')) as etl_time
,max(from_unixtime(unix_timestamp(),'yyyy-MM-dd')) as partition_date
from
(
--- 当天的入职数据
select
id
,entry_date as dataid
,(year('2024-06-28')- year(birthday)) age -- 年龄
,sex
,deptname
,datediff('2024-06-28',entry_date) as days_in_com -- 在职天数
,datediff('2024-06-28',entry_date)/30 as month_in_com -- 在职月数
,0 as datatype
from dim_com_manager
where departure_date is null -- 离职时间为空
union all
--- 当天的离职数据
select
id
,departure_date as dataid -- 离职时间
,year('2024-06-28')- year(birthday) as age
,sex
,deptname
,datediff(departure_date,entry_date) as days_in_com -- 在职天数
,datediff(departure_date,entry_date)/30 as month_in_com -- 在职月数
,1 as datatype
from dim_com_manager
where departure_date is not null -- 离职时间不为空,代表已经离职
)t1
group by
dataid
,age
,sex
,deptname
,days_in_com
,month_in_com;
那么在这个模块里如果想要计算年龄怎么办呢:
-- 假如一个人是2000-10-01 出生,现在的时间是2024-09-24 他的年龄不能直接是24岁,而应该是23.9岁
-- 简单粗暴的一种办法
select year(`current_date`()) - year('2000-10-01');
-- ((2024-2000)*12 + (9-10)) / 12 == 小数
select round(((year(`current_date`()) - year('2000-10-01'))*12 + (month(`current_date`())-month('2000-10-01')))/12,2);
四.宽表
什么是宽表?
一个表字段很多,就是宽表。(宽表不是数仓中的专业名词,数据库也有)
要在那一层构建呢?
不一定,一般在dws创建,但是dwd也可以。
缺点:
因为关联好几个表,集中了表的字段,所以宽表很显然数据冗余很严重,浪费磁盘空间,不符合三范式,不过天然的数仓就不符合三范式,因为是维度建模。而字段多导致的还有灵活性差,不易修改因此在创建宽表之间,要对业务非常熟悉。
那么它的优点是什么呢?以空间换时间,当需要多表关联的时候,关联宽表即可。
创建宽表语句:
-- 关于审核的宽表
create table dws_fact_credit_wide_sum(
data_date string
,idty_type string
,channel_id bigint
,user_type string
,manager_id bigint
,sex string
,regis_cnt int
,ocr_cnt int
,md5_app_cnt int
,md5_val_cnt int
,apply_num int
,apply_user_num int
,cs_num int
,cs_user_num int
,cs_app_num int
,cs_pass_num int
,cs_deny_num int
,cs_pass_amt decimal(11,2)
,zs_num int
,zs_user_num int
,zs_app_num int
,zs_pass_num int
,zs_deny_num int
,zs_pass_amt decimal(11,2)
,drawal_app_num int
,drawal_appuse_num int
,drawal_app_amt decimal(11,2)
,pass_app_num int
,pass_appuse_num int
,pass_amt decimal(11,2)
,deny_app_num int
,deny_appuse_num int
,deny_amt decimal(11,2)
,cancel_app_num int
,cancel_appuse_num int
,cancel_amt decimal(11,2)
,lent_app_num int
,lent_appuse_num int
,lent_amt decimal(11,2)
,etl_time string
)
comment ''
partitioned by (partition_date string comment '分区日期')
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n';
插入数据语句:
insert overwrite table dws_fact_credit_wide_sum partition(partition_date)
select
a.data_date
,a.idty_type
,a.channel_id
,a.user_type
,a.manager_id
,a.sex
,sum(a.regis_cnt) as regis_cnt
,sum(a.ocr_cnt) as ocr_cnt
,sum(a.md5_app_cnt) as md5_app_cnt
,sum(a.md5_val_cnt) as md5_val_cnt
,sum(a.apply_num ) as apply_num
,sum(a.apply_user_num) as apply_user_num
,sum(a.cs_num) as cs_num
,sum(a.cs_user_num ) as cs_user_num
,sum(a.cs_app_num ) as cs_app_num
,sum(a.cs_pass_num) as cs_pass_num
,sum(a.cs_deny_num) as cs_deny_num
,sum(a.cs_pass_amt) as cs_pass_amt
,sum(a.zs_num) as zs_num
,sum(a.zs_user_num) as zs_user_num
,sum(a.zs_app_num ) as zs_app_num
,sum(a.zs_pass_num) as zs_pass_num
,sum(a.zs_deny_num) as zs_deny_num
,sum(a.zs_pass_amt) as zs_pass_amt
,sum(a.drawal_app_num) as drawal_app_num
,sum(a.drawal_appuse_num) as drawal_appuse_num
,sum(a.drawal_app_amt) as drawal_app_amt
,sum(a.pass_app_num) as pass_app_num
,sum(a.pass_appuse_num) as pass_appuse_num
,sum(a.pass_amt) as pass_amt
,sum(a.deny_app_num) as deny_app_num
,sum(a.deny_appuse_num) as deny_appuse_num
,sum(a.deny_amt ) as deny_amt
,sum(a.cancel_app_num) as cancel_app_num
,sum(a.cancel_appuse_num) as cancel_appuse_num
,sum(a.cancel_amt ) as cancel_amt
,sum(a.lent_app_num) as lent_app_num
,sum(a.lent_appuse_num) as lent_appuse_num
,sum(a.lent_amt ) as lent_amt
,max(from_unixtime(unix_timestamp(),'yyyy-MM-dd HH:mm:ss')) as etl_time
,max(from_unixtime(unix_timestamp(a.data_date,'yyyyMMdd'),'yyyy-MM-dd')) as partition_date
from
(
--注册指标
select
dateid data_date
,idty_type
,channel_id
,user_type
,manager_id
,sex
,regis_cnt
,ocr_cnt
,md5_app_cnt
,md5_val_cnt
,0 as apply_num
,0 as apply_user_num
,0 as cs_num
,0 as cs_user_num
,0 as cs_app_num
,0 as cs_pass_num
,0 as cs_deny_num
,0 as cs_pass_amt
,0 as zs_num
,0 as zs_user_num
,0 as zs_app_num
,0 as zs_pass_num
,0 as zs_deny_num
,0 as zs_pass_amt
,0 as drawal_app_num
,0 as drawal_appuse_num
,0 as drawal_app_amt
,0 as pass_app_num
,0 as pass_appuse_num
,0 as pass_amt
,0 as deny_app_num
,0 as deny_appuse_num
,0 as deny_amt
,0 as cancel_app_num
,0 as cancel_appuse_num
,0 as cancel_amt
,0 as lent_app_num
,0 as lent_appuse_num
,0 as lent_amt
from dws_fact_regiter_sum
union all
--申请指标
select
dateid data_date
,idty_type
,channel_id
,user_type
,manager_id
,sex
,0 as regis_cnt
,0 as ocr_cnt
,0 as md5_app_cnt
,0 as md5_val_cnt
,apply_num
,apply_user_num
,0 as cs_num
,0 as cs_user_num
,0 as cs_app_num
,0 as cs_pass_num
,0 as cs_deny_num
,0 as cs_pass_amt
,0 as zs_num
,0 as zs_user_num
,0 as zs_app_num
,0 as zs_pass_num
,0 as zs_deny_num
,0 as zs_pass_amt
,0 as drawal_app_num
,0 as drawal_appuse_num
,0 as drawal_app_amt
,0 as pass_app_num
,0 as pass_appuse_num
,0 as pass_amt
,0 as deny_app_num
,0 as deny_appuse_num
,0 as deny_amt
,0 as cancel_app_num
,0 as cancel_appuse_num
,0 as cancel_amt
,0 as lent_app_num
,0 as lent_appuse_num
,0 as lent_amt
from dws_fact_loan_apply_sum
union all
-- 审核指标
select
dateid
,idty_type
,channel_id
,user_type
,manager_id
,sex
,0 as regis_cnt
,0 as ocr_cnt
,0 as md5_app_cnt
,0 as md5_val_cnt
,0 as apply_num
,0 as apply_user_num
,cs_num
,cs_user_num
,cs_app_num
,cs_pass_num
,cs_deny_num
,cs_pass_amt
,zs_num
,zs_user_num
,zs_app_num
,zs_pass_num
,zs_deny_num
,zs_pass_amt
,0 as drawal_app_num
,0 as drawal_appuse_num
,0 as drawal_app_amt
,0 as pass_app_num
,0 as pass_appuse_num
,0 as pass_amt
,0 as deny_app_num
,0 as deny_appuse_num
,0 as deny_amt
,0 as cancel_app_num
,0 as cancel_appuse_num
,0 as cancel_amt
,0 as lent_app_num
,0 as lent_appuse_num
,0 as lent_amt
from dws_fact_loan_credit_sum
union all
--提款类指标
select
data_date
,idty_type
,channel_id
,user_type
,manager_id
,sex
,0 as regis_cnt
,0 as ocr_cnt
,0 as md5_app_cnt
,0 as md5_val_cnt
,0 as apply_num
,0 as apply_user_num
,0 as cs_num
,0 as cs_user_num
,0 as cs_app_num
,0 as cs_pass_num
,0 as cs_deny_num
,0 as cs_pass_amt
,0 as zs_num
,0 as zs_user_num
,0 as zs_app_num
,0 as zs_pass_num
,0 as zs_deny_num
,0 as zs_pass_amt
,drawal_app_num
,drawal_appuse_num
,drawal_app_amt
,pass_app_num
,pass_appuse_num
,pass_amt
,deny_app_num
,deny_appuse_num
,deny_amt
,cancel_app_num
,cancel_appuse_num
,cancel_amt
,lent_app_num
,lent_appuse_num
,lent_amt
from dws_fact_drawal_sum
) a
group by
a.data_date
,a.idty_type
,a.channel_id
,a.user_type
,a.manager_id
,a.sex;
五.拉链表
对于维度数据,如果变换频率很高,可以使用分区表,如果变换频率不高,使用拉链表,就不需要使用分区表了。
创建拉链表:
create table dim_product_zip(
id bigint
,repay_amount decimal(11,2)
,amount decimal(11,2)
,loan_term int
,composite_rate decimal(11,2)
,interest_rate decimal(11,2)
,real_interest_rate decimal(11,2)
,service_rate decimal(11,2)
,gutee_rate decimal(11,2)
,gutee_service_rate decimal(11,2)
,guarantee_prop decimal(11,2)
,month_pty_service decimal(11,2)
,day_penalty decimal(11,2)
,roof_penalty decimal(11,2)
,term_repay_amount decimal(11,2)
,day_interest_rate decimal(11,2)
,mgmt_consultation decimal(11,2)
,mgmt_service decimal(11,2)
,service_gutee decimal(11,2)
,repay_amount_month decimal(11,2)
,prin_interest_month decimal(11,2)
,mgmt_con_month decimal(11,2)
,mgmt_service_month decimal(11,2)
,service_gutee_month decimal(11,2)
,gutee_fee decimal(11,2)
,gutee_service_fee decimal(11,2)
,insurance_amount decimal(11,2)
,insurance_limit decimal(11,2)
,created_at string
,updated_at string
,start_dt string
,end_dt string
,etl_time string
)
comment ''
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n';
在我们创建了dim_product_zip的时候,dim_product就没用了,同样的维度表只用保留一个即可。
dim_product_zip 包含了 dim_product ,即 既可以查询最新的商品状态,又可以查询历史商品状态。
mysql 中的维度发生了变化,抽取到 ods 也会发生改变,ods 改变,以下生成拉链表的数据就会改变,这个设计完美。
怎么测试?
流程应该如下:mysql --> sqoop --> hive 的 ods --> 拉链表中
-- 对商品编号修改一条,插入一条,通过sqoop增量导入数据到hive的ods
INSERT INTO `dict_product` VALUES (5756,NULL,200000,5,34.2,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,'NULL',7910,1980,NULL,NULL,NULL,1582,396,NULL,NULL,NULL,NULL,NULL,'2024-09-23 03:23:57','2024-09-23 03:23:57');
INSERT INTO `dict_product` VALUES (5757,NULL,300000,5,34.2,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,'NULL',7910,1980,NULL,NULL,NULL,1582,396,NULL,NULL,NULL,NULL,NULL,'2024-09-23 03:23:57','2024-09-23 03:23:57');
sqoop增量抽取:
sqoop import \
--driver com.mysql.cj.jdbc.Driver \
--connect jdbc:mysql://shucang:3306/jrxd?characterEncoding=UTF-8 \
--username root \
--password 123456 \
--query 'select * from dict_product where DATE_FORMAT(updated_at,"%Y-%m-%d") = "2024-09-23" and $CONDITIONS' \
--target-dir /tmp/ods_dict_product \
--delete-target-dir \
--split-by id \
--num-mappers 1 \
--hive-import \
--hive-database finance \
--hive-table ods_dict_product \
--null-non-string '\\N' \
--null-string '\\N'
插入数据:
insert overwrite table dim_product_zip
select
t1.id
,t1.repay_amount
,t1.amount
,t1.loan_term
,t1.composite_rate
,t1.interest_rate
,t1.real_interest_rate
,t1.service_rate
,t1.gutee_rate
,t1.gutee_service_rate
,t1.guarantee_prop
,t1.month_pty_service
,t1.day_penalty
,t1.roof_penalty
,t1.term_repay_amount
,t1.day_interest_rate
,t1.mgmt_consultation
,t1.mgmt_service
,t1.service_gutee
,t1.repay_amount_month
,t1.prin_interest_month
,t1.mgmt_con_month
,t1.mgmt_service_month
,t1.service_gutee_month
,t1.gutee_fee
,t1.gutee_service_fee
,t1.insurance_amount
,t1.insurance_limit
,t1.created_at
,t1.updated_at,
t1.start_dt,
if(t1.end_dt = '9999-12-31' and t2.id is not null,'2024-09-22',t1.end_dt),
-- from_unixtime(unix_timestamp(`current_date`(),'yyyyMMdd'),'yyyy-MM-dd') etl_time
'2024-09-23' etl_time
from
(select * from dim_product_zip where substr(etl_time,1,10) ='2024-09-22') t1
left join
(select * from ods_dict_product where substr(updated_at,1,10) ='2024-09-23') t2
on t1.id = t2.id
union all
select
id
,repay_amount
,amount
,loan_term
,composite_rate
,interest_rate
,real_interest_rate
,service_rate
,gutee_rate
,gutee_service_rate
,guarantee_prop
,month_pty_service
,day_penalty
,roof_penalty
,term_repay_amount
,day_interest_rate
,mgmt_consultation
,mgmt_service
,service_gutee
,repay_amount_month
,prin_interest_month
,mgmt_con_month
,mgmt_service_month
,service_gutee_month
,gutee_fee
,gutee_service_fee
,insurance_amount
,insurance_limit
,created_at
,updated_at,
'2024-09-23' start_dt,
'9999-12-31' end_dt,
-- from_unixtime(unix_timestamp(`current_date`(),'yyyyMMdd'),'yyyy-MM-dd') etl_time
'2024-09-23' etl_time
from ods_dict_product
where substr(updated_at,1,10) = '2024-09-23';
再来模拟第二天又有数据修改:
这个dict_product 中的created_at 字段会根据时间修改而自动改变,所以将created_at 的current_timestap给删除掉,方便测试。
INSERT INTO `dict_product` VALUES (5758,NULL,300000,5,34.2,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,'NULL',7910,1980,NULL,NULL,NULL,1582,396,NULL,NULL,NULL,NULL,NULL,'2024-09-24 03:23:57','2024-09-24 03:23:57');
update `dict_product` set loan_term=12,updated_at='2024-09-24 13:23:57' where id = 5756 ;
sqoop再次抽取:
sqoop import \
--driver com.mysql.cj.jdbc.Driver \
--connect jdbc:mysql://shucang:3306/jrxd?characterEncoding=UTF-8 \
--username root \
--password 123456 \
--query 'select * from dict_product where DATE_FORMAT(updated_at,"%Y-%m-%d") = "2024-09-24" and $CONDITIONS' \
--target-dir /tmp/ods_dict_product \
--delete-target-dir \
--split-by id \
--num-mappers 1 \
--hive-import \
--hive-database finance \
--hive-table ods_dict_product \
--null-non-string '\\N' \
--null-string '\\N'
再次往拉链表增量插入:
每次都会覆盖以前的所有拉链表数据,因为重新查找的数据已经涵盖了以前的所有情况。
insert overwrite table dim_product_zip
select
t1.id
,t1.repay_amount
,t1.amount
,t1.loan_term
,t1.composite_rate
,t1.interest_rate
,t1.real_interest_rate
,t1.service_rate
,t1.gutee_rate
,t1.gutee_service_rate
,t1.guarantee_prop
,t1.month_pty_service
,t1.day_penalty
,t1.roof_penalty
,t1.term_repay_amount
,t1.day_interest_rate
,t1.mgmt_consultation
,t1.mgmt_service
,t1.service_gutee
,t1.repay_amount_month
,t1.prin_interest_month
,t1.mgmt_con_month
,t1.mgmt_service_month
,t1.service_gutee_month
,t1.gutee_fee
,t1.gutee_service_fee
,t1.insurance_amount
,t1.insurance_limit
,t1.created_at
,t1.updated_at,
t1.start_dt,
if(t1.end_dt = '9999-12-31' and t2.id is not null,'2024-09-23',t1.end_dt),
from_unixtime(unix_timestamp(`current_date`(),'yyyyMMdd'),'yyyy-MM-dd') etl_time
from
(select * from dim_product_zip where substr(etl_time,1,10) ='2024-09-23') t1
left join
(select * from ods_dict_product where substr(updated_at,1,10) ='2024-09-24') t2
on t1.id = t2.id
union all
select
id
,repay_amount
,amount
,loan_term
,composite_rate
,interest_rate
,real_interest_rate
,service_rate
,gutee_rate
,gutee_service_rate
,guarantee_prop
,month_pty_service
,day_penalty
,roof_penalty
,term_repay_amount
,day_interest_rate
,mgmt_consultation
,mgmt_service
,service_gutee
,repay_amount_month
,prin_interest_month
,mgmt_con_month
,mgmt_service_month
,service_gutee_month
,gutee_fee
,gutee_service_fee
,insurance_amount
,insurance_limit
,created_at
,updated_at,
'2024-09-23' start_dt,
'9999-12-31' end_dt,
from_unixtime(unix_timestamp(`current_date`(),'yyyyMMdd'),'yyyy-MM-dd') etl_time
from ods_dict_product
where substr(updated_at,1,10) = '2024-09-24';
来讲解一下具体逻辑吧,是这么做的呢?
思路
T=2023-06-06
1.dwd的拉链表T-1日的数据 关联ODS的T日的数据
union all
2.ODS的T日数据
select t1.id,..... start-dt,if(t2.id is null,t1.end_dt,'2023-06-05')
from (select * from dwd.nums_member_zip where dt = '2023-06-05')
t1 left join ( select * from dwd.nums_member_zip where dt = '2023-06-06')
t2 on t1.id = t2.id
union all
select id,xxx......'2023-06-06 start_dt,'9999-12-31' end_dt from ods.ums_member where dt = '2023-06-06'
因为今天没有结束的话数据是不完全的,所以只能每天对昨天的数据进行增量导入,那从拉链表筛选昨天的数据与今天的ods层商品表左关联这样就得到了我们昨天进行修改了的数据(并且在字段里通过if判断把昨天的截止日期从9999-12-31更换成昨天),这样我们还缺少新增的数据,那么在从ods层商品表筛选商品创建日期为今天的数据并把截止时间换成9999-12-31在union all就行了。
六.漏斗模型
漏斗表简单来说就是越往下数据越少,形状像一个漏斗一样,比方说在某个时间段内注册的有多少人,着这些人里发起了借款申请的又有多少人,借款申请里审核通过的又有多少,那这个人数肯定是越来越少的。
出一个指标玩玩:
漏斗指标
-- 查看'2019-11-21 到 2019-11-25 '注册的用户,并且完成贷款申请和审核通过用户数量
with t as (
select user_id,regis_cnt,md5_val_cnt,ocr_cnt from dwd_fact_user_regiter_dtl where dateid>='20191121' and dateid <='20191125'
),t2 as (
select * from dwd_fact_loan_apply_dtl
),t3 as (
select * from dwd_fact_credit_dtl where credit_type ='zs'
)
select
'2019-11-21' funnel_start,
'2019-11-25' funnel_end,
sum(t.regis_cnt) regis_cnt,
sum(t.ocr_cnt) ocr_cnt,
sum(t.md5_val_cnt) md5_val_cnt,
sum(t2.loan_app_cnt) loan_app_cnt,
sum(t3.pass_cnt) pass_cnt
from t
left join t2 on t.user_id=t2.user_id
left join t3 on t2.user_id = t3.user_id
;
七.DolphinScheduler
用ds去进程调度,实现增量导入,
先创建一个mysql的数据源:
再创建一个hive的数据源:
任务流定义如下:
第一个是shell脚本配置仅需要写入脚本:
脚本如下:
sqoop import \
--driver com.mysql.cj.jdbc.Driver \
--connect 'jdbc:mysql://shucang:3306/jrxd?characterEncoding=UTF-8&zeroDateTimeBehavior=CONVERT_TO_NULL' \
--username root \
--password 123456 \
--query 'select * from users where DATE_FORMAT(updated_at,"%Y-%m-%d") = "${dt}" and $CONDITIONS' \
--target-dir /tmp/users \
--delete-target-dir \
--num-mappers 1 \
--hive-import \
--hive-database finance \
--hive-table ods_users \
--null-non-string '\\N' \
--null-string '\\N'
这是抽取mysql的表增量到hive,
那么接下来是另一个表的增量,但是使用sqoop配置一下试试:
数据源是mysql,写入sql语句
select * from user_ocrlog where DATE_FORMAT(updated_at,'%Y-%m-%d') = '${dt}'
继续往下翻设置目标路径 :
第三个也是sqoop脚本:
sqoop import \
--driver com.mysql.cj.jdbc.Driver \
--connect 'jdbc:mysql://shucang:3306/jrxd?characterEncoding=UTF-8&zeroDateTimeBehavior=CONVERT_TO_NULL' \
--username root \
--password 123456 \
--query 'select * from user_md5 where DATE_FORMAT(updated_at,"%Y-%m-%d") = "${dt}" and $CONDITIONS' \
--target-dir /tmp/user_md5 \
--delete-target-dir \
--num-mappers 1 \
--hive-import \
--hive-database finance \
--hive-table ods_user_md5 \
--null-non-string '\\N' \
--null-string '\\N'
第四个使用sql写hivesql:
insert overwrite table dwd_fact_user_regiter_dtl
partition(partition_date)
select
from_unixtime(unix_timestamp(a.created_at),'yyyyMMdd') as dateid
,a.idty_type
,a.channel_id
,a.user_type
,a.manager_id
,a.sex
,a.id as user_id
,a.birthday
,a.name as user_name
,a.client_id
,a.message_auth
,a.idty_error_cnt
,a.idty_edit_cnt
,1 as regis_cnt
,b.ocr_cnt
,b.first_ocr_time
,b.last_ocr_time
,c.md5_app_cnt
,c.md5_val_cnt
,c.first_md5_time
,c.last_md5_time
,from_unixtime(unix_timestamp(),'yyyy-MM-dd HH:mm:ss') as etl_time
,from_unixtime(unix_timestamp(a.created_at),'yyyy-MM-dd') as partition_date
from ods_users a
left join (
select
user_id,
count(id) as ocr_cnt, -- orc认证的次数
min(created_at) as first_ocr_time,
max(created_at) as last_ocr_time
from ods_user_ocrlog
group by user_id
) b on a.id = b.user_id
left join (
select
user_id,
count(id) as md5_app_cnt, -- md5认证次数
sum((case when is_valid=1 then 1 else 0 end)) as md5_val_cnt, -- md5认证成功次数
min(created_at) as first_md5_time,
max(created_at) as last_md5_time
from ods_user_md5
group by user_id
) c on a.id=c.user_id
两条前置sql配置:
set hive.exec.dynamic.partition=true
set hive.exec.dynamic.partition.mode=nonstrict
第五个使用shell脚本写hive语句:
hive -e "
use finance;
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dws_fact_regiter_sum partition(partition_date)
select
dateid
,idty_type
,channel_id
,user_type
,manager_id
,sex
,sum(regis_cnt) as regis_cnt
,sum(ocr_cnt) as ocr_cnt
,sum(md5_app_cnt) as md5_app_cnt
,sum(md5_val_cnt) as md5_val_cnt
,max(from_unixtime(unix_timestamp(),'yyyy-MM-dd HH:mm:ss')) as etl_time
,'${dt}' as partition_date
from dwd_fact_user_regiter_dtl
where partition_date='${dt}'
group by
dateid
,idty_type
,channel_id
,user_type
,manager_id
,sex;
"
第六个用sql:
、
insert into ads_register_cnt
select '${dt}' dt,sum(regis_cnt) regis_cnt,sum(ocr_cnt) ocr_cnt,sum(md5_val_cnt) md5_val_cnt
from dws_fact_regiter_sum
where partition_date='${dt}'
前置sql语句 :
create table if not exists ads_register_cnt(
dt string,
reg_cnt bigint,
ocr_cnt bigint,
md5_cnt bigint
)row format delimited
fields terminated by ','
配置完成后:
点击保存在弹出的窗口配置全局变量,
先上线再运行。
注意!修改的话要先下线再修改,否则不会对你的工作流生效,而是只对你的当次运行修改生效!
这样你点击运行后就可以依次运行增量了,
如果在ds过程中有以下错误:
在如下路径:
添加配置:
export SQOOP_HOME=/opt/installs/sqoop
export PATH=$SQOOP_HOME/bin:$HADOOP_HOME/bin:$SPARK_HOME1/bin:$SPARK_HOME2/bin:$PYTHON_HOME/bin:$JAVA_HOME/bin:$HIVE_HOME/bin:$FLINK_HOME/bin:$DATAX_HOME/bin:$SEATUNNEL_HOME/bin:$CHUNJUN_HOME/bin:$PATH
如果你的ds已经启动了,那么在修改后要重启才能生效。
八.datagear
执行如下sql把前面的指标全导入jrxd-bi数据库,然后用datagear将指标图表可视化:
先安装datagear,datagear安装包需要的话下载:
【免费】datagear-5.1.0.zip资源-CSDN文库
在虚拟机:
yum install -y unzip
上传解压: /opt/modules
解压:
unzip datagear-5.1.0.zip -d /opt/installs/
重命名:
mv datagear-5.1.0/ datagear
启动:
cd /opt/installs/datagear
chmod u+x startup.sh
chmod u+x shutdown.sh
启动 :
./startup.sh
访问:
打开浏览器,输入DataGear服务地址
http://[IP地址]:50401
例如:http://shucang:50401
进去之后先注册:
然后登陆后添加数据源:
随便在项目起个名字后在数据集:
添加上我们的的指标数据也就是select * 从指标表,再在图表实现可视化:
点点点即可没什么好说的,然后看板一放:
编辑如下,插入哪个表格点哪个地方即可!
结果:
我的背景是从官网下载的模板:
要哪个下载哪个就可以了,在看板一导入就能使用:
九.结束
到此,全项目完毕,希望对你有所帮助!