一、Datax基础知识
1、什么是Datax?
DataX 是阿里云商用产品 DataWorks 数据集成的开源版本,它是一个异构数据源的离线数据同步工具/平台(ETL工具)。DataX 实现了包括 MySQL、Oracle、OceanBase、SqlServer、Postgre、HDFS、Hive、ADS、HBase、TableStore(OTS)、MaxCompute(ODPS)、Hologres、DRDS 等各种异构数据源之间高效的数据同步功能。
Tips : 异构即不同类型的应用或者数据源,例如MySQL/Oracle/DB2/MongDB等不同类型的数据库源 Tips : 离线数据同步常用Sqoop以及DataX工具。 Tips : ETL(Extract-Transform-Load)工具描述将数据从来源端经过抽取(extract)、转换(transform)、加载(load)至目的端的过程,目的是将企业中的分散、零乱、标准不统一的数据整合到一起,为企业的决策提供分析依据,其常用在数据仓库,但其对象并不限于数据仓库(DW)。
2、Datax的设计思想
为了解决异构数据源同步问题,DataX将复杂的网状的同步链路变成了星型数据链路,DataX作为中间传输载体负责连接各种数据源。当需要接入一个新的数据源的时候,只需要将此数据源对接到DataX,便能跟已有的数据源做到无缝数据同步。简单得说,DataX就像中间商一样为每一个服务对象进行需求供应。
3、Datax的框架设计
描述: DataX本身作为离线数据同步框架,离线(批量)的数据通道通过定义数据来源和去向的数据源和数据集,提供一套抽象化的数据抽取插件(Reader)、数据写入插件(Writer),并基于此框架设计一套简化版的中间数据传输格式,从而实现任意结构化、半结构化数据源之间数据传输。
DataX架构设计流程类似source(数据来源)-> channel(数据存储池中转通道) -> sink
Reader:数据采集模块,负责采集数据源的数据,将数据发送给Framework。
Writer:数据写入模块,负责不断向Framework取数据,并将数据写入到目的端。
Framework:用于连接reader和writer,作为两者的数据传输通道,并处理缓冲,流控,并发,数据转换等核心技术问题。
Reader、Writer与Transformer
DataX 可支持任意数据源到数据源,只要实现了 Reader/Writer Plugin,官方已经实现了主流的数据源插件,比如 MySQL、Oracle、SQLServer 等。
4、Datax的调度流程
(1)并发量
并发量即,所需通道数量(needChannelNumber)。
-
按表数量 配置了每个通道处理1张表,那么同时处理100张表,就需要100个通道
-
按记录数 配置了每个通道处理1000条记录,那么同时处理10000条记录,就需要10个通道
(2)任务切分
将Job切分成多个Task:根据所需通道数量将Job切分成多个Task
-
如果用户配置了具体的 Table 数量,那么就按照 Table 为最小单元进行拆分(即一个 Table 对应一个 Task),并生成对应的 querySql;
-
如果用户还配置了 splitPk,则会根据 splitPk 进行切分,具体逻辑是根据 splitPk 区间对 Table 进行拆分,并生成对应的 querySql。
(3)任务分配
1)配置通道数量(ChannelNumber)
2)设置每个TaskGroup分配的通道数(默认为每个TaskGroup分配5个通道)
3)计算TaskGroup数量=通道数量/每个TaskGroup分配的通道数
4)将Task平均分配至每个TaskGroup
(4)调度示例
例如
构建了一个数据同步作业,该作业的目的是将 MySql 的 100 张表同步到Hive 库中。 且此时设置了20 个并发(即 channelNumber=20)。
-
按表数量计算,需要100个通道(Channel);
-
将此Job切分成 100 个 Task;
-
DataX 默认给每个 TaskGroup 分配 5 个 Channel,因此 taskGroupNumber = channelNumber / channelsPerTaskGroup = 20 / 5 = 4;
-
根据 DataX 的公平分配策略,会将 100 个 Task 平均分配给每个 TaskGroup,因此每个 TaskGroup 处理 taskNumber / taskGroupNumber = 100 / 4 = 25 个 Task。
(5)详细图
二、在服务器上部署Datax
1、将安装包解压至app下
[peizk@hadoop app]$ tar -zxvf datax.tar.gz -C ~/app/
2、配置环境变量
[root@hadoop ~]# vim /etc/profile
添加如下
#DATAX_HOME
export DATAX_HOME=/home/peizk/app/datax
export PATH=$PATH:$DATAX_HOME/bin
不要忘记source!!!
3、执行一下官方例子
在job目录下
[peizk@hadoop job]$ datax.py job.json
有
配置成功!!
4、各目录作用
├── bin # 可执行的Python脚本
├── conf # Datax 配置文件
├── job # 离线同步任务
├── lib # 依赖库
├── log # 任务执行过程日志
├── log_perf
├── plugin # 各类数据库读写插件
│ ├── reader
│ └── writer
├── script # 脚本存放
└── tmp # 临时目录
三、测试一下从mysql表拿数据传输到hdfs
1、获取mysql到hdfs模板
[peizk@hadoop job]$ datax.py -r mysqlreader -w hdfswriter
DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.
Please refer to the mysqlreader document:
https://github.com/alibaba/DataX/blob/master/mysqlreader/doc/mysqlreader.md
Please refer to the hdfswriter document:
https://github.com/alibaba/DataX/blob/master/hdfswriter/doc/hdfswriter.md
Please save the following configuration as a json file and use
python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json
to run the job.
{
"job": {
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"column": [],
"connection": [
{
"jdbcUrl": [],
"table": []
}
],
"password": "",
"username": "",
"where": ""
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"column": [],
"compress": "",
"defaultFS": "",
"fieldDelimiter": "",
"fileName": "",
"fileType": "",
"path": "",
"writeMode": ""
}
}
}
],
"setting": {
"speed": {
"channel": ""
}
}
}
}
2、模板解析
{
"job": {
"content": [
{
"reader": {
"name": "mysqlreader", --name:reader名
"parameter": {
"column": [], -- 字段名
"connection": [
{
"jdbcUrl": [], --对数据库的JDBC连接信息
"table": [] --表名
【"querySql":[]】 -- 支持sql语句,配置它 --后,mysqlreader直接忽略 --table、column、where
}
],
"password": "", --密码
"username": "", --用户名
"where": "" --筛选条件
【"splitPk": ""】 --数据分片字段,一般是主键, --仅支持整型
}
},
"writer": {
"name": "hdfswriter", --name:writer名
"parameter": {
"column": [], --column:写入数据的字段,其 --中name指定字段名,type指定类型
"compress": "", --hdfs文件压缩类型,默认未空
"defaultFS": "", --namenode节点地址
"fieldDelimiter": "", --分隔符
"fileName": "", --写入文件名
"fileType": "", --文件的类型,目前只支持用户配 --置为"text"或"orc"
"path": "", --hdfs文件系统的路径信息
"writeMode": ""
-- hdfswriter写入前数据清理处理模式:
--(1)append:写入前不做任何处理,hdfswriter直接使用filename写入,会重复数据
--(2)nonConflict:如果目录下有fileName前缀的文件,直接报错。
}
}
}
],
"setting": {
"speed": {
"channel": "" --并发数限速(根据自己CPU合理控制并发数)
-- channle=2,使用streamwriter打印的时候会打印两遍内容。
-- 但是如果是别的writer,比如hdfswriter,mysqlwriter,它只是 -- 启动两个并发任务,内容只有一次,不会重复两遍。
}
}
}
}
四、Datax两种配置模式与传参
1、字段模式
{
"job": {
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"column": [
"bus_id",
"bus_name",
"telephone",
"pic_url",
"bus_user_name",
"province",
"city",
"dist",
"addr",
"zipcode",
"is_open",
"open_time",
"closed_time"
],
"connection": [
{
"jdbcUrl": [
"jdbc:mysql://hadoop04:3306/prod"
],
"table": [
"business_info"
]
}
],
"password": "shumeng_123",
"username": "shumeng",
"where": "1=1"
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"column": [
{
"name":"bus_id",
"type":"string"
},
{
"name":"bus_name",
"type":"string"
},
{
"name":"telephone",
"type":"string"
},
{
"name":"pic_url",
"type":"string"
},
{
"name":"bus_user_name",
"type":"string"
},
{
"name":"province",
"type":"string"
},
{
"name":"city",
"type":"string"
},
{
"name":"dist",
"type":"string"
},
{
"name":"addr",
"type":"string"
},
{
"name":"zipcode",
"type":"string"
},
{
"name":"is_open",
"type":"string"
},
{
"name":"open_time",
"type":"string"
},
{
"name":"closed_time",
"type":"string"
}
],
"compress": "snappy",
"defaultFS": "hdfs://hadoop04:8020",
"fieldDelimiter": ",",
"fileName": "ods_business_info_full.txt",
"fileType": "orc",
"path": "/user/hive/warehouse/ods.db/ods_business_info_full/pt=2022-01-01",
"writeMode": "append"
}
}
}
],
"setting": {
"speed": {
"channel": "1"
}
}
}
}
2、sql模式
{
"job": {
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"connection": [
{
"jdbcUrl": [
"jdbc:mysql://hadoop04:3306/prod"
],
"querySql":[
"select bus_id,bus_name,telephone,pic_url,bus_user_name,province,city,dist,addr,zipcode,is_open,open_time,close_time from prod.business_info "
]
}
],
"password": "shumeng_123",
"username": "shumeng",
"where": "1=1"
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"column": [
{
"name":"bus_id",
"type":"string"
},
{
"name":"bus_name",
"type":"string"
},
{
"name":"telephone",
"type":"string"
},
{
"name":"pic_url",
"type":"string"
},
{
"name":"bus_user_name",
"type":"string"
},
{
"name":"province",
"type":"string"
},
{
"name":"city",
"type":"string"
},
{
"name":"dist",
"type":"string"
},
{
"name":"addr",
"type":"string"
},
{
"name":"zipcode",
"type":"string"
},
{
"name":"is_open",
"type":"string"
},
{
"name":"open_time",
"type":"string"
},
{
"name":"closed_time",
"type":"string"
}
],
"compress": "snappy",
"defaultFS": "hdfs://hadoop04:8020",
"fieldDelimiter": ",",
"fileName": "ods_business_info_full_sql.txt",
"fileType": "orc",
"path": "/user/hive/warehouse/ods.db/ods_business_info_full/pt=2022-01-01",
"writeMode": "append"
}
}
}
],
"setting": {
"speed": {
"channel": "1"
}
}
}
}
3、注意点:DataX的HDFS writer会将null值存储为空字符串'' ,Hive默认null格式为\N。
alter table ods_business_info_full set serdeproperties('serialization.null.format' = '');
4、Datax 传入参数
1、为啥
目的:对每日同步数据以分区形式进行加载,目标路径需动态变化指定HDFS writer的path参数。
方法:在JSON配置文件中使用${param} 引入参数,提交任务时候使用 -p -Dparam=value 传参。
2、例子
(1)更改的json
{
"job": {
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"connection": [
{
"jdbcUrl": [
"jdbc:mysql://hadoop04:3306/prod"
],
"querySql":[
"select bus_id,bus_name,telephone,pic_url,bus_user_name,province,city,dist,addr,zipcode,is_open,open_time,closed_time from prod.business_info where open_time>='${start_date}'"
]
}
],
"password": "shumeng_123",
"username": "shumeng",
"where": "1=1"
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"column": [
{
"name":"bus_id",
"type":"string"
},
{
"name":"bus_name",
"type":"string"
},
{
"name":"telephone",
"type":"string"
},
{
"name":"pic_url",
"type":"string"
},
{
"name":"bus_user_name",
"type":"string"
},
{
"name":"province",
"type":"string"
},
{
"name":"city",
"type":"string"
},
{
"name":"dist",
"type":"string"
},
{
"name":"addr",
"type":"string"
},
{
"name":"zipcode",
"type":"string"
},
{
"name":"is_open",
"type":"string"
},
{
"name":"open_time",
"type":"string"
},
{
"name":"closed_time",
"type":"string"
}
],
"compress": "snappy",
"defaultFS": "hdfs://hadoop04:8020",
"fieldDelimiter": ",",
"fileName": "ods_business_info_full_sql",
"fileType": "orc",
"path": "/user/hive/warehouse/test.db/ods_business_info_full_sql/pt=2022-01-01",
"writeMode": "append"
}
}
}
],
"setting": {
"speed": {
"channel": "1"
}
}
}
}
(2)执行的语句
datax.py ods_business_info_full_sql.json -p -Dstart_date=2020-01-01
五、业务的一些逻辑
1、关于订单和订单明细
订单表 --》订单明细表
一个订单可能买多个商品,可以通过订单id去明细表里查
2、维度数据
维度数据一般全量同步
这是从 数据比较固定,数据变化频率 来确定的
3、订单数据
订单一般增量 因为一般要有时间判断
并且订单数据一般走 t+1 离线
4、订单表只通过 create_time 抽取可以吗?
订单表只 通过 create_time 抽取是有问题的
只抽取当天产生数据会有问题
例如: 前几天有一个订单,今天才支付 update_time 变化 (订单有生命周期 7-15天)
5、对于像支付表这种
支付周期表(多少分钟内必须要完成支付,比如30分钟),
如果想对支付表做增量,
(1)通过create_time 和update_time 去取
(2)通过 设计他的跑批时间,如每天的 00.30之后去抽数 ,那么前一天的 create_time都已经完结了
6、对于收藏表这种
收藏表 无法确定生命周期(一年前收藏,一年后取消,;一天前收藏,一天后取消)
使用 create_time 和update_time 去取
六、编写python脚本的分析及同步的分析
1、编写一个脚本我们需要的东西
(1) mysql 的字段
mysql 的表名
(2)hive的字段
hive的文件名
hive的文件地址
2、我们需要做什么
分析(1)提取公共参数
(2)提取 mysql字段
(3) 提取 hive字段 要和mysql一一对应
(4) 生成 json
(5)通过使用参数 控制生成哪些表
3、如何获取我们需要的东西?
对于mysql 字段、表名
select
TABLE_NAME
,COLUMN_NAME
,DATA_TYPE
from information_schema.`COLUMNS` c
where TABLE_SCHEMA ='project'
4、总体的脚本思路?要干什么?
全量json生成,通过python 文件生成
接着 datax执行相应的 json文件,导入数据
5、全量同步的逻辑?
全删,全插
6、增量同步的逻辑
对于订单
select distinct substr(create_time,1,10) as ts
from
seller_info
where create_time >='2022-01-06' and create_time <'2022-01-07'
or update_time >='2022-01-06' and update_time <'2022-01-07'
这样最多可以生成6天的数据 因为除了 create_time 还有 update_time 在这一天更改的,但他的create_time 可能不是这一天
!!!有些增量数据 比如支付 只需要 控制它的跑批时间在 00:30后就可以
!!!有些增量 订单 需要 create_time 和 update_time 一起去控制
问题:涉及增量 可以直接写到ODS表吗?(按天分区的,每天分区,是每天的数据)
不可以!
比如上诉的sql脚本 6号这一天 它的 create_time 有6天 就不能写
解决办法:
(1)创建一个 stg层 先将我们每天增量抽取的数据存储在stg的表中 ,这个表没有分区(它每次存储的都是我们每天当天抽的数据)
那如何装载到ods?
(2)首先 查询 stg表中数据 为a
然后根据stg表中的 create_time 得出相应分区 ,用这些分区做where条件 去查询 整个增量表中的数据(全部的分区) 为 b
将这两个查询 union all 在一起 做一张临时表(a的pt 还是它的原pt,b的pt为 create_time)
(3)然后 对这张临时表做 开窗 组 订单id 排序 update_time desc 取 rn =1
将查询的结果 使用动态分区 插入 到 整个增量表 partition(pt)
最后结果:
这样 有变更的数据 最新的数据就会写入到最新分区 ,upda也会相应修改 原数据就消失
无变更的数据 仍然写入原分区
这样就完成整个增量同步 逻辑
简单的代码诠释
场景 ods_a_inc 为增量表 内有 1 2 3 三天的分区
ods_a_stg 为a的stg 表 今天是 4号 内有 create_time 1 2 3 4 都有 实现增量代码
其中字段a为订单id也就是主键
create table tmp_ods_a_inc
(
select
a
,b
,create_time
,update_time
,pt
from
ods_a_inc
where
pt in
(
select
create_time
from
ods_a_stg
group by
create_time
)
union all
select
a
,b
,create_time
,update_time
,create_time as pt
from
ods_a_stg
);
--开启动态分区
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table ods_a_inc partition(pt)
select
a
,b
,create_time
,update_time
,pt
from
(
select
a
,b
,create_time
,update_time
,pt
,row_number() over(partition by a order by update_time desc ) as rn
from tmp_ods_a_inc
)
where rn =1
七、项目脚本实现
1、创建全量表json生成的py脚本
(1)创建文件 export_job_json_full.py
[peizk@hadoop job]$ vim export_job_json_full.py
(2)代码如下
# coding=utf-8
import json
import getopt
import os
import sys
import MySQLdb
#MySQL连接信息配置
mysql_host = "hadoop"
mysql_port = "3306"
mysql_user = "root"
mysql_passwd = "123456"
#HDFS NameNode连接信息配置
hdfs_nn_host = "hadoop"
hdfs_nn_port = "9000"
#配置生成文件的目标路径
output_path = "/home/peizk/app/datax/job/ods_job"
def mysql_connect():
return MySQLdb.connect(host=mysql_host, port=int(mysql_port), user=mysql_user, passwd=mysql_passwd)
def mysql_meta(database, table):
connection = mysql_connect()
cursor = connection.cursor()
sql = "SELECT COLUMN_NAME,DATA_TYPE from information_schema.COLUMNS WHERE TABLE_SCHEMA=%s AND TABLE_NAME=%s ORDER BY ORDINAL_POSITION"
cursor.execute(sql, [database, table])
fetchall = cursor.fetchall()
cursor.close()
connection.close()
return fetchall
def mysql_col(database, table):
meta = mysql_meta(database, table)
return map(lambda x: x[0], meta)
def hive_col(database, table):
def type_mapping(mysql_type):
mappings = {
"bigint": "bigint",
"int": "bigint",
"smallint": "bigint",
"tinyint": "bigint",
"decimal": "string",
"double": "double",
"float": "float",
"binary": "string",
"char": "string",
"varchar": "string",
"datetime": "string",
"time": "string",
"timestamp": "string",
"date": "string",
"text": "string"
}
return mappings[mysql_type]
meta = mysql_meta(database, table)
return map(lambda x: {"name": x[0], "type": type_mapping(x[1].lower())}, meta)
def generate_json(source_database, source_table):
job = {
"job": {
"setting": {
"speed": {
"channel": 1
}
},
"content": [{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": mysql_user,
"password": mysql_passwd,
"column": mysql_col(source_database, source_table),
"splitPk": "",
"connection": [{
"table": [source_table],
"jdbcUrl": ["jdbc:mysql://" + mysql_host + ":" + mysql_port + "/" + source_database]
}]
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"defaultFS": "hdfs://" + hdfs_nn_host + ":" + hdfs_nn_port,
"fileType": "orc",
"path": "${hdfs_path}",
"fileName": source_table,
"column": hive_col(source_database, source_table),
"writeMode": "append",
"fieldDelimiter": "\t",
"compress": "snappy"
}
}
}]
}
}
if not os.path.exists(output_path):
os.makedirs(output_path)
with open(os.path.join(output_path, ".".join([source_database, source_table+"_full", "json"])), "w") as f:
json.dump(job, f)
def main(args):
source_database = ""
source_table = ""
options, arguments = getopt.getopt(args, '-d:-t:', ['sourcedb=', 'sourcetbl='])
for opt_name, opt_value in options:
if opt_name in ('-d', '--sourcedb'):
source_database = opt_value
if opt_name in ('-t', '--sourcetbl'):
source_table = opt_value
generate_json(source_database, source_table)
if __name__ == '__main__':
main(sys.argv[1:])
(3)通过脚本 生成某个全量表的json代码
python export_job_json_full.py -dprod -tbusiness_info
(4)通过datax执行生成的某个全量表的json
datax.py prod.business_info.json -p "-Dhdfs_path=/user/hive/warehouse/ods.db/ods_business_info_full/pt=2022-01-01"
2、创建增量表json生成的py脚本
(1)创建文件 export_job_json_incr.py
[peizk@hadoop utils_code]$ vim export_job_json_incr.py
(2)脚本代码如下
# coding=utf-8
import json
import getopt
import os
import sys
import MySQLdb
#MySQL连接信息配置
mysql_host = "hadoop"
mysql_port = "3306"
mysql_user = "root"
mysql_passwd = "123456"
#HDFS NameNode连接信息配置
hdfs_nn_host = "hadoop"
hdfs_nn_port = "9000"
#配置生成文件的目标路径
output_path = "/home/peizk/app/datax/job/ods_job"
def get_connect():
return MySQLdb.connect(host=mysql_host, port=int(mysql_port), user=mysql_user, passwd=mysql_passwd)
def mysql_meta(database, table):
connection = get_connect()
cursor = connection.cursor()
sql = "SELECT COLUMN_NAME,DATA_TYPE from information_schema.COLUMNS WHERE TABLE_SCHEMA=%s AND TABLE_NAME=%s ORDER BY ORDINAL_POSITION"
cursor.execute(sql, [database, table])
fetchall = cursor.fetchall()
cursor.close()
connection.close()
return fetchall
def mysql_col(database, table):
return map(lambda x: x[0], mysql_meta(database, table))
def mysql_query(database, table):
lis = list(mysql_col(database, table))
str_columns = ','.join(lis)
sql_query = 'select ' + str_columns + ' from ' + table + ' where (${create_time}>=\'${start_date}\' and ${create_time}<\'${end_date}\') ${is_or} (${update_time}>=\'${start_date}\' and ${update_time}<\'${end_date}\') '
return sql_query
def hive_col(database, table):
def type_mapping(mysql_type):
mappings = {
"bigint": "bigint",
"int": "bigint",
"smallint": "bigint",
"tinyint": "bigint",
"decimal": "string",
"double": "double",
"float": "float",
"binary": "string",
"char": "string",
"varchar": "string",
"datetime": "string",
"time": "string",
"timestamp": "string",
"date": "string",
"text": "string"
}
return mappings[mysql_type]
meta = mysql_meta(database, table)
return map(lambda x: {"name": x[0], "type": type_mapping(x[1].lower())}, meta)
def generate_json(source_database, source_table):
job = {
"job": {
"content": [{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": mysql_user,
"password": mysql_passwd,
"splitPk": "",
"connection": [{
"jdbcUrl": ["jdbc:mysql://" + mysql_host + ":" + mysql_port + "/" + source_database],
"querySql":[mysql_query(source_database, source_table)]
}]
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"defaultFS": "hdfs://" + hdfs_nn_host + ":" + hdfs_nn_port,
"fileType": "orc",
"path": "${hdfs_path}",
"fileName": source_table,
"column": hive_col(source_database, source_table),
"writeMode": "append",
"fieldDelimiter": "\t",
"compress": "snappy"
}
}
}],
"setting": {
"speed": {
"channel": 1
}
}
}
}
if not os.path.exists(output_path):
os.makedirs(output_path)
with open(os.path.join(output_path, ".".join([source_database, source_table+"_incr", "json"])), "w") as f:
json.dump(job, f)
def main(args):
source_database = ""
source_table = ""
options, arguments = getopt.getopt(args, '-d:-t:', ['sourcedb=', 'sourcetbl='])
for opt_name, opt_value in options:
if opt_name in ('-d', '--sourcedb'):
source_database = opt_value
if opt_name in ('-t', '--sourcetbl'):
source_table = opt_value
generate_json(source_database, source_table)
if __name__ == '__main__':
main(sys.argv[1:])
(3)通过脚本 生成某个增量表的json代码
python export_job_json_incr.py -dprod -tseller_info
(4)通过datax执行生成的某个增量表的json
datax.py prod.seller_info_incr.json -p "-Dhdfs_path=/user/hive/warehouse/datax.db/ods_seller_info_incr/pt=2022-01-01 -Dcreate_time=create_time -Dupdate_time=update_time -Dstart_date=2022-01-01 -Dend_date=2022-01-02 -Dis_or='or'"
3、创建一个shell脚本来同时控制我们的生成全量、增量json所有表的脚本
(1)创建脚本 export_all_job.sh
vim export_all_job.sh
(2)脚本内容为
#!/bin/bash
python /home/peizk/app/datax/job/utils_code/export_job_json_full.py -dproject -tpub_class
python /home/peizk/app/datax/job/utils_code/export_job_json_full.py -dproject -tpub_code
python /home/peizk/app/datax/job/utils_code/export_job_json_full.py -dproject -tpub_area
python /home/peizk/app/datax/job/utils_code/export_job_json_full.py -dproject -tpub_brand
python /home/peizk/app/datax/job/utils_code/export_job_json_full.py -dproject -tbusiness_info
python /home/peizk/app/datax/job/utils_code/export_job_json_full.py -dproject -tsku_info
python /home/peizk/app/datax/job/utils_code/export_job_json_full.py -dproject -tspu_info
python /home/peizk/app/datax/job/utils_code/export_job_json_full.py -dproject -tsku_para
python /home/peizk/app/datax/job/utils_code/export_job_json_full.py -dproject -tsku_para_info
python /home/peizk/app/datax/job/utils_code/export_job_json_full.py -dproject -tactivity_sku
python /home/peizk/app/datax/job/utils_code/export_job_json_full.py -dproject -tcoupon_info
python /home/peizk/app/datax/job/utils_code/export_job_json_full.py -dproject -tactivity_info
python /home/peizk/app/datax/job/utils_code/export_job_json_incr.py -dproject -tcollect_info
python /home/peizk/app/datax/job/utils_code/export_job_json_incr.py -dproject -tpay_info
python /home/peizk/app/datax/job/utils_code/export_job_json_incr.py -dproject -trefund_pay_info
python /home/peizk/app/datax/job/utils_code/export_job_json_incr.py -dproject -tremark_info
python /home/peizk/app/datax/job/utils_code/export_job_json_incr.py -dproject -tseller_detail
python /home/peizk/app/datax/job/utils_code/export_job_json_incr.py -dproject -tseller_info
python /home/peizk/app/datax/job/utils_code/export_job_json_incr.py -dproject -tseller_refund_info
python /home/peizk/app/datax/job/utils_code/export_job_json_incr.py -dproject -tseller_detail_coupon
python /home/peizk/app/datax/job/utils_code/export_job_json_incr.py -dproject -tseller_detail_activity
python /home/peizk/app/datax/job/utils_code/export_job_json_incr.py -dproject -tuser_info
(3)通过执行此脚本我们可批量生成我们需要的所有表的json(全量与增量都有)
4、创建关于全量表 datax执行json的抽数脚本
(1)创建文件 import_ods_table_full.sh
vim import_ods_table_full.sh
(2)代码内容为
#!/bin/bash
# 如果传入日期则etl_date等于传入的日期,否则赋值前一天日期
if [ -n "$2" ] ;then
etl_date=$2
else
etl_date=`date -d "-1 day" +%F`
fi
#设置datax目录
DATAX_HOME=/home/peizk/app/datax
#设置job存放的目录
job_dir="/home/peizk/app/datax/job/ods_job"
#设置HDFS公共存放的目录
hdfs_dir="/user/hive/warehouse/project.db"
#设置hive库名
database="project"
#先清空表分区,再创建表分区,防止同一分区数据重复
alter_partition() {
echo "------------开始清空$1表${etl_date}分区数据----------------"
hive_table_name=$1
hive -e "alter table ${database}.${hive_table_name} drop partition(pt='${etl_date}')"
hive -e "alter table ${database}.${hive_table_name} add partition(pt='${etl_date}')"
}
#数据同步
import_full_table() {
datax_json=$1
hdfs_path=$2
hive_table_url=${hdfs_path%/*}
hive_table_name=${hive_table_url##*/}
alter_partition $hive_table_name
echo "------------开始全量导入$1表${etl_date}分区数据------------"
python $DATAX_HOME/bin/datax.py $datax_json -p"-Dhdfs_path=$hdfs_path"
}
case $1 in
"pub_class")
import_full_table ${job_dir}/project.pub_class_full.json ${hdfs_dir}/ods_pub_class_full/pt=$etl_date
;;
"pub_area")
import_full_table ${job_dir}/project.pub_area_full.json ${hdfs_dir}/ods_pub_area_full/pt=$etl_date
;;
"activity_info")
import_full_table ${job_dir}/project.activity_info_full.json ${hdfs_dir}/ods_activity_info_full/pt=$etl_date
;;
"activity_sku")
import_full_table ${job_dir}/project.activity_sku_full.json ${hdfs_dir}/ods_activity_sku_full/pt=$etl_date
;;
"coupon_info")
import_full_table ${job_dir}/project.coupon_info_full.json ${hdfs_dir}/ods_coupon_info_full/pt=$etl_date
;;
"pub_code")
import_full_table ${job_dir}/project.pub_code_full.json ${hdfs_dir}/ods_pub_code_full/pt=$etl_date
;;
"pub_brand")
import_full_table ${job_dir}/project.pub_brand_full.json ${hdfs_dir}/ods_pub_brand_full/pt=$etl_date
;;
"business_info")
import_full_table ${job_dir}/project.business_info_full.json ${hdfs_dir}/ods_business_info_full/pt=$etl_date
;;
"sku_info")
import_full_table ${job_dir}/project.sku_info_full.json ${hdfs_dir}/ods_sku_info_full/pt=$etl_date
;;
"spu_info")
import_full_table ${job_dir}/project.spu_info_full.json ${hdfs_dir}/ods_spu_info_full/pt=$etl_date
;;
"sku_para")
import_full_table ${job_dir}/project.sku_para_full.json ${hdfs_dir}/ods_sku_para_full/pt=$etl_date
;;
"sku_para_info")
import_full_table ${job_dir}/project.sku_para_info_full.json ${hdfs_dir}/ods_sku_para_info_full/pt=$etl_date
;;
"all")
import_full_table ${job_dir}/project.pub_class_full.json ${hdfs_dir}/ods_pub_class_full/pt=$etl_date
import_full_table ${job_dir}/project.pub_area_full.json ${hdfs_dir}/ods_pub_area_full/pt=$etl_date
import_full_table ${job_dir}/project.pub_code_full.json ${hdfs_dir}/ods_pub_code_full/pt=$etl_date
import_full_table ${job_dir}/project.pub_brand_full.json ${hdfs_dir}/ods_pub_brand_full/pt=$etl_date
import_full_table ${job_dir}/project.business_info_full.json ${hdfs_dir}/ods_business_info_full/pt=$etl_date
import_full_table ${job_dir}/project.sku_info_full.json ${hdfs_dir}/ods_sku_info_full/pt=$etl_date
import_full_table ${job_dir}/project.spu_info_full.json ${hdfs_dir}/ods_spu_info_full/pt=$etl_date
import_full_table ${job_dir}/project.sku_para_full.json ${hdfs_dir}/ods_sku_para_full/pt=$etl_date
import_full_table ${job_dir}/project.sku_para_info_full.json ${hdfs_dir}/ods_sku_para_info_full/pt=$etl_date
import_full_table ${job_dir}/project.activity_info_full.json ${hdfs_dir}/ods_activity_info_full/pt=$etl_date
import_full_table ${job_dir}/project.activity_sku_full.json ${hdfs_dir}/ods_activity_sku_full/pt=$etl_date
import_full_table ${job_dir}/project.coupon_info_full.json ${hdfs_dir}/ods_coupon_info_full/pt=$etl_date
esac
5、创建关于增量表 datax执行json的抽数脚本
(1)创建文件 import_ods_table_incr.sh
vim import_ods_table_incr.sh
(2)代码内容为
#!/bin/bash
# shell 字符串截取 https://www.cnblogs.com/onmyway20xx/p/4308113.html
# 如果传入日期则etl_date等于传入的日期,否则赋值前一天日期
if [ -n "$3" ] ;then
start_date=$2
end_date=$3
else
start_date=`date -d "-1 day" +%F`
send_date=`date +%F`
fi
#设置datax目录
DATAX_HOME=/home/peizk/app/datax
#设置job存放的目录
job_dir="/home/peizk/app/datax/job/ods_job"
#设置HDFS公共存放的目录
hdfs_dir="/user/hive/warehouse/project.db"
#设置hive库名
database="project"
#先清空表分区,再创建表分区,防止同一分区数据重复
truncate_table () {
echo "------------开始清空$1表${etl_date}分区数据----------------"
hive -e "truncate table ${database}.${1} "
}
#数据同步
import_incr_table() {
datax_json=$1
hdfs_path=$2
is_or=$3
create_time=$4
update_time=$5
hive_table_url=${hdfs_path}
hive_table_name=${hive_table_url##*/}
truncate_table ${hive_table_name}
echo "------------开始全量导入${hive_table_incr}表----------------"
python $DATAX_HOME/bin/datax.py $datax_json -p"-Dhdfs_path=$hdfs_path -Dcreate_time=$create_time -Dupdate_time=${update_time} -Dstart_date=${start_date} -Dend_date=${end_date} -Dis_or=${is_or}"
}
case $1 in
"collect_info")
import_incr_table ${job_dir}/project.collect_info_incr.json ${hdfs_dir}/ods_collect_info_incr_stg or create_time cancel_time
;;
"pay_info")
import_incr_table ${job_dir}/project.pay_info_incr.json ${hdfs_dir}/ods_pay_info_incr_stg -- pay_create_time
;;
"refund_pay_info")
import_incr_table ${job_dir}/project.refund_pay_info_incr.json ${hdfs_dir}/ods_refund_pay_info_incr_stg -- refund_create_time refund_update_time
;;
"remark_info")
import_incr_table ${job_dir}/project.remark_info_incr.json ${hdfs_dir}/ods_remark_info_incr_stg -- create_time update_time
;;
"seller_detail")
import_incr_table ${job_dir}/project.seller_detail_incr.json ${hdfs_dir}/ods_seller_detail_incr_stg or create_time update_time
;;
"seller_info")
import_incr_table ${job_dir}/project.seller_info_incr.json ${hdfs_dir}/ods_seller_info_incr_stg or create_time update_time
;;
"seller_refund_info")
import_incr_table ${job_dir}/project.seller_refund_info_incr.json ${hdfs_dir}/ods_seller_refund_info_incr_stg -- create_time update_time
;;
"seller_detail_coupon")
import_incr_table ${job_dir}/project.seller_detail_coupon_incr.json ${hdfs_dir}/ods_seller_detail_coupon_incr_stg -- create_time update_time
;;
"seller_detail_activity")
import_incr_table ${job_dir}/project.seller_detail_activity_incr.json ${hdfs_dir}/ods_seller_detail_activity_incr_stg -- create_time update_time
;;
"user_info")
import_incr_table ${job_dir}/project.user_info_incr.json ${hdfs_dir}/ods_user_info_incr_stg or create_time update_time
;;
"all")
import_incr_table ${job_dir}/project.collect_info_incr.json ${hdfs_dir}/ods_collect_info_incr_stg or create_time cancel_time
import_incr_table ${job_dir}/project.pay_info_incr.json ${hdfs_dir}/ods_pay_info_incr_stg -- pay_create_time
import_incr_table ${job_dir}/project.refund_pay_info_incr.json ${hdfs_dir}/ods_refund_pay_info_incr_stg -- refund_create_time refund_update_time
import_incr_table ${job_dir}/project.remark_info_incr.json ${hdfs_dir}/ods_remark_info_incr_stg -- create_time update_time
import_incr_table ${job_dir}/project.seller_detail_incr.json ${hdfs_dir}/ods_seller_detail_incr_stg or create_time update_time
import_incr_table ${job_dir}/project.seller_info_incr.json ${hdfs_dir}/ods_seller_info_incr_stg or create_time update_time
import_incr_table ${job_dir}/project.seller_refund_info_incr.json ${hdfs_dir}/ods_seller_refund_info_incr_stg -- create_time update_time
import_incr_table ${job_dir}/project.seller_detail_coupon_incr.json ${hdfs_dir}/ods_seller_detail_coupon_incr_stg -- create_time update_time
import_incr_table ${job_dir}/project.seller_detail_activity_incr.json ${hdfs_dir}/ods_seller_detail_activity_incr_stg -- create_time update_time
import_incr_table ${job_dir}/project.user_info_incr.json ${hdfs_dir}/ods_user_info_incr_stg or create_time update_time
esac
(3)将增量数据从stg表导入ods原表的脚本
--开启动态分区
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
-- 1 ods_collect_info_incr_stg
insert overwrite table ods_collect_info_incr partition(pt)
select
id, user_id, sku_id, is_cancel, create_time, cancel_time, pt
from
(
SELECT id, user_id, sku_id, is_cancel, create_time, cancel_time, pt,
row_number() over(partition by id order by cancel_time desc ) as rn
FROM
(
SELECT id, user_id, sku_id, is_cancel, create_time, cancel_time, pt
FROM ods_collect_info_incr
where pt in
(
SELECT from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
FROM ods_collect_info_incr_stg
group by
from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
)
union all
SELECT id, user_id, sku_id, is_cancel, create_time, cancel_time,
from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd') as pt
FROM ods_collect_info_incr_stg
) a
) a
where rn =1 ;
--2 ods_pay_info_incr_stg
insert overwrite table ods_pay_info_incr partition(pt)
SELECT pay_id, out_trade_no, seller_id, user_id, pay_type, trade_no, total_amount, subject, pay_status, pay_create_time, pay_update_time, pay_content, pt
from
(
SELECT pay_id, out_trade_no, seller_id, user_id, pay_type, trade_no, total_amount, subject, pay_status, pay_create_time, pay_update_time, pay_content, pt,
row_number() over(partition by pay_id order by pay_update_time desc ) as rn
from
(
SELECT pay_id, out_trade_no, seller_id, user_id, pay_type, trade_no, total_amount, subject, pay_status, pay_create_time, pay_update_time, pay_content, pt
FROM ods_pay_info_incr
where pt in
(
SELECT from_unixtime(unix_timestamp(pay_create_time),'yyyy-MM-dd')
FROM ods_pay_info_incr_stg
group by from_unixtime(unix_timestamp(pay_create_time),'yyyy-MM-dd')
)
union all
SELECT pay_id, out_trade_no, seller_id, user_id, pay_type, trade_no, total_amount, subject, pay_status, pay_create_time, pay_update_time, pay_content,
from_unixtime(unix_timestamp(pay_create_time),'yyyy-MM-dd') as pt
FROM ods_pay_info_incr_stg
) a
) a
where rn =1 ;
--3 ods_refund_pay_info_incr_stg
insert overwrite table ods_refund_pay_info_incr partition(pt)
SELECT refund_pay_id, out_trade_no, seller_id, sku_id, pay_type, trade_no, total_amount, refund_status, refund_create_time, refund_update_time, refund_content, pt
from
(
SELECT refund_pay_id, out_trade_no, seller_id, sku_id, pay_type, trade_no, total_amount, refund_status, refund_create_time, refund_update_time, refund_content, pt,
row_number() over(partition by refund_pay_id order by refund_update_time desc ) as rn
from
(
SELECT refund_pay_id, out_trade_no, seller_id, sku_id, pay_type, trade_no, total_amount, refund_status, refund_create_time, refund_update_time, refund_content, pt
FROM ods_refund_pay_info_incr
where pt in
(
SELECT from_unixtime(unix_timestamp(refund_create_time),'yyyy-MM-dd')
FROM ods_refund_pay_info_incr_stg
group by from_unixtime(unix_timestamp(refund_create_time),'yyyy-MM-dd')
)
union all
SELECT refund_pay_id, out_trade_no, seller_id, sku_id, pay_type, trade_no, total_amount, refund_status, refund_create_time, refund_update_time, refund_content,
from_unixtime(unix_timestamp(refund_create_time),'yyyy-MM-dd') as pt
FROM ods_refund_pay_info_incr_stg
) a
) a
where rn =1 ;
--4 ods_remark_info_incr_stg
insert overwrite table ods_remark_info_incr partition(pt)
SELECT id, user_id, head_img, sku_id, spu_id, seller_id, appraise, remark_string, create_time, update_time, pt
from
(
SELECT id, user_id, head_img, sku_id, spu_id, seller_id, appraise, remark_string, create_time, update_time, pt,
row_number() over(partition by id order by update_time desc ) as rn
from
(
SELECT id, user_id, head_img, sku_id, spu_id, seller_id, appraise, remark_string, create_time, update_time, pt
FROM ods_remark_info_incr
where pt in
(
SELECT from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
FROM ods_remark_info_incr_stg
group by
from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
)
union all
SELECT id, user_id, head_img, sku_id, spu_id, seller_id, appraise, remark_string, create_time, update_time,
from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd') as pt
FROM ods_remark_info_incr_stg
) a
) a
where rn =1 ;
--5 ods_seller_detail_activity_incr_stg
insert overwrite table ods_seller_detail_activity_incr partition(pt)
SELECT id, seller_id, seller_detail_id, activity_id, activity_rule_id, sku_id, create_time, update_time, pt
from
(
SELECT id, seller_id, seller_detail_id, activity_id, activity_rule_id, sku_id, create_time, update_time, pt,
row_number() over(partition by id order by update_time desc ) as rn
from
(
SELECT id, seller_id, seller_detail_id, activity_id, activity_rule_id, sku_id, create_time, update_time, pt
FROM ods_seller_detail_activity_incr
where pt in
(
SELECT from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
FROM ods_seller_detail_activity_incr_stg
group by from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
)
union all
SELECT id, seller_id, seller_detail_id, activity_id, activity_rule_id, sku_id, create_time, update_time,
from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd') as pt
FROM ods_seller_detail_activity_incr_stg
) a
) a
where rn =1 ;
--6 ods_seller_detail_coupon_incr_stg
insert overwrite table ods_seller_detail_coupon_incr partition(pt)
SELECT id, seller_id, seller_detail_id, coupon_id, coupon_use_id, sku_id, create_time, update_time, pt
from
(
SELECT id, seller_id, seller_detail_id, coupon_id, coupon_use_id, sku_id, create_time, update_time, pt,
row_number() over(partition by id order by update_time desc ) as rn
from
(
SELECT id, seller_id, seller_detail_id, coupon_id, coupon_use_id, sku_id, create_time, update_time, pt
FROM ods_seller_detail_coupon_incr
where pt in
(
SELECT from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
FROM ods_seller_detail_coupon_incr_stg
group by from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
)
union all
SELECT id, seller_id, seller_detail_id, coupon_id, coupon_use_id, sku_id, create_time, update_time,
from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd') as pt
FROM ods_seller_detail_coupon_incr_stg
) a
) a
where rn =1 ;
-- 7 ods_seller_detail_incr_stg
insert overwrite table ods_seller_detail_incr partition(pt)
SELECT seller_detail_id, seller_id, seller_status, sku_id, sku_name, bus_id, seller_price, sku_num, create_time, update_time, source_type, source_id, total_amount, split_activity_amount, split_coupon_amount, pt
from
(
SELECT seller_detail_id, seller_id, seller_status, sku_id, sku_name, bus_id, seller_price, sku_num, create_time, update_time, source_type, source_id, total_amount, split_activity_amount, split_coupon_amount, pt,
row_number() over(partition by seller_detail_id order by update_time desc ) as rn
from
(
SELECT seller_detail_id, seller_id, seller_status, sku_id, sku_name, bus_id, seller_price, sku_num, create_time, update_time, source_type, source_id, total_amount, split_activity_amount, split_coupon_amount, pt
FROM ods_seller_detail_incr
where pt in
(
SELECT from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
FROM ods_seller_detail_incr_stg
group by from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
)
union all
SELECT seller_detail_id, seller_id, seller_status, sku_id, sku_name, bus_id, seller_price, sku_num, create_time, update_time, source_type, source_id, total_amount, split_activity_amount, split_coupon_amount,
from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd') as pt
FROM ods_seller_detail_incr_stg
) a
) a
where rn =1 ;
--8 ods_seller_info_incr_stg
insert overwrite table ods_seller_info_incr partition(pt)
SELECT seller_id, user_id, seller_status, buyer_name, buyer_mobile, buyer_province_id, buyer_address_province, buyer_address_city, buyer_address_area, buyer_address_detail, total_amount, activity_reduce_amount, coupon_reduce_amount, original_total_amount, payment_way, seller_comment, out_trade_no, trade_body, process_status, tracking_no, feight_fee, create_time, update_time, pt
from
(
SELECT seller_id, user_id, seller_status, buyer_name, buyer_mobile, buyer_province_id, buyer_address_province, buyer_address_city, buyer_address_area, buyer_address_detail, total_amount, activity_reduce_amount, coupon_reduce_amount, original_total_amount, payment_way, seller_comment, out_trade_no, trade_body, process_status, tracking_no, feight_fee, create_time, update_time, pt,
row_number() over(partition by seller_id order by update_time desc ) as rn
from
(
SELECT seller_id, user_id, seller_status, buyer_name, buyer_mobile, buyer_province_id, buyer_address_province, buyer_address_city, buyer_address_area, buyer_address_detail, total_amount, activity_reduce_amount, coupon_reduce_amount, original_total_amount, payment_way, seller_comment, out_trade_no, trade_body, process_status, tracking_no, feight_fee, create_time, update_time, pt
FROM ods_seller_info_incr
where pt in
(
SELECT from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
FROM ods_seller_info_incr_stg
group by from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
)
union all
SELECT seller_id, user_id, seller_status, buyer_name, buyer_mobile, buyer_province_id, buyer_address_province, buyer_address_city, buyer_address_area, buyer_address_detail, total_amount, activity_reduce_amount, coupon_reduce_amount, original_total_amount, payment_way, seller_comment, out_trade_no, trade_body, process_status, tracking_no, feight_fee, create_time, update_time,
from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd') as pt
FROM ods_seller_info_incr_stg
) a
) a
where rn =1 ;
--9 ods_seller_refund_info_incr_stg
insert overwrite table ods_seller_refund_info_incr partition(pt)
SELECT seller_refund_id, user_id, seller_id, sku_id, refund_type, refund_num, refund_amount, refund_reason_type, refund_reason_txt, refund_status, create_time, update_time, pt
from
(
SELECT seller_refund_id, user_id, seller_id, sku_id, refund_type, refund_num, refund_amount, refund_reason_type, refund_reason_txt, refund_status, create_time, update_time, pt,
row_number() over(partition by seller_refund_id order by update_time desc ) as rn
from
(
SELECT seller_refund_id, user_id, seller_id, sku_id, refund_type, refund_num, refund_amount, refund_reason_type, refund_reason_txt, refund_status, create_time, update_time, pt
FROM ods_seller_refund_info_incr
where pt in
(
SELECT from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
FROM ods_seller_refund_info_incr_stg
group by from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
)
union all
SELECT seller_refund_id, user_id, seller_id, sku_id, refund_type, refund_num, refund_amount, refund_reason_type, refund_reason_txt, refund_status, create_time, update_time,
from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd') as pt
FROM ods_seller_refund_info_incr_stg
) a
) a
where rn =1 ;
--10 ods_user_info_incr_stg
insert overwrite table ods_user_info_incr partition(pt)
SELECT user_id, login_name, nick_name, passwd, user_name, phone_num, email, head_img, user_level, birthday, gender, create_time, update_time, `status`, pt
from
(
SELECT user_id, login_name, nick_name, passwd, user_name, phone_num, email, head_img, user_level, birthday, gender, create_time, update_time, `status`, pt,
row_number() over(partition by user_id order by update_time desc ) as rn
from
(
SELECT user_id, login_name, nick_name, passwd, user_name, phone_num, email, head_img, user_level, birthday, gender, create_time, update_time, `status`, pt
FROM ods_user_info_incr
where pt in
(
SELECT from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
FROM ods_user_info_incr_stg
group by from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
)
union all
SELECT user_id, login_name, nick_name, passwd, user_name, phone_num, email, head_img, user_level, birthday, gender, create_time, update_time, `status`,
from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd') as pt
FROM ods_user_info_incr_stg
) a
) a
where rn =1 ;