服务器部署datax+使用Python脚本导数+项目数据

ListenerDMT

已于 2022-05-06 19:30:23 修改

阅读量5.5k

点赞数 1

文章标签：大数据

于 2022-04-13 09:44:41 首次发布

本文链接：https://blog.csdn.net/qq_43476430/article/details/124121809

版权

项目专栏收录该内容

3 篇文章 1 订阅

订阅专栏

一、Datax基础知识

1、什么是Datax？

DataX 是阿里云商用产品 DataWorks 数据集成的开源版本，它是一个异构数据源的离线数据同步工具/平台（ETL工具）。DataX 实现了包括 MySQL、Oracle、OceanBase、SqlServer、Postgre、HDFS、Hive、ADS、HBase、TableStore(OTS)、MaxCompute(ODPS)、Hologres、DRDS 等各种异构数据源之间高效的数据同步功能。

Tips : 异构即不同类型的应用或者数据源，例如MySQL/Oracle/DB2/MongDB等不同类型的数据库源 Tips : 离线数据同步常用Sqoop以及DataX工具。 Tips : ETL(Extract-Transform-Load)工具描述将数据从来源端经过抽取（extract）、转换（transform）、加载（load）至目的端的过程,目的是将企业中的分散、零乱、标准不统一的数据整合到一起，为企业的决策提供分析依据，其常用在数据仓库，但其对象并不限于数据仓库(DW)。

2、Datax的设计思想

为了解决异构数据源同步问题，DataX将复杂的网状的同步链路变成了星型数据链路，DataX作为中间传输载体负责连接各种数据源。当需要接入一个新的数据源的时候，只需要将此数据源对接到DataX，便能跟已有的数据源做到无缝数据同步。简单得说,DataX就像中间商一样为每一个服务对象进行需求供应。

3、Datax的框架设计

描述: DataX本身作为离线数据同步框架，离线（批量）的数据通道通过定义数据来源和去向的数据源和数据集，提供一套抽象化的数据抽取插件（Reader）、数据写入插件（Writer），并基于此框架设计一套简化版的中间数据传输格式，从而实现任意结构化、半结构化数据源之间数据传输。

DataX架构设计流程类似source(数据来源）-> channel(数据存储池中转通道) -> sink

Reader：数据采集模块，负责采集数据源的数据，将数据发送给Framework。

Writer：数据写入模块，负责不断向Framework取数据，并将数据写入到目的端。

Framework：用于连接reader和writer，作为两者的数据传输通道，并处理缓冲，流控，并发，数据转换等核心技术问题。

Reader、Writer与Transformer

DataX 可支持任意数据源到数据源，只要实现了 Reader/Writer Plugin，官方已经实现了主流的数据源插件，比如 MySQL、Oracle、SQLServer 等。

4、Datax的调度流程

（1）并发量

并发量即，所需通道数量（needChannelNumber）。

按表数量配置了每个通道处理1张表，那么同时处理100张表，就需要100个通道
按记录数配置了每个通道处理1000条记录，那么同时处理10000条记录，就需要10个通道

（2）任务切分

将Job切分成多个Task：根据所需通道数量将Job切分成多个Task

如果用户配置了具体的 Table 数量，那么就按照 Table 为最小单元进行拆分（即一个 Table 对应一个 Task），并生成对应的 querySql；
如果用户还配置了 splitPk，则会根据 splitPk 进行切分，具体逻辑是根据 splitPk 区间对 Table 进行拆分，并生成对应的 querySql。

（3）任务分配

1）配置通道数量（ChannelNumber）

2）设置每个TaskGroup分配的通道数（默认为每个TaskGroup分配5个通道）

3）计算TaskGroup数量=通道数量/每个TaskGroup分配的通道数

4）将Task平均分配至每个TaskGroup

（4）调度示例

例如

构建了一个数据同步作业，该作业的目的是将 MySql 的 100 张表同步到Hive 库中。且此时设置了20 个并发（即 channelNumber=20）。

按表数量计算，需要100个通道（Channel）；
将此Job切分成 100 个 Task；
DataX 默认给每个 TaskGroup 分配 5 个 Channel，因此 taskGroupNumber = channelNumber / channelsPerTaskGroup = 20 / 5 = 4；
根据 DataX 的公平分配策略，会将 100 个 Task 平均分配给每个 TaskGroup，因此每个 TaskGroup 处理 taskNumber / taskGroupNumber = 100 / 4 = 25 个 Task。

（5）详细图

二、在服务器上部署Datax

1、将安装包解压至app下

[peizk@hadoop app]$ tar -zxvf datax.tar.gz  -C ~/app/

2、配置环境变量

[root@hadoop ~]# vim /etc/profile

添加如下

#DATAX_HOME
export DATAX_HOME=/home/peizk/app/datax
export PATH=$PATH:$DATAX_HOME/bin

不要忘记source！！！

3、执行一下官方例子

在job目录下

[peizk@hadoop job]$ datax.py job.json

有

配置成功！！

4、各目录作用

├── bin # 可执行的Python脚本
├── conf # Datax 配置文件
├── job # 离线同步任务
├── lib # 依赖库
├── log # 任务执行过程日志
├── log_perf
├── plugin # 各类数据库读写插件
│ ├── reader
│ └── writer
├── script # 脚本存放
└── tmp # 临时目录

三、测试一下从mysql表拿数据传输到hdfs

1、获取mysql到hdfs模板

[peizk@hadoop job]$ datax.py -r mysqlreader -w hdfswriter 

DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.


Please refer to the mysqlreader document:
     https://github.com/alibaba/DataX/blob/master/mysqlreader/doc/mysqlreader.md 

Please refer to the hdfswriter document:
     https://github.com/alibaba/DataX/blob/master/hdfswriter/doc/hdfswriter.md 
 
Please save the following configuration as a json file and  use
     python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json 
to run the job.

{
    "job": {
        "content": [
            {
                "reader": {
                    "name": "mysqlreader", 
                    "parameter": {
                        "column": [], 
                        "connection": [
                            {
                                "jdbcUrl": [], 
                                "table": []
                            }
                        ], 
                        "password": "", 
                        "username": "", 
                        "where": ""
                    }
                }, 
                "writer": {
                    "name": "hdfswriter", 
                    "parameter": {
                        "column": [], 
                        "compress": "", 
                        "defaultFS": "", 
                        "fieldDelimiter": "", 
                        "fileName": "", 
                        "fileType": "", 
                        "path": "", 
                        "writeMode": ""
                    }
                }
            }
        ], 
        "setting": {
            "speed": {
                "channel": ""
            }
        }
    }
}

2、模板解析

{
    "job": {
        "content": [
            {
                "reader": {                        
                    "name": "mysqlreader",        --name：reader名
                    "parameter": {          
                        "column": [],             -- 字段名
                        "connection": [  
                            {
                                "jdbcUrl": [],    --对数据库的JDBC连接信息
                                "table": []       --表名
                                【"querySql":[]】 -- 支持sql语句，配置它                                                    --后，mysqlreader直接忽略                                                  --table、column、where
                            }
                        ], 
                        "password": "",        --密码
                        "username": "",        --用户名
                        "where": ""            --筛选条件
                       【"splitPk": ""】       --数据分片字段，一般是主键，                                                --仅支持整型
                    }
                }, 
                "writer": {
                    "name": "hdfswriter",       --name：writer名
                    "parameter": {   
                        "column": [],     --column：写入数据的字段，其                                               --中name指定字段名，type指定类型
                        "compress": "",   --hdfs文件压缩类型，默认未空
                        "defaultFS": "",  --namenode节点地址
                        "fieldDelimiter": "", --分隔符
                        "fileName": "",       --写入文件名
                        "fileType": "",     --文件的类型，目前只支持用户配                                             --置为"text"或"orc"
                        "path": "",      --hdfs文件系统的路径信息
                        "writeMode": ""  
                        -- hdfswriter写入前数据清理处理模式：
--(1)append：写入前不做任何处理，hdfswriter直接使用filename写入，会重复数据
--(2)nonConflict：如果目录下有fileName前缀的文件，直接报错。
                    }
                }
            }
        ], 
        "setting": {
            "speed": {
                "channel": ""  --并发数限速（根据自己CPU合理控制并发数）
                -- channle=2,使用streamwriter打印的时候会打印两遍内容。
                -- 但是如果是别的writer，比如hdfswriter,mysqlwriter,它只是                 -- 启动两个并发任务，内容只有一次，不会重复两遍。
            }
        }
    }
}

四、Datax两种配置模式与传参

1、字段模式

{
    "job": {
        "content": [
            {
                "reader": {
                    "name": "mysqlreader", 
                    "parameter": {
                        "column": [
					  "bus_id",
                        "bus_name",
                        "telephone",
                        "pic_url",
                        "bus_user_name",
                        "province",
                        "city",
                        "dist",
                        "addr",
                        "zipcode",
                        "is_open",
                        "open_time",
					  "closed_time"
                        ], 
                        "connection": [
                            {
                                "jdbcUrl": [
                                "jdbc:mysql://hadoop04:3306/prod"
                                ], 
                                "table": [
                                "business_info"
                                ]
                            }
                        ], 
                        "password": "shumeng_123", 
                        "username": "shumeng", 
                        "where": "1=1"
                    }
                }, 
                "writer": {
                    "name": "hdfswriter", 
                    "parameter": {
                        "column": [
						    {
                        		"name":"bus_id",
                        		"type":"string"
                        	},
                        	{
                        		"name":"bus_name",
                        		"type":"string"
                        	},
                        	{
                        		"name":"telephone",
                        		"type":"string"
                        	},
                        	{
                        		"name":"pic_url",
                        		"type":"string"
                        	},                    	
                        	{
                        		"name":"bus_user_name",
                        		"type":"string" 
                        	},
							{
                        		"name":"province",
                        		"type":"string"
                        	},
                        	{
                        		"name":"city",
                        		"type":"string"
                        	},
                        	{
                        		"name":"dist",
                        		"type":"string"
                        	},                    	
                        	{
                        		"name":"addr",
                        		"type":"string" 
                        	},
                        	{
                        		"name":"zipcode",
                        		"type":"string"
                        	},
                        	{
                        		"name":"is_open",
                        		"type":"string"
                        	},                    	
                        	{
                        		"name":"open_time",
                        		"type":"string" 
                        	}, 
							{
                        		"name":"closed_time",
                        		"type":"string" 
                        	}
                        ], 
                        "compress": "snappy", 
                        "defaultFS": "hdfs://hadoop04:8020", 
                        "fieldDelimiter": ",", 
                        "fileName": "ods_business_info_full.txt", 
                        "fileType": "orc", 
                        "path": "/user/hive/warehouse/ods.db/ods_business_info_full/pt=2022-01-01", 
                        "writeMode": "append"
                    }
                }
            }
        ], 
        "setting": {
            "speed": {
                "channel": "1"
            }
        }
    }
}

2、sql模式

{
    "job": {
        "content": [
            {
                "reader": {
                    "name": "mysqlreader", 
                    "parameter": {
                       
                        "connection": [
                            {
                                "jdbcUrl": [
                                "jdbc:mysql://hadoop04:3306/prod"
                                ], 
								"querySql":[
                                "select bus_id,bus_name,telephone,pic_url,bus_user_name,province,city,dist,addr,zipcode,is_open,open_time,close_time from prod.business_info "
                                ]
                            }
                        ], 
                        "password": "shumeng_123", 
                        "username": "shumeng", 
                        "where": "1=1"
                    }
                }, 
                "writer": {
                    "name": "hdfswriter", 
                    "parameter": {
                        "column": [
						    {
                        		"name":"bus_id",
                        		"type":"string"
                        	},
                        	{
                        		"name":"bus_name",
                        		"type":"string"
                        	},
                        	{
                        		"name":"telephone",
                        		"type":"string"
                        	},
                        	{
                        		"name":"pic_url",
                        		"type":"string"
                        	},                    	
                        	{
                        		"name":"bus_user_name",
                        		"type":"string" 
                        	},
							{
                        		"name":"province",
                        		"type":"string"
                        	},
                        	{
                        		"name":"city",
                        		"type":"string"
                        	},
                        	{
                        		"name":"dist",
                        		"type":"string"
                        	},                    	
                        	{
                        		"name":"addr",
                        		"type":"string" 
                        	},
                        	{
                        		"name":"zipcode",
                        		"type":"string"
                        	},
                        	{
                        		"name":"is_open",
                        		"type":"string"
                        	},                    	
                        	{
                        		"name":"open_time",
                        		"type":"string" 
                        	}, 
				           {
                        		"name":"closed_time",
                        		"type":"string" 
                        	}
                        ], 
                        "compress": "snappy", 
                        "defaultFS": "hdfs://hadoop04:8020", 
                        "fieldDelimiter": ",", 
                        "fileName": "ods_business_info_full_sql.txt", 
                        "fileType": "orc", 
                        "path": "/user/hive/warehouse/ods.db/ods_business_info_full/pt=2022-01-01", 
                        "writeMode": "append"
                    }
                }
            }
        ], 
        "setting": {
            "speed": {
                "channel": "1"
            }
        }
    }
}

3、注意点：DataX的HDFS writer会将null值存储为空字符串'' ,Hive默认null格式为\N。

alter table ods_business_info_full set serdeproperties('serialization.null.format' = '');

4、Datax 传入参数

1、为啥

目的：对每日同步数据以分区形式进行加载，目标路径需动态变化指定HDFS writer的path参数。

方法：在JSON配置文件中使用${param} 引入参数，提交任务时候使用 -p -Dparam=value 传参。

2、例子

（1）更改的json

{
    "job": {
        "content": [
            {
                "reader": {
                    "name": "mysqlreader", 
                    "parameter": {
                       
                        "connection": [
                            {
                                "jdbcUrl": [
                                "jdbc:mysql://hadoop04:3306/prod"
                                ], 
								"querySql":[
                                "select bus_id,bus_name,telephone,pic_url,bus_user_name,province,city,dist,addr,zipcode,is_open,open_time,closed_time from prod.business_info where open_time>='${start_date}'"
                               ]
                            }
                        ], 
                        "password": "shumeng_123", 
                        "username": "shumeng", 
                        "where": "1=1"
                    }
                }, 
                "writer": {
                    "name": "hdfswriter", 
                    "parameter": {
                        "column": [
						    {
                        		"name":"bus_id",
                        		"type":"string"
                        	},
                        	{
                        		"name":"bus_name",
                        		"type":"string"
                        	},
                        	{
                        		"name":"telephone",
                        		"type":"string"
                        	},
                        	{
                        		"name":"pic_url",
                        		"type":"string"
                        	},                    	
                        	{
                        		"name":"bus_user_name",
                        		"type":"string" 
                        	},
							{
                        		"name":"province",
                        		"type":"string"
                        	},
                        	{
                        		"name":"city",
                        		"type":"string"
                        	},
                        	{
                        		"name":"dist",
                        		"type":"string"
                        	},                    	
                        	{
                        		"name":"addr",
                        		"type":"string" 
                        	},
                        	{
                        		"name":"zipcode",
                        		"type":"string"
                        	},
                        	{
                        		"name":"is_open",
                        		"type":"string"
                        	},                    	
                        	{
                        		"name":"open_time",
                        		"type":"string" 
                        	}, 
				           {
                        		"name":"closed_time",
                        		"type":"string" 
                        	}
                        ], 
                        "compress": "snappy", 
                        "defaultFS": "hdfs://hadoop04:8020", 
                        "fieldDelimiter": ",", 
                        "fileName": "ods_business_info_full_sql",
                        "fileType": "orc", 
                        "path": "/user/hive/warehouse/test.db/ods_business_info_full_sql/pt=2022-01-01", 
                        "writeMode": "append"
                    }
                }
            }
        ], 
        "setting": {
            "speed": {
                "channel": "1"
            }
        }
    }
}

（2）执行的语句

datax.py  ods_business_info_full_sql.json -p -Dstart_date=2020-01-01

五、业务的一些逻辑

1、关于订单和订单明细

订单表 --》订单明细表

一个订单可能买多个商品，可以通过订单id去明细表里查

2、维度数据

维度数据一般全量同步

这是从 数据比较固定，数据变化频率来确定的

3、订单数据

订单一般增量因为一般要有时间判断

并且订单数据一般走 t+1 离线

4、订单表只通过 create_time 抽取可以吗？

订单表只通过 create_time 抽取是有问题的
只抽取当天产生数据会有问题

例如：前几天有一个订单，今天才支付 update_time 变化（订单有生命周期 7-15天）

5、对于像支付表这种

支付周期表（多少分钟内必须要完成支付，比如30分钟），

如果想对支付表做增量，

（1）通过create_time 和update_time 去取
（2）通过 设计他的跑批时间，如每天的 00.30之后去抽数，那么前一天的 create_time都已经完结了

6、对于收藏表这种

收藏表 无法确定生命周期（一年前收藏，一年后取消，；一天前收藏，一天后取消）

使用 create_time 和update_time 去取

六、编写python脚本的分析及同步的分析

1、编写一个脚本我们需要的东西

（1） mysql 的字段
mysql 的表名

（2）hive的字段
hive的文件名
hive的文件地址

2、我们需要做什么

分析（1）提取公共参数
（2）提取 mysql字段
（3）提取 hive字段要和mysql一一对应
（4）生成 json
（5）通过使用参数控制生成哪些表

3、如何获取我们需要的东西？

对于mysql 字段、表名

select
  TABLE_NAME 
  ,COLUMN_NAME 
  ,DATA_TYPE 
from information_schema.`COLUMNS` c 
where TABLE_SCHEMA ='project'

4、总体的脚本思路？要干什么？

全量json生成，通过python 文件生成

接着 datax执行相应的 json文件，导入数据

5、全量同步的逻辑？

全删，全插

6、增量同步的逻辑

对于订单

select distinct substr(create_time,1,10) as ts
from 
  seller_info 
where create_time >='2022-01-06' and create_time <'2022-01-07'  
or update_time >='2022-01-06'  and update_time <'2022-01-07'

这样最多可以生成6天的数据因为除了 create_time 还有 update_time 在这一天更改的，但他的create_time 可能不是这一天

！！！有些增量数据比如支付只需要控制它的跑批时间在 00:30后就可以

！！！有些增量订单需要 create_time 和 update_time 一起去控制

问题：涉及增量可以直接写到ODS表吗？（按天分区的，每天分区，是每天的数据）
不可以！

比如上诉的sql脚本 6号这一天它的 create_time 有6天就不能写

解决办法：

（1）创建一个 stg层先将我们每天增量抽取的数据存储在stg的表中，这个表没有分区（它每次存储的都是我们每天当天抽的数据）

那如何装载到ods？

（2）首先查询 stg表中数据为a
然后根据stg表中的 create_time 得出相应分区，用这些分区做where条件去查询整个增量表中的数据（全部的分区）为 b

将这两个查询 union all 在一起做一张临时表(a的pt 还是它的原pt，b的pt为 create_time)

（3）然后对这张临时表做开窗组订单id 排序 update_time desc 取 rn =1
将查询的结果使用动态分区插入到整个增量表 partition(pt)

最后结果：
这样有变更的数据最新的数据就会写入到最新分区，upda也会相应修改原数据就消失
无变更的数据仍然写入原分区

这样就完成整个增量同步逻辑

简单的代码诠释 
场景 ods_a_inc  为增量表 内有 1 2 3 三天的分区
     ods_a_stg  为a的stg 表  今天是 4号 内有 create_time 1 2 3 4 都有 实现增量代码
    其中字段a为订单id也就是主键

create table  tmp_ods_a_inc
(
  select 
     a
    ,b
    ,create_time
    ,update_time
    ,pt
  from 
  ods_a_inc
  where 
    pt in 
  (
    select 
      create_time
    from 
      ods_a_stg
    group by 
      create_time
  )
  union all

  select
    a
    ,b  
    ,create_time 
    ,update_time  
    ,create_time as pt
  from 
    ods_a_stg
);

--开启动态分区
set hive.exec.dynamic.partition=true;    
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table ods_a_inc partition(pt)
select
  a
  ,b 
  ,create_time
  ,update_time
  ,pt 
from 
(
  select 
    a
    ,b 
    ,create_time
    ,update_time
    ,pt 
    ,row_number() over(partition by a order by update_time desc ) as rn 
  from tmp_ods_a_inc
)
  where rn =1

七、项目脚本实现

1、创建全量表json生成的py脚本

（1）创建文件 export_job_json_full.py

[peizk@hadoop job]$ vim export_job_json_full.py

（2）代码如下

# coding=utf-8
import json
import getopt
import os
import sys
import MySQLdb

#MySQL连接信息配置
mysql_host = "hadoop"
mysql_port = "3306"
mysql_user = "root"
mysql_passwd = "123456"

#HDFS NameNode连接信息配置
hdfs_nn_host = "hadoop"
hdfs_nn_port = "9000"

#配置生成文件的目标路径
output_path = "/home/peizk/app/datax/job/ods_job"

def mysql_connect():
    return MySQLdb.connect(host=mysql_host, port=int(mysql_port), user=mysql_user, passwd=mysql_passwd)

def mysql_meta(database, table):
    connection = mysql_connect()
    cursor = connection.cursor()
    sql = "SELECT COLUMN_NAME,DATA_TYPE from information_schema.COLUMNS WHERE TABLE_SCHEMA=%s AND TABLE_NAME=%s ORDER BY ORDINAL_POSITION"
    cursor.execute(sql, [database, table])
    fetchall = cursor.fetchall()
    cursor.close()
    connection.close()
    return fetchall

def mysql_col(database, table):
    meta = mysql_meta(database, table)
    return map(lambda x: x[0], meta)

def hive_col(database, table):
    def type_mapping(mysql_type):
        mappings = {
            "bigint": "bigint",
            "int": "bigint",
            "smallint": "bigint",
            "tinyint": "bigint",
            "decimal": "string",
            "double": "double",
            "float": "float",
            "binary": "string",
            "char": "string",
            "varchar": "string",
            "datetime": "string",
            "time": "string",
            "timestamp": "string",
            "date": "string",
            "text": "string"
        }
        return mappings[mysql_type]

    meta = mysql_meta(database, table)
    return map(lambda x: {"name": x[0], "type": type_mapping(x[1].lower())}, meta)

def generate_json(source_database, source_table):
    job = {
        "job": {
            "setting": {
                "speed": {
                    "channel": 1
                }
            },
            "content": [{
                "reader": {
                    "name": "mysqlreader",
                    "parameter": {
                        "username": mysql_user,
                        "password": mysql_passwd,
                        "column": mysql_col(source_database, source_table),
                        "splitPk": "",
                        "connection": [{
                            "table": [source_table],
                            "jdbcUrl": ["jdbc:mysql://" + mysql_host + ":" + mysql_port + "/" + source_database]
                        }]
                    }
                },
                "writer": {
                    "name": "hdfswriter",
                    "parameter": {
                        "defaultFS": "hdfs://" + hdfs_nn_host + ":" + hdfs_nn_port,
                        "fileType": "orc",
                        "path": "${hdfs_path}",
                        "fileName": source_table,
                        "column": hive_col(source_database, source_table),
                        "writeMode": "append",
                        "fieldDelimiter": "\t",
                        "compress": "snappy"
                    }
                }
            }]
        }
    }
    if not os.path.exists(output_path):
        os.makedirs(output_path)
    with open(os.path.join(output_path, ".".join([source_database, source_table+"_full", "json"])), "w") as f:
        json.dump(job, f)


def main(args):
    source_database = ""
    source_table = ""

    options, arguments = getopt.getopt(args, '-d:-t:', ['sourcedb=', 'sourcetbl='])
    for opt_name, opt_value in options:
        if opt_name in ('-d', '--sourcedb'):
            source_database = opt_value
        if opt_name in ('-t', '--sourcetbl'):
            source_table = opt_value

    generate_json(source_database, source_table)


if __name__ == '__main__':
    main(sys.argv[1:])

（3）通过脚本生成某个全量表的json代码

python export_job_json_full.py -dprod -tbusiness_info

（4）通过datax执行生成的某个全量表的json

datax.py prod.business_info.json -p "-Dhdfs_path=/user/hive/warehouse/ods.db/ods_business_info_full/pt=2022-01-01"

2、创建增量表json生成的py脚本

（1）创建文件 export_job_json_incr.py

[peizk@hadoop utils_code]$ vim export_job_json_incr.py

（2）脚本代码如下

# coding=utf-8
import json
import getopt
import os
import sys
import MySQLdb

#MySQL连接信息配置
mysql_host = "hadoop"
mysql_port = "3306"
mysql_user = "root"
mysql_passwd = "123456"

#HDFS NameNode连接信息配置
hdfs_nn_host = "hadoop"
hdfs_nn_port = "9000"

#配置生成文件的目标路径
output_path = "/home/peizk/app/datax/job/ods_job"


def get_connect():
    return MySQLdb.connect(host=mysql_host, port=int(mysql_port), user=mysql_user, passwd=mysql_passwd)


def mysql_meta(database, table):
    connection = get_connect()
    cursor = connection.cursor()
    sql = "SELECT COLUMN_NAME,DATA_TYPE from information_schema.COLUMNS WHERE TABLE_SCHEMA=%s AND TABLE_NAME=%s ORDER BY ORDINAL_POSITION"
    cursor.execute(sql, [database, table])
    fetchall = cursor.fetchall()
    cursor.close()
    connection.close()
    return fetchall


def mysql_col(database, table):
    return map(lambda x: x[0], mysql_meta(database, table))

def mysql_query(database, table):
    lis = list(mysql_col(database, table))
    str_columns = ','.join(lis)
    sql_query = 'select ' + str_columns + ' from ' + table + ' where (${create_time}>=\'${start_date}\' and ${create_time}<\'${end_date}\') ${is_or} (${update_time}>=\'${start_date}\' and ${update_time}<\'${end_date}\') '
    return sql_query


def hive_col(database, table):
    def type_mapping(mysql_type):
        mappings = {
            "bigint": "bigint",
            "int": "bigint",
            "smallint": "bigint",
            "tinyint": "bigint",
            "decimal": "string",
            "double": "double",
            "float": "float",
            "binary": "string",
            "char": "string",
            "varchar": "string",
            "datetime": "string",
            "time": "string",
            "timestamp": "string",
            "date": "string",
            "text": "string"
        }
        return mappings[mysql_type]

    meta = mysql_meta(database, table)
    return map(lambda x: {"name": x[0], "type": type_mapping(x[1].lower())}, meta)


def generate_json(source_database, source_table):
    job = {
        "job": {
            "content": [{
                "reader": {
                    "name": "mysqlreader",
                    "parameter": {
                        "username": mysql_user,
                        "password": mysql_passwd,
                        "splitPk": "",
                        "connection": [{
                            "jdbcUrl": ["jdbc:mysql://" + mysql_host + ":" + mysql_port + "/" + source_database],
                            "querySql":[mysql_query(source_database, source_table)]
                        }]
                    }
                },
                "writer": {
                    "name": "hdfswriter",
                    "parameter": {
                        "defaultFS": "hdfs://" + hdfs_nn_host + ":" + hdfs_nn_port,
                        "fileType": "orc",
                        "path": "${hdfs_path}",
                        "fileName": source_table,
                        "column": hive_col(source_database, source_table),
                        "writeMode": "append",
                        "fieldDelimiter": "\t",
                        "compress": "snappy"
                    }
                }
            }],
            "setting": {
                "speed": {
                    "channel": 1
                }
            }
        }
    }
    if not os.path.exists(output_path):
        os.makedirs(output_path)
    with open(os.path.join(output_path, ".".join([source_database, source_table+"_incr", "json"])), "w") as f:
        json.dump(job, f)


def main(args):
    source_database = ""
    source_table = ""

    options, arguments = getopt.getopt(args, '-d:-t:', ['sourcedb=', 'sourcetbl='])
    for opt_name, opt_value in options:
        if opt_name in ('-d', '--sourcedb'):
            source_database = opt_value
        if opt_name in ('-t', '--sourcetbl'):
            source_table = opt_value

    generate_json(source_database, source_table)


if __name__ == '__main__':
    main(sys.argv[1:])

（3）通过脚本生成某个增量表的json代码

python export_job_json_incr.py -dprod -tseller_info

（4）通过datax执行生成的某个增量表的json

datax.py prod.seller_info_incr.json -p "-Dhdfs_path=/user/hive/warehouse/datax.db/ods_seller_info_incr/pt=2022-01-01 -Dcreate_time=create_time -Dupdate_time=update_time -Dstart_date=2022-01-01 -Dend_date=2022-01-02 -Dis_or='or'"

3、创建一个shell脚本来同时控制我们的生成全量、增量json所有表的脚本

（1）创建脚本 export_all_job.sh

vim  export_all_job.sh

（2）脚本内容为

#!/bin/bash
python /home/peizk/app/datax/job/utils_code/export_job_json_full.py  -dproject -tpub_class
python /home/peizk/app/datax/job/utils_code/export_job_json_full.py  -dproject -tpub_code
python /home/peizk/app/datax/job/utils_code/export_job_json_full.py  -dproject -tpub_area
python /home/peizk/app/datax/job/utils_code/export_job_json_full.py  -dproject -tpub_brand
python /home/peizk/app/datax/job/utils_code/export_job_json_full.py  -dproject -tbusiness_info
python /home/peizk/app/datax/job/utils_code/export_job_json_full.py  -dproject -tsku_info
python /home/peizk/app/datax/job/utils_code/export_job_json_full.py  -dproject -tspu_info
python /home/peizk/app/datax/job/utils_code/export_job_json_full.py  -dproject -tsku_para
python /home/peizk/app/datax/job/utils_code/export_job_json_full.py  -dproject -tsku_para_info
python /home/peizk/app/datax/job/utils_code/export_job_json_full.py  -dproject -tactivity_sku
python /home/peizk/app/datax/job/utils_code/export_job_json_full.py  -dproject -tcoupon_info
python /home/peizk/app/datax/job/utils_code/export_job_json_full.py  -dproject -tactivity_info
python /home/peizk/app/datax/job/utils_code/export_job_json_incr.py  -dproject -tcollect_info
python /home/peizk/app/datax/job/utils_code/export_job_json_incr.py  -dproject -tpay_info
python /home/peizk/app/datax/job/utils_code/export_job_json_incr.py  -dproject -trefund_pay_info
python /home/peizk/app/datax/job/utils_code/export_job_json_incr.py  -dproject -tremark_info
python /home/peizk/app/datax/job/utils_code/export_job_json_incr.py  -dproject -tseller_detail
python /home/peizk/app/datax/job/utils_code/export_job_json_incr.py  -dproject -tseller_info
python /home/peizk/app/datax/job/utils_code/export_job_json_incr.py  -dproject -tseller_refund_info
python /home/peizk/app/datax/job/utils_code/export_job_json_incr.py  -dproject -tseller_detail_coupon
python /home/peizk/app/datax/job/utils_code/export_job_json_incr.py  -dproject -tseller_detail_activity
python /home/peizk/app/datax/job/utils_code/export_job_json_incr.py  -dproject -tuser_info

（3）通过执行此脚本我们可批量生成我们需要的所有表的json（全量与增量都有）

4、创建关于全量表 datax执行json的抽数脚本

（1）创建文件 import_ods_table_full.sh

vim  import_ods_table_full.sh

（2）代码内容为

#!/bin/bash
# 如果传入日期则etl_date等于传入的日期，否则赋值前一天日期
if [ -n "$2" ] ;then
    etl_date=$2
else
    etl_date=`date -d "-1 day" +%F`
fi
 
#设置datax目录
DATAX_HOME=/home/peizk/app/datax
 
#设置job存放的目录
job_dir="/home/peizk/app/datax/job/ods_job"
 
#设置HDFS公共存放的目录
hdfs_dir="/user/hive/warehouse/project.db"
 
#设置hive库名
database="project"
 
 
#先清空表分区,再创建表分区,防止同一分区数据重复
alter_partition() {
  echo "------------开始清空$1表${etl_date}分区数据----------------"
  hive_table_name=$1
  hive  -e "alter table ${database}.${hive_table_name} drop partition(pt='${etl_date}')"   
  hive  -e "alter table ${database}.${hive_table_name} add  partition(pt='${etl_date}')"  
}
 
#数据同步
import_full_table() {
  datax_json=$1
  hdfs_path=$2
  hive_table_url=${hdfs_path%/*}
  hive_table_name=${hive_table_url##*/}  
  alter_partition $hive_table_name
  echo "------------开始全量导入$1表${etl_date}分区数据------------"
  python $DATAX_HOME/bin/datax.py $datax_json -p"-Dhdfs_path=$hdfs_path" 
}
 
 
case $1 in
"pub_class")
  import_full_table ${job_dir}/project.pub_class_full.json ${hdfs_dir}/ods_pub_class_full/pt=$etl_date
  ;;
"pub_area")
  import_full_table ${job_dir}/project.pub_area_full.json ${hdfs_dir}/ods_pub_area_full/pt=$etl_date
  ;;
"activity_info")
  import_full_table ${job_dir}/project.activity_info_full.json ${hdfs_dir}/ods_activity_info_full/pt=$etl_date
  ;;
"activity_sku")
  import_full_table ${job_dir}/project.activity_sku_full.json ${hdfs_dir}/ods_activity_sku_full/pt=$etl_date
  ;;
"coupon_info")
  import_full_table ${job_dir}/project.coupon_info_full.json ${hdfs_dir}/ods_coupon_info_full/pt=$etl_date
  ;;
"pub_code")
  import_full_table ${job_dir}/project.pub_code_full.json ${hdfs_dir}/ods_pub_code_full/pt=$etl_date
  ;;
"pub_brand")
  import_full_table ${job_dir}/project.pub_brand_full.json ${hdfs_dir}/ods_pub_brand_full/pt=$etl_date
  ;;
"business_info")
  import_full_table ${job_dir}/project.business_info_full.json ${hdfs_dir}/ods_business_info_full/pt=$etl_date
  ;;  
"sku_info")
  import_full_table ${job_dir}/project.sku_info_full.json ${hdfs_dir}/ods_sku_info_full/pt=$etl_date
  ;;
"spu_info")
  import_full_table ${job_dir}/project.spu_info_full.json ${hdfs_dir}/ods_spu_info_full/pt=$etl_date
  ;;
"sku_para")
  import_full_table ${job_dir}/project.sku_para_full.json ${hdfs_dir}/ods_sku_para_full/pt=$etl_date
  ;;
"sku_para_info")
  import_full_table ${job_dir}/project.sku_para_info_full.json ${hdfs_dir}/ods_sku_para_info_full/pt=$etl_date
  ;;
  
"all")
  import_full_table ${job_dir}/project.pub_class_full.json ${hdfs_dir}/ods_pub_class_full/pt=$etl_date
  import_full_table ${job_dir}/project.pub_area_full.json ${hdfs_dir}/ods_pub_area_full/pt=$etl_date
  import_full_table ${job_dir}/project.pub_code_full.json ${hdfs_dir}/ods_pub_code_full/pt=$etl_date
  import_full_table ${job_dir}/project.pub_brand_full.json ${hdfs_dir}/ods_pub_brand_full/pt=$etl_date
  import_full_table ${job_dir}/project.business_info_full.json ${hdfs_dir}/ods_business_info_full/pt=$etl_date
  import_full_table ${job_dir}/project.sku_info_full.json ${hdfs_dir}/ods_sku_info_full/pt=$etl_date
  import_full_table ${job_dir}/project.spu_info_full.json ${hdfs_dir}/ods_spu_info_full/pt=$etl_date
  import_full_table ${job_dir}/project.sku_para_full.json ${hdfs_dir}/ods_sku_para_full/pt=$etl_date
  import_full_table ${job_dir}/project.sku_para_info_full.json ${hdfs_dir}/ods_sku_para_info_full/pt=$etl_date
  import_full_table ${job_dir}/project.activity_info_full.json ${hdfs_dir}/ods_activity_info_full/pt=$etl_date
  import_full_table ${job_dir}/project.activity_sku_full.json ${hdfs_dir}/ods_activity_sku_full/pt=$etl_date
  import_full_table ${job_dir}/project.coupon_info_full.json ${hdfs_dir}/ods_coupon_info_full/pt=$etl_date

esac

5、创建关于增量表 datax执行json的抽数脚本

（1）创建文件 import_ods_table_incr.sh

vim  import_ods_table_incr.sh

（2）代码内容为

#!/bin/bash
# shell 字符串截取 https://www.cnblogs.com/onmyway20xx/p/4308113.html
# 如果传入日期则etl_date等于传入的日期，否则赋值前一天日期
if [ -n "$3" ] ;then
        start_date=$2
        end_date=$3
else
    start_date=`date -d "-1 day" +%F`
        send_date=`date  +%F`
fi

#设置datax目录
DATAX_HOME=/home/peizk/app/datax

#设置job存放的目录
job_dir="/home/peizk/app/datax/job/ods_job"

#设置HDFS公共存放的目录
hdfs_dir="/user/hive/warehouse/project.db"

#设置hive库名
database="project"


#先清空表分区,再创建表分区,防止同一分区数据重复
truncate_table () {
  echo "------------开始清空$1表${etl_date}分区数据----------------"
  hive  -e "truncate table  ${database}.${1} "   
}

#数据同步
import_incr_table() {
  datax_json=$1
  hdfs_path=$2
  is_or=$3
  create_time=$4
  update_time=$5
  hive_table_url=${hdfs_path}
  hive_table_name=${hive_table_url##*/} 
  truncate_table ${hive_table_name}
echo "------------开始全量导入${hive_table_incr}表----------------"
  python $DATAX_HOME/bin/datax.py $datax_json -p"-Dhdfs_path=$hdfs_path -Dcreate_time=$create_time -Dupdate_time=${update_time} -Dstart_date=${start_date} -Dend_date=${end_date} -Dis_or=${is_or}" 
}


case $1 in
"collect_info")
  import_incr_table ${job_dir}/project.collect_info_incr.json ${hdfs_dir}/ods_collect_info_incr_stg or create_time cancel_time
  ;;
"pay_info")
  import_incr_table ${job_dir}/project.pay_info_incr.json ${hdfs_dir}/ods_pay_info_incr_stg         -- pay_create_time 
  ;;
"refund_pay_info")
  import_incr_table ${job_dir}/project.refund_pay_info_incr.json ${hdfs_dir}/ods_refund_pay_info_incr_stg -- refund_create_time refund_update_time
  ;;
"remark_info")
  import_incr_table ${job_dir}/project.remark_info_incr.json ${hdfs_dir}/ods_remark_info_incr_stg  -- create_time update_time
  ;;
"seller_detail")
  import_incr_table ${job_dir}/project.seller_detail_incr.json ${hdfs_dir}/ods_seller_detail_incr_stg or create_time update_time
  ;;  
"seller_info")
  import_incr_table ${job_dir}/project.seller_info_incr.json ${hdfs_dir}/ods_seller_info_incr_stg or create_time update_time
  ;;
"seller_refund_info")
  import_incr_table ${job_dir}/project.seller_refund_info_incr.json ${hdfs_dir}/ods_seller_refund_info_incr_stg -- create_time update_time
  ;;
"seller_detail_coupon")
  import_incr_table ${job_dir}/project.seller_detail_coupon_incr.json ${hdfs_dir}/ods_seller_detail_coupon_incr_stg -- create_time update_time
  ;;
"seller_detail_activity")
  import_incr_table ${job_dir}/project.seller_detail_activity_incr.json ${hdfs_dir}/ods_seller_detail_activity_incr_stg -- create_time update_time
  ;;
"user_info")
  import_incr_table ${job_dir}/project.user_info_incr.json ${hdfs_dir}/ods_user_info_incr_stg     or create_time update_time
  ;; 
  
"all")
   import_incr_table ${job_dir}/project.collect_info_incr.json ${hdfs_dir}/ods_collect_info_incr_stg or create_time cancel_time
  
  import_incr_table ${job_dir}/project.pay_info_incr.json ${hdfs_dir}/ods_pay_info_incr_stg -- pay_create_time 

  import_incr_table ${job_dir}/project.refund_pay_info_incr.json ${hdfs_dir}/ods_refund_pay_info_incr_stg -- refund_create_time refund_update_time

  import_incr_table ${job_dir}/project.remark_info_incr.json ${hdfs_dir}/ods_remark_info_incr_stg  -- create_time update_time

  import_incr_table ${job_dir}/project.seller_detail_incr.json ${hdfs_dir}/ods_seller_detail_incr_stg or create_time update_time

  import_incr_table ${job_dir}/project.seller_info_incr.json ${hdfs_dir}/ods_seller_info_incr_stg or create_time update_time

  import_incr_table ${job_dir}/project.seller_refund_info_incr.json ${hdfs_dir}/ods_seller_refund_info_incr_stg -- create_time update_time

  import_incr_table ${job_dir}/project.seller_detail_coupon_incr.json ${hdfs_dir}/ods_seller_detail_coupon_incr_stg -- create_time update_time

  import_incr_table ${job_dir}/project.seller_detail_activity_incr.json ${hdfs_dir}/ods_seller_detail_activity_incr_stg -- create_time update_time

  import_incr_table ${job_dir}/project.user_info_incr.json ${hdfs_dir}/ods_user_info_incr_stg  or create_time update_time
esac

（3）将增量数据从stg表导入ods原表的脚本

--开启动态分区
set hive.exec.dynamic.partition=true;    
set hive.exec.dynamic.partition.mode=nonstrict;

-- 1 ods_collect_info_incr_stg 

insert overwrite table ods_collect_info_incr partition(pt)
select
id, user_id, sku_id, is_cancel, create_time, cancel_time, pt
from
(
    SELECT id, user_id, sku_id, is_cancel, create_time, cancel_time, pt,
    row_number() over(partition by id order by cancel_time desc ) as rn
    FROM 
    (
        
        SELECT id, user_id, sku_id, is_cancel, create_time, cancel_time, pt
        FROM ods_collect_info_incr
        where pt in 
        (
            SELECT  from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
            FROM ods_collect_info_incr_stg
            group by 
            from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
        )
        union all
        SELECT id, user_id, sku_id, is_cancel, create_time, cancel_time, 
        from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd') as pt
        FROM ods_collect_info_incr_stg
    
    ) a
) a
where rn =1 ;


--2  ods_pay_info_incr_stg

insert overwrite table ods_pay_info_incr partition(pt)
SELECT pay_id, out_trade_no, seller_id, user_id, pay_type, trade_no, total_amount, subject, pay_status, pay_create_time, pay_update_time, pay_content, pt
from 
(
    SELECT pay_id, out_trade_no, seller_id, user_id, pay_type, trade_no, total_amount, subject, pay_status, pay_create_time, pay_update_time, pay_content, pt,
    row_number() over(partition by pay_id order by pay_update_time desc ) as rn
    from
    (
        SELECT pay_id, out_trade_no, seller_id, user_id, pay_type, trade_no, total_amount, subject, pay_status, pay_create_time, pay_update_time, pay_content, pt
        FROM ods_pay_info_incr
        where pt in 
            (
                SELECT  from_unixtime(unix_timestamp(pay_create_time),'yyyy-MM-dd')
                FROM ods_pay_info_incr_stg
                group by from_unixtime(unix_timestamp(pay_create_time),'yyyy-MM-dd')
            )
        union all 
        SELECT pay_id, out_trade_no, seller_id, user_id, pay_type, trade_no, total_amount, subject, pay_status, pay_create_time, pay_update_time, pay_content,
        from_unixtime(unix_timestamp(pay_create_time),'yyyy-MM-dd') as pt 
        FROM ods_pay_info_incr_stg
    ) a
) a
where rn =1 ;

--3  ods_refund_pay_info_incr_stg


insert overwrite table ods_refund_pay_info_incr partition(pt)
SELECT refund_pay_id, out_trade_no, seller_id, sku_id, pay_type, trade_no, total_amount, refund_status, refund_create_time, refund_update_time, refund_content, pt
from 
(
    SELECT refund_pay_id, out_trade_no, seller_id, sku_id, pay_type, trade_no, total_amount, refund_status, refund_create_time, refund_update_time, refund_content, pt,
    row_number() over(partition by refund_pay_id order by refund_update_time desc ) as rn
    from
    (
        SELECT refund_pay_id, out_trade_no, seller_id, sku_id, pay_type, trade_no, total_amount, refund_status, refund_create_time, refund_update_time, refund_content, pt
        FROM ods_refund_pay_info_incr
        where pt in 
            (
                SELECT from_unixtime(unix_timestamp(refund_create_time),'yyyy-MM-dd')
                FROM ods_refund_pay_info_incr_stg
                group by from_unixtime(unix_timestamp(refund_create_time),'yyyy-MM-dd')
            )
        union all 
        SELECT refund_pay_id, out_trade_no, seller_id, sku_id, pay_type, trade_no, total_amount, refund_status, refund_create_time, refund_update_time, refund_content,
        from_unixtime(unix_timestamp(refund_create_time),'yyyy-MM-dd') as pt  
        FROM ods_refund_pay_info_incr_stg
    ) a
) a
where rn =1 ;



--4  ods_remark_info_incr_stg


insert overwrite table ods_remark_info_incr partition(pt)
SELECT id, user_id, head_img, sku_id, spu_id, seller_id, appraise, remark_string, create_time, update_time, pt
from 
(
    SELECT id, user_id, head_img, sku_id, spu_id, seller_id, appraise, remark_string, create_time, update_time, pt,
    row_number() over(partition by id order by update_time desc ) as rn
    from
    (
        SELECT id, user_id, head_img, sku_id, spu_id, seller_id, appraise, remark_string, create_time, update_time, pt
        FROM ods_remark_info_incr
        where pt in 
            (
                SELECT from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
                FROM ods_remark_info_incr_stg
                group by  
                from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
            )
        union all 
        SELECT id, user_id, head_img, sku_id, spu_id, seller_id, appraise, remark_string, create_time, update_time,
        from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd') as pt  
        FROM ods_remark_info_incr_stg
    ) a
) a
where rn =1 ;



--5  ods_seller_detail_activity_incr_stg

insert overwrite table ods_seller_detail_activity_incr partition(pt)
SELECT  id, seller_id, seller_detail_id, activity_id, activity_rule_id, sku_id, create_time, update_time, pt
from 
(
    SELECT id, seller_id, seller_detail_id, activity_id, activity_rule_id, sku_id, create_time, update_time, pt,
    row_number() over(partition by id order by update_time desc ) as rn
    from
    (
        SELECT id, seller_id, seller_detail_id, activity_id, activity_rule_id, sku_id, create_time, update_time, pt
        FROM ods_seller_detail_activity_incr
        where pt in 
            (
                SELECT from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
                FROM ods_seller_detail_activity_incr_stg
                group by from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
            )
        union all 
        SELECT id, seller_id, seller_detail_id, activity_id, activity_rule_id, sku_id, create_time, update_time,
        from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd') as pt  
        FROM ods_seller_detail_activity_incr_stg
    ) a
) a
where rn =1 ;



--6  ods_seller_detail_coupon_incr_stg


insert overwrite table ods_seller_detail_coupon_incr partition(pt)
SELECT  id, seller_id, seller_detail_id, coupon_id, coupon_use_id, sku_id, create_time, update_time, pt
from 
(
    SELECT id, seller_id, seller_detail_id, coupon_id, coupon_use_id, sku_id, create_time, update_time, pt,
    row_number() over(partition by id order by update_time desc ) as rn
    from
    (
        SELECT id, seller_id, seller_detail_id, coupon_id, coupon_use_id, sku_id, create_time, update_time, pt
        FROM ods_seller_detail_coupon_incr
        where pt in 
            (
                SELECT from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
                FROM ods_seller_detail_coupon_incr_stg
                group by from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
            )
        union all 
        SELECT id, seller_id, seller_detail_id, coupon_id, coupon_use_id, sku_id, create_time, update_time,
        from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd') as pt  
        FROM ods_seller_detail_coupon_incr_stg
    ) a
) a
where rn =1 ;


-- 7  ods_seller_detail_incr_stg

insert overwrite table ods_seller_detail_incr partition(pt)
SELECT seller_detail_id, seller_id, seller_status, sku_id, sku_name, bus_id, seller_price, sku_num, create_time, update_time, source_type, source_id, total_amount, split_activity_amount, split_coupon_amount, pt
from 
(
    SELECT seller_detail_id, seller_id, seller_status, sku_id, sku_name, bus_id, seller_price, sku_num, create_time, update_time, source_type, source_id, total_amount, split_activity_amount, split_coupon_amount, pt,
    row_number() over(partition by seller_detail_id order by update_time desc ) as rn
    from
    (
        SELECT seller_detail_id, seller_id, seller_status, sku_id, sku_name, bus_id, seller_price, sku_num, create_time, update_time, source_type, source_id, total_amount, split_activity_amount, split_coupon_amount, pt
        FROM ods_seller_detail_incr
        where pt in 
            (
                SELECT from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
                FROM ods_seller_detail_incr_stg
                group by from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
            )
        union all 
        SELECT seller_detail_id, seller_id, seller_status, sku_id, sku_name, bus_id, seller_price, sku_num, create_time, update_time, source_type, source_id, total_amount, split_activity_amount, split_coupon_amount,
        from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd') as pt  
        FROM ods_seller_detail_incr_stg
    ) a
) a
where rn =1 ;


--8 ods_seller_info_incr_stg

insert overwrite table ods_seller_info_incr partition(pt)
SELECT seller_id, user_id, seller_status, buyer_name, buyer_mobile, buyer_province_id, buyer_address_province, buyer_address_city, buyer_address_area, buyer_address_detail, total_amount, activity_reduce_amount, coupon_reduce_amount, original_total_amount, payment_way, seller_comment, out_trade_no, trade_body, process_status, tracking_no, feight_fee, create_time, update_time, pt
from 
(
    SELECT seller_id, user_id, seller_status, buyer_name, buyer_mobile, buyer_province_id, buyer_address_province, buyer_address_city, buyer_address_area, buyer_address_detail, total_amount, activity_reduce_amount, coupon_reduce_amount, original_total_amount, payment_way, seller_comment, out_trade_no, trade_body, process_status, tracking_no, feight_fee, create_time, update_time, pt,
    row_number() over(partition by seller_id order by update_time desc ) as rn
    from
    (
        SELECT seller_id, user_id, seller_status, buyer_name, buyer_mobile, buyer_province_id, buyer_address_province, buyer_address_city, buyer_address_area, buyer_address_detail, total_amount, activity_reduce_amount, coupon_reduce_amount, original_total_amount, payment_way, seller_comment, out_trade_no, trade_body, process_status, tracking_no, feight_fee, create_time, update_time, pt
        FROM ods_seller_info_incr
        where pt in 
            (
                SELECT from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
                FROM ods_seller_info_incr_stg
                group by from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
            )
        union all 
        SELECT seller_id, user_id, seller_status, buyer_name, buyer_mobile, buyer_province_id, buyer_address_province, buyer_address_city, buyer_address_area, buyer_address_detail, total_amount, activity_reduce_amount, coupon_reduce_amount, original_total_amount, payment_way, seller_comment, out_trade_no, trade_body, process_status, tracking_no, feight_fee, create_time, update_time,
        from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd') as pt  
        FROM ods_seller_info_incr_stg
    ) a
) a
where rn =1 ;


--9 ods_seller_refund_info_incr_stg

insert overwrite table ods_seller_refund_info_incr partition(pt)
SELECT seller_refund_id, user_id, seller_id, sku_id, refund_type, refund_num, refund_amount, refund_reason_type, refund_reason_txt, refund_status, create_time, update_time, pt
from 
(
    SELECT seller_refund_id, user_id, seller_id, sku_id, refund_type, refund_num, refund_amount, refund_reason_type, refund_reason_txt, refund_status, create_time, update_time, pt,
    row_number() over(partition by seller_refund_id order by update_time desc ) as rn
    from
    (
        SELECT seller_refund_id, user_id, seller_id, sku_id, refund_type, refund_num, refund_amount, refund_reason_type, refund_reason_txt, refund_status, create_time, update_time, pt
        FROM ods_seller_refund_info_incr
        where pt in 
            (
                SELECT from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
                FROM ods_seller_refund_info_incr_stg
                group by from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
            )
        union all 
        SELECT seller_refund_id, user_id, seller_id, sku_id, refund_type, refund_num, refund_amount, refund_reason_type, refund_reason_txt, refund_status, create_time, update_time,
        from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd') as pt  
        FROM ods_seller_refund_info_incr_stg
    ) a
) a
where rn =1 ;


--10 ods_user_info_incr_stg

insert overwrite table ods_user_info_incr partition(pt)
SELECT user_id, login_name, nick_name, passwd, user_name, phone_num, email, head_img, user_level, birthday, gender, create_time, update_time, `status`, pt
from 
(
    SELECT user_id, login_name, nick_name, passwd, user_name, phone_num, email, head_img, user_level, birthday, gender, create_time, update_time, `status`, pt,
    row_number() over(partition by user_id order by update_time desc ) as rn
    from
    (
        SELECT user_id, login_name, nick_name, passwd, user_name, phone_num, email, head_img, user_level, birthday, gender, create_time, update_time, `status`, pt
        FROM ods_user_info_incr
        where pt in 
            (
                SELECT from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
                FROM ods_user_info_incr_stg
                group by from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
            )
        union all 
        SELECT user_id, login_name, nick_name, passwd, user_name, phone_num, email, head_img, user_level, birthday, gender, create_time, update_time, `status`,
        from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd') as pt  
        FROM ods_user_info_incr_stg
    ) a
) a
where rn =1 ;