服务器部署datax+使用Python脚本导数+项目数据

一、Datax基础知识

1、什么是Datax?

DataX 是阿里云商用产品 DataWorks 数据集成的开源版本,它是一个异构数据源的离线数据同步工具/平台(ETL工具)。DataX 实现了包括 MySQL、Oracle、OceanBase、SqlServer、Postgre、HDFS、Hive、ADS、HBase、TableStore(OTS)、MaxCompute(ODPS)、Hologres、DRDS 等各种异构数据源之间高效的数据同步功能。

Tips : 异构即不同类型的应用或者数据源,例如MySQL/Oracle/DB2/MongDB等不同类型的数据库源 Tips : 离线数据同步常用Sqoop以及DataX工具。 Tips : ETL(Extract-Transform-Load)工具描述将数据从来源端经过抽取(extract)、转换(transform)、加载(load)至目的端的过程,目的是将企业中的分散、零乱、标准不统一的数据整合到一起,为企业的决策提供分析依据,其常用在数据仓库,但其对象并不限于数据仓库(DW)。

2、Datax的设计思想

为了解决异构数据源同步问题,DataX将复杂的网状的同步链路变成了星型数据链路,DataX作为中间传输载体负责连接各种数据源。当需要接入一个新的数据源的时候,只需要将此数据源对接到DataX,便能跟已有的数据源做到无缝数据同步。简单得说,DataX就像中间商一样为每一个服务对象进行需求供应。

3、Datax的框架设计 

描述: DataX本身作为离线数据同步框架,离线(批量)的数据通道通过定义数据来源和去向的数据源和数据集,提供一套抽象化的数据抽取插件(Reader)、数据写入插件(Writer),并基于此框架设计一套简化版的中间数据传输格式,从而实现任意结构化、半结构化数据源之间数据传输。

DataX架构设计流程类似source(数据来源)-> channel(数据存储池中转通道) -> sink

Reader:数据采集模块,负责采集数据源的数据,将数据发送给Framework。

Writer:数据写入模块,负责不断向Framework取数据,并将数据写入到目的端。

Framework:用于连接reader和writer,作为两者的数据传输通道,并处理缓冲,流控,并发,数据转换等核心技术问题。

Reader、Writer与Transformer

 DataX 可支持任意数据源到数据源,只要实现了 Reader/Writer Plugin,官方已经实现了主流的数据源插件,比如 MySQL、Oracle、SQLServer 等。

4、Datax的调度流程

(1)并发量

并发量即,所需通道数量(needChannelNumber)。

  • 按表数量 配置了每个通道处理1张表,那么同时处理100张表,就需要100个通道

  • 按记录数 配置了每个通道处理1000条记录,那么同时处理10000条记录,就需要10个通道

(2)任务切分

将Job切分成多个Task:根据所需通道数量将Job切分成多个Task

  • 如果用户配置了具体的 Table 数量,那么就按照 Table 为最小单元进行拆分(即一个 Table 对应一个 Task),并生成对应的 querySql;

  • 如果用户还配置了 splitPk,则会根据 splitPk 进行切分,具体逻辑是根据 splitPk 区间对 Table 进行拆分,并生成对应的 querySql。

(3)任务分配

1)配置通道数量(ChannelNumber)

2)设置每个TaskGroup分配的通道数(默认为每个TaskGroup分配5个通道

3)计算TaskGroup数量=通道数量/每个TaskGroup分配的通道数

4)将Task平均分配至每个TaskGroup

(4)调度示例

例如

构建了一个数据同步作业,该作业的目的是将 MySql 的 100 张表同步到Hive 库中。 且此时设置了20 个并发(即 channelNumber=20)。

  1. 按表数量计算,需要100个通道(Channel);

  2. 将此Job切分成 100 个 Task;

  3. DataX 默认给每个 TaskGroup 分配 5 个 Channel,因此 taskGroupNumber = channelNumber / channelsPerTaskGroup = 20 / 5 = 4;

  4. 根据 DataX 的公平分配策略,会将 100 个 Task 平均分配给每个 TaskGroup,因此每个 TaskGroup 处理 taskNumber / taskGroupNumber = 100 / 4 = 25 个 Task。

(5)详细图

二、在服务器上部署Datax

1、将安装包解压至app下

[peizk@hadoop app]$ tar -zxvf datax.tar.gz  -C ~/app/

2、配置环境变量

[root@hadoop ~]# vim /etc/profile

添加如下

#DATAX_HOME
export DATAX_HOME=/home/peizk/app/datax
export PATH=$PATH:$DATAX_HOME/bin

不要忘记source!!!

3、执行一下官方例子

在job目录下

[peizk@hadoop job]$ datax.py job.json 

配置成功!!

4、各目录作用

├── bin   # 可执行的Python脚本
├── conf  # Datax 配置文件
├── job   # 离线同步任务
├── lib   # 依赖库
├── log   # 任务执行过程日志
├── log_perf
├── plugin # 各类数据库读写插件
│   ├── reader
│   └── writer
├── script # 脚本存放
└── tmp    # 临时目录 

三、测试一下从mysql表拿数据传输到hdfs

1、获取mysql到hdfs模板

[peizk@hadoop job]$ datax.py -r mysqlreader -w hdfswriter 

DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.


Please refer to the mysqlreader document:
     https://github.com/alibaba/DataX/blob/master/mysqlreader/doc/mysqlreader.md 

Please refer to the hdfswriter document:
     https://github.com/alibaba/DataX/blob/master/hdfswriter/doc/hdfswriter.md 
 
Please save the following configuration as a json file and  use
     python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json 
to run the job.

{
    "job": {
        "content": [
            {
                "reader": {
                    "name": "mysqlreader", 
                    "parameter": {
                        "column": [], 
                        "connection": [
                            {
                                "jdbcUrl": [], 
                                "table": []
                            }
                        ], 
                        "password": "", 
                        "username": "", 
                        "where": ""
                    }
                }, 
                "writer": {
                    "name": "hdfswriter", 
                    "parameter": {
                        "column": [], 
                        "compress": "", 
                        "defaultFS": "", 
                        "fieldDelimiter": "", 
                        "fileName": "", 
                        "fileType": "", 
                        "path": "", 
                        "writeMode": ""
                    }
                }
            }
        ], 
        "setting": {
            "speed": {
                "channel": ""
            }
        }
    }
}

2、模板解析

{
    "job": {
        "content": [
            {
                "reader": {                        
                    "name": "mysqlreader",        --name:reader名
                    "parameter": {          
                        "column": [],             -- 字段名
                        "connection": [  
                            {
                                "jdbcUrl": [],    --对数据库的JDBC连接信息
                                "table": []       --表名
                                【"querySql":[]】 -- 支持sql语句,配置它                                                    --后,mysqlreader直接忽略                                                  --table、column、where
                            }
                        ], 
                        "password": "",        --密码
                        "username": "",        --用户名
                        "where": ""            --筛选条件
                       【"splitPk": ""】       --数据分片字段,一般是主键,                                                --仅支持整型
                    }
                }, 
                "writer": {
                    "name": "hdfswriter",       --name:writer名
                    "parameter": {   
                        "column": [],     --column:写入数据的字段,其                                               --中name指定字段名,type指定类型
                        "compress": "",   --hdfs文件压缩类型,默认未空
                        "defaultFS": "",  --namenode节点地址
                        "fieldDelimiter": "", --分隔符
                        "fileName": "",       --写入文件名
                        "fileType": "",     --文件的类型,目前只支持用户配                                             --置为"text"或"orc"
                        "path": "",      --hdfs文件系统的路径信息
                        "writeMode": ""  
                        -- hdfswriter写入前数据清理处理模式:
--(1)append:写入前不做任何处理,hdfswriter直接使用filename写入,会重复数据
--(2)nonConflict:如果目录下有fileName前缀的文件,直接报错。
                    }
                }
            }
        ], 
        "setting": {
            "speed": {
                "channel": ""  --并发数限速(根据自己CPU合理控制并发数)
                -- channle=2,使用streamwriter打印的时候会打印两遍内容。
                -- 但是如果是别的writer,比如hdfswriter,mysqlwriter,它只是                 -- 启动两个并发任务,内容只有一次,不会重复两遍。
            }
        }
    }
}

四、Datax两种配置模式与传参

1、字段模式

{
    "job": {
        "content": [
            {
                "reader": {
                    "name": "mysqlreader", 
                    "parameter": {
                        "column": [
					  "bus_id",
                        "bus_name",
                        "telephone",
                        "pic_url",
                        "bus_user_name",
                        "province",
                        "city",
                        "dist",
                        "addr",
                        "zipcode",
                        "is_open",
                        "open_time",
					  "closed_time"
                        ], 
                        "connection": [
                            {
                                "jdbcUrl": [
                                "jdbc:mysql://hadoop04:3306/prod"
                                ], 
                                "table": [
                                "business_info"
                                ]
                            }
                        ], 
                        "password": "shumeng_123", 
                        "username": "shumeng", 
                        "where": "1=1"
                    }
                }, 
                "writer": {
                    "name": "hdfswriter", 
                    "parameter": {
                        "column": [
						    {
                        		"name":"bus_id",
                        		"type":"string"
                        	},
                        	{
                        		"name":"bus_name",
                        		"type":"string"
                        	},
                        	{
                        		"name":"telephone",
                        		"type":"string"
                        	},
                        	{
                        		"name":"pic_url",
                        		"type":"string"
                        	},                    	
                        	{
                        		"name":"bus_user_name",
                        		"type":"string" 
                        	},
							{
                        		"name":"province",
                        		"type":"string"
                        	},
                        	{
                        		"name":"city",
                        		"type":"string"
                        	},
                        	{
                        		"name":"dist",
                        		"type":"string"
                        	},                    	
                        	{
                        		"name":"addr",
                        		"type":"string" 
                        	},
                        	{
                        		"name":"zipcode",
                        		"type":"string"
                        	},
                        	{
                        		"name":"is_open",
                        		"type":"string"
                        	},                    	
                        	{
                        		"name":"open_time",
                        		"type":"string" 
                        	}, 
							{
                        		"name":"closed_time",
                        		"type":"string" 
                        	}
                        ], 
                        "compress": "snappy", 
                        "defaultFS": "hdfs://hadoop04:8020", 
                        "fieldDelimiter": ",", 
                        "fileName": "ods_business_info_full.txt", 
                        "fileType": "orc", 
                        "path": "/user/hive/warehouse/ods.db/ods_business_info_full/pt=2022-01-01", 
                        "writeMode": "append"
                    }
                }
            }
        ], 
        "setting": {
            "speed": {
                "channel": "1"
            }
        }
    }
}

 2、sql模式

{
    "job": {
        "content": [
            {
                "reader": {
                    "name": "mysqlreader", 
                    "parameter": {
                       
                        "connection": [
                            {
                                "jdbcUrl": [
                                "jdbc:mysql://hadoop04:3306/prod"
                                ], 
								"querySql":[
                                "select bus_id,bus_name,telephone,pic_url,bus_user_name,province,city,dist,addr,zipcode,is_open,open_time,close_time from prod.business_info "
                                ]
                            }
                        ], 
                        "password": "shumeng_123", 
                        "username": "shumeng", 
                        "where": "1=1"
                    }
                }, 
                "writer": {
                    "name": "hdfswriter", 
                    "parameter": {
                        "column": [
						    {
                        		"name":"bus_id",
                        		"type":"string"
                        	},
                        	{
                        		"name":"bus_name",
                        		"type":"string"
                        	},
                        	{
                        		"name":"telephone",
                        		"type":"string"
                        	},
                        	{
                        		"name":"pic_url",
                        		"type":"string"
                        	},                    	
                        	{
                        		"name":"bus_user_name",
                        		"type":"string" 
                        	},
							{
                        		"name":"province",
                        		"type":"string"
                        	},
                        	{
                        		"name":"city",
                        		"type":"string"
                        	},
                        	{
                        		"name":"dist",
                        		"type":"string"
                        	},                    	
                        	{
                        		"name":"addr",
                        		"type":"string" 
                        	},
                        	{
                        		"name":"zipcode",
                        		"type":"string"
                        	},
                        	{
                        		"name":"is_open",
                        		"type":"string"
                        	},                    	
                        	{
                        		"name":"open_time",
                        		"type":"string" 
                        	}, 
				           {
                        		"name":"closed_time",
                        		"type":"string" 
                        	}
                        ], 
                        "compress": "snappy", 
                        "defaultFS": "hdfs://hadoop04:8020", 
                        "fieldDelimiter": ",", 
                        "fileName": "ods_business_info_full_sql.txt", 
                        "fileType": "orc", 
                        "path": "/user/hive/warehouse/ods.db/ods_business_info_full/pt=2022-01-01", 
                        "writeMode": "append"
                    }
                }
            }
        ], 
        "setting": {
            "speed": {
                "channel": "1"
            }
        }
    }
}

 3、注意点:DataX的HDFS writer会将null值存储为空字符串'' ,Hive默认null格式为\N。

alter table ods_business_info_full set serdeproperties('serialization.null.format' = '');

 4、Datax 传入参数

1、为啥

目的:对每日同步数据以分区形式进行加载,目标路径需动态变化指定HDFS writer的path参数。

方法:在JSON配置文件中使用${param} 引入参数,提交任务时候使用 -p -Dparam=value 传参。

2、例子

(1)更改的json

{
    "job": {
        "content": [
            {
                "reader": {
                    "name": "mysqlreader", 
                    "parameter": {
                       
                        "connection": [
                            {
                                "jdbcUrl": [
                                "jdbc:mysql://hadoop04:3306/prod"
                                ], 
								"querySql":[
                                "select bus_id,bus_name,telephone,pic_url,bus_user_name,province,city,dist,addr,zipcode,is_open,open_time,closed_time from prod.business_info where open_time>='${start_date}'"
                               ]
                            }
                        ], 
                        "password": "shumeng_123", 
                        "username": "shumeng", 
                        "where": "1=1"
                    }
                }, 
                "writer": {
                    "name": "hdfswriter", 
                    "parameter": {
                        "column": [
						    {
                        		"name":"bus_id",
                        		"type":"string"
                        	},
                        	{
                        		"name":"bus_name",
                        		"type":"string"
                        	},
                        	{
                        		"name":"telephone",
                        		"type":"string"
                        	},
                        	{
                        		"name":"pic_url",
                        		"type":"string"
                        	},                    	
                        	{
                        		"name":"bus_user_name",
                        		"type":"string" 
                        	},
							{
                        		"name":"province",
                        		"type":"string"
                        	},
                        	{
                        		"name":"city",
                        		"type":"string"
                        	},
                        	{
                        		"name":"dist",
                        		"type":"string"
                        	},                    	
                        	{
                        		"name":"addr",
                        		"type":"string" 
                        	},
                        	{
                        		"name":"zipcode",
                        		"type":"string"
                        	},
                        	{
                        		"name":"is_open",
                        		"type":"string"
                        	},                    	
                        	{
                        		"name":"open_time",
                        		"type":"string" 
                        	}, 
				           {
                        		"name":"closed_time",
                        		"type":"string" 
                        	}
                        ], 
                        "compress": "snappy", 
                        "defaultFS": "hdfs://hadoop04:8020", 
                        "fieldDelimiter": ",", 
                        "fileName": "ods_business_info_full_sql",
                        "fileType": "orc", 
                        "path": "/user/hive/warehouse/test.db/ods_business_info_full_sql/pt=2022-01-01", 
                        "writeMode": "append"
                    }
                }
            }
        ], 
        "setting": {
            "speed": {
                "channel": "1"
            }
        }
    }
}

(2)执行的语句

datax.py  ods_business_info_full_sql.json -p -Dstart_date=2020-01-01

五、业务的一些逻辑

1、关于订单和订单明细

订单表 --》订单明细表

一个订单可能买多个商品,可以通过订单id去明细表里查

2、维度数据


维度数据一般全量同步  

这是从    数据比较固定,数据变化频率   来确定的

3、订单数据


订单一般增量  因为一般要有时间判断 

并且订单数据一般走   t+1 离线

4、订单表只通过  create_time 抽取可以吗?

订单表只  通过  create_time 抽取是有问题的 
只抽取当天产生数据会有问题  

例如: 前几天有一个订单,今天才支付  update_time  变化   (订单有生命周期  7-15天

5、对于像支付表这种

支付周期表(多少分钟内必须要完成支付,比如30分钟)

如果想对支付表做增量,

(1)通过create_time 和update_time 去取
(2)通过  设计他的跑批时间如每天的 00.30之后去抽数 ,那么前一天的  create_time都已经完结了

6、对于收藏表这种

收藏表  无法确定生命周期(一年前收藏,一年后取消,;一天前收藏,一天后取消)

使用  create_time 和update_time 去取


六、编写python脚本的分析及同步的分析 

1、编写一个脚本我们需要的东西

(1) mysql 的字段  
           mysql 的表名

(2)hive的字段
        hive的文件名
        hive的文件地址

2、我们需要做什么

分析(1)提取公共参数
        (2)提取 mysql字段
        (3) 提取 hive字段  要和mysql一一对应
        (4) 生成 json  
        (5)通过使用参数  控制生成哪些表

3、如何获取我们需要的东西?

对于mysql 字段、表名

select
  TABLE_NAME 
  ,COLUMN_NAME 
  ,DATA_TYPE 
from information_schema.`COLUMNS` c 
where TABLE_SCHEMA ='project'

4、总体的脚本思路?要干什么?

全量json生成,通过python 文件生成

接着 datax执行相应的 json文件,导入数据
 

5、全量同步的逻辑?

全删,全插

6、增量同步的逻辑

对于订单

select distinct substr(create_time,1,10) as ts
from 
  seller_info 
where create_time >='2022-01-06' and create_time <'2022-01-07'  
or update_time >='2022-01-06'  and update_time <'2022-01-07'

这样最多可以生成6天的数据  因为除了 create_time  还有 update_time 在这一天更改的,但他的create_time 可能不是这一天

!!!有些增量数据  比如支付  只需要 控制它的跑批时间在 00:30后就可以  


!!!有些增量  订单   需要 create_time 和 update_time 一起去控制

问题:涉及增量  可以直接写到ODS表吗?(按天分区的,每天分区,是每天的数据)
不可以!  

比如上诉的sql脚本  6号这一天  它的 create_time 有6天  就不能写

解决办法:

(1)创建一个  stg层  先将我们每天增量抽取的数据存储在stg的表中  ,这个表没有分区(它每次存储的都是我们每天当天抽的数据)

那如何装载到ods?

(2)首先 查询 stg表中数据  为a
然后根据stg表中的 create_time 得出相应分区 ,用这些分区做where条件  去查询  整个增量表中的数据(全部的分区) 为 b

将这两个查询 union all 在一起  做一张临时表(a的pt 还是它的原pt,b的pt为 create_time)

(3)然后 对这张临时表做 开窗  组 订单id  排序 update_time  desc   取 rn =1  
将查询的结果  使用动态分区 插入 到 整个增量表 partition(pt)

最后结果:
这样  有变更的数据   最新的数据就会写入到最新分区 ,upda也会相应修改  原数据就消失
无变更的数据 仍然写入原分区 

这样就完成整个增量同步 逻辑

简单的代码诠释 
场景 ods_a_inc  为增量表 内有 1 2 3 三天的分区
     ods_a_stg  为a的stg 表  今天是 4号 内有 create_time 1 2 3 4 都有 实现增量代码
    其中字段a为订单id也就是主键

create table  tmp_ods_a_inc
(
  select 
     a
    ,b
    ,create_time
    ,update_time
    ,pt
  from 
  ods_a_inc
  where 
    pt in 
  (
    select 
      create_time
    from 
      ods_a_stg
    group by 
      create_time
  )
  union all

  select
    a
    ,b  
    ,create_time 
    ,update_time  
    ,create_time as pt
  from 
    ods_a_stg
);

--开启动态分区
set hive.exec.dynamic.partition=true;    
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table ods_a_inc partition(pt)
select
  a
  ,b 
  ,create_time
  ,update_time
  ,pt 
from 
(
  select 
    a
    ,b 
    ,create_time
    ,update_time
    ,pt 
    ,row_number() over(partition by a order by update_time desc ) as rn 
  from tmp_ods_a_inc
)
  where rn =1

七、项目脚本实现

1、创建全量表json生成的py脚本

(1)创建文件  export_job_json_full.py

[peizk@hadoop job]$ vim export_job_json_full.py

(2)代码如下

# coding=utf-8
import json
import getopt
import os
import sys
import MySQLdb

#MySQL连接信息配置
mysql_host = "hadoop"
mysql_port = "3306"
mysql_user = "root"
mysql_passwd = "123456"

#HDFS NameNode连接信息配置
hdfs_nn_host = "hadoop"
hdfs_nn_port = "9000"

#配置生成文件的目标路径
output_path = "/home/peizk/app/datax/job/ods_job"

def mysql_connect():
    return MySQLdb.connect(host=mysql_host, port=int(mysql_port), user=mysql_user, passwd=mysql_passwd)

def mysql_meta(database, table):
    connection = mysql_connect()
    cursor = connection.cursor()
    sql = "SELECT COLUMN_NAME,DATA_TYPE from information_schema.COLUMNS WHERE TABLE_SCHEMA=%s AND TABLE_NAME=%s ORDER BY ORDINAL_POSITION"
    cursor.execute(sql, [database, table])
    fetchall = cursor.fetchall()
    cursor.close()
    connection.close()
    return fetchall

def mysql_col(database, table):
    meta = mysql_meta(database, table)
    return map(lambda x: x[0], meta)

def hive_col(database, table):
    def type_mapping(mysql_type):
        mappings = {
            "bigint": "bigint",
            "int": "bigint",
            "smallint": "bigint",
            "tinyint": "bigint",
            "decimal": "string",
            "double": "double",
            "float": "float",
            "binary": "string",
            "char": "string",
            "varchar": "string",
            "datetime": "string",
            "time": "string",
            "timestamp": "string",
            "date": "string",
            "text": "string"
        }
        return mappings[mysql_type]

    meta = mysql_meta(database, table)
    return map(lambda x: {"name": x[0], "type": type_mapping(x[1].lower())}, meta)

def generate_json(source_database, source_table):
    job = {
        "job": {
            "setting": {
                "speed": {
                    "channel": 1
                }
            },
            "content": [{
                "reader": {
                    "name": "mysqlreader",
                    "parameter": {
                        "username": mysql_user,
                        "password": mysql_passwd,
                        "column": mysql_col(source_database, source_table),
                        "splitPk": "",
                        "connection": [{
                            "table": [source_table],
                            "jdbcUrl": ["jdbc:mysql://" + mysql_host + ":" + mysql_port + "/" + source_database]
                        }]
                    }
                },
                "writer": {
                    "name": "hdfswriter",
                    "parameter": {
                        "defaultFS": "hdfs://" + hdfs_nn_host + ":" + hdfs_nn_port,
                        "fileType": "orc",
                        "path": "${hdfs_path}",
                        "fileName": source_table,
                        "column": hive_col(source_database, source_table),
                        "writeMode": "append",
                        "fieldDelimiter": "\t",
                        "compress": "snappy"
                    }
                }
            }]
        }
    }
    if not os.path.exists(output_path):
        os.makedirs(output_path)
    with open(os.path.join(output_path, ".".join([source_database, source_table+"_full", "json"])), "w") as f:
        json.dump(job, f)


def main(args):
    source_database = ""
    source_table = ""

    options, arguments = getopt.getopt(args, '-d:-t:', ['sourcedb=', 'sourcetbl='])
    for opt_name, opt_value in options:
        if opt_name in ('-d', '--sourcedb'):
            source_database = opt_value
        if opt_name in ('-t', '--sourcetbl'):
            source_table = opt_value

    generate_json(source_database, source_table)


if __name__ == '__main__':
    main(sys.argv[1:])

(3)通过脚本 生成某个全量表的json代码

python export_job_json_full.py -dprod -tbusiness_info

(4)通过datax执行生成的某个全量表的json

datax.py prod.business_info.json -p "-Dhdfs_path=/user/hive/warehouse/ods.db/ods_business_info_full/pt=2022-01-01"

2、创建增量表json生成的py脚本

(1)创建文件  export_job_json_incr.py

[peizk@hadoop utils_code]$ vim export_job_json_incr.py

(2)脚本代码如下

# coding=utf-8
import json
import getopt
import os
import sys
import MySQLdb

#MySQL连接信息配置
mysql_host = "hadoop"
mysql_port = "3306"
mysql_user = "root"
mysql_passwd = "123456"

#HDFS NameNode连接信息配置
hdfs_nn_host = "hadoop"
hdfs_nn_port = "9000"

#配置生成文件的目标路径
output_path = "/home/peizk/app/datax/job/ods_job"


def get_connect():
    return MySQLdb.connect(host=mysql_host, port=int(mysql_port), user=mysql_user, passwd=mysql_passwd)


def mysql_meta(database, table):
    connection = get_connect()
    cursor = connection.cursor()
    sql = "SELECT COLUMN_NAME,DATA_TYPE from information_schema.COLUMNS WHERE TABLE_SCHEMA=%s AND TABLE_NAME=%s ORDER BY ORDINAL_POSITION"
    cursor.execute(sql, [database, table])
    fetchall = cursor.fetchall()
    cursor.close()
    connection.close()
    return fetchall


def mysql_col(database, table):
    return map(lambda x: x[0], mysql_meta(database, table))

def mysql_query(database, table):
    lis = list(mysql_col(database, table))
    str_columns = ','.join(lis)
    sql_query = 'select ' + str_columns + ' from ' + table + ' where (${create_time}>=\'${start_date}\' and ${create_time}<\'${end_date}\') ${is_or} (${update_time}>=\'${start_date}\' and ${update_time}<\'${end_date}\') '
    return sql_query


def hive_col(database, table):
    def type_mapping(mysql_type):
        mappings = {
            "bigint": "bigint",
            "int": "bigint",
            "smallint": "bigint",
            "tinyint": "bigint",
            "decimal": "string",
            "double": "double",
            "float": "float",
            "binary": "string",
            "char": "string",
            "varchar": "string",
            "datetime": "string",
            "time": "string",
            "timestamp": "string",
            "date": "string",
            "text": "string"
        }
        return mappings[mysql_type]

    meta = mysql_meta(database, table)
    return map(lambda x: {"name": x[0], "type": type_mapping(x[1].lower())}, meta)


def generate_json(source_database, source_table):
    job = {
        "job": {
            "content": [{
                "reader": {
                    "name": "mysqlreader",
                    "parameter": {
                        "username": mysql_user,
                        "password": mysql_passwd,
                        "splitPk": "",
                        "connection": [{
                            "jdbcUrl": ["jdbc:mysql://" + mysql_host + ":" + mysql_port + "/" + source_database],
                            "querySql":[mysql_query(source_database, source_table)]
                        }]
                    }
                },
                "writer": {
                    "name": "hdfswriter",
                    "parameter": {
                        "defaultFS": "hdfs://" + hdfs_nn_host + ":" + hdfs_nn_port,
                        "fileType": "orc",
                        "path": "${hdfs_path}",
                        "fileName": source_table,
                        "column": hive_col(source_database, source_table),
                        "writeMode": "append",
                        "fieldDelimiter": "\t",
                        "compress": "snappy"
                    }
                }
            }],
            "setting": {
                "speed": {
                    "channel": 1
                }
            }
        }
    }
    if not os.path.exists(output_path):
        os.makedirs(output_path)
    with open(os.path.join(output_path, ".".join([source_database, source_table+"_incr", "json"])), "w") as f:
        json.dump(job, f)


def main(args):
    source_database = ""
    source_table = ""

    options, arguments = getopt.getopt(args, '-d:-t:', ['sourcedb=', 'sourcetbl='])
    for opt_name, opt_value in options:
        if opt_name in ('-d', '--sourcedb'):
            source_database = opt_value
        if opt_name in ('-t', '--sourcetbl'):
            source_table = opt_value

    generate_json(source_database, source_table)


if __name__ == '__main__':
    main(sys.argv[1:])

(3)通过脚本 生成某个增量表的json代码

python export_job_json_incr.py -dprod -tseller_info

(4)通过datax执行生成的某个增量表的json

datax.py prod.seller_info_incr.json -p "-Dhdfs_path=/user/hive/warehouse/datax.db/ods_seller_info_incr/pt=2022-01-01 -Dcreate_time=create_time -Dupdate_time=update_time -Dstart_date=2022-01-01 -Dend_date=2022-01-02 -Dis_or='or'"

3、创建一个shell脚本来同时控制我们的生成全量、增量json所有表的脚本

 (1)创建脚本 export_all_job.sh

vim  export_all_job.sh

(2)脚本内容为

#!/bin/bash
python /home/peizk/app/datax/job/utils_code/export_job_json_full.py  -dproject -tpub_class
python /home/peizk/app/datax/job/utils_code/export_job_json_full.py  -dproject -tpub_code
python /home/peizk/app/datax/job/utils_code/export_job_json_full.py  -dproject -tpub_area
python /home/peizk/app/datax/job/utils_code/export_job_json_full.py  -dproject -tpub_brand
python /home/peizk/app/datax/job/utils_code/export_job_json_full.py  -dproject -tbusiness_info
python /home/peizk/app/datax/job/utils_code/export_job_json_full.py  -dproject -tsku_info
python /home/peizk/app/datax/job/utils_code/export_job_json_full.py  -dproject -tspu_info
python /home/peizk/app/datax/job/utils_code/export_job_json_full.py  -dproject -tsku_para
python /home/peizk/app/datax/job/utils_code/export_job_json_full.py  -dproject -tsku_para_info
python /home/peizk/app/datax/job/utils_code/export_job_json_full.py  -dproject -tactivity_sku
python /home/peizk/app/datax/job/utils_code/export_job_json_full.py  -dproject -tcoupon_info
python /home/peizk/app/datax/job/utils_code/export_job_json_full.py  -dproject -tactivity_info
python /home/peizk/app/datax/job/utils_code/export_job_json_incr.py  -dproject -tcollect_info
python /home/peizk/app/datax/job/utils_code/export_job_json_incr.py  -dproject -tpay_info
python /home/peizk/app/datax/job/utils_code/export_job_json_incr.py  -dproject -trefund_pay_info
python /home/peizk/app/datax/job/utils_code/export_job_json_incr.py  -dproject -tremark_info
python /home/peizk/app/datax/job/utils_code/export_job_json_incr.py  -dproject -tseller_detail
python /home/peizk/app/datax/job/utils_code/export_job_json_incr.py  -dproject -tseller_info
python /home/peizk/app/datax/job/utils_code/export_job_json_incr.py  -dproject -tseller_refund_info
python /home/peizk/app/datax/job/utils_code/export_job_json_incr.py  -dproject -tseller_detail_coupon
python /home/peizk/app/datax/job/utils_code/export_job_json_incr.py  -dproject -tseller_detail_activity
python /home/peizk/app/datax/job/utils_code/export_job_json_incr.py  -dproject -tuser_info

(3)通过执行此脚本我们可批量生成我们需要的所有表的json(全量与增量都有)

4、创建关于全量表  datax执行json的抽数脚本

(1)创建文件  import_ods_table_full.sh

vim  import_ods_table_full.sh

 (2)代码内容为

#!/bin/bash
# 如果传入日期则etl_date等于传入的日期,否则赋值前一天日期
if [ -n "$2" ] ;then
    etl_date=$2
else
    etl_date=`date -d "-1 day" +%F`
fi
 
#设置datax目录
DATAX_HOME=/home/peizk/app/datax
 
#设置job存放的目录
job_dir="/home/peizk/app/datax/job/ods_job"
 
#设置HDFS公共存放的目录
hdfs_dir="/user/hive/warehouse/project.db"
 
#设置hive库名
database="project"
 
 
#先清空表分区,再创建表分区,防止同一分区数据重复
alter_partition() {
  echo "------------开始清空$1表${etl_date}分区数据----------------"
  hive_table_name=$1
  hive  -e "alter table ${database}.${hive_table_name} drop partition(pt='${etl_date}')"   
  hive  -e "alter table ${database}.${hive_table_name} add  partition(pt='${etl_date}')"  
}
 
#数据同步
import_full_table() {
  datax_json=$1
  hdfs_path=$2
  hive_table_url=${hdfs_path%/*}
  hive_table_name=${hive_table_url##*/}  
  alter_partition $hive_table_name
  echo "------------开始全量导入$1表${etl_date}分区数据------------"
  python $DATAX_HOME/bin/datax.py $datax_json -p"-Dhdfs_path=$hdfs_path" 
}
 
 
case $1 in
"pub_class")
  import_full_table ${job_dir}/project.pub_class_full.json ${hdfs_dir}/ods_pub_class_full/pt=$etl_date
  ;;
"pub_area")
  import_full_table ${job_dir}/project.pub_area_full.json ${hdfs_dir}/ods_pub_area_full/pt=$etl_date
  ;;
"activity_info")
  import_full_table ${job_dir}/project.activity_info_full.json ${hdfs_dir}/ods_activity_info_full/pt=$etl_date
  ;;
"activity_sku")
  import_full_table ${job_dir}/project.activity_sku_full.json ${hdfs_dir}/ods_activity_sku_full/pt=$etl_date
  ;;
"coupon_info")
  import_full_table ${job_dir}/project.coupon_info_full.json ${hdfs_dir}/ods_coupon_info_full/pt=$etl_date
  ;;
"pub_code")
  import_full_table ${job_dir}/project.pub_code_full.json ${hdfs_dir}/ods_pub_code_full/pt=$etl_date
  ;;
"pub_brand")
  import_full_table ${job_dir}/project.pub_brand_full.json ${hdfs_dir}/ods_pub_brand_full/pt=$etl_date
  ;;
"business_info")
  import_full_table ${job_dir}/project.business_info_full.json ${hdfs_dir}/ods_business_info_full/pt=$etl_date
  ;;  
"sku_info")
  import_full_table ${job_dir}/project.sku_info_full.json ${hdfs_dir}/ods_sku_info_full/pt=$etl_date
  ;;
"spu_info")
  import_full_table ${job_dir}/project.spu_info_full.json ${hdfs_dir}/ods_spu_info_full/pt=$etl_date
  ;;
"sku_para")
  import_full_table ${job_dir}/project.sku_para_full.json ${hdfs_dir}/ods_sku_para_full/pt=$etl_date
  ;;
"sku_para_info")
  import_full_table ${job_dir}/project.sku_para_info_full.json ${hdfs_dir}/ods_sku_para_info_full/pt=$etl_date
  ;;
  
"all")
  import_full_table ${job_dir}/project.pub_class_full.json ${hdfs_dir}/ods_pub_class_full/pt=$etl_date
  import_full_table ${job_dir}/project.pub_area_full.json ${hdfs_dir}/ods_pub_area_full/pt=$etl_date
  import_full_table ${job_dir}/project.pub_code_full.json ${hdfs_dir}/ods_pub_code_full/pt=$etl_date
  import_full_table ${job_dir}/project.pub_brand_full.json ${hdfs_dir}/ods_pub_brand_full/pt=$etl_date
  import_full_table ${job_dir}/project.business_info_full.json ${hdfs_dir}/ods_business_info_full/pt=$etl_date
  import_full_table ${job_dir}/project.sku_info_full.json ${hdfs_dir}/ods_sku_info_full/pt=$etl_date
  import_full_table ${job_dir}/project.spu_info_full.json ${hdfs_dir}/ods_spu_info_full/pt=$etl_date
  import_full_table ${job_dir}/project.sku_para_full.json ${hdfs_dir}/ods_sku_para_full/pt=$etl_date
  import_full_table ${job_dir}/project.sku_para_info_full.json ${hdfs_dir}/ods_sku_para_info_full/pt=$etl_date
  import_full_table ${job_dir}/project.activity_info_full.json ${hdfs_dir}/ods_activity_info_full/pt=$etl_date
  import_full_table ${job_dir}/project.activity_sku_full.json ${hdfs_dir}/ods_activity_sku_full/pt=$etl_date
  import_full_table ${job_dir}/project.coupon_info_full.json ${hdfs_dir}/ods_coupon_info_full/pt=$etl_date

esac

5、创建关于增量表  datax执行json的抽数脚本

(1)创建文件  import_ods_table_incr.sh

vim  import_ods_table_incr.sh

(2)代码内容为

#!/bin/bash
# shell 字符串截取 https://www.cnblogs.com/onmyway20xx/p/4308113.html
# 如果传入日期则etl_date等于传入的日期,否则赋值前一天日期
if [ -n "$3" ] ;then
        start_date=$2
        end_date=$3
else
    start_date=`date -d "-1 day" +%F`
        send_date=`date  +%F`
fi

#设置datax目录
DATAX_HOME=/home/peizk/app/datax

#设置job存放的目录
job_dir="/home/peizk/app/datax/job/ods_job"

#设置HDFS公共存放的目录
hdfs_dir="/user/hive/warehouse/project.db"

#设置hive库名
database="project"


#先清空表分区,再创建表分区,防止同一分区数据重复
truncate_table () {
  echo "------------开始清空$1表${etl_date}分区数据----------------"
  hive  -e "truncate table  ${database}.${1} "   
}

#数据同步
import_incr_table() {
  datax_json=$1
  hdfs_path=$2
  is_or=$3
  create_time=$4
  update_time=$5
  hive_table_url=${hdfs_path}
  hive_table_name=${hive_table_url##*/} 
  truncate_table ${hive_table_name}
echo "------------开始全量导入${hive_table_incr}表----------------"
  python $DATAX_HOME/bin/datax.py $datax_json -p"-Dhdfs_path=$hdfs_path -Dcreate_time=$create_time -Dupdate_time=${update_time} -Dstart_date=${start_date} -Dend_date=${end_date} -Dis_or=${is_or}" 
}


case $1 in
"collect_info")
  import_incr_table ${job_dir}/project.collect_info_incr.json ${hdfs_dir}/ods_collect_info_incr_stg or create_time cancel_time
  ;;
"pay_info")
  import_incr_table ${job_dir}/project.pay_info_incr.json ${hdfs_dir}/ods_pay_info_incr_stg         -- pay_create_time 
  ;;
"refund_pay_info")
  import_incr_table ${job_dir}/project.refund_pay_info_incr.json ${hdfs_dir}/ods_refund_pay_info_incr_stg -- refund_create_time refund_update_time
  ;;
"remark_info")
  import_incr_table ${job_dir}/project.remark_info_incr.json ${hdfs_dir}/ods_remark_info_incr_stg  -- create_time update_time
  ;;
"seller_detail")
  import_incr_table ${job_dir}/project.seller_detail_incr.json ${hdfs_dir}/ods_seller_detail_incr_stg or create_time update_time
  ;;  
"seller_info")
  import_incr_table ${job_dir}/project.seller_info_incr.json ${hdfs_dir}/ods_seller_info_incr_stg or create_time update_time
  ;;
"seller_refund_info")
  import_incr_table ${job_dir}/project.seller_refund_info_incr.json ${hdfs_dir}/ods_seller_refund_info_incr_stg -- create_time update_time
  ;;
"seller_detail_coupon")
  import_incr_table ${job_dir}/project.seller_detail_coupon_incr.json ${hdfs_dir}/ods_seller_detail_coupon_incr_stg -- create_time update_time
  ;;
"seller_detail_activity")
  import_incr_table ${job_dir}/project.seller_detail_activity_incr.json ${hdfs_dir}/ods_seller_detail_activity_incr_stg -- create_time update_time
  ;;
"user_info")
  import_incr_table ${job_dir}/project.user_info_incr.json ${hdfs_dir}/ods_user_info_incr_stg     or create_time update_time
  ;; 
  
"all")
   import_incr_table ${job_dir}/project.collect_info_incr.json ${hdfs_dir}/ods_collect_info_incr_stg or create_time cancel_time
  
  import_incr_table ${job_dir}/project.pay_info_incr.json ${hdfs_dir}/ods_pay_info_incr_stg -- pay_create_time 

  import_incr_table ${job_dir}/project.refund_pay_info_incr.json ${hdfs_dir}/ods_refund_pay_info_incr_stg -- refund_create_time refund_update_time

  import_incr_table ${job_dir}/project.remark_info_incr.json ${hdfs_dir}/ods_remark_info_incr_stg  -- create_time update_time

  import_incr_table ${job_dir}/project.seller_detail_incr.json ${hdfs_dir}/ods_seller_detail_incr_stg or create_time update_time

  import_incr_table ${job_dir}/project.seller_info_incr.json ${hdfs_dir}/ods_seller_info_incr_stg or create_time update_time

  import_incr_table ${job_dir}/project.seller_refund_info_incr.json ${hdfs_dir}/ods_seller_refund_info_incr_stg -- create_time update_time

  import_incr_table ${job_dir}/project.seller_detail_coupon_incr.json ${hdfs_dir}/ods_seller_detail_coupon_incr_stg -- create_time update_time

  import_incr_table ${job_dir}/project.seller_detail_activity_incr.json ${hdfs_dir}/ods_seller_detail_activity_incr_stg -- create_time update_time

  import_incr_table ${job_dir}/project.user_info_incr.json ${hdfs_dir}/ods_user_info_incr_stg  or create_time update_time
esac

(3)将增量数据从stg表导入ods原表的脚本

--开启动态分区
set hive.exec.dynamic.partition=true;    
set hive.exec.dynamic.partition.mode=nonstrict;

-- 1 ods_collect_info_incr_stg 

insert overwrite table ods_collect_info_incr partition(pt)
select
id, user_id, sku_id, is_cancel, create_time, cancel_time, pt
from
(
    SELECT id, user_id, sku_id, is_cancel, create_time, cancel_time, pt,
    row_number() over(partition by id order by cancel_time desc ) as rn
    FROM 
    (
        
        SELECT id, user_id, sku_id, is_cancel, create_time, cancel_time, pt
        FROM ods_collect_info_incr
        where pt in 
        (
            SELECT  from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
            FROM ods_collect_info_incr_stg
            group by 
            from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
        )
        union all
        SELECT id, user_id, sku_id, is_cancel, create_time, cancel_time, 
        from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd') as pt
        FROM ods_collect_info_incr_stg
    
    ) a
) a
where rn =1 ;


--2  ods_pay_info_incr_stg

insert overwrite table ods_pay_info_incr partition(pt)
SELECT pay_id, out_trade_no, seller_id, user_id, pay_type, trade_no, total_amount, subject, pay_status, pay_create_time, pay_update_time, pay_content, pt
from 
(
    SELECT pay_id, out_trade_no, seller_id, user_id, pay_type, trade_no, total_amount, subject, pay_status, pay_create_time, pay_update_time, pay_content, pt,
    row_number() over(partition by pay_id order by pay_update_time desc ) as rn
    from
    (
        SELECT pay_id, out_trade_no, seller_id, user_id, pay_type, trade_no, total_amount, subject, pay_status, pay_create_time, pay_update_time, pay_content, pt
        FROM ods_pay_info_incr
        where pt in 
            (
                SELECT  from_unixtime(unix_timestamp(pay_create_time),'yyyy-MM-dd')
                FROM ods_pay_info_incr_stg
                group by from_unixtime(unix_timestamp(pay_create_time),'yyyy-MM-dd')
            )
        union all 
        SELECT pay_id, out_trade_no, seller_id, user_id, pay_type, trade_no, total_amount, subject, pay_status, pay_create_time, pay_update_time, pay_content,
        from_unixtime(unix_timestamp(pay_create_time),'yyyy-MM-dd') as pt 
        FROM ods_pay_info_incr_stg
    ) a
) a
where rn =1 ;

--3  ods_refund_pay_info_incr_stg


insert overwrite table ods_refund_pay_info_incr partition(pt)
SELECT refund_pay_id, out_trade_no, seller_id, sku_id, pay_type, trade_no, total_amount, refund_status, refund_create_time, refund_update_time, refund_content, pt
from 
(
    SELECT refund_pay_id, out_trade_no, seller_id, sku_id, pay_type, trade_no, total_amount, refund_status, refund_create_time, refund_update_time, refund_content, pt,
    row_number() over(partition by refund_pay_id order by refund_update_time desc ) as rn
    from
    (
        SELECT refund_pay_id, out_trade_no, seller_id, sku_id, pay_type, trade_no, total_amount, refund_status, refund_create_time, refund_update_time, refund_content, pt
        FROM ods_refund_pay_info_incr
        where pt in 
            (
                SELECT from_unixtime(unix_timestamp(refund_create_time),'yyyy-MM-dd')
                FROM ods_refund_pay_info_incr_stg
                group by from_unixtime(unix_timestamp(refund_create_time),'yyyy-MM-dd')
            )
        union all 
        SELECT refund_pay_id, out_trade_no, seller_id, sku_id, pay_type, trade_no, total_amount, refund_status, refund_create_time, refund_update_time, refund_content,
        from_unixtime(unix_timestamp(refund_create_time),'yyyy-MM-dd') as pt  
        FROM ods_refund_pay_info_incr_stg
    ) a
) a
where rn =1 ;



--4  ods_remark_info_incr_stg


insert overwrite table ods_remark_info_incr partition(pt)
SELECT id, user_id, head_img, sku_id, spu_id, seller_id, appraise, remark_string, create_time, update_time, pt
from 
(
    SELECT id, user_id, head_img, sku_id, spu_id, seller_id, appraise, remark_string, create_time, update_time, pt,
    row_number() over(partition by id order by update_time desc ) as rn
    from
    (
        SELECT id, user_id, head_img, sku_id, spu_id, seller_id, appraise, remark_string, create_time, update_time, pt
        FROM ods_remark_info_incr
        where pt in 
            (
                SELECT from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
                FROM ods_remark_info_incr_stg
                group by  
                from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
            )
        union all 
        SELECT id, user_id, head_img, sku_id, spu_id, seller_id, appraise, remark_string, create_time, update_time,
        from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd') as pt  
        FROM ods_remark_info_incr_stg
    ) a
) a
where rn =1 ;



--5  ods_seller_detail_activity_incr_stg

insert overwrite table ods_seller_detail_activity_incr partition(pt)
SELECT  id, seller_id, seller_detail_id, activity_id, activity_rule_id, sku_id, create_time, update_time, pt
from 
(
    SELECT id, seller_id, seller_detail_id, activity_id, activity_rule_id, sku_id, create_time, update_time, pt,
    row_number() over(partition by id order by update_time desc ) as rn
    from
    (
        SELECT id, seller_id, seller_detail_id, activity_id, activity_rule_id, sku_id, create_time, update_time, pt
        FROM ods_seller_detail_activity_incr
        where pt in 
            (
                SELECT from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
                FROM ods_seller_detail_activity_incr_stg
                group by from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
            )
        union all 
        SELECT id, seller_id, seller_detail_id, activity_id, activity_rule_id, sku_id, create_time, update_time,
        from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd') as pt  
        FROM ods_seller_detail_activity_incr_stg
    ) a
) a
where rn =1 ;



--6  ods_seller_detail_coupon_incr_stg


insert overwrite table ods_seller_detail_coupon_incr partition(pt)
SELECT  id, seller_id, seller_detail_id, coupon_id, coupon_use_id, sku_id, create_time, update_time, pt
from 
(
    SELECT id, seller_id, seller_detail_id, coupon_id, coupon_use_id, sku_id, create_time, update_time, pt,
    row_number() over(partition by id order by update_time desc ) as rn
    from
    (
        SELECT id, seller_id, seller_detail_id, coupon_id, coupon_use_id, sku_id, create_time, update_time, pt
        FROM ods_seller_detail_coupon_incr
        where pt in 
            (
                SELECT from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
                FROM ods_seller_detail_coupon_incr_stg
                group by from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
            )
        union all 
        SELECT id, seller_id, seller_detail_id, coupon_id, coupon_use_id, sku_id, create_time, update_time,
        from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd') as pt  
        FROM ods_seller_detail_coupon_incr_stg
    ) a
) a
where rn =1 ;


-- 7  ods_seller_detail_incr_stg

insert overwrite table ods_seller_detail_incr partition(pt)
SELECT seller_detail_id, seller_id, seller_status, sku_id, sku_name, bus_id, seller_price, sku_num, create_time, update_time, source_type, source_id, total_amount, split_activity_amount, split_coupon_amount, pt
from 
(
    SELECT seller_detail_id, seller_id, seller_status, sku_id, sku_name, bus_id, seller_price, sku_num, create_time, update_time, source_type, source_id, total_amount, split_activity_amount, split_coupon_amount, pt,
    row_number() over(partition by seller_detail_id order by update_time desc ) as rn
    from
    (
        SELECT seller_detail_id, seller_id, seller_status, sku_id, sku_name, bus_id, seller_price, sku_num, create_time, update_time, source_type, source_id, total_amount, split_activity_amount, split_coupon_amount, pt
        FROM ods_seller_detail_incr
        where pt in 
            (
                SELECT from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
                FROM ods_seller_detail_incr_stg
                group by from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
            )
        union all 
        SELECT seller_detail_id, seller_id, seller_status, sku_id, sku_name, bus_id, seller_price, sku_num, create_time, update_time, source_type, source_id, total_amount, split_activity_amount, split_coupon_amount,
        from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd') as pt  
        FROM ods_seller_detail_incr_stg
    ) a
) a
where rn =1 ;


--8 ods_seller_info_incr_stg

insert overwrite table ods_seller_info_incr partition(pt)
SELECT seller_id, user_id, seller_status, buyer_name, buyer_mobile, buyer_province_id, buyer_address_province, buyer_address_city, buyer_address_area, buyer_address_detail, total_amount, activity_reduce_amount, coupon_reduce_amount, original_total_amount, payment_way, seller_comment, out_trade_no, trade_body, process_status, tracking_no, feight_fee, create_time, update_time, pt
from 
(
    SELECT seller_id, user_id, seller_status, buyer_name, buyer_mobile, buyer_province_id, buyer_address_province, buyer_address_city, buyer_address_area, buyer_address_detail, total_amount, activity_reduce_amount, coupon_reduce_amount, original_total_amount, payment_way, seller_comment, out_trade_no, trade_body, process_status, tracking_no, feight_fee, create_time, update_time, pt,
    row_number() over(partition by seller_id order by update_time desc ) as rn
    from
    (
        SELECT seller_id, user_id, seller_status, buyer_name, buyer_mobile, buyer_province_id, buyer_address_province, buyer_address_city, buyer_address_area, buyer_address_detail, total_amount, activity_reduce_amount, coupon_reduce_amount, original_total_amount, payment_way, seller_comment, out_trade_no, trade_body, process_status, tracking_no, feight_fee, create_time, update_time, pt
        FROM ods_seller_info_incr
        where pt in 
            (
                SELECT from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
                FROM ods_seller_info_incr_stg
                group by from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
            )
        union all 
        SELECT seller_id, user_id, seller_status, buyer_name, buyer_mobile, buyer_province_id, buyer_address_province, buyer_address_city, buyer_address_area, buyer_address_detail, total_amount, activity_reduce_amount, coupon_reduce_amount, original_total_amount, payment_way, seller_comment, out_trade_no, trade_body, process_status, tracking_no, feight_fee, create_time, update_time,
        from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd') as pt  
        FROM ods_seller_info_incr_stg
    ) a
) a
where rn =1 ;


--9 ods_seller_refund_info_incr_stg

insert overwrite table ods_seller_refund_info_incr partition(pt)
SELECT seller_refund_id, user_id, seller_id, sku_id, refund_type, refund_num, refund_amount, refund_reason_type, refund_reason_txt, refund_status, create_time, update_time, pt
from 
(
    SELECT seller_refund_id, user_id, seller_id, sku_id, refund_type, refund_num, refund_amount, refund_reason_type, refund_reason_txt, refund_status, create_time, update_time, pt,
    row_number() over(partition by seller_refund_id order by update_time desc ) as rn
    from
    (
        SELECT seller_refund_id, user_id, seller_id, sku_id, refund_type, refund_num, refund_amount, refund_reason_type, refund_reason_txt, refund_status, create_time, update_time, pt
        FROM ods_seller_refund_info_incr
        where pt in 
            (
                SELECT from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
                FROM ods_seller_refund_info_incr_stg
                group by from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
            )
        union all 
        SELECT seller_refund_id, user_id, seller_id, sku_id, refund_type, refund_num, refund_amount, refund_reason_type, refund_reason_txt, refund_status, create_time, update_time,
        from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd') as pt  
        FROM ods_seller_refund_info_incr_stg
    ) a
) a
where rn =1 ;


--10 ods_user_info_incr_stg

insert overwrite table ods_user_info_incr partition(pt)
SELECT user_id, login_name, nick_name, passwd, user_name, phone_num, email, head_img, user_level, birthday, gender, create_time, update_time, `status`, pt
from 
(
    SELECT user_id, login_name, nick_name, passwd, user_name, phone_num, email, head_img, user_level, birthday, gender, create_time, update_time, `status`, pt,
    row_number() over(partition by user_id order by update_time desc ) as rn
    from
    (
        SELECT user_id, login_name, nick_name, passwd, user_name, phone_num, email, head_img, user_level, birthday, gender, create_time, update_time, `status`, pt
        FROM ods_user_info_incr
        where pt in 
            (
                SELECT from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
                FROM ods_user_info_incr_stg
                group by from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd')
            )
        union all 
        SELECT user_id, login_name, nick_name, passwd, user_name, phone_num, email, head_img, user_level, birthday, gender, create_time, update_time, `status`,
        from_unixtime(unix_timestamp(create_time),'yyyy-MM-dd') as pt  
        FROM ods_user_info_incr_stg
    ) a
) a
where rn =1 ;

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值