刷新表内容_数据验证_认识Airflow SubDAGs

最新推荐文章于 2023-06-13 11:19:04 发布

Anla Likes Sunshine

最新推荐文章于 2023-06-13 11:19:04 发布

阅读量530

点赞数

分类专栏：大数据文章标签：大数据

本文链接：https://blog.csdn.net/AnlaGodness/article/details/114976974

版权

大数据专栏收录该内容

48 篇文章 1 订阅

订阅专栏

一、刷新表内容，以excel表中内容进行替换
1、hue指定csv路径并创建表，得到的表属性如：

	createtab_stmt	
CREATE TABLE `test.table_name`(	
  `field_1` string COMMENT 'from deserializer', 	
  `field_2` string COMMENT 'from deserializer')	
ROW FORMAT SERDE 	
  'org.apache.hadoop.hive.serde2.OpenCSVSerde' 	
WITH SERDEPROPERTIES ( 	
  'escapeChar'='\\', 	
  'quoteChar'='"', 	
  'separatorChar'=',') 	
STORED AS INPUTFORMAT 	
  'org.apache.hadoop.mapred.TextInputFormat' 	
OUTPUTFORMAT 	
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'	
LOCATION	
  'hdfs://xxx-cluster/user/hive/warehouse/test.db/table_name'	
TBLPROPERTIES (	
  'bucketing_version'='2', 	
  'transient_lastDdlTime'='1615960711')

在测试表测试完后，操作生产上的表：
2、备份

hadoop fs -cp /user/hive/warehouse/ods.db/table_name /user/bantongshui/upload_src/

3、上传数据到hdfs

 hadoop fs -put table_name.csv /user/bantongshui/upload_src/

4、导入到表

truncate table ods.table_name;
load data inpath '/user/bantongshui/upload_src/table_name.csv' into table ods.table_name;

即可在hive查询表数据了。
impala则需刷新元数据 invalidate metadata table_name，才可查询(表路径下已是csv文件，而不是名如000000_0的文件了)，不然会报出：No such file or directory。

Tips:
由于表在生产上已被使用，因此要根据生产上的表属性建表测试并备份原来的数据，确保导入格式无误和有机会回滚。

二、数据验证：当表数据需更新时，应提前测试，数据验证了解变动范围和准确性。

select count(1) 
from (
	select concat(nvl(t1.field1,''),nvl(cast(field2 as string),''),...,nvl(cast(fieldn as string),'')) ss1
		,concat(nvl(t2.field1,''),nvl(cast(field2 as string),''),...,nvl(cast(fieldn as string),'')) ss2
	from 
	(
		select *
		from tableA
	) t1 
	join (
		select *
		from update_tableA
	) t2 
	on t1.lidu_field = t2.lidu_field --粒度关联
	where t1.field <> t2.field --更新的内容
) t 
where ss1<>ss2
;

三、认识Airflow SubDAGs
参考官网：https://airflow.apache.org/docs/apache-airflow/stable/concepts.html#
SubDAGs are perfect for repeating patterns. Defining a function that returns a DAG object is a nice design pattern when using Airflow.
Airbnb uses the stage-check-exchange pattern when loading data. Data is staged in a temporary table, after which data quality checks are performed against that table. Once the checks all pass the partition is moved into the production table.
SubDAGs 非常适合重复模式。当使用Airflow时，定义一个返回DAG对象的函数是一个很好的设计模式。
Airbnb在加载数据时使用阶段-检查-交换模式。数据暂存到一个临时表中，然后对该表执行数据质量检查。一旦检查全部通过，就将分区移到生产表中。
准备：pycharm 安装 airflow 包 : pip install apache-airflow==1.10.10

from datetime import datetime
from airflow import DAG
# from airflow.operators.dummy import DummyOperator
from airflow.operators.dummy_operator import DummyOperator
# from airflow.operators.subdag import SubDagOperator
# from airflow.example_dags.subdags.subdag import subdag
from airflow.operators.subdag_operator import SubDagOperator

DAG_NAME = 'example_subdag_operator'

args = {
    'owner': 'hadoop',
    'start_date': datetime(2019, 10, 21)
}

dag = DAG(
    dag_id=DAG_NAME, default_args=args, schedule_interval="@once"
)

start = DummyOperator(
    task_id='start',
    dag=dag,
)


def subdag(parent_dag_name, child_dag_name, args):
    dag_subdag = DAG(
        dag_id='%s.%s' % (parent_dag_name, child_dag_name),
        default_args=args,
        schedule_interval="@daily",
    )

    last_task = None

    for i in range(5):
        curr_task = DummyOperator(
            task_id='%s-task-%s' % (child_dag_name, i + 1),
            default_args=args,
            dag=dag_subdag,
        )
        if last_task:
            last_task >> curr_task
        last_task = curr_task

    return dag_subdag


section_1 = SubDagOperator(
    task_id='section-1',
    subdag=subdag(DAG_NAME, 'section-1', args),
    dag=dag,
)

some_other_task = DummyOperator(
    task_id='some-other-task',
    dag=dag,
)

section_2 = SubDagOperator(
    task_id='section-2',
    subdag=subdag(DAG_NAME, 'section-2', args),
    dag=dag,
)

end = DummyOperator(
    task_id='end',
    dag=dag,
)

start >> section_1 >> some_other_task >> section_2 >> end

效果：1、Airflow新增一个DAG
在这里插入图片描述
2、点击Graph View ->section1 ->Zoom into Sub DAG

可看到：

ps.如果使用的subdag 是airflow.example_dags.subdags.subdag 的定义，那么效果是 section1里面的Task是并行的。