一、刷新表内容,以excel表中内容进行替换
1、hue指定csv路径并创建表,得到的表属性如:
createtab_stmt
CREATE TABLE `test.table_name`(
`field_1` string COMMENT 'from deserializer',
`field_2` string COMMENT 'from deserializer')
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'escapeChar'='\\',
'quoteChar'='"',
'separatorChar'=',')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://xxx-cluster/user/hive/warehouse/test.db/table_name'
TBLPROPERTIES (
'bucketing_version'='2',
'transient_lastDdlTime'='1615960711')
在测试表测试完后,操作生产上的表:
2、备份
hadoop fs -cp /user/hive/warehouse/ods.db/table_name /user/bantongshui/upload_src/
3、上传数据到hdfs
hadoop fs -put table_name.csv /user/bantongshui/upload_src/
4、导入到表
truncate table ods.table_name;
load data inpath '/user/bantongshui/upload_src/table_name.csv' into table ods.table_name;
即可在hive查询表数据了。
impala则需 刷新元数据 invalidate metadata table_name,才可查询(表路径下已是csv文件,而不是名如000000_0的文件了),不然会报出:No such file or directory。
Tips:
由于表在生产上已被使用,因此要根据生产上的表属性建表测试并备份原来的数据,确保导入格式无误和有机会回滚。
二、数据验证:当表数据需更新时,应提前测试,数据验证了解变动范围和准确性。
select count(1)
from (
select concat(nvl(t1.field1,''),nvl(cast(field2 as string),''),...,nvl(cast(fieldn as string),'')) ss1
,concat(nvl(t2.field1,''),nvl(cast(field2 as string),''),...,nvl(cast(fieldn as string),'')) ss2
from
(
select *
from tableA
) t1
join (
select *
from update_tableA
) t2
on t1.lidu_field = t2.lidu_field --粒度关联
where t1.field <> t2.field --更新的内容
) t
where ss1<>ss2
;
三、认识Airflow SubDAGs
参考官网:https://airflow.apache.org/docs/apache-airflow/stable/concepts.html#
SubDAGs are perfect for repeating patterns. Defining a function that returns a DAG object is a nice design pattern when using Airflow.
Airbnb uses the stage-check-exchange pattern when loading data. Data is staged in a temporary table, after which data quality checks are performed against that table. Once the checks all pass the partition is moved into the production table.
SubDAGs 非常适合重复模式。当使用Airflow时,定义一个返回DAG对象的函数是一个很好的设计模式。
Airbnb在加载数据时使用阶段-检查-交换模式。数据暂存到一个临时表中,然后对该表执行数据质量检查。一旦检查全部通过,就将分区移到生产表中。
准备:pycharm 安装 airflow 包 : pip install apache-airflow==1.10.10
from datetime import datetime
from airflow import DAG
# from airflow.operators.dummy import DummyOperator
from airflow.operators.dummy_operator import DummyOperator
# from airflow.operators.subdag import SubDagOperator
# from airflow.example_dags.subdags.subdag import subdag
from airflow.operators.subdag_operator import SubDagOperator
DAG_NAME = 'example_subdag_operator'
args = {
'owner': 'hadoop',
'start_date': datetime(2019, 10, 21)
}
dag = DAG(
dag_id=DAG_NAME, default_args=args, schedule_interval="@once"
)
start = DummyOperator(
task_id='start',
dag=dag,
)
def subdag(parent_dag_name, child_dag_name, args):
dag_subdag = DAG(
dag_id='%s.%s' % (parent_dag_name, child_dag_name),
default_args=args,
schedule_interval="@daily",
)
last_task = None
for i in range(5):
curr_task = DummyOperator(
task_id='%s-task-%s' % (child_dag_name, i + 1),
default_args=args,
dag=dag_subdag,
)
if last_task:
last_task >> curr_task
last_task = curr_task
return dag_subdag
section_1 = SubDagOperator(
task_id='section-1',
subdag=subdag(DAG_NAME, 'section-1', args),
dag=dag,
)
some_other_task = DummyOperator(
task_id='some-other-task',
dag=dag,
)
section_2 = SubDagOperator(
task_id='section-2',
subdag=subdag(DAG_NAME, 'section-2', args),
dag=dag,
)
end = DummyOperator(
task_id='end',
dag=dag,
)
start >> section_1 >> some_other_task >> section_2 >> end
效果:1、Airflow新增一个DAG
2、点击Graph View ->section1 ->Zoom into Sub DAG
可看到:
ps.如果使用的subdag 是airflow.example_dags.subdags.subdag 的定义,那么效果是 section1里面的Task是并行的。