etlsdk插件开发
快速入门插件
通过终端运行命令示例:
API
DatasourceFactory()类
read_dataframe(input,** kwargs)
write_dataframe(df,输出,** kwargs)
Table()类:
db_table_name
全名
名称
数据库名称
图式
划分
all_columns
ID
storage_type
storage_settings
临时演员
ETLSession()类
spark_session
输入
输出
args
spark_conf(配置)
ETLSessionUT()类:
create_etlsession(输入,输出,参数):
get_sparksession():
get_writeout_results():
快速入门插件
from etlplugin.lib.datasources.datasource_factory import DatasourceFactory
class WeixinProcessPlugin():
# etlsdk传入inputs, outputs, args
#### joinimage(self, inputs, outputs, args):
"""
:param inputs: 由dayu_id,dayu_full_name 唯一确定的数据源,并且包含了对应的dataframe,示例:
inputs = {
"image":{
"type": "hive",
"name": "table_name1",
"dayu_full_name": "Hive:testdb:table_name1",
"dayu_id": "10",
"partition": [{"tdate": "2019-01-01"}],
"table": the instance of Table,
"df": DataFrame,
},
"article":{
"type": "hive",
"name": "table_name2",
"dayu_full_name": "Hive:testdb:table_name2",
"dayu_id": "11",
"partition": {"tdate": "2019-01-01"},
"table": the instance of Table,
"df": DataFrame
}
}
这个结构对应命令行里的 --inputs image:name=Hive:testdb:table_name1\
--inputs article:name=Hive:testdb:table_name2\
--partition 2019-01-01
:param outputs: 类似inputs, 缺少df字段,示例:
outputs = {
"image_join_article":{
"type": "hive",
"name": "table_name3",
"dayu_full_name": "Hive:testdb:table_name3",
"dayu_id": "12",
"partition": {"tdate":"2019-01-01"},
"table": the instance of Table
}
}
:param args: plugin运行的其他配置信息,base_args也会整合到里面,另外会根据sparkc_conf存一个新init的spark_session实例:
args = {
}
:return: 无返回,运行结果或输出按配置存到outputs数据源
"""
artilce = inputs['artilce']['df']
image = inputs['image']['df']
<df transform>
DatasourceFactory.write_dataframe(image_df, outputs['image_join_article']) # 根据时间和table schema 将df写入到对应的schema
#### image2es(self, inputs, outputs, args):
pass
通过终端运行命令示例:
python3 -m etlsdk data_pipeline.plugins.WeixinProcessPlugin.joinimage
--input image:name=Hive:testdb.table_name1\
--input article:name=Hive:testdb.table_name2\
--output image_join_article:name=Hive:testdb.table_name3\
--executor_num 5\
--partition "2019-01-01"
API
DatasourceFactory()类
获取Datasource的工厂类。目前支持获取的datasource类型: ['oss','es','hive', 'kafka']
ETLSession正在流式传输True / False时支持的数据源详见:https ://git.aipp.io/pub/wiki/blob/master/ 系统/ Pony / etlsdk /实时流支持情况.md
read_dataframe(input,** kwargs)
根据input读取df
:param input(dict):
input 必填keys (`dayu_id`) or (`dayu_full_name`)
:key `dayu_id`: 与大禹平台table展示的`id`相同. value type: int
:key `dayu_full_name`: 与大禹平台table展示的`全名`相同. value type: str.
选填keys: (`partition`)
:key `partition`: 读取table的分区。default None: 读取全表数据。 value type: {$partition_column_name:$partition_column_value} (support list).
return DataFrame
例:
from etlsdk.lib.datasources.datasource_factory import DatasourceFactory
input = {
"dayu_id": 1740,
"partition": [{"tdate": "2019-01-01"}]
}
df = DatasourceFactory.read_dataframe(input)
write_dataframe(df,输出,** kwargs)
将df写出到output table
:param df: the instance of DataFrame
:param output(dict):
output 必填keys (`dayu_id`, `partition`) or (`dayu_full_name`, `partition`)
:key `dayu_id`: 与大禹平台table展示的`id`相同. value type: int
:key `dayu_full_name`: 与大禹平台table展示的`全名`相同. value type: str.
:key `partition`: 读取table的分区。 value type: {$partition_column_name:$partition_column_value}.
:kwargs dqc(dict): 对数据做dqc检查. default: {"rules": [{'rule': 'count', 'restrict': 1}]} .
dqc 配置参考:https://git.aipp.io/pub/wiki/blob/master/系统/Pony/ETL文档/DQC.md
:kwargs uuid(dict): 当写出的output table存储类型是oss, es时,必填。
例:
from etlsdk.lib.datasources.datasource_factory import DatasourceFactory
output = {
"dayu_full_name": "Hive:parsed:dcrawl_parsed_weixin",
"partition": [{"tdate": "2019-01-01"}]
}
dqc_config = {"rules": [{'rule': 'count', 'restrict': 10}]}
df = DatasourceFactory.write_dataframe(output, dqc=dqc_config)
Table()类:
保存table的大禹schema,connector配置等信息。
db_table_name:
返回db_name.table_name
全名:
返回大禹`全名`
名称:
返回table name
数据库名称:
返回table大禹数据库名
模式:
返回大禹table column schema: list of column.
column: {"name":$column_name, "type":$column_type, "comment":}
划分:
返回大禹table partitions: list of partition.
partition: {"name":$partition_name, "type":$partition_type, "comment":}
all_columns:
返回大禹table all column(包含schema和partition): list of column.
column: {"name":$column_name, "type":$column_type, "comment":}
ID:
返回大禹table id
storage_type:
返回table存储类型。enum:['oss','es','hive']
storage_settings:
返回table的存储配置。
附加功能:
返回table的额外配置信息。
当storage_type是oss时,extra返回{"prefix_pattern":$prefix_pattern}.
当storage_type是es时,extra返回{"doc_type":$doc_type,"source":"article", "mapping": $mapping}.
其他情况,extra返回{}.
ETLSession()类
用户通过`ETLSession.get_instance()`获取ETL Session实例,ETLSession保存ETL的变量,变量包括spark_session,inputs,outputs,args。
spark_session
spark session. 根据配置创建的SparkSession。
输入
plugin的inputs。格式: {
"name": $dayu_name,
"type": $table_storage_type,
"df": Dataframe,
"dayu_id": $dayu_id,
"dayu_full_name": $dayu_full_name,
"partition": $partition,
"table": the instance of [Table](https://git.aipp.io/pub/wiki/blob/master/系统/Pony/ETL上线文档/ETLSDK/etlsdk_plugin开发.md#Table).
}
输出
plugin的outputs。格式: {
"name": $dayu_name,
"type": $table_storage_type,
"dayu_id": $dayu_id,
"dayu_full_name": $dayu_full_name,
"partition": $partition,
"table": the instance of [Table](https://git.aipp.io/pub/wiki/blob/master/系统/Pony/ETL上线文档/ETLSDK/etlsdk_plugin开发.md#Table).
}
args
plugin的args。格式: dict.
例:
from etlsdk.lib.session import ETLSession
etl_session = ETLSession.get_instance()
spark = etl_session.spark_session
df = spark.table('default.dcrawl_parsed_cars')
inputs = etl_session.inputs
>>print(etlsession.inputs)
{'RawTable': {'dayu_full_name': 'Hive:etlsdk_test:raw_table',
'dayu_id': 1730,
'df': DataFrame[key: string, json: string, tdate: string],
'name': 'raw_table',
'partition': [{'partition_date': '2019-01-01'}],
'table': <etlsdk.lib.handlers.table_handler.Table object at 0x7f3100910a90>,
'type': 'hive'}}
>>print(etlsession.outputs)
{'ParsedTable': {'dayu_full_name': 'Hive:etlsdk_test:parsed_table',
'dayu_id': 1740,
'name': 'parsed_table',
'partition': {'partition_date': '2019-01-01'},
'table': <etlsdk.lib.handlers.table_handler.Table object at 0x7f310090cf98>,
'type': 'hive'}}
spark_conf(配置)
装饰器,plugin函数的spark配置
:param config(dict): spark配置
:key config: spark configration. value type: {$spark_conf_key: $spark_conf_value}。
config可选参数 https://spark.apache.org/docs/latest/configuration.html
:key dependency: 函数的依赖包。
dependency value type: {$module_import_name: $module_url}.
module_url: 支持pypi地址链接和hdfs保存地址. hdfs保存的文件格式:module setup目录文件夹的zip包.
例:
from etlsdk.session import spark_conf
class Plugin():
spark_config = {
"config": {
"spark.executor.instances": '2',
"spark.executor.memory": '1g',
"spark.partition.num": '100'
},
"dependency":{
"jieba": "https://pypi.aidigger.com/packages/EigenJieba-0.0.1.tar.gz"
}
}
@spark_conf(spark_config)
#### run(inputs, outputs, args):
pass
ETLSessionUT()类:
用于unittest创建ETLSession。
create_etlsession(输入,输出,参数):
创建ETLSession. 其中spark_session的配置为[ ('spark.executor.cores', '2'), ('spark.executor.instances', '1'), ('spark.executor.memory', '2g')]. 运行模式为 local[2].
:param inputs(list): plugin的inputs参数。value type: list of input。 input type: dict. inputs支持[]
input 必填keys (`name`, `dayu_full_name`, `mock_datas`)
:key `name`: input alias name. value type: string.
:key `dayu_full_name`: 与大禹平台table展示的`全名`相同. value type: str.
:key `mock_datas`: 用于mock input df. value type: list of Row.asDict().
选填keys: (`partition`)
:key `partition`: 读取table的分区。default None: 读取全表数据。 value type: {$partition_column_name:$partition_column_value} (support list).
:param outputs(list): plugin的outputs参数。value type: list of output。 output type: dict. outputs支持[]
output 必填keys (`name`, `dayu_id`, `partition`) or (`name`, `dayu_full_name`, `partition`)
:key `name`: output alias name. value type: string.
:key `dayu_id`: 与大禹平台table展示的`id`相同. value type: int
:key `dayu_full_name`: 与大禹平台table展示的`全名`相同. value type: str.
:key `partition`: 读取table的分区。 value type: {$partition_column_name:$partition_column_value}.
:param args(dict): plugin的args参数。format: {$args_key:$args_value}
args_key 可选(`isstreaming`, $user_args_key)
:key `isstreaming`: ETLSession实时流开关。default: False. value type: bool.
:key $user_args_key: 用户plugin中用到args参数。user_args_value type: string.
return `ETLSession`实例
例:
from etlsdk.tools.ut_utils import ETLSessionUT
from data_pipeline.plugins.oss2hive import OSS2HivePlugin
mock_datas = [{"key":str(num), "json":"json %d"%num, "tdate":"2019-01-01"} for num in range(10)]
inputs = [
{
"name": "RawTable",
"dayu_full_name":"Hive:etlsdk_test:raw_table",
"partition":[{"partition_date":"2019-01-01"}],
"mock_datas":mock_datas
}
]
outputs = [
{
"name": "ParsedTable",
"dayu_full_name": "HIVE:etlsdk_test:parsed_table",
"partition":{"partition_date":"2019-01-01"}
}
]
args = {'test_args': 'test_args_value'}
etlsession = ETLSessionUT.create_etlsession(inputs, outputs, args)
>>print(etlsession.inputs)
{'RawTable': {'dayu_full_name': 'Hive:etlsdk_test:raw_table',
'dayu_id': 1730,
'df': DataFrame[key: string, json: string, tdate: string],
'name': 'raw_table',
'partition': [{'partition_date': '2019-01-01'}],
'table': <etlsdk.lib.handlers.table_handler.Table object at 0x7f3100910a90>,
'type': 'hive'}}
>>print(etlsession.outputs)
{'ParsedTable': {'dayu_full_name': 'Hive:etlsdk_test:parsed_table',
'dayu_id': 1740,
'name': 'parsed_table',
'partition': {'partition_date': '2019-01-01'},
'table': <etlsdk.lib.handlers.table_handler.Table object at 0x7f310090cf98>,
'type': 'hive'}}
>>print(etlsession.args)
{'test_args': 'test_args_value'}
oss2hive = OSS2HivePlugin()
oss2hive.run(etlsession.inputs, etlsession.outputs, etlsession.args)
writeout_results = etlsession.get_writeout_results()
get_sparksession():
获取配置为 [('spark.executor.cores', '2'), ('spark.executor.instances', '1'), ('spark.executor.memory', '2g')]的SparkSession. 运行模式为 local[2].
return SparkSession
get_writeout_results():
获取DatasourceFactory.write_dataframe传入的df results.
return {$dayu_full_name:$df_datas}.
etlsdk
isstreaming:布尔类型默认值: False
True: `实时流`方式运行
False: `批处理`方式运行
*注: 一个任务中仅支持实时流,批处理一种运行方式*
实时流和批处理etlsdk中支持情况
1. OSS
读/写
实时流
批处理
read_dataframe
不支持
支持
write_dataframe
支持
支持
2. HIVE
读/写
实时流
批处理
read_dataframe
不支持
支持
write_dataframe
不支持
支持
3. ES
读/写
实时流
批处理
read_dataframe
不支持
支持
write_dataframe
支持
支持
4. MYSQL
读/写
实时流
批处理
read_dataframe
不支持
支持
write_dataframe
支持
支持
5.卡夫卡
读/写
实时流
批处理
read_dataframe
支持
不支持
write_dataframe
支持
不支持