《ETLSDK插件开发》

最新推荐文章于 2021-12-08 18:48:00 发布

yk_3215123

最新推荐文章于 2021-12-08 18:48:00 发布

阅读量317

点赞数

本文链接：https://blog.csdn.net/yk_3215123/article/details/102572424

版权

etlsdk插件开发

快速入门插件

通过终端运行命令示例：

API

DatasourceFactory（）类

read_dataframe（input，** kwargs）
write_dataframe（df，输出，** kwargs）

Table（）类：

db_table_name
全名
名称
数据库名称
图式
划分
all_columns
ID
storage_type
storage_settings
临时演员

ETLSession（）类

spark_session
输入
输出
args

spark_conf（配置）

ETLSessionUT（）类：

create_etlsession（输入，输出，参数）：
get_sparksession（）：
get_writeout_results（）：

快速入门插件

from etlplugin.lib.datasources.datasource_factory import DatasourceFactory

class WeixinProcessPlugin():

# etlsdk传入inputs, outputs, args
#### joinimage(self, inputs, outputs, args):
"""
:param inputs: 由dayu_id,dayu_full_name 唯一确定的数据源，并且包含了对应的dataframe，示例：
inputs = {
"image":{
"type": "hive",
"name": "table_name1",
"dayu_full_name": "Hive:testdb:table_name1",
"dayu_id": "10",
"partition": [{"tdate": "2019-01-01"}],
"table": the instance of Table,
"df": DataFrame,
},
"article":{
"type": "hive",
"name": "table_name2",
"dayu_full_name": "Hive:testdb:table_name2",
"dayu_id": "11",
"partition": {"tdate": "2019-01-01"},
"table": the instance of Table,
"df": DataFrame
}
}
这个结构对应命令行里的 --inputs image:name=Hive:testdb:table_name1\
--inputs article:name=Hive:testdb:table_name2\
--partition 2019-01-01
:param outputs: 类似inputs, 缺少df字段，示例：
outputs = {
"image_join_article":{
"type": "hive",
"name": "table_name3",
"dayu_full_name": "Hive:testdb:table_name3",
"dayu_id": "12",
"partition": {"tdate":"2019-01-01"},
"table": the instance of Table
}
}
:param args: plugin运行的其他配置信息，base_args也会整合到里面，另外会根据sparkc_conf存一个新init的spark_session实例：
args = {
}
:return: 无返回，运行结果或输出按配置存到outputs数据源
"""
artilce = inputs['artilce']['df']
image = inputs['image']['df']
<df transform>
DatasourceFactory.write_dataframe(image_df, outputs['image_join_article']) # 根据时间和table schema 将df写入到对应的schema

#### image2es(self, inputs, outputs, args):
pass

通过终端运行命令示例：

python3 -m etlsdk data_pipeline.plugins.WeixinProcessPlugin.joinimage
--input image:name=Hive:testdb.table_name1\
--input article:name=Hive:testdb.table_name2\
--output image_join_article:name=Hive:testdb.table_name3\
--executor_num 5\
--partition "2019-01-01"

API

DatasourceFactory（）类

获取Datasource的工厂类。目前支持获取的datasource类型: ['oss','es','hive', 'kafka']

ETLSession正在流式传输True / False时支持的数据源详见：https ://git.aipp.io/pub/wiki/blob/master/ 系统/ Pony / etlsdk /实时流支持情况.md

read_dataframe（input，** kwargs）

根据input读取df
:param input(dict):
input 必填keys (`dayu_id`) or (`dayu_full_name`)
:key `dayu_id`: 与大禹平台table展示的`id`相同. value type: int
:key `dayu_full_name`: 与大禹平台table展示的`全名`相同. value type: str.
选填keys: (`partition`)
:key `partition`: 读取table的分区。default None: 读取全表数据。 value type: {$partition_column_name:$partition_column_value} (support list).
return DataFrame
例：

from etlsdk.lib.datasources.datasource_factory import DatasourceFactory

input = {
"dayu_id": 1740,
"partition": [{"tdate": "2019-01-01"}]
}
df = DatasourceFactory.read_dataframe(input)

write_dataframe（df，输出，** kwargs）

将df写出到output table
:param df: the instance of DataFrame
:param output(dict):
output 必填keys (`dayu_id`, `partition`) or (`dayu_full_name`, `partition`)
:key `dayu_id`: 与大禹平台table展示的`id`相同. value type: int
:key `dayu_full_name`: 与大禹平台table展示的`全名`相同. value type: str.
:key `partition`: 读取table的分区。 value type: {$partition_column_name:$partition_column_value}.
:kwargs dqc(dict): 对数据做dqc检查. default: {"rules": [{'rule': 'count', 'restrict': 1}]} .
dqc 配置参考：https://git.aipp.io/pub/wiki/blob/master/系统/Pony/ETL文档/DQC.md
:kwargs uuid(dict): 当写出的output table存储类型是oss, es时，必填。
例：

from etlsdk.lib.datasources.datasource_factory import DatasourceFactory

output = {
"dayu_full_name": "Hive:parsed:dcrawl_parsed_weixin",
"partition": [{"tdate": "2019-01-01"}]
}
dqc_config = {"rules": [{'rule': 'count', 'restrict': 10}]}
df = DatasourceFactory.write_dataframe(output, dqc=dqc_config)

Table（）类：

保存table的大禹schema，connector配置等信息。

db_table_name：

返回db_name.table_name

全名：

返回大禹`全名`

名称：

返回table name

数据库名称：

返回table大禹数据库名

模式：

返回大禹table column schema: list of column.
column: {"name":$column_name, "type":$column_type, "comment":}

划分：

返回大禹table partitions: list of partition.
partition: {"name":$partition_name, "type":$partition_type, "comment":}

all_columns：

返回大禹table all column(包含schema和partition): list of column.
column: {"name":$column_name, "type":$column_type, "comment":}

ID：

返回大禹table id

storage_type：

返回table存储类型。enum：['oss','es','hive']

storage_settings：

返回table的存储配置。

附加功能：

返回table的额外配置信息。
当storage_type是oss时，extra返回{"prefix_pattern":$prefix_pattern}.
当storage_type是es时，extra返回{"doc_type":$doc_type,"source":"article", "mapping": $mapping}.
其他情况，extra返回{}.

ETLSession（）类

用户通过`ETLSession.get_instance()`获取ETL Session实例，ETLSession保存ETL的变量，变量包括spark_session,inputs,outputs,args。

spark_session

spark session. 根据配置创建的SparkSession。

输入

plugin的inputs。格式: {
"name": $dayu_name,
"type": $table_storage_type,
"df": Dataframe,
"dayu_id": $dayu_id,
"dayu_full_name": $dayu_full_name,
"partition": $partition,
"table": the instance of [Table](https://git.aipp.io/pub/wiki/blob/master/系统/Pony/ETL上线文档/ETLSDK/etlsdk_plugin开发.md#Table).
}

输出

plugin的outputs。格式: {
"name": $dayu_name,
"type": $table_storage_type,
"dayu_id": $dayu_id,
"dayu_full_name": $dayu_full_name,
"partition": $partition,
"table": the instance of [Table](https://git.aipp.io/pub/wiki/blob/master/系统/Pony/ETL上线文档/ETLSDK/etlsdk_plugin开发.md#Table).
}

args

plugin的args。格式: dict.
例：

from etlsdk.lib.session import ETLSession

etl_session = ETLSession.get_instance()
spark = etl_session.spark_session
df = spark.table('default.dcrawl_parsed_cars')

inputs = etl_session.inputs
>>print(etlsession.inputs)
{'RawTable': {'dayu_full_name': 'Hive:etlsdk_test:raw_table',
'dayu_id': 1730,
'df': DataFrame[key: string, json: string, tdate: string],
'name': 'raw_table',
'partition': [{'partition_date': '2019-01-01'}],
'table': <etlsdk.lib.handlers.table_handler.Table object at 0x7f3100910a90>,
'type': 'hive'}}

>>print(etlsession.outputs)
{'ParsedTable': {'dayu_full_name': 'Hive:etlsdk_test:parsed_table',
'dayu_id': 1740,
'name': 'parsed_table',
'partition': {'partition_date': '2019-01-01'},
'table': <etlsdk.lib.handlers.table_handler.Table object at 0x7f310090cf98>,
'type': 'hive'}}

spark_conf（配置）

装饰器，plugin函数的spark配置
:param config(dict): spark配置
:key config: spark configration. value type: {$spark_conf_key: $spark_conf_value}。
config可选参数 https://spark.apache.org/docs/latest/configuration.html
:key dependency: 函数的依赖包。
dependency value type: {$module_import_name: $module_url}.
module_url: 支持pypi地址链接和hdfs保存地址. hdfs保存的文件格式：module setup目录文件夹的zip包.
例：

from etlsdk.session import spark_conf
class Plugin():
spark_config = {
"config": {
"spark.executor.instances": '2',
"spark.executor.memory": '1g',
"spark.partition.num": '100'
},
"dependency":{
"jieba": "https://pypi.aidigger.com/packages/EigenJieba-0.0.1.tar.gz"
}
}
@spark_conf(spark_config)
#### run(inputs, outputs, args):
pass

ETLSessionUT（）类：

用于unittest创建ETLSession。

create_etlsession（输入，输出，参数）：

创建ETLSession. 其中spark_session的配置为[ ('spark.executor.cores', '2'), ('spark.executor.instances', '1'), ('spark.executor.memory', '2g')]. 运行模式为 local[2].
:param inputs(list): plugin的inputs参数。value type: list of input。 input type: dict. inputs支持[]
input 必填keys (`name`, `dayu_full_name`, `mock_datas`)
:key `name`: input alias name. value type: string.
:key `dayu_full_name`: 与大禹平台table展示的`全名`相同. value type: str.
:key `mock_datas`: 用于mock input df. value type: list of Row.asDict().
选填keys: (`partition`)
:key `partition`: 读取table的分区。default None: 读取全表数据。 value type: {$partition_column_name:$partition_column_value} (support list).
:param outputs(list): plugin的outputs参数。value type: list of output。 output type: dict. outputs支持[]
output 必填keys (`name`, `dayu_id`, `partition`) or (`name`, `dayu_full_name`, `partition`)
:key `name`: output alias name. value type: string.
:key `dayu_id`: 与大禹平台table展示的`id`相同. value type: int
:key `dayu_full_name`: 与大禹平台table展示的`全名`相同. value type: str.
:key `partition`: 读取table的分区。 value type: {$partition_column_name:$partition_column_value}.
:param args(dict): plugin的args参数。format: {$args_key:$args_value}
args_key 可选（`isstreaming`, $user_args_key）
:key `isstreaming`: ETLSession实时流开关。default: False. value type: bool.
:key $user_args_key: 用户plugin中用到args参数。user_args_value type: string.
return `ETLSession`实例
例：

from etlsdk.tools.ut_utils import ETLSessionUT
from data_pipeline.plugins.oss2hive import OSS2HivePlugin

mock_datas = [{"key":str(num), "json":"json %d"%num, "tdate":"2019-01-01"} for num in range(10)]
inputs = [
{
"name": "RawTable",
"dayu_full_name":"Hive:etlsdk_test:raw_table",
"partition":[{"partition_date":"2019-01-01"}],
"mock_datas":mock_datas
}
]

outputs = [
{
"name": "ParsedTable",
"dayu_full_name": "HIVE:etlsdk_test:parsed_table",
"partition":{"partition_date":"2019-01-01"}
}
]

args = {'test_args': 'test_args_value'}
etlsession = ETLSessionUT.create_etlsession(inputs, outputs, args)

>>print(etlsession.inputs)
{'RawTable': {'dayu_full_name': 'Hive:etlsdk_test:raw_table',
'dayu_id': 1730,
'df': DataFrame[key: string, json: string, tdate: string],
'name': 'raw_table',
'partition': [{'partition_date': '2019-01-01'}],
'table': <etlsdk.lib.handlers.table_handler.Table object at 0x7f3100910a90>,
'type': 'hive'}}
>>print(etlsession.outputs)
{'ParsedTable': {'dayu_full_name': 'Hive:etlsdk_test:parsed_table',
'dayu_id': 1740,
'name': 'parsed_table',
'partition': {'partition_date': '2019-01-01'},
'table': <etlsdk.lib.handlers.table_handler.Table object at 0x7f310090cf98>,
'type': 'hive'}}
>>print(etlsession.args)
{'test_args': 'test_args_value'}

oss2hive = OSS2HivePlugin()
oss2hive.run(etlsession.inputs, etlsession.outputs, etlsession.args)
writeout_results = etlsession.get_writeout_results()

get_sparksession（）：

获取配置为 [('spark.executor.cores', '2'), ('spark.executor.instances', '1'), ('spark.executor.memory', '2g')]的SparkSession. 运行模式为 local[2].
return SparkSession

get_writeout_results（）：

获取DatasourceFactory.write_dataframe传入的df results.
return {$dayu_full_name:$df_datas}.

etlsdk

isstreaming：布尔类型默认值： False

True: `实时流`方式运行
False: `批处理`方式运行
*注: 一个任务中仅支持实时流，批处理一种运行方式*

实时流和批处理etlsdk中支持情况

1. OSS

读/写
实时流
批处理

read_dataframe
不支持
支持

write_dataframe
支持
支持

2. HIVE

读/写
实时流
批处理

read_dataframe
不支持
支持

write_dataframe
不支持
支持

3. ES

读/写
实时流
批处理

read_dataframe
不支持
支持

write_dataframe
支持
支持

4. MYSQL

读/写
实时流
批处理

read_dataframe
不支持
支持

write_dataframe
支持
支持

5.卡夫卡

读/写
实时流
批处理

read_dataframe
支持
不支持

write_dataframe
支持
不支持

yk_3215123

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
《ETLSDK插件开发》

etlsdk插件开发快速入门插件通过终端运行命令示例：APIDatasourceFactory（）类read_dataframe（input，** kwargs）write_dataframe（df，输出，** kwargs）Table（）类：db_table_name全名名称数据库名称图式划分all_columnsIDs...
复制链接

扫一扫

《ETLSDK插件开发》

“相关推荐”对你有帮助么？