Apache DolphinScheduler
Apache DolphinScheduler 是一个分布式、可扩展的开源工作流编排平台,具有强大的 DAG 可视化界面。
一、特点
-
本身就支持HA
-
任务状态、任务类型、重试次数、任务运行机器、可视化变量等关键信息一目了然
-
任务队列机制,单个机器上可调度的任务数量可以灵活配置,当任务过多时会缓存在任务队列中,不会造成机器卡死
-
所有流程定义操作都是可视化的,通过拖拽任务来绘制DAG,配置数据源及资源。同时对于第三方系统,提供api方式的操作。
-
一键部署
-
支持暂停,恢复操作
-
DolphinScheduler上的用户可以通过租户和hadoop用户实现多对一或一对一的映射关系,这对大数据作业的调度是非常重要的。
-
支持传统的shell任务,同时支持大数据平台任务调度: MR、Spark、SQL(mysql、postgresql、hive、sparksql)、Python、Procedure、Sub_Process
-
调度器使用分布式调度,整体的调度能力会随便集群的规模线性增长,Master和Worker支持动态上下线
二、架构
-
UI
-
API 网关
-
Master 服务器
-
Worker 服务器
-
DB(MySQL,PostgreSQL)
-
Zookeeper
三、名词解释
-
流程定义:通过拖拽任务节点并建立任务节点的关联所形成的可视化DAG
- 流程实例:流程实例是流程定义的实例化,可以通过手动启动或定时调度生成,流程定义每运行一次,产生一个流程实例
-
任务实例:任务实例是流程定义中任务节点的实例化,标识着具体的任务执行状态
-
任务类型: 目前支持有SHELL、SQL、SUB_PROCESS(子流程)、PROCEDURE、MR、SPARK、PYTHON、DEPENDENT(依赖),同时计划支持动态插件扩展,注意:其中子 SUB_PROCESS 也是一个单独的流程定义,是可以单独启动执行的
-
调度方式: 系统支持基于cron表达式的定时调度和手动调度。命令类型支持:启动工作流、从当前节点开始执行、恢复被容错的工作流、恢复暂停流程、从失败节点开始执行、补数、定时、重跑、暂停、停止、恢复等待线程。其中 恢复被容错的工作流 和 恢复等待线程 两种命令类型是由调度内部控制使用,外部无法调用
-
定时调度:系统采用 quartz 分布式调度器,并同时支持cron表达式可视化的生成
-
依赖:系统不单单支持 DAG 简单的前驱和后继节点之间的依赖,同时还提供任务依赖节点,支持流程间的自定义任务依赖
-
优先级 :支持流程实例和任务实例的优先级,如果流程实例和任务实例的优先级不设置,则默认是先进先出
-
邮件告警:支持 SQL任务 查询结果邮件发送,流程实例运行结果邮件告警及容错告警通知
-
失败策略:对于并行运行的任务,如果有任务失败,提供两种失败策略处理方式,继续是指不管并行运行任务的状态,直到流程失败结束。结束是指一旦发现失败任务,则同时Kill掉正在运行的并行任务,流程失败结束
-
补数:补历史数据,支持区间并行和串行两种补数方式
-
四、部署
单节点模式
1、单节点模式(standalone)模式下,所有服务均集中于一个StandaloneServer进程中,并且其中内置了注册中心Zookeeper和数据库H2。只需配置JDK环境,就可一键启动DolphinScheduler,快速体验其功能。
2、Standalone 仅适用于 DolphinScheduler 的快速体验。Standalone仅建议20个以下工作流使用,因为其采用内存式的H2 Database, Zookeeper Testing Server,任务过多可能导致不稳定,并且如果重启或者停止standalone-server会导致内存中数据库里的数据清空
3、部署
创建软件目录
mkdir -p /opt/soft
3.1
下载DophinScheduler的二进制安装包
wget https://archive.apache.org/dist/dolphinscheduler/2.0.8/apache-dolphinscheduler-2.0.8-bin.tar.gz --no-check-certificate
解压
tar -xvzf apache-dolphinscheduler-2.0.8-bin.tar.gz -C /opt/soft
启动和停止
# 启动 Standalone Server 服务 bash ./bin/dolphinscheduler-daemon.sh start standalone-server # 停止 Standalone Server 服务 bash ./bin/dolphinscheduler-daemon.sh stop standalone-server
登录
输入地址 http://spark01:12345/dolphinscheduler 默认密码是: admin/dolphinscheduler123
web ui
伪集群模式
伪集群模式(Pseudo-Cluster)是在单台机器部署 DolphinScheduler 各项服务,该模式下master、worker、api server、logger server等服务都只在同一台机器上。Zookeeper和数据库需单独安装并进行相应配置。
集群模式
集群模式(Cluster)与伪集群模式的区别就是在多台机器部署 DolphinScheduler各项服务,并且可以配置多个Master及多个Worker。
spark01 | master,work |
---|---|
spark02 | work |
spark03 | work |
下载DophinScheduler的安装包
wget https://archive.apache.org/dist/dolphinscheduler/2.0.8/apache-dolphinscheduler-2.0.8-bin.tar.gz --no-check-certificate
解压
tar -xvzf apache-dolphinscheduler-2.0.8-bin.tar.gz -C /opt/soft
配置MySQL
CREATE DATABASE dolphinscheduler DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci; CREATE USER 'dolphinscheduler'@'%' IDENTIFIED BY'dolphinscheduler'; GRANT ALL PRIVILEGES ON dolphinscheduler.* TO'dolphinscheduler'@'%'; flush privileges;
出现密码安全性不够的问题
show variables like 'validate_password%'; set global validate_password.policy=LOW;
配置dolphinscheduler文件
cd conf/config vim install_config.conf
ips="spark01,spark02,spark03" masters="spark01" workers="spark01:default,spark02:default,spark03:default" alertServer="spark02" apiServers="spark03" #pythonGatewayServers="ds1" installPath="/opt/soft/dolphinscheduler" deployUser="root" javaHome="/opt/soft/jdk-8" DATABASE_TYPE=${DATABASE_TYPE:-"mysql"} "jdbc:mysql://spark03:3306/dolphinscheduler?useUnicode=true&characterEncoding=UTF-8" SPRING_DATASOURCE_USERNAME=${SPRING_DATASOURCE_USERNAME:-"dolphinscheduler"} SPRING_DATASOURCE_PASSWORD=${SPRING_DATASOURCE_PASSWORD:-"dolphinscheduler"} registryServers="spark01:2181,spark02:2181,spark03:2181" resourceStorageType="HDFS" defaultFS="hdfs://mycluster" yarnHaIps="spark01,spark02" singleYarnIp="" hdfsRootUser="root"
# # Licensed to the Apache Software Foundation (ASF) under one or more # contributor license agreements. See the NOTICE file distributed with # this work for additional information regarding copyright ownership. # The ASF licenses this file to You under the Apache License, Version 2.0 # (the "License"); you may not use this file except in compliance with # the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # # --------------------------------------------------------- # INSTALL MACHINE # --------------------------------------------------------- # A comma separated list of machine hostname or IP would be installed DolphinScheduler, # including master, worker, api, alert. If you want to deploy in pseudo-distributed # mode, just write a pseudo-distributed hostname # Example for hostnames: ips="ds1,ds2,ds3,ds4,ds5", Example for IPs: ips="192.168.8.1,192.168.8.2,192.168.8.3,192.168.8.4,192.168.8.5" ips="spark01,spark02,spark03" # Port of SSH protocol, default value is 22. For now we only support same port in all `ips` machine # modify it if you use different ssh port sshPort="22" # A comma separated list of machine hostname or IP would be installed Master server, it # must be a subset of configuration `ips`. # Example for hostnames: masters="ds1,ds2", Example for IPs: masters="192.168.8.1,192.168.8.2" masters="spark01" # A comma separated list of machine <hostname>:<workerGroup> or <IP>:<workerGroup>.All hostname or IP must be a # subset of configuration `ips`, And workerGroup have default value as `default`, but we recommend you declare behind the hosts # Example for hostnames: workers="ds1:default,ds2:default,ds3:default", Example for IPs: workers="192.168.8.1:default,192.168.8.2:default,192.168.8.3:default" workers="spark01:default,spark02:default,spark03:default" # A comma separated list of machine hostname or IP would be installed Alert server, it # must be a subset of configuration `ips`. # Example for hostname: alertServer="ds3", Example for IP: alertServer="192.168.8.3" alertServer="spark02" # A comma separated list of machine hostname or IP would be installed API server, it # must be a subset of configuration `ips`. # Example for hostname: apiServers="ds1", Example for IP: apiServers="192.168.8.1" apiServers="spark03" # A comma separated list of machine hostname or IP would be installed Python gateway server, it # must be a subset of configuration `ips`. # Example for hostname: pythonGatewayServers="ds1", Example for IP: pythonGatewayServers="192.168.8.1" #pythonGatewayServers="ds1" # The directory to install DolphinScheduler for all machine we config above. It will automatically be created by `install.sh` script if not exists. # Do not set this configuration same as the current path (pwd) installPath="/opt/soft/dolphinscheduler" # The user to deploy DolphinScheduler for all machine we config above. For now user must create by yourself before running `install.sh` # script. The user needs to have sudo privileges and permissions to operate hdfs. If hdfs is enabled than the root directory needs # to be created by this user deployUser="root" # The directory to store local data for all machine we config above. Make sure user `deployUser` have permissions to read and write this directory. dataBasedirPath="/tmp/dolphinscheduler" # --------------------------------------------------------- # DolphinScheduler ENV # --------------------------------------------------------- # JAVA_HOME, we recommend use same JAVA_HOME in all machine you going to install DolphinScheduler # and this configuration only support one parameter so far. javaHome="/opt/soft/jdk-8" # DolphinScheduler API service port, also this is your DolphinScheduler UI component's URL port, default value is 12345 apiServerPort="12345" # --------------------------------------------------------- # Database # NOTICE: If database value has special characters, such as `.*[]^${}\+?|()@#&`, Please add prefix `\` for escaping. # --------------------------------------------------------- # The type for the metadata database # Supported values: ``postgresql``, ``mysql`, `h2``. DATABASE_TYPE=${DATABASE_TYPE:-"mysql"} # Spring datasource url, following <HOST>:<PORT>/<database>?<parameter> format, If you using mysql, you could use jdbc # string jdbc:mysql://127.0.0.1:3306/dolphinscheduler?useUnicode=true&characterEncoding=UTF-8 as example SPRING_DATASOURCE_URL=${SPRING_DATASOURCE_URL:-"jdbc:mysql://spark03:3306/dolphinscheduler?useUnicode=true&characterEncoding=UTF-8"} # Spring datasource username SPRING_DATASOURCE_USERNAME=${SPRING_DATASOURCE_USERNAME:-"dolphinscheduler"} # Spring datasource password SPRING_DATASOURCE_PASSWORD=${SPRING_DATASOURCE_PASSWORD:-"dolphinscheduler"} # --------------------------------------------------------- # Registry Server # --------------------------------------------------------- # Registry Server plugin name, should be a substring of `registryPluginDir`, DolphinScheduler use this for verifying configuration consistency registryPluginName="zookeeper" # Registry Server address. registryServers="spark01:2181,spark02:2181,spark03:2181" # Registry Namespace registryNamespace="dolphinscheduler" # --------------------------------------------------------- # Worker Task Server # --------------------------------------------------------- # Worker Task Server plugin dir. DolphinScheduler will find and load the worker task plugin jar package from this dir. taskPluginDir="lib/plugin/task" # resource storage type: HDFS, S3, NONE resourceStorageType="HDFS" # resource store on HDFS/S3 path, resource file will store to this hdfs path, self configuration, please make sure the directory exists on hdfs and has read write permissions. "/dolphinscheduler" is recommended resourceUploadPath="/dolphinscheduler" # if resourceStorageType is HDFS,defaultFS write namenode address,HA, you need to put core-site.xml and hdfs-site.xml in the conf directory. # if S3,write S3 address,HA,for example :s3a://dolphinscheduler, # Note,S3 be sure to create the root directory /dolphinscheduler defaultFS="hdfs://mycluster" # if resourceStorageType is S3, the following three configuration is required, otherwise please ignore s3Endpoint="http://192.168.xx.xx:9010" s3AccessKey="xxxxxxxxxx" s3SecretKey="xxxxxxxxxx" # resourcemanager port, the default value is 8088 if not specified resourceManagerHttpAddressPort="8088" # if resourcemanager HA is enabled, please set the HA IPs; if resourcemanager is single node, keep this value empty yarnHaIps="spark01,spark02" # if resourcemanager HA is enabled or not use resourcemanager, please keep the default value; If resourcemanager is single node, you only need to replace 'yarnIp1' to actual resourcemanager hostname singleYarnIp="" # who has permission to create directory under HDFS/S3 root path # Note: if kerberos is enabled, please config hdfsRootUser= hdfsRootUser="root" # kerberos config # whether kerberos starts, if kerberos starts, following four items need to config, otherwise please ignore kerberosStartUp="false" # kdc krb5 config file path krb5ConfPath="$installPath/conf/krb5.conf" # keytab username,watch out the @ sign should followd by \\ keytabUserName="hdfs-mycluster\\@ESZ.COM" # username keytab path keytabPath="$installPath/conf/hdfs.headless.keytab" # kerberos expire time, the unit is hour kerberosExpireTime="2" # use sudo or not sudoEnable="true" # worker tenant auto create workerTenantAutoCreate="false"
将core-site.xml和hds-site.xml上传到conf目录下
cp /opt/soft/hadoop-3/etc/hadoop/core-site.xml conf/ cp /opt/soft/hadoop-3/etc/hadoop/hdfs-site.xml conf/
初始化数据库
1.上传mysql-connector-j-8.0.33.jar到 lib 目录下
2.在解压的目录下执行 script/create-dolphinscheduler.sh
执行安装
./install.sh
web ui
输入地址 http://spark03:12345/dolphinscheduler 默认密码是: admin/dolphinscheduler123
启动和停止
cd /opt/soft/dolphinscheduler/ ./bin/start-all.sh ./bin/stop-all.sh
shell 工作流
nodeA : echo "A" nodeB : echo "B" nodeC : echo "C"
本地参数和全局参数
本地参数是指对单个任务节点有效的参数
echo "${dt}"
全局参数对所有节点
传递参数
DolphinScheduler 支持上游任务节点向下游任务节点传参。目前支持这个特性的任务类型有: Shell、SOL、Procedure。以下案例使用 Shell 类型的任务节点进行演示。
echo '${setValue(key=value)}' echo '${setValue(data=2024-01-03)}'
nodeA
nodeC
参数优先级:一个任务节点引用的参数可能来自三种类型:分别是全局参数 、上游任务传递的参数、本地参数。因为参数的值存在多个来源,当参数名相同时,就需要考虑参数优先级的问题。DolphinScheduler 参数的优先级从高到低为: 上游任务传递的参数>全局参数>本地参数。在上游任务传递的参数的情况下,由于上游可能存在多个任务向下游传递参数。当上游传递的参数名称相同时: 下游节点会优先使用值为非空的参数 如果存在多个值为非空的参数,则按照上游任务的完成时间排序,选择完成时间最早的上游任务对应的参数。
引用依赖资源
有些任务需要引用一些额外的资源,例如 MR、Spark 等任务须引用 jar 包,Shell 任务需要引用其他脚本等。DolphinScheduler 提供了资源中心来对这些资源进行统一管理。资源中心存储系统可选择本地文件系统或者 HDFS 等。资源中心除了提供文件资源管理功能,还提供了 Hive 自定义函数管理的功能。下面以 Shel1 任务为例,演示如何引用资源中心的其他脚本。
1.在资源中心创建脚本
nodeA引用资源
bash + 脚本相对位置
告警通知
告警通知 Dolphinscheduler 支持多种告警媒介,此处以电子邮件为例进行演示
1.创建告警实例
任务流启动时设置告警
告警成功