Apache DolphinScheduler的部署与简单应用

Apache DolphinScheduler

Apache DolphinScheduler 是一个分布式、可扩展的开源工作流编排平台,具有强大的 DAG 可视化界面。

一、特点

  1. 本身就支持HA

  2. 任务状态、任务类型、重试次数、任务运行机器、可视化变量等关键信息一目了然

  3. 任务队列机制,单个机器上可调度的任务数量可以灵活配置,当任务过多时会缓存在任务队列中,不会造成机器卡死

  4. 所有流程定义操作都是可视化的,通过拖拽任务来绘制DAG,配置数据源及资源。同时对于第三方系统,提供api方式的操作。

  5. 一键部署

  6. 支持暂停,恢复操作

  7. DolphinScheduler上的用户可以通过租户和hadoop用户实现多对一或一对一的映射关系,这对大数据作业的调度是非常重要的。

  8. 支持传统的shell任务,同时支持大数据平台任务调度: MR、Spark、SQL(mysql、postgresql、hive、sparksql)、Python、Procedure、Sub_Process

  9. 调度器使用分布式调度,整体的调度能力会随便集群的规模线性增长,Master和Worker支持动态上下线

二、架构

  • UI

  • API 网关

  • Master 服务器

  • Worker 服务器

  • DB(MySQL,PostgreSQL)

  • Zookeeper

  • 三、名词解释

  • 流程定义:通过拖拽任务节点并建立任务节点的关联所形成的可视化DAG
  • 流程实例:流程实例是流程定义的实例化,可以通过手动启动或定时调度生成,流程定义每运行一次,产生一个流程实例
    • 任务实例:任务实例是流程定义中任务节点的实例化,标识着具体的任务执行状态

    • 任务类型: 目前支持有SHELL、SQL、SUB_PROCESS(子流程)、PROCEDURE、MR、SPARK、PYTHON、DEPENDENT(依赖),同时计划支持动态插件扩展,注意:其中子 SUB_PROCESS 也是一个单独的流程定义,是可以单独启动执行的

    • 调度方式: 系统支持基于cron表达式的定时调度和手动调度。命令类型支持:启动工作流、从当前节点开始执行、恢复被容错的工作流、恢复暂停流程、从失败节点开始执行、补数、定时、重跑、暂停、停止、恢复等待线程。其中 恢复被容错的工作流 和 恢复等待线程 两种命令类型是由调度内部控制使用,外部无法调用

    • 定时调度:系统采用 quartz 分布式调度器,并同时支持cron表达式可视化的生成

    • 依赖:系统不单单支持 DAG 简单的前驱和后继节点之间的依赖,同时还提供任务依赖节点,支持流程间的自定义任务依赖

    • 优先级 :支持流程实例和任务实例的优先级,如果流程实例和任务实例的优先级不设置,则默认是先进先出

    • 邮件告警:支持 SQL任务 查询结果邮件发送,流程实例运行结果邮件告警及容错告警通知

    • 失败策略:对于并行运行的任务,如果有任务失败,提供两种失败策略处理方式,继续是指不管并行运行任务的状态,直到流程失败结束。结束是指一旦发现失败任务,则同时Kill掉正在运行的并行任务,流程失败结束

    • 补数:补历史数据,支持区间并行和串行两种补数方式

四、部署

单节点模式
1、单节点模式(standalone)模式下,所有服务均集中于一个StandaloneServer进程中,并且其中内置了注册中心Zookeeper和数据库H2。只需配置JDK环境,就可一键启动DolphinScheduler,快速体验其功能。
2、Standalone 仅适用于 DolphinScheduler 的快速体验。Standalone仅建议20个以下工作流使用,因为其采用内存式的H2 Database, Zookeeper Testing Server,任务过多可能导致不稳定,并且如果重启或者停止standalone-server会导致内存中数据库里的数据清空
3、部署
创建软件目录
mkdir -p /opt/soft
3.1
下载DophinScheduler的二进制安装包
wget https://archive.apache.org/dist/dolphinscheduler/2.0.8/apache-dolphinscheduler-2.0.8-bin.tar.gz --no-check-certificate
解压
tar -xvzf apache-dolphinscheduler-2.0.8-bin.tar.gz -C /opt/soft
启动和停止
# 启动 Standalone Server 服务
bash ./bin/dolphinscheduler-daemon.sh start standalone-server
# 停止 Standalone Server 服务
bash ./bin/dolphinscheduler-daemon.sh stop standalone-server
​
登录
输入地址 http://spark01:12345/dolphinscheduler
默认密码是: admin/dolphinscheduler123
web ui

伪集群模式

伪集群模式(Pseudo-Cluster)是在单台机器部署 DolphinScheduler 各项服务,该模式下master、worker、api server、logger server等服务都只在同一台机器上。Zookeeper和数据库需单独安装并进行相应配置。

集群模式

集群模式(Cluster)与伪集群模式的区别就是在多台机器部署 DolphinScheduler各项服务,并且可以配置多个Master及多个Worker。
spark01master,work
spark02work
spark03work

下载DophinScheduler的安装包
wget https://archive.apache.org/dist/dolphinscheduler/2.0.8/apache-dolphinscheduler-2.0.8-bin.tar.gz --no-check-certificate
解压
tar -xvzf apache-dolphinscheduler-2.0.8-bin.tar.gz -C /opt/soft
配置MySQL
CREATE DATABASE dolphinscheduler DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;
CREATE   USER   'dolphinscheduler'@'%'   IDENTIFIED  BY'dolphinscheduler';
GRANT  ALL PRIVILEGES   ON   dolphinscheduler.*     TO'dolphinscheduler'@'%';
flush privileges;

出现密码安全性不够的问题

 show variables like 'validate_password%';
 set global validate_password.policy=LOW;
配置dolphinscheduler文件
cd  conf/config
vim install_config.conf
ips="spark01,spark02,spark03"
masters="spark01"
workers="spark01:default,spark02:default,spark03:default"
alertServer="spark02"
apiServers="spark03"
#pythonGatewayServers="ds1"
installPath="/opt/soft/dolphinscheduler"
deployUser="root"
javaHome="/opt/soft/jdk-8"
DATABASE_TYPE=${DATABASE_TYPE:-"mysql"}
"jdbc:mysql://spark03:3306/dolphinscheduler?useUnicode=true&characterEncoding=UTF-8"
SPRING_DATASOURCE_USERNAME=${SPRING_DATASOURCE_USERNAME:-"dolphinscheduler"}
SPRING_DATASOURCE_PASSWORD=${SPRING_DATASOURCE_PASSWORD:-"dolphinscheduler"}
registryServers="spark01:2181,spark02:2181,spark03:2181"
resourceStorageType="HDFS"
defaultFS="hdfs://mycluster"
yarnHaIps="spark01,spark02"
singleYarnIp=""
hdfsRootUser="root"

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
​
# ---------------------------------------------------------
# INSTALL MACHINE
# ---------------------------------------------------------
# A comma separated list of machine hostname or IP would be installed DolphinScheduler,
# including master, worker, api, alert. If you want to deploy in pseudo-distributed
# mode, just write a pseudo-distributed hostname
# Example for hostnames: ips="ds1,ds2,ds3,ds4,ds5", Example for IPs: ips="192.168.8.1,192.168.8.2,192.168.8.3,192.168.8.4,192.168.8.5"
ips="spark01,spark02,spark03"
​
# Port of SSH protocol, default value is 22. For now we only support same port in all `ips` machine
# modify it if you use different ssh port
sshPort="22"
​
# A comma separated list of machine hostname or IP would be installed Master server, it
# must be a subset of configuration `ips`.
# Example for hostnames: masters="ds1,ds2", Example for IPs: masters="192.168.8.1,192.168.8.2"
masters="spark01"
​
# A comma separated list of machine <hostname>:<workerGroup> or <IP>:<workerGroup>.All hostname or IP must be a
# subset of configuration `ips`, And workerGroup have default value as `default`, but we recommend you declare behind the hosts
# Example for hostnames: workers="ds1:default,ds2:default,ds3:default", Example for IPs: workers="192.168.8.1:default,192.168.8.2:default,192.168.8.3:default"
workers="spark01:default,spark02:default,spark03:default"
​
# A comma separated list of machine hostname or IP would be installed Alert server, it
# must be a subset of configuration `ips`.
# Example for hostname: alertServer="ds3", Example for IP: alertServer="192.168.8.3"
alertServer="spark02"
​
# A comma separated list of machine hostname or IP would be installed API server, it
# must be a subset of configuration `ips`.
# Example for hostname: apiServers="ds1", Example for IP: apiServers="192.168.8.1"
apiServers="spark03"
​
# A comma separated list of machine hostname or IP would be installed Python gateway server, it
# must be a subset of configuration `ips`.
# Example for hostname: pythonGatewayServers="ds1", Example for IP: pythonGatewayServers="192.168.8.1"
#pythonGatewayServers="ds1"
​
# The directory to install DolphinScheduler for all machine we config above. It will automatically be created by `install.sh` script if not exists.
# Do not set this configuration same as the current path (pwd)
installPath="/opt/soft/dolphinscheduler"
​
# The user to deploy DolphinScheduler for all machine we config above. For now user must create by yourself before running `install.sh`
# script. The user needs to have sudo privileges and permissions to operate hdfs. If hdfs is enabled than the root directory needs
# to be created by this user
deployUser="root"
​
# The directory to store local data for all machine we config above. Make sure user `deployUser` have permissions to read and write this directory.
dataBasedirPath="/tmp/dolphinscheduler"
​
# ---------------------------------------------------------
# DolphinScheduler ENV
# ---------------------------------------------------------
# JAVA_HOME, we recommend use same JAVA_HOME in all machine you going to install DolphinScheduler
# and this configuration only support one parameter so far.
javaHome="/opt/soft/jdk-8"
​
# DolphinScheduler API service port, also this is your DolphinScheduler UI component's URL port, default value is 12345
apiServerPort="12345"
​
# ---------------------------------------------------------
# Database
# NOTICE: If database value has special characters, such as `.*[]^${}\+?|()@#&`, Please add prefix `\` for escaping.
# ---------------------------------------------------------
# The type for the metadata database
# Supported values: ``postgresql``, ``mysql`, `h2``.
DATABASE_TYPE=${DATABASE_TYPE:-"mysql"}
​
# Spring datasource url, following <HOST>:<PORT>/<database>?<parameter> format, If you using mysql, you could use jdbc
# string jdbc:mysql://127.0.0.1:3306/dolphinscheduler?useUnicode=true&characterEncoding=UTF-8 as example
SPRING_DATASOURCE_URL=${SPRING_DATASOURCE_URL:-"jdbc:mysql://spark03:3306/dolphinscheduler?useUnicode=true&characterEncoding=UTF-8"}
​
# Spring datasource username
SPRING_DATASOURCE_USERNAME=${SPRING_DATASOURCE_USERNAME:-"dolphinscheduler"}
​
# Spring datasource password
SPRING_DATASOURCE_PASSWORD=${SPRING_DATASOURCE_PASSWORD:-"dolphinscheduler"}
​
# ---------------------------------------------------------
# Registry Server
# ---------------------------------------------------------
# Registry Server plugin name, should be a substring of `registryPluginDir`, DolphinScheduler use this for verifying configuration consistency
registryPluginName="zookeeper"
​
# Registry Server address.
registryServers="spark01:2181,spark02:2181,spark03:2181"
​
# Registry Namespace
registryNamespace="dolphinscheduler"
​
# ---------------------------------------------------------
# Worker Task Server
# ---------------------------------------------------------
# Worker Task Server plugin dir. DolphinScheduler will find and load the worker task plugin jar package from this dir.
taskPluginDir="lib/plugin/task"
​
# resource storage type: HDFS, S3, NONE
resourceStorageType="HDFS"
​
# resource store on HDFS/S3 path, resource file will store to this hdfs path, self configuration, please make sure the directory exists on hdfs and has read write permissions. "/dolphinscheduler" is recommended
resourceUploadPath="/dolphinscheduler"
​
# if resourceStorageType is HDFS,defaultFS write namenode address,HA, you need to put core-site.xml and hdfs-site.xml in the conf directory.
# if S3,write S3 address,HA,for example :s3a://dolphinscheduler,
# Note,S3 be sure to create the root directory /dolphinscheduler
defaultFS="hdfs://mycluster"
​
# if resourceStorageType is S3, the following three configuration is required, otherwise please ignore
s3Endpoint="http://192.168.xx.xx:9010"
s3AccessKey="xxxxxxxxxx"
s3SecretKey="xxxxxxxxxx"
​
# resourcemanager port, the default value is 8088 if not specified
resourceManagerHttpAddressPort="8088"
​
# if resourcemanager HA is enabled, please set the HA IPs; if resourcemanager is single node, keep this value empty
yarnHaIps="spark01,spark02"
​
# if resourcemanager HA is enabled or not use resourcemanager, please keep the default value; If resourcemanager is single node, you only need to replace 'yarnIp1' to actual resourcemanager hostname
singleYarnIp=""
​
# who has permission to create directory under HDFS/S3 root path
# Note: if kerberos is enabled, please config hdfsRootUser=
hdfsRootUser="root"
​
# kerberos config
# whether kerberos starts, if kerberos starts, following four items need to config, otherwise please ignore
kerberosStartUp="false"
# kdc krb5 config file path
krb5ConfPath="$installPath/conf/krb5.conf"
# keytab username,watch out the @ sign should followd by \\
keytabUserName="hdfs-mycluster\\@ESZ.COM"
# username keytab path
keytabPath="$installPath/conf/hdfs.headless.keytab"
# kerberos expire time, the unit is hour
kerberosExpireTime="2"
​
# use sudo or not
sudoEnable="true"
​
# worker tenant auto create
workerTenantAutoCreate="false"
​
将core-site.xml和hds-site.xml上传到conf目录下
cp  /opt/soft/hadoop-3/etc/hadoop/core-site.xml conf/
cp  /opt/soft/hadoop-3/etc/hadoop/hdfs-site.xml conf/
初始化数据库

1.上传mysql-connector-j-8.0.33.jar到 lib 目录下

2.在解压的目录下执行 script/create-dolphinscheduler.sh

执行安装

./install.sh
web ui
输入地址 http://spark03:12345/dolphinscheduler
默认密码是: admin/dolphinscheduler123

启动和停止
cd /opt/soft/dolphinscheduler/
./bin/start-all.sh
./bin/stop-all.sh

shell 工作流

nodeA : echo "A"
nodeB : echo "B"
nodeC : echo "C"

本地参数和全局参数

本地参数是指对单个任务节点有效的参数

echo "${dt}"

全局参数对所有节点

传递参数

DolphinScheduler 支持上游任务节点向下游任务节点传参。目前支持这个特性的任务类型有: Shell、SOL、Procedure。以下案例使用 Shell 类型的任务节点进行演示。

echo '${setValue(key=value)}'
echo '${setValue(data=2024-01-03)}'

nodeA

nodeC

参数优先级:一个任务节点引用的参数可能来自三种类型:分别是全局参数 、上游任务传递的参数、本地参数。因为参数的值存在多个来源,当参数名相同时,就需要考虑参数优先级的问题。DolphinScheduler 参数的优先级从高到低为: 上游任务传递的参数>全局参数>本地参数。在上游任务传递的参数的情况下,由于上游可能存在多个任务向下游传递参数。当上游传递的参数名称相同时: 下游节点会优先使用值为非空的参数 如果存在多个值为非空的参数,则按照上游任务的完成时间排序,选择完成时间最早的上游任务对应的参数。

引用依赖资源

有些任务需要引用一些额外的资源,例如 MR、Spark 等任务须引用 jar 包,Shell 任务需要引用其他脚本等。DolphinScheduler 提供了资源中心来对这些资源进行统一管理。资源中心存储系统可选择本地文件系统或者 HDFS 等。资源中心除了提供文件资源管理功能,还提供了 Hive 自定义函数管理的功能。下面以 Shel1 任务为例,演示如何引用资源中心的其他脚本。

1.在资源中心创建脚本

nodeA引用资源

bash  + 脚本相对位置

告警通知

告警通知 Dolphinscheduler 支持多种告警媒介,此处以电子邮件为例进行演示

1.创建告警实例

任务流启动时设置告警

告警成功

  • 24
    点赞
  • 28
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值