Apache DolphinScheduler的部署与简单应用

我不讲武德

已于 2024-01-04 15:49:29 修改

阅读量1.1k

点赞数 24

文章标签： big data

于 2024-01-02 03:46:34 首次发布

本文链接：https://blog.csdn.net/qq_46328727/article/details/135331609

版权

Apache DolphinScheduler

Apache DolphinScheduler 是一个分布式、可扩展的开源工作流编排平台，具有强大的 DAG 可视化界面。

一、特点

本身就支持HA
任务状态、任务类型、重试次数、任务运行机器、可视化变量等关键信息一目了然
任务队列机制，单个机器上可调度的任务数量可以灵活配置，当任务过多时会缓存在任务队列中，不会造成机器卡死
所有流程定义操作都是可视化的，通过拖拽任务来绘制DAG,配置数据源及资源。同时对于第三方系统，提供api方式的操作。
一键部署
支持暂停，恢复操作
DolphinScheduler上的用户可以通过租户和hadoop用户实现多对一或一对一的映射关系，这对大数据作业的调度是非常重要的。
支持传统的shell任务，同时支持大数据平台任务调度： MR、Spark、SQL(mysql、postgresql、hive、sparksql)、Python、Procedure、Sub_Process
调度器使用分布式调度，整体的调度能力会随便集群的规模线性增长，Master和Worker支持动态上下线

二、架构

UI
API 网关
Master 服务器
Worker 服务器
DB（MySQL，PostgreSQL）
Zookeeper
三、名词解释
流程定义：通过拖拽任务节点并建立任务节点的关联所形成的可视化DAG
流程实例：流程实例是流程定义的实例化，可以通过手动启动或定时调度生成,流程定义每运行一次，产生一个流程实例
- 任务实例：任务实例是流程定义中任务节点的实例化，标识着具体的任务执行状态
- 任务类型：目前支持有SHELL、SQL、SUB_PROCESS(子流程)、PROCEDURE、MR、SPARK、PYTHON、DEPENDENT(依赖)，同时计划支持动态插件扩展，注意：其中子 SUB_PROCESS 也是一个单独的流程定义，是可以单独启动执行的
- 调度方式：系统支持基于cron表达式的定时调度和手动调度。命令类型支持：启动工作流、从当前节点开始执行、恢复被容错的工作流、恢复暂停流程、从失败节点开始执行、补数、定时、重跑、暂停、停止、恢复等待线程。其中恢复被容错的工作流和恢复等待线程两种命令类型是由调度内部控制使用，外部无法调用
- 定时调度：系统采用 quartz 分布式调度器，并同时支持cron表达式可视化的生成
- 依赖：系统不单单支持 DAG 简单的前驱和后继节点之间的依赖，同时还提供任务依赖节点，支持流程间的自定义任务依赖
- 优先级 ：支持流程实例和任务实例的优先级，如果流程实例和任务实例的优先级不设置，则默认是先进先出
- 邮件告警：支持 SQL任务查询结果邮件发送，流程实例运行结果邮件告警及容错告警通知
- 失败策略：对于并行运行的任务，如果有任务失败，提供两种失败策略处理方式，继续是指不管并行运行任务的状态，直到流程失败结束。结束是指一旦发现失败任务，则同时Kill掉正在运行的并行任务，流程失败结束
- 补数：补历史数据，支持区间并行和串行两种补数方式

四、部署

单节点模式

1、单节点模式（standalone）模式下，所有服务均集中于一个StandaloneServer进程中，并且其中内置了注册中心Zookeeper和数据库H2。只需配置JDK环境，就可一键启动DolphinScheduler，快速体验其功能。

2、Standalone 仅适用于 DolphinScheduler 的快速体验。Standalone仅建议20个以下工作流使用，因为其采用内存式的H2 Database, Zookeeper Testing Server，任务过多可能导致不稳定，并且如果重启或者停止standalone-server会导致内存中数据库里的数据清空

3、部署

创建软件目录

mkdir -p /opt/soft

3.1

下载DophinScheduler的二进制安装包

wget https://archive.apache.org/dist/dolphinscheduler/2.0.8/apache-dolphinscheduler-2.0.8-bin.tar.gz --no-check-certificate

解压

tar -xvzf apache-dolphinscheduler-2.0.8-bin.tar.gz -C /opt/soft

启动和停止

# 启动 Standalone Server 服务
bash ./bin/dolphinscheduler-daemon.sh start standalone-server
# 停止 Standalone Server 服务
bash ./bin/dolphinscheduler-daemon.sh stop standalone-server

web ui

伪集群模式

伪集群模式（Pseudo-Cluster）是在单台机器部署 DolphinScheduler 各项服务，该模式下master、worker、api server、logger server等服务都只在同一台机器上。Zookeeper和数据库需单独安装并进行相应配置。

集群模式

集群模式（Cluster）与伪集群模式的区别就是在多台机器部署 DolphinScheduler各项服务，并且可以配置多个Master及多个Worker。

spark01	master,work
spark02	work
spark03	work

下载DophinScheduler的安装包

wget https://archive.apache.org/dist/dolphinscheduler/2.0.8/apache-dolphinscheduler-2.0.8-bin.tar.gz --no-check-certificate

解压

tar -xvzf apache-dolphinscheduler-2.0.8-bin.tar.gz -C /opt/soft

配置MySQL

CREATE DATABASE dolphinscheduler DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;
CREATE   USER   'dolphinscheduler'@'%'   IDENTIFIED  BY'dolphinscheduler';
GRANT  ALL PRIVILEGES   ON   dolphinscheduler.*     TO'dolphinscheduler'@'%';
flush privileges;

出现密码安全性不够的问题

 show variables like 'validate_password%';
 set global validate_password.policy=LOW;

配置dolphinscheduler文件

cd  conf/config
vim install_config.conf

ips="spark01,spark02,spark03"
masters="spark01"
workers="spark01:default,spark02:default,spark03:default"
alertServer="spark02"
apiServers="spark03"
#pythonGatewayServers="ds1"
installPath="/opt/soft/dolphinscheduler"
deployUser="root"
javaHome="/opt/soft/jdk-8"
DATABASE_TYPE=${DATABASE_TYPE:-"mysql"}
"jdbc:mysql://spark03:3306/dolphinscheduler?useUnicode=true&characterEncoding=UTF-8"
SPRING_DATASOURCE_USERNAME=${SPRING_DATASOURCE_USERNAME:-"dolphinscheduler"}
SPRING_DATASOURCE_PASSWORD=${SPRING_DATASOURCE_PASSWORD:-"dolphinscheduler"}
registryServers="spark01:2181,spark02:2181,spark03:2181"
resourceStorageType="HDFS"
defaultFS="hdfs://mycluster"
yarnHaIps="spark01,spark02"
singleYarnIp=""
hdfsRootUser="root"

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# ---------------------------------------------------------
# INSTALL MACHINE
# ---------------------------------------------------------
# A comma separated list of machine hostname or IP would be installed DolphinScheduler,
# including master, worker, api, alert. If you want to deploy in pseudo-distributed
# mode, just write a pseudo-distributed hostname
# Example for hostnames: ips="ds1,ds2,ds3,ds4,ds5", Example for IPs: ips="192.168.8.1,192.168.8.2,192.168.8.3,192.168.8.4,192.168.8.5"
ips="spark01,spark02,spark03"

# Port of SSH protocol, default value is 22. For now we only support same port in all `ips` machine
# modify it if you use different ssh port
sshPort="22"

# A comma separated list of machine hostname or IP would be installed Master server, it
# must be a subset of configuration `ips`.
# Example for hostnames: masters="ds1,ds2", Example for IPs: masters="192.168.8.1,192.168.8.2"
masters="spark01"

# A comma separated list of machine <hostname>:<workerGroup> or <IP>:<workerGroup>.All hostname or IP must be a
# subset of configuration `ips`, And workerGroup have default value as `default`, but we recommend you declare behind the hosts
# Example for hostnames: workers="ds1:default,ds2:default,ds3:default", Example for IPs: workers="192.168.8.1:default,192.168.8.2:default,192.168.8.3:default"
workers="spark01:default,spark02:default,spark03:default"

# A comma separated list of machine hostname or IP would be installed Alert server, it
# must be a subset of configuration `ips`.
# Example for hostname: alertServer="ds3", Example for IP: alertServer="192.168.8.3"
alertServer="spark02"

# A comma separated list of machine hostname or IP would be installed API server, it
# must be a subset of configuration `ips`.
# Example for hostname: apiServers="ds1", Example for IP: apiServers="192.168.8.1"
apiServers="spark03"

# A comma separated list of machine hostname or IP would be installed Python gateway server, it
# must be a subset of configuration `ips`.
# Example for hostname: pythonGatewayServers="ds1", Example for IP: pythonGatewayServers="192.168.8.1"
#pythonGatewayServers="ds1"

# The directory to install DolphinScheduler for all machine we config above. It will automatically be created by `install.sh` script if not exists.
# Do not set this configuration same as the current path (pwd)
installPath="/opt/soft/dolphinscheduler"

# The user to deploy DolphinScheduler for all machine we config above. For now user must create by yourself before running `install.sh`
# script. The user needs to have sudo privileges and permissions to operate hdfs. If hdfs is enabled than the root directory needs
# to be created by this user
deployUser="root"

# The directory to store local data for all machine we config above. Make sure user `deployUser` have permissions to read and write this directory.
dataBasedirPath="/tmp/dolphinscheduler"

# ---------------------------------------------------------
# DolphinScheduler ENV
# ---------------------------------------------------------
# JAVA_HOME, we recommend use same JAVA_HOME in all machine you going to install DolphinScheduler
# and this configuration only support one parameter so far.
javaHome="/opt/soft/jdk-8"

# DolphinScheduler API service port, also this is your DolphinScheduler UI component's URL port, default value is 12345
apiServerPort="12345"

# ---------------------------------------------------------
# Database
# NOTICE: If database value has special characters, such as `.*[]^${}\+?|()@#&`, Please add prefix `\` for escaping.
# ---------------------------------------------------------
# The type for the metadata database
# Supported values: ``postgresql``, ``mysql`, `h2``.
DATABASE_TYPE=${DATABASE_TYPE:-"mysql"}

# Spring datasource url, following <HOST>:<PORT>/<database>?<parameter> format, If you using mysql, you could use jdbc
# string jdbc:mysql://127.0.0.1:3306/dolphinscheduler?useUnicode=true&characterEncoding=UTF-8 as example
SPRING_DATASOURCE_URL=${SPRING_DATASOURCE_URL:-"jdbc:mysql://spark03:3306/dolphinscheduler?useUnicode=true&characterEncoding=UTF-8"}

# Spring datasource username
SPRING_DATASOURCE_USERNAME=${SPRING_DATASOURCE_USERNAME:-"dolphinscheduler"}

# Spring datasource password
SPRING_DATASOURCE_PASSWORD=${SPRING_DATASOURCE_PASSWORD:-"dolphinscheduler"}

# ---------------------------------------------------------
# Registry Server
# ---------------------------------------------------------
# Registry Server plugin name, should be a substring of `registryPluginDir`, DolphinScheduler use this for verifying configuration consistency
registryPluginName="zookeeper"

# Registry Server address.
registryServers="spark01:2181,spark02:2181,spark03:2181"

# Registry Namespace
registryNamespace="dolphinscheduler"

# ---------------------------------------------------------
# Worker Task Server
# ---------------------------------------------------------
# Worker Task Server plugin dir. DolphinScheduler will find and load the worker task plugin jar package from this dir.
taskPluginDir="lib/plugin/task"

# resource storage type: HDFS, S3, NONE
resourceStorageType="HDFS"

# resource store on HDFS/S3 path, resource file will store to this hdfs path, self configuration, please make sure the directory exists on hdfs and has read write permissions. "/dolphinscheduler" is recommended
resourceUploadPath="/dolphinscheduler"

# if resourceStorageType is HDFS，defaultFS write namenode address，HA, you need to put core-site.xml and hdfs-site.xml in the conf directory.
# if S3，write S3 address，HA，for example ：s3a://dolphinscheduler，
# Note，S3 be sure to create the root directory /dolphinscheduler
defaultFS="hdfs://mycluster"

# if resourceStorageType is S3, the following three configuration is required, otherwise please ignore
s3Endpoint="http://192.168.xx.xx:9010"
s3AccessKey="xxxxxxxxxx"
s3SecretKey="xxxxxxxxxx"

# resourcemanager port, the default value is 8088 if not specified
resourceManagerHttpAddressPort="8088"

# if resourcemanager HA is enabled, please set the HA IPs; if resourcemanager is single node, keep this value empty
yarnHaIps="spark01,spark02"

# if resourcemanager HA is enabled or not use resourcemanager, please keep the default value; If resourcemanager is single node, you only need to replace 'yarnIp1' to actual resourcemanager hostname
singleYarnIp=""

# who has permission to create directory under HDFS/S3 root path
# Note: if kerberos is enabled, please config hdfsRootUser=
hdfsRootUser="root"

# kerberos config
# whether kerberos starts, if kerberos starts, following four items need to config, otherwise please ignore
kerberosStartUp="false"
# kdc krb5 config file path
krb5ConfPath="$installPath/conf/krb5.conf"
# keytab username,watch out the @ sign should followd by \\
keytabUserName="hdfs-mycluster\\@ESZ.COM"
# username keytab path
keytabPath="$installPath/conf/hdfs.headless.keytab"
# kerberos expire time, the unit is hour
kerberosExpireTime="2"

# use sudo or not
sudoEnable="true"

# worker tenant auto create
workerTenantAutoCreate="false"

将core-site.xml和hds-site.xml上传到conf目录下

cp  /opt/soft/hadoop-3/etc/hadoop/core-site.xml conf/
cp  /opt/soft/hadoop-3/etc/hadoop/hdfs-site.xml conf/

初始化数据库

1.上传mysql-connector-j-8.0.33.jar到 lib 目录下

2.在解压的目录下执行 script/create-dolphinscheduler.sh

执行安装

./install.sh

web ui

输入地址 http://spark03:12345/dolphinscheduler
默认密码是： admin/dolphinscheduler123

启动和停止

cd /opt/soft/dolphinscheduler/
./bin/start-all.sh
./bin/stop-all.sh

shell 工作流

nodeA : echo "A"
nodeB : echo "B"
nodeC : echo "C"

本地参数和全局参数

本地参数是指对单个任务节点有效的参数

echo "${dt}"

全局参数对所有节点

传递参数

DolphinScheduler 支持上游任务节点向下游任务节点传参。目前支持这个特性的任务类型有: Shell、SOL、Procedure。以下案例使用 Shell 类型的任务节点进行演示。

echo '${setValue(key=value)}'
echo '${setValue(data=2024-01-03)}'

nodeA

nodeC

参数优先级:一个任务节点引用的参数可能来自三种类型:分别是全局参数、上游任务传递的参数、本地参数。因为参数的值存在多个来源，当参数名相同时，就需要考虑参数优先级的问题。DolphinScheduler 参数的优先级从高到低为: 上游任务传递的参数>全局参数>本地参数。在上游任务传递的参数的情况下，由于上游可能存在多个任务向下游传递参数。当上游传递的参数名称相同时: 下游节点会优先使用值为非空的参数如果存在多个值为非空的参数，则按照上游任务的完成时间排序，选择完成时间最早的上游任务对应的参数。

引用依赖资源

有些任务需要引用一些额外的资源，例如 MR、Spark 等任务须引用 jar 包，Shell 任务需要引用其他脚本等。DolphinScheduler 提供了资源中心来对这些资源进行统一管理。资源中心存储系统可选择本地文件系统或者 HDFS 等。资源中心除了提供文件资源管理功能，还提供了 Hive 自定义函数管理的功能。下面以 Shel1 任务为例，演示如何引用资源中心的其他脚本。

1.在资源中心创建脚本