Apache Hop-使用介绍【持续完善中】

沧海之巅

已于 2022-12-06 18:59:17 修改

阅读量6.1k

点赞数

于 2022-12-05 13:23:24 首次发布

沧海之巅

本文链接：https://blog.csdn.net/linjie_830914/article/details/128097769

版权

Apache Hop 是一个强大的数据处理工具，本文详细介绍了如何使用其构建和管理管道，包括Pipeline Editor的使用、管道创建、转换、错误处理以及与Apache Beam的集成。此外，还讲解了单元测试的重要性和执行策略，以及多种转换的用法，如数据清洗、集成、错误管理和流处理等，帮助用户更好地理解和运用Apache Hop。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

文章目录

Pipelines（管道）

Pipelines（管道）

Pipeline Editor（管道编辑器）

TOOLBAR

您在“创建管道”中学习了如何创建管道。您将在管道画布上执行许多操作，但是从管道编辑器的主工具栏中还可以获得许多额外的功能。

还有另外两个重要的工具栏用于处理单元测试以及项目和环境。查看相关页面，了解有关管理项目和环境的更多信息，并了解如何为管道编写单元测试。

让我们看看最上面的工具栏:

Action	Icon	Description
run		Start the execution of the pipeline; 开始执行管道;
pause	[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-BecZyUI8-1670217736507)(C:\Users\lenovo\AppData\Roaming\Typora\typora-user-images\image-20221127081024252.png)]	pause the execution of the pipeline; 暂停管道的执行;
stop		stop the execution of the pipeline; 停止管道的执行；
preview		preview the pipeline; 预览管道
debug		debug the pipeline; 调试管道
print		print the pipeline; 打印管道
undo		undo an operation; 撤销操作
redo		redo an operation; 重复操作
align		align the specified (selected) transforms to the specified grid size; 将指定的（选定）变换与指定的网格大小对齐
align left		align the selected transforms with left-most selected transform in the selection; 将选择的转换与选择的最左边的转换对齐
align right		align the selected transforms with right-most selected transform in the selection; 将选择的转换与选择的最右边的转换对齐
align top		align the selected transforms with top-most selected transform in the selection; 将选择的转换与选择的最上面的转换对齐
align bottom		align the selected transforms with bottom-most selected transform in the selection; 将选择的转换与选择的最底部的转换对齐
distribute horizontally		Distribute the selected transforms evenly between the left-most and right-most transform in your selection; 将选择的转换均匀分布在选择的最左和最右转换之间
distribute vertically		Distribute the selected transforms evenly between the top-most and bottom-most transform in your selection; 将选择的转换均匀分布在选择的最顶部和最底部的转换之间

Create a Pipeline（创建管道）

How pipelines work（管道如何工作）

管道是Hop项目的基本构建块。

管道完成了繁重的工作:它们从各种来源读取数据，执行许多操作(合并、清理、完善、转换等)，并将数据写入某些目标平台。管道以预定义的顺序并并行地执行所有这些操作。

在下面的图片中，一个非常简单的管道从数据库读取数据，向数据添加一条消息，然后发送一封电子邮件。所有这些操作都按照预定义的顺序执行(从数据库读取、添加消息、发送邮件)并并行执行。管道执行这些转换，假设我们的数据库表或查询包含数千行。管道将开始从查询中读取结果，并将它们传递给’Add message’转换。一旦添加了消息，我们将从mail转换发送一封邮件。所有这些都是并行的，所以邮件转换.

在这里插入图片描述

Concepts(概念）

管道由跳连接的转换组成。在邮件示例“Table input”中，“Add message”和“mail”都是转换。

转换是管道中的基本操作。管道通常由许多由跃点连接在一起的转换组成。转换是细粒度的，从某种意义上说，每个转换都被设计和优化为执行且仅执行一项任务。尽管一个转换本身可能不能提供令人惊叹的功能，但是管道中所有转换的组合将使您的管道变得强大。
跳把所有的转换连接在一起。当转换处理完成它接收到的数据集时，该数据集通过一个跳传递给下一个转换。跳是单向的(数据不能反向流动)。跳只缓冲和传递数据，跳本身与转换无关，它不知道将数据从哪来来，传递给哪些转换。一些转换可以有条件地从其他转换向其他转换进行读写，但这是一种特定于转换的配置。跳跃者并不知道这一点。跳可以通过点击或右击来禁用

Create a pipeline（创建管道）

通过工作项对话框创建一个新的管道。您将看到下面的对话框。

在这里插入图片描述

当您完成管道创建后，保存它。这可以通过“文件”菜单、图标或使用CTLR或Command s来完成。对于新的管道，文件浏览器将显示，以导航到您想要存储文件的位置。

Add Transform to your pipelines（将Transform添加到管道中）

单击管道画布中的任意位置，即您将看到下图的区域。

在这里插入图片描述

单击后，将显示如下所示的对话框。通过顶部的搜索框搜索转换、名称、标签(TODO)等。找到要查找的转换后，单击它将其添加到管道中。点击的另一种选择是方向键导航+回车。现在重复此步骤，或者在您想向管道添加更多转换时重复此步骤。向管道添加转换后，可以拖动它来重新定位它。

查看要添加到管道中的转换列表以获得更多详细信息。

在这里插入图片描述

添加“Generate Rows”和“Add Sequence”转换，你的管道应该如下图所示。

在这里插入图片描述

可以通过对对象的一次单击来配置转换对象。下面显示的菜单将基于您的转换对象显示。

在这里插入图片描述

Action	Description
Edit 编辑转换	Edit the transform’s metadata 编辑转换的元数据
Copy to clipboard 复制	Copies selected items to clipboard. 复制选中的转换
Create hop 创建跳	Creates a new hop between two transforms. 在两个转换之间创建一个新的跳转
Detach transform 分离转换	Detach the transform from the pipeline 从管道中分离转换
Show input fields	显示输入字段
Show output fields	显示输出字段
Edit transform description 编辑转换描述	Add a description to the transform. 向转换添加描述。
Delete 删除	Delete selected transform from the canvas. 从画布中删除选中的转换
Data routing
Specify copies 指定复制	复制指定行数
Copy rows 复制行	In case of more than one hop the daya is copied to the next transforms. 如果有多个跳，则将daya复制到下一个转换
Set partitioning 设置分区	Specify how rows of data need to be grouped into partitions allowing parallel execution where similar rows need to end up on the same transform copy 指定需要如何将数据行分组到允许并行执行的分区中，其中相似的行需要在相同的转换副本上结束
Error handling 错误处理	Set the error handling for the transform, not available for all transforms. 为转换设置错误处理，但不是所有转换都可用
Add web service 添加web服务
Preview
View output 视图输出
Preview output 预览输出	Allows you to preview the results of the transform. 允许您预览转换的结果
Debug output 调试输出
Sniff output 嗅探输出	Take a look at 50 rows coming out of this transform. This will show a real-time table with a continuous output of the selected transform. 看看这个转换产生的50行。这将显示一个具有所选转换连续输出的实时表
Add data probe 添加数据探针	添加数据探针
Logging
Edit Custom Logging 编辑自定义日志	Edit the custom log settings for this transform. This will change the log level used for this transform. 编辑此转换的自定义日志设置。这将更改用于此转换的日志级别
Clear Custom Logging 清除自定义日志	Clear custom log settings. This will clear the log level used for this transform. 清除自定义日志设置。这将清除用于此转换的日志级别
Unit Testing
Create data set 创建数据集	Create an empty dataset with the output fields of this transform 使用此转换的输出字段创建一个空数据集
Write rows to data set 把行写到数据集中	Run the current pipeline and write the data to a data set 运行当前管道并将数据写入数据集
其他属性
Set the number of transforms 设置转换的数量	Starts several instances of a transform in parallel. 并行启动转换的多个实例
Show the fields entering this transform 显示输入此转换的字段	Shows metadata, like the field name and type for fields coming into the transform. 显示元数据，比如进入转换的字段的字段名和类型
Show the fields exiting this transform 显示退出此转换的字段	Shows metadata, like the field name and type for fields coming out of the transform. 显示元数据，比如转换输出的字段的字段名和类型
Distribute rows 分配行	In case of more than one hop the data is distributed between the next transforms. 在有多个跳的情况下，数据会分布在下一个转换之间
Set input data set 设置输入数据集	Defines which data to use instead of the active input transform, applies to the selected unit test 定义要使用哪些数据来代替活动输入转换，应用于选定的单元测试
Clear input data set 清除输入数据集	Remove a defined data set from the selected unit test 从选定的单元测试中删除已定义的数据集
Set golden data set 黄金数据集	The input to this transform is taken and compared to the golden data set you are selecting.\nThe transform itself is not executed during testing 获取此转换的输入，并将其与所选择的黄金数据集进行比较。转换本身在测试期间不执行
Clear golden data set 清除黄金数据集	Remove a defined input data set from this transform unit test 从此转换单元测试中删除已定义的输入数据集
Remove from test 从测试中移除	When this unit test is run, do not include this transform 运行此单元测试时，不要包含此转换
Include in test 包括在测试中	Run the current pipeline and write the data to a data set 运行当前管道并将数据写入数据集
Bypass in tess 旁路	When this unit test is run, bypass this transform (replace with a dummy) 当这个单元测试运行时，绕过这个转换(用一个虚拟替换)
Remove bypass in test 在测试中拆除旁路	Do not bypass this transform in the current pipeline during testing 在测试期间不绕过当前管道中的这个转换

Add a Hop between transforms（在转换之间添加跳转）

有很多方法可以创建一个跳：

shift-拖动:同时按住键盘上的shift键。单击一个转换，同时按住鼠标主按钮，拖动到第二个转换。松开鼠标主键和shift键
滚动-拖动:滚动单击一个转换，同时按住鼠标的滚动按钮，拖动到第二个转换。松开滚动按钮
点击管道中的转换打开“点击任何地方”对话框。点击’Create hop’ image::getting-started/icons/ hop .svg[Create hop, 25px, align=“bottom”]按钮，选择你想要创建跳转到的变换

在这里插入图片描述

有些转换会导致不同类型的跳转。

p	Description
Result is TRUE	Specifies that the transform will be executed only when the result from the previous transform is true 指定仅当前一个转换的结果为真时才执行转换
Result is FALSE	pecifies that the transform will be executed only when the result from the previous transform is false 指定仅当前一个转换的结果为假时才执行转换
Main output of transform	The default hop between two transforms 两个转换之间的默认跳转

Pipeline properties（管道属性）

管道属性是描述管道并配置其行为的属性的集合。

可以通过单击或双击管道画布打开属性对话框。

Following properties can be configured（可以配置以下属性:）:

Pipeline（管道）
Parameters（参数）
Monitoring（监控）

在这里插入图片描述

Pipeline选项卡允许你指定管道的一般属性，包括:

Property	Description
Pipeline name	The name of the pipeline 管道的名称
Synchronize name with filename 与文件名同步名称	If option is enabled the filename and pipeline name are synchronized. 如果选项是启用的，文件名和管道名称是同步的
Pipeline filename 管道文件名	The filename of the pipeline 管道的文件名
Description 描述	Short description of the pipeline 管道的简短描述
Extended description 扩展描述	Long extended description of the pipeline 管道的长扩展描述
Status 状态	Draft or production status 草稿或生产状态
Version 版本	Description of the version 版本说明
Created by 创建者	Displays the original creator of the pipeline 显示管道的原始创建者
Created at 创建时间	Displays the date and time when the pipeline was created. 显示创建管道的日期和时间
Last modified by 最后修改人	Displays the last user that modified the pipeline 显示修改管道的最后一个用户
Last modified at 最后修改时间	Displays the date and time when the pipeline was last modified. 显示管道最后一次修改的日期和时间

参数选项卡允许您指定特定于管道的参数。参数由名称、默认值和描述定义。

在这里插入图片描述

监视选项卡允许您指定对管道的监视。

在这里插入图片描述

在这个选项卡中可以设置的选项有:

Property	Description	Type
Enable transform performance monitoring 启用转换性能监视	Enable performance monitoring for the transforms in this pipeline 为该管道中的转换启用性能监视	boolean
Transform performance measurement interval (ms) 转换性能测量间隔(ms)	The interval (milliseconds) to monitor the performance for the transforms in this pipeline 监视此管道中转换的性能的间隔(毫秒)	integer
Maximum number of snapshots in memory 内存中快照的最大数量	the number of performance monitoring snapshots to keep in memory for the transforms in this pipeline 为该管道中的转换在内存中保留的性能监视快照的数量	integer

Run,Preview and Debug a Pipeline

Running a Pipeline（运行管道）

可以通过完成以下任务之一来运行管道以查看其执行情况：

Using the Run icon
Select Run and Start execution from the top menu
Pressing F8

在管道运行对话框中，点击右上角的“New”按钮，创建一个新的“管道运行配置”。

在这里插入图片描述

在弹出的对话框中，添加’Local Pipeline’作为管道配置名称，并选择’Hop local pipeline engine’作为引擎类型。

在这里插入图片描述

单击“Ok”返回管道运行对话框。

选择如下所述的日志级别。

LogLevel	Description
Nothing	Do not record any logging output. 不记录任何日志输出
Error	Only record errors in logging output. 只在日志输出中记录错误
Minimal	Only use minimal logging. 只使用最小的日志记录
Basic	This is the default logging level. 这是默认的日志记录级别
Detailed	This logging level gives a detailed logging output. 此日志级别提供详细的日志输出
Debugging	Results in a very detailed output for debugging purposes. 产生非常详细的输出，以供调试之用
Row Level(very detailed)	Logging at row level. 行级日志记录

确保您的配置被选中，并点击“启动”。

当管道成功运行时，转换的右上角将显示绿色复选标记。

在这里插入图片描述

转换还显示一个小的表格图标，如下图红框所示，点击可以让您可以预览转换的结果。

在这里插入图片描述

当管道失败时，失败转换的右上角会显示一个红色三角形。将鼠标悬停在红色错误三角形上可以快速查看错误消息。完整的堆栈跟踪可以在日志中找到。检查管道错误处理，了解如何优雅地处理管道中的错误(这并不一定是您想要的)。

在这里插入图片描述

每次运行后，执行结果都显示在窗口底部的面板中。执行结果包含两个选项卡:

Metrics（指标）
Logging（日志记录）

转换度量选项卡显示每个转换的度量。

在这里插入图片描述

显示如下指标:

Metric	Description
Copy
Input	Number of rows read from 从中读取的行数
Read	Number of rows coming from previous transform 来自上一个转换的行数
Written	Number of rows leaving this step toward next transform 离开此步骤走向下一个转换的行数
Output	Number of rows written to a file or table 写入文件或表的行数
Updated	Number of rows updated by the transform 转换更新的行数
Rejected
Errors	Number of errors in the execution. The whole row is marked red if an error occurs 执行中的错误数目。如果发生错误，整个行将被标记为红色
Buffers Input	输入缓存
Buffers Output	输出缓存
Duration	The duration of the execution of the transform 转换执行的持续时间
Speed
Status	The step status; Running, Stopped, Finished, 当前步骤状态;执行，停止，完成

logggign选项卡根据执行时选择的日志级别显示管道的日志。

在这里插入图片描述

Preview a pipeline（预览管道）

通过完成以下方式，可以预览管道的结果以查看其执行情况

点击图标
点击菜单中的run和preview
从转换操作菜单中选择 “Preview output”
运行转换后，单击右下角的小图标

在管道预览对话框中，可以选择必须预览结果的转换。此外，还可以确定要预览的行数以及暂停条件。完成后按快速启动按钮。如果要更改管道运行配置，请单击Configure。结果按升序显示。结果预览如下所示。

在这里插入图片描述

Debug a pipeline（调试管道）

可以通过完成以下任务之一来调试管道以消除错误：

点击图标
选择Run菜单下面的Debug按钮
从转换操作菜单中选择Debug输出

与使用预览功能时显示的对话框相同，只是启用的选项不同。

在这里插入图片描述

在管道调试对话框中，可以选择必须对结果进行调试的转换。此外，还可以确定行数以及暂停条件。完成后按快速启动按钮。如果要更改管道运行配置，请单击Configure。结果按降序显示。

Error Handling（错误处理）

当其中一个转换发生重大故障时，将通知管道并停止所有活动操作。这在大多数情况下都没问题，管道故障通常在父工作流中处理(检查create workflow页面了解工作流中的错误跳转)。然而，在某些情况下，您希望在不停止整个管道的情况下妥善处理一些错误。

在这些情况下，您不希望当错误发生时管道失败，跳跃管道支持转换和跳跃的错误处理。

当您创建从支持错误处理的转换到另一个转换的跳转时，跳转管道编辑器将询问您是要为主输出还是为转换的错误处理创建跳转。

在这里插入图片描述

如果您选择创建一个错误处理转换，跳转将显示为红色而不是默认的黑色(或者白色，如果您处于暗模式)。

在这里插入图片描述

对于每个支持错误处理的转换，您可以配置许多选项。单击转换图标打开上下文对话框，并选择“错误处理”图标。

在这里插入图片描述

在错误处理对话框中，您可以指定将添加到管道流的附加字段。

在这里插入图片描述

可用的选项有:

option	description
target transform 目标转换	the transform that will receive the error information 将接收错误信息的转换
enable the error handling 启用错误处理	enable error handling from this transform 从此转换启用错误处理
nr of errors fieldname 发生错误数量字段名	the nummer of errors that occurred in the pipeline 在管道中发生的错误数量
error description fieldname 错误描述字段名	fieldname to contain the error description 字段名称以包含错误描述
error fields fieldname 错误字段字段名	the pipeline field where an error occurred 发生错误的管道字段
error codes fieldname 错误代码字段名	the error code for the error that occurred 发生的错误的错误代码
max nr errors allowed	max number of errors allowed before the pipeline fails. 管道失败前允许的最大错误数
max % errors allowed (empty = 100%)	the percentage of errors that is allowed before the pipeline fails 在管道失效之前允许的错误百分比
min nr of rows to read before doing % evaluation	number of rows to read before doing the percentage evaluation. These rows will be taken into account in the evaluation, but the evaluation will only be performed once the specified number of rows has been processed. 在进行百分比计算之前要读取的行数。在计算时将考虑这些行，但是只有在处理了指定数量的行之后才执行计算

将无效日期字符串转换为日期时的示例输出如下所示。

在这里插入图片描述

Getting started with Apache Beam（开始使用Apache Beam）

What is Apache Beam?

Apache Beam是一个高级的统一编程模型，它允许您实现在任何执行引擎上运行的批处理和流数据处理作业。流行的执行引擎例如Apache Spark、Apache Flink或谷歌云平台数据流。

How does it work?

Apache Beam允许您使用标准的Beam API用各种编程语言(如Java、Python和Go)创建程序。这些程序构建数据管道，然后可以在各种执行引擎上使用Beam运行程序执行这些管道。

How is Hop using Beam?

Hop是使用Beam API来创建基于你视觉设计的Hop管道的Beam管道。hop和Beam的术语是一样的。Hop提供了4种标准的方法来执行你在Spark、Flink、Dataflow或Direct runner上设计的管道。

以下是相关插件的文档::

What software versions are supported

Hop version	Beam version	Spark version	Flink version
1.0.0	2.32.0	2.4.8	1.11
1.1.0	2.35.0	3.1.2	1.13
1.2.0	2.35.0	3.1.2 (scala 2.12)	1.13 (scala 2.11)
2.0.0	2.38.0	3.1.3 (scala 2.12)	1.14.4 (scala 2.11)
2.1.0	2.41.0	3.3.0 (scala 2.12)	1.15.2

How are my pipelines executed?

Apache Hop管道只是元数据。各种束流管道引擎插件每次查看一个转换的元数据。它根据提供的Hop转换处理程序来决定如何处理它。转换通常分为以下描述的不同类型。

Important to remember is that Beam pipelines try to solve every action in an ‘embarrassingly parallel’ way. This means that every transform can and usually will run in more than 1 copy. On large clusters you should expect a lot of copies of the same code to run at any given time.

Beam specific transforms

有许多特定于Beam的转换可用，它们只能在所提供的Beam管道执行引擎上工作。例如:从一个或多个文件读取文本文件数据的Beam Input或将数据写入BigQuery的Beam BigQuery Output。

你可以在大数据分类中找到这些转换，它们的名字都是以Beam开头的，这样很容易识别。

下面是一个简单的管道示例，它读取文件夹中的文件(在gs://上)，过滤掉来自California的数据，删除并重命名一些字段，并将数据写回另一组文件:

在这里插入图片描述

Universal transforms

有一些转换可以转换为Beam变体:

Memory Group By:此转换允许跨大数据卷聚合数据。当使用Beam引擎时，它使用org.apache.beam.sdk.transforms.GroupByKey。
Merge Join:您可以使用此转换连接2个数据源。主要的区别是在Beam引擎中输入数据不需要排序。用于执行此操作的Beam类是