nifi apache_Apache Nifi的工作原理-浏览数据流,不要淹没其中

nifi apache

by François Paupier

通过FrançoisPaupier

Apache Nifi的工作原理-浏览数据流,不要淹没其中 (How Apache Nifi works — surf on your dataflow, don’t drown in it)

介绍 (Introduction)

That’s a crazy flow of water. Just like your application deals with a crazy stream of data. Routing data from one storage to another, applying validation rules and addressing questions of data governance, reliability in a Big Data ecosystem is hard to get right if you do it all by yourself.

那是疯狂的水流。 就像您的应用程序处理疯狂的数据流一样。 如果您独自完成所有工作,那么很难将数据从一个存储路由到另一个存储,应用验证规则并解决数据治理,大数据生态系统中的可靠性问题。

Good news, you don’t have to build your dataflow solution from scratch — Apache NiFi got your back!

好消息,您不必从头开始构建数据流解决方案-Apache NiFi支持您!

At the end of this article, you’ll be a NiFi expert — ready to build your data pipeline.

在本文结尾,您将成为NiFi专家-准备建立数据管道。

我将在本文中介绍: (What I will cover in this article:)
  • What Apache NiFi is, in which situation you should use it, and what are the key concepts to understand in NiFi.

    什么是Apache NiFi,在什么情况下应使用它,以及在NiFi中理解的关键概念是什么。
我不会介绍的内容: (What I won’t cover:)
  • Installation, deployment, monitoring, security, and administration of a NiFi cluster.

    NiFi群集的安装,部署,监视,安全性和管理。

For your convenience here is the table of content, feel free to go straight where your curiosity takes you. If you’re a NiFi first-timer, going through this article in the indicated order is advised.

为了方便起见,这里是目录,您可以随时随心所欲地带您进入。 如果您是NiFi初学者,建议按照指示的顺序阅读本文。

表中的内容 (Table of Content)

什么是Apache NiFi? (What is Apache NiFi?)

On the website of the Apache Nifi project, you can find the following definition:

在Apache Nifi项目的网站上,可以找到以下定义:

An easy to use, powerful, and reliable system to process and distribute data.
一个易于使用,功能强大且可靠的系统来处理和分发数据。

Let’s analyze the keywords there.

让我们在那里分析关键字。

定义NiFi (Defining NiFi)

Process and distribute dataThat’s the gist of Nifi. It moves data around systems and gives you tools to process this data.

处理和分发数据这就是Nifi的要旨。 它可以在系统中移动数据,并为您提供处理该数据的工具。

Nifi can deal with a great variety of data sources and format. You take data in from one source, transform it, and push it to a different data sink.

Nifi可以处理各种各样的数据源和格式。 您可以从一个源中获取数据,对其进行转换,然后将其推送到另一个数据接收器。

Easy to useProcessors — the boxes — linked by connectors — the arrows create a flow. NiFi offers a flow-based programming experience.

易于使用的处理器-通过连接器链接的框- 箭头创建了流程 N iFi提供基于流的编程体验。

Nifi makes it possible to understand, at a glance, a set of dataflow operations that would take hundreds of lines of source code to implement.

Nifi使一眼就能理解一组数据流操作,这将需要数百行源代码来实现。

Consider the pipeline below:

考虑下面的管道:

To translate the data flow above in NiFi, you go to NiFi graphical user interface, drag and drop three components into the canvas, and That’s it. It takes two minutes to build.

要在NiFi中转换上述数据流,请转到NiFi图形用户界面,将三个组件拖放到画布中,仅此而已。 构建需要两分钟。

Now, if you write code to do the same thing, it’s likely to be a several hundred lines long to achieve a similar result.

现在,如果您编写代码来执行相同的操作,则要获得相似的结果可能需要数百行。

You don’t capture the essence of the pipeline through code as you do with a flow-based approach. Nifi is more expressive to build a data pipeline; it’s designed to do that.

您不会像使用基于流的方法那样通过代码捕获管道的本质。 Nifi在构建数据管道方面更具表现力; 它的目的是这样做

PowerfulNiFi provides many processors out of the box (293 in Nifi 1.9.2). You’re on the shoulders of a giant. Those standard processors handle the vast majority of use cases you may encounter.

强大的 NiFi提供许多处理器 开箱即用(Nifi 1.9.2中的293)。 您站在巨人的肩膀上。 这些标准处理器可以处理您可能遇到的绝大多数用例。

NiFi is highly concurrent, yet its internals encapsulates the associated complexity. Processors offer you a high-level abstraction that hides the inherent complexity of parallel programming. Processors run simultaneously, and you can span multiple threads of a processor to cope with the load.

NiFi是高度并发的,但其内部封装了相关的复杂性。 处理器为您提供了高级抽象,它隐藏了并行编程固有的复杂性。 处理器同时运行,您可以跨越处理器的多个线程来应对负载。

Concurrency is a computing Pandora’s box that you don’t want to open. NiFi conveniently shields the pipeline builder from the complexities of concurrency.

并发是您不想打开的计算潘多拉盒子。 NiFi方便地保护了管道构建器免受并发复杂性的影响。

ReliableThe theory backing NiFi is not new; it has solid theoretical anchors. It’s similar to models like SEDA.

可靠 NiFi的理论支持并不新鲜; 它具有扎实的理论基础。 它类似于SEDA之类的模型。

For a dataflow system, one of the main topics to address is reliability. You want to be sure that data sent somewhere is effectively received.

对于数据流系统,要解决的主要主题之一是可靠性 。 您要确保有效地接收了发送到某处的数据。

NiFi achieves a high level of reliability through multiple mechanisms that keep track of the state of the system at any point in time. Those mechanisms are configurable so you can make the appropriate tradeoffs between latency and throughput required by your applications.

NiFi通过多种机制在任何时间点跟踪系统状态,从而实现了高度的可靠性。 这些机制是可配置的,因此您可以在延迟和应用程序所需的吞吐量之间进行适当的权衡

NiFi tracks the history of each piece of data with its lineage and provenance features. It makes it possible to know what transformation happens on each piece of information.

NiFi利用其沿袭和出处特征来跟踪每条数据的历史记录。 它使得知道每条信息发生什么转变成为可能。

The data lineage solution proposed by Apache Nifi proves to be an excellent tool for auditing a data pipeline. Data lineage features are essential to bolster confidence in big data and AI systems in a context where transnational actors such as the European Union propose guidelines to support accurate data processing.

Apache Nifi提出的数据沿袭解决方案被证明是审核数据管道的出色工具。 在诸如欧盟这样的跨国参与者提出支持准确数据处理的准则的背景下,数据沿袭功能对于增强人们对大数据和AI系统的信心至关重要。

为什么要使用Nifi? (Why using Nifi?)

First, I want to make it clear I’m not here to evangelize NiFi. My goal is to give you enough elements so you can make an informed decision on the best way to build your data pipeline.

首先,我想说明一下,我不是在宣传NiFi。 我的目标是为您提供足够的元素,以便您可以明智地决定构建数据管道的最佳方法。

It’s useful to keep in mind the four Vs of big data when dimensioning your solution.

在确定解决方案的尺寸时,牢记大数据的四个优势非常有用。

  • Volume — At what scale do you operate? In order of magnitude, are you closer to a few GigaBytes or hundreds of PetaBytes?

    数量 -您的经营规模是多少? 在数量级上,您接近几千兆字节还是几百PB?

  • Variety — How many data sources do you have? Are your data structured? If yes, does the schema vary often?

    种类 -您有多少个数据源? 您的数据是结构化的吗? 如果是,架构是否经常变化?

  • Velocity — What is the frequency of the events you process? Is it credit cards payments? Is it a daily performance report sent by an IoT device?

    速度 -您处理事件的频率是多少? 是信用卡付款吗? 它是物联网设备发送的每日性能报告吗?

  • Veracity — Can you trust the data? Alternatively, do you need to apply multiple cleaning operations before manipulating it?

    准确性 -您可以信任数据吗? 另外,在操作之前是否需要进行多次清洁操作?

NiFi seamlessly ingests data from multiple data sources and provides mechanisms to handle different schema in the data. Thus, it shines when there is a high variety in the data.

NiFi无缝地从多个数据源中提取数据,并提供了处理数据中不同模式的机制。 因此,当数据种类繁多时,它会发光。

Nifi is particularly valuable if data is of low veracity. Since it provides multiple processors to clean and format the data.

如果数据的准确性高,则Nifi尤其有价值。 由于它提供了多个处理器来清理和格式化数据。

With its configuration options, Nifi can address a broad range of volume/velocity situations.

通过其配置选项,Nifi可以解决各种体积/速度情况。

数据路由解决方案的应用程序列表越来越多 (An increasing list of applications for data routing solutions)

New regulations, the rise of the Internet of Things and the flow of data it generates emphasize the relevance of tools such as Apache NiFi.

新法规,物联网的兴起及其生成的数据流都强调了诸如Apache NiFi之类的工具的重要性。

  • Microservices are trendy. In those loosely coupled services, the data is the contract between the services. Nifi is a robust way to route data between those services.

    微服务是新潮。 在那些松耦合的服务中, 数据是服务之间的契约 。 Nifi是在这些服务之间路由数据的可靠方法。

  • Internet of Things brings a multitude of data to the cloud. Ingesting and validating data from the edge to the cloud poses a lot of new challenges that NiFi can efficiently address (primarily through MiniFi, NiFi project for edge devices)

    物联网 将大量数据带到云中。 从边缘到云的数据吸收和验证带来了NiFi有效解决的许多新挑战(主要是通过MiniFi ,针对边缘设备的NiFi项目)

  • New guidelines and regulations are put in place to readjust the Big Data economy. In this context of increasing monitoring, it is vital for businesses to have a clear overview of their data pipeline. NiFi data lineage, for example, can be helpful in a path towards compliance to regulations.

    制定了新的准则和法规以重新调整大数据经济。 在日益增加的监控范围内,对于企业来说,对其数据管道有清晰的概览至关重要。 例如,NiFi数据沿袭可能会有助于您遵守法规。

弥合大数据专家与其他专家之间的鸿沟 (Bridge the gap between big data experts and the others)

As you can see by the user interface, a dataflow expressed in NiFi is excellent to communicate about your data pipeline. It can help members of your organization become more knowledgeable about what’s going on in the data pipeline.

从用户界面可以看到,用NiFi表示的数据流非常适合与您的数据管道进行通信。 它可以帮助您的组织成员更了解数据管道中发生的事情。

  • An analyst is asking for insights about why this data arrives here that way? Sit together and walk through the flow. In five minutes you give someone a strong understanding of the Extract Transform and Load -ETL- pipeline.

    分析师正在寻求有关为什么这些数据以这种方式到达此处的见解? 坐在一起,并在流程中穿行。 在五分钟内,您将对提取转换和加载-ETL-管道有深入的了解。

  • You want feedback from your peers on a new error handling flow you created? NiFi makes it a design decision to consider error paths as likely as valid outcomes. Expect the flow review to be shorter than a traditional code review.

    您是否希望您的同僚对您创建的新错误处理流程提供反馈? NiFi决定将错误路径视为有效结果,这是一项设计决策。 期望流程审查比传统的代码审查要短。

你应该使用它吗? 是的,不是,也许吗? (Should you use it? Yes, No, Maybe?)

NiFi brands itself as easy to use. Still, it is an enterprise dataflow platform. It offers a complete set of features from which you may only need a reduced subset. Adding a new tool to the stack is not benign.

NiFi品牌本身就易于使用。 尽管如此,它还是一个企业数据流平台。 它提供了一套完整的功能,您可能只需要其中的一部分即可。 将新工具添加到堆栈中不是良性的。

If you are starting from scratch and manage a few data from trusted data sources, you may be better off setting up your Extract Transform and Load — ETL pipeline. Maybe a change data capture from a database and some data preparations scripts are all you need.

如果您是从头开始并管理来自受信任数据源的一些数据,则最好设置“提取转换和加载-ETL”管道。 您可能只需要从数据库中捕获更改数据和一些数据准备脚本即可。

On the other hand, if you work in an environment with existing big data solutions in use (be it for storage, processing or messaging ), NiFi integrates well with them and is more likely to be a quick win. You can leverage the out of the box connectors to those other Big Data solutions.

另一方面,如果您在使用现有大数据解决方案(用于存储处理消息传递 )的环境中工作,则NiFi可以很好地与它们集成,并且很可能会很快获胜。 您可以利用现成的连接器来连接其他大数据解决方案。

It’s easy to be hyped by new solutions. List your requirements and choose the solution that answers your needs as simply as possible.

新解决方案很容易被炒作。 列出您的要求,并选择尽可能简单地满足您需求的解决方案

Now that we have seen the very high picture of Apache NiFi, we take a look at its key concepts and dissect its internals.

既然我们已经看到了Apache NiFi的高水准,我们来看看它的关键概念并剖析其内部结构。

显微镜下的Apache Nifi (Apache Nifi under the microscope)

“NiFi is boxes and arrow programming” may be ok to communicate the big picture. However, if you have to operate with NiFi, you may want to understand a bit more about how it works.

可以传达“ NiFi是盒子和箭头编程”的信息。 但是,如果您必须使用NiFi进行操作,则可能需要更多地了解其工作原理。

In this second part, I explain the critical concepts of Apache NiFi with schemas. This black box model won’t be a black box to you afterward.

在第二部分中,我将说明使用模式的Apache NiFi的关键概念。 此后的黑匣子模型将不再是您的黑匣子。

取消装箱Apache NiFi (Unboxing Apache NiFi)

When you start NiFi, you land on its web interface. The web UI is the blueprint on which you design and control your data pipeline.

启动NiFi时,您会进入其Web界面。 Web UI是设计和控制数据管道的蓝图。

In Nifi, you assemble processors linked together by connections. In the sample dataflow introduced previously, there are three processors.

在Nifi中,您将组装通过连接链接在一起的处理器 。 在前面介绍的样本数据流中,有三个处理器。

The NiFi canvas user interface is the framework in which the pipeline builder evolves.

NiFi canvas用户界面是管道构建器在其中发展的框架。

理解Nifi术语 (Making sense of Nifi terminology)

To express your dataflow in Nifi, you must first master its language. No worries, a few terms are enough to grasp the concept behind it.

要以Nifi表示数据流,您必须首先掌握其语言。 不用担心,只需几个术语即可掌握其背后的概念。

The black boxes are called processors, and they exchange chunks of information named FlowFiles through queues that are named connections. Finally, the FlowFile Controller is responsible for managing the resources between those components.

黑匣子称为处理器,它们通过称为连接的队列交换名为FlowFiles的信息块。 最后, FlowFile Controller负责管理那些组件之间的资源。

Let’s take a look at how this works under the hood.

让我们看看它是如何工作的。

流文件 (FlowFile)

In NiFi, the FlowFile is the information packet moving through the processors of the pipeline.

在NiFi中, FlowFile 是通过管道处理器移动的信息包。

A FlowFile comes in two parts:

FlowFile分为两个部分:

  • Attributes, which are key/value pairs. For example, the file name, file path, and a unique identifier are standard attributes.

    属性 ,是键/值对。 例如,文件名,文件路径和唯一标识符是标准属性。

  • Content, a reference to the stream of bytes compose the FlowFile content.

    Content ,对字节流的引用构成了FlowFile内容。

The FlowFile does not contain the data itself. That would severely limit the throughput of the pipeline.

FlowFile不包含数据本身。 这将严重限制管道的吞吐量。

Instead, a FlowFile holds a pointer that references data stored at some place in the local storage. This place is called the Content Repository.

相反,FlowFile保留一个指针,该指针引用存储在本地存储中某个位置的数据。 这个地方称为内容存储库

To access the content, the FlowFile claims the resource from the Content Repository. The later keep tracks of the exact disk offset from where the content is and streams it back to the FlowFile.

为了访问内容,FlowFile从内容存储库中声明资源。 稍后将跟踪内容所在位置的确切磁盘偏移,并将其流回FlowFile。

Not all processors need to access the content of the FlowFile to perform their operations — for example, aggregating the content of two FlowFiles doesn’t require to load their content in memory.

并非所有处理器都需要访问FlowFile的内容来执行其操作-例如,聚合两个FlowFiles的内容不需要将其内容加载到内存中。

When a processor modifies the content of a FlowFile, the previous data is kept. NiFi copies-on-write, it modifies the content while copying it to a new location. The original information is left intact in the Content Repository.

当处理器修改FlowFile的内容时,将保留先前的数据。 NiFi 时复制,它会在将内容复制到新位置时对其进行修改。 原始信息保留在内容存储库中。

ExampleConsider a processor that compresses the content of a FlowFile. The original content remains in the Content Repository, and a new entry is created for the compressed content.

示例考虑一个压缩FlowFile内容的处理器。 原始内容保留在内容存储库中,并为压缩内容创建一个新条目。

The Content Repository finally returns the reference to the compressed content. The FlowFile is updated to point to the compressed data.

内容存储库最终将对压缩内容的引用返回。 FlowFile更新为指向压缩数据。

The drawing below sums up the example with a processor that compresses the content of FlowFiles.

下图总结了带有压缩FlowFiles内容的处理器的示例。

ReliabilityNiFi claims to be reliable, how is it in practice? The attributes of all the FlowFiles currently in use, as well as the reference to their content, are stored in the FlowFile Repository.

可靠性 NiFi声称是可靠的,在实践中如何? 当前使用的所有FlowFiles的属性以及对其内容的引用都存储在FlowFile存储库中。

At every step of the pipeline, a modification to a Flowfile is first recorded in the FlowFile Repository, in a write-ahead log, before it is performed.

在流水线的每个步骤中,在对流文件进行修改之前,首先将其记录在流文件存储库中的预写日志中

For each FlowFile that currently exist in the system, the FlowFile repository stores:

对于系统中当前存在的每个FlowFile,FlowFile存储库存储:

  • The FlowFile attributes

    FlowFile属性
  • A pointer to the content of the FlowFile located in the FlowFile repository

    指向位于FlowFile存储库中的FlowFile内容的指针
  • The state of the FlowFile. For example: to which queue does the Flowfile belong at this instant.

    FlowFile的状态。 例如:Flowfile在此瞬间属于哪个队列。

The FlowFile repository gives us the most current state of the flow; thus it’s a powerful tool to recover from an outage.

FlowFile存储库为我们提供了流程的最新状态。 因此,它是从中断中恢复的强大工具。

NiFi provides another tool to track the complete history of all the FlowFiles in the flow: the Provenance Repository.

NiFi提供了另一个工具来跟踪流程中所有FlowFiles的完整历史记录:“资源库”。

Provenance RepositoryEvery time a FlowFile is modified, NiFi takes a snapshot of the FlowFile and its context at this point. The name for this snapshot in NiFi is a Provenance Event. The Provenance Repository records Provenance Events.

来源存储库每次修改FlowFile时,NiFi都会在此时获取FlowFile及其上下文的快照。 NiFi中此快照的名称是“ 来源事件”来源存储库记录来源事件。

Provenance enables us to retrace the lineage of the data and build the full chain of custody for every piece of information processed in NiFi.

出处使我们能够追溯数据沿袭并为在NiFi中处理的每条信息建立完整的监管链。

On top of offering the complete lineage of the data, the Provenance Repository also offers to replay the data from any point in time.

除了提供完整的数据沿袭之外,Provenance信息库还提供从任何时间点重播数据的功能。

Wait, what’s the difference between the FlowFile Repository and the Provenance Repository?

等等,FlowFile资料库和Provenance资料库有什么区别?

The idea behind the FlowFile Repository and the Provenance Repository is quite similar, but they don’t address the same issue.

FlowFile资料库和Provenance资料库背后的想法非常相似,但是它们没有解决相同的问题。

  • The FlowFile repository is a log that contains only the latest state of the in-use FlowFiles in the system. It is the most recent picture of the flow and makes it possible to recover from an outage quickly.

    FlowFile存储库是一个日志,仅包含系统中正在使用的FlowFiles的最新状态。 这是流量的最新情况,可以快速从中断中恢复。
  • The Provenance Repository, on the other hand, is more exhaustive since it tracks the complete life cycle of every FlowFile that has been in the flow.

    另一方面,“资源库”更为详尽,因为它跟踪流中每个FlowFile的完整生命周期。

If you have only the most recent picture of the system with the FlowFile repository, the Provenance Repository gives you a collection of photos — a video. You can rewind to any moment in the past, investigate the data, replay operations from a given time. It provides a complete lineage of the data.

如果您只有使用FlowFile信息库的最新系统图片,则Provenance信息库会为您提供照片集- 视频 。 您可以倒退到过去的任何时刻,研究数据,并从给定的时间重放操作。 它提供了数据的完整沿袭。

FlowFile处理器 (FlowFile Processor)

A processor is a black box that performs an operation. Processors have access to the attributes and the content of the FlowFile to perform all kind of actions. They enable you to perform many operations in data ingress, standard data transformation/validation tasks, and saving this data to various data sinks.

处理器是执行操作的黑匣子。 处理器可以访问FlowFile的属性和内容来执行所有类型的操作。 它们使您能够在数据输入,标准数据转换/验证任务中执行许多操作,并将这些数据保存到各种数据接收器中。

NiFi comes with many processors when you install it. If you don’t find the perfect one for your use case, it’s still possible to build your own processor. Writing custom processors is outside the scope of this blog post.

NiFi在安装时会附带许多处理器。 如果找不到适合您的用例的处理器,那么仍然可以构建自己的处理器。 编写自定义处理器超出了本博客文章的范围。

Processors are high-level abstractions that fulfill one task. This abstraction is very convenient because it shields the pipeline builder from the inherent difficulties of concurrent programming and the implementation of error handling mechanisms.

处理器是完成一项任务的高级抽象。 这种抽象非常方便,因为它使管道构建器免受并发编程和错误处理机制的实现所固有的困难。

Processors expose an interface with multiple configuration settings to fine-tune their behavior.

处理器公开具有多个配置设置的界面以微调其行为。

The properties of those processors are the last link between NiFi and the business reality of your application requirements.

这些处理器的属性是NiFi与您的应用程序需求之间的最后联系。

The devil is in the details, and pipeline builders spend most of their time fine-tuning those properties to match the expected behavior.

细节在于魔鬼,管道建设者会花费大部分时间来微调这些属性以匹配预期的行为。

ScalingFor each processor, you can specify the number of concurrent tasks you want to run simultaneously. Like this, the Flow Controller allocates more resources to this processor, increasing its throughput. Processors share threads. If one processor requests more threads, other processors have fewer threads available to execute. Details on how the Flow Controller allocates threads are available here.

扩展对于每个处理器,您可以指定要同时运行的并发任务数。 这样, 流控制器将更多资源分配给该处理器,从而增加其吞吐量。 处理器共享线程。 如果一个处理器请求更多线程,则其他处理器将具有更少的线程来执行。 有关Flow Controller如何分配线程的详细信息,请参见此处

Horizontal scaling. Another way to scale is to increase the number of nodes in your NiFi cluster. Clustering servers make it possible to increase your processing capability using commodity hardware.

水平缩放。 扩展的另一种方法是增加NiFi群集中的节点数。 群集服务器使您可以使用商用硬件来提高处理能力。

Craft.io组 (Process Group)

This one is straightforward now that we’ve seen what processors are.

现在,我们已经了解了什么是处理器,这很简单。

A bunch of processors put together with their connections can form a process group. You add an input port and an output port so it can receive and send data.

一堆处理器及其连接可以组成一个进程组。 您添加了一个输入端口和一个输出端口,以便它可以接收和发送数据。

Processor groups are an easy way to create new processors based from existing ones.

处理器组是从现有处理器创建新处理器的简便方法。

连接数 (Connections)

Connections are the queues between processors. These queues allow processors to interact at differing rates. Connections can have different capacities like there exist different size of water pipes.

连接是处理器之间的队列。 这些队列允许处理器以不同的速率进行交互。 连接可以具有不同的容量,例如存在不同尺寸的水管。

Because processors consume and produce data at different rates depending on the operations they perform, connections act as buffers of FlowFiles.

由于处理器根据执行的操作以不同的速率消耗和产生数据,因此连接充当FlowFiles的缓冲区。

There is a limit on how many data can be in the connection. Similarly, when your water pipe is full, you can’t add water anymore, or it overflows.

连接中可以有多少数据是有限制的。 同样,当水管已满时,您将无法再加水,否则水会溢出。

In NiFi you can set limits on the number of FlowFiles and the size of their aggregated content going through the connections.

在NiFi中,您可以设置FlowFile的数量及其通过连接的聚合内容大小的限制。

What happens when you send more data than the connection can handle?

当您发送的数据超出连接的处理能力会发生什么?

If the number of FlowFiles or the quantity of data goes above the defined threshold, backpressure is applied. The Flow Controller won’t schedule the previous processor to run again until there is room in the queue.

如果FlowFiles的数量或数据量超过定义的阈值,则将施加反压 。 在队列中没有空间之前,Flow Controller不会安排先前的处理器再次运行。

Let’s say you have a limit of 10 000 FlowFiles between two processors. At some point, the connection has 7 000 elements in it. It is ok since the limit is 10 000. P1 can still send data through the connection to P2.

假设您在两个处理器之间最多只能有10000个FlowFiles。 在某个时候,连接中有7 000个元素。 这是确定的,因为限制为10 000 P1还可以通过连接到P2发送数据。

Now let’s say that processor one sends 4 000 new FlowFiles to the connection. 7 0000 + 4 000 = 11 000 → We go above the connection threshold of 10 000 FlowFiles.

现在,假设处理器一向该连接发送了4000个新的FlowFiles。 7 0000 + 4 000 = 11000→我们超过了10 000个FlowFiles的连接阈值。

The limits are soft limits, meaning they can be exceeded. However, once they are, the previous processor, P1 won’t be scheduled until the connector goes back below its threshold value — 10 000 FlowFiles.

限制是软限制,表示可以超出限制 。 但是,一旦连接器恢复到其阈值(10000个FlowFiles)以下,就不会调度以前的处理器P1

This simplified example gives the big picture of how backpressure works.

这个简化的示例可以大致了解反压的工作原理。

You want to setup connection thresholds appropriate to the Volume and Velocity of data to handle. Keep in mind the Four Vs.

您要设置适合于要处理的数据量和速度的连接阈值。 请记住四个Vs。

The idea of exceeding a limit may sound odd. When the number of FlowFiles or the associated data go beyond the threshold, a swap mechanism is triggered.

超出限制的想法听起来很奇怪。 当FlowFiles或关联数据的数量超过阈值时,将触发交换机制

For another example on backpressure, this mail thread can help.

对于反压的另一个示例, 此邮件线程可以提供帮助。

Prioritizing FlowFilesThe connectors in NiFi are highly configurable. You can choose how you prioritize FlowFiles in the queue to decide which one to process next.

确定FlowFile优先级 NiFi中的连接器是高度可配置的。 您可以选择如何在队列中确定FlowFiles的优先级 ,以决定下一个要处理的文件。

Among the available possibility, there is, for example, the First In First Out order — FIFO. However, you can even use an attribute of your choice from the FlowFile to prioritize incoming packets.

在可用的可能性中,例如,先进先出顺序FIFO。 但是,您甚至可以使用FlowFile中选择的属性来对传入数据包进行优先级排序。

流量控制器 (Flow Controller)

The Flow Controller is the glue that brings everything together. It allocates and manages threads for processors. It’s what executes the dataflow.

流量控制器是将一切融合在一起的粘合剂。 它为处理器分配和管理线程。 这就是执行数据流的方式。

Also, the Flow Controller makes it possible to add Controller Services.

此外,Flow Controller还可以添加Controller Services。

Those services facilitate the management of shared resources like database connections or cloud services provider credentials. Controller services are daemons. They run in the background and provide configuration, resources, and parameters for the processors to execute.

这些服务有助于管理共享资源,例如数据库连接或云服务提供商凭据。 控制器服务是守护程序 。 它们在后台运行,并提供配置,资源和参数供处理器执行。

For example, you may use an AWS credentials provider service to make it possible for your services to interact with S3 buckets without having to worry about the credentials at the processor level.

例如,您可以使用AWS凭证提供程序服务使您的服务与S3存储桶进行交互,而不必担心处理器级别的凭证。

Just like with processors, a multitude of controller services is available out of the box.

就像处理器一样,开箱即用的控制器服务很多

You can check out this article for more content on the controller services.

您可以查看本文以获取有关控制器服务的更多内容。

结论和号召性用语 (Conclusion and call to action)

In the course of this article, we discussed NiFi, an enterprise dataflow solution. You now have a strong understanding of what NiFi does and how you can leverage its data routing features for your applications.

在本文的过程中,我们讨论了企业数据流解决方案NiFi。 您现在对NiFi的功能以及如何为应用程序利用其数据路由功能有了深刻的了解。

If you’re reading this, congrats! You now know more about NiFi than 99.99% of the world’s population.

如果您正在阅读本文,那么恭喜! 现在,您对NiFi的了解超过了全球99.99%的人口。

Practice makes perfect. You master all the concepts required to start building your own pipeline. Make it simple; make it work first.

实践使完美。 您掌握了开始构建自己的管道所需的所有概念。 简单点; 使它首先工作。

Here is a list of exciting resources I compiled on top of my work experience to write this article.

这是我根据自己的工作经验编写的这篇令人兴奋的资源清单。

资源? (Resources ?)
更大的图景 (The bigger picture)

Because designing data pipeline in a complex ecosystem requires proficiency in multiple areas, I highly recommend the book Designing Data-Intensive Applications from Martin Kleppmann. It covers the fundamentals.

因为在复杂的生态系统中设计数据管道需要精通多个领域,所以我强烈建议《 设计数据密集型应用程序 》一书 来自Martin Kleppmann。 它涵盖了基础知识。

  • A cheat sheet with all the references quoted in Martin’s book is available on his Github repo.

    马丁书中引用的所有参考文献的备忘单可在他的Github存储库中找到

This cheat sheet is a great place to start if you already know what kind of topic you’d like to study in-depth and you want to find quality materials.

如果您已经知道您想深入学习什么样的主题并且想要找到优质的材料,那么这份备忘单是一个很好的起点。

Apache Nifi的替代品 (Alternatives to Apache Nifi)

Other dataflow solutions exist.

存在其他数据流解决方案。

Open source:

开源:

Most of the existing cloud providers offer dataflow solutions. Those solutions integrate easily with other products you use from this cloud provider. At the same time, it solidly ties you to a particular vendor.

大多数现有的云提供商都提供数据流解决方案。 这些解决方案可轻松与您从该云提供商处使用的其他产品集成。 同时,它将您与特定供应商牢固地联系在一起。

  • The official Nifi documentation and especially the Nifi In-depth section are gold mines.

    Nifi的官方文档 ,尤其是“ Nifi深入”部分是金矿。

  • Registering to Nifi users mailing list is also a great way to be informed — for example, this conversation explains back-pressure.

    向Nifi用户的邮件列表注册也是一种很好的通知方式-例如, 此对话说明了背压。

  • Hortonworks, a big data solutions provider, has a community website full of engaging resources and how-to for Apache Nifi.

    Hortonworks,大数据解决方案提供商,拥有一个社区网站充分接合的资源,以及如何对 Apache的Nifi。

    -

    This article goes in depth about connectors, heap usage, and back pressure.

    本文深入介绍了连接器,堆使用情况和背压。

    -

    This one shares dimensioning best practices when deploying a NiFi cluster.

    此人分享了部署NiFi集群时的尺寸最佳实践。

  • The NiFi blog distills a lot of insights NiFi usage patterns as well as tips on how to build pipelines.

    NiFi博客摘录了许多有关NiFi使用模式的见解以及有关如何构建管道的技巧。

  • Claim Check pattern explained

    索赔检查模式说明

  • The theory behind Apache Nifi is not new, Seda referenced in Nifi Doc is extremely relevant

    Apache Nifi背后的理论并不是新事物,Nifi Doc中引用的Seda极为相关

    — Matt Welsh. Berkeley. SEDA: An Architecture for Well-Conditioned, Scalable Internet Services [online]. Retrieved: 21 Apr 2019, from

    —马特·威尔士(Matt Welsh)。 伯克利。 SEDA:一种条件良好的可扩展Internet服务的体系结构[在线]。 检索:2019年4月21日,从

    http://www.mdw.la/papers/seda-sosp01.pdf

    http://www.mdw.la/papers/seda-sosp01.pdf

翻译自: https://www.freecodecamp.org/news/nifi-surf-on-your-dataflow-4f3343c50aa2/

nifi apache

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值