草稿

当数据流经过NIFI 时,NIFI采用先写日志模式来记录FlowFiles 的变化。这提前写日志记录的flowfiles本身的变化,如flowfile属性(键/值对组成的元数据),以及他们的状态,如连接/队列的flowfile属于。

 

SerDe:接口序列化/反序列化记录和更新记录

TransactionID Generator:编辑或者快照日志时通过AtomicLong 生成自增ID

 

 

FlowFile

一个flowfile是一个逻辑概念相关的一块有一套关于数据属性数据。这些属性包括flowfile独特的标识符,以及它的名称,大小,和任何数量的其它流程的具体值。同时,一个flowfile属性的内容是可以改变的,flowfile对象是不可变的。一个flowfile修改是由processsession完成。

 

 

 

数据流入组件时先写入日志

步骤:

1、 判断restored 是不是的值是不是true。如果不是,抛出 throw IllegalStateException

2、  获得共享的锁(读取锁)。

3、  获取一个没有占用的分区

Ø  增量AtomicLong和分区数求模---》计算出分区数分区索引(AtomicLong/partitions-size

Ø  尝试(获得一个写锁)] 获得分区索引partition[partitionIndex]。如果不成功,回到2A

 

4、 如果编辑flowfile日志没有输出流的存在,创建输出流写SerDe类名称和版本。

5、 获得Transaction ID写入日志。

6、 将写更新到分区中

Ø  序列记录更新

Ø 如果有多个记录,写入一个TransactionContinue 做为标记,然后序列更新,否则写transactioncommit标记

 

7、 更新全局记录map保持最新版本记录

8、 释放获取分区索引的写锁。

9、 释放共享的读取锁。

 

 

 

Checkpointing the Write-Ahead Log (检查点 写入日志到partial file)

10、             获取互斥锁(写锁)所以没有分区可以被更新

11、             创建partial file

12、             写入SerDe class 名称和版本号。

13、             写入当前最大 Transaction ID

14、             在全局变量map写入records 记录数量

15、             遍历record,序列化record

16、             关闭partial file的输出流(output stream

17、             删除当前快照文件

18、             重命名partial file到快到快照

 

 

 

 

 

Processor

 

This is the most commonly used component inNiFi and tends to be the easiest place for newcomers to jump in. AProcessor is a component that is responsible for bringing data into thesystem, pushing data out to other systems, or performing some sort ofenrichment, extraction, transformation, splitting, merging, or routing logicfor a particular piece of data.

NIFIProcessor是一个最常用的组件,也是最先投入学习的组件。Processor是一个负责装载数据进入系统,并将系统pushing 到其他的组件。它包含排序、聚集、取出、转换、分割、合并、或路由逻辑的一个特定的数据块

 

Processor Node

The Processor Node is essentially a wrapperaround a Processor and maintains state about the Processor itself. TheProcessor Node is responsible for maintaining, among other things, state abouta Processor's positioning on the graph, the configured properties and settingsof the Processor, its scheduled state, and the annotations that are used todescribe the Processor.

 

ProcessorNode本质上是一个包装一个Processor和维护关于处理器本身的状态。Processor Node还负责维护和配置Processor的配置(在我们的画板),配置的性能和处理器的设置,其预定的状态,这是用来描述处理器的注释。

 

Reporting Task

A Reporting Task is a NiFi extensionpoint that is capable of reporting and analyzing NiFi's internal metrics inorder to provide the information to external resources or report statusinformation as bulletins that appear directly in the NiFi User Interface.Unlike a Processor, a Reporting Task does not have access to individualFlowFiles. Rather, a Reporting Task has access to all Provenance Events,bulletins, and the metrics shown for components on the graph, such as FlowFilesIn,Bytes Read, and Bytes Written.

 

 

Controller Service

The Controller Service is a mechanismthat allows state or resources to be shared across multiple components in theflow. The SSLContextService, for instance, allows a user to configure SSLinformation only once and then configure any number of resources to use thatconfiguration. Other Controller Services are used to share configuration. Forexample, if a very large dataset needs to be loaded, it will generally makesense to use a Controller Service to load the dataset. This allows multipleProcessors to make use of this dataset without having to load the datasetmultiple times.

Controller Service的机制是获取流动的多个组件的共享状态和资源。例如:实例化SSLContextService,允许用户配置SSL信息一次,然后使用任意数量的资源配。Controller Service可以使用共享的configuration。例如,如果一个非常大的数据集需要被加载,它通常会使用一个控制器服务加载数据的。这样使得多个处理器无需加载数据集多次来使用数据集。

 

 

Process Session 处理会话

The ProcessSession (often referred to simply as a "session") provides Processorsaccess to FlowFiles and provides transactional behavior across the tasksthat are performed by a Processor. The session provides get() methods for obtaining access to FlowFiles that arequeued up for a Processor, methods to read from and write to the contentsof a FlowFile, add and remove FlowFiles from the flow, add and removeattributes from a FlowFile, and route a FlowFile to a particularrelationship. Additionally, the session provides access to theProvenanceReporter that is used by Processors to emit Provenance Events.

 

Process Session(通常称作为一个session)提供了处理器在执行taskflowfiles提供事物性行为。这个session提供了get方法获得flowfiles正在排队等候处理方法。还提供方法来读取和写入到一个flowfile内容,添加和删除flowfiles在flow中,添加和删除属性和一个flowfile,路由到一个特定的关系flowfile。此外,session还提供的provenancereporter所使用的处理器的发出的事件源。

 

Once a Processoris finished performing its task, the Processor has the ability to either commitor rollback the session. If a Processor rolls back the session, the FlowFilesthat were accessed during that session will all be reverted to their previousstates. Any FlowFile that was added to the flow will be destroyed. Any FlowFilethat was removed from the flow will be re-queued in the same queue that it waspulled from. Any FlowFile that was modified will have both its contents andattributes reverted to their previous values, and the FlowFiles will all bere-queued into the FlowFile Queue that they were pulled from. Additionally, anyProvenance Events will be discarded.

当一个处理器完成执行任务,利用session可以为处理器提交或回滚事物。如果一个处理器回滚会话,并在会话访问flowfiles都将恢复到以前的状态。任何flowfile被添加到将被销毁flow………………….

 

 

 

 

 

FlowFile Repository

TheFlowFile Repository is responsible for storing the FlowFiles' attributes andstate, such as creation time and which FlowFile Queue the FlowFile belongsin. The default implementation is the WriteAheadFlowFileRepository, whichpersists the information to a write-ahead log that is periodically"checkpointed". This allows extremely high transaction rates, as thefiles that it writes to are "append-only," so the OutputStreamsare able to be kept open. Periodically, the repository will checkpoint, meaningthat it will begin writing to new write-ahead logs, write out the state ofall FlowFiles at that point in time, and delete the old write-ahead logs. Thisprevents the write-ahead logs from growing indefinitely.

 

flowfile库负责存储flowfiles的属性和状态,如创建时间和flowfile队列的flowfile属于。默认的实现是writeaheadflowfilerepository,持续的信息到一个提前写日志,定期检查点。这允许非常高的成交率,为文件,它写到仅追加,于是OutputStreams能够保持开放。定期将检查点库,这意味着它将开始写新的提前写日志,写出来的都flowfiles状态在那个时间点,并删除旧的提前写日志。这可以防止提前写日志增长下去

 

Content Repository

The Content Repository is responsible for storing thecontent of FlowFiles and providing mechanisms for reading the contents of aFlowFile. This abstraction allows the contents of FlowFiles to be storedindependently and efficiently based on the underlying storage mechanism. Thedefault implementation is the FileSystemRepository, which persists all data tothe underlying file system.

Note: While the Content Repository is pluggable, it isconsidered a 'private API' and its interface could potentially be changedbetween minor versions of NiFi. It is, therefore, not recommended thatimplementations be developed outside of the NiFi codebase.

ContentRepository负责存储flowfiles内容和读取flowfile内容提供机制。这种抽象使flowfiles应保存的内容是独立有效的底层存储机制为基础的有效。默认的实现是filesystemrepository,坚持所有的数据底层文件系统。

注意:Content Repository是可插拔的,它被认为是一个私有API和接口可能发生NIFI次要版本之间改变。它是,因此,不推荐应用开发代码NIFI外的代码。

Provenance Repository(起源存储)

The Provenance Repository is responsible for storing,retrieving, and querying all Data Provenance Events. Each time that a FlowFileis received, routed, cloned, forked, modified, sent, or dropped, aProvenance Event is generated that details this information. The eventcontains information about what the Event Type was, which FlowFile(s) wereinvolved, the FlowFile's attributes at the time of the event, details aboutthe event, and a pointer to the Content of the FlowFile before andafter the event occurred (which allows a user to understand how thatparticular event modified the FlowFile).

The Provenance Repository allows thisinformation to be stored about each FlowFile as it traverses through the systemand provides a mechanism for assembling a "Lineage view" of aFlowFile, so that a graphical representation can be shown of exactly how the FlowFilewas handled. In order to determine which lineages to view, the repositoryexposes a mechanism whereby a user is able to search the events and associatedFlowFile attributes.

The default implementation isPersistentProvenanceRepository. This repository stores all data immediately todisk-backed write-ahead log and periodically "rolls over"the data, indexing and compressing the data. The search capabilities areprovided by an embedded Lucene engine. For more information on how thisrepository is designed and implemented, see the Persistent Provenance Repository Design page.

Note: Whilethe Provenance Repository is pluggable, it is considered a 'private API' andits interface could potentially be changed between minor versions of NiFi.It is, therefore, not recommended that implementations be developed outside ofthe NiFi codebase.

 

 

ProvenanceRepository负责存储、检索和查询所有数据源事件。每一个flowfile接收路由,克隆,分叉,修改,发送,或下降,源事件产生的细节信息。该事件包含的事件类型是什么的信息,这flowfileS)有关,该flowfile属性的事件发生时,有关该事件的细节,和一个指向的内容flowfile事件之前和之后发生的(它可以让用户了解如何修改flowfile特定事件)。

ProvenanceRepository允许这些信息被存储的每个flowfile作为穿越系统,提供了一种机制,用于组装血统观点的一个flowfile,这样的图形表示可以显示究竟如何处理flowfile。为了确定血统的观点,知识库的公开机制,用户可以搜索的事件和相关flowfile属性。

默认的实现是persistentprovenancerepository。这个库中存储所有的数据到磁盘支持立即提前写日志,定期翻滚的数据,索引和压缩数据。搜索功能是由一个嵌入式Lucene引擎提供。更多关于如何这个库的设计与实现,看到持续的源库的设网页

注意:ProvenanceRepository是可插拔的,它被认为是一个私有API和接口可能发生NIFI次要版本之间改变。它是,因此,不推荐应用开发代码NIFI外的代码。

 

Process Scheduler

In order for a Processor or a Reporting Task to beinvoked, it needs to be scheduled to do so. This responsibility belongs to theProcess Scheduler. In addition to scheduling Processors and ReportingTask, the scheduler is also responsible for scheduling framework tasks to runat periodic intervals and maintaining the schedule state of eachcomponent, as well as the current number of active threads. The Process Scheduleris able to inspect a particular component in order to determine whichScheduling Strategy to use (Cron Driven, Timer Driven, or Event Driven), aswell as the scheduling frequency.

 

为了一个处理器或报告任务被调用,它需要计划这样做。这个责任属于进程调度程序。除了处理器和报告任务调度,调度器还负责调度任务运行在定期维护每个组件的进度状态,以及当前活动线程数。进程调度是能够检查一个特定的组件以确定该调度策略的使用(cron驱动、定时器驱动,或事件驱动),以及调度频率。

 

FlowFileQueue

Though it sounds sufficiently simple, the FlowFile Queueis responsible for implementing quite a bit of logic. In addition to queuing theFlowFiles for another component to pull from, the FlowFile Queue must alsobe able to prioritize the data following the user's prioritization rules. Thequeue keeps state about the number of FlowFiles as well as the data sizeof those FlowFiles and must keep state about the number of"in-flight" FlowFiles - those that have been pulled from the queuebut have not yet been removed from the system or transferred to anotherqueue.

When an instance of NiFi has a verylarge number of active FlowFiles, the attributes associated with thoseFlowFiles can be quite a burden on the JVM's heap. To alleviate thisproblem, the framework may choose to "swap out" some of the FlowFiles,writing the attributes to disk and removing them from the JVM's heap when aqueue grows very large and later "swap in" those FlowFiles.During this process, the FlowFile Queue is also responsible for keeping trackof the number of FlowFiles and size of the FlowFiles' contents so thataccurate numbers can be reported to users.

Finally, the FlowFile Queue is alsoresponsible for maintaining state about backpressure and FlowFile Expiration.Backpressure is the mechanism by which a user is able to configure theflow to temporarily stop scheduling a given component to run when its outputqueue is too full. By doing this, we are able to cause the flow to stopaccepting incoming data for a short period, or to route data differently. Thisprovides us the ability to prevent resource exhaustion. In a clusteredenvironment, this also allows a particular node that is falling for one reasonor another to avoid ingesting data so that other nodes in the cluster thatare more capable can handle the workload.

FlowFile Expiration is the mechanism bywhich data is eventually purged from the system because it is no longer ofvalue. It can be thought of as a flow's pressure release valve. This isused, for instance, in an environment when there is not enough bandwidthto send all of the data to its desired destination. In such a case, thenode will eventually run out of resources. In order to avoid this, theuser can configure a queue to expire data that reaches a certain age. Forexample, the user can indicate that data that is one hour old shouldautomatically be purged. This capability is then coupled with the abilityto prioritize the data so that the most important data is always sentfirst and the less important data eventually expires.

 

虽然这听起来足够简单,对flowfile队列负责实施相当的逻辑。除了排队的flowfiles另一部分拉,这flowfile队列还必须能够区分数据用户的优先级规则。队列保持国家对flowfiles数以及那些flowfiles数据大小,必须对“飞行”flowfiles -那些已经从队列中拉但尚未从系统删除或转移到另一个队列的数量保持状态。

当发生NIFI实例有一个非常大量的活性flowfiles,那些flowfiles相关联的属性完全可以在JVM堆的负担。为了缓解这一问题,框架可以选择“交换”的一些flowfiles,写作的属性到磁盘并将它们从JVM的堆当队列长得很大,后来在“那些flowfiles互换。在这个过程中,该flowfile队列还负责跟踪的flowfiles的flowfiles内容大小的号码,准确的数字可以向用户报告。

最后,该flowfile队列也负责维护背压和flowfile呼气状态。背压的机制是一个用户可配置的流程暂时停止调度给定组件的运行时,它的输出队列已满。通过这样做,我们能够使流量停止接受输入数据的时间很短,或路由数据不同。这为我们提供了防止资源耗竭的能力。在群集环境中,这也让一个特定的节点,是爱上了这样或那样的原因,避免摄取数据那更能够群集的其他节点可以处理的工作量。

flowfile过期的数据最终被从系统中清除因为它不再是价值机制。它可以被认为是一个流动的压力释放阀。这是用的,比如,在一个环境,当没有足够的带宽来发送所有的数据到它的目的地。在这种情况下,节点最终将耗尽的资源。为了避免这种情况,用户可以配置队列过期数据达到一定年龄。例如,用户可以表明那是一小时的旧自动清除数据。这种能力,再加上优先次序的能力的数据,最重要的数据总是首先被发送和不太重要的数据最终到期。

 

FlowFilePrioritizer(flowFile 算法)

A coretenant of NiFi is that of data prioritization. The user should have the abilityto prioritize the data in whatever order makes sense for a particular dataflowat a particular point in time. This is especially important for time-sensitivedata and becomes even more important when processing data in an environmentwhere the rate of data acquisition exceeds that rate at which the data can beegressed. In such an environment, it is important to be able to egress the datain such a way that the most important data is sent first, leaving the lessimportant data to be sent when the bandwidth allows for it, or eventually beaged off.

Thistenant is realized through the user of FlowFile Prioritizers. The FlowFileQueue is responsible for ensuring the data is ordered in the way that the userhas chosen. This is accomplished by applying FlowFile Prioritizers to thequeue. A FlowFile Prioritizer has access to all FlowFile information but doesnot have access to the data that it points to. The Prioritizer can then comparetwo FlowFiles in order to determine which should be made available first.

ThePrioritizer is an extension point, and its API will not change from one minorrelease of NiFi to another but may change with a new major release of NiFi

的核心租户发生NIFI是数据的优先次序。用户应该优先考虑以何种顺序数据的能力在一个特定时间点的特定数据流的道理。这是特别重要的时间敏感的数据变得更加重要,在数据处理时的环境中,数据采集速度超过速度的数据可以外出。在这样的环境中,它是能够以这样一种方式,最重要的数据发送给第一个出口数据重要,让不重要的数据要发送时,带宽允许,或最终被年龄了。

这个房客是通过flowfileprioritizers用户实现。的flowfile队列负责确保数据是有序的,用户选择的方式。这是利用flowfile prioritizers队列的实现。一个flowfile优先排序器访问所有flowfile信息但不能访问数据,指出。算法可以比较两flowfiles为了确定并应提供第一。

算法是一个扩展点,及其API不会从另一个小的变化发生NIFI释放但可能有一个新的主要版本的变化发生NIFI。

 

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值