apache kafka_如何使用Neo4j和Apache Kafka进行事件驱动的图分析

最新推荐文章于 2023-04-19 09:47:36 发布

cumifi2519

最新推荐文章于 2023-04-19 09:47:36 发布

阅读量462

点赞数

文章标签：算法数据库 java 大数据 python

原文链接：https://www.freecodecamp.org/news/how-to-embrace-event-driven-graph-analytics-using-neo4j-and-apache-kafka-474c9f405e06/

版权

本文介绍了如何结合Neo4j的Kafka流库，利用图算法和事务数据库进行事件驱动的图分析。通过在Neo4j实例中添加插件，实现了对Apache Kafka主题的生产者、消费者和过程功能，允许在分析过程中实时更新和回写数据。文章以电影数据库为例，展示了在分析实例上运行PageRank算法，并通过Kafka将结果流回因果群集的过程。

摘要由CSDN通过智能技术生成

apache kafka

by Ljubica Lazarevic

通过Ljubica Lazarevic

如何使用Neo4j和Apache Kafka进行事件驱动的图分析 (How to embrace event-driven graph analytics using Neo4j and Apache Kafka)

介绍 (Introduction)

With the new Neo4j Kafka streams now available, my fellow Neo4j colleague Tom Geudens and I were keen to try it out. We have many use-cases in mind that leverage the power of graph databases and event-driven architectures. The first one we explore combines the power of Graph Algorithms with a transactional database.

现在有了新的Neo4j Kafka流，我和我的Neo4j同事Tom Geudens都渴望尝试一下。我们想到了许多用例，它们充分利用了图数据库和事件驱动的体系结构的功能。我们探索的第一个方法将图算法的功能与事务数据库相结合。

The new Neo4j Kafka streams library is a Neo4j plugin that you can add to each of your Neo4j instances. It enables three types of Apache Kafka mechanisms:

新的Neo4j Kafka流库是一个Neo4j插件，您可以将其添加到每个Neo4j实例中。它启用三种类型的Apache Kafka机制：

Producer: based on the topics set up in the Neo4j configuration file. Outputs to said topics will happen when specified node or relationship types change
生产者：基于Neo4j配置文件中设置的主题。当指定的节点或关系类型发生更改时，将发生上述主题的输出
Consumer: based on the topics set up in the Neo4j configuration file. When events for said topics are picked up, the specified Cypher query for each topic will be executed
使用者：基于Neo4j配置文件中设置的主题。拾取上述主题的事件后，将为每个主题执行指定的Cypher查询
Procedure: a direct call in Cypher to publish a given payload to a specified topic
过程：直接在Cypher中调用以将给定的有效负载发布到指定的主题

You can get a more detailed overview of how each of these might look like here.

您可以在此处更详细地了解它们的外观。

情况概述 (Overview of the situation)

Graph algorithms provide powerful analytical abilities. They help us understand the context of our data better by analysing relationships. For example, graph algorithms are used to:

图算法提供了强大的分析能力。它们通过分析关系来帮助我们更好地理解数据的上下文。例如，图算法用于：

Understand network dependencies
了解网络依赖性
Detect communities
检测社区
Identify influencers
识别影响者
Calculate recommendations
计算建议
And so forth.
依此类推。

Neo4j offers a set of graph algorithms out of the box via a plugin that can run directly on data within Neo4j. This library of algorithms has been very popularly received. Many times I’ve received feedback that the plugins are as fast or faster than what clients have used before. With such wonderful feedback, why wouldn’t we want to apply these optimised and performant algorithms on a Neo4j database?

Neo4j通过可直接在Neo4j内的数据上运行的插件提供了一组开箱即用的图形算法。该算法库已非常受欢迎。很多时候，我收到反馈，说这些插件比客户以前使用的插件快或快。有了如此出色的反馈，我们为什么不希望将这些优化的高性能算法应用于Neo4j数据库？

Getting the full advantage of any analytical process needs resources. To get a nice, performant experience, we want to provide as much CPU and memory as we can afford.

充分利用任何分析过程的优势都需要资源。为了获得良好的性能体验，我们希望提供尽可能多的CPU和内存。

Now, we could run this kind of work on our transactional cluster. But in this typical architecture, we’re going to run into some challenges. For example, if one machine is big, the other machines in the cluster should be matching. This could mean that the set up architecture is expensive.

现在，我们可以在事务集群上运行此类工作。但是在这种典型的体系结构中，我们将遇到一些挑战。例如，如果一台计算机很大，则群集中的其他计算机应匹配。这可能意味着设置架构非常昂贵。

The other challenge we face is that our cluster is supposed to be managing transactions — day-to-day queries such as processing requests. We don’t want to weigh it down with crunching through various iterations and permutations of a model. Ideally, we want to offload this along with associated analytical work.

我们面临的另一个挑战是我们的集群应该管理事务-日常查询，例如处理请求。我们不想在模型的各种迭代和排列过程中费力地进行权衡。理想情况下，我们希望将其与相关的分析工作一起卸载。

If we know that the heavy querying that is going to take place is read-only, then it’s an easy solution. We can spin up read replicas to manage the load. This keeps the cluster focussed on what it’s supposed to be doing, supporting an operational, transactional system.

如果我们知道将要进行的繁重查询是只读的，那么这是一个简单的解决方案。我们可以增加只读副本来管理负载。这使集群始终专注于它应该做的事情，支持可操作的事务性系统。

But how do we handle write backs to the operational graph as part of the analytical processing? We want those results, such as recommendations, as soon as they are available.

但是，作为分析处理的一部分，我们如何处理对操作图的回写？我们希望尽快获得这些结果，例如建议。

Read replicas are as the name suggests — they are for read-only applications. They will not be involved in either elections of leaders in the cluster, nor in writing. Using Neo4j-streams, we can stream the results back from the read replica back to the cluster for consumption.

顾名思义，只读副本是只读副本。他们既不会参与集群的领导人选举，也不会参与书面形式。使用Neo4j流，我们可以将结果从只读副本流回群集以供使用。

The big advantages of approaching it this way include:

以这种方式进行处理的最大优势包括：

We have our high availability/disaster recovery afforded to us from the cluster.
我们从集群中为我们提供了高可用性/灾难恢复。
The data is going to be identical on both the read replica and the cluster. We don’t have to worry about updating the read replica because the cluster is going to take care of that for us.
只读副本和群集上的数据都将相同。我们不必担心更新只读副本，因为集群将为我们解决这一问题。
The id’s for nodes and relationships will be identical on both the servers of the cluster, and the read replica. This makes updating really easy.
节点和关系的ID在群集的服务器和只读副本上将相同。这使得更新非常容易。
We can provision resources as necessary to the read replica, which is likely to be very different from the cluster.
我们可以根据需要为只读副本配置资源，这可能与群集有很大的不同。

Our architecture will look like the figure below. A is our read replica, and B is our causal cluster. A will receive transactional information from B. Any results calculated by A will be streamed back to B via Kafka messages.

我们的架构如下图所示。 A是我们的只读副本，B是我们的因果群集。 A将接收来自B的交易信息。A计算的任何结果都将通过Kafka消息流回B。

So with our updated pattern, let’s continue with our simple example.

因此，使用我们的更新模式，让我们继续我们的简单示例。

示例数据集 (The Example Data Set)

We’re going to use the Movie Database data set available from the :play movie-guide guide in Neo4j Browser. For this example we are going to use four Neo4j instances:

我们将使用Neo4j浏览器中:play movie-guide指南中的电影数据库数据集。对于此示例，我们将使用四个Neo4j实例：

The analytics instance — this is going to be our read replica, and on this instance we’re going to run PageRank on all Person nodes in the data set. We will call the streams.publish() procedure to post the output to our Kafka topic.
分析实例–这将是我们的只读副本，在这个实例上，我们将在数据集中的所有Person节点上运行PageRank。我们将调用streams.publish()过程将输出发布到我们的Kafka主题。
The operational instances — this is going be our three-server causal cluster which is going to be listening for any changes to the person node. We will update as changes come in.
操作实例-这将是我们的三服务器因果群集，它将侦听人员节点的任何更改。随着变化的来临，我们将进行更新。

For Kafka, we’ll follow the instructions from the quick start guide up until step 2. Before we get Kafka up and running, we will need to set up the consumer elements in the Neo4j configuration files. We also will set up the cluster itself. Please note that at the moment neo4j-streams only works with Neo4j version 3.4.x.

对于Kafka，我们将按照快速入门指南中的说明进行操作，直到第2步。在启动和运行Kafka之前，我们需要在Neo4j配置文件中设置使用者元素。我们还将设置集群本身。请注意，目前neo4j-streams仅适用于Neo4j 3.4.x版 。

To set up the three server clusters and a read replica, we will follow the instructions provided in the Neo4j operations manual. Follow the instructions for the cores, and also for one read replica.

要设置三个服务器集群和一个只读副本，我们将按照Neo4j操作手册中提供的说明进行操作。遵循有关内核以及一个只读副本的说明。

Additionally, we’re going to need to add the following to neo4j.config for the causal cluster servers:

此外，我们将需要为因果群集服务器添加以下内容到neo4j.config中：

#************# Kafka Config — Consumer#************kafka.zookeeper.connect=localhost:2181kafka.bootstrap.servers=localhost:9092kafka.group.id=neo4j-core1streams.sink.enabled=truestreams.sink.topic.cypher.neorr=WITH event.payload as payload MATCH (p:Person) WHERE ID(p)=payload[0] SET p.pagerank = payload[1]

Note that we want to change kafka.group.id to neo4j-core2 and neo4j-core3 respectively.

请注意，我们要改变kafka.group.id到neo4j-core2和neo4j-core3分别。

For the read replica, we’ll need to add the following to neo4j.config:

对于只读副本，我们需要将以下内容添加到neo4j.config中 ：

#************# Kafka Config - Procedure#************kafka.zookeeper.connect=localhost:2181kafka.bootstrap.servers=localhost:9092kafka.group.id=neo4j-read1

You will need ti download and save the neo4j-streams jar into the plugins folder. Also you need to add the graph algorithm library, via Neo4j Desktop, or manually.

您将需要ti下载并将neo4j-streams jar保存到plugins文件夹中。另外，您还需要通过Neo4j Desktop或手动添加图算法库。

With these changes to the respective config files set and saved and plugins installed, we will start everything up, in the following order:

设置并保存了相应的配置文件并安装了插件之后，我们将按照以下顺序启动所有内容：

Apache Zookeeper
阿帕奇动物园管理员
Apache Kafka
阿帕奇·卡夫卡
The three instances for the Neo4j causal cluster
Neo4j因果群集的三个实例
The read replica
只读副本

Once all of the Neo4j instances are up and running and the cluster has discovered all of the members, we can now run the following query on the read replica:

一旦所有Neo4j实例启动并运行并且集群已发现所有成员，我们现在就可以在只读副本上运行以下查询：

CALL algo.pageRank.stream('MATCH (p:Person) RETURN id(p) AS id','MATCH (p1:Person)-->()<--(p2:Person) RETURN distinct id(p1) AS source, id(p2) AS target',{graph:'cypher'}) YIELD nodeId, scoreWITH [nodeId,score] AS resCALL streams.publish('neorr',res)RETURN COUNT(*)

This Cypher query will call the PageRank algorithm with the specified configuration. Once the algorithm is complete, we will stream the returned node id’s and the PageRank score to the specified topic.

该Cypher查询将使用指定的配置调用PageRank算法。算法完成后，我们会将返回的节点ID和PageRank得分流式传输到指定的主题。

We can have a look at what the neorr topic looks like by running Step 5 of the Apache Kafka quick start guide (replacing test with neorr):

我们可以通过运行Apache Kafka快速入门指南的第5步(用neorr代替test )来查看neorr主题的外观：

摘要 (Summary)

In this post we’ve demonstrated:

在这篇文章中，我们展示了：

Separating transactional and analytical data concerns
分离交易和分析数据问题
Painlessly flowing analytical results back back for real-time consumption
无痛地流回分析结果以供实时消耗

Whilst we’ve used a simple example, you can see how complex analytical work can be carried out, supporting an event-driven architecture.

尽管我们使用了一个简单的示例，但您可以看到如何支持事件驱动的体系结构来执行复杂的分析工作。