by Andrea Santurbano

通过安德里亚·桑图尔巴诺(Andrea Santurbano)

如何利用Neo4j Streams并建立即时数据仓库 (How to leverage Neo4j Streams and build a just-in-time data warehouse)

In this article, we’ll show how to create a Just-In-Time Data Warehouse by using Neo4j and the Neo4j Streams module with Apache Spark’s Structured Streaming Apis and Apache Kafka.

在本文中,我们将展示如何通过将Neo4j Neo4j Streams模块与Apache Spark的结构化流Apis和Apache Kafka结合使用来创建即时数据仓库

In order to show how to integrate them, simplify the integration, and let you test the whole project by hand, I’ll use Apache Zeppelin a notebook runner that simply allows to natively interact with Neo4j.

为了展示如何集成它们,简化集成并让您手动测试整个项目,我将使用Apache Zeppelin 一个笔记本运行器,该笔记本运行器仅允许与Neo4j进行本机交互

利用Neo4j流 (Leveraging Neo4j Streams)

The Neo4j Streams project is composed of three main pillars:

Neo4j Streams项目由三个主要Struts组成:

  • The Change Data Capture (the subject of this first article) that allows us to stream database changes over Kafka topics

    更改数据捕获 (第一篇文章的主题)使我们能够通过Kafka主题流式传输数据库更改

  • The Sink that allows consuming data streams from the Kafka topic


  • A set of procedures that allows us to Produce/Consume data to/from Kafka Topics

    一套程序 ,使我们可以向/从卡夫卡主题产生/消费数据

什么是变更数据捕获? (What is a Change Data Capture?)

It’s a system that automatically captures changes from a source system (a Database, for instance) and automatically provides these changes to downstream systems for a variety of use cases.


CDC typically forms part of an ETL pipeline. This is an important component for ensuring Data Warehouses (DWH) are kept up to date with any record changes.

CDC通常构成ETL管道的一部分。 这是确保数据仓库(DWH)与任何记录更改保持最新的重要组件。

Also traditionally CDC applications used to work off of transaction logs, thereby allowing us to replicate databases without having much of a performance impact on its operation.


Neo4j Streams CDC模块如何处理数据库更改? (How does the Neo4j Streams CDC module deal with database changes?)

Every transaction inside Neo4j gets captured and transformed in order to stream an atomic element of the transaction.


Let’s suppose we have a simple creation of two nodes and one relationship between them:


CREATE (andrea:Person{name:"Andrea"})-[knows:KNOWS{since:2014}]->(michael:Person{name:"Michael"})

The CDC module will transform this transaction into 3 events (2 node creation, 1 relationship creation).


The Event structure was inspired by the Debezium format and has the following general structure:


{  "meta": { /* transaction meta-data */ },  "payload": { /* the data related to the transaction */    "before": { /* the data before the transaction */},    "after": { /* the data after the transaction */}  }}

Node source (andrea):


{  "meta": {    "timestamp": 1532597182604,    "username": "neo4j",    "tx_id": 1,    "tx_event_id": 0,    "tx_events_count": 3,    "operation": "created",    "source": {      "hostname": ""    }  },  "payload": {    "id": "1004",    "type": "node",    "after": {      "labels": ["Person"],      "properties": {        "name": "Andrea"      }    }  }}

Node target (michael):


{  "meta": {    "timestamp": 1532597182604,    "username": "neo4j",    "tx_id": 1,    "tx_event_id": 1,    "tx_events_count": 3,    "operation": "created",    "source": {      "hostname": ""    }  },  "payload": {    "id": "1006",    "type": "node",    "after": {      "labels": ["Person"],      "properties": {        "name": "Michael"      }    }  }}

Relationship knows:


{  "meta": {    "timestamp": 1532597182604,    "username": "neo4j",    "tx_id": 1,    "tx_event_id": 2,    "tx_events_count": 3,    "operation": "created",    "source": {      "hostname": ""    }  },  "payload": {    "id": "1007",    "type": "relationship",    "label": "KNOWS",    "start": {      "labels": ["Person"],      "id": "1005"    },    "end": {      "labels": ["Person"],      "id": "106"    },    "after": {      "properties": {        "since": 2014      }    }  }}

By default, all the data will be streamed on the neo4j topic. The CDC module allows controlling which nodes are sent to Kafka, and which of their properties you want to send to the topic:

默认情况下,所有数据将在neo4j主题上流式传输。 CDC模块允许控制将哪些节点发送到Kafka,以及要将其哪些属性发送到主题:


With the following example:


streams.source.topic.nodes.products=Product{name, code}

The CDC module will send to the products topic all the nodes that have the label Product. It then sends, to that topic, only the changes about name and code properties. Please go the official documentation for a full description on how label filtering works.

CDC模块会将带有标签Product所有节点发送到products主题。 然后,它将有关namecode属性的更改仅发送到该主题。 请转到官方文档以获取有关标签过滤工作原理的完整说明。

For a more in-depth description of the Neo4j Streams project and how/why we at LARUS and Neo4j built it, check out this article that provides an in-depth description.

有关Neo4j Streams项目以及我们LARUSNeo4j如何/为什么构建它的更深入的描述,请查看这篇文章,其中提供了深入的描述

超越传统的数据仓库 (Beyond the traditional Data Warehouse)

A traditional DWH requires data teams to constantly build multiple costly and time-consuming Extract Transform Load (ETL) pipelines to ultimately derive business insights.


One of the biggest pain points is that, due to its rigid architecture that’s difficult to change, Enterprise Data Warehouses are inherently rigid. That’s because:

最大的痛点之一是,由于其难以更改的刚性架构 ,企业数据仓库具有固有的刚性。 那是因为:

  • they are based on the Schema-On-Write architecture: first, you define your schema, then you write your data, then you read your data and it comes back in the schema you defined up-front

    它们基于 写时架构(Schema-On-Write)架构:首先,定义架构,然后编写数据,然后读取数据,然后将其返回到预先定义的架构中

  • they are based on (expensive) batched/scheduled jobs

    它们基于 (昂贵的) 批处理/计划工作

This results in having to build costly and time-consuming ETL pipelines to access and manipulate the data. And as new data types and sources are introduced, the need to augment your ETL pipelines exacerbates the problem.

这导致必须建立昂贵且耗时的ETL管道来访问和处理数据。 随着新数据类型数据源的引入,对扩展ETL管道的需求加剧了问题

Thanks to the combination of the stream data processing with the Neo4j Streams CDC module and the Schema-On-Read approach provided by Apache Spark, we can overcome this rigidity and build a new kind of (flexible) DWH.

由于将流数据处理与Neo4j Streams CDC模块以及Apache Spark提供的“读取时模式相结合 ,我们可以克服这种僵化并构建一种新型的(灵活的)DWH。

范式转变:即时数据仓库 (A paradigm shift: Just-In-Time Data Warehouse)

A JIT-DWH solution is designed to easily handle a wider variety of data from different sources and starts from a different approach about how to deal with and manage data: Schema-On-Read.

JIT-DWH解决方案旨在轻松处理来自不同来源的各种数据,并且从关于如何处理和管理数据的不同方法入手: Schema-On-Read。

读取架构 (Schema-On-Read)

Schema-On-Read follows a different sequence: it just loads the data as-is and applies your own lens to the data when you read it back out. With this kind of approach, you can present data in a schema that is adapted best to the queries being issued. You’re not stuck with a one-size-fits-all schema. With schema-on-read, you can present the data back in a schema that is most relevant to the task at hand.

读取时模式遵循不同的顺序: 它仅按原样加载数据,并在您读出数据时将自己的镜头应用于数据 。 通过这种方法,您可以按照最适合要发出的查询的模式来显示数据。 您不必局限于“一刀切”的架构。 使用读取模式,您可以将数据显示在与手头任务最相关的模式中。

搭建环境 (Set-Up the Environment)

Going to the following Github repo you’ll find everything you need in order to replicate what I’m presenting in this article. What you will need to start is Docker. Then you can simply spin-up the stack by entering into the directory and from the Terminal, executing the following command:

转到下面的Github存储库,您将找到所需的一切,以便复制我在本文中介绍的内容。 您将需要启动Docker 然后,您可以通过进入目录并从终端执行以下命令来简单地增加堆栈:

$ docker-compose up

This will start-up the whole environment that comprises:


  • Neo4j + Neo4j Streams module + APOC procedures

    Neo4j + Neo4j Streams模块+ APOC程序
  • Apache Kafka

  • Apache Spark

    Apache Spark
  • Apache Zeppelin


By going into Apache Zeppelin @ http://localhost:8080 you’ll find in the directory Medium/Part 1 two notebooks:

通过进入Apache Zeppelin @ http://localhost:8080您将在Medium/Part 1目录Medium/Part 1找到两个笔记本:

  • Create a Just-In-Time Data Warehouse: in this notebook, we will build the JIT-DWH

    创建一个即时数据仓库 :在此笔记本中,我们将构建JIT-DWH

  • Query The JIT-DWH: in this notebook, we will perform some queries over the JIT-DWH

    查询JIT-DWH :在本笔记本中,我们将对JIT-DWH进行一些查询

用例: (The Use-Case:)

We’ll create a fake social network like dataset. This will activate the CDC module of Neo4j Stream, and via Apache Spark we’ll intercept this event and persist them on the File System as JSON.

我们将创建一个伪造的社交网络,例如数据集。 这将激活Neo4j Stream的CDC模块,并且通过Apache Spark,我们将拦截此事件并将其作为JSON保留在文件系统上。

Then we’ll demonstrate how new fields added in our nodes will be automatically added to our JIT-DWL without the modification of the ETL pipeline, thanks to the Schema-On-Read approach.


We’ll execute the following steps:


  1. Create the fake data set

  2. Build our data pipeline that intercepts the Kafka events published by the Neo4j Streams CDC module

    建立我们的数据管道,以拦截Neo4j Streams CDC模块发布的Kafka事件
  3. Make the first query over our JIT-DWH on Spark

  4. Add a new field in our graph model

  5. Show how the new field is automatically exposed in real time thanks to the Neo4j Streams CDC module (without the need for changes over our ETL pipeline thanks to the Schema-On-Read approach).

    通过Neo4j Streams CDC模块,展示如何实时自动显示新字段(由于采用了按读取模式,无需在ETL管道上进行更改)。

笔记本1:创建即时数据仓库 (Notebook 1: Create a Just-In-Time Data Warehouse)

We’ll create a fake social network by using the APOC apoc.periodic.repeat procedure that executes this query every 15 seconds:

我们将使用APOC apoc.periodic.repeat过程创建一个伪造的社交网络,该过程每15秒执行一次此查询:

WITH ["M", "F", ""] AS genderUNWIND range(1, 10) AS idCREATE (p:Person {id: apoc.create.uuid(), name: "Name-" +  apoc.text.random(10), age: round(rand() * 100), index: id, gender: gender[toInteger(size(gender) * rand())]})WITH collect(p) AS peopleUNWIND people AS p1UNWIND range(1, 3) AS friendWITH p1, people[(p1.index + friend) % size(people)] AS p2CREATE (p1)-[:KNOWS{years: round(rand() * 10), engaged: (rand() > 0.5)}]-&gt;(p2)

If you need more details about the APOC project, please follow this link.


So the resulting graph model is quite straightforward:


Let’s create an index over the Person node:


%neo4jCREATE INDEX ON :Person(id)

Now let’s set the Background Job in Neo4j:


%neo4jCALL apoc.periodic.repeat('create-fake-social-data', 'WITH ["M", "F", "X"] AS gender UNWIND range(1, 10) AS id CREATE (p:Person {id: apoc.create.uuid(), name: "Name-" +  apoc.text.random(10), age: round(rand() * 100), index: id, gender: gender[toInteger(size(gender) * rand())]}) WITH collect(p) AS people UNWIND people AS p1 UNWIND range(1, 3) AS friend WITH p1, people[(p1.index + friend) % size(people)] AS p2 CREATE (p1)-[:KNOWS{years: round(rand() * 10), engaged: (rand() > 0.5)}]->(p2)', 15) YIELD nameRETURN name AS created

This background query brings the Neo4j-Streams CDC module to stream related events over the “neo4j” Kafka topic (the default topic of the CDC).

此后台查询使Neo4j-Streams CDC模块通过“ neo4j” Kafka主题(CDC的默认主题)流式传输相关事件。

Now let’s create a Structured Streaming Dataset that consumes the data from the “neo4j” topic:

现在,让我们创建一个结构化流数据集,该数据集使用“ neo4j”主题中的数据:

val kafkaStreamingDF = (spark    .readStream    .format("kafka")    .option("kafka.bootstrap.servers", "broker:9093")    .option("startingoffsets", "earliest")    .option("subscribe", "neo4j")    .load())

The kafkaStreamingDF Dataframe is basically a ProducerRecord representation. And in fact its schema is:

kafkaStreamingDF框基本上是一个ProducerRecord表示形式。 实际上,它的架构是:

root|-- key: binary (nullable = true)|-- value: binary (nullable = true)|-- topic: string (nullable = true)|-- partition: integer (nullable = true)|-- offset: long (nullable = true)|-- timestamp: timestamp (nullable = true)|-- timestampType: integer (nullable = true)

Now let’s create the Structure of the data streamed by the CDC using the Spark APIs in order to read the streamed data:

现在,让我们使用Spark API创建CDC流数据的结构,以读取流数据:

val cdcMetaSchema = (new StructType()    .add("timestamp", LongType)    .add("username", StringType)    .add("operation", StringType)    .add("source", MapType(StringType, StringType, true)))    val cdcPayloadSchemaBeforeAfter = (new StructType()    .add("labels", ArrayType(StringType, false))    .add("properties", MapType(StringType, StringType, true)))    val cdcPayloadSchema = (new StructType()    .add("id", StringType)    .add("type", StringType)    .add("label", StringType)    .add("start", MapType(StringType, StringType, true))    .add("end", MapType(StringType, StringType, true))    .add("before", cdcPayloadSchemaBeforeAfter)    .add("after", cdcPayloadSchemaBeforeAfter))    val cdcSchema = (new StructType()    .add("meta", cdcMetaSchema)    .add("payload", cdcPayloadSchema))

The cdcSchema is suitable for both node and relationships events.


What we need now is to extract only the CDC event from the Dataframe, so let’s perform a simple transformation query over Spark:


val cdcDataFrame = (kafkaStreamingDF    .selectExpr("CAST(value AS STRING) AS VALUE")    .select(from_json('VALUE, cdcSchema) as 'JSON))

The cdcDataFrame contains just one column JSON which is the data streamed from the Neo4j-Streams CDC module.

cdcDataFrame仅包含一列JSON ,这是从Neo4j-Streams CDC模块流式传输的数据。

Let’s perform a simple ETL query in order to extract fields of interest:


val dataWarehouseDataFrame = (cdcDataFrame    .where("json.payload.type = 'node' and (array_contains(nvl(json.payload.after.labels, json.payload.before.labels), 'Person'))")    .selectExpr(" AS neo_id", "CAST(json.meta.timestamp / 1000 AS Timestamp) AS timestamp",        "json.meta.source.hostname AS host",        "json.meta.operation AS operation",        "nvl(json.payload.after.labels, json.payload.before.labels) AS labels",        "explode("))

This query is quite important, because it represents how the data will be persisted over the filesystem. Every node will be exploded in a number of JSON snippets, one for each node property, just like this:

此查询非常重要,因为它表示如何在文件系统上持久存储数据。 每个节点都将分解为多个JSON代码片段,每个节点属性都包含一个,如下所示:


This kind of structure can be easily turned into tabular representation (we’ll see in the next few steps how to do this).


Now let's write a Spark continuous streaming query that saves the data to the file system as JSON:


val writeOnDisk = (dataWarehouseDataFrame    .writeStream    .format("json")    .option("checkpointLocation", "/zeppelin/spark-warehouse/jit-dwh/checkpoint")    .option("path", "/zeppelin/spark-warehouse/jit-dwh")    .queryName("nodes")    .start())

We have now created a simple JIT-DWH. In the second notebook we’ll learn how to query it and how simple it is to deal with dynamical changes in the data structures thanks schema-on-read.

现在,我们创建了一个简单的JIT-DWH。 在第二本笔记本中,我们将学习如何查询它,以及如何通过读取模式来处理数据结构中的动态变化有多么简单。

笔记本2:查询JIT-DWH (Notebook 2: Query The JIT-DWH)

The first paragraph let us query and display our JIT-DWH


val flattenedDF = ("json").load("/zeppelin/spark-warehouse/jit-dwh/**")    .where("neo_id is not null")    .groupBy("neo_id", "timestamp", "host", "labels", "operation")    .pivot("key")    .agg(first($"value")))

Remember how we saved the data in JSON some row above? The flattenedDF simply pivoted the JSONs over the key field thus grouping the data over 5 columns that represent the “unique key” (“neo_id”, “timestamp”, “host”, “labels”, “operation”). This allows us to have this tabular representation of the source data as follows:

还记得我们如何在上面的某行中将数据保存在JSON中吗? flattenedDF只需将JSON遍历key字段即可,从而将数据分为代表“唯一键”( “ neo_id”,“ timestamp”,“ host”,“ labels”,“ operation” )的5列进行分组。 这使我们可以使用以下表格形式表示源数据:

Now imagine that our Person dataset gets a new field: birth. Let's add this new field to one node; in this case, you must choose an id from your dataset and update it with the following paragraph:

现在,假设我们的“人”数据集得到一个新字段:“ 出生”。 让我们将此新字段添加到一个节点; 在这种情况下,您必须从数据集中选择一个ID,并使用以下段落进行更新:

Now the final step: reuse the same query and filter the DWH by the id that we have previously changed in order to check how our dataset changed according to the changes made over Neo4j.


结论 (Conclusions)

In this first part, we learned how to leverage the events produced by Neo4j Stream CDC module in order to build a simple (Real-Time) JIT-DWL that uses the Schema-On-Read approach.

在第一部分中,我们学习了如何利用Neo4j Stream CDC模块产生的事件来构建使用“读取时架构”方法的简单(实时)JIT-DWL。

In Part 2 we’ll discover how to use the Sink module in order to ingest data into Neo4j directly from Kafka.


If you have already tested the Neo4j-Streams module or tested it via these notebooks please fill out our feedback survey.


If you run into any issues or have thoughts about improving our work, please raise a GitHub issue.

如果您遇到任何问题或对改进我们的工作有想法, 请提出GitHub问题


