

Recently, I got a chance to do an R&D on a requirement where I would need to read files stored in a Cloud Storage bucket, which would be processed and transformed in the desired format and stored in an in-memory data store, i.e., Memorystore for faster access. Well, honestly it took several days to figure out the correct approach before finding the correct technologies to implement this.

最近,我有机会根据需要读取需要存储在Cloud Storage存储桶中的文件进行研发,该文件将以所需的格式进行处理和转换并存储在内存中的数据存储中,例如Memorystore快速访问。 好吧,老实说,花了几天的时间才找到正确的方法,然后才找到正确的技术来实现这一目标。

One of the best services in Google Cloud Platform that I have worked and experimented with is Cloud Dataflow which is a fully-managed service to execute pipelines within the Google Cloud Platform ecosystem. It is a service which is fully dedicated to transforming and enriching data in stream (real-time) and batch (historical) modes. It takes a serverless approach where users can focus on programming instead of managing server clusters, can be integrated with Operations (formerly Stackdriver), which lets you monitor and troubleshoot pipelines as they are running.

我曾经尝试过的Google Cloud Platform中最好的服务之一是Cloud Dataflow ,它是一项完全托管的服务,可以在Google Cloud Platform生态系统中执行管道。 它是一项完全致力于以流(实时)和批处理(历史)模式转换和丰富数据的服务。 它采用无服务器方法,用户可以专注于编程而不是管理服务器集群,并且可以与Operations(以前称为Stackdriver)集成,从而可以在管道运行时对其进行监视和故障排除。

Memorystore is Google’s implementation of Redis data store with reduced latency but high scalability. Well, caching is a technique used to accelerate application response times and help applications scale by placing frequently needed data very close to the application. Memorystore for Redis provides a fully-managed service that is powered by the Redis in-memory data store to build application caches that provide sub-millisecond data access.

Memorystore是Google对Redis数据存储的实现,具有减少的延迟但具有高可伸缩性。 好吧,缓存是一种用于加速应用程序响应时间并通过将经常需要的数据放置在离应用程序很近的位置来帮助扩展应用程序的技术。 用于Redis的Memorystore提供了一种完全托管的服务,该服务由Redis内存中的数据存储提供支持,以构建可提供亚毫秒级数据访问的应用程序缓存。

先决条件 (Prerequisites)

Before creating our dataflow pipeline for the implementation, we would require to do 3 things:


  1. Create two GCS buckets


GCS buckets are required for storing the input file(s) which will be read, transformed and then stored in the Redis data store and the other bucket is required for staging the dataflow pipeline code.


Image for post
You can give any names to your buckets

If you are not familiar with the creation of buckets, refer to this GCS documentation.


2. Create a Redis Instance


Memorystore (for Redis) instance is required for our implementation to store the processed data after the cloud dataflow pipeline is executed.


Image for post
You can give any name to your Redis instance (it won’t matter as we will be using the ip-address only)

The IP Address (Redis Host) is required to be provided in the command-line for executing the dataflow pipeline. If you are not familiar with the creation of memorystore, refer to this Memorystore documentation.

要求在命令行中提供IP地址(Redis主机)以执行数据流管道。 如果您不熟悉Memorystore的创建,请参阅此Memorystore文档

3. Upload the input file in the GCS bucket


For the dataflow pipeline to be executed, an input file is needed to be uploaded in the GCS bucket for input, for our case, its cloud-dataflow-input-bucket for our case.


Image for post
The input file is uploaded in the bucket before

The input file would have the data with “pipe” separator and is of the form: guid|firstname|lastname|dob|postalcode

输入文件将包含带有“竖线”分隔符的数据,格式为: guid|firstname|lastname|dob|postalcode

The input file can be accessed from here.


数据流管道将做什么? (What the dataflow pipeline will do?)

The idea is that the input file will be read, transformed and stored into a running Redis data-store instance.


The transformation step of the pipeline will split the data from the input file and then store it with corresponding keys in the data-store along with the guid.


For example, if the input file is xxxxxx|bruce|wayne|31051989|4444 where xxxxxx is the guid, bruce is the firstname, wayne is the lastname, 31051989 is the dob (in DDMMYYYY format) and 4444 is the postalcode.

例如,如果输入文件是xxxxxx|bruce|wayne|31051989|4444 ,其中xxxxxxGUID, bruce姓名wayne31051989DOB(在DDMMYYYY格式)4444邮递区号

The pipeline will store the transformed data in the Redis instance like below:


firstname:bruce xxxxxx

firstname:bruce xxxxxx

lastname:wayne xxxxxx

lastname:wayne xxxxxx

dob:31051989 xxxxxx

dob:31051989 xxxxxx

postalcode:4444 xxxxxx

postalcode:4444 xxxxxx

创建数据流管道 (Creating the dataflow pipeline)

We would create a template from the scratch and obviously, we referred and understood the core concepts from this documentation. We would be creating a dataflow batch job and for that, we would have to use Dataflow SDK 2.x and Apache Beam SDK for Redis.

我们将从头开始创建一个模板,显然,我们引用并理解了本文档中的核心概念。 我们将创建一个数据流批处理作业,为此,我们将不得不使用Dataflow SDK 2.x和Apache Beam SDK for Redis。


For the pipeline code, we would have to construct StorageToRedisOptions (or give any name) object using the method PipelineOptionsFactory.fromArgs to read options from command-line.


Main Class


public static void main(String[] args) {/**
* Constructed StorageToRedisOptions object using the method PipelineOptionsFactory.fromArgs to read options from command-line
StorageToRedisOptions options = PipelineOptionsFactory.fromArgs(args)
Pipeline p = Pipeline.create(options);
p.apply("Reading Lines...", TextIO.read().from(options.getInputFile()))
.apply("Transforming data...",
ParDo.of(new DoFn<String, String[]>() {
@ProcessElementpublic void TransformData(@Element String line, OutputReceiver<String[]> out) {
String[] fields = line.split("\\|");
.apply("Processing data...",
ParDo.of(new DoFn<String[], KV<String, String>>() {
@ProcessElementpublic void ProcessData(@Element String[] fields, OutputReceiver<KV<String, String>> out) {if (fields[RedisIndex.GUID.getValue()] != null) {
out.output(KV.of("firstname:".concat(fields[RedisIndex.FIRSTNAME.getValue()]), fields[RedisIndex.GUID.getValue()]));
out.output(KV.of("lastname:".concat(fields[RedisIndex.LASTNAME.getValue()]), fields[RedisIndex.GUID.getValue()]));
out.output(KV.of("dob:".concat(fields[RedisIndex.DOB.getValue()]), fields[RedisIndex.GUID.getValue()]));
out.output(KV.of("postalcode:".concat(fields[RedisIndex.POSTAL_CODE.getValue()]), fields[RedisIndex.GUID.getValue()]));
.apply("Writing field indexes into redis",
.withEndpoint(options.getRedisHost(), options.getRedisPort()));

You can clone the complete code from this GitHub repository. You can also refer to this documentation for designing your pipeline.

您可以从此GitHub存储库克隆完整的代码。 您还可以参考此文档来设计管道

执行数据流管道 (Executing the dataflow pipeline)

We would have to execute the below command to create the dataflow template.


mvn compile exec:java \
-Dexec.mainClass=com.viveknaskar.DataFlowPipelineForMemStore \
-Dexec.args="--project=your-project-id \
--jobName=dataflow-memstore-job \
--inputFile=gs://cloud-dataflow-input-bucket/*.txt \
--redisHost= \
--stagingLocation=gs://dataflow-pipeline-batch-bucket/staging/ \
--dataflowJobFile=gs://dataflow-pipeline-batch-bucket/templates/dataflow-template \
--gcpTempLocation=gs://dataflow-pipeline-batch-bucket/tmp/ \

Here, project: name-of-the-project-where-dataflow-pipeline-job-is-created, jobName: name-of-the-dataflow-pipeline, inputFile: bucket-where-the-input-file-is-read-by-the-pipeline, redisHost: ip-address-of-the-running-redis-instance, dataflowJobFile: bucket-where-the-dataflow-template-is-created, runner: DataflowRunner (for running dataflow pipeline), stagingLocation, tempLocation also needs to be provided.

在这里, 项目:在其中创建数据流管道作业的项目名称, jobName:在其中数据流管道作业的名称, inputFile:在其中输入文件的存储桶管道读取, redisHost:运行重分配实例的 ip地址, dataflowJobFile:创建数据流模板所在的存储桶, 道: DataflowRunner(用于运行数据流管道),还需要提供stagingLocation,tempLocation

Once build is successful, the dataflow template would be created and a dataflow job would run.


Image for post
Dataflow job is created after the successful execution

The dataflow job is also represented in a graph summarizing about various stages of the pipeline. You can also check the logs.

数据流作业也以图形形式表示,该图形概述了管道的各个阶段。 您还可以检查日志。

Image for post

检查在Memorystore实例中插入的数据 (Check the data inserted in Memorystore instance)

For checking whether the processed data is stored in the Redis instance after the dataflow pipeline is executed successfully, you must first connect to the Redis instance from any Compute Engine VM instance located within the same project, region and network as the Redis instance.

为了在成功执行数据流管道之后检查已处理的数据是否存储在Redis实例中,必须首先从与Redis实例位于同一项目,区域和网络中的任何Compute Engine VM实例连接到Redis实例。

  • Create a VM instance and SSH to it

  • Install telnet from apt-get in the VM instance

sudo apt-get install telnet
  • From the VM instance, connect to the ip-address of the Redis instance

telnet instance-ip-address 6379
  • Once you are in the Redis, check the keys inserted

keys *
  • Check whether the data is inserted using the intersection command to get the guid

sinter firstname:<firstname> lastname:<lastname> dob:<dob> postalcode:<post-code>
  • Check with individual entry using the below command to get the guid

smembers firstname:<firstname>
  • Command to clear the Redis datastore


You can read more about Redis commands in this documentation.


最终,我们实现了我们想要的... (Finally, we have achieved what we wanted…)

Dataflow pipeline jobs are champions when it comes to processing our bulk data within seconds. I highly recommend you to do it yourself and see how fast it is. Well, I have tried to attach as many resources as possible and if you go through the code, it is fairly simple. Still, you will get the gist of it when you experiment on your own. 🙂

在几秒钟内处理批量数据时,数据流管道作业是冠军。 我强烈建议您自己做,看看有多快。 好吧,我尝试附加尽可能多的资源,如果您遍历代码,这相当简单。 尽管如此,当您自己进行实验时,您将获得要点。 🙂

Originally published at http://thedeveloperstory.com on August 30, 2020.

最初于 2020年8月30日 http://thedeveloperstory.com 发布

翻译自: https://medium.com/swlh/exporting-data-from-storage-to-memorystore-using-cloud-dataflow-5d37287139e7


  • 0
  • 0
    觉得还不错? 一键收藏
  • 0




当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则
钱包余额 0


