数据库存储过程实现导出数据_使用云数据流将数据从存储导出到内存存储-CSDN博客

数据库存储过程实现导出数据

Recently, I got a chance to do an R&D on a requirement where I would need to read files stored in a Cloud Storage bucket, which would be processed and transformed in the desired format and stored in an in-memory data store, i.e., Memorystore for faster access. Well, honestly it took several days to figure out the correct approach before finding the correct technologies to implement this.

最近，我有机会根据需要读取需要存储在Cloud Storage存储桶中的文件进行研发，该文件将以所需的格式进行处理和转换并存储在内存中的数据存储中，例如Memorystore快速访问。好吧，老实说，花了几天的时间才找到正确的方法，然后才找到正确的技术来实现这一目标。

One of the best services in Google Cloud Platform that I have worked and experimented with is Cloud Dataflow which is a fully-managed service to execute pipelines within the Google Cloud Platform ecosystem. It is a service which is fully dedicated to transforming and enriching data in stream (real-time) and batch (historical) modes. It takes a serverless approach where users can focus on programming instead of managing server clusters, can be integrated with Operations (formerly Stackdriver), which lets you monitor and troubleshoot pipelines as they are running.

我曾经尝试过的Google Cloud Platform中最好的服务之一是Cloud Dataflow ，它是一项完全托管的服务，可以在Google Cloud Platform生态系统中执行管道。它是一项完全致力于以流(实时)和批处理(历史)模式转换和丰富数据的服务。它采用无服务器方法，用户可以专注于编程而不是管理服务器集群，并且可以与Operations(以前称为Stackdriver)集成，从而可以在管道运行时对其进行监视和故障排除。

Memorystore is Google’s implementation of Redis data store with reduced latency but high scalability. Well, caching is a technique used to accelerate application response times and help applications scale by placing frequently needed data very close to the application. Memorystore for Redis provides a fully-managed service that is powered by the Redis in-memory data store to build application caches that provide sub-millisecond data access.

Memorystore是Google对Redis数据存储的实现，具有减少的延迟但具有高可伸缩性。好吧，缓存是一种用于加速应用程序响应时间并通过将经常需要的数据放置在离应用程序很近的位置来帮助扩展应用程序的技术。用于Redis的Memorystore提供了一种完全托管的服务，该服务由Redis内存中的数据存储提供支持，以构建可提供亚毫秒级数据访问的应用程序缓存。

先决条件 (Prerequisites)

Before creating our dataflow pipeline for the implementation, we would require to do 3 things:

在为实现创建数据流管道之前，我们需要做三件事：

Create two GCS buckets
创建两个GCS存储桶

GCS buckets are required for storing the input file(s) which will be read, transformed and then stored in the Redis data store and the other bucket is required for staging the dataflow pipeline code.

需要GCS存储桶来存储将被读取，转换然后存储在Redis数据存储中的输入文件，并且需要另一个存储桶来暂存数据流管道代码。

Image for post — You can give any names to your buckets

If you are not familiar with the creation of buckets, refer to this GCS documentation.

如果您不熟悉存储桶的创建，请参阅此GCS文档。

2. Create a Redis Instance

2.创建一个Redis实例

Memorystore (for Redis) instance is required for our implementation to store the processed data after the cloud dataflow pipeline is executed.

执行云数据流管道后，我们的实施需要使用Memorystore(用于Redis)实例来存储处理后的数据。

The IP Address (Redis Host) is required to be provided in the command-line for executing the dataflow pipeline. If you are not familiar with the creation of memorystore, refer to this Memorystore documentation.

要求在命令行中提供IP地址(Redis主机)以执行数据流管道。如果您不熟悉Memorystore的创建，请参阅此Memorystore文档。

3. Upload the input file in the GCS bucket

3.将输入文件上传到GCS存储桶中

For the dataflow pipeline to be executed, an input file is needed to be uploaded in the GCS bucket for input, for our case, its cloud-dataflow-input-bucket for our case.

为了执行数据流管道，需要将输入文件上载到GCS存储桶中以进行输入，对于我们的情况，是针对我们的案例的cloud-dataflow-input-bucket 。

The input file would have the data with “pipe” separator and is of the form: guid|firstname|lastname|dob|postalcode

输入文件将包含带有“竖线”分隔符的数据，格式为： guid|firstname|lastname|dob|postalcode

The input file can be accessed from here.

可以从此处访问输入文件。

数据流管道将做什么？ (What the dataflow pipeline will do?)

The idea is that the input file will be read, transformed and stored into a running Redis data-store instance.

这个想法是输入文件将被读取，转换并存储到一个正在运行的Redis数据存储实例中。

The transformation step of the pipeline will split the data from the input file and then store it with corresponding keys in the data-store along with the guid.

管道的转换步骤将从输入文件中拆分数据，然后将其与相应的键以及guid一起存储在数据存储区中 。

For example, if the input file is xxxxxx|bruce|wayne|31051989|4444 where xxxxxx is the guid, bruce is the firstname, wayne is the lastname, 31051989 is the dob (in DDMMYYYY format) and 4444 is the postalcode.

例如，如果输入文件是xxxxxx|bruce|wayne|31051989|4444 ，其中xxxxxx是GUID， bruce是姓名， wayne是姓， 31051989是DOB(在DDMMYYYY格式)和4444是邮递区号 。

The pipeline will store the transformed data in the Redis instance like below:

管道将转换后的数据存储在Redis实例中，如下所示：

firstname:bruce xxxxxx

lastname:wayne xxxxxx

dob:31051989 xxxxxx

postalcode:4444 xxxxxx

创建数据流管道 (Creating the dataflow pipeline)

We would create a template from the scratch and obviously, we referred and understood the core concepts from this documentation. We would be creating a dataflow batch job and for that, we would have to use Dataflow SDK 2.x and Apache Beam SDK for Redis.

我们将从头开始创建一个模板，显然，我们引用并理解了本文档中的核心概念。我们将创建一个数据流批处理作业，为此，我们将不得不使用Dataflow SDK 2.x和Apache Beam SDK for Redis。

<dependency>
    <groupId>com.google.cloud.dataflow</groupId>
    <artifactId>google-cloud-dataflow-java-sdk-all</artifactId>
    <version>2.5.0</version>
</dependency><dependency>
    <groupId>org.apache.beam</groupId>
    <artifactId>beam-sdks-java-io-redis</artifactId>
    <version>2.23.0</version>
</dependency>

For the pipeline code, we would have to construct StorageToRedisOptions (or give any name) object using the method PipelineOptionsFactory.fromArgs to read options from command-line.

对于管道代码，我们将必须使用PipelineOptionsFactory.fromArgs方法构造StorageToRedisOptions对象(或提供任何名称)以从命令行读取选项。

Main Class

主班

public static void main(String[] args) {/**
     * Constructed StorageToRedisOptions object using the method PipelineOptionsFactory.fromArgs to read options from command-line
     */StorageToRedisOptions options = PipelineOptionsFactory.fromArgs(args)
            .withValidation()
            .as(StorageToRedisOptions.class);
    Pipeline p = Pipeline.create(options);
    p.apply("Reading Lines...", TextIO.read().from(options.getInputFile()))
            .apply("Transforming data...",
                    ParDo.of(new DoFn<String, String[]>() {
                        @ProcessElementpublic void TransformData(@Element String line, OutputReceiver<String[]> out) {
                            String[] fields = line.split("\\|");
                            out.output(fields);
                        }
                    }))
            .apply("Processing data...",
                    ParDo.of(new DoFn<String[], KV<String, String>>() {
                        @ProcessElementpublic void ProcessData(@Element String[] fields, OutputReceiver<KV<String, String>> out) {if (fields[RedisIndex.GUID.getValue()] != null) {
                                out.output(KV.of("firstname:".concat(fields[RedisIndex.FIRSTNAME.getValue()]), fields[RedisIndex.GUID.getValue()]));
                                out.output(KV.of("lastname:".concat(fields[RedisIndex.LASTNAME.getValue()]), fields[RedisIndex.GUID.getValue()]));
                                out.output(KV.of("dob:".concat(fields[RedisIndex.DOB.getValue()]), fields[RedisIndex.GUID.getValue()]));
                                out.output(KV.of("postalcode:".concat(fields[RedisIndex.POSTAL_CODE.getValue()]), fields[RedisIndex.GUID.getValue()]));
                            }
                        }
                    }))
            .apply("Writing field indexes into redis",
            RedisIO.write().withMethod(RedisIO.Write.Method.SADD)
                    .withEndpoint(options.getRedisHost(), options.getRedisPort()));
    p.run();
}

You can clone the complete code from this GitHub repository. You can also refer to this documentation for designing your pipeline.

您可以从此GitHub存储库克隆完整的代码。您还可以参考此文档来设计管道。

执行数据流管道 (Executing the dataflow pipeline)

We would have to execute the below command to create the dataflow template.

我们将必须执行以下命令来创建数据流模板。

mvn compile exec:java \
  -Dexec.mainClass=com.viveknaskar.DataFlowPipelineForMemStore \
  -Dexec.args="--project=your-project-id \
  --jobName=dataflow-memstore-job \
  --inputFile=gs://cloud-dataflow-input-bucket/*.txt \
  --redisHost=127.0.0.1 \
  --stagingLocation=gs://dataflow-pipeline-batch-bucket/staging/ \
  --dataflowJobFile=gs://dataflow-pipeline-batch-bucket/templates/dataflow-template \
  --gcpTempLocation=gs://dataflow-pipeline-batch-bucket/tmp/ \
  --runner=DataflowRunner"

Here, project: name-of-the-project-where-dataflow-pipeline-job-is-created, jobName: name-of-the-dataflow-pipeline, inputFile: bucket-where-the-input-file-is-read-by-the-pipeline, redisHost: ip-address-of-the-running-redis-instance, dataflowJobFile: bucket-where-the-dataflow-template-is-created, runner: DataflowRunner (for running dataflow pipeline), stagingLocation, tempLocation also needs to be provided.

在这里， 项目：在其中创建数据流管道作业的项目名称， jobName：在其中数据流管道作业的名称， inputFile：在其中输入文件的存储桶管道读取， redisHost：运行重分配实例的 ip地址， dataflowJobFile：创建数据流模板所在的存储桶，流道： DataflowRunner(用于运行数据流管道)，还需要提供stagingLocation，tempLocation 。

Once build is successful, the dataflow template would be created and a dataflow job would run.

一旦构建成功，将创建数据流模板并运行数据流作业。

The dataflow job is also represented in a graph summarizing about various stages of the pipeline. You can also check the logs.

数据流作业也以图形形式表示，该图形概述了管道的各个阶段。您还可以检查日志。

检查在Memorystore实例中插入的数据 (Check the data inserted in Memorystore instance)

For checking whether the processed data is stored in the Redis instance after the dataflow pipeline is executed successfully, you must first connect to the Redis instance from any Compute Engine VM instance located within the same project, region and network as the Redis instance.

为了在成功执行数据流管道之后检查已处理的数据是否存储在Redis实例中，必须首先从与Redis实例位于同一项目，区域和网络中的任何Compute Engine VM实例连接到Redis实例。

Create a VM instance and SSH to it
创建一个VM实例并对其进行SSH
Install telnet from apt-get in the VM instance
从VM实例中的apt-get安装telnet

sudo apt-get install telnet

From the VM instance, connect to the ip-address of the Redis instance
从VM实例连接到Redis实例的ip地址

telnet instance-ip-address 6379

Once you are in the Redis, check the keys inserted
进入Redis后，检查插入的键

keys *

Check whether the data is inserted using the intersection command to get the guid
使用intersection命令检查是否插入了数据以获取引导

sinter firstname:<firstname> lastname:<lastname> dob:<dob> postalcode:<post-code>

Check with individual entry using the below command to get the guid
使用以下命令检查单个条目以获取引导

smembers firstname:<firstname>

Command to clear the Redis datastore
清除Redis数据存储区的命令

flushall

You can read more about Redis commands in this documentation.

您可以在本文档中阅读有关Redis命令的更多信息。

最终，我们实现了我们想要的... (Finally, we have achieved what we wanted…)

Dataflow pipeline jobs are champions when it comes to processing our bulk data within seconds. I highly recommend you to do it yourself and see how fast it is. Well, I have tried to attach as many resources as possible and if you go through the code, it is fairly simple. Still, you will get the gist of it when you experiment on your own. 🙂

在几秒钟内处理批量数据时，数据流管道作业是冠军。我强烈建议您自己做，看看有多快。好吧，我尝试附加尽可能多的资源，如果您遍历代码，这相当简单。尽管如此，当您自己进行实验时，您将获得要点。 🙂

Originally published at http://thedeveloperstory.com on August 30, 2020.

最初于 2020年8月30日 在 http://thedeveloperstory.com 上发布。