Build a Scalable Data Pipeline with AWS Kinesis, AWS Lambda, and Google BigQuery

最新推荐文章于 2024-07-20 19:31:22 发布

Heather_Fan

最新推荐文章于 2024-07-20 19:31:22 发布

阅读量323

点赞数

分类专栏： Big Data 文章标签： aws cloud data warehouse queue java

本文链接：https://blog.csdn.net/Heather_Fan/article/details/105609384

版权

This blog details how to handle large amount of event triggered data for live time backend analysis with AWS Kinesis, AWS Lambda and Google BigQuery.

Photo by NASA on Unsplash

Background

We have a user facing app which triggers a large amount of events every second, including searching with tags, browsing certain pages, locating on maps, etc. All of these data are stored in Google’s BigQuery.

In the old app, the data was inserted into BigQuery by making API calls directly from the app backend to BigQuery. Although it works, it has the bottleneck created by hitting BigQuery’s rate and size limits as we scale up, as well as causing possible data loss in the absence of retry logic.

So a better solution is needed to handle the scaling and avoid possible data loss.

Design

在这里插入图片描述

Amazon Kinesis Data Streams (KDS) is a massively scalable and durable real-time data streaming service.

Kinesis is a reliable tool for large scale data streaming with competitive pricing comparing to other message queue services. As Kinesis came into the picture, on the consumer side, AWS lambda seems like a natural choice for its easy integration with Kinesis.

One thing to be noted of the Kinesis + Lambda pipeline is error handling. While it’s good to have auto-retry provided by this combination, when it comes to corrupted records the whole pipeline will be stuck by this chunk of data contains the “bad record”. The retry will keep going on until the record is expired (by default 24 hours) and removed from the Kinesis stream.

In order to solve the above issue, an SQS is added here as a DLQ for the corrupted records. The total retry times and option to split failed batch are also configurable in the source mapping configuration for the lambda function.

Hands-on

Setup Kinesis

Kinesis is fairly easy to set up as it doesn’t require much configuration aside from the number of shards.

Streams are

Made up of Shards
Each Shard ingests data up to 1MB/sec
Each Shard emits data up to 2MB/sec

Number of shards needed for a stream can be computed by the formula provided by AWS-

number of shards = max(incoming_write_bandwidth_in_KB/1000, outgoing_read_bandwidth_in_KB / 2000)

After deciding the name of your stream and number of shards needed, we can create the data stream either in the console or through AWS CLI —

aws kinesis create-stream -stream-name <name> -shard-count <number_of_shards> -region <AWS_Region>

Create Lambda Consumer

Photo by Pravin Chavda on Unsplash

After setup the pipe, now it’s time to create the consumer. Event though node.js is the most recommended language to use when it comes to Lambda, Java as my main coding language is what I’m most confident in so I decided to stick to it (as there are already LOTS of uncertainties and learnings during the whole infrastructure setup).

The simple handler function looks like this-

public class ProcessKinesisRecords implements RequestHandler<KinesisEvent, String> {
   

  private static final BigQueryService bigQueryService = new BigQueryService();
  private static final Gson gson = new GsonBuilder().setPrettyPrinting().create();
  private static final ObjectMapper mapper = new ObjectMapper();

  @Override
  public String handleRequest(KinesisEvent event, Context context) {
   
    LambdaLogger logger = context.getLogger();
    // log execution details
    logger.log(“ENVIRONMENT VARIABLES: “ + gson.toJson(System.getenv()));
    logger.log(“CONTEXT: “ + gson.toJson(context));

    // process event
    logger.log(“EVENT: “ + gson.toJson(event));
    Map<String, Map<String, Object>> rows = new HashMap<>();

    for (KinesisEventRecord rec : event.getRecords()) {
   

      // extract data
      KinesisEvent.Record record = rec.getKinesis();
      String id = rec.getKinesis().getPartitionKey() + 
      rec.getKinesis().getSequenceNumber

最低0.47元/天解锁文章

Heather_Fan

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Build a Scalable Data Pipeline with AWS Kinesis, AWS Lambda, and Google BigQuery

This blog details how to handle large amount of event-triggered data for live time backend analysis with [AWS Kinesis](https://aws.amazon.com/kinesis/streams/), [AWS Lambda](https://aws.amazon.com/lambda/) and [Google BigQuery](https://cloud.google.com/bi
复制链接

扫一扫