Build a Scalable Data Pipeline with AWS Kinesis, AWS Lambda, and Google BigQuery

This blog details how to handle large amount of event triggered data for live time backend analysis with AWS Kinesis, AWS Lambda and Google BigQuery.

Photo by NASA on Unsplash

Background

We have a user facing app which triggers a large amount of events every second, including searching with tags, browsing certain pages, locating on maps, etc. All of these data are stored in Google’s BigQuery.

In the old app, the data was inserted into BigQuery by making API calls directly from the app backend to BigQuery. Although it works, it has the bottleneck created by hitting BigQuery’s rate and size limits as we scale up, as well as causing possible data loss in the absence of retry logic.

So a better solution is needed to handle the scaling and avoid possible data loss.

Design

在这里插入图片描述

Amazon Kinesis Data Streams (KDS) is a massively scalable and durable real-time data streaming service.

Kinesis is a reliable tool for large scale data streaming with competitive pricing comparing to other message queue services. As Kinesis came into the picture, on the consumer side, AWS lambda seems like a natural choice for its easy integration with Kinesis.

One thing to be noted of the Kinesis + Lambda pipeline is error handling. While it’s good to have auto-retry provided by this combination, when it comes to corrupted records the whole pipeline will be stuck by this chunk of data contains the “bad record”. The retry will keep going on until the record is expired (by default 24 hours) and removed from the Kinesis stream.

In order to solve the above issue, an SQS is added here as a DLQ for the corrupted records. The total retry times and option to split failed batch are also configurable in the source mapping configuration for the lambda function.

Hands-on

Setup Kinesis

Kinesis is fairly easy to set up as it doesn’t require much configuration aside from the number of shards.

Streams are

  • Made up of Shards
  • Each Shard ingests data up to 1MB/sec
  • Each Shard emits data up to 2MB/sec

Number of shards needed for a stream can be computed by the formula provided by AWS-

number of shards = max(incoming_write_bandwidth_in_KB/1000, outgoing_read_bandwidth_in_KB / 2000)

After deciding the name of your stream and number of shards needed, we can create the data stream either in the console or through AWS CLI —

aws kinesis create-stream -stream-name <name> -shard-count <number_of_shards> -region <AWS_Region>

Create Lambda Consumer

Photo by Pravin Chavda on Unsplash

After setup the pipe, now it’s time to create the consumer. Event though node.js is the most recommended language to use when it comes to Lambda, Java as my main coding language is what I’m most confident in so I decided to stick to it (as there are already LOTS of uncertainties and learnings during the whole infrastructure setup).

The simple handler function looks like this-

public class ProcessKinesisRecords implements RequestHandler<KinesisEvent, String> {
   

  private static final BigQueryService bigQueryService = new BigQueryService();
  private static final Gson gson = new GsonBuilder().setPrettyPrinting().create();
  private static final ObjectMapper mapper = new ObjectMapper();

  @Override
  public String handleRequest(KinesisEvent event, Context context) {
   
    LambdaLogger logger = context.getLogger();
    // log execution details
    logger.log(“ENVIRONMENT VARIABLES:+ gson.toJson(System.getenv()));
    logger.log(“CONTEXT:+ gson.toJson(context));

    // process event
    logger.log(“EVENT:+ gson.toJson(event));
    Map<String, Map<String, Object>> rows = new HashMap<>();

    for (KinesisEventRecord rec : event.getRecords()) {
   

      // extract data
      KinesisEvent.Record record = rec.getKinesis();
      String id = rec.getKinesis().getPartitionKey() + 
      rec.getKinesis().getSequenceNumber
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值