This blog details how to handle large amount of event triggered data for live time backend analysis with AWS Kinesis, AWS Lambda and Google BigQuery.
Contents
Background
We have a user facing app which triggers a large amount of events every second, including searching with tags, browsing certain pages, locating on maps, etc. All of these data are stored in Google’s BigQuery.
In the old app, the data was inserted into BigQuery by making API calls directly from the app backend to BigQuery. Although it works, it has the bottleneck created by hitting BigQuery’s rate and size limits as we scale up, as well as causing possible data loss in the absence of retry logic.
So a better solution is needed to handle the scaling and avoid possible data loss.
Design
Amazon Kinesis Data Streams (KDS) is a massively scalable and durable real-time data streaming service.
Kinesis is a reliable tool for large scale data streaming with competitive pricing comparing to other message queue services. As Kinesis came into the picture, on the consumer side, AWS lambda seems like a natural choice for its easy integration with Kinesis.
One thing to be noted of the Kinesis + Lambda pipeline is error handling. While it’s good to have auto-retry provided by this combination, when it comes to corrupted records the whole pipeline will be stuck by this chunk of data contains the “bad record”. The retry will keep going on until the record is expired (by default 24 hours) and removed from the Kinesis stream.
In order to solve the above issue, an SQS is added here as a DLQ for the corrupted records. The total retry times and option to split failed batch are also configurable in the source mapping configuration for the lambda function.
Hands-on
Setup Kinesis
Kinesis is fairly easy to set up as it doesn’t require much configuration aside from the number of shards.
Streams are
- Made up of Shards
- Each Shard ingests data up to 1MB/sec
- Each Shard emits data up to 2MB/sec
Number of shards needed for a stream can be computed by the formula provided by AWS-
number of shards = max(incoming_write_bandwidth_in_KB/1000, outgoing_read_bandwidth_in_KB / 2000)
After deciding the name of your stream and number of shards needed, we can create the data stream either in the console or through AWS CLI —
aws kinesis create-stream -stream-name <name> -shard-count <number_of_shards> -region <AWS_Region>
Create Lambda Consumer
After setup the pipe, now it’s time to create the consumer. Event though node.js is the most recommended language to use when it comes to Lambda, Java as my main coding language is what I’m most confident in so I decided to stick to it (as there are already LOTS of uncertainties and learnings during the whole infrastructure setup).
The simple handler function looks like this-
public class ProcessKinesisRecords implements RequestHandler<KinesisEvent, String> {
private static final BigQueryService bigQueryService = new BigQueryService();
private static final Gson gson = new GsonBuilder().setPrettyPrinting().create();
private static final ObjectMapper mapper = new ObjectMapper();
@Override
public String handleRequest(KinesisEvent event, Context context) {
LambdaLogger logger = context.getLogger();
// log execution details
logger.log(“ENVIRONMENT VARIABLES: “ + gson.toJson(System.getenv()));
logger.log(“CONTEXT: “ + gson.toJson(context));
// process event
logger.log(“EVENT: “ + gson.toJson(event));
Map<String, Map<String, Object>> rows = new HashMap<>();
for (KinesisEventRecord rec : event.getRecords()) {
// extract data
KinesisEvent.Record record = rec.getKinesis();
String id = rec.getKinesis().getPartitionKey() +
rec.getKinesis().getSequenceNumber