How Samsung modernized architecture for real-time analytics

本文链接：https://blog.csdn.net/weixin_46812959/article/details/134598773

I'm Anusha. I'm a Principal Software Engineer with AWS. I'm with the AWS Messaging and Streaming organization. And uh today, I'm excited to introduce our customer who's going to be talking about how they modernize their data streaming architecture using AWS. We have Jon Lee who is the Head of Samsung SmartThings Cloud Engineering team doing a session for us today.

You know, one of my favorite parts of being an engineer at AWS is enabling our customers to manage less and less every single day. So you can take that time that you've saved up to go and do the business mission that you set out to do. And I'm so happy that we have one of those stories that is going to be shared here today about how Samsung SmartThings modernized their data pipeline from a self managed Apache Spark architecture to a fully managed data streaming architecture using AWS by using Kinesis Data Analytics, which is our managed service for Apache Flink J.

One is going to walk us through how this led to them, increasing their business value, reducing their developer time and also reducing their total cost of business. I know you're all as excited as I am to hear. So without further ado introducing J One on stage, uh we're gonna be hanging out by the stage after the presentation to answer any Q and A over to you J One.

J One: Thank you Anusha to for the interaction. Hello everyone. Uh my name is Zan me and uh I've been working for Samsung um as the Head of SmartThings Cloud Engineering.

Mhm today, I'm gonna talk about our journey on modernization of the event processing uh architecture from uh batch to real time. The story, we start from a a brief introduction on Samsung SmartThings and its data platform. And after that, uh we look through the problems we had in the previous architecture and the reason why we ended up making a decision to evolve, brief walking through the options we considered and which option we chose in the end as well as how new architecture looks like will be followed.

Uh and I will uh go a little bit deeper about the challenges we should overcome with the migrations as well. Lastly, i talk about the impact on the platform with this journey. Let's get started.

Samsung SmartThings is Samsung's IO platform that supports more than hundreds of millions of users and devices. As a home IoT platform, SmartThings enables users to easily connect and control the IoT devices.

As this diagram shows us the holistic view of Samsung Smarting ecosystem, Samsung SmartThings platform supports three different types of devices according to the connectivity. One type is directly connected devices such as Samsung smart TV or refrigerators. Most of the Samsung appliances belong to this category and these devices mainly use WiFi for connectivity. Actually, those are the vast majority of the devices in the SmartThings platform.

Another type is H connected devices that connects to the cloud via HOB most sensors out there using ZigBee or Z wave as well as metal competitive devices are classified as this type. The other type is C to C also known as cloud to cloud devices that connects to SmartThings platform. We are partner cloud. For example, if you have Philips H valve in your home, you can control this valve from SmartThings by making use of this C to C integration Smarting cloud platform is the backbone of various smart home services and also adapting microservices architecture to provide meaningful values to the end users.

As we saw earlier, all data sent by IoT devices will flow through the cloud and it needs to be processed on a real time basis. As SmartThings is on IoT platform and all user experiences are expected to work based on this data SmartThings. Data platform takes the responsibilities for this data processing and it processes around 30 terabytes of data on average every day.

With this data SmartThings data platform provides various features in the platform. There are two types of histories. One is the history on what happened in devices and the other is the history on what users get from the devices. All the histories for the last seven days are stored in data platform. One another feature is usage statistics. Key use case of this feature is to provide users with energy usage patterns or savings of eligible devices. The last feature is analytics plus recommendations. We analyze all the data and and get user uses patterns or other meaningful insights. And from time to time, we recommend use of functions as well based on these analytics.

This diagram shows how SmartThings data platform look like in the past. As you can see in this architecture, we use Apache Spark framework as a data processing engine on top of Apache Hadoop cluster and all these were self managed services running on a two here. A spark is an open source framework for cluster computing and Apache Hadoop is an open source project, open source project suitable for distributed computing by making use of the HTDFS as a distributed fire system.

Let's take a look at how it works. One step closer sensor data from IoT devices such as TV s or refrigerators is coming to the cloud through the device gateway which is an end point between the device and the cloud. This data typically goes to the Kafka pipeline which is an open source distributed event streaming platform. Then Spark apps get data from multiple sensors or sources such like Ka or Kinesis and do processing so called ETL. All processes data are sent to various things such as database and data lake. The final in which or aggregated data are stored in database or S3 so that it can be used by other microservices. When needed. We are using HBase as a database which is a NoSQL database for big data running on top of Hadoop and HDFS.

At any time, when a mobile client requests data through the API gateway, the relevant service will call it database and get back to the client. So we have 50 Spark applications to handle all different business requirements and they processed more than 30 terabytes of data on average in a day. But it was on the rise in summer because of the seasonal events.

At the very first beginning, everything worked pretty well with this architecture over years. However, as the data volume to process has been dramatically increased with a rapid increase of Smarting users and connected devices, we started struggling with several issues. Let's take a look at what issues we've had and how these challenges affected our platform.

Since multiple Spark applications might be running on top of Hadoop clusters with YARN. At the same node on a single EC2 instance, it was very difficult to have the resources reserved to individual applications. So often times, we saw the resource contention among Spark apps, particularly when data ingestion traffic surged as a result, some of the Spark apps could be in starvation of CPU and memory. And it caused a delay and our performance degradation while processing data. Since we didn't have a solution to detect those neighbors in place. In this case, we had to increase the number of nodes in Hadoop cluster manually according to the traffic or had to restart Spark apps in order to recover from failures.

Moreover, we spent a significant amount of time to remediate the outage caused by CPU maxed out and lack of memory in Spark apps because we were not able to use auto scaling in Spark. In order to leverage the auto scaling in Spark, we should be able to use both of node level and application level auto scaling at the same time. But it was it required to implement additional external shuffling services. These challenges sometimes led us to drop the data accidentally. And after all, it caused bad user experience due to such like the inaccurate device history and wrong energy consumption usage statistic as a result, it brought about high costly overhead at maintaining workload in operation.

These numbers represent the instant alerts we received in a month which came from the audit in Spark. We had to address more than five outages on average in a day. When any Spark app goes down, the data won't be consumed in time and then we'll get delayed and notifications. App failure is representing this case and manual recovery is the case that we have outages in Hadoop because of the failures from Spark components such like Yarn or Yarn. This number tells us well, how we suffered from the challenges i mentioned earlier. And then it also tells us the reason why we had to retrofit the our data platform architecture.

I think there is no one sided piece all solution, but we had to rearchitect our data platform. So we kicked off data platform to that old project. In this project, we set three objectives. It was pretty expensive because we had to reserve more resources than required in order to work well, even though some of these Spark apps go down. So the first objective was to increase the cost efficiency. And also we wanted to improve the stability and availability of our data platform and to have the operation more efficient. In other words, we had to find out an alternative that will cost less than Spark apps running on top of in in house Hadoop cluster and ensure higher stability with better operational efficiency.

So we explored all available alternative solutions and set criteria to evaluate each of them. The main considerations were like this. The alternative solution should be able to process in large scale on a real time basis. Our data platform is receiving hundreds of thousands of events per second. So we should be able to deal with this amount of data appropriately. And also we try to find out a solution which is much easier to maintain the cluster with high availability. So we preferred fully managed solution to self managed ones because it was too painful to manage infrastructure. And I was also hoping that the engineering team was able to focus on delivering more meaningful business value without needing to bother about infrastructure management.

The last consideration was that load and application should be scalable automatically according to the workload that is we should be able to use auto scaling. That's because of the fact that we were not able to leverage auto scaling in our old architecture affected the cost and efficiency.

Under these considerations, we ended up narrowing down the alternatives to two candidates which were Apache Spark and Apache Flink. Apache Spark operates in a micro batch that collects and processes small units of streaming data. While Apache Flink operates by processing each streaming data on a real time basis without batching. Auto scaling has to be applied in order to improve operation cost and operation efficiency.

Although Apache Yarn was previously utilized as a cluster manager without using Kinesis, Kinesis will provide a better stability and maintainability than using Yarn when operating several applications. In addition, cluster uses Kinesis can be controlled according to the constantly changing data traffic to lower the system cost, which also required auto scaling of application to use auto scaling of Apache Spark on Kinesis, which was selected as the cluster manager, the external shuffling service or unstable dynamic allocation had to be used. This was a pretty big burden to go with Apache Spark.

Since SmartThings data platform mostly has applications that process data on a real time basis, we thought it is more suitable to have the real time data processing model of Apache Flink than having micro batch processing of Apache Spark. We ended up choosing Kinesis Data Analytics for Flink as a binary solution for data platform to do all because of the advantages advantages, it has at first with constantly changing data traffic in in the IoT environment being serverless greatly helps in reducing operational human resources and it is also fully managed solution. And secondly, each application behaves independently not affecting other applications in terms of using system resources. And lastly, it can periodically create a save point and use those save points to recover in case of a failure. So we can expect high availability without any additional efforts.

Although we had to migrate from Apache Spark to Apache Flink, we chose Kinesis Data Analytics for Flink as our final choice because it better responds to the requirement of the Samsung SmartThings data platform and can better address the issue of high availability in the long term. So we've got on a PoC first for migration with top three most data consuming Spark apps during 100 days and got a great light to move forward.

This diagram shows how SmartThings data platform looks like in current architecture SmartThings data platform mainly consists of two parts. First one is the data processing part you can see upper in this diagram. And the second one is the data serving part as you can see all EC2 boxes for Spark applications in the previous architecture were replaced by Kinesis Data Analytics for Flink.

Unlike the old architecture, the incoming traffic is going to manage the Kafka first and consume to buy Kinesis Data Analytics for Flink, filtered and enriched data is going back to Kafka and then they are aggregated and stored in database. And the microservices accessing database have been also containerized with EKS. So we can say the operation of the microservices in data platform is also modernized.

In addition to the migration, we wanted to improve the CI/CD as well. Previously, we divided the steps into uploading jar on S3 and using Jenkins to download the jar from S3 to register the application on Yarn. However, it had several constraints. So we modernize the workflow like this diagram by combining CI/CD into a single workflow to seamlessly deploy the latest code.

When an application code is registered on Git the CI tool is utilized to start building automatically after the building is completed, the applications are and configuration are unloaded on S3. After unloading the files, the CI tool cause deployment tasks of spinning, then Spinnaker calls Jenkins sub of each application to deploy it when Jenkins service is running it downloads the application serve and configuration files from S3. And finally, Flink application is created or updated by using the downloaded server and configuration files.

With this improvement, we were able to apply deployments to multiple applications at the same time and leverage existing Spinnaker based continuous delivery infrastructure as well as keep track of deployment history.

Although we decided to have migration because Kinesis Data Analytics for Flink is suitable for SmartThings data platform. The process leading up to the migration was quite complicated. There were three major challenges we had to overcome.

First, we should be able to use auto scaling that changes rapidly according to CPU uses. And also we should be able to fill the gaps in terms of performance between Spark and Flink. And lastly, we should be able to manage the offset of the streaming data appropriately. Let's take a look at each challenge and how we got through them in detail.

Although we chose Kinesis Data Analytics for Flink by supporting application auto scaling the auto scaling of the service has some limitations. We could not make changes for Kinesis Data Analytics for Flink because the requirements for scaling have been fixed and cannot be changed. You can scale out when an application uses the CPU over 75% for 15 minutes and scale in when it uses the CPU less than 10% for six minutes, six hours without it takes too long to determine scale in and cannot adjust scale out requirements based on application which led us to build custom auto scalar.

The diagram here illustrates the structure of custom auto scalar and explains how it works. As you can see, Amazon CloudWatch tracks CPU uses and AWS Lambda changes the application parallelism value and the CPU uses meets the scale in or out condition. We are using Terraform to manage these components and scaling policy. So we were able to set different scaling policies for each application by setting the cool down value for such like parallelism max CPU threshold and et cetera. By using this, we were able to set our own scaling policies that suit applications which can scale automatically to meet the event volume. Thanks to this, ultimately, we were able to reduce operation cost.

Kinesis Data Analytics for Apache Flink uses a resource allocation unit called Kinesis Processing Unit or KPU. A single KPU provides one core, four gigabyte of memory and 50 gigabyte of storage. When they used Apache Spark, we used Apache Hadoop, Yarn and Apache Libby to allocate resources. At that time, we allocated the number of CPU cores and memory that are necessary for each application and combined the number of KPUs to be designated for each application.

Also Kinesis Data Analytics for Flink has a setting called parallelism for KPU. It determines the number of parallelism for one KPU for parallel execution which is used to appropriately respond to applications that require many CPU cores but simply combining allocation for system resources. When migrating from Apache Spark to Kinesis Data Analytics for Apache Flink led to several unexpected issues related to application operation compared to Apache Spark Apache Flink has better performance in terms of streaming data processing.

However, in some cases, Apache Flink had lower performance than Apache Spark depending on attributes of the application. This required us to implement the follow measures. The first one is async IO because Apache Flink processes one event after another applying async IO is recommended if interactions with the external system are needed, we apply the async IO considering SmartThings data platform frequently interacts with other microservices to process data if needed. We are also applying async IO when writing data on database.

The second one is the database, uh batch read and write because Flink works with a complete streaming data processing method. It processes data one by one. But this could undermine the performance for reading and writing database. It depends on the type of database SmartThings data platform uses HBase as database. But in case of writing database, there is a significant performance difference between writing one unit of data 100 times and 100 units of data at once. The following chart shows performance test results we conducted we concluded that HBase absolutely requires batch, right? Even if memory and checkpoint capacity slightly increase we aggregated data within the application for deployment.

Flink supports two types of window triggers by default is time and count. We applied both time and count triggers through which we could control the number of data sent to database to make the number consistent.

As we saw in the in the architecture diagram. earlier data is generally imported using Kafka in SmartThings data platform. However, in some cases, data is imported using Amazon Kinesis Data Stream. If Kinesis consumer library is used to retrieve data in Kinesis data stream from Apache Spark offset is managed in DynamoDB. But one has to be careful since managing offsets in Apache Flink is different from doing in Apache Spark.

Whereas Apache Spark requires a developer to somewhat directly involved in deciding offset sorting method Flink follows deployment of connectors. The table here summarizes how to store the offset when using Apache Spark on SmartThings data platform. And after migration to Flink Kafka connectors of Flink store in a preserved Kafka topic at each time when a checkpoint is saved offset is stored in save point. If save point of Flink is used applications use the value stored in save points as the highest priority and retrieve the value from Kafka topic. When the save point is not used, the Kafka connector of Apache Flink completely relies on save point and does not store offset on DynamoDB.

Therefore a snapshot of Kinesis Data Analytics for Flink must be activated to use the save point function of Flink when retrieving data for Kinesis data streams.

Ah let's take a look at what impacts we have after addressing the challenges. The first impact is scalability. We applied auto scaling to all dozens of data processing applications to establish a system that can manage traffic increases. Even if the amount of traffic gets four times bigger. We also used serverless to not have human interaction environments for system expansion and reduction and so successfully reduced human resources for operation by over 60%. As you can see in the chart, the application KPU uses is moving according to the events traffic pattern, which means the resources for data processing automatically scales in and out dynamically.

The second impact is reliability and availability. The issue of operating loss were addressed by migrating or real time data processing components of SmartThings data platform to serverless. And we also removed the fundamental cause of cascading failures by configuring independent clusters for each application. So when we ran into an incident, it didn't affect the other applications as a result, we were able to highly improve the reliability even though one cluster went down.

The last impact is efficiency. Previously, it took more than an hour to recover from an error when directly managing cluster nodes. But we can now recover clusters that are allocated to applications just within a few minutes. As you can see the number of delay notifications on data processing decreased less than 40%. And we've never experienced a failure recovery and manual cluster recovery case at all after the migration. And also the total cost of ownership was reduced. In spite of the increase of computing resources from 500 to 600 KPU. And we significantly reduced the total operation resources by around 80%.

With this efficiency improvement, we can now spend our engineering resources to more valuable work. So I'm very happy that the engineering team is more focusing on delivering meaningful business value rather than spending their resources for routine tasks.

Although we've modernized our data platform in aid of Kinesis Data Analytics for Flink, we are looking forward to small additional improvement while it took, while it took under one minute to deploy Spark app. in the past, it takes 2 to 54 minutes with Kinesis Data Analytics. If we can deploy the Flink application much faster, it would be really helpful to get the efficiency better and Kinesis Data Analytics for Flink doesn't support using different local level policy and KPU metrics. yet, as i know most of these features are under development by aws service team, but i hope we can use those as soon as possible.

All right, that's all i have for today. Uh thank you for joining the session and we'll have the uh off site. Uh thank you.