Samsung SmartThings powers home automation with Amazon MemoryDB-CSDN博客

本文链接：https://blog.csdn.net/weixin_46812959/article/details/134618125

Abhay Saxena: Hello, everyone. Thanks for coming to that. 215. Welcome. Uh we are here to talk about how Samsung Smart Things is powering their home automation systems with Amazon MemoryDB. So thank you. Thank you for taking the time. My name is Abhay Saxena. I'm a product manager with the Amazon MemoryDB team here at AWS and I'm here with Tim and Kent who will be walking us through how Smart Things uses Amazon MemoryDB to power their home automation and IO platform.

So we'll, we'll walk through a little bit of detail on, you know, what Smart Things is, what they hope to accomplish what they're doing today and you know, we'll walk through some of the architecture and constraints that they're running into.

Now, we will walk into what led them to choose MemoryDB to power their next generation platform. Uh for those of you taking a quick poll who here is familiar with MemoryDB? All right. Not, not a lot but a few. It's great. So we'll go a little bit deeper into MemoryDB as well. I'll spend some time talking about what MemoryDB for Redis is, how it works and so on and then we'll switch back and learn how MemoryDB is powering the next generation platform and architecture for the IoT platform at Smart Things.

And then of course, we'll have a few demos. I will show you how it all works together. We may not have time for Q&A towards the end because of, you know, we don't have microphones in the audience here, but we will hang out towards the end. So if you have any questions, happy to answer them afterwards.

So with that, I'm gonna hand over to Kent, who's gonna walk us through.

Kent: Hi, I'm Kent. I'm a software engineer at Smart Things. How many of you have heard of Smart Things or worked with whole my IoT devices before? Well, have you ever wondered how it works behind the scenes? How do you push a button on your smartphone and suddenly your lights turn on? Now, imagine this system handling billions of events like this every day today. We'll pull back the curtain and look at how Smart Things achieves this and how we're preparing for the future.

But first, let's quickly go over some history of Smart Things to put things into perspective. Smart Things was founded in 2012 as a Kickstarter project with a goal of $250,000. But after just 30 days, we ended up raising $1.2 million with 5700 backers. About a year after that, Smart Things started selling whole my Io devices commercially. And by late 2014, Smart Things was acquired by Samsung and has since become an integral part of their IoT strategy.

In the beginning, our platform was centered around a monolithic application with just a few ancillary services. We had one type of hub and supported a handful of devices. But fast forward to today, we have over 200 microservices in our cloud. We have multiple hub variants and we support thousands of devices. And on a given day, we connect to over 100 million devices and process over 13 billion events.

But how does this work from a technical perspective? How do we handle all of these devices and events? Let's take a look at the Smart Things ecosystem first, we have the Smart Things app. The app is used to interact with your smart home and its devices. You can on board new devices, create automations and trigger state changes for devices like turning on a light or unlocking a door.

The largest component of our ecosystem is our cloud. This is where those billions of events sprawl out across over 200 services. It's where the source of truth lives for our users, their devices and other metadata that lets us do intelligent things. Depending on the type of automations you have set up. Some of those may execute here in the cloud before being sent down to the devices in your home.

We have some devices that connect directly to the cloud like Samsung appliances. And we also integrate with some partners in a cloud to cloud interaction.

Finally, the Smart Things edge is comprised of the physical things in your home. The Smart Things hub is the gateway for interacting with your devices like light bulbs, water sensors and motion sensors like the cloud. Depending on the type of devices you have the hub can execute some of these automations locally. This can improve the speed of these automations since you don't need to transit the internet to trigger them.

But what we're most interested in for this talk is the layer between the hub and the cloud. We call this the hub connectivity platform. The hub connectivity platform is the vital link between the cloud and edge layers. This is where we maintain hundreds of thousands of concurrent TCP connections between the hubs and a cloud service called DevCon short for Device Connectivity.

The cloud and the hub speak to each other using a custom binary protocol called Smart Command within the Smart Things hub. These Smart Commands are translated to IoT native protocols like Zigbee and Z-Wave.

The hub connectivity platform is bidirectional. The cloud sends messages to the hub like a device command to turn on a light bulb and the hub sends messages to the cloud like a device event reporting a state change.

The hub connectivity platform is also one of the oldest parts of the Smart Things ecosystem. The first commit for that central service called DevCon was back in June 2012, unlike wine, however, things haven't improved as they've aged since those first commits over 10 years ago, Smart Things has grown tremendously and over that time, we've made some architectural decisions that have started to show their age.

The first of these is our shard based architecture. Globally. The Smart Things cloud is deployed in multiple AWS regions. Some modern services are called global services where their data is replicated between regions and they can serve any request regardless of its origin. However, some of our older services like those that make up the hub connectivity platform are called charted services. While these services are deployed in multiple regions, their data is not replicated. That means that for these services, a user and their existence within the Smart Things, cloud lives in a single region to top it off, we aren't adding new shards. So as more users come onto the platform, each of these shards gets more and more dense.

Secondly, the infrastructure that we use is starting to show its age. The hub connectivity platform is an event driven software architecture where we use clusters of self hosted RabbitMQ as our messaging layer. As we've grown, this messaging layer has become much more difficult to manage with our traffic. And our use case RabbitMQ is becoming more difficult to scale for us as our Rabbit and Q clusters have grown over time.

So too have the maintenance requirements to put it simply the growth of our users is outpacing our original architecture. As we've grown over the last 10 years, we've come to adopt some core principles for our platform.

First, we need something that is extremely fast. Since our customers gain value from our ability to do intelligent things quickly. Any slowdown will degrade their experience. We measure total latency in milliseconds and each one counts also keep in mind that one user interaction will likely span a dozen services in its lifetime.

Next, we need something that's reliable. A user's requests shouldn't fail or be dropped regardless of the state of the system serving those requests. A user doesn't care and shouldn't know if an availability zone or region goes down. They only care that they can reliably unlock their front door when they get home. So we have to build something that is very fault tolerant and highly available.

Third, the home IoT space is rapidly growing and we need something that can keep pace with that growth. It's a great problem to have. But the burden is on us to make sure that we can accommodate each user and give them the same smooth experience.

Finally, we want something that is easy to maintain. We want our developers spending more time working on features and functionality that sets us apart from our competitors. We want to minimize the amount of time it takes to perform routine maintenance operations. Like application deployment or OS upgrades.

So these principles have guided our path to success in the past. But how do things stack up? If we look at our growth over the past several years, using our original architecture? Do we have ultrafast performance? Yes, most of our services are very fast and we get in memory performance from RabbitMQ. The round trip time for messages between the hub and the cloud is around 200 milliseconds.

Is it reliable sort of when things are stable? It is. But if you introduce some entropy, the story changes in particular, since we have that shard based architecture hubs can only connect to one region. So if a shard has a problem, like a RabbitMQ node becomes degraded, so does the reliability for all for our platform form for all of the hubs? And that shard is it scalable? Barely due to our shard architecture and aging infrastructure, we are quickly approaching the limits of our ability to scale.

Is it maintainable? No, the growing shard size makes deployments much more time consuming and makes changes to infrastructure. More risky. Developers have to take time to upgrade RabbitMQ versions and resolve degraded nodes.

So we know that the original architecture can't pass our checklist for the natural growth we've seen over the years. But before we find ways to solve these problems, we also need to take a step back and look at where the IoT industry is heading we don't want to architect something new only for it to be outgrown in a couple of years. And there are some big initiatives looming on the horizon that could have profound impacts on Smart Things.

First is an industry initiative called Matter. Matter aims to create a unified architecture and protocol for IoT devices that is secure, reliable and easy to use. This standard will pave the way for a more unified IoT space where users don't have to pick one walled garden for their smart home. Instead Matter devices can be used across multiple platforms, freeing customers from being walked into one place.

This also means an incredible increase in the number of new IoT devices out in the wild. Some estimate that by 2030 there could be over 5 billion Matter devices for us. This means that the floodgates are open as each hub will now potentially have significantly more devices attached to it.

Yes, the second is a Samsung initiative called Hub Everywhere. One of the issues is that you need some sort of stand alone hub to typically interact in our ecosystem. This raises the bar for entry since you need another piece of equipment to use your IoT devices.

Well, Samsung wants to make this easier for users by making some devices act as their own hub. Things like smart TVs and refrigerators can act as a device but also a hub which is now one less thing that the user needs to worry about setting up. But for us, this means a surge in the number of hubs connected to the platform.

But at the end of the day, we don't fully know how the future will unfold. We know there is going to be growth thanks to Matter and Hub Everywhere. But we don't know how much we project that Smart Things alone will grow to 500 million users over the next few years. But what other features or initiatives may arise as our world becomes more conscious around energy usage? Could we see a boom in energy related devices? What about wearable technology like health and fitness monitors, all of these things combined and probably more, what impact will that have on Smart Things? And in particular, the hub connectivity platform, will we see a five x increase in traffic? Would we be able to handle a 100 x increase in traffic?

So with that in mind, let's go back to our checklist and account for this industry growth on our original architecture. Is it still going to have low latency? Probably not as we put ever increasing pressure on our infrastructure. There's going to be to come a point of diminishing returns, scaling bigger and wider won't solve our problems and the snappy performance we see today will get worse over time.

What about reliability as our shards get bigger and denser like before, more pressure is going to be applied on those services and infrastructure on top of this incidents will be more difficult and time consuming to recover from. Whereas before we might have had to recover 500,000 hubs in a shard in the future. If we have an incident, there could be 10 million hubs in a given shard and for scalability and maintainability, it's a similar story. Our original architecture isn't scalable or maintainable at existing levels, let alone after Matter and Hub Everywhere roll out.

So we know that our original system is going to struggle with the simple natural growth of our platform. And we definitely know it isn't going to be able to handle these big waves of growth.

So how can we build a new platform that supports this future and adheres to our core principles. How do we take what we've learned over the past 10 years and build something that can grow for the next 10?

Well, we started from scratch and looked at things individually. What kind of latency do we actually want? Ideally, each hop through the event pipeline should take 10 milliseconds or less. Since each user interaction involves numerous touch points in the broader platform, we want to make sure that the hub connectivity part is as fast as possible. Because when you boil everything down, we're basically just a transport layer between the hub and the cloud.

Kent: We also need something that is very fault tolerant and instead of slowly or partially failing, we want to fail fast. For example, if an availability zone or region is having packet loss or latency, what we want to do is move the hub traffic around that problem. A brief period of hub reconnection is much better than minutes or hours of flickering performance issues and the new architecture should be frictionless to scale and grow with us. We want the same performance and reliability at 100 million hubs that we had at 100,000 hubs.

Finally, the architecture should be easy to maintain and reason about we want to spend less time worrying about infrastructure and more time delivering the features that our users are excited about at the same time, day to day operations should be low impact, effortless and not something that the on-call person dreads. So what changes did we make and how will those changes help us for that? We turn it over to Tim?

Tim: Thanks Kent. So I'm Tim F Newman. I'm a senior staff software engineer at SmartThings working on the team with Kent. So as Kent said, he sort of mentioned where our platform was. He mentioned the pain points. He mentioned sort of where the industry is going and also sort of the requirements that we set out to try to achieve. So how did we, how did we do that?

Well, the first thing that we did is try to address the issues with those shards, these things that are constantly growing bigger and bigger. But instead of trying to find something that is even bigger like in sports and even bigger, we went the opposite direction and looked at something smaller. So we're trying to break these shards apart into smaller cells, we're doubling the cell architecture. So we'll still be in that multi region deployment. But in every single region, we're going to have multiple little cells, each cell is gonna still be our full deployment of services and infrastructure. But something that's different with those shards versus cells is in a cell, we are capping the number of hubs that we allow to connect to each cell. So instead of having these things that are constantly growing, we're going to cap them at a fixed uh fixed number so that we can control the blast radius. So if there is something that goes wrong inside of that cell, we're now lowering the number of users that are impacted. And also we're not putting as much strain on that uh infrastructure.

As Kent mentioned, another thing was today, we can't really move hubs around. Well, we're getting away with that so that now with the cell architecture hubs can move. So if an availability zone has a problem or an entire region has a problem, we can drain the connections from that one region and put them into another region and it's transparent to the user and the stuff still works.

And lastly because of the cell architecture, the way we have designed, if we ever need more capacity. we simply spin up new cells and we have it so that we can spin up new cells probably within about an hour. whereas with the shards, it would take a long time to even just even think about getting one stood up.

The second thing we did was try to address that messaging uh layer with rabbit mq. As you said, rabbit mq, we've sort of like hit the scaling limit for the architecture with rabbit. So one of the things we're doing is we're replacing it with memory db for reddit. So why memory dvv for red? Well, most people think of memorydb or sorry reddit as a cache, a key value cache that you front with like you front dynamo or cassandra or my sequel. But reddit is actually a very rich in memory database. It has a bunch of different data structures inside of it.

For example, we're using reddit streams as a replacement for rabbit mq. We're also treating memorydb as our primary source of record for a lot of the hub information. So we're using things like reddit hashes. And then for some of our connection business logic for figuring out where a hub will connect to in terms of which cell, which devcon instance and stuff we're using sorted sets so that we can rank those cells rank those instances and find very quickly which one has the most available capacity. And then we route the hub to route that hub to it.

Second memory db is very fast, you know, it's in memory by its name. So in our small scale testing when we were pushing over 100 and 55,000 messages through streams per second, our mean round trip time that is going from the, from the producer to memory db and to the consumer was less than seven milliseconds and at the p 90 nines were less than 25 milliseconds.

Third, it's massively scalable. As Kent mentioned, our growth over time is going to explode. So we need something that's going to be able to support us, not just today but a year or two from now. And also it needs to be able to support fail over events. So say we have an entire region while all those hubs and events going through there if it needs a fail over to the region. Well, mimi db needs to be able to all of a sudden handle a two x load in traffic. And so when we've done this, we've only seen like a very minor if any sort of impact to performance, it just simply works.

Another thing is Samsung and SmartThings is really on the um really conscious about our costs and our energy footprint. So having something that we can scale up and down the clients throughout the day is important to us. Our workload is somewhat cyclical because if you think about it, we're in the home iot space. So when most people are home, it is in the morning and in the evening. So having something that we can scale down during the day or scale down at night to save money and to lower energy footprint is important for us. And using something like rabbit mq with our current architecture in at our scale has proven to be very difficult, if not impossible for us in the past.

Next, as we mentioned before, we need something that's very reliable. So if we're going all in on memory db for rati, um all of our event streams, we can't drop those events if we drop an event, that means you can't open your front door when you get home, both memorydb, everything is durable. We know that as soon as memorydb acknowledges that um it's persisted with that event, it's persisted it to a transaction log. So that even if that node does go down, that data is not lost, abe will speak more to that uh in a bit here. But we have that guarantee that whatever we put into memory db, it's written to a transaction log and we know that it's going to be there. So even if the thing goes down, we still have it also coming from something where we had a self hosted rabbit mq. We know all the pain points of doing that, having to do with the os upgrades, having to do the, you know the patch versions for rabbit. So we wanted to get away from that. And with memorydb, this is a managed service provisioning all the hardware is done for us. The patch upgrades are done for us. If a node goes down and a new one needs to be brought in that's handled automatically for us, fail overs have automatically for us. So this frees our engineers from having to deal with all the day to day like babysitting of the infrastructure and lets us and frees us to do more things like creating the features and functionality that sets us apart from our competitors.

But at the same time that we want something that we can sort of tweak to our, our needs, you know, as a requirement change over time, we might need to add more red shards or replicas to meet that new demand. And so using terraform or going into the memory db console, we can easily tweak those things or upgrade the instant size as our demand changes over time.

And lastly, we already have a partnership with amazon, you know, smart things is entirely in aws. So we know we are familiar with the infrastructure, the network, the security, all of that sort of stuff. So we know how to interact with amazon. So we were to um unlike if we were to go to a third party, it would we have to basically relearn how to interact with them. And also the support team from memorydb has been great. Um as we've been proving the sys uh system out, we've hit some stumbling blocks, but they've always been there um to get quick resolution to our problems and to talk more about memory db for redis, i'm gonna turn it over to abe uh for a bit.

Abe: Thanks Jim. So just a brief overview about MemoryDB before we go back and see how SmartThings is using MemoryDB. Amazon MemoryDB for Redis is a service that was launched last year in August 2021. It's a Redis compatible, highly durable in-memory database service. So it's intended to be used as a primary database in workloads that need really ultrafast performance high scale. But at the same time, you need data durability.

So everything that you write into a database, you expect it to be there, you don't expect it to vanish suddenly. And so MemoryDB is designed to operate in that manner. It is designed for performance and durability. So it is an in-memory service. So you get the benefits of in-memory performance, you get sub millisecond or microsecond reads rights are slightly slower. You get single digit millisecond rights usually p50 of the order of 3 to 4 milliseconds, p95 of the order of 8 to 9 milliseconds. So still relatively fast but slower than what you would expect from a completely in-memory solution. And we will get to that in a bit as to why rights are slower, right?

It also offers multi-AZ data durability. So everything you write into MemoryDB is stored across at least three copies at least two availability zones before it is acknowledged back to the client. It is designed for high scale. So you can scale horizontally, you can always add more shards as your data grows and you can scale horizontally or you can scale vertically if you have a workload, which is extremely, let's say IO heavy and CPU intensive, you can use a larger machine, larger two machine to run MemoryDB and get that additional additional throughput we support online scaling.

So when you're scaling out, so let's say as Tim was saying, you want to scale out because you anticipate an incoming increase in traffic, you can always scale, scale out and scale back in in an online fashion. There is no impact to your to your application and service. You can also run MemoryDB in high availability mode with Redis replicas, right?

So what usually you might run MemoryDB with just let's say a primary node and maybe have multiple shards. But if that primary node goes down, even though there is no data loss, it takes some time to initiate a node, right? What you can do is have read replicas or just replicas, let's call them replicas for now. And if you create, let's say even one more replica per shard, then Amazon MemoryDB will automatically fail over to that replica whenever that primary node goes down and that gives you high availability.

You can also use those replicas for serving reads and scaling reads. If you have an application that is extremely read heavy, right? You want to scale out the number of reads, you can perform, use those read replicas.

We we've been busy the last year developing some features from MemoryDB that I wanted to highlight the most interesting. One of them is what we call data turing. If you're not familiar with it, customers have told us that they don't necessarily want to store all data in memory at all times, right? Because imagine a use case like an ecommerce application where not all your users are logged in at the same time, right? A subset of users or data needs to be hot data at any given time.

So what that means is you can use with MemoryDB, you can use what are called r6gd instance types on to r6gd, the d stands for attached dis they have locally attached and me based ssd disks and Memory utilizes the total storage space available on those nodes to store your data. So including memory as well as disk and as soon as the memory is full, it spills data over to the disk. And so you get expanded storage at a lower cost because ssd are cheaper than, than ram. So it's you get, you essentially get a better sort of, you get a different price performance option here. And we've seen customers save over 60% in savings versus keeping all of that data and memory on me.

We are HIPAA SOC and other compliant. So if you have a use case which needs to store sensitive data, you can use MemoryDB. And lastly, we support automatic fails. It's a fully managed service. So we manage, you know, hardware changes, software upgrades, patching fails, no reins and all of that good stuff.

So let's uh let's go a little bit deeper. I'm sure you guys are curious on how it actually achieves durability, right? So here's, here's a view of what a normal MemoryDB for Redis cluster might look like. So you have a primary node which is in let's say an availability zone A. And in this case, I'm showing you an example with two replicas, one each in zone B and zone C. And let's say you have a Redis client that's sending a request which is setting the value of A to 100 right? In this, I'm taking a simple key value example, set the value of A to 100 in traditional Redis.

What happens is that primary node will write that value A is equal to 100 in memory and respond back with an acknowledgement immediately, right? And that's how you get that sub millisecond, microsecond right? Performance as well. But you are at risk of data loss here because as soon as if that data is not yet replicated to the replicas and the primary node is lost, the node goes down, that data is lost. And especially if you're writing at really high throughput, then the amount of data loss can, can grow.

So what we did was we decided that we want to use a, a separate multi AZ transaction log a system to to to store data and persist it so that it is not lost, right? This this technology, the multi AZ transaction log is a proprietary technology that we use that we built. And it is also used by a lot of other internal AWS and Amazon systems that, that need durability like S3 and even Amazon.com ordering data is is persisted on this system. So using the same system.

So in this case, now what happens is synchronously when that set A equals 100 comes into the primary, it's written to memory. But at the same time, also written to the multi AZ transaction log and only when that log responds back with an act, that's when we respond back with an acknowledgement to the client. So this is why rights are slower. They take p50 as i was saying up to 3 to 4 milliseconds, p95 about 7 to 8 to 9 milliseconds. But data is written durably, right?

And because the multi AZ transaction log is separated from your cluster itself, the memory, we cluster the nodes themselves, you get the same durability that you if you were running MemoryDB on a single node, you could be running MemoryDB on a single primary node and all data will still be stored durably across at least two availability zones and three copies. Right. In this case, though I'm showing you an example with two replicas, right. So the data is replicated into the replica memory asynchronously

So those replicas read the data from the multi transaction log and asynchronously replicate into memory for the replicas. And then it's available for reads. This is asynchronous. So if you are reading this data from the replicas, you should expect eventual consistency between the between what you've written and what you're reading, right?

So this is essentially how we achieve, you know, zero data loss. So even if the entire cluster is lost, we are still able to recover your data from the multi transaction law, right? So that's anything you write is persisted and is is not lost.

So with that, I'm gonna turn, turn it back to Tim who's gonna walk us through the architecture and how they're using MemoryDB. Thanks, sorry.

Ok. Now on to everybody's favorite part where we get to show some pretty pictures of the inner workings of Smart Things.

So as I said before, there's sort of two main flows that we deal with, there's inbound traffic or ingress that's going to be messages from the hubs up to the cloud and then outbound or egress things coming from the cloud down to the edge.

We'll look at the ingress first. So what you're looking at is a representation of all the services and infrastructure in a single cell. As you can see at the middle. Here in the blue, we have a MemoryDB cluster that we call Photon. We got creative and we called all of our MemoryDB clusters elementary particle names. So we got creative for once.

So in the bottom left, you'll see that gray square, that is what we're representing as a hub and you see attached to it in the green, there is a light bulb. So in our case, let's say that we have a light that turns on that light is going to notify the hub that it's attached to. So it's like, hey my state has changed. I am now on that hub is then going to create a smart command that Kent had mentioned earlier. And it's going to send that over that persistent TCP connection to the Defcon instance that it is connected to. In this case, it's connected to instance 101.

So that smart command gets sent over, it's received by Defcon Defcon immediately publishes it to the Photon cluster. We want to try to get everything off Defcon as fast as possible and get it into that durable MemoryDB cluster. So that way if something does happen and it goes down, we still have that event and we can still process it.

So Defcon is just shoving all these events sort of like just a fire hose into MemoryDB. Then we have another service called Prism. Prism reads from that fire hose of events and then partially will decode every single message to figure out what type of message it is. In this case, it's like a device state change, but there could be a device join type where there could be some sort of inventory message.

So it reads what type of message it is and then it repubs it onto a different Redis stream that is unique to that um to that event, unique to that event type. So there'll be a stream for device joins, for device state changes for those inventories, so on and so forth.

Then we have another collection of nano services called Wavelength that then consume from each one of those individual streams and does the actual processing, whether that means if it's a state change, it might just shove it into the database or it might call out some other service on the platform.

Originally, we had a serv wavelength was a single monolithic service, but we decided to break it out to get rid to avoid the whole noisy neighbor problem that we were suffering from. And eventually that request will then get bouncing around through some more Smart Things services and it will show up in your phone. Like we saw in that demo earlier, we turn on the light and then we saw in the app, the state had changed

The opposite flow of this is the outbound. So let's say you're coming home and you want to unlock your front door again. You're in the app, you're going to push the button in the app says unlock my door. It's going to bounce around some services until it hits a service called Pit Boss.

Pit Boss verifies that. Yep, you can do this thing. I have the information I need, I know what hub it is. I know what lock you want to unlock. And then it's going to create a smart command. That thing that the hub knows how to that language, the hub knows how to speak.

So once Pit Boss has created that it's going to send that to a service that we call Dispatcher, what Dispatcher is going to do is it's going to figure out where that hub is located on our platform. Remember we have multiple cells across the globe. So it's Dispatcher's job to figure out where in the world. Literally that hub is connected to if it is in a different cell, other than the one that received the request, we need to do a cross cell routing.

So we'll just send that to the, to the cell that it's connected to. Once it gets to the right cell, then it looks into another MemoryDB cluster called Higgs Higgs is our source of truth for where all the hubs are located in a given cell.

We have another service called Switchboard that acts as our controller and our router for determining where hubs connect to the platform Switchboard is what's using those like sorted sets that we mentioned earlier, determining which cell has its most available capacity, then which Defcon instance has the most available capacity.

So all that stuff is managed by Switchboard and it's persisted into that Higgs MemoryDB cluster. So now the Dispatcher knows which instance that is connected to. It's then going to publish that message onto a Redis stream again in the Muon cluster.

Now that is specific to that individual Defcon instance. So here we see that gray square version in the hub is connected to instance 104. So we're going to publish it to a stream called outbound 104.

So then the Defcon instance is going to read from that stream, send it down a lot of that persistent TCP connection to the hub. Then the hub is going to translate that smart command into some sort of native IoT protocol like ZigB or Z-Wave that, that IoT device in our case, a lock or a light bulb or something understands, send it to it and then voila your door is now unlocked.

So with all this new architecture in mind, let's go back to that those core principles and see how we fared.

How is our latency? We're good. Now, even at 100 and 55,000 messages per second, we're still getting single digit millisecond latency through the system. And depending on the type of event we actually see going from that Defcon up to that wing service that entire span, taking just a little over 10 milliseconds.

Do we have a reliability? Yes. So with new DB, all of our data is persistent as soon as we get it off, the Defcon, everything is persistent. If something crashes, we still have that event. But also because of our cell architecture, if we need to, if there's a problem in an a z or an availability zone, we can easily shift traffic away from the again, it's transparent to the user.

So from the user's perspective, reliability is great. Even though we know us east one is on fire again. Is it scalable? Yes. So using those cells, if we ever need more scalability, we just increase the number of cells. And then also we've seen that MemoryDB can scale very uh very far, very far and it could handle also sudden swings and loads.

So we have both the the consistent natural growth, but we also need something that can handle very quick swings and load and traffic. So there's a product release that came out that all of a sudden it's sending twice as much events to us. We need to support that we shouldn't fail with MemoryDB. We, we can see that that is true.

Is it maintainable again? Yes. Now that we're using a managed service, a lot of the maintainability of our data layer is not managed for us. So we don't have to deal with that anymore. But at the same time, because we're sort of moving to that cells and now that hubs can move around if we did, if we did need to do some sort of maintenance in one region or a cell or something, we can evacuate that area, push all the traffic somewhere else, make our changes there, shift all the traffic back over, make the modifications there and then balance everything back out.

So we can make modifications into production like infrastructure and stuff like that without having to risk impact to our customers. So to sort of tie everything back together real quick to that original demo.

So on the left is going to be that same thing that you saw earlier. But on the bottom here, on the right, you're going to see three little boxes. These are actual terminals that we're just tailing logs for the bottom is going to be a Defcon the middle is going to be that Prism service and the top is going to be Wavelength and as this plays, you'll see that as we turn it on those events pop up almost instantaneously.

There is some nuance here. Um they don't, they might not show up exactly the same time, but that's the nature of asynchronous programming and also asynchronous logging. So there is a little difference. But if you were to look really close at these logs, you can see the time stamps are just boom, boom, boom.

So hopefully, this has sort of given you a better idea of, you know, what it looks like from the user's perspective as well as what's happening behind the scenes um with MemoryDB. But as both things in life, you know, there's always room for improvement and this project isn't any different.

One of the things that we want to explore is some sort of compression or serialization for all of all of our events, we're shoving a lot of traffic into MemoryDB. And the great thing with MemoryDB is you don't get charged for the reads, but you get charged for the writes and it's based on the aggregate of how much you're you're putting in there.

So you get charged per the gigabytes. So right now, we're just using uncompressed json strings. But as more traffic comes into our platform, we need to start thinking about, you know, more novel ways of trying to minimize the amount of physical data we're shoving into MemoryDB and doing them. And one of the ways we can do that is by looking at something like Protobufs or some other tightly packed um bytes that we can send in, this is gonna become more and more important for us, you know, next year and stuff as these big initiatives are actually, you know, coming online.

The other thing is um in the current version of MemoryDB, i believe it's 62 y. Um one of the things that we need to do is we need to know when a consumer goes offline, we need to do like some, some, some, you know, clean up operations.

So when that Redis stream consumer goes away, we need to make sure that there's no pending messages that didn't get processed for it. For example, say a consumer got a message but then it just suddenly crashed. So if that happened, we need to make sure that that message got rerouted to a healthy consumer.

And in the current version that MemoryDB is on for Redis, there is the notion of an idle consumer time, but it's a little misleading instead of what instead of it being, you know, the last time a consumer interacted with the server Redis in MemoryDB. In this case, it's only the last time that that consumer actually processed something.

So if you have a very low throughput stream, that's, you know, only like one message every 10 minutes or something, that idle time is going to be 10 minutes. But that's not really an accurate representation of when that hub was or sorry when that consumer actually interacted.

So in Redis seven, they actually changed this. Um whereas now they make a distinction between two things. Now, one is that idle time where the consumer has last read something from the stream. But there's also a new setting that is the last time the consumer actually interacted with the server.

So now we'll be able to use that one. Once it's, once MemoryDB gets that version, we'll be able to find, like, stop doing some of these hacky things that we have to like manually track and we can start using that instead to know when those consumers are unhealthy.

Therefore, we know when we can move. Um if you're familiar with Reddit, it's the Redis streams is the pending entries list when we can prune that and move that somewhere else.

So there you have it. That's how Smart Things from a very high level use using MemoryDB um to power that home automation uh connectivity layer.

So I'm gonna turn it back over to Abe to close this out real quick. And yeah, thank you, Tim. Uh so hopefully that will use that was useful. Uh and you guys learned how uh you know the inner workings of the IoT platform at Smart Things.

Here are some helpful links. If you want to get in touch with the MemoryDB team or you know, learn more about Samsung Smart Things and Matter. Uh you know, I'll stay on the screen for a bit. Please feel free to scan these QR codes for those links.

Um we unfortunately do not have a mic in the audience. So we won't be able to take live questions, but we have about 13 minutes left so we can wrap up early. We'll hang out. And if you have questions about, you know, the decisions the Smart Things teams made or about MemoryDB, please feel free to come up here and we are happy to answer those. Uh and yeah, thank you.