Backup and disaster recovery strategies for increased resilience

最新推荐文章于 2024-09-15 18:44:36 发布

李白的朋友王维

最新推荐文章于 2024-09-15 18:44:36 发布

阅读量113

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134809052

版权

So today we're gonna talk about backup and disaster recovery strategies for increased resiliency. And we're gonna start with resilience requirements.

The requirements are a key thing. Before we start talking about strategies, we first need to identify what are we trying to achieve, what are requirements for data protection for resiliency, disaster recovery and backup. And make sure that we all speak in the same language so we can define the metrics and the KPIs that we want to meet.

We'll talk about the strategies and dive deep into disaster recovery and backup. We'll go deep into some architectural diagrams and go over the workflows. And then we're gonna talk about ransomware recovery specifically and the mitigations that we wanna have for ransomware.

So what do we need in order to succeed? What each and every business needs to define for your own resilience requirement. So we're gonna touch on several key aspects of resilience and data protection requirements. And let's start with recovery objectives.

So recovery objectives, RPO and RTO are among the more common concepts for recovery objectives and requirements for resilience. But let's go over and make sure that we understand what they mean.

I'll start with RPO - Recovery Point Objective. That basically identifies the amount of data that I can afford to lose in sense of an event, whether it's an infrastructural failure, ransomware attack, as we've mentioned before, data corruption, human error, etc.

So for example, if I'm taking daily snapshots of my data, I can potentially lose a full day of data when I need to recover. So we'll talk about this requirement and how can we achieve various RPO recovery point objectives with the different strategies?

And we have the RTO - Recovery Time Objective, which basically means how much downtime can I afford to have when I have an event? So how long will it take me to get my business back up and running even if I lost some data? How long will I be down until I'm recovered and can work?

So RPO and RTO are extremely important. And it's not a one fit on. It varies obviously by the business, but also per system application at a time, even per resource. So you'll need to define what are your recovery objectives for the different workloads.

Next, we're going to talk about the deployment pattern. And that obviously varies per business. Do I still have workloads running on premises which I need to recover? Do I need to go multi cloud? Do I need to have disaster recovery strategies between clouds or do I have all my workloads on AWS?

And then also I'll need to define my strategy. Do I need cross region recovery or cross availability zone? And that is also a key concept that we'll need to discuss and define before we choose the right solution.

So next, we're gonna determine our backup policies. So starting with our retention, how long do we need to retain our data when we back it up? Do we need a week? Two weeks? What is the frequency?

And these are driven by multiple requirements - they can come from compliance or legal requirements depending on the vertical. And it can come because an aspect of your data protection requirements. So I need to know how long do I need to retain? And at what frequency do I need to retain?

And we see customers using multiple policies for different workloads as well as we mentioned on the RPO also for backup. Do I need cross region or in region the restoration? Do I need object logs? Meaning do I need to have immutability of my snapshots? We'll talk more about that when we dive into ransomware and accounts.

How is the topology of the accounts between my backups? My disaster recovery and my production workload is gonna look like. And again, ransomware is a key driver. But also others will talk about that as well.

Next, I need to identify what am I protecting? Not just what do I need to achieve in order to choose the right solution for the right resources. So we have various options in the different storage services. I need to know what am I protecting? Which services, what compute services and databases.

Each of those will have their own unique constraints, unique options that we can choose. And we need to tie everything in, in our wider strategy.

So before we dive into some high level strategies and options, let's talk a little bit about the RPO and RTO and take a step back to what we call the old way of doing disaster recovery.

So if I look at the on premises, disaster recovery patterns on premises, I have my primary site, my primary data center, I have hardware that needs to run production workloads. So it has performance storage, it has performance CPU everything is fully provisioned to online production workloads.

Now, the common practice was to have a secondary recovery site on premises as well. So I need another data center and I need to enable some form of replication between the two sites at the storage level or the application level or both.

But my recovery site needs to run my production workloads in case of a fail over. So it has to be fully provisioned. I need the same performance storage, the same performance CPU and compute hardware on my recovery site.

So basically I double my cost. Now, most replication and ongoing replication solution do not require fully provisioned resources. They replicate data, they don't do complex compute, you don't read the data that much.

So basically the old way was duplicate your spend, pay everything up in advance. Because sometime someday you might need it even though you're not using it now, just sitting out there and piling up dust.

So why using the cloud? Why using AWS for disaster recovery? And in a single word, it's elasticity, the cloud with the elasticity.

And that model that basically states you pay for what you use. When you use it, you're not paying in advance just because you're gonna have to use it someday.

And if you look at that paradigm of what we call the new way of disaster recovery, whether your production is running on premises or if it's running in cloud in AWS, whether it's cross region or cross availability zone, I still need to have my primary site fully provisioned. It's running my production workloads.

But my recovery site as AWS does not need to be fully provisioned. It needs to run the very minimal resources that are required to support ongoing replication.

So on an ongoing basis, I'm cutting down my cost drastically to the very bare minimum of what I need to support my replication, my ongoing replication. And only when I'll need to either run it, either run a drill or fail over to my recovery site on AWS. Only then will I provision a full production resources and infrastructure.

So I'll pay if I'm replicating the entire year and I'm running an RPO drill of a day, I'll pay only for one day.

So let's talk about data resilience options in the cloud as a whole. So we understand what disaster recovery in the cloud means and that it changes and breaks the old paradigm.

Once we go into the architectural diagrams, we'll dive deep into those areas and we'll see how it works. Where's the magic. But now let's take the birds eye view and see the data resilience options in the cloud.

So first of all, as you can see, we're looking at a spectrum, why do we have multiple options? And as always, the answer is because there is a trade off, there's a key trade off in those options.

Now, there are actually there are multiple tradeoffs, complexity feature sets, simplicity for reporting monitoring metrics, etc. There are a lot of various trade offs when you choose a solution. But I'll focus on the simplest and the main driver that we see. And that's the trade off between recovery objectives that we've discussed RPO and RTO. And cost.

Very simply, if I want more aggressive and lower recovery objectives, I'll need to pay more for the infrastructure.

So let's look at the options on the very far edge of that spectrum, we have back up and restore so our goal there will be to cut costs as much as possible and it's ok to get RPOs sometimes even close to a day, potentially hours of RTO for the backup.

Then I'll jump over to the very far edge of active, active or high availability solutions. For those we have RPOs of seconds. RTO is almost real time, meaning that I'll have my more similar to what we've seen on the on premises world. I have fully provisioned resources but only for the extreme cases where I need very, very, very low RTO near real time recovery time objective, I can't suffer any downtime.

And then I'll have to have basically fully provisioned infrastructures both on my primary region and my recovery availability zone or region. But then we have pilot light and warm standby which are kind of in between.

And again, looking at the trade off, I can have RTO and RPO of minutes in warm standby and pay a little less than HA or RPO of minutes. RTO of hours for the pilot light paradigm and pay even less.

So we have that paradigm and it's not a new paradigm but in elastic disaster recovery or in general, in AWS, we want to challenge those paradigms and we want to achieve better, meaning we want to have the best recovery objectives we can offer with the lowest cost we can give.

So if I look at elastic disaster recovery, you can see that we have a nice mix. We'll have the RPO of seconds. We'll see how that works. So the RPO of active, active or high availability, that's the amount of data that we can afford to lose. So we hardly lose any data replication is always very, very, very small incremental. We'll see how that works.

So my RPO is of seconds. My RTO is of minutes because I have some resources that are ready to be provisioned, but they still have to do some work on recovery. But the cost goes all the way down to pilot light and we'll see how that works.

So we are one of our main missions is to challenge these paradigms and give more cost effective solutions that give more aggressive recovery objectives.

So let's talk about some of the key benefits of elastic disaster recovery and then we'll dive into it. We'll see exactly how it works. How can we achieve more aggressive recovery objectives with lower costs and then go into the architectural diagram, the evidence.

So the first and the key element is to have fast recovery. So circling back to the cost, we want the RTO of minutes with lower cost. We don't want RTO of hours.

So that means that we have very aggressive RTO very cost effective one and testing is easy. And when we look about when we look at strategies having the bottom line of data protection of lower cost of aggressive recovery objectives is nice. But how do I really know that I can count on my disaster recovery solution and on my ability to recover if I need to.

And that boils down to testing, we need to have, first of all, non disruptive DR drills, which means that I can test as much as I want without impacting my ongoing replication, my recovery posture and obviously zero impact on my production environment.

And I wanted to have those extensive meaning I can choose the and test the wide spectrum of functionality of recover. So I want to be able to test fail over and we just launched an ability to plug in application level verifications into the DR drill.

So the DR drill will provision all of the resources will do a fail over zero impact on the source environment. But there are some predefined verifications at the application level. It's not enough that the infrastructure is up and running. I want to see that my database is working and that various applications are running that an end point is approachable, etc.

So you'll be able to plug in multiple automatic verifications so that the DR drill will be simple and you'll be able to see the feedback and do it as frequently as you want.

It also doesn't cost that much because you only pay for those provisioned resources for the short time of the DR drill.

We talked about lower cost. We'll see how that works. Ransomware recovery is a big topic and we'll talk about it in the context of backup and disaster recovery and the differences between them. Some of the best practices and data protection.

As we said, we have very low RPOs.

We want to lose as minimal amount of data as we can. And obviously point in time recovery in case of data corruptions, ransomer, et cetera.

So let's have a, a quick run over the life cycle and then we'll dive into the how it works.

So first of all, you start by installing an agent on your source servers. If they're running on premises, you install that agent that will do the actual replication. If those are running on ec2, uh we can enable it from the console today from the api s. So you don't need to actually install the agents manually that will start to application automatically and provision those resources that are required for application.

So no overhead in terms of managing or provisioning resources for application. And, and then you can test, you can see that you can recover that you can do a fail over that. You can do a fail back to your original infrastructure. Once again, non disruptive, you can tie in your application level testing and so on.

And then we get to the operation phase which disaster recovery as in replication is an ongoing thing. So replication will be ongoing and you can run your periodic drill tie in your alerts. If anything goes wrong, if you block a port for the replication, for some reason, you'll get the alerts. So you'll know that you're uh always protected.

And then in case of an event, you can fail over again. Cross region cross a z on prem to cloud and fail back when needed, failing back will take all of the deltas, the work that you've done while failing over and replicate it back to your primary site.

So now we're gonna dive into how it works and we'll see the details. So we'll go um and see how, how can we achieve lower costs but still keep such an aggressive recovery time objectives.

So i'll start with the examples of on premises, data centers, two aws. So on the left side, you can see uh your data center, you have your uh servers running your v ms running on premises. And in this example, we'll use aws as the recovery site. And by the way, we see it as a very common migration pattern. When customers decide to, first of all exit their recovery site or recovery data center, starting by using aws as the recovery site. And then when they want to complete the migration, they can simply fail over and never fail back.

So once we install the agents on the source servers, we start the application and the application is continuous async at the block level. Now, what that means is that first of all, because the asynchronous nature of the replication, it has minimal footprint and impact on the source server. It doesn't interfere with ongoing rights. That's a key. Uh it's, it's extremely important, but it also replicates in very small chunks on an ongoing basis, which means that we can keep rp os minimally actually seconds or sub seconds usually.

And that means that the ongoing replication which is obviously compressed and encrypted to save on networking uh costs brings aws into a situation where your data, your latest data is always there. In order to do that, we need to provision resources on aws to get the data that we replicate.

So on the right hand side, you can see your aws account in your aws account. Drs elastic disaster recovery will provision all of the resources that are required for application. That is the staging area. So you can use different subnets for your ongoing replication and for your recovery, you can also choose different vpc s and you can also choose different accounts and we'll see when and why do we wanna do that?

But first of all, to understand the basic building blocks, we have our staging area where drs will provision low cost two instances to be the source of the replication. Now the target, sorry of the replication. Now those easy two instances also can be act for as one to many so for many multiple source vms, we can use a single lightweight two instance to handle the application and we provision uh a volume n bs volume for application per drive on the or per disk on the source.

Now, the volume type is dynamic. If you think about replication, the interesting bits of replication is that it almost never reads. You need to write data and you need to write it a lot, but you don't need to read that much. Now, when you don't need to read, you don't, you're not sensitive to hydration, meaning you can change volume types dynamically and get the right performance that you need.

So drs will automatically toggle between different volume types depending on the change rates of data on the source to keep costs low, to keep them as low as possible all the time. Um but still the data is there. So once we need to recover, we have the latest data, even though on the ongoing state, we cut cost drastically.

And then we have the recovery subnet again, can be a different vpc can be a different account of features too. And here based on easy to launch templates or uh smart right sizing mechanisms and framework will provision will take the latest data and in case of a recovery or, or the r drill will run it in your target vpc account or subnet fully provision to what you need.

So if you fail over run for an hour, you'll get the fully provisioned, high performance volumes, high performance instances, fully provisioned for production. Now, if you need to fail back, we support failing back to on premises, replicating all the data that was generated. While you fail over to aws, we have a fail decline that can run on premises and pull all of those deltas so you can fail back to your own premises.

So let's talk about cross region dr and what are the differences between on prem to cloud versus cross region. So first of all, i go to the, the details of the the differences in the architectural diagram soon. But let's talk about start with uh a key benefit of cross region and that, that is network configuration, replication and recovery.

So let's say that and, and, and we'll see that over the diagram. So first of all, on the left hand side, i have one aws region, it runs by production. We install the agent on drs. The installation process itself use uh easy two instance profiles. You don't need credentials, you can do it from the console, you don't need to log into the instance.

So the experience is really nice, but the underlying infrastructure is the same. We do continuous block level, async, block level application to your other aws region or availability zone. But on top of the data, there's the configurations, let's say that i had the yard and it was successful and we tied in all of the application level testing that we wanted to.

How do i know that my vpc is set up properly? The next time i need to recover, maybe my security groups has changed and i didn't update it on my recovery site. Maybe some uh ac ls route changes, subnets or topology looks a bit different.

So in order to mitigate that drs will replicate all of those vpc settings as well as the data itself. So if i'm changing a security group on my primary region, that will be replicated to my recovery region as well, so i can, when i run my periodic drill and when i need to fail over, i know that it's going to be in a working state.

So as we said, security groups, ac l rout tables, et cetera subnets are being replicated along with the actual data um as well as some metadata of the ac two instances themselves.

And we gonna need to fail back on the uh cross region. We don't need to fail the client, right? Because it's not failing all the way back to on premises data centers. So then it's pretty symmetric. I can just invoke an api or from the console, change the direction of the replication.

And we see some customers that are kind of doing a summer home, winter home strategy and simply every periodically just fail over to one from one region or one availability zone to another and just stay there for a few months. So they know that they can fail over uh periodically and they feel that they are ready for any event in case they need to recover from an outage.

So let's summarize some of the key benefits uh that we've seen so far. And why that strategy of asymmetric replication uh is so strong.

So if you look at the asymmetry of that replication, meaning i have fully provisioned resources on one side, i have very light provisioned resources on the other side. But i'm replicating in small chunks. I'm not taking daily snapshots. We'll see we'll talk about that as well because sometimes we do want to take longer period uh snapshots at uh uh lower frequency.

But for low costs, disaster recovery, we'd wanna have that asymmetry as a key point and those very small chunks of data that we replicate continuously and that can lower down down cost. But keeping my rpo so aggressive testing is a key element. You need to test your solutions and you need to test it often.

Now, even if for some compliance requirements, you're only doing an annual disaster recovery drill. We want to encourage customers to test more and more often. And in order to test often, we need to remove and reduce the friction of running a drill both on cost. It shouldn't be costly and on operational overhead, it should be easy. It should be simple and it should be bubb zero, impact on your production and we see customers starting to run those drills much more often than we've seen in the past because of the improvement in the uh past months and years.

So now let's talk about backup and restore. We've talked about disaster recovery where we wanna keep our rps as low as possible, but it still comes down with the cost. We want to optimize that cost as much as possible and lower it. But it still requires those lightweight ec2 instances, those uh low cost ebs volumes that still adding some cost.

Now, what about my less critical applications and resources? So let's talk about backup and restore. Usually when we say back up, we're looking at, as we said hours of rt o hours of rpo most commonly, we see daily backups as the f the main frequency. On the other hand, longer retention.

So let's go over an example, how are we going to use backup and restore to recover from an event? So in this example, uh i have my workload running on aws in region and you can see an two instance with an ebs volume and an rds instance, all running in a single availability zone. We also have an efs, a file system uh that is connected to that ec2 instance.

Now i'll enable my backup policy and run those backups. So using ebs snapshots, r ds, snapshots and using aws backup to back up my efs. I'm running my backups now a key element here is that all of these backups are stored regionally not zonally, right?

So they're not sensitive to what happens in that availability zone. So let's say that i have an outage. Obviously, in this example, i'm looking at an easy outage but it can also be data corruption. It can be something that is very specific uh to your environment or to a specific resource, right? It can be that a specific resource is has malfunctioned.

We always want to be prepared for faults. But in this case, i have that ay not functioning, we call that my backups or zonal. So now i can do a restore for my for my ebs snapshot. Create an ec2 instance that is has mounted, is mounting those restored volumes connecting my ec2 instance to my efs which again was not impacted and restore my rds.

So all of this is done into a different availability zone

So now I can restore from my backups. RT is not highly impacted. So RT would still be relatively aggressive but RP, when I don't have aggressive RP requirements, RP can potentially be longer.

Now, we can apply the same approach for cross region backup and restore and we can do that by enabling cross region replication at the S3 level. So that will basically make my snapshots not just regional versus zonal, but also cross regional. So I can choose the region where I want to replicate my snapshots to. And now my snapshots are always in a different region.

So if I want to recover to another region, let's say, in case of a more extreme event, I can then simply restore all of those resources into a different region. So we've seen disaster recovery and we've seen backup and restore.

So the question is, when do I choose each of those options? And again, the answer is, it depends on the very specific requirements of each application sometimes at the resource level. And at times, we see that we want to apply both strategies, meaning we want to have disaster recovery, narrow down to a single day and then have my backups for longer time retention in case I have compliance requirement or data protection.

And that brings us to the next topic before we dive into ransomware recovery. Let's talk about detection. The question is how do I know that I need to recover, that I need to restore, that I need to run a failover? Sometimes it's trivial, right? I have my production environment is down, it's not available at all. And then I know that I need to recover. I'll have my operational alerts. I know that I'm having downtime and I know that I need to recover.

But at times it's more complex. What if I have a single record in a database that was corrupted? If that is the case, and now I need to restore, I might learn about it or detect it much later, can be days after. So in that scenario, I'll need to retain snapshots for longer periods of time and that's where the retention policy comes in place.

But if I need to go back a few days, probably it's not that important to have granularity of minutes or hours on what time of the day. So I'll probably want to cut down cost and have daily retentions. If I even go farther back for compliance audits, etc, I can increase those gaps.

So we see very commonly that those retention policies are more dynamic or complex. For example, storing daily snapshot for a month or even starting with disaster recovery, storing every 10 minutes for the first hour. So if I have an event and I detect it 20 minutes later, I don't have to fall back a whole day, storing every hour for the previous 24 hours, for the previous day. So I'm cutting down, I'm increasing the gaps as far back as I go.

Then having daily retention for 30 days, then monthly for a year and then annual retention based on my compliance requirements. So this is another way that we want to cut down ongoing cost and manage those tradeoffs.

Now let's talk about ransomware recovery. So ransomware is a very hot topic in the sense that we're seeing more and more over the past few years, we're seeing it becoming a bigger pain point and a bigger concern for customers.

So, first of all, what is ransomware? So basically, it's just a piece of software, but a malicious one like malware. So usually it will infiltrate the organization and will start encrypting data. That data is encrypted and the attackers would usually ask for some payment in order to provide the keys to decrypt the data. That would be the ransom.

So there are a lot of reasons why customers should not want to have to pay for ransomware. Obviously, we don't want to pay for it, but also it generates downtime, it takes time to do that transaction. So at that time, your organization is down, obviously, it can happen again and again. And it's not the greatest reputation that we'd want to have.

So we want to protect ourselves from ransomware. And there are a few key elements for that and we'll start with recovery.

So let's dive into an example where I have my production account. I have various workloads, running data stores, compute, etc. And I have AWS Backup configured to back up those.

Now, one of the key risks of ransomware is not just the encryption of data, is how were they able to encrypt the data? Did they have access to the account? Did they have root access to the account? Did they steal identity? How did that happen?

We definitely don't want to try to understand that while we are impacted. So what we want to do is isolate and separate the different problems.

So for that, we'll open up a data bunker account, different account with a different set of KMS keys. We define another AWS Backup vault but will enable the backup vault lock and compliance mode. And that means that no one can delete snapshots.

The key risk is that someone would encrypt my data. But if they have access to that account, they might turn off backup, they might delete the snapshots. That's the key risk.

But if I'm using a whole completely separated data bunker account and enabling the vault lock, even with root access they can't delete the snapshots. But I need the snapshots together.

So I need to enable push of the snapshots between the vaults. And now I have my bunker account, which is absolutely safe from any tampering with those immutable snapshots.

But what if someone, let's take a step back, malware or ransomware has two very critical events or points in time? One is the time of infection, when did that software hit my resources? Second is the time of encryption, when did it start to encrypt my data? And there could be some time between the two.

So we do see forms of attacks that the attacker would turn off, infect the system, turn off the backups and encrypt later. So if you detect the encryption, it might be too late, you already don't have backups. So even if you didn't, if the attacker did not delete the backups, they might turn it off and wait.

So we want to make sure that we have proper alerts and monitoring and mechanisms to alert in case the backups are being stopped and the configuration is being changed. And we don't want that in our production account. We want that in our data bunker account.

So if there is any case where someone tries to turn off the backups, we'll get all the alerts that we need.

Now let's say there was that in this example, we were impacted by ransomware. Now we want to recover potentially, that workload account is compromised and if it is compromised, I want to be able to recover as quickly as possible without trying to understand what happened with that account.

So one of the best practices is just to set up a different account for recovery. I have another AWS Backup vault in that account and I'll do a push of the backups just for recovery. From here I can just use the AWS Backup restore in order to recover my production workloads.

Now, there are some complementary services around the solution. So we want alerts and metrics and logs using CloudWatch and CloudTrail. And we also want some detection and scanning tools. We have GuardDuty etc. to complement the solution.

So let's talk about ransomware mitigation with lower RPO. Now, what does it even mean lower RPO on ransomware attacks? You said that the data is encrypted but you want to lose minimal amount of data. So that means that I need frequent point in time snapshots. I don't want to roll back a whole day, but it's not enough. I need fast detection and I need a clean up.

So that's where disaster recovery comes into play for mitigation. So we have our backup plan on top of that. If we use Elastic Disaster Recovery to protect from ransomware, I'll have the same concept. I'll start with the account isolation. I want the data that I'm replicating will be in a separate account, immutable snapshots that is completely safe and separated from my production accounts or from my recovery account.

We have the immutable snapshots similar to what we've seen in the backup flow, but the replication is at the block level, not daily snapshots. So I can have those tight point in time snapshots that if I have fast enough detection, let's say that I know that ransomware software started to encrypt my data 10 minutes after it did, I can roll back 10 minutes back and only lose 10 minutes of data and not roll back a whole day.

But for that, I need to detect it. So Elastic Disaster Recovery has some integrations with threat detection software. Some examples are with CrowdStrike and SentinelOne, which we have partnerships with at the source.

So once they detect ransomware very shortly after it started encrypting, they will, they are fully integrated with DRS to invoke recovery to the closest, cleanest state of the data.

So we're not only avoiding or minimizing downtime with the ability to restore, we're also gonna minimize the data loss. So again, we need short point in time recovery. We need very fast detection and we need that separation because we don't want to recover to a state that will be infected again in a minute.

Now, do recall that we had two distinct points in time for a ransomware attack. We had the infection time and we had the encryption time. So if we want to minimize data loss, we don't want to find the time where we were infected and go all the way back to that point.

So we want to have a very quick recovery and then a clean up mechanism that will get rid of that ransomware.

So what's next? I encourage you to, first of all, define your requirements, start with that, make sure that you understand your recovery objectives, that you understand the constraints of your organization in terms of cost, data loss and downtime.

And just simply go ahead and check out the services - Elastic Disaster Recovery for what you need, AWS Backup. We have free online training, etc.

And with that, I want to thank you for your time today. I hope that you've learned something about backup and disaster recovery.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫