Achieving Amazon S3 data lake resilience at LexisNexis

本文讨论了为什么在AWSS3中进行备份的重要性,特别是对于包含关键业务数据的服务。LexisNexus的架构和他们与Columbia合作解决高数据变化率和快速恢复需求的过程,包括使用Clumo实现几乎实时的数据恢复和降低成本的创新。
摘要由CSDN通过智能技术生成

hello. yeah, my name is uh wun jong. i'm the co-founder and cto of cio. and today i have mark ser from lexis nexus.

mark. do you wanna introduce yourself?

uh i'm mark ser, i am at lex nexus. i'm a senior consulting software engineer.

all right. so as for a quick agenda, what i'll be doing today is that i'll cover about why s3 backup is needed. and then uh mark is gonna cover a little bit about lexis nexis uh architecture as well as some of the data resiliency challenges in their data lake. and then we'll go over a, a very high level overview of columbia's architecture and some of the cool innovations that we actually worked together over the last uh year.

ok. so let's start about, you know, talking about why uh backing up s3 is important. it's actually beyond just s3, but it's actually applies to every single data services that you know, contains business critical data. so when it comes to data resiliency, it's actually a shared responsibility model like aws will guarantee the infrastructure resiliency, they'll make sure that your compute and network are in place. but when it comes to data resiliency. it's actually your challenge to actually go ahead and solve. because if you think about it, when it comes to a delete event, let's say somebody calls something like a s3 delete object from aws standpoint, they have no way for them to tell whether that delete is legitimate or whether it is accidental or it's an attack for them. a delete is delete. therefore, it is your responsibility to actually go ahead and protect that data.

we talked to a bunch of customers and there's basically uh multiple reasons why people go ahead and back up their data in s3. the first one is about operational recovery. obviously, you know, these are accidental deletions that could happen, let's say by having uh the wrong life cycle rule in a bucket that could actually end up deleting the entire bucket or a subset of the bucket. the other reasons are things like, you know, cyber attacks that happens to be more that talked about more and more these days. and lastly, it's about compliance requirement. there's different compliance requirements that enforces for you to have backup of your business critical data.

next, i'm gonna pass it down to uh mark.

thank you. um so i work at lexis nexus, the legal and professional division uh about lexus. uh it's a global provider of legal regulatory and business information. uh that's actually one of the world's largest online databases. we have about 100 and 44 billion records and we're adding about 1.2 million new records every day. so with that challenge, uh my team is responsible for the data lake that houses most of this content at lexis nexus.

so here's a high-level architectural diagram of our data lake. uh one of the things you might notice is that there's no vpc. so we're a completely serverless architecture. we don't even actually have a vpc in our aws account. uh this provides us high uh scalability to meet the needs of the content coming in. uh because a lot of times we get uh big spikes of content, uh we actually load about uh 6 to 12 million new documents every hour. uh so we have a high rate of change. uh the documents are quite small, typically, they're about 35 kilobytes. on average, we're a completely restful service. so that allows uh for a company you don't have to be in amazon to use us.

um and one of the benefits we use is we use uh eight of us cloud front to front it for low resilience um end points. and we use a feature called origin groups. so origin groups uh provide the capability to fail over on a per requests basis. uh and this provides us our dr resilience solution so that if we ever get a request that fails, going to the primary bucket, uh we have all the data replicated between 280 regions, it'll automatically fail over to the second bucket.

um and this is kind of important when we talk later about clum and instant access.

um so when we started looking at our backup requirements, we already had our dr capabilities handled through cloudfront origin groups and replicating the data between us three. uh but one of the problems we didn't have was what happens if uh somebody accidentally deletes the data where someone gets into the account and ransomware, is it uh what if a code bug happens that uh we didn't expect? so we need to make sure that we had uh backups of this data that were immutable and capt. so we wanted to make sure that even a super admin at the company has no way to touch, modify, delete uh or encrypt any of the data.

uh we also had uh low rpo, so 15 minute rpo requirement because again, we're loading between six and 10 million documents every hour. so we need to make sure that we had uh low rp os. so we don't lose data. and then we also wanted to make sure that we had a uh less uh low rt o to recover the data because again, we're a very large database and, you know, anytime you're out for more than a couple of hours, revenue starts to take a hit.

so as we started looking at vendors and what we could do in the space, we realized that we had some challenges uh mostly our challenges were around what we do and how we do it. so we have about 26 billion records uh in a single s3 bucket. uh we actually have about 100 terabytes of data. and we're loading between six and 12 million objects an hour. so we have a very high data rate change.

um and again, the average object size is 35 kilobytes. so it's very small. so we have a lot of very small files.

um and we also have these are the, the 6 to 12 million new objects per hour, just new ones. it doesn't account for any of the deletes that we have where we either lose access to content or versions roll off because we only keep so many versions. so it's really, we have another 5 to 10 million requests per hour also hitting our s3 storage.

uh when we looked at what we could do to repopulate from our vendors, it was gonna take six months, which obviously was not an acceptable solution. so we started looking at different vendors to see what could uh solve this challenge. and that's where we came up with clum o

all right. thank you. so let's uh let's start with a high level overview of the architecture and then start driving down deeper into the innovation that we actually work together.

so i'm gonna start with the, on the left hand side or on the right hand side depending where you are, this is kind of the customer account. in this example, that will be lexis nexus aws account and that bucket in the middle happens to be the bucket that we actually have to go and protect the on boarding process is pretty straightforward. we actually provide you with a cloud formation template or a terraform script that you would actually go ahead and install it in your environment.

uh obviously, every environment is a little unique. so we also allow you to actually customize those templates and install it in your environment once that template is installed, what we go ahead and do is that we install all the assets that is required for cu to access your data and perform the backup. for example, we actually go ahead and create that im role in your account and that is the im role that we assume in your account to actually carry out the backup processes.

along with that we actually also install things like s3 inventory as well as s3 eventbridge. these are the mechanism by which we actually get the full list of objects that we actually go ahead and need to back it up. so the s3 inventory will give us the full list while the s3 event bridge will actually give us the changes that are happening in your account, all the puts and deletes that are happening in your, in your bucket minute by minute. and that's the actual mechanism that we use to build the continuous backup or the 15 to deliver the 15 minute rpo.

once that is set up on the right hand side, we actually go ahead and create a dedicated aws account for that specific customer. think of that being more of a bunker account that we actually create for that one customer. and that in this example, that account will be dedicated for lexis nexus. and that is the account where we actually have the data, we process all that data, all that processing and then uh retention of the data happens on that dedicated account that is actually managed by luo and everything on the right hand side that happens on the that dedicated account is purely serverless. everything is lambda driven and then everything dynamically scales up and down depending on the workload that we actually get.

like mark mentioned, they're actually doing 10 million puts every hour on that bucket. so what that entails is that gonna be 10 million events that happens to that event bridge and where 10 million events happen, then we actually fired a bunch of lambda functions that handles all that load. so basically, we want the backup to actually scale with the primary because otherwise, you know, the backup ha happens to be the bottleneck, right?

all right. next, i'm gonna talk about about the, the work that i've been working uh with uh market team over the last year. so it's about uh instant access. so what i'm gonna start doing is first of all, let me explain the the basic restore process.

uh when it comes to restore process, you know, something clearly bad has happened in your account or in your bucket. so you're coming to me and then you're telling me, hey, i wanna restore this bucket to a specific point in time, let's say yesterday at 5 p.m. i wanna take that bucket as of yesterday, 5 p.m. so the process goes as follows.

first of all, we need to find out what are the objects that are visible at yesterday 5 p.m. so we need a list of objects because obviously the the objects changes. but what you're asking me is to restore it back to yesterday at 5 p.m. so first, we actually do the query so that we know the list of objects that are visible.

and then the, the second step is to actually go ahead and for us to actually copy those objects back to the final destinations where you wanna restore it to. however, if you think about it, the process of copying actually takes time because we are actually moving bits from one location to the other. so you could uh in the case of lexus nexus, if they have 30 billions of objects, it will take days to actually go ahead and restore it. but we wanted to improve the rt o. we wanted to offer an rt o that is significantly faster than days. we wanted to do it in hours if it is possible.

so what we ended up doing is implementing something called instant axis. the first part of the restore still remains the same. so if you're coming to me and you're asking me, hey, i wanna restore it back to yesterday at 5 p.m. we'll still go ahead and do the query and get the list of objects that is visible as of yesterday at 5 p.m. but the only difference is that this time we're not gonna copy those objects back to the final destination. but what we're gonna go is stood up an s3 axis point s3 lambda integration.

basically, that end point starts behaving like an s 3d end point that you can actually connect and your application can actually start using that endpoint directly without copying all the bits into your final destination. why all this happened? we can hid it back to the final destination that you wanted. but this capability allows you to get up and running within hours within days while the restore happens. and that end point is what starts behaving like uh like any other std uh endpoint

mark. do you not talk about the integration?

yeah, sure. um so thi this was what was uh novel for us and innovative um because what it did was uh again, we use origin groups with cloudfront. so it allowed us now to have our primary origin rather than be r s3 bucket which is no longer there compromised whatever may happen, we can now point to the backup that we had in clum o the instant access endpoint and it provides immediate access, read access to all of that content.

uh and actually, we were able to test that, we were able to restore the 26 billion records in uh two hours and 48 minutes, which is uh awesome for our business because now we're able to get to our critical data quickly. while at the same time at the bottom, you can see we're rehydrating the data. so we have a fail over back to what will be our primary origin. and at some point in the future, we now don't have to read from cumia anymore. we can switch around our origin groups behind the scenes, our customers don't know anything different. and now we're actually getting the data from our bucket. so this was what uh really kind of set cumia apart for when we're looking at how to recover the data, not just from the backup perspective, right?

so yeah, the the on boarding process, like i mentioned is pretty straightforward. it's just a terraform template or a car formation template in 15 minutes, you're up and going and just apply the policy at the bucket level and you configure things that continues back up and then the whole thing flows. but do you want to spend a little bit? i kinda explain towards the results?

uh yeah. so, um you know, it was very easy to set up. uh they do have one click integration uh in the console and we started working with them. uh however, we also had requirements that we, uh we weren't able to use that. uh we didn't want anyone to be able to modify our buckets and production outside of i ac.

uh so we were actually had to work with clum o and they were able to make some changes that allowed us to provide uh maintain control of the bucket configuration. and so once the bucket was configured properly, klum was able to automatically start and pull that data. Here is the rest of the transcript formatted:

uh and then when we actually started doing the backup, we, we were able to back up all 26 billion records, 100 terabytes in less than a week. um ironically, klum has um some scalability or scale protections where they make sure that they don't pull too much data to impact you. uh so this could have gone faster, but we kept trying to tell clum to go faster, but they're like, no, no, we don't want to hurt you.

um and then, you know, working with luo the, the nice part of working with them as a vendor was that they really understand the space and they listen to customers. so when we started, uh looking at, you know, all the different features and the capabilities exactly what i was doing.

um, you know, we're in the scale of billions. um, and a lot of times, you know, usually you're hearing about millions. so usually between millions of billions, there's a kind of a different, uh, there's different things you have to think about.

so, one of the things we came across was that when klum o did the restores, they would tag, they would add two tags to every object to indicate that this was a restored object and this was the time stamp to give you kind of information. so you kind of, you knew these objects were restored from klum and they weren't original in the bucket. however, you couldn't uh do a restore without them.

uh and when we talked with clum, we, we explained that if we were gonna do that restore with the 26 billion records, that's gonna cost us $52,000 because of s3 tagging costs and then another $52,000 to remove those tags. uh so they were able to make a change actually within a couple of days to make those options.

uh and another way that we're able to fine tune and uh help. the cost was with the continuous backup, which is with the eventbridge where you get the con uh 15 minute rpo. uh all the objects are uh replicated, both the crates and the deletes, the puts and deletes.

well, for our use case, we didn't need the deletes to be within the 15 minute rpo. based on architecture and we could just let the incremental that every night they validate uh everything that's there and should be there. so we could just let that do the delete. so they were able to make changes. so we could specify, we only want it put to be under the 15 minute rpo for continuous backup. and that was actually a 48% cost reduction on our side.

um and then when we integrate with cloudfront, ironically, actually, when we started talking with lumio, the instant access feature, they explained it and it was awesome. we're like, this is great. this is exactly what we need. and we said great, we can use it with uh cloudfront, right? and they're like, what do you mean use it with cloudfront? it wasn't even on the road map.

um you know, they were just using it as an s3 end point. and so we said, look for, for a business, you know, to be able to recover the data in hours all the data while we're still trying to copy the data, that's, that's priceless. so is there any way that you can actually get cloudfront integration to work with this in the way that you're designing it?

so they actually went back to the drawing board and for the g a release, they were able to work in cloud front uh features to work with that. so now we're able to get our data restored.

um basically the entire day lake in under three hours if we ever lost it. uh so w what are other ways that people are using instant access outside of cloud front?

yeah. so the, the whole idea of instant access came from our obsession to actually reduce the rt o. and the, and then the cost aspects of it, obviously, you know, uh uh one of the use cases that we talked about is how it reduces your rt o. but then the original use case that we had that people are using is for things like dr testing.

so if you have a bucket, a production bucket and you will like, you're backing it up. and if you have to do ad r test every quarter, what does that mean? do you have to restore hundreds of millions of objects so that into a final location so that you point your dr application to it so that you do a quick test and you destroy the bucket. but that's a lot of waste of time and resources good, you know, pouring hundreds of millions and billions, sometimes it cost prohibited.

so what we wanted to do is enable that use case to allow you to quickly go back in time and stood up an s3 bucket where you can point and quickly test as a test or ad r test where you can actually test at a fraction of the cost. and also a fraction of, in terms of even the time. that's one use case that, that, that we heard, obviously, you know, it was a pleasure, you know, working with mark very technical team and then the call uh fee a was great. so we act on it immediately and some of the other requests that we're getting in things like, can you guys integrate this with, let's say aws athena, if you wanna take the data lake back to, let's say my example back to yesterday at 5 p.m. and you wanna do it really quick and you wanna do a quick athena query by hooking it up directly or something. that's gonna be another use case that you can uh integrate with uh with instant access, right?

all right. so net net, you know where, you know, in terms of scale, i think where we can process uh i believe 10 million objects every hour give and take. that's kind of what you guys are doing. uh and then the restore is for 30 billion. it is taking about three hours to restore uh 3 billion, 30 billion, billion, billion.

all right. last point, i'm guns. all right. so that, that's pretty much it. and then uh thank you for stopping by. we're at the booth is uh right there. booth 605. thank you. thank you.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值