Accelerate secure data migrations at scale with AWS DataSync

最新推荐文章于 2024-08-24 15:38:43 发布

李白的朋友王维

最新推荐文章于 2024-08-24 15:38:43 发布

阅读量149

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134805819

版权

hey folks. thanks for uh coming out today. appreciate, hope everybody's having a great time at re even. uh my name is jeff bartley. i'm a product manager on the aws data sync team and i'm joined today by r vr who's uh with uh the workday team.

he's gonna talk to us today. excited to hear his story about how workday use data sync as part of their migration of their hadoop cluster into aws. so we'll be talking about that a little bit later.

um but just from an agenda perspective today, what we'll do is uh i'll start with the data sync overview and overview of the data sync service. and then i'll go into a deep dive and talk uh about kind of some of the inter capabilities with data sync and help you understand a little bit more about the service.

i'll then talk a little bit about performance optimization. some things that you can do with data sync to save, save a little money as you're using the service. and then also uh sorry, that's cost optimization and performance optimization. help you uh get the best performance uh out of your uh data sync tasks,

i'll then turn it over to robert vendor who will talk a little bit about workday story and then i'll wrap it up at the end there.

so, um if you've ever had to move a lot of data uh as part of a migration, i used to do this. uh many years ago, you've probably experienced a lot of the challenges that we have listed up there.

um and you know, when we talk about large scale data transfers, we're talking about terabytes, hundreds of terabytes, maybe even petabytes of data, millions or billions of files. and i'll work with a lot of customers who will frequently start out thinking, ok, i, i think i can do this, you know, moving data ought to be pretty simple. i'll write some scripts and i'll do this on my own and they quickly begin to realize the level of effort that's required in, in something like that.

um you know, you reach challenges around like security. am i gonna encrypt? how do i encrypt my data in flight? keep it safe?

um how do i verify my data? this is often a big thing, right? sometimes i'll work with customers and like, look, i'm gonna move my data and then i'll just spot check it. yeah, this file made it and that file made it. but, you know, that's ok if it's like 1000 files or tens of thousands. but if you've got millions or billions of files, trying to spot it's not gonna work, right?

so verifying, making sure that that data made it all the way through to the end is super important. also recovering from errors. i, you know, i can't count how many migrations i've done where like a quarter of the way into it. i realize, oh man, i forgot to like flip that switch and the metadata is not copying correctly and i got to go back and start all over again, right? dealing with those kind of errors, making sure you can recover from them cleanly.

uh if you're copying online, thinking about network availability, how do i make sure i'm using my network bandwidth efficiently and effectively that i'm not taking it away from my other users and things like that. so a lot of different challenges that are out there, particularly when you're moving data at scale and aws data sync was built to help our customers deal with a lot of these challenges that can come up.

so uh data sync in particular was built with a custom uh protocol that's built on top of uh tcp/ip. we use parallel tcp/ip streams to achieve a high level of throughput. uh several of our customers can achieve up to 10 gigabits per second per task.

um and they can scale that out. i've had one customer who is achieving a petabyte a day of data movement using data sync. so a significant amount of data movement so we can move data very fast.

um security is it really is job one here at aws. we say that but it really is true. and um data sync uh has built in encryption inf flight and at rest, all data is encrypted in flight using tls 1.3. and we also uh have options for verifying data both in flight and at rest as well to make sure that your data is fully transferred. bite for bite and data sync is a fully managed service.

so what we mean by that is it integrates with a lot of the uh aws uh services that you might expect cloud watch logs and events and things like that as well as cloud trail. and we also make it easy to connect to the storage services that we support specifically amazon s3. amazon fsx file systems, we support all the file system types that are out there as well as efs.

and then we're, we've designed it to be easy to use to overcome a lot of those challenges that i talked about earlier. it's got capabilities like built in scheduling filtering, you can throttle bandwidth usage uh of your tasks and a lot of other capabilities that are designed to um to make it easier to transfer particularly large amounts of data.

now, our customers are using data sync really for four main use cases. and the first of these is helping them accelerate their recurring business workflows. and and what i mean by. that is a good example i often give is i work with customers in the life sciences space.

um and these customers may be running things like genome sequencers on premises and those sequencers are putting out a lot of data on a daily basis. and that data is often going to like a local file system on premises. and then, but these, these customers are often start ups and they don't have the compute capacity on premises to actually process that data that's being pro produced. they want to do that in the cloud.

so they'll use data sync as that data is coming off of those machines and copy it in somewhat real time up into the cloud. and then once the sequencing job is done, they'll trigger their downstream processing. so again, helping customers accelerate their data movement as part of these workflows.

uh migration is another key use case customers migrating file systems, uh r of vendor will talk about his migration with the dupe uh and then um protecting data. so i've got a lot of customers who maybe they have uh file systems or object storage systems on premises and they want to make a second copy of that data in the cloud data sync has built in scheduling. you can replicate on an hourly basis daily, weekly, whatever works for you or you know, custom schedules and then archiving.

so you guys probably know there's a lot of data out there and uh, a lot of that data is cold. think about those powerpoint presentations you never touched in a year, those excel spreadsheets that, you know, you created two years ago. it's a lot of it is just sitting out there cold. but for a lot of our customers, they have to retain that data for compliance purposes or for other reasons.

and, um, but that's consuming on premises, space of their on premises capacity and so they need to move that data. and what they'll often do is move it into something like s3 or s3 glacier or glacier deep archive. take advantage of the cost savings that those uh services provide and then free up uh their data on premises. and so moving that data again, data sync can help our customers with that.

so we'll talk about kind of three different ways that you can use data sync. so the first is, and i've kind of alluded to it, moving data from on premises locations or edge locations. and we can support really any storage that talks the nfs or snb protocol like file servers, windows servers, things like that, as well as object storage systems that support the s3 api. and then obviously hadoop, which rga vander will talk about.

uh data sync can also work at edge locations. so the aws snow cone device, if you're familiar with that, it's a small form factor ruggedized device and it comes built in with the data sync software on it. so our customers will um often take the snow cone device, go to their remote location, maybe collect some data on it, bring that device back to their office or their data center, plug it in and use data sync to copy the data from the device up into the cloud.

and then we also uh data sync also supports copying data to and from uh s3 storage on outposts as well. so that edge presence, we also have a variety of customers who for one reason or another operate in multi cloud environments. uh this could be maybe they've acquired a customer or they've acquired another organization that operates in a different cloud or maybe they've got some users within their organization uh that, you know, have stored data in some different cloud and they want to bring it into aws.

but regardless, you know, it can be complicated to operate in a multi cloud scenario. and aws has actually been building out capabilities over the last few years to help our customers operate in these kind of environments. and data sync is one of those services. and so last year, we launched support for being able to copy data both to and from uh google cloud storage uh and azure files.

and then earlier this year, we launched azure blob storage and then um we added support for a bunch of other clouds there which all have s3 compatible object storage on them. so you can use data sync to copy data both from those other clouds into any of those aws storage services there. so not just s3 but fsx or efs,

um you can also copy data from aws to those clouds if you need it. so in both directions, so giving our customers that operate in these multi cloud environments and extra help in moving their data between clouds.

and then finally, you know, we have a lot of customers using data sync simply to copy data back and forth between aws storage services. so this may be copying a subset of your data from 1 s3 bucket to another. it could be going from efs to s3, any combination of the services that you see there, data sync can copy data between them and it intelligently deals with things like metadata translation or you know, making sure that the data lands in the right way in the right format because you know going from a file system to an object storage. these are some of the things that you have to think about.

so at a high level, how does data sync work? well, you start um if you're copying from an on premises location or an edge location or even other clouds, you start by deploying an agent which is uh a virtual machine and that agent is set up to connect to your storage and then you define source and destination locations. and that's simply a way of telling data saying, hey, here's how you connect to my storage. it's an ip address or a url, the authentication user name, password, things like that.

and then you run a task, a data sync task and that task is responsible for copying data from the source to the destination. and if you're copying from on premises environments, we support uh copying data through the internet through um aws direct connect or through uh site to site vpn regardless of which method you use. all data is encrypted in flight using tls 1.3.

ok. so let's go into a deep dive on data sync. um so on, on premises transfers, we talked about um how data sync can connect to these different types of storage systems. whether it's the nfs or snb protocol, we, we pro uh support, you know, different flavors of those protocols. so we have pretty good uh connectivity capabilities there.

um often having customers copying from like net app system or like i said, windows servers or things like that, we support really any on premises object storage system that's generally s3 api compatible. it doesn't have to be completely compatible with the s3 api. in our documentation, we have a list of uh the api needs that we support and then we talk to hadoop clusters using the htfs protocol and then that agent

uh this is one important thing to note here that agent doesn't talk directly to the storage services itself. it actually goes through the data sync service. and the reason that we do that is because it's really that, that remember i talked about that custom protocol that we have this built for w a optimization high throughput. and so that agent is using that protocol to talk to our data sync service which then routes the data intelligently and correctly at the fastest speed to those different services that are located in your environment.

and an agent uh is really intended for communicating with systems outside of aws. like i said, that could be on premises in other clouds or at the edge um on a snow cone device. it's actually the data sync agent is installed on there.

and uh it's provided as a virtual machine image or an ami that you can run on an ec2 instance, i can deploy it on vm ware hyper vk pm kbm or ec2.

um the nice thing about data sync agents, the software updates are completely handled by data sync. you don't have to go in and schedule those or anything. we do it in the background when tasks aren't running and things like that. so we, we keep all the software up to date.

and uh when you the the basic way that you deploy an agent, you install a virtual machine and then you activate that agent in a region an account where you're actually gonna be using it associated with your storage. so if you're copying to a necessary bucket, you would activate your reg your agent in the region where that s3 bucket is located and your account that you would be using

Now, we provide a couple of different ways for that agent to connect to into AWS because DataSync is copying data over the network. So the first of these are just public endpoints and these are our uh standard available over the internet. Again, the agent is always communicating using TLS. So it's all encrypted in flight.

We also support VPC endpoints. And so customers who need private connectivity between their on premises environments and AWS can use this VPC, you create a DataSync VPC endpoint in your VPC up in AWS. And then you can use something like Direct Connect to create that private connection between your agent and your VPC up in AWS.

And then for customers who need like FIPS capabilities is typically customers working in like US government uh in GovCloud, things like that. But we have uh FIPS endpoints capability as well if you need that added level of security.

Now, one thing to think about as you're using agents with other clouds, you really have two options. Because if you remember, I mentioned that you can deploy the agent as an EC2 instance, you can actually also deploy it um on the other clouds as well with a little bit of extra work to convert that uh EC2 to uh uh a virtual machine that can run on other clouds.

So why you would choose one over the other? Um well, sometimes you may have data that's compressible. So one nice thing about the protocol that we have is is data is compressed in flight. And so for some of our customers who are maybe doing a migration from another cloud into AWS, they're concerned about egress charges so they can put the DataSync agent in the other cloud. And then as it's running, it will compress the data in flight, potentially reduce egress charges out of that cloud.

So um that's one of the advantages. We've got a couple of blogs out there that talk about deploying the agent in a uh in Google Cloud or in Azure uh in Azure. And so you can check that out. Those are kind of the two of the most uh popular ones there. But um you have options for how, you know, you can deploy this. It's uh we've also had customers who deploy it as an EC2 agent just because it's a little bit easier to do it that way they can automate it a little easier.

And I mentioned that customer who was copying a petabyte of day, that that was uh that was actually from another cloud and they were scaling up those EC2 agents, which is a kind of a performance optimization that I'll talk about later. But you have these two options when you're using agents with other clouds.

Now, when it comes to AWS to AWS transfers, one of the nice things is that there's no agents to deploy um because we're talking completely with AWS storage services. So there's no agents to deploy no infrastructure to manage.

Um you can literally just call an API say copy from this S3 to S3 bucket to this S3 bucket. It's super easy. Um and you can copy data within the same region or across regions using DataSync.

Um the nice thing is that all the traffic stays within the AWS network, it never leaves, goes out to the internet and comes back or anything like that. It's all within AWS and again, it's all traffic is encrypted in flight.

Ok? So DataSync is a task driven service that means that it's it's designed to copy data from the source to to the destination. And so a task, a DataSync task is it's configured of different things. And I kind of mentioned this earlier, but it's um it's configured through a source location, a destination location and then configuration options that control how the task actually actually runs this.

Might you might control how the verification works. I mean might control how data is actually copied, what metadata is copied or what metadata may, you may not want to copy things like that.

Um a few things to remember when you're working with DataSync. So a DataSync agent can run one task at a time, it can queue up tasks um if an agent is in use, but it can only run one task at a time.

A task copies data from one source to one destination. You can create multiple tasks if you want to copy from different locations, but it's one source to one destination today.

Uh task can process up to 50 million files. We're working on overcoming that limit um with changes we're making, but that's a limit today.

Um and then uh you can also, as I mentioned, configure a task to run on a schedule. We support um different time frames. You can also create custom schedules as well and a task when it runs goes through various stages.

So it starts at a launching phase which is where we're launching uh some of the uh infrastructure on our back end to uh process that task. And then you go into a preparing stage where DataSync is scanning the source in the destination and comparing them and figuring out what needs to be transferred.

And then we go into a transferring stage where the data is actually being moved over the wire to the destination. And then we go into our verifying stage where we actually verify the data, make sure it's fully landed at the destination bite for bite.

And if anywhere along the way there are errors, we report, we report those in CloudWatch and we also recently added capabilities to provide more in depth error reporting where you can see differences in like um hashes or um uh think or check some things like that to get a deeper dive in understanding what went wrong.

So again, a little bit of an understanding of how DataSync works. So from a performance optimization perspective, there are a few things that you want to think about. And really a lot of this stems from the fact that DataSync is an online data transfer service and it's talking to storage systems on both ends. Those are the main things you need to think about here.

And there are really three legs to the network path that data transfers uh or that that data transfers through as you're running a DataSync task. So you've got the leg from your, let's see if you're copying from on prem your on prem storage to the DataSync agent. And you've got the leg from the agent itself into the DataSync managed service. And then you've got the leg from the managed service to the storage.

We'll talk a little bit here about each of those. So some things to think about uh from the agent to storage leg. The first is that um you know, DataSync uses these common protocols to communicate with your storage. Um and so, you know, you need to make sure that firewalls and things like that are open between your agent and your storage.

Um we use a multi threaded architecture on our agent to scale our storage transfers to achieve high levels of throughput and IOPS. So what that means is that we could potentially impact your storage system. And so that's why I always recommend the customers, particularly when they're running against maybe older storage systems, they're migrating from this, you know, this old NetApp or something like that, that's running these like 72 RPM 100 RPM, you know, S drives like we're probably gonna impact the IOPS in your system.

So you want to go check that out because um you know, you, you just want to be careful there. So um definitely run a POC there. So as I mentioned, you know, slow, slow storage networking or IO um could impact uh DataSync performance.

You also wanna minimize the latency between the agent and the storage. So sometimes customers will come to me and they'll say, well, look, you know, I've got a Direct Connect from my VPC and AWS down to my storage that's running in my data center. Can I just run the agent as an EC2 instance? And like all good engineers, my answer is, it depends.

Um so and really what it depends upon is the latency on that network, particularly because storage protocols like uh NFS or SMB in particular is sensitive to latency. And if there's too much latency between AWS and your on premises storage, you're gonna run into a lot of problems to the point where it may not even work at all.

So this is why we typically recommend that customers deploy the agent as close to their storage as possible to minimize that latency. And then we also provide options for encrypting uh storage communication as well. We do this automatically with the SMB protocol. Um and we have options with Hadoop and, and the other protocols that we support.

Now for the leg between the agent and the DataSync managed service. Again, this is using that custom protocol uh that we provide. And as I mentioned earlier, the agents communicate with our service, not with the storage, we don't talk directly, the agent doesn't talk directly to S3 or to EFS or FSx. It goes through our service first.

Um you can use Direct Connect or VPN with VPC endpoints for private networking. Uh we provide throttles so you can configure DataSync on a per task basis to tell it basically how much bandwidth you wanted to use. So you could configure it, uh you know, use 50 megabytes per second or you know, 500 megabytes per second or whatever.

And you can actually change that throttling dynamically as the task is running. So I've had some, some customers who run really long tasks for a long period of time and during the day, they throttle it down while um you know, while their business processes are utilizing the network. And then at night when all their users go home and they stop using the network, they throttle it back up. So the data team can use more bandwidth.

Uh we can compress data where possible for sending over the wire. And again, it's all encrypted in flight using TLS.

And then on that last leg there between the DataSync service and the AWS storage. Again, we're using the common storage protocols, NFS, SMB, S3. In this case, um DataSync manages this connectivity automatically. There's nothing that you need to do. You do need to configure it a little bit.

So like when we're talking to S3, you need to give us a role that we, that our service can use to talk to your bucket. But that's pretty common for, for talking to AWS services. But again, you know, it's the same um the same caveat applies here.

You can configure uh some storage systems in AWS. Like, you know FSx for Windows, you could configure it with HDD storage and the IOPS are gonna be a lot lower than if you configured it with SSD. So the same caveats apply here.

DataSync is a multi threaded service. We do some optimizations when we're talking to certain services to make sure we don't try to overwhelm them. But you still want to keep in mind if you've configured your file systems or your storage to be a low uh you know, with lower cost storage or something like that, you could run into IO issues as well. So just keep that in mind.

Uh we encrypt traffic automatically, we provide options to do that. And again, all traffic stays within the AWS network for that particular region where your storage is uh exists.

So I talked earlier about uh this customer who was uh copying a petabyte a day, which is a lot of data, by the way, that's like you're operating at like over 10 gigabytes per second. That's it was a huge amount of data that they were copying.

And the way that they did this was effectively by deploying multiple agents, they actually had 30 different agents and every one of those agents was running an individual task and then they configured their tasks to point to certain prefixes in their object storage and then they just kicked off DataSync and started running it.

And so all of those tasks were running simultaneously pulling data from different locations in their storage, but all doing it in parallel. So they really able to reach high levels of throughput using this method.

So this is one way to increase your throughput and you don't have to run it uh in the cloud, you can do it on premises as well. I, you know, typically with customers who have like a 10 gigabit per second link, um the way that they'll max that out is usually like with like three or four different tasks running in parallel if they're looking to do a high speed migration.

So this is one way to uh get a higher level of throughput using, using DataSync. Another option that we have is you can actually use multiple agents with a single task.

Now, the reason that you would do this is if you have a data set with lots of small files where you need higher IOPS rather than necessarily throughput. And so you can assign up to four agents to a single task. And then uh what will happen is each one of those agents will be copying a lot of files in parallel. So this is this is one approach that you can use if you're migrating for instance, uh a system with a lot of small files on it.

So there was a lot for performance optimization. And then from a cost optimization perspective, there are some things that we do as well. So uh DataSync uh charges on the amount of data that we transfer. So we charge per gigabyte basis and DataSync is always doing incremental transfers. That means we only try to copy the data that is actually changed on the source between the source and the destination. And the way we do this is during the preparing phase, we scan the source, we scan the destination, we compare them both at the data and metadata level and see what's changed. And then only copy what is actually changed. And so that in that way, you know, we're not trying to copy the same data over and over and over again and charging you for extra stuff. We're only doing those incremental transfers.

You can also use filters to narrow down the data that you need to copy. So for example, you know, you might have aaa file structure that looks something like this. And you might say, you know, i only want copies folders c and d. Well, DataSync has filtering capabilities. We have been in include filter capability. You could give it this string that says, hey, only copy folders c and d and in that case, DataSync will completely ignore folder a and folder b. We won't even scan it at all. We just completely ignore it in the preparing and transferring phases.

Um we also support exclude filters. So you could say, hey, don't bother copying, you know files of a certain extension or something like that. So again, this is a way that we can also limit the amount of data that that DataSync is scanning and transferring.

And then the last one is really the ability to do direct transfers into S3 storage classes. So um if you're familiar with S3, you know that there's a cost to running life cycle policies to transition data from one storage class into another. uh using DataSync. If you know, I want my data to land in Glacier I want my data to land in, you know, Intelligent Turing or whatever you can configure DataSync to copy directly into that storage class, removing that need for life cycle policies to transition objects and things like that and save yourself a little bit of money.

The one caveat um when using DataSync with S3 is just to be aware that um S3 has request charges. Um so you know if you're using DataSync to copy a lot of data to and from S3 over and over and over again, DataSync is doing things like listing and doing heads on the objects and that can lead to some request charges in some cases. So just something to be aware, you have to have like millions and millions of objects that you're scanning and doing this with and you have to be doing it on a frequent basis to see any kind of real impact, but just something to be aware of because sometimes that does affect our customers. Ok?

And with that, I'm gonna turn over to R Vender who's gonna talk about Workday story using DataSync.

Thank you, Jeff. Um my name is Raghavendra. Hi, everybody. Um I lead the Operational Data Lake team at Workday. I'll be talking about the Workday journey of data migration using DataSync.

Quick introduction about Workday. What Workday does - Workday is a cloud based human capital management, etc m and financial management company. Workday provides solutions for many as managing aspects of organizations such as HR payroll benefits, time tracking talent management and more Workday. Hm platform provides solutions for organization to streamline their HR process, improve workforce management and enhance employee engagement. In addition to HCM platform Workday also offers financial management solutions to help organizations manage their financial operations such as accounting, budgeting, procurement, and financial reporting.

Ok. Before migration, we were operating operational data lake for about 8 to 9 years with 10 petabytes of historical data and continuous streaming data in on premises. Right. During this period, we encountered several limitations running such large operational data like in on premises. Some of them are aging hardware and network constraint, right? Due to limited capacity in the data center, hardware maintenance, frequent hardware maintenance and limited capacity in the networking.

Then as the business and the company grew across different geographical locations, we also had challenges with geographical constraints such as limited redundancy in the geography, disaster recovery, higher latency for remote access and challenges on geo local data regulations, right?

And then lack of elasticity, primarily driven by limited capacity and automation on the infrastructure in on prem and inefficient resource utilization due to homogeneous hardware and inefficient capacity planning and limited fault tolerance.

And finally, some of the cost challenges with on premises data center for such a large data lake, you would have to do upfront hardware investments, ongoing maintenance and upgrades as well as licensing and software costs and some of the challenges with units of economics of scale, right?

So all this put together we strategically designed, made a decision to more operational data lake to the cloud which unlocked greater scalability, cost efficiency, accessibility and also enabled our platform for modernization.

So when we started evaluating data migration solution for our use case, we had three choices, three solutions. Data sync is one of them snowball edge and s3 distributed copy with snowballed, quick, quick definition of snowball snowball edge is a physical device which has both storage and compute capabilities. you copy the data offline connected to the data center and then ship it to the aws data center from then you ingest into aw storage services. and s3 distributed copy is a command line tool that supports parallel processing between had htfs file system and s3 right.

We made a choice to move forward with data sync primarily because of the flexibility and the agility it provides during the migration process. For example, we have the ability to control the networking speed or the data transfer speed during the migration. Second, we also have the ability which is a powerful one to incrementally copy the data instead of reprocessing the data from the beginning. In case if you encounter any errors which happened in our case, how did data sync enable petabytes of data migration for us?

If you look on the picture on the left, you have your own premises on the middle, you have the migration account which was a temporary account we established for the migration process. And on the right, you have the production account. So on the left, you have the production hadoop cluster which is completely isolated. And then you we deployed the staging in the staging cluster, all the data sync agents querying the data or querying the files in the htf first that is running in prod so that we do not disrupt any of the existing applications that are using the pra htf first cluster.

The data agents are connected to direct connect gateway through the data sync vpc end points that is on the migration account. And through that we copied all the data as is to the raw data bucket. This is an intermediate data bucket where we had all the files that copied from that were copied from the htf first to the raw data bucket. The primary reason we have taken this approach where there is two phase approach, copy raw data first and then transform that transform. The data is primarily because of limited capacity that we had limited compute capacity that we had in the data center.

Once we copied the data to the raw data bucket, which is an intermediate data bucket, we then applied our transformations and business logics to that data that was copied in the migration account through emr spark jobs and then finally writing it back to product production account, which is the final data bucket. that is where the users query the data

During this process, we also encountered several challenges, several challenges not specific to anything specific to data sync. this is more of the challenges that we encountered during the migration process.

First thing uh if someone is starting with the data sync and if you already have a custom validation of your data, i would highly suggest to disable the checksum that is within the data sync, it will enable faster data transfers.

And uh also the migration, we also face some challenges with s3 output writer for a for a custom partitions, defining the right prefix for s3 is very key in terms of the performance of s3 rights. so that's another one.

uh third one is transforming data from one file format to another file format during the migration is is is very key. and most of the cases that's what happens during migration. our historical data was sequence files. we have to transform it to parquet file and most of our files that was residing in the on prem did not had any sort of schema schema related associated schema associated and scheme evolution, right?

so we have to transform all those sequence files from sequence files to the parquet files. that was one of the challenges.

the other one is managing and optimizing data transfer speeds such that you do not impact the existing network flows where there is other applications running. and then causing exhaustion on the networking capacity, addressing any unexpected errors or failures and reprocessing the data with check pointing. this is also key, you do not want to go for a long process and then copy all the data and finally realize that the data was not valid and then reprocess the data. it's always important to have some level of checkpoint during the data migration and handling potential data integrity, tracing back to social systems.

this is one of the key thing. data sync has the ability to identify files that has been already migrated or moved to the s3. and then if you go back and change that file for any reason, it would, it would, it would copy the incremental data automatically.

the reason for this key, this is an important feature is because in our case, we identified several files was corrupted during the transfer or from the social system itself. for example, in the hdf, we had some corrupted replication, replicated data which was copied by data sync and then we have to go back and correct those replicated files automatically. data sync will copy those files back to s3 in an incremental fashion.

Finally, what is the impact on this journey? And how, how, how what was our takeaways?

First, we had predictable data speed that helped in planning network capacity networking was a huge huge win for us using data sync.

2nd 22 face approach enabled thorough testing of major transformations which were then reused for our streaming pipeline. so during this migration process, all our transformations were well tested, we reused similar business logic and the code for streaming pipeline for in-flight data

Improved sls and data reliability when compared to on premises because of aging hardware and frequent hardware failures and so on and so forth.

This platform on the cloud, the new platform enabled modernization, accessing to cutting edge analytics tools that support different use cases such as machine learning and a i.

So wrapping up, i'm handing over to jeff.

Thank you. Yeah, thanks robert vra. You know, he um i think i like some of the things that he called about, which is when you're thinking about migrations, there's still a lot of planning you have to do data sync is gonna help you with a lot of your, you know, the basic data movement challenges and things like that. but there's still a lot of planning that needs to happen in order to pull off a successful migration. so it pays to put in the effort up front and really avoid some of the challenges that can frequently trip up migrations.

So just wrapping it up real quick, some of the key takeaways. So uh DataSync gives you these uh data movement capabilities. for those four use cases. i mentioned migration archive uh replication for data protection and then data movement workflows. uh you can use DataSync to move your data between on premises storage, judged locations, other clouds and within a us. uh we provide those different network end points and features for your secure, fast and reliable data movement.

And then you can maximize your available bandwidth using multiple tasks being able to scale out like that. And as Roger Mener said, you can also use that bandwidth throttling to control um potentially how much network bandwidth you're using as well.

So if you want to learn more, definitely check out uh our DataSync uh website. We've got lots of blogs, tutorials. um you know, references, videos, things like that. you can check out that ur qr code and i'll take you directly to our website.

Um definitely would also encourage you to continue with your uh aws storage learning. We've got lots of different capabilities out there uh for you to learn more about the storage capabilities that DataSync provides and ultimately lead to some of those uh earning uh some of those aws storage uh badges again, a qr code leads to that uh url there.

And with that, I just wanna say thank you for your time. uh hopefully this was useful. We're gonna take some uh questions off stage if we got them. But thank you very much.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Accelerate secure data migrations at scale with AWS DataSync

Ok?
复制链接

扫一扫