How Electronic Arts modernized its data platform with Amazon EMR

最新推荐文章于 2024-08-22 20:49:13 发布

李白的朋友王维

最新推荐文章于 2024-08-22 20:49:13 发布

阅读量146

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134789389

版权

well, hello everybody and welcome to this session uh where we are gonna talk about how electronic arts modernized its data platform with amazon emr. my name is arturo blo. i am the data analytics specialist for our strategic customers at us. and i've got the pleasure of be sharing the stage today with uh shika verma, who is the principal pro manager at amazon emr and with alex icis, uh who is the senior director for data and analytics engineering at electronic arts, who was also the executive sponsor for the migration project that we are gonna share with you all this afternoon.

but before diving deeper into e a's story, uh let's reflect briefly why about why customers are modernizing their data platforms. the fundamental reason is because the total amount of data that we are capturing, generating, copying and consuming globally is growing very, very rapidly reaching over 180 cab bytes in 2025. according to some analysts reports that is a growth rate of 10 times more data every five years.

so since most of us here today are builders, right? if we want to design a data platform for our organizations that can last potentially for, let's say 1015 years, right? that means that we need to design that data platform so that it can scale 1000 times from where we are today according to those growth rate numbers. now building for that scale is not easy, right? it brings some challenges across multiple dimensions that you need to consider such as high cost when multiyear commitments on hardware software or support fees are in place. and this is even exacerbated when you've got compute and storage, tightly coupled um or operations and management management overhead, that also tends to grow at the same pace as the data platform. and there is some where the data platform itself cannot scale because it cannot cope with the amount of data processes or even people that are using the data platform. now new people um using the data platform, bringing new requirements such as machine learning or um data sharing or self-service.

so um when customers look at, you know, solutions to modernize their data platform, they um they want a solution that can address all these challenges. and one of the solutions that customers are looking at uh is uh amazon emr and who better to tell you more about why than shiva, the principal product manager for the service or to use cica.

so, hello everyone. uh my name is shika verma and i'm part of the amazon emr product management team. and today i want to talk to you a bit about amazon emr and the value that emr brings to our customers.

so first what's amazon er emr is an industry leading cloud big data solution for petabyte scale data processing, interactive analytics and machine learning using open source frameworks such as apache spark apache hive h base hoodie and slink. emr is 100% compliant to open source api s which means you don't need to change your application code. when you come to emr with, with emr, you can run petabyte scale um analytics at half the cost of on premise solutions and over three x fasted and standard apache spark emr makes it easy to create operate and scale big data environments by automating time consuming tasks such as provisioning capacity and tuning clusters with emr, you can create a cluster and provision one hundreds of thousands of compute instances um to process data at any scale a r can manage the cluster size for you to scale up or down depending on your utilization. and you only pay for what you use.

emrd couples compute and storage allowing you to scale each independently for storage. you can take advantage of tiered storage of amazon s3 and for compute, you can take advantage of easy two spot instances that can save you up to 80% of the cost of on-demand instances. emr enables interactive analytics for data scientists and analysts through emr studio and deep integrations with um stage maker studio allowing you to visualize build and debug applications easily.

emr offers multiple deployment models with emr and ec2, you have access to a broad range of compute instances in the cloud um allowing you to optimize for price performance. customers can bring their spark workloads to emr along with container applications using elastic ker neti service or eks with emr serverless, you can run petabyte scale applications without having to manage or operate a cluster. and if you want the benefits of managed services in em r but need to keep your equipment in an on premise environment, then aws outposts are also an option.

emr um runs directly against your data stored in an s3 data lake which means you don't need to move or transform your data when you come to emr. and because emr runs against s3, you can have multiple clusters operating on the same data.

so next, i want to talk a bit on how customers use. emr customers use emr to build data leaks as part of their modern data architecture for scalable data analytics. this can include things such as change, data transfer, uh change, data capture or ingesting streaming telemetry events. customers use emr to query petabytes of data in batch or real time using open source frameworks such as spark uh prestone and hive.

customers come to emr from expensive solutions cost savings on emr are not just support costs, there's also savings to be had on hardware acquisition personnel and maintenance costs. essentially with emr, you can create a cluster um in minutes without having to do any of the typical management activities that you previously had to do.

customers. use. emr in conjunction with our studio and stage maker to transform, analyze and prepare large quantities of data as part of their data science and machine learning workflows. um and lastly, customers use notebooks to build uh to build big data applications um and leverage other aws analytic services.

so lastly, i just want to touch a bit on the areas where emr is continuously innovating one area where we're constantly innovating in is performance. and this is because we want to make sure that emr is the best place for you to run your spark workloads as well as workloads on other open source frameworks such as um preto and hive. in addition, there are four main themes where we continue to optim uh to innovate.

first is uh cost and performance. in addition to the um performance improvements that we've made in our runtime, uh we focus on two main areas, compute optimizations and cluster management policies for computer optimizations. um while it depends on the number and types of uh compute instances you deploy and on demand offers um lower rates, you can further cost optimize by purchasing reserved instances or spot instances. emr provides cluster management policies which provide configuration options that allow you to um configure things such as how your cluster is terminated automatically or manually. this can further reduce your cost by allowing your clusters to only run for the time needed.

the next area i want to call out is ease of use and the main innovations here are in emr studio. emr studio is an integrated development environment that allows um data engineers and um uh data engineers and scientists to deploy code easily. it's a fully-managed application um with single sign on jupiter notebooks, automated infrastructure provisioning and job diagnosis.

the next area i want to call out where we continue to innovate is transactional data leaks. and the main innovations here are in the ingestion, querying and administration for the creation and management of data leaks.

and lastly, security. emr has a comprehensive set of features across isolation, authentication and authorization and emr enables you to work with data that is encrypted at rest and in transit.

so with that next up, we'll get a firsthand view of how customers are modernizing with amazon emr. and this will be presented by alex ignatius who's a senior director of data and analytics at electronics arts.

thank you. thank you. thank you, shiva. pleased to be here, sir. coming to the meet of the presentation. i'm here to talk about our journey with the with the mr or in other words, you can call it as the rubber me the road.

so welcome to electronic arts in the first place. if any of you have any of your teenage children, you would definitely know like what electronic arts does. so we are a gaming company. we make games for various platforms including xbox uh playstation, um and every other pc online, every other um gameplay platform that you can imagine what you see here is a makeup of what electronic artist is. i mean, we have plenty of studios all over the world making different genre of games. in other words, we exist to inspire the world to play with 2020 20 plus studios distributed all over the world, making flagship titles like ifc or battlefield or madden or sims, many of them that you would have heard.

so naturally, we cater to a very large global audience, very large global audience of different seasonality of uh games, games and game releases and all the data flowing in from uh each and every one of our studios. how do we, how do we handle this? because each of these studios and each of these games function pretty much like i always used to say company of companies, they have a unique culture to themselves and they have a unique architecture. they have a unique way of really doing things. everything is, is on their own. how do we really bring the data together? because if the data is going to be siloed, that's going to be a very big challenge. and that's where we come in.

we are e a digital platform and i belong to the vertical of data. and a i i have a couple of my colleagues here with me as well. our job is to deliver a platform as a service. we don't want each and every one of them building a data platform on their own and doing every nuts and bolts before they could even consume the data and make meaningful uh business decisions. so we are part of the digital platform team and we offer platform as a service. now from data acquisition all the way to collection, to processing and sometimes summarizing and delivering to our customers. we take care of all of that and this includes naturally all the compliance management and all the um enriching summarizing and making sure that we are all the gdpr compliance included. everything is taken care of in terms of consumption.

we have all sorts of consumers like our personas include uh a producer, could be needing that data or a data engineer could be needing that data or a game artist might be needing that data or a business for a business decision, the sales, every you you think about a persona, we have all of them in each and every one of these game units. so there is a common language that they speak across. at the same time, there is a unique language to each and every one of these games and the studios. so that's where we come in as a central data platform. we try to really unify everything that's related to that common language, data language that all of them speak. you know, i i jokingly used to say um i mean, all of us being data professionals, you are either already in a data storm or you are about to get into one or you are or you are getting out of 11 or the other. and by the time you thought you have solved it for the next few years, here comes another surprise. that's where we were constantly and continuously.

now with one central data platform team, we acquire all this data. it's it's like tens of billions of rows per day, multiple tens of billions billions of rows per day that we collect. and we as a team put together sanitize that data, make that data readily available. either you want to consume in the raw form or you want to consume in a summarized form or you want to consume either directly through an api or you just want a raw data copy of it right from storage or you want to really consume it through a bi tool or an etl tool or which way that you want, we support all of it. so from one central data platform, on the left side, what you see is your digital platform and we are part of it, data and a i and on the right side, hundreds of data teams just because we have one central data team that does not mean each of those games or everyone totally depend on us. that's not the way it is. each and every one of our games and game themes and studios, they have their own data all the way from maybe a small number of data engineers, all the way up to data scientists and engineers, everyone who consume and use this data on a day to day basis.

so from a user perspective, a variant type of data sets that we need to cater to variant type of personas that we need to cater to and naturally various type of problems that we need to really cater to as well. no, with this many studios, tens of terabytes of data, tens of terabytes of telemetry data flowing in every single day

Thousands of ETLs of all velocity in nature - real time, near real time, batch, mini batch, every aspect of velocity. You can think about thousands of ETL streaming in every day and dozens and dozens of petabytes of data accumulated. It's a large data platform and that's where this leads to. All of this leads to the volume, the velocity and the real time or mini batch - and everything leads to proliferation of tech in the sense that I wouldn't call it proliferation but kind of an organic growth over a period of 10-11 years.

Like there were unique individual problems that we were trying to solve to our technology helped. They were common mass problems that we were trying to solve across the studios as a particular area really helped. So our platform has grown over a period of last 10 plus years with a bunch of vendor stack and a whole bunch of open source stack. It's a combination of both with which we deliver what we deliver.

So as you look at this picture, this is where I really don't know how to put it because this is not uh this is where we were. Ok? So this is where we, we originally were because we have already migrated, we are already in production. Our current state solution with EMR is already in production, delivering up and running. So this is where we were at the legacy uh state per se.

Just if you look at the left side, let me, let me start with the left side where, where, where our data emerges. As I said, each and every one of your kiddos as they really pick up that controller and push a button, we get, we get data, I mean, you know, all our partners and their services working synchronously to really bring in that data, either our internal game servers or game back ends or the game clients. And then our first party providers like for example, Sony or Microsoft or even Amazon or Meta or anyone from anywhere this data could emerge. And we have a robust even collection service which um uh seamlessly collects all that data and dumps that into S3.

Now once we are, uh once the data is an S3, then our scheduler really picks it up and where we had a huge data processing platform at that spot at that place, our Hadoop high based uh with, with about close to 500 plus notes that we had at one time and then three plus P bytes of HDFS space and then 2000 plus ETLs really grinding at it actually. And this is this was our uh legacy uh state per se.

Now, for, for orchestration, we have uh um uh we had Oozie and then we had Hive Metastore for our metadata. And then our typical normal monitoring um health monitoring portals like uh Slack or email notifications or we have a devops internal portal where we had all that coming to the right side of the picture, which is our consumption stack where we, as I mentioned, we did we deliver data in several forms to several destinations.

I mean, you can just subscribe to a data set to really get your data to your preferred destination or we make the data available in either Redshift or Snowflake or there are BigQuery or so many platforms where the consumers are. And we don't dictate this is where you need to access. Rather they have the option of having the data delivered to the endpoints or what we call a syncs where they really want to have data.

We do have an in house querying mechanism. Also, it's a trough based in house environment, what we call Pond where customers can access any of those data sets directly out of a Pond as well. And we have a whole rich set of APIs through which the customers can consume as well. Because one of the use cases of data consumption is our light services. Our light services are directly game integrated. Sometimes when you go into a game, when you are in a messaging service inside the game, the small pop ups that you see and the images that you see that is the rendered from the data platform.

So we have a life services component of our data platform for which the data need to be rendered as well. So we have a rich set of APIs and several operational DBs as well. So this is overall our current state existing, sorry, our legacy platform what we had now the question could be so it looks clean enough neat enough. What's your problem? Why, why do you want to really radically make over this whole platform? And why?

So as I mentioned, we always are either entering into our vrn or we are getting out of a data storm. I mean, all of us who have been in this industry for a decade plus at least for the three decades plus that i have been there. That is the reality actually.

Now, first of all, what is our first our first challenge was legacy and aging stack. Too many components accumulated over a period of a decade plus the platform emerges over and over, over and over and always we have this, I mean, a necessary evil call the backwards compatibility, right? You never get to make over things that easily anything and everything that goes to production, every service needs to be backwards compatible.

So slowly but steadily, too many components in the stack. A lot of interdependencies and versions and very very slow paced upgrades and paces to an extent, we even, I mean, obviously AWS has evolved in this entire journey in this period as well. And we had some of our dependencies built into the AMI versions that we were using. So it's it's kind of very tightly coupled and with a lot of components in it and the number that naturally increases the number of data hops in your environment. So that is our one of our major challenge then effective auto scaling with this many components really making it, you may be able to order scale 11 piece of your standard but not the other piece. I mean, how good is that? At the end of the day, the customer SLA has not changed in any way, data delayed is data denied.

So in that kind of us, we needed kind of a central mechanism where we can handle critical things such as retention, access, our access control, our provisioning, we needed kind of a central mechanism making that work in with this many versions and variations of technology and technology components was really a herculean challenge.

And then very importantly, sometimes our customers even, they don't even need a faster SLA but they need a consistent SLA hey, tell me every day, the data is going to be delayed for an hour. I'm ok with it. But don't tell me today, it arrived on time and tomorrow it arrived later and the following day uh sooner that this cycle repeating our inability to predict. Uh the RSLAs was another major driver for us.

And then as I mentioned, very tightly coupled architecture, some of our AMI dependencies, instance, dependencies and needless to say heavy data movement between HDFS industry. As I mentioned in the architecture slide, we had three plus petabytes of HDFS storage with minimal retention. No data ever sits in our HDS man more than seven days and most of the data is barely a couple of days or three days. That's how we keep the data. So it's more of a process place where we receive all of our telemetry. And we do heavy lifting, all the processing after which the data really gets into S3 for consumption. At that point onwards, the consumption really starts which means there is terabytes and terabytes, tens of terabytes of data movement between S3 FS and S3 every single day.

Once the processing is complete, then we had backup mechanisms that will back up the data into S3. Then it goes through one more level of processing and then we get into each and every one of it is a hop that takes considerable amount of time.

Last but not the least, as i mentioned. By the time you thought you solved the problem, there is 10 other waiting for you already. That's a typical nature of the data world. So are increasing business demand. Our data used to double in the past have i mean, i'm a long time with the a. Um so there, there, there were times when every five years, your data pretty much doubles. Now it's almost every year or every other year, your data pretty much doubles. So the increasing business demand is that i mean the amount of data that they want real time near real time, mini batch batch, the demand is continuously increasing and naturally with this kind of a stack as well as this many components as well as this much of business demand. It's like it's increasing operations overhead out of the young hops that it takes out of young technologies that are involved. Even one single piece of it can really kill your SLA and if it is not this tech this time, it is something else next time one or the other.

So the increasing operations overhead are our dev investment, dev engineering investment was kind of being balanced or overthrown with the operations overhead and naturally that leads to cost and cost management challenges, everything and the file formats or the tech versions, everything.

So we needed a next generation platform, next generation platform that will help us. First of all, the bottom line that was given to us is that seamless, absolutely seamless because we have games launching throughout the year. It's not that hey, we have a launch season. Yes, our, our sports properties like the NFL or the uh football club FC, they have specific time of launch, but there are a slew of other games, especially our mobile properties that continuously launch all through the year and they don't wait for. Ok. This is a down time. That is a down time. We didn't have that luxury at all. From the current state to future state. It has to be absolutely a seamless transition for us. We didn't have no down time, bottom line.

And then handle mixed and diversified workloads. You know, sometimes we optimize quite a lot for the reads. Sometimes we optimize quite a lot for the writes. And for us, it is like all through the day, we have demand that's commonly and continuously plowing through us. So mixed workload scenarios with the central data catalog was our vision.

And then optimize our dev ops. I already spoke to it a little bit before reduce our operations overhead and have an effective CI/CD mechanism. Um and then um enhance the monitoring and alerting services and last but not the least. This is the fundamental thing that our customers expected from us is predictable, SLAs, predictable SLAs and guarantees that we can sign up for that they can sign up for because there are multiple levels of customers like for example, the C level. So it expects a different kind of data set. The engineering expects a different kind of data set. Studio producers expect a different kind of data set and different SLAs all of them need to be broadened together.

That's where we turn to EMR we wanted it to be as fully managed as possible. So fully managed big data processing platform. The key thing for us is that that leverages that supports a wider range of open source frameworks for us because as I mentioned earlier, our platform in the legacy state itself quite a bit open source plus vendor stack. So that open source support was very important to us like and say a Spark or Hive or Flink or Iceberg. All of them, the support was very critical because we just can't restart over newly somewhere actually.

And our platform is predominantly built on AWS. We are a multi cloud environment, we are present in all the three clouds, some form of our services or other are running in all the three clouds. But in this case, predominantly our platform is built on AWS, AWS. So EMR really nicely fits in there where it fits into the whole ecosystem of our existing platform actually then looking for quantifiable uh SLA improvements and uh very important to the next point. I'll also be talking about it a little bit later. The resource isolation in our legacy state, we had a queuing mechanism whereby all the jobs in a particular queue really share all the resources. But that did not really work for us actually, because one game could pump in, sometimes it could even be a bug or it could be a bot or something like that, that could pump in enormous amount of data that will impact every other job in the same queue. I mean a queue level optimization or SLA guarantees or resource isolation is not enough for us.

We needed job level isolation and sport instances are incentive for us like from a cost management perspective, compute and storage scale ability to scale both of them independently. And then like, you know, manage the scaling because i want to be cautious there, somebody went and ran a sell extra query on a, on a petabyte size of on a 10 terabyte size of a table. And then for that, i end up scaling up and paying for it. We don't want those kind of things to happen because the analyst community turns over continuously and constantly a newer analyst who doesn't know the size of the game or the size of the table or size of the data could easily kill out. So we wanted a managed scaling without the defined perimeters, how we can really scale up and scale down very effectively, something that we were looking for.

And i'll also talk about the t shirt sizes, how we came about in a little bit later in my presentation as well. No, with that, where did we end up? What is our current state? Where are we currently functioning? I'm not going to be spending much time on the left side because nothing much has changed there actually. Um except that from perforce, we went into uh git but other than that, on the left side, what i explained in the legacy where our data originated and how we really bring that data in pretty much is uh similar. Yes, of course, we are working on uh newer collecting collection service mechanisms, acquisition mech mechanism constantly, but that will happen organically no matter what because our games keep growing and scaling

The critical difference, key differences in the middle part of it where you see. So we, we came up with this decoupled architecture that is flexible enough for us to pick our software and the versions of it um as flexible as possible. And at the end of the day, they need to be talking to each other in a seamless fashion, whether you are delivering the data to redshift or you are delivering the data to snowflake each and every one of them really following a unique way. to connect and collect. That's not the way we wanted it. So it's kind of a very decoupled architecture that's very flexible to plug and play, whether it is a tech to acquire or a tech to process or a tech to consume. We wanted all of it to be seamless working with each of them.

Then the next thing is that you may see that large, extra, extra large medium and kind of a thing. So this first of all, this might let alone the processing part of it. This migration was frankly, when we worked with our aws partners, they were quite puzzled at, are you really going to do it? Thousands of jobs? Three petabytes of hdfs one year timescale, we have seen companies making, taking three plus years to really deliver this kind of thing. There were, there was a lot of internally and also there was a lot of skepticism around it and we had an army of engineers going at it and we can't afford to let each of them make a decision on. Ok. What instance type would really work for me? What is a computer demand and what is a storage demand? What is i demand? And what is the sl a that i need to, if we are going to take that kind of an approach, it's going to take not three years it could given five years for us.

So we heavily, heavily invested, i'll speak about it a little bit more later as well. We heavily invested in the planning and architecture and preparation stage where we had t shirt size of clusters that already predefined that all we had a, we had, we profiled all our jobs, we had those defined so that when it comes to the engineer, as long as he knows about his job, what he wants, that's enough. And then he b will have something equivalent for him to really pick up a t shirt size of a, a configuration and then spin it up and then run it up. And that's how we had because our storage layer also, it's intelligent earing on s3. So we had yet to end it, we wanted it to be as decoupled as possible. But at the same time, we, we wanted to ask minimal decision making at every stage of a track where the data passes through. If each and every one of them are going to take time to really decide even one wrong decision is going to cost us quite dear and near in terms of our external delivery.

So what you see is that a large large exel or medium is basically the type of clusters type of instances with specific reasons and profiling of our jobs that we really picked. Then we switched from uzi to alo from our orchestration perspective. And with our meta data sitting in dynamo db there from observable stack at the top, the existing mechanism continued, we added amazon cloud boards, pros and grafana to our stack for monitoring, not only monitoring, post live, we wanted this, there was enormous amount of testing and iteration that we did at each and every one of those stages. How long it takes to spin up a cluster? Then how much of a computer it is using? How many nodes are, how many instances are being used? How much memory being used are we optimally using it? All of this needs to be continuously tracked for us to make sure that we are on the right trajectory. Otherwise at the end of the year or end of the thing, i can't afford to really go and say at that 0.0 like, you know, these number of jobs are not really meeting our goal, we are breaking our bank, we can't afford that. So we need a very close observable right in the from the get go all the way up to delivery, then the alerts to our slack and email are the same and then the consumption stack. Also, the key part is that aws glue as i mentioned, we were missing that interconnectivity across platform across the platform. And you know, so many, not only because of that us, it's not just that uzi was not capable or anything. This is where the version dependency also comes in some of our stack versions including uzi were really old and that really doesn't support it. So we use this is an opportunity in parallel. That is a huge project that happened in parallel to er migration itself where we completely upgraded our, we moved into aws glue and for the repository and we emr took full advantage of it on the other side as well. And on the right side, you see the delivery stack and all of this done that is the very important part, all of this done in a seamless manner, meaning our legacy stack and the current and the future stack, we are constantly talking to each other and doing the jobs.

The reason being they were two approaches which leads to the next one with thousands of jobs accumulated. I had two options in front of me. One is take workflow by workflow. A workflow has got 10 jobs or 20 jobs and take each of them move it end to end perfect place, clean, peaceful, go to the destination and then walk away. Ideally, that's what i would love to do, but that is not practical in our case. First of all, for a platform of this age, ironing out each of the interdependencies and then finding out all the associated objects, all the libraries and all the os everything and then move them together in one shot with all the validation we would be in this for many, many years. So that was, that was a very challenge. It's a right way to do it. But unfortunately, that was not an option for us.

So the next option is basically job by job. And even in that job by job, we went by the impact rather than the mere size, thousands and thousands of jobs. It's a typical, instead of 8020 i would say 7030 rule, 70% of the jobs really delivered the 100% of hit what we wanted to really get so sorry, 30% of the jobs. So the smaller set of jobs which are really heavy compute in nature and which takes the jews out of the entire platform, we targeted them and we moved them even their job by job. But the challenge with the job by job approach is that half of your workflow is running in the future state, half of your workflow is running in the legacy state. And these two need to constantly keep changing their status to make sure that the entire workflow doesn't fail. And the customer is not disappointed from a point of view, either not seeing what you should see or the data is super delayed. So this preparatory phase was very, very critical for us and we did that.

So the job and detail classification kudos to my team, my whole or in fact, they painstakingly sat and classified all the jobs into groups of jobs with a common computer profile, common storage profile and common demands from every aspect including the customer delivery criticality of the data, all those aspects. I mean, if any of you are considering that kind of a migration thing. This is where you want to invest even before you write a single line of code to do anything at all. I mean, if this phase is done for you, you pretty much have you are well begun? That's what i would say. Our investment in this phase helped us so very much. And not only that in this phase, we also invested quite heavily in not only the jobs classification aspect of it but building frameworks so that it will be a rinse and repeat for anyone that comes after. Meaning we needed a development framework. We needed a validation framework, testing framework. We needed a deployment framework because we don't want everyone going through the manual steps of all of this at a time.

So rather we, we invested in those frameworks and we, we, we made it as a kind of a literally a cookie cutter mechanism for our engineers so that they don't need to think a lot, manage a lot, just set this configuration. Here is your cluster that will spin up. And here is where you set your configuration for deployment and c ad we made that. And so this is the phase where we really built those dev test and deploy frameworks as well. So that we can literally get into a factory model of several engineers really jumping in and then doing this entire work migration is zero downtime. I already mentioned backwards compatibility. Also, you have spoken about it, observable was critical all through the migration for us, not just after releasing it. And the we at this point, we kind of defined a kind of a development and migration strategy like uh with this many jobs both in terms of the development, also in terms of the both in terms of the development as well as in terms of deployment. Once we did this whole classification, we classified them into set of 12 waves into 12 waves, 12 group of jobs that are fully isolated in parallel development and deployment because doing it sequentially was not an option at all.

So again, in the order of criticality, this is where i do not have that big of an engineering shop because i have a live platform, tens of petabytes to sustain while we are really building this and transitioning into the new. So this is where we took full advantage of aws proserve and we had them partner with us in terms of whole development activity so that we can focus on validation, testing, diploma and migration because my people know my data really well like, you know, it's very difficult to expect somebody external to really come in and get that give that, bring that kind of expertise. So we focus more on really testing validation deployment, making sure that the previous track and current track are producing same output. While aws proserv, on the other hand, wave by wave they built and developed and deployed, they developed and then work with us to really hand it off and things like that. And we were, i mean, as i mentioned, within a year, we wanted to really get all this done. And even within that year, we had to take into account every single game launches that we had so many freezes and so many deep freezes in the entire year, we had to really take all of that into account to deliver that.

So with the evaluation and then migration assessment and once we built all those frameworks and everything, the waves were completely iterative, we we finished development, we go into the deployment stage. Then my internal team takes over, they start really validating, verifying and then pushing into production comparing with the previous and current state. The next wave gets developed coming in like a complete factory model, we were doing it end to end. So the first period that is january through march, pretty much, you see is our framework investments, our architecture investments and then the following year we delivered what we promised for this is how we executed the project overall.

So the job classification process, as i mentioned, 2000 plus jobs like you know, what are our production tls like that? Make up what percentage of it subscriptions are nothing but our data delivery mechanisms which are the sins and which are the destinations to which the data really gets delivered. And those are some, somebody could be subscribing through an api somebody could have a regular job that's running, pushing data out somewhere. Somebody could be really pointing their dashboard itself directly to us any. So then our compliance like gdpr extracts and other things and then is nothing but our pre broad environment, our pre broad environment tls and then we also use this opportunity to retire a whole bunch of jobs. So that, i mean, how many times you get to look back 10 years and see what's really required. So we use this opportunity to do that as well.

Now, as i mentioned, the t shirt sizes, all that our engineers had to do is what is on the left side for you. We made 20 plus cluster types and instance types that are readily made available to them and each of them have got predefined configurations underneath. What is the instant type? What is the ebs size? What is, is it on demand or spot? All those are predefined and predecided for them? They don't need to break their heads, just know your data well, know your job, well, job behavior well, as long as you know that they can pick one of it and 25 plus different predefined configurations and spin up your instance and then go from there.

So the framework enables conflict driven to be conflict driven for us that really enabled us for that. And it expedited our transition quite a bit. And also very importantly, this limited our cost exposure because if each engineer is going to make an instance choice, consider you, you you are tuning for performance, you are tuning for sla i could throw a bunch of hardware at it and get the same sl a done, but i have broken the bank or it could happen, vice versa. Now, this this predefined mechanism really helped us to contain cost and manage this whole migration in a in a better way, manage this in a better way

From our job orchestration perspective, while on-demand is something YAR provides and does really well, our challenge was that some of our jobs run frequently throughout the day. If your job is 10 minutes but it takes 7 minutes to spin up an EMR cluster, that defeats the whole purpose - you've already missed 70% of your runtime.

So to mitigate that, we had long running clusters in addition to transient clusters. We kept long running clusters to a minimum - about 5% of our total clusters. This is where multiple jobs being deployed to the same cluster helped us. Jobs running every 15-20 minutes get deployed to the same cluster so they keep running while the majority of deployments are on transient clusters.

Data and metadata integration between HDFS and S3 - minimizing data movement or removing it completely - these were critical. After a job completed in HDFS, we used to copy data to S3. Now that movement is gone since EMR operates directly on S3 and EBS. For metadata, we made integration seamless through query rewrites. Customers don't know which catalog a table is in - Hive, Glue, legacy or current. We capture the query, rewrite it, and route it appropriately so migration is invisible.

Some other migration considerations:

UDF compatibility - evaluate custom UDFs and leverage native UDFs where possible
Small files issue
Data validation
Instance availability - cluster size and auto-terminate impact cost
Snowflake integration
Cluster sharing for multiple jobs

We used a "rinse and repeat" model for migration waves - pick a job, check dependencies, review config, deploy, and repeat. We built a development pipeline with CI/CD and integrated monitoring through CloudWatch, Datadog and Grafana. This gave us an SLA dashboard with fine-grained visibility.

Last year's holiday volume was more than double what we anticipated. Job-level isolation helped where previously we had queue-level isolation. One job doesn't impact another. This let us handle the highest ever volumes this year.

Daily data volume increased 113% year-over-year. We delivered 90 minutes of SLA improvement with 20% lower TCO. Most importantly, we maintained consistent SLAs through peak volumes.

There's a lot more we want to do - optimization, new technologies like EMR Serverless. AWS partnership has been vital for us to evolve the platform.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
How Electronic Arts modernized its data platform with Amazon EMR

And why?
复制链接

扫一扫