Modernize analytics by moving your data warehouse to Amazon Redshift

最新推荐文章于 2024-07-28 15:37:27 发布

李白的朋友王维

最新推荐文章于 2024-07-28 15:37:27 发布

阅读量97

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134828084

版权

Hello and welcome to re:Invent. I hope you are having a really good re:Invent.

So, welcome to the session titled "Modernized Analytics by Moving Your Data Warehouse to Amazon Redshift". My name is Manan Goel. I'm a Principal Product Manager with the Amazon Redshift team. That's the team that's building Amazon Redshift, which is our cloud data warehousing service.

So for the next hour or so, we'll talk about four main things:

We'll start out a little bit broad talking about how at AWS, we think about helping you with your data analytic needs.
Then we'll talk about, specifically around data warehousing, how Redshift is evolving, what are the key investment areas, what new capabilities we are bringing out in the market with Redshift.
We'll talk about some of the use cases with Redshift in terms of how customers are using it. I'll share a couple of customer examples of different architecture patterns in which customers are using Redshift and getting value out of it.
And then we have a real treat for you - we have two customers here today. They both happen to be related to the airlines industry, although Redshift has tens of thousands of customers across every industry. But we have United Airlines and G Aerospace here and they'll share their journey in terms of how they have modernized their data and analytics with Redshift and getting value with data and analytics.

So just to kick things off, first of all, at AWS, we are committed to helping you get the most value out of your data and analytics. And the way we do that is with our AWS for Analytics vision which really helps you break free from data silos. It helps you bring data together from different sources in your organization, store it in a data warehouse or data lake, analyze it with the tools of your choice. And then it also gives you capabilities to make this data available pervasively within your organization. So all of your users can benefit from the goodness of data and analytics.

Now, specifically, the AWS for Data vision delivers capabilities around three main areas:

It gives you purpose-built database analytics and machine learning services so you can use the right tool for the right job. For transactional workloads, for example, we give you databases like Aurora. For analytics workloads, we give you data warehouses like Redshift. For big data analytics, we give you services like EMR so you can pick and choose the right tool depending upon your use case and get the best price, performance, features, capabilities from each of these services.
The second thing that AWS for Data does is build out-of-the-box integration between these services so data seamlessly flows from, let's say, your transactional database to your analytics database using capabilities such as Zero ETL where you don't have to focus on building data pipelines - we do that work for you and you can instead focus on building your applications, analyzing your data, getting value from your data.
And then the third thing we do is provide out-of-the-box governance, security, and compliance so again you can focus on innovation, pervasively make insights available across your organization without worrying about governance and security compliance because those things are sort of built-in within each of the services across multiple AWS services to give you value.

Now, within this AWS for Data, Redshift is our core data warehousing service. Tens of thousands of customers use Redshift every day across different industries to bring together terabytes to petabytes of structured, semi-structured data, analyze it with complex SQL analytics, different styles of analytics, and make these insights pervasively available across their organization.

Redshift has been in the market for over 10 years now. Back almost 10 years ago when Redshift went generally available, the history around Redshift, the journey around Redshift, has always been to work backwards from customer requirements and bring out new capabilities to help you get the most value from your data analytics.

So going back to 2013, when we launched Redshift, what customers were telling us is as their data volumes exploded, they wanted a highly performing, cost effective data warehouse. And that's where we launched Redshift as the first massively parallel processing data warehouse in the cloud with unmatched price - $1000 per terabyte per year - which was sort of unheard of in the industry.

Fast forward a couple years, as your data volumes continue to increase, we added capabilities like concurrency scaling so you can scale to unlimited users, unlimited queries. We added capabilities like data lake analytics so you can seamlessly go run SQL queries on top of data sitting in your S3 data lakes. We added separation of storage and compute so you can scale your data warehouse infrastructure, get the best performance from the data warehouse.

More recently, there are four key areas where we continue to invest, bring out new capabilities:

Continue to build the best of breed data warehouse - continue to deliver on things like ease of use, price performance, reliability availability.
Making it easy for you to unify all of your data using capabilities like Zero ETL - go across databases, data lakes, data warehouses. If you want to move data around, make it easy to move that data around with capabilities like Zero ETL without building data pipelines. If you want to leave data in place, giving you capabilities like federated query and data lake analytics to seamlessly analyze data in place.
Continuing to add more and more analytic capabilities in the data warehouse - I'll point out some of the new capabilities there but we continue to bring out new capabilities around SQL analytics, open source analytics with Spark and things like that.
And finally, helping you innovate faster with secure data collaboration - building security, governance, compliance in the data warehouse, giving you those capabilities out of the box. And then also giving you capabilities like data sharing.

So this week, a lot of announcements have already been made. I just wanted to point out a couple of key features in each of these areas:

In the best of breed data warehouse capabilities, a couple years ago we launched this capability called Redshift Serverless, which makes it really easy for you to get started with the data warehouse without worrying about the infrastructure - nodes and all the hardware and those things. You focus on building a data warehouse. We take care of things like auto scaling and patching, backups, recovery. We continue to invest heavily around autonomics in the data warehouse using ML to give you the best price performance scalability in the data warehouse. We are also adding capabilities like multi-AZ availability, giving you scalability and reliability with multi-AZ support, for example, which was recently announced in general availability.
In giving you the ability to analyze all your data, I hope you got some of the announcements this morning around Zero ETL support - Aurora MySQL to Redshift is already available in general availability and this morning we announced additional data sources like DynamoDB and Aurora PostgreSQL and RDS MySQL Zero ETLing data into Redshift as well. And besides that you also have capabilities like data lake analytics or federated queries so if you want to leave data in place and analyze it using those capabilities you can use that as well.
In adding more SQL capabilities, we added open source analytics with Spark, which has been there for quite some time now. We added ML with Redshift ML - we continue to enhance it, add more things in this area as well.
And finally in security, all the things you need - role level security, column level security, role level access control, dynamic security - all those kinds of things.

And as a result of this, what we see is tremendous adoption growth for Redshift. Customers definitely use their data warehouse for classic use cases like BI and dashboarding, but we also see like because of all the capabilities we've added and how you are using your data warehouse, the use cases for data warehouse evolving to other areas like self-service analytics or customers building data mesh architectures with Redshift or real-time analytics, data sharing and collaboration, machine learning and AI analytics.

What I wanted to do over the next few minutes is share a couple of customer examples to sort of get your appetite wet in terms of how customers are using Redshift in a differentiated way to get value from their data and analytics.

The first example I have for you is from a company called Peloton. Just a show of hands, how many people are familiar with Peloton? That's awesome.

As you know, Peloton is an internet-connected fitness equipment company. They are really popular for their fitness equipments like exercise bikes or treadmills which are connected to the internet. And they also offer subscription-based classes around these fitness equipments. Their goal is to help people lose the most amount of calories - that's what they tell us. When we talk to them, they count the number of calories people have lost using their platform, which is in the billions, and that's what they pride themselves on.

So Peloton is a great customer story in terms of scalability and using cloud as a way to achieve tremendous scale as far as data warehousing is concerned.

Think about it - the company started around 2012. When COVID hit, as the gyms were closing down, fitness moved online and this company saw tremendous growth as far as their data volumes, their revenue is concerned. Between 2019 and 2021 their revenue went from under $900 million to over $4 billion. The user count in that same timeframe went from under 400,000 users to 2.7 million users.

And the way they scaled is similar to most of you as you pick Redshift - they started with Redshift earlier in their journey in 2019 because of the performance, the concurrency scaling capabilities, the ease of use capabilities that Redshift offers. And as they progressed, as the volumes across their system increased, they adopted some of the newer features like RA3 nodes for separation of storage and compute. And more recently they adopted features like Serverless and data sharing for self-service analytics.

So just in terms of the architecture diagram, this is kind of what their reference architecture looks like. It's a classic hub and spoke architecture pattern in terms of how they are using a combination of data warehouses to deliver the scale, the performance, the capabilities that they need in terms of managing the growth that they are seeing across their platform.

If you look at it, in the center you have a provisioned data warehouse. On the left hand side, you have your classic data pipeline where data is coming from different sources - they're using technologies like Spark and DBT to process that data, land it into a central hub data warehouse, which is a provisioned environment which is continuously feeding data.

And then on the spoke side, they have these multiple data warehouses which are both provisioned as well as serverless, which are using data sharing and going across this common copy of the data. And these spoke data warehouses give them the workload isolation, the ability to scale - workload isolation enables them to not step on each other's foot so they can keep the processing for dashboarding workloads separate from, let's say, more resource intensive data sharing workloads and things like that.

So really interesting modern architecture in terms of they started with a single data warehouse cluster but then evolved as the data volumes, user accounts grew to this multi-cluster architecture, hub and spoke architecture pattern with specific data warehouses for different workloads.

By the way, this architecture pattern also helped them save a lot of money, which is really important. They are able to use Redshift Serverless capabilities like auto pause and resume so a lot of these spoke data warehouses for Tableau workloads or ad hoc workloads, they don't need to work on the weekend - they can go in the pause state on the weekend which automatically happens because nobody is running any of these data warehouses and they save money on compute as a result of this architecture.

So really interesting architecture pattern if you're looking for scale, flexibility, performance - all of those kind of capabilities.

The second example I have for you is from another company called Fannie Mae. Show of hands, how many people are familiar with Fannie Mae? Alright, quite a lot of you - that's pretty awesome.

So Fannie Mae, the claim to fame for Fannie Mae is making the dream of home ownership possible for their customers. What they do is, when you buy a home, you typically get a mortgage from a bank like Bank of America. And what these banks then do is they package all these mortgages and sell it to a company like Fannie Mae so they can get it off of their books and they can fund more home mortgages that way.

So Fannie Mae plays an important part in terms of home ownership in America. You know, one in four homes um uh is, you know, has, is being touched by Fannie Mae, essentially lots of data, a lot of pi a requirement and, and what Fannie Mae wanted to do is to build a data mesh architecture where they could give their individual business units the freedom to innovate at their own pace, have the agility, you know, have the ownership of their own data.

But they also wanted to promote sharing and collaboration between these different teams. So they can, you know, everybody can learn from each other and benefit from each other. So they sort of implemented a multi-account strategy with AWS where each of their own, uh each of their business units have their own accounts and within their accounts, they run their own operational systems and analytics systems like data lakes and data warehouses.

And then they all have a centralized uh AWS account where they are running uh AWS Glue Data Catalog which acts as you know, the the central place for data mesh governance and uh you know, a data marketplace for data.

So the idea is like, you know, each of these business units can operate in their own environment, innovate at their own pace. But if somebody is looking for, you know, a shared data set, they can go to this data marketplace which is run by Glue Catalog, search for a particular data set and then trigger off just in time.

Um uh access control mechanism where the request will go to the owner of this data, they can approve or deny the request and then the the person who is requesting this data gets access to the data.

So really interesting pattern in terms of how this company, you know, is able to innovate a lot faster get agility, you know, separate ownership of this data have decentralization yet you know, benefit from the shared governance and collaboration.

The final example I have for you is from a company called Moderna. Again, show fans how many people, how many of you are familiar with Moderna? Very nice. How many of you got uh the Moderna vaccine, by the way, very nice. Uh for some reason, I feel the side effects of Moderna are a lot less than Pfizer but you know, whatever.

All right. So um Moderna as you know, is you know, a pioneer in mRNA vaccines. So, you know, they um you know, again, uh data analytics is really important in their uh scenario. And um as a pharmaceutical life sciences healthcare company, they are looking for looking at real world data.

So real world data for them means things like clinical trials and you know, um data from researchers and things like that. And typically like, you know, if you look at how this data is on boarded within organizations, you know, the top of this slide is what they had before.

Basically a spaghetti of processes, ad hoc processes where you know, if you want to on board a third party data, typically, you know, you go to the data provider and you buy the data set, the data provider will send you a data set which you will ftp into S3 and then load it into pillar data pipeline to load it into Redshift.

So again, really complex process with no governance, centralized sharing things like that. Uh Moderna went from that architecture to the one in the bottom using AWS Data Exchange and Redshift Redshift data sharing capability where AWS Data Exchange is basically a marketplace for the data.

So data sellers or data providers can go there, list their products there. Moderna can go there, you know, Moderna teams can go there, search for the data sets, they want click a button and subscribe to this data and they can access this data within the Redshift data, uh data warehouses as data shares.

So imagine the benefits they got, you know, from going from uh ad hoc, you know, error ridden uh mechanism up at the top to the bottom where you know, they were able to cut down their uh data pipeline process from eight days to less than three days in terms of on boarding uh data sets. And as a result of which you know, bring the uh vaccine out to the market a lot faster.

So pretty interesting use cases in terms of what you know, customers are able to do with Redshift use, you know, newer patterns like hub and spoke and data mesh and you know, a acquisition of third party data and all.

Uh at this point, what i like to do is uh bring up uh Sanjay Nayer from uh United Airlines. Uh Sanjay will talk about his journey with Redshift in terms of how they went about modernizing their data analytics architecture with Redshift.

Thank you Manan. Uh and good afternoon everybody. I'm I'm glad to be here to uh share United story. Uh again, I'm Sanjay and air. I uh head up our at United our uh data a I and uh analytics engineering teams. And I'm gonna be adding to uh Manan product and technology story, the customer perspective of our journey of uh how we transformed our uh cloud data, a journey from uh from a cloud data into, into Redshift.

So we will talk about that and, and the uh the topics i'll cover, we'll start with a quick introduction uh about uh our team. Uh and then talk about our problem statement and the challenges we faced. Uh talk about our solution, how we worked and partnered with uh AWS uh to design an architecture uh that uh uh laid the foundation for our transformation and then uh close out with our path forward.

Um and, and where the journey we feel will continue over the next uh near future. So our mission at United Airlines is to be the best airline in aviation. And that's more than just the size of, of the number of flights per day or our out network, the number of destinations we fly to or uh the number of customers we carry.

It's also to deliver the best customer experience, drive the highest customer loyalty and engender, a engaged employee workforce. But what's really exciting is our growth story that we are calling United Next. And it's about adding more than 700 aircraft over the next few years, almost doubling our fleet.

And what's really exciting about it is what we call the way we are going about. it is what we call with our motto. Good leads the way and it's about doing good for the communities that we are privileged to serve and the customers that we carry on our aircraft.

Now, one of our leaders at call United, a technology company with wings. And if that's a true statement, I would say data is the fuel that drives the technology that drives the airline. And what we have is a really complex business, whether it's in flight operations or network operations or in flight or our revenue management, our tech ops and maintenance.

And it's data that really connects the dots between these complex business domains and divisions for them to create value for our airline and data engineering team that I have the privilege of leading covers our commercial, our operations, our digital and beyond the entire enterprise.

So what we faced with our prior technology platform and our architecture was of course the scale we had over 50 business domains, thousands of users, hundreds of thousands of workloads. But the real challenge was the data were in silos and that's an understatement. And this was primarily because of the variety of legacy platforms that we had through no fault of anybody.

But over the past 2025 years and even more, we are 100 year airline people brought in solutions to solve problems that eventually became this point solutions. So it was not just data silos, it was like huge walls that we had to overcome to get data together. And this is hard to believe, but it was not just data between domains that were a challenge sometimes even within a business domain.

It was a challenge to get these data to join together and come up with the analytics and the business value that was needed. And of course, there was inconsistent data quality because of the challenge to bring these data together with the joints.

So what was the outcome? The outcome really was it took weeks and months to get data to our business users. And so the outcome at the end of the day was not insights and and actionable insights. It really ended up being department reporting because value at a more higher level could not be generated.

And of course, we had big challenges around cyber insecurity because of the sprawl of data systems, we could not have a good consistent approach to data security which was really critical in in this current world.

So then how did we tackle it? We got together with our strategic partners and and, and with AWS, especially as our, as our cloud partner. And we figured of a way to get aligned with our business first. So we went really on a listening to and that's really important. We listened to all of our business users to understand what their needs are.

And there are, you know, big, huge business domains and small as well. So we had to cater to all the needs. So, one of the big first guiding principles to all of this was a purpose built data access layer and Manan had alluded to it a little bit earlier and and what this did really was cater to the needs of everybody and was built on top of the lake house architecture with AWS.

Now, what we also heard from our business users was the insatiable sort of demand, of course, for data, we talked about the uh the growth that we have, right? So there's going to be big huge volumes. And so that was like we had to cater to that and cloud and AWS of course, was a big strength was scalability from a volume perspective.

The second need was around more and more real time data needs. And this was to get real time operational insights from our data. And this was the the second v the velocity that was needed um for our business users. And again, this was the strength of, of uh from, from the architecture that we had the third was around variety of data, right?

And so during COVID especially this was we had COVID documents that our customers were uploading on our digital channels. And so these had to be figured out also voice uh as a data asset images as well. So this was all um data artifacts that could not be handled in our previous legacy platforms. And that is something that we had to take care of.

And the bigger need really was our customers, our internal customers didn't want to hunt and peck for data all over the place, right? And so we had to unify our data and distribute it though in a business domain architecture, but it had to be in one place so that they could get to it and and really have the next point which is having an interconnected data platform, where could they, they could get value out of it quickly.

Now all of this had to be done. Plus we had to migrate data from our legacy data platforms. So what this meant really was we couldn't hang a pardon our dust or under construction sign outside our shop and say that we'll get back to your your problems later. We had to solve and add business value at the same time that we were doing this whole transformation and migration of data as well. So that was some of the big sort of challenges that we had to address with the solution.

So what were some of the key decisions and learnings. And i'm going to focus on these four and, and this is sort of the, the part of the session where we, we share the information and, and, and tell you some of our learnings about what we had.

So focus early on data quality. The, the, the fourth, we if you will is on trust veracity, if you will. And then all of our business users wanted to trust the data and to have that we had to have the mechanisms in place to make sure that the data was, was validated.

The second was around reliability really and to have an observable framework that was implemented with the light, right, logging, monitoring, tracing to make sure that we could build on the trust that the customers wanted to have on our on our platform.

The third was around resilience business continuity, more and more data as i mentioned was in real time, which had operational implications and impact, right? So that had to be uh built in and ad w australia again was the strength was around this cross region replication and that was very important for, for, for our learnings as well.

The fourth really was, as i mentioned earlier, we had, we were laying the runway at the same time that planes were taking off and landing, it feels like an impossible thing, but that's what we were doing. We were building our platform at the same time that we were doing migration as well, right?

And so we leaned very heavily on AWS partnered very closely with them to make sure that the capacity planning aspect of our cloud infrastructure as a combination of provisioning and serve was done with that in mind. So these were some of our key learnings and to talk about the journey, you know, this has happened primarily during COVID time and that was, you know, bringing to life where you don't want to let a crisis go to waste.

So while uh some days there were more customers in our airports than employees, we were spending the time in data and and engineering and digital technology to architect and build out our platform that would told us for good stead when we knew that customers would come back. And that almost that it's it's that we hit almost two years later, right?

So in this journey, one of the most exciting sort of things that we built was around machine learning and that that is something that we built on top of our United data hub that we call it. It's our data platform on on AWS. And that really again, got real value and use from a machine learning perspective with the data that we had

And there are some of the results here uh for you to see, we grew our data warehouse three x a batch pipeline six x and, and our streaming pipelines three x as well. So these outcomes really built on the virtual cycle. And more and more data is now getting on boarded as, as as the trust in the data is building.

Now, this is the the quintessential architecture diagram, a little bit of a eye chart. But I want to for you to have really a few key takeaways from this slide.

So the perspective of what we gain from this, the first fundamental philosophy shift that was in this architecture, which is really sort of latent is in the past it took weeks and months for us to get our, our data to market. That was primarily because folks had to spell out every little data field. Every aspect of the metadata that had to be documented in this new architecture. We could move our data on block into perhaps into s3 and then we could do it on in a more agile fashion, move the data that we needed to our, our red shift data warehouse. That was the first sort of fundamental philosophy shift.

The second was around the layers and man, I talked to it as well, the purpose built architecture that we had and the separation even between storage and compute that really gave us more flexibility and more rigor around our cyber and our security and security matters. It's critical in this environment. And so that was really helpful. And these two fundamental free shifts helped us in that area as well.

And of course red shift is the, is the crown jewel of this, this architecture and this environment. And of course, it's it's fully managed with aws. They take care of the, the infrastructure we had made use of some key capabilities in terms of performance. And, and when I talked about that as well, concurrency scaling was important to make sure that we had performance queries, um minimizing data movement and maximizing data sharing, things like federated queries that we are using to make sure that that happens. And that was really helpful for our for our users as well.

Of course, from a pipeline perspective, we are largely a smart shop for uh batch pipelines. We use aws glue, but there's more and more demand for real time and streaming pipelines and we use emr with uh ms k and, and kinesis for that as well. But we also built accelerators and adapters bolted on top of this platform. And that what you see in the middle here is what we call icon a framework that we built that gave us flexibility around audit balance control frameworks. It also gave us more flexibility in terms of accelerating our data engineering development activities on on the cloud.

Now again, aws was a big help that we leaned on in terms of creating these hybrid environment where it's a combination of provision as well as service that was really useful and positive for us. But this architecture was from a technology side of things and and that's not just all we wanted to transform, we also wanted to transform our data culture and that's where this technology platform really helped.

The first really I'm gonna talk about is machine learning. I mentioned that earlier. We built our what we call a mars platform on top of u dh. And that really accelerated the availability of data and machine learning. 80% of it as you're aware of is is having the right data in the right place. And so we were able to build a machine learning platform with a feature store feature catalog model registry. All of that. On top of our our uh our data platform,

The second of course is to enable our business users to connect all this data together. And that we built a data mesh architecture for this. And that really accelerated the availability of data to the market. And what this really did is empowered our data scientists, our analysts to be able to self serve their data. And we talked about a little bit the adapters and accelerators. We built some of the pipelines could be entirely self served with our data scientists as well. And that was capabilities that they didn't have earlier an accelerated time to market as well with all of this, with the goal of building analytics communities that would be sharing knowledge, sharing information, train people where it's needed on their own. It's the bottoms up approach that really is spreading what we call analytics everywhere, which is important for us to really gain value from all of the work that we did in our data platform.

So those were some of the key benefits that we got from this work that we did now in terms of our path forward. Of course, we want to continue to partner with aws as they build more and more capabilities on red shift. And some of the key ones that we'll be using going forward we're using now as well as in the near future is going to be things like the cross region data sharing, minimizing data movement with zero etl integration. Of course, having serve less um and more and more capabilities coming on that as well as redshift spectrum.

And of course, our goal as we get more and more operationally critical with this data platform, it's going to be critical to have a resilient and reliable uh data platform and self healing data pipelines is something that we're working on where there's more and more automation that's uh made available in this platform where the res with the platform really becomes uh resilient and and highly available.

And all of this, of course, from a perspective of adding the interconnected data with the data mesh for all of our business users that they can continue to gain more and more value out of the data. So the the real two key takeaways from this really is we want the data to be interoperable and we want to minimize data movement so that our data engineers can then focus on creating data products that are ready for our business users to use and so that they can create products that can create a great customer experience for our customers as they fly our aircraft.

That's our story. Thank you. And with that, I'm going to bring up. I go in for his story on ge aerospace. Thank you. I can't.

Great and thank you sanjay for that. Um as mentioned, my name is quin whis and I'm the senior principal data architect at ge aerospace within our strategic shared services organization. And what that means for our organization is that we host and we maintain and manage the strategic analytic platforms that our business builds uh their data products and analytics on top of. So with that power, as I'm sure you all imagine comes some great responsibility.

So I'm very excited to tell you about our story about our usage of the amazon redshift technology. And before we jump into that, i just want to just give you a little bit of uh and about our business. Look at ge aerospace. When thinking of ge e aerospace, i had no other where to no other place to start than with our purpose statement.

So we invent the future of flight, we lift people up and we bring them home safely. That's something that every member of our team holds deeply within their own, their own views of themselves and the company. And it's not just a belief we all have, but it's a responsibility we take with the products we build. And i really thought there was no better way to connect that mission and that purpose to you all than with this story in terms of numbers.

So if we look at the numbers here, every two seconds, an aircraft with ge engine technologies is taking off around the world, 3/4 of all takeoffs are powered by ge technology. And the one that's perhaps the most impactful, over 650,000 people are flying at any given time with ge engine technology. I can imagine many of you flew on these aircrafts as we were all getting into las vegas for re invent.

So this is something that's able to connect each and every one of us to ge aerospace and to our company. Now getting into the story of our data modernization and how we were able to do that with amazon redshift. I first wanted to take you back um roughly about 10 years ago where our story started.

So we like to think of this as our data lake 1.0. What this meant is this was our first foray into the data lake type of pattern. So I'll give you a bit of some metrics around the type of scope and the type of user base that we serve. And also a bit of a look into our tech stack that we were then able to modernize over the last roughly 2.5 years.

So within our data lake 1.0 scope, we had 300 or so source systems around our business that are all brought together that allow our citizen developers the chance to build trusted analytics and data products in our environment. These source systems come from all over the business and they're made up of er p systems, finance systems mainframes. Um you name it, we probably have it and all told this is over 100 terabytes of data.

And what this means is, it's a lot of data ingest, but it's also a lot of data creation. When i mentioned data ingest, we have roughly 15,000 or so base tables. When we say base tables, these are the tables that we are bringing into this analytic environment. And then we serve up those tables for all of our developers to build on top of now.

As you can imagine that next stat that says we have over 100,000 data objects. That means we have a lot of creation within this environment. So we bring in 15,000 objects or so. But the real power comes from all those developers and those data product engineers that are building the analytics within the environment. And all told we have thousands of data developers around the business that are using this environment. And relying on it to power their trusted analytics and to power their business processes

From a tech stack perspective. Um this tech stack i think does not look that different than many others around this room. Um we had an enterprise partner elt solution. So for that, that means we have one doing continuous change data capture as well as one that's funneling bulk data transfers into our environment. It's a very typical truncate reload type of pattern.

We also had a powerful mp p data warehouse and then we had built some custom orchestration tooling that was really a key enabler for our end users to be able to build and maintain their own pipelines within our environment.

Now, at the end of this data platform, we have our b i and reporting layer. This is where the majority of our users interact with our data and interact with the analytics that are built within this environment. And some of them may even take advantage of our low code tooling around advanced analytics.

So what that means is they're able to access some more advanced languages, something like spark python r to build and maintain their analytics within this environment. And all told this environment was running on prem as i mentioned, we started about 10 years ago and so it was a powerful environment but it obviously had with it some limitations

I mentioned those limitations when it really comes down to it. Our legacy platform lacked agility so it could be as powerful as it wanted to be. But what we were hearing from our business community and from our end users and those that were out there building and powering these analytics is they demanded more.

Our lengthy hardware and software procurement cycles sometimes took upwards of six months to uh plan ahead and to think in terms of scale and licensing needs that we would need to enable, to allow our business to continue developing and continue building these very powerful analytics. So that six month time frame that could be challenging.

Um I'd like to think of it as tossing a ball out in front and just doing your best to make sure that you estimate correctly how far out in front you had to toss it because then eventually your usage of the environment catches up to it. And not only that, but we had a very monolithic system design now that isn't always inherently a problem, but it, it was for us, it was fragile and everything funneled through one shared pool of compute and we had a lot of competing interests uh operating within this monolith that actually tested that fragility on a day to day basis.

And lastly when we talk analytics, we have to talk the value and the power they bring to the business. So a problem we had with our legacy environment dealt with the fact that it was siloed off. Um as i mentioned, it was on prem. But what that meant was this really strategic storage layer was not able to be tapped into from all of the different areas of our business or all of the teams that were already adopting cloud as an example.

So with these limitations and with this need for agility and flexibility, um that's when roughly two years ago, we embarked on an effort to modernize this platform and to take it to the next level and to partner with aws in doing so.

So you may ask yourself why amazon redshift. And for us, it was about addressing those issues with our, our data lake 1.0 platform. And also it was about making us nimble, allowing us to do right by our end users and and do right by those users that were testing that fragility uh day in and day out.

So from an amazon redshift perspective, the flexible and scalable storage layer was something that set us up for the future from day one. So we looked at that and with the advent of newer features like concurrency scaling and different innovations within amazon redshift, we knew that the service would scale with us. And as we threw more at it, it would be able to catch that and handle it.

Not only that, but it was a technology and it was a partner that was ready to support our more distributed business. So our business built up around our data lake and the way analytics were built in that environment was very centrally done. But what we've seen over the last few years is a more distributed architecture within our company and we have numerous decentralization efforts underway. And so we had to be able to support that with our technology as well.

So amazon redshift did that and it allows us to adopt a more data mesh type pattern with something as big and critical as our data lake environment. And lastly, and something i'll get into with an architecture diagram and and some looks ahead um the deep integrations of the ecosystem.

So that's something that we were extremely excited to embrace with amazon redshift and their aws partner services because that allowed us to unify much more of our analytics and bring them into an area where other teams, other domains, other parts of the business could interact with this critical layer without us having to move everything around or or send copies off to other parts of the business.

So when we talk about unified, it's it's hard to, to do that and not also talk about trust because for us those two go hand in hand.

So uh this, this is a pretty stock image that you probably see all the time of modern architecture. But what i wanna do is really dig into the main point of trust here. We look at the amazon ecosystem and we look at amazon redshift and the s3 type of layers. And it's that unification of these services that allows us to evolve our architecture and adopt a more modern approach to doing it in the process.

We also get to deliver more trusted analytics to our end users. And really the first way we do that is just by breaking down silos. So silos are very, very challenging in terms of trust. And what we've observed from our legacy data platform is they tend to promote duplication of data. So we have what we like to think of as copy sprawl everywhere. So everyone loves even some of our our most widely used data sets, but they copy it and use it for themselves and that can be a pretty challenging situation to be up against.

So what we are looking for is a way to be able to unify the storage, to not have to copy it, to make use of it and to actually build our technology in a way where the right behaviors were encouraged. In order to do that, we had to embrace the culture of produce once consume many. So more of like a a modern pub sub take on data warehousing. This allowed us to start to normalize this idea of producers and consumers within our environment.

And again, it's another behavior encouragement. So we are now able to encourage the behavior of producing really important and trusted data sets, but also listening to your consumer community and allowing them to consume in place as opposed to having to copy everywhere to make use of your data.

And lastly, i alluded to it before, but that decentralization of our business and that domain orientation, the modern architecture sets us up perfect for that you hear domain orientation data mesh, all of those things, you know, we want them because essentially our business is saying they are organizing themselves in a similar fashion. And so our technology should, should align with that and should actually encourage the right behaviors.

So all of this being told we believed in amazon redshift, we took on this modernization journey and we were ready to kick it off roughly 2.5 years ago. And i am now happy to share with you our end state solution architecture as well.

So again, this may be a familiar type of chart you see um albeit with some, some small or slight nuances, but i'll talk you through it and i'll hit on some key points here. But, but really this is the platform and solution architecture that we were able to get to that then delivered on the promise of all of those items we wanted to address and modernize in the process.

So from left to right, we still bring in those data sources. We still bring in those 300 or so source systems and we bring them into our producer landing zones. So we have isolated at an account and cluster level, our ability to produce data out to the rest of our environment. This supports not only our current environment but that growing trend of decentralization.

When we bring data in, we process it through emr and we store it in one of the big three open table formats. So we deliver on that lake house architecture in the process before we even get to our amazon red shift flare from there, we sync our tables with amazon redshift and produce out on the right side.

You see our consumer cluster account. This is the area where the bulk if not all of our end users interact with our data, they interact with it on amazon redshift or they interact with it through some of the different partner, aws ecosystem services, eks advanced analytics and our b i and dashboarding solution as well.

And something you may notice from this chart and from this, this architecture diagram is the fact that we're able to serve different user personas. So within our business, that's the reality, we have different personas. We have a growing compliance need for at least privileged access and we're able to do this within our solution, but then share data across or as we like to think of it reading down um if it's warranted from persona a to persona b, so we're able to do this and we're able to still be uh be forward in terms of compliance and cyber security of our environment.

And that's extremely important to us because safety of this data and these analytics is always number one and a another call out i'll make here because you may have seen the icon is the fact that all of this is running within the gov cloud or more protected regions of aws was an extremely big win for us.

As you can imagine, we take advantage of a few very cutting edge features in this architecture and in deep partnership with the amazon redshift team and a lot of teams across aws, these advanced features are being delivered to the gov cloud regions. And that's something we're able to tap into and protect our data at the same time as giving our end users and our business community this more monitored architecture. So that was a huge win all around

Now before moving off and and why i can stand up here and also have confidence in the solution and confidence in what we have moving forward is we have this, this god promise. We have the fact that the as built performance features keep being released. We have the the numerous announcements that happen at reinvented all throughout the year. And that continuous investment in the service in amazon redshift is something that gives us and our business confidence in partnering with aws to make sure that we're delivering not only for our users today but our users tomorrow.

So what this means is we're able to deliver these features to our end users and deliver on that cloud promise where these are released sometimes overnight. And what that means is the very next day workloads run better. We have improved performance across our b i and analytic use cases or maybe it's just the simple fact that your dashboard loads a little faster.

So items like redshift, serverless autonomic or improvements in the auto wlm are the types of things that when released they are just background processes that we take advantage of data lake analytics. And all of the good goodness you see happening with spectrum and improved glued data catalog management. That's again, something that we take advantage of with that open table layer.

And really, i i mentioned the ability to scale concurrency scaling is at the heart of that. So any time that we are improving concurrency eligible operations, that's just a scenario where our scale is improved, our, our time and our performance is improved on our b i workloads.

And I'll leave you with all the things that probably aren't even on here. So we saw some really fun and exciting announcements in the way of llm and redshift ml. It just keeps getting better. And that's really the promise that we want to make sure we are set up to pass right along on to our users and to allow our data product community, uh an ecosystem that really gives them that thriving place to, to design and keep building value for their business processes.

Alright. Well, I wanna say thank you for hearing our, our journey and letting me tell the story of our modernization. I will pass it back over to manon to give some closing remarks.

Alright. uh thank you sanjay and uh al uh hopefully, you know, we gave you folks some ideas here, you know, in terms of new red shift features, you know, if you're not using some of these new features, definitely want to might want to look at some of these things.

Uh if you're new to red shift, you know, there are, you know, also wanted to encourage you to um look at red shift, you know, you don't have to go alone as aws. You know, we provide a number of uh resources and tools for you to be success successful with your migration journey.

So if you're getting started, you know, on your migration journey, you want to come to red shift, you want to do a pc or, you know, if you're already using red shift, you want to, you know, investigate, look into another feature. There are a number of programs available for you, like, you know, from all the way, you know, which provides you help with technical resources, even migration incentives and credits.

Um and then tooling across the board, you know, with migration, tooling and other things like that. Um you know, so hopefully, like, you know, you can take some of these learnings from the session today. implement it. You know, when you go back to work on monday, start looking at some of these things, a couple of other things i wanted to leave you with a few resources.

Um you know, if you want to continue to learn more about redshift, you know, you can visit the web page. There are customer success stories, new features that are being launched, you know, links to those um as well as like, you know, other linkedin channels and books if you're interested in.

Uh by the way, uh congratulations, if you are participating in the data heroes uh program, you know, you have earned a data hero, you have completed an aws analytics superhero session. So feel free to, you know, scan that code and you know, um have it uh have it um as part of your, you know, completions accomplishments.

Uh and finally, um you know, thank you for uh attending the session. Uh please leave the feedback, you know, so we can continue to bring sessions like this to you uh in the future.

So just wanted to thank like sanjay and alun for taking the time and all of you for joining us today. Uh we'll be here uh

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Modernize analytics by moving your data warehouse to Amazon Redshift

Hello and welcome to re:Invent. I hope you are having a really good re:Invent.So, welcome to the session titled "Modernized Analytics by Moving Your Data Warehouse to Amazon Redshift". My name is Manan Goel. I'm a Principal Product Manager with the Amazon
复制链接

扫一扫