What’s new in Amazon Redshift-CSDN博客

本文链接：https://blog.csdn.net/just2gooo/article/details/134812978

All right, thanks everyone for joining our session today. My name is Neja Rintala. I lead the product management for Amazon Redshift today. I'm very delighted to introduce my co-speaker Matt Sandler, who is a Senior Director of Data and Analytics at McDonald's who is here with us to share their journey at McDonald's with AWS and Amazon Redshift.

Alright, so this is our agenda today:

First, we'll start with a brief discussion on the trends in modern cloud data warehousing.

We'll talk about Amazon Redshift evolution advancements in the last few years, including some of the exciting announcements at Re:Invent this year.

And then Matt is going to talk about how all these innovations come to life at scale using McDonald's as an example.

Broadly, there are four trends that we see in data warehousing:

The boundaries between the data lake, data warehouse and machine learning are blurring and the use cases are converging. We see that customers just want to maintain a single copy of the data and they want to do all types of analytics on top of it - SQL, Spark, machine learning and more.
Near real time access to the data. They want to set up seamless flow of transfer of the data from operational systems, from their transactional systems into the analytical stores so that they can be much more proactive in their business operations and improve their customer experiences and take it as a competitive advantage.
And as you have all these data in one place, you want to make it available for all your users in the organization. This could be data scientists, data engineers, developers and ease of use is a critical element of it.
And as you make data more and more broadly available in the organization, making sure that you have the right security and governance controls in place is very important so that you can make sure the right users have right access to the data and all the data is centrally managed and you're able to govern the data access.

Many of you might know in 2013, AWS has introduced Amazon Redshift as the first fully managed petabyte scale enterprise grade data warehouse, which made it really simple and cost effective to analyze large amounts of data. And this is a significant leap from the data warehousing, the traditional data warehousing on premises because they were not elastic, they were not agile, they were expensive and much more harder to tune and operate.

Since then, we have added a lot of innovations to Redshift to keep pace with customer demands and expanded the horizons of data warehousing use cases. It could include business intelligence and analytics at scale, but also machine learning, Spark integration, data sharing, monetizing your data and others.

So I just wanted to quickly walk through the evolution of Redshift in the last couple of years since its inception:

In 2013, we launched Redshift as the best price performance cloud data warehouse.
In 2017, we have introduced data lake querying. So you are able to query the open data formats in Parquet or ORC in Amazon S3 data lake. In fact, Redshift was the first query engine to introduce high performance analytics on top of data lake.
In 2018, we have introduced elastic scaling to Redshift whereby when you have more users and more queries, Redshift is able to automatically add capacity to maintain the performance.
In 2019, we have introduced Redshift manage storage, which I would say is a breakthrough for Redshift with which you as a customer was able to scale and pay for compute and storage independently. With this model, the durable storage is in S3 and the local cluster nodes, the SSD nodes really serve as caches for the data. And the Redshift manage storage is an intelligent software layer that decides what data should stay in caches and what should be in the durable storage.
So building on the Redshift manage storage, in 2020 we have introduced Redshift data sharing with which you are able to actually spin up separate compute for your analytic workloads. For example, ad hoc queries, different type of compute, data science different type of compute - they were all sharing the same data. So without making any copies of the data, you are able to provide compute isolation for your read workloads.
We also started investing in deeper integration with AWS services. So we have introduced streaming injection, direct streaming from streaming services like Kinesis and MSK managed streaming for Kafka. We have introduced last year zero ETL from operational databases like Amazon Aurora. And we also started expanding on the analytics side. We have made Spark access easy from EMR and Glue.
And last year in July, we have introduced Redshift serverless with which we have automated compute management. So with Redshift serverless, it automatically provisions the compute, it scales the compute and it takes care of things like patching. So you can just focus on getting the business insights and serverless is pay for use. So for example, if you had a situation where you had in an hour, you are just doing work for 55 minutes, you literally would pay for that workload duration, which is five minutes. So it's a pay for your system.
We also integrated with AWS Data Exchange so that you can easily consume third party data sets from your data warehouse easily. And you are also able to actually take your data warehouse data and make it available to your customers, your partners, your business ecosystem - license it and monetize it.

So with all the work that we have been doing, we have tens of thousands of customers using Redshift today. They are collectively processing exabytes of data. We have tens of billions of queries per week. Even some of our new capabilities are showing tremendous numbers.

For example, zero ETL from Amazon Aurora - in internal tests, we see 1 million transactions per minute data coming from multiple Aurora databases landing in Redshift making it available for analytics within 15 seconds. Our streaming injection for from a single stream, we see a throughput of 2.5 gigabytes per second. And with Redshift ML, which is a capability with which you can actually use SQL to create, train and invoke models - we have customers doing in database inferences more than 10 billion per day.

So when we think about Redshift innovation and where do we want to spend our time, what should our roadmap look like - there are broadly four pillars that we work on:

From day one, we have been focused on offering best price performance at scale. So we do everything behind the scenes so that we can make sure you are getting best value from the system and you are also able to scale seamlessly as your needs grow.
The second pillar is unifying all your data. So we want to make it as simple as possible to bring data into Redshift. And also when relevant query the data in place.
The third pillar is being able to work with all the data for rich analytics. We want you to do SQL, Spark, machine learning, more and more - generate BI applications using all the data in Redshift.
And all this doesn't matter if you don't have a secure and reliable platform. So this is a fundamental for AWS services where we want to make sure that the platform is secure and reliable and you are able to collaborate on data with confidence.

Let's start to dive in a little bit to understand kind of what are some of the innovations and what are some of the new features:

On the price performance, Redshift offers up to 66% better price performance than alternative cloud data warehouses. We have a lot of performance capabilities - columnar storage, MPP processing, pretty sophisticated cost based optimizer, materialized views, compilation caches, results caching - there are a lot of performance improvements that contribute to the lead that we have in terms of price performance.

And Redshift also scales as the number of users and queries grow so that we can maintain the costs as your usage grows.

For example, this is a scenario of a real life workload, dashboarding application where you have lots and lots of queries - like high concurrency workload, low latency, very small queries - these are very common patterns that we see in the fleet, we have done a lot of optimizations. And in this scenario, Redshift is able to offer up to 7x better price performance than alternative cloud data warehouses.

So we continuously look into the query patterns in our fleet to see what kind of optimizations we need to do so that we can be a lot more customer focused in terms of the innovations we are doing around performance.

For example, earlier in the year, we have introduced efficient CPU encoding with which we are able to speed up the string processing by 63x. Similarly, we have speeded up encryption. We have made our auto scaling to be much more functional by adding write operation support. And sometimes we do these performance optimizations behind the scenes, you just experience the benefit.

For example, we recently started onboarding Graviton instances on our serverless with which we are seeing up to 20% better price performance.

And customers do see these benefits in their deployments. GlobalFoundries is one of the world's leading semiconductor manufacturing companies and they see 3x better performance with Redshift compared to their on-premises systems. They were able to improve their ETL performance, they were able to do their load data loads much faster and bring a lot more data to cloud for analytics using Redshift.

In addition to the core performance optimizations, Redshift also has industry leading predictive optimizations which means that the system is continuously monitoring your workload to figure out the strategies for improving the data layout, improving the compute management.

Redshift already has things like automatically figuring out the sort key, distribution key and encoding, sorting the data, automatic materialized views so that we can seamlessly speed up repetitive queries, automatic workload management. So these are all the capabilities we have in the product.

The benefit of this is as you use the data warehouse more, the performance automatically speeds up.

At Re:Invent, we are very excited to introduce a new capability in Redshift called multidimensional data layouts. So this is also a predictive optimization. This is a very powerful sorting technique with which we are able to speed up repeatable queries.

Unlike traditional sorting methods which sort the data by a single column like what you see here - a cost column - the multidimensional data layout sorts the data based on the incoming queries, actual query filters. And the benefit is when you have large amounts of data and tables, the time it requires for us to read the data is much, much faster.

And in our internal experiments, we see up to 74% performance improvement compared to the tables that have no sort keys. And the good thing with this feature is you don't have to do anything. The system is observing the workload, it figures out - ok, is a single sort key, single column sort key is better or should I go for a multidimensional data layout? And it adopts the right strategy.

And when it comes to the performance, it's also not a single cluster performance. With Redshift data sharing, we have thousands of customers that have deployed Redshift in a multi-cluster architecture. With data sharing, you can share live and transaction consistent data within an AWS account, across accounts, across regions, across organizations. There is no copy or movement of the data.

So we have customers deploying in these kind of common multi-cluster architecture patterns like:

A data mesh where you have different business groups, they're all collaborating on data, no data movement required
Or a hub and spoke model where you bring data from different sources, put it in a central ETL or data processing environment and then you make it available for different analytics like ad hoc queries, machine learning, different workloads using different compute.

So at Re:Invent, we have extended our data sharing capabilities and we have introduced something called multi data warehouse writes. So with the data sharing capability we had before you were able to scale read workloads. So you can put your dashboarding workloads in a separate cluster, your data science in a separate cluster. But with multi data warehouse writes, you can actually scale your data processing as well.

For example, you might want to say "Ok, in the night time, I have a large amount of data to be processed. So I want to put that in a different cluster, a large cluster. But throughout the day, I have incremental data trickling feed coming in. So I can use a different size cluster. And maybe once in a while, you have like a weekly compliance report or monthly compliance reporting to be done, you can use a different cluster and the beauty is all of them are able to operate on the same data.

So with multi data warehouse rights, a data warehouse can write to a database in a different Redshift data warehouse. And once the data is committed, it is visible to all the other data warehouses that have permissions to access the data. And another use case for multi data warehouse rights is live collaboration.

So many organizations have entity 360 like customer 360 type of data sets and they want different teams to collaborate and build those data sets. So this is an example where you have sales, marketing finance, everybody working on that customer 360 data and they can populate different columns, they can add data. So you are collaborating on live data sets.

We talked about how important it is to have ease of use when it comes to scaling analytics in the organization. Last year, as I mentioned in July, we have made serverless generally available and serverless simplifies the overall data warehouse management. It takes care of provisioning the compute, scaling the compute the patching workload management, everything is taken care of and it's a pay for use system.

And one very interesting pattern that we have seen in the last year with our customers is they use a combination of provision and serverless Redshift data warehouses to form multi cluster architectures. So this is an example. Peloton is a fitness equipment company. They are known for their internet connected bikes where actually the fitness lessons can be streamed on demand.

So they bring data from different data sources and they process it with Spark DBT. They use Airflow for orchestration. They bring the data into a central Redshift cluster which is a provision cluster. And then on the consumption side, they have different patterns of workloads, ad hoc queries, Looker dashboards and Tableau dashboards.

So all these different workloads, they have a combination of provision and serverless. And the nice thing is they are not duplicating any data. The data is in the central cluster and all the consuming clusters are simply reading the live data from the central environment Play is another example where they have hundreds of terabytes of data.

They are one of the early users of serverless and they have used serverless too as a data sharing consumer so that they can make, make it available for ad hoc queries to their users. And the benefit is by clearly separating these kind of workloads. They are able to maintain their SLA's on their recurring the steady state workloads but also improve the data warehousing costs in their, in their case on a monthly basis up to 20%.

So we have been enhancing serverless with a lot of capabilities throughout the year. Since its general availability at Re:Invent, we have launched a number of monitoring and manageability improvements in serverless. For example, cross region snapshot support with which you can do disaster recovery.

We have things like max RPU with which you can say at any point of time, I don't want serverless to go beyond this much resources. Then secrets manager integration, we have something called custom domain name. So you can give a user friendly name to access the cluster. So a number of these improvements really make the experience with serverless much more seamless, simplified in terms of operations and also have a robust data protection.

One of the exciting capabilities we launched actually as part of Peter's keynote is a Redshift serverless AI driven optimizations and scaling. And this is one feature we are super excited about. The idea is very simple as a customer. Now with serverless, you have a very intuitive way to specify your price performance requirements to Redshift.

So you can say I want to optimize my system for cost or I want to optimize system for performance. Give me the best performance possible or keep it balanced. The default is going to be balanced, but you can adjust the slider as needed. And serverless is going to take care of everything the remaining which is provisioning the resources, scaling the resources deploying the optimizations so that we can actually meet this price performance level.

And the other nice thing with this feature is as your workloads change. For example, you might be processing a terabyte of data to today and you are meeting that SLA in 10 minutes and tomorrow, your data size double, but you still want that job to be finished in 10 minutes. This particular capability is going to help with it.

So it is going to automatically scale the resources so that not just we meet the performance level, but we maintain the performance level as workloads change. This is a capability again in preview. We request you to try it out and give us feedback.

So let's move to the next pillar which is unifying the data with zero ETL. So in the last 12 to 18 months, we have been spending a lot of investment to simplify data ingestion into Redshift because we have been hearing from customers that building and managing pipelines is harder and sometimes these are fragile which means you are going, it's going to lead to delays in getting access to the data.

So and sometimes these delays can be missed business opportunities, right? Where the insights from the transactional data may be are time sensitive, they are valid only for a certain amount of time. So they really want to get access to the near real time data. So this is what we have been focusing on near real time ingestion into Redshift.

And this is where AWS vision is around the zero ETL where we want to provide an easy and secure way to enable near real time analytics on petabytes of transactional data. Without you having to worry about the injection pipelines and management.

Once the data is in Redshift, once you start to explore the data, you might still want to say, ok, I want to transform this data. I want to aggregate this data, you can do that. But the point of zero ETL is getting you the access to the data for analytics as quickly as possible.

So we have launched the general availability of Aurora MySQL zero ETL integration. End of October early November time frame. It has been in preview for several months. We have been getting very good feedback and it is all incorporated.

We have launched the MySQL general availability at Re:Invent. We have launched three more zero ETL integrations into Redshift Aurora PostgreSQL to Redshift DynamoDB to Redshift and RDS MySQL to Redshift. So all these capabilities are in preview. You can try it out and working with zero ETL is very, very easy and this is a consistent experience across all these zero ETL integrations.

So you go to your source system could be Aurora could be RDS. And in this case, this is DynamoDB example where you have some new option called integrations and you start working with it, you just click on it, you choose your source DynamoDB table and then you choose your target Redshift data warehouse.

You configure the resource policies, you want to make sure it's all secure. And on the Redshift side, you just need to accept the integration. The way you accept the integration is just create a database from that integration that you have created in your source system. And that's it. And the data automatically starts flowing, it starts replicating, it's a change, data capture based replication.

So as soon as the data lands in your source system within seconds, the data appears on Redshift and you can go to Redshift query editor or whatever the tool the BI tool that you are using and start querying the data.

This example is a DynamoDB example which is a key value system. So we are bringing the data as a super data type in Redshift super is a semi structured data JSON data type in Redshift. So we are bringing the DynamoDB data super, you can navigate it, you can convert into relational. So you can do any post processing that you want to do after the data lands in Redshift.

So zero retail has been tried, tried by many, many customers. Just a couple of examples to highlight what some of the common feedback, the things in our range is to highlight the pattern. One common theme is "It used to take months for us to set up the analytics environment. It's super easy now." That's a common pattern.

Another pattern is "I am able to replicate data into Redshift without impacting my production systems" because that zero ETL happens at a storage level. It's a storage level replication. You are actually not reading using any of the compute either on the source or on the target. So that that's a helpful scenario for customers and of course, the near real time access to the data so that you don't have to do this in a batch fashion.

Another capability we have added last year is streaming ingestion. This is a direct streaming ingestion into Redshift by avoiding S3 as a staging area. So with this, you can take the data from Amazon Kinesis or Kafka and directly ingest data into Redshift. And the way it works again is pretty straightforward, which is you create a materialized view in Redshift, you point it to a Kinesis stream or a Kafka topic and data automatically comes into Redshift as new data lands on the Kinesis Kafka streams, the materialized view on Redshift is refreshed.

So the idea as you can see here, the near real time data is a focus and also simplifying the data injection into Redshift is a focus for us. And we have enhanced streaming injection with capabilities like cross account support and purging some of the streaming MVs. The data may not be relevant for too long. So you need like a delete capability from this materialized views. This is something we have introduced.

There are many customers, thousands of customers using streaming injection. In fact, streaming injection is one of our most widely adopted features. It's because of its simplicity. Joomy is one example where they are using streaming Kinesis for risk control on their users, financial activity like recharge, refunds, rewards, etc., where they can see all the risk related activity in one place and they can evaluate on a near real time basis unlike the previous situation where they were doing it hourly basis, thereby improving the efficiency of the overall business.

So we have seen how we have made it easy to bring data into Redshift. We are also continuing to expand on querying data in place.

So at re:Invent, we have launched the general availability of Apache Iceberg support. So Redshift already supports Parquet, JSON. We have added support for uh Delta Lake, Hudi, and Apache Iceberg in addition to it. So you can easily query the data from Iceberg. You can join it with the data in Redshift and just start leveraging it for your analytics.

It is one of our customers who has standardized on Iceberg is able to leverage Redshift to read the data. We also invested on data lake performance. So when customers standardize on data lake, keeping the data in open formats, we want to make sure we are providing the best possible performance as well. So it uh with the uh with the data lake, we have improved performance up to 45%.

One of the interesting features that we have done. We actually launched the three invent is incremental refresh support for materialized views on data lake data. So for example, you have your data in data lake, but you want to power a dashboard with that data. This feature will come in handy where you go to Redshift, create a materialized view and it automatically gets refreshed and you get access to the latest data with uh with very good latency.

Moving to the third pillar of focusing on the analytics and machine learning. The first type of analytics that customers do with Redshift is SQL. So we continue to invest in SQL. We have launched a lot of new SQL capabilities like MERGE, ROLLUP, CUBE, GROUPING SETS. So different types of sequel syntax and recently QUALIFY. And the idea is you are able to build rich applications but also migrate easily from the traditional data warehouse systems.

And one of the new capabilities we launched at re:Invent around SQL is called Glue Data Catalog views. So this is a cross AWS service feature. And the idea with this feature is you create the view in Glue catalog, Glue data catalog once and you can use it across multiple engines. So you can query it from Redshift, you can query it from Athena, you can query it from Spark but you are secure the data once you are creating that, creating the view once you are securing it once. So you are not repeating it three times for access by three engines. So this is a pretty handy feature.

This is one of our common requests that we get from customers. As you can see in this example here, the first point here shows you create the view in Redshift. Then you can say I want to customize it for Athena dialect. So we have you can customize as much as possible for the specific Athena. Then you the Athena user would come in and they can query the view. But as an administrator, you are just managing the permissions for the view. Once we are, we are also excited to introduce a few generated BI capabilities in Redshift.

The first one is now Redshift machine learning is able to access large language models LLMs. So Redshift machine learning, as I mentioned briefly earlier, you can actually use SQL to create train and invoke models. We have extended the Redshift Redshift machine learning with what is called SUPER, which is the semi structured data type. We have added support to it because of which now we can do integrate with SageMaker JumpStart and you can start in working these large language models directly from Redshift ML. And there are a lot of models like text generation, text summarization and then text classification.

And we hear the requirements from customers saying I want to summarize the product feedback. I want to classify the entities on the data that is stored in Redshift. So the large language models LLM support is going to help with it.

Another capability we have introduced is the generative SQL capability in Amazon Redshift query editor, which is our out of the box query experience with which you can actually now improve your query authoring experience by asking queries by asking questions in natural language and get SQL recommendations.

So the generative SQL analyzes the user intent looks at the past query history, it analyses the scheme of metadata that is available to provide accurate recommendations and this is a conversational experience. So you can actually get business insights by asking questions without having to understand complex organizational metadata.

So this is a feature available in uh query editor that you can try it out.

Moving on to the final pillar here. One of the nice things with Redshift is we offer a lot of security capabilities out of the box. So we have many authentication mechanisms, fine grain access control, auditing encryption, everything is there out of the box, there is no extra cost for security with Redshift and we continue to improve our security capabilities now we have column level permissions, row level permissions, dynamic data masking.

And at re:Invent, we have added support for metadata security. So if a user has one access to one schema out of the 300 schemas that are there, they can just see one schema and query the data. So metadata security is very important in multi-tenant environments. And we also added support for fine grain access controls on nested data lead binding views.

We have introduced table level controls on data sharing consumers. One of the capabilities we have introduced, you might have heard as an announcement at re:Invent is Redshift integration. This is again a cross analytic service feature which is a unified identity across all AWS analytic services. And the idea is Redshift integrates now with IAM Identity Center to enable organizations to propagate trusted identities throughout the stack.

So it's pretty straightforward, which is you as a customer, you go to Identity Center, you configure your provider, it could be Microsoft and One Login Octa, you just configure it in one place and those organization identities have a single sign on experience throughout AWS stack. So you can come through QuickSight, query editor, go to Redshift, go to Lake Formation, go to S3 that entire stack users don't need to re auth it again and again, the same identity is propagated.

And the nice thing is administrator because these are identities that are passed through. You can you can do fine grain access controls using the organization's identities. So this is going to simplify the use cases where you have more than one use, more than one Redshift, more than one AWS service involved.

Another capability we recently launched a few weeks back is Redshift multi AZ deployments with which we are raising the SLA of Redshift to four nines instead of three nines in a multi multi AZ deployment Redshift is deployed in multiple availability zones. So in the rare cases of an AZ failure, your business continues to operate.

There are a couple of nice things with the multi AZ that are, that are differentiating and that you don't get with other solutions. One is the endpoint, the application connectivity is not interrupted, so we meet the same endpoint. But behind the scene, we are failing over the second one is the capacity that we are going to keep in the other availability zone. It's not just sitting around doing nothing, it's not active, passive. Instead when you are in the healthy situation, both availability zones, the capacity in both availability zones are actually serving the workloads. So you get much better throughput. And in the case of a failure, which is a rare scenario, the it will be the request will be continued to serve from the available availability zone.

So just to summarize, we have been enhancing Redshift with the different innovations. And the idea is bring together the data from all your data sources, make it very, very easy to access and make it available for all types of analytics, not just SQL but Spark machine learning generative applications and have a secure, reliable performant and scalable platform.

So with this, I would like to invite Matt Sandler to talk to us about their journey with McDonald's and Matt has been a great ally and advocate for Redshift, great partner. So I'm eager to hear from you. Thank you so much. Very cool new features to kind of summarize.

And uh so I'm gonna talk about Redshift and I use it and applied it here over time. So let's see where there we are. So this started a little bit of as a blog post where we sort of shared, uh you know, driving the McDonald's business one for one bite at a time and uh I'll go for it here.

So background context McDonald's, all of you probably know your local McDonald's and when you travel, you uh hit a few, but a lot of people don't realize just how big we really are. We have 40,000 restaurants around the globe. Only about 14,000 of those are in the US. So by far, the vast majority are now overseas. We're in over 100 and 20 countries. Last year, we did about 100 and 18 billion in restaurant sales and just under 1% of the world's population actually eats at McDonald's every day.

Yeah, so quick question. Think about this for a moment. What's your favorite menu item? I'm gonna count to three. Yell it out. Uh, let's hear it. 123. Oh, I think I, I, I've gotten different answers. I think I heard quarter pounder and cheese best. I don't know how it was said. Sometimes it's fries. I love the McRib ii. I know I'm out there and my, uh, my three year old loves happy meals. It's, it's fantastic.

So we all have memories of McDonald's or many of us do, uh going to the play places and being there. It's, um, it's a great gathering place for a lot of people. So it's a, it's a huge chain really quite dominant in the restaurant area.

So, so let's tell, talk about technology at McDonald's. We've been around since 1955. They're actually reasonably large, you know, not as big as we are now by, you know, the eighties without a lot of technology, right? We, we were known for, if you ever seen the founder, it's actually kind of a great movie. Um, they really known for this operational reliability, consistency value, cleanliness. These are sort of the hallmarks of, of our brand when you think about it

"But we've really grown in the last five years from a technology point of view. I have just a little bit of there we go. So digital channels are now driving our growth. And so our, we, we now have a mobile app that is last, I looked top food app on the apple store and google it comes with, we're now nearing to be one of the world's largest loyalty programs. On top of that, our kiosks are super popular. I've been to the airport a bunch of times and I see like a long line at the kiosk, no one in line for the front counter. And then you know, outdoor menu drive uh outdoor driver boards kind of very engaging. We also have a delivery program that's quite big. And now you can keep having new features with our mobile app. For example, when you order the food is hot and ready, right? As you show up, there's no weight, you can also do delivery to the home through the mobile app.

So as you can imagine, this creates an enormous amount of data, a lot of opportunity both to understand our business, but also to build data products that drive the business on top of these digital platforms. From a technology point of view, given our growth, like a lot of very large companies that have been around for a while. We have a lot of country specific solutions that kind of emerged over time. So we have variations in our point of sale. We have different supply chain systems around the globe, different mobile app variations as we got going fast and marketing. But now we've got a big push at the technology level to bring it all together, globalize, standardize, make it operate as one global stack and we're making progress on this.

So when it comes to the data side of all that digital data and all the subs systems, we we find at the start of this journey, we had silos of data performance bottlenecks, surging amounts of data as these things start to roll out and get so much usage and then big delays in data processing and of course data insights or business insights. So we needed to bring the that together in a unified way for analysis, look at on a timely basis and then make it available for for self service and access across the company.

So I'm gonna tell you a little bit of our journey with redshift in brief. Um and I'll go back through these, we started with a, a one departmental red shift cluster. We then went to all in on aws and red shift to be fully based there. And then we've evolved with these new cloud native features, some of which you saw in the prior presentation that we'll cover here.

So when we started, this is a few years ago, we had our on premise data warehouse there in the middle of the slide. And you can see uh we, we fed s3 with some of our core transaction and we brought in these new digital capabilities in the mobile app information. The back end services. And we created a a us customer 360. Many of you probably have something similar for your uh from a large company that's customer um uh large company and facing. And um we brought, we have customer slices and disable. We have offer redemption information, sales by channel restaurant menu. And so we did this in close collaboration with red shift, the teams, sizing and tuning and resizing as needed. And as you've seen today, a lot of the need to do some of these things has, has gradually gone away.

So then we made this was successful and we had a number of uh good, good quality results with that. But we made the big shift based on that success to go all in. And this was where we started with our um all that core data that was in that on premise data warehouse went to for customers offer sales transaction, all the supply chain information, we put that into s3 and we use talend as a traditional etl to fill our global data warehouse. However, at our scale, that wasn't a red shift at the time, wasn't able to handle that. And so we added four more uh or uh country specific data marts in addition to the one we had before, and this took over 500 eetl and elt jobs in order to bring that all together. So a lot to manage, but it's working, it's actually giving us a lot of value.

Along the way, we had a lot of aws engineering visits, um new feature requests that we put in for performance and workload management. And we'll see some of the benefits of that in a moment and the usage we now are able to support over 3000 b i tool users and opened up for the first time to direct access on these data sets. So these are our large scale main data sets and a h over 100 b i user or uh sql analysts at the time starting to work with the data, our first exposure to self service for the large data at mcdonald's

Now where it gets really interesting for those of you who've already got into this journey is these new features that have come on board and enabled us. So this is a very similar diagram. And in the middle, you see that large global data warehouse with the country specific data marts each getting data from the main one but also having some other incremental data. The that's country specific and as the um capabilities um grew particularly around things like r a three nodes for independent compute and storage, um automated concurrency scaling. So as people are working away, we can scale up the clusters and then especially data sharing, we've been able to simplify dramatically to produce the pattern you see here where we have another consumer now of the global data warehouse. For all these self-service loads. And we got rid of an enormous fraction of all our etl and uh jobs in order to simplify and all that data sharing uh was taking is taking care of all that propagation for us. We also have uh row and column level security features that were mentioned have been very valuable for us in making this work.

And then on filling out the story in terms of how we get the data in. As you may remember in a prior diagram, we had s3 with talent. But now we've also been able to take advantage of transient emr to spin up clusters. We still get a lot of uh batch data overnight. Then we can very efficiently process it all in the wee hours of the morning to have our, you know, daily sales go ready. And ms k managed service, kafka is now our new venting platform. We'll be able to use glue in and lambda in order to feed our primary producer and have very low latency propagation of that data for availability.

So this is a great example of where red shift, which is more the right hand side is now getting the benefits of a lot of upstream aws integration for us. So high performance cost effective, easily managed. It's been working well for us to give you some summary statistics on how this pattern is coming together at mcdonald's. Let's see here. It's very performant. We're have over 12 petabytes of actively used long live data that we process. We have 50,000 users serving all parts of the company through data products as well as self service analysts. And then we have over 260,000 queries a day coming in from all the different b i dashboards and data products and users. So it's a very, it's really met our needs that you, we saw um from a few slides ago that were the big problem.

So it, at the same time, we've done a lot of data quality work. Not really the topic of this presentation, but we've been able to bring all that together with the performance data systems, high quality data to drive business value. And it's a couple of applications i'll mention here.

Next. So one of the things we're doing to drive our business forward is we're big on marketing. We've had national campaigns for marketing, you've known for a long time. But now we're also getting big into uh martech where we look at customer specific information and a great example here, uh abandoned cart. So we have redshift with all sorts of historical data on purchase. We have new information coming in from redshift on re very recent purchases. And then we also have real time data from the mobile app coming in through managed service, kafka. And then we combine them to now just launched recently our first abandoned cart campaign. So if you're using the mobile app and based on what's in your cart, you haven't, we think you're very likely to purchase that, but haven't touched it in a little while. We can send you a push in the app. And it's been, it's been fantastic.

Another example is in supply chain, we're long known for incredibly uh great availability of our core products like the fries and the big mac. When we do promotions, it can be really hard to tune. We're not always sure how much of it people are gonna want. We try to have enough product for the promotion window but not too much more so capacity planning. And now real time data can come streaming in. We combine that with historicals and marketing calendars and we can for ever tune tighter that right amount of availability during promotions.

So these are just a couple of examples of things we're doing right now and and what does the future look like? So from an amazon red shift point of view, we see uh manage service ca uh kafka, the whole data streaming capabilities as they grow within the ecosystem and tie into our data systems really driving our event platform across the company, right? support for data sharing. This allows us to drive down the latency of across the data share cluster pattern. You saw earlier redshift ml allowing us to up our game ever more on advanced analytics being more efficient and more effective and then multi a z redshift which will for us as you've seen some of these applications take that marketing. If our cluster isn't available, we lose now, transactional kinds of integra integration. So multi a z redshift is important for us for that resiliency at scale as we build on top of our data platform, how we use the data, you know, we have all this clean, great data in our restaurants and how things are going. We're looking to use generative a i to access that structured data to bring it to life in a more effective way than we've ever been able to before for our restaurants. And then we've talked a lot about sort of our business facing data, our supply chain, our marketing, but we also have some other back end data systems like some people, data and some finance data. And we're now working on bringing that unified layer at sql together for the company. So analysts can just sit down write a query that accesses any day across any data across the company. It's that standardization meets these data sharing features to really just drive tremendous gains in data accessibility at mcdonald's.

So that's sort of the last uh slide. I want to take a moment to thank, you know, the red shift team for being a great partner. I also thank my teams who've done so much, great work on driving this over time and then incorporating these great new features and it's uh great to share my story with all of you mcdonald's, yeah, an insightful presentation. Um and it's, it's nice to see how you're leveraging these innovations to kind of improve your data platform even more.

So, just to wrap up, we have a lot of redshift resources, we have uh you can go to our feature page, explore our new features. There are a lot of demos, use cases, customer stories that you can try it out. And if you want to get in touch with experts, there are links here. So the slide deck will be there. So feel free to refer to, to refer to the resources that are mentioned in the deck.

So we can open up."