Good morning everyone. So we had some technical difficulties which actually helped us warm up. We think we might extend that favor to the audience just to warm you up as well. So, really quick uh for how many of you by way of standing up at your seats? This is your first re:Invent.
We were. Yeah. So ideally we get everybody to stand up and we'll count down about that. Count up. So we're gonna get you guys the first breakout after Adam's presentation. We wanna make sure your blood's flowing. We're now we're between lunch uh and after this session. So if this is your first or second re:Invent, if you would take a seat.
Wow. So it's great to do these in person again, but the majority of the room just sat down. So how about if this is your fourth or less re:Invent, please take a seat. So we have two people, three people. Now the battle begins. All right, we're gonna go to five. Ok. Down to two. So what's, what's the next number you want to pick, Taz? Oh, I think we have one. Yeah, exactly. No. Thank you for playing along. Thank you. It's great to be back in person with everyone and Taz and I are both excited.
All right. Well, good morning again. Uh my name is Imtiaz (Taz) Sayed. I'm the Tech Media for Data Ana at AWS and my name is Adam Driver, not the actor Adam Driver, but the employee of AWS. Adam Driver. I'm the worldwide analytic solution architecture leader.
All right, great to see you all in person. And thank you for taking the time for this session. Quite an ambitious topic. Democratizing analytics. Thank you. Yeah. So our intent today is to provide a simple perspective on how AWS works towards the democratization of analytics for our customers, the differentiating factors that drive this experience and how AWS incorporates these factors into its services and solutions.
Our story line starts with exploring the attraction that data has towards the cloud and the corresponding adoption and growth challenges it brings for our customers and how democratization of analytics helps mitigate these challenges. We'll then look at two of the most compelling democratization differentiators as seen from our customers viewpoint and dive into a few of these with the help of demos. So let's begin by acknowledging how far along we have come today.
In fact, I remember the time when the biggest decision used to be, which relational database should we choose? And today, the biggest question seems to be, should we do this on premise or should we do this in the cloud so, working with our customers on their cloud journeys, we see several cloud deployment strategies, hybrid uh cloud only, cloud first and uh sorry and on premises.
So uh again, a quick uh warm up exercise just to get a sense of how this our audience is distributed in terms of your organizations where they are with regards to your cloud deployment strategies, Adam. Yeah, why don't we, why don't we just define what these are first guys?
So cloud only is pretty self explanatory, right? It's where anything you build or do is is the foundation and maybe across multiple um is gonna be only in the cloud, right? We won't make it on premise version and it'll be specifically native natively born in the cloud. Um cloud first is anything new goes to the cloud and then maybe think expanding it to on premise perhaps later. But the leading factor is that cloud deployment first hybrid is where we combine these, right, the on premise and the cloud and maybe even multiple cloud platforms, not just the singular cloud platform.
And then on premise is pretty much exactly what it means. Um those deployments are on premise, there's no cloud strategy or deployment in mind for that set of applications. So you thought you were done with active participation, not yet. Um so we'd like you by a raising of your hands, we wanna see what the distribution looks like across this audience to see how many people are um, on, on premise only orgs they will start with on premise.
So if you could raise your hand, if you're predominantly an on premise organization today, no club, just on premise. Let's see. A couple of hands up. All right, a couple of hands here, Adam. Ok. How about if you're a hybrid organization? Ok. Ok. Thank you for participating. Live in about half the room. Ok. That's about what we would expect. Cloud only work. Awesome. All right, that's great. Close to a half the room nice. And then if you're a cloud first organization slowly raising their hands, but I think I see about a quarter of a room.
So th so again, thank you for your participation. This gives us energy to see your hands go up and come down. Um it's about what we expected and we expect, you know, based on this o'reilly study, 90% are that hybrid deployment followed by cloud first and then cloud only. And the on premise is obviously the smallest percentage that we would expect, but that is consistent with the usage patterns that we see with our customers and, and, and similar studies that we just called out with o'reilly, right?
So Adam, what this is telling us is that we are at an inflection point where most of our customers, most of our users are already working with data or will be working with data that is either distributed across different platforms, different solutions or even different cloud providers. There is a huge diversity of choices available today, both on premises as well as on the cloud and not just in terms of storing data but also new ways of working with data.
So what is driving organizations towards the cloud? Several reasons? Actually, customers are seeking out a lower organizational footprint to manage their environments. They want to take advantage of components of the cloud to help them do things that they couldn't do before and they want to build for the cloud instead of in the cloud by using managed and service products.
So what are the dynamic of this cloud movement when it comes to data? For the most part, we have data moving from on premises to the cloud and we see data exhibiting a gravitational force as in when applications and data move to the cloud. All of the adjacent disciplines, all of the machine learning data warehousing, all of the applications that work with data are also pulled into the cloud. And as the center of data gravity shifts, this behavior just keeps on increasing.
This study by Flexera shows the different rates at which many big data and analytics disciplines are moving to the cloud today. And another data point is really around the impact of this data gravity that's happening in costs or commercial off the shelf software solutions. And those providers are really moving towards subscription and consult consumption pricing models and getting away from the classic on premise, um maintenance and perpetual licensing, that kind of went um with those perpetual licenses and main maintenance, most of them, the sas providers today are actually cloud first or only a cloud only strategy for their new capabilities, faster agile, faster deployments, et cetera that are being deployed.
And uh and in some cases only in the cloud and no intention to actually do it on premise. And in fact, there's multiple studies, it's not just what we're seeing, working with you our customers and and with other organizations, it is actually the reality that we have in our marketplace. And this i dc study, there's an i dc study and a garner study further amplify this gravity effect.
And as of September this year 2022 just a few months ago, the cloud accounts for 44% of all big data and analytics spend. It's a big number 90 billion. And by 2024 organizations plan to move over 70% of their data analytics workloads to the cloud. Customers are feeling a sense of urgency even in these economic times to get started now.
And as applications and data move to the cloud, especially with the hybrid cloud first and cloud only deployment strategies. What this is telling us that the customer transition to the cloud is a multiyear long term effort data and all things that work with data are not only moving to the cloud at different rates, but this movement also brings forward unique challenges for our customers when you're operating between on premise and the cloud. And in some cases, more than one cloud provider.
And like I mentioned, it brings several challenges for our customers. Like the cost of managing data, not just in the affordable sense but also in terms of governance, data sharing, data discovery, the challenge of making your data work and interoperate with other types of data. Talking about different formats, different storage types, automating your data processing functions, your operations. And when scale becomes a function of speed to achieve performance. And finally, when organizations are looking to become data driven, so aws understands these challenges faced by our customers and we are helping them by democratizing analytics where aws makes analytics available accessible and affordable aws understands that cloud requires a fundamental cultural shift and we help address the shift by ensuring that that right tools are available to all roles and functions working with data and that these tools are accessible not only in terms of user experience but also tools that do not require customers to break the bank availability for aws is about choice.
It's about providing services and solutions that span across all personas and roles and functions working with data. This aws analytics mind map demonstrates that choice. The aws cloud provides solutions and services that recognizes the data gravity effect and the corresponding needs of customers. Moving to the cloud for the analytics and machine learning needs. This is a representation of the diversity of choices available today that has the reach to provide for every data function regardless of where they are in their data journey, helping break down barriers that may come up due to reasons of cost scale skills and experience.
So let's drill in a little bit closer um to this mind map. And we're gonna start with the diversity of choice available to data analysts. So perhaps we have some in the audience today. Besides your conventional role, quite often, we see data analysts today performing the analyst function in areas of streaming ingestion and machine learning, working to collect and clean and catalog raw data for high quality inputs and doing this without coke.
The second one i wanted to highlight was the data scientists where we're definitely training and deploying machine learning models and collaborating using notebooks, sometimes multiple notebooks to develop and debug data science applications. Similarly a w ss has choices available to the data engineer. And i know we have some of those in the audience to build the data processing workflows, the orchestration, the data pipelines and relying on governed tables for compliance and and overall data governance and not lastly. But the other one i wanted to highlight is for the developers which is incredibly important to be able to process with their preferred applications of their choice such as apache spark flink or python.
We were chatting with someone before the presentation has had a tremendous cost savings by building applications a little bit different. So what this is telling us again is that every customer is unique. Their requirements are unique and the diversity of services and solutions available for customers on aws is to be able to have that ability to make the choice that is best suited for their unique situation. And availability of choice is great is the first step towards democratizing analytics.
Now, let's look at making it accessible and affordable. So we work with a large number of customers, different types of customers, customers with a very large footprint in the cloud to customers who are just starting out customers across different industry spectrums, ad tech, retail start-ups, manufacturing and so on and so forth. And this is what we have learned. Their cloud journey has multiple factors that influences their rate of cloud adoption factors such as cost skills, security compliance.
These are all important factors but none more compelling than ease of use and price performance. These are the differentiating factors that vastly influence their cloud adoption related business decisions. So working together with our customers and learning from them, we've broken down these differentiators into prescriptive attributes and demonstrate how these attributes reflect the aws way of building services and solutions.
I will also i will cover ease of use followed by adam to take us through price performance. Thank you, adam. So let's take a deeper look at ease of use. These of use differentiator comes down to three main attributes for the customer. A low entry barrier, reduced operations and a low no code experience.
The low entry barrier is about minimalistic requirements. So we are building services and solutions that are intuitive enough for a user to be able to just pick up and run with it. These enable users to move quickly from start to finish rather than making users spend cycles in learning the tool and encourages data users to wear multiple hats by removing stringent skills needs.
The next reducing the operational burden is when customers can offset away their operational overheads. For example, identifying the events to monitor building automation to react to those events, notifying or performing self healing actions. These are all necessary operational details. However, do not add value to the overall customer outcome. Many aws services are designed to shift that burden away from the customer to aws.
A low code or a no code experience provides building blocks that can be assembled to customer needs. Users can move quickly by eliminating many aspects of traditional software development. A large number of aws services facilitates quick deployments faster repeat cycles that in turn boost use of productivity and also lend to a high degree of automation and customization, thereby reducing operational expenses.
Now, ease of use is best experience then demonstrated. So let's take a look at this example of a demo.
All right. So our first demo is a common scenario. Around working with data from staff applications, software as a service or s applications play an important role in many customer analytics pipelines. What we have here in this scenario is that we'll be ingesting sales force data run transformations on it, catalog it for business use and prepare the data for downstream analysis with other data sources in a data lake.
The demo is also published on the aws blog for those who would like to try it out later. And the aws services we will be using for this demo include amazon appflow, visual no code, data integration service. It provides a ease of use to set up data flows between third party sas providers and aws services. Appflow has a point and click user interface that is zero coding required.
Then we'll be using aws glue data brew, another visual tool for data preparation at scale with over 250 prebuilt transformations available. It simplifies connections to multiple data sources and provides an intuitive interactive interface to explore and clean data.
And then we'll be using the aws glue crawler and the data catalog. These automatically extract the schema and partition structure and store the metadata in the glue data catalog and a patchy hive compatible meta store. And then finally, we'll be using amazon athena again, an interactive and serve as sql query service natively integrated with aws glue. So your tables and data are quickly available to query and dashboard for upstream reporting we bring these services together to create a simple architecture pattern.
We'll be using sales force as a third party sas input uh create a data flow with appflow, connecting to sales force to ingest data
We'll then leverage DataBrew to simulate a few transformations on the extracted Salesforce data and store it in S3, which will then be crawled and cataloged using Glue Crawler and Data Catalog respectively. And then finally, we'll query the data with Athena.
So let's jump into this demo, a recorded video that we'll walk through.
All right. So we'll go to the AWS console and make our way to AppFlow and begin by creating a new data flow. We give the flow a name. The flow needs a source. And for this demo, we are going with Salesforce. We connect to the source, give the connection the name. I'm using a Salesforce free Salesforce developer account.
Over here we grant AppFlow access to my data in Salesforce. By the way, data and other artifacts such as tokens and keys are default encrypted with a choice for customers to customize with custom encryption keys as well.
Now we need to provide a destination. By the way, we choose the Opportunity artifact from Salesforce. I also go with default automatic Salesforce API to make it easy to switch between standard and bulk API calls. A pre-created S3 bucket with the prefix is what I'm using to collect my output data.
And we define a trigger. The next configuration step gives us options to map the source and destination fields, define storage partitioning and aggregation settings, and optional data validation steps if needed. And this is followed by the ability to define filters to restrict the flow to only data that you would need to work with.
Once we are done with the configuration, we run the flow. It takes a quick few seconds for the flow to complete. So flow's done over here. We then move to DataBrew to the console to clean up our Salesforce data.
We start by creating a project that includes reusable and scalable constructs like a recipe, which is an ordered list of all the transformations that we would be using to perform on the data. We also declare, define a source dataset which in this case, we point to our AppFlow dataset that we just recently created.
DataBrew does support a wide variety of additional data sources including anything that can talk to a JDBC endpoint. You also create a new IAM role for DataBrew to be able to talk to AppFlow.
So the next step demonstrates the ease of use of data manipulation over here. We will be working on some prebuilt transform functions with simple point and click action like splitting the CloseDate column values into unique Year, Month and Day columns. We'll also be working with the Amount column to do some Z-score binning to flag certain outliers in those values. Just an example of the kind of transforms that you can do easily in DataBrew.
Now that we're on the right, you will see that our recipe is being built as we are applying these transforms. Once the recipe is done, we publish this recipe for future reuse. And once that is done, we move ahead to creating a job in DataBrew that will apply those transforms to the Salesforce data.
Creating the job is a simple step with identifying the dataset and the recipe that you want to use for this job. I'm also using the same bucket that I created earlier with a different prefix, but it can work with a different bucket altogether. We reuse the earlier IAM role to grant DataBrew permissions to transform the AppFlow dataset and then we create and run the data job.
So this job takes a couple of minutes for us. Once it finishes the run, we confirm the output in S3. So we have our data over there and then we quickly move to Glue to crawl this dataset and catalog it in the Glue Data Catalog.
And right from the Glue Data Catalog, we directly view it in Athena as we'll see it in a couple of seconds. You can also extend this flow to feed into JDBC supported dashboards. And that was basically a quick example to creating simple and effective data pipelines with third party data.
Now let's switch gears from data integration and data preparation to data warehousing with Amazon Redshift, a cloud data warehouse to analyze all types of data, log files, transactional data, click stream data. The managed Redshift service abstracts away most of the data warehousing administrative overheads such as performance tuning, maintenance and scalability while easily supporting complex use cases such as data sharing, federated query and machine learning.
And now Redshift service not only brings all those benefits with it, but it also takes away the infrastructure management overheads of a data warehouse. So let's talk about how straightforward it is to do machine learning with Redshift service.
Redshift ML makes it easy for users to create, train and deploy machine learning models using SQL commands. It's a two step process. First step is to train the model when you run the CREATE MODEL SQL command to create and train the model.
Redshift securely exports your data to S3 and calls Amazon SageMaker Autopilot, a machine learning service that automatically prepares the data, selects the appropriate prebuilt algorithm and applies the algorithm for model training.
Next, the training step is followed by running inference to do prediction where it uses Amazon SageMaker Neo, another ML service that optimizes machine learning models for inference. It compiles the best model and deploys it locally to Redshift as a user defined function.
How about we experience this through another demo?
Right. So for this experiment, we will use a linear learner algorithm to predict the age of an abalone. Linear regression algorithms are typically used for cases where you want to find out the revenue over time or you want to do predictions over customer value over time.
By the way, any guesses? So I'll be honest over here when I first did this demo, I had no clue what an abalone was and I'm sure some of you don't know. So, so it's actually a snail. Thank you. And this is also based off of a blog. So if you would like to try this out later, you can use this link.
And also we are handing out AWS credit vouchers at the Analytics booth in the Expo. If you would like to use those to run these experiments, please feel free.
So let's walk through this demo.
All right. So we start by provisioning an Amazon Redshift endpoint. We navigate to the AWS console and use the default settings to create a default namespace and a workgroup. A namespace is a collection of database objects. And a workgroup defines the compute and networking resources that will host that Redshift infrastructure.
Clicking on the Save Configuration button initiates the cluster provisioning setup. Redshift service scales automatically based on workload and also encrypts your data by default. And within a few minutes, Redshift will create my service endpoint.
I explore my default namespace over here to navigate to the Security and Encryption section to attach the Amazon Redshift default IAM role to it. This role simplifies SQL operations that access other AWS services. So I go to Manage IAM Roles for the purpose of this demo. I already have this role assigned to my endpoint.
All right, I save my changes and then I move on to the Amazon Redshift Query Editor v2 directly from the console itself. Once the editor loads up, I connect to my dev endpoint over here. And the first thing that we do is we create a table in the dev instance that will hold this abalone dataset, which is an aggregation of different physical measurements.
And then we use the COPY command in Redshift to load that CSV formatted data from an S3 bucket that is publicly available to this table that we just created. We then create two additional tables. One is the table that holds 80% of the data which is the training table. This is the data that we will be using for training. And the other table is the validation table which will hold 20% of the data that will validate our results against.
Once we have these tables set up, we now create a model in Redshift using the CREATE MODEL command against the training table that is holding the 80% of the data. When we create this model, we take care of several things. But two things in particular, one is the problem type which we will come to in a bit.
So we created the model. Now we are looking at the model as to how it is doing. And the two things that we will be looking up, one is the problem type, which first is the linear learner problem type because that is a problem statement that we are running predictions on. And the other one is the objective we are using MSE, a mean square error. It's a common metric that is used to measure linear regression problems.
We are also checking out a few things in terms of the IAM role, the S3 bucket and the timeout that we have given. So this particular CREATE MODEL command takes about two hours to run to create that model. It's also working with Amazon SageMaker. As I mentioned earlier, we come back in two hours, we look at how our model is doing as it moved from training to ready. And we see yes, it is now moved to the ready state.
And we also note, make a note of our MSE value that it has determined for us 4.25. Now, right after this, we run the prediction query against the 20% validation dataset. What this will tell us is this will indicate to us that validating the accuracy of a model against both the datasets. And we see that we get it's returning us a value of 4.93. This indicates that the model is accurate enough to the actual values from our validation dataset.
And pretty much just like that, we see the ease of use with Redshift ML to solve regression problems. And with that, I would like to thank you for your time again and hand over to Adam to take us through price performance with AWS.
"Um but they run directly against the data in your S3 data lake. And what does that mean? You don't have to move the data, transform the data. Although we have technologies to do that, you're not forced to do that or that's not a requirement in order to get some benefit.
We launched EMR in 2009. It's been a big journey and have constantly been innovating in areas such as performance to make sure EMR is the best place to run your workloads, specifically Spark workloads as well as other workloads um with other frameworks as well.
We see true, true benefit with Amazon EMR specifically because we're allowing people to move to EMR in a seamless fashion. And one of those benefits is that we don't, you don't have to change your application code to embrace EMR. So if you're moving from Cloudera, Hortonworks, another provider, you can actually drop directly on an Amazon EMR without changing your application code uh to, to get going.
We know that it's important for people to be able to deliver and scale these applications very quickly. So we want you to easily create clusters and provision one hundreds or even thousands of these compute instances to process your data at scale. EMR will automatically do that for you and size up and down based on utilization. And then you only pay for what you use, right? Similar to Athena, you only pay for what is scanned and that drives more price performance benefits.
Here are some numbers I love the number game, right? What do these numbers mean? Are they good or bad? Don't know without context, right? So the first one is our performance on how we run it with Apache Spark and it's 3.9 x faster than standard Apache. Very proud of the recent benchmarks that have been done there in terms of open source Spark, Treno, we're running 4.2 faster than standard Presto and obviously lever, leveraging our Graviton two or Graviton three processors, we're seeing tremendous benefit and I'm guessing most of the audience knows that's our custom CPU, right? So that we've uh built built on the 64 bit ARM architecture is what the Graviton uh foundation is. And then last but not least this is something we're very proud of allowing people to have 100% um improve compliance with these migrations that happen so that you're not forced to um redo things as you're moving and migrating into the cloud, making it simple and easy for people to make this migration.
As I mentioned, Spark is one of the more popular applications uh for EMR and let me just show you some of the price performance benefits. These are 12 features that are very important to the overall, I'm only gonna highlight two of them um because I wanna share just the benefits of two of those with you, but there's 12 on this slide and there's more, but this is important to maybe just highlight a couple of these.
The first one I wanna highlight is the dynamic size executor always called DS C internally. But what this does is it allows today, you have to specify the actual size of your executors and your specifications should match the underlying instances to reduce that wastage. What DS C solves it, you only submit the job to EMR it'll figure out the size of the container to ensure the most optimal infrastructure that that is required to actually do the job and it does it automatically for you. So you don't, you're not try running trial, running things and read sizing and running again.
And then that's DS E dynamic pruning of data columns. Um this is one with shuffle. Um one of the main efforts is to reduce that shuffle and Spark and so open source Spark, you know, has a dynamic partitioning and pruning capability that minimize the shuffle, especially when you're joining two tables on their partition keys. And so what we've done is we've actually extended that capability to the data columns as well, not just the joint, the partition keys.
So let me just share with you a little bit about some of the deployment options that are managed on EMR and maybe you guys are very familiar with these, but I just wanna talk through a couple of them because it'll answer which EMR deployment is best for me or what do I want to consider when going down that path.
So on the left, there's a whole bunch of features, not a whole bunch, a few features listed. And typically our customer's feedback is these are the most helpful ones to determine which type of deployment with EMR I should consider. And so the ones that jump out, I I have them color coded to kind of highlight the different choices. So we're gonna walk through three different types of deployments of EMR first one is classic on EC2, right? This is where if everything's equal and your organization needs the ability to customize and change all the configurations and make it very specific to you on what you need. Then this is a good option because you can allocate the cost by cluster. You can actually um the pricing is pretty simple by instance, type used and you get to control the size of the cluster, not asking for anything to happen automatically the middle option.
The second option is obviously when we're running on EKS, right? So Amazon EMR running in a container environment, this is a solid choice spec specifically if the pricing is around CPU and memory used and you have a big Kubernetes deployment, right? If you're using this today for your applications and you're very comfortable with it, you're skilled up, then it's a logical choice to continue to use Spark on Kubernetes as well and just take that benefit that you have in your organization and apply it to EMR the last option is a serverless option.
Um and these aren't in any particular order. They're just to kind of highlight some of the differences. We know this is great for multi AZ we know it's great for CPU and memory use cost allocation. Customers always wanna perhaps charge back with internal organizations by job or by application. This is a great option there. And we use this deployment option when we don't want to control the levers, we don't want to change the bells and whistles. We want, we want it to automatically figure it out as it goes. Rather than me suggesting what I think the size of the cluster could be and all of these day, all of these choices get the same bits right? The the distribution is the same, they get the same performance benefits at EMR runtime. So you're not losing anything by one or the other. It provides choice for you that matches different cost profiles and will affect the price performance where it makes sense.
So similar to EMR Serverless. Um and there were some big announcements earlier on Adam's keynote that I'll touch on maybe one of them. Uh that happened just a couple, I guess an hour ago. Um Amazon Athena is everybody somewhat familiar with Amazon Athena even before? As does anybody know how old the server? Uh what, what version we're on or how old the service is from the beginning? Yeah, so we're on M we're on version three which we if you feel like we should be on a, a higher version um than where we are. It has definitely been around for a while. So why is it so successful because it's serverless, right? Zero maintenance, zero administration. You just point at the to your data and go to use SQL, right? And just go, you only pay for the queries that you're using and that can come up to about $5 per terabyte scan. Very reasonable. The price of this service has never changed from the beginning. We have left it as is and we continue to add more. You can get even more benefit the 30 to 90% down there through compression and partitioning and converting your data obviously in a column on our data formats.
Um and that drives even more price performance if if interested and needed. Um the last, not the last one but open and flexible. And why it's so heavily adopted is because of the da different data formats. RK CS V JSON Avro are incredibly standard formats for our customers to use and then to drive tass's home ta's point earlier, easy to use. We only do this if it's easy to use and we reduce friction if it becomes too hard and it's full of problems and potholes along the way, then we kind of take a step back.
So we know since we launched this service, you know, it's been great uh to really drive continuous improvements and the feedback we get from our customers allow us to round out to make it simple and easy to use. So when we first launched this, it was very simple and that, you know, pointed to S3 2019, we added a lot more data sources like DynamoDB and Redshift. And now you can do recently actual deep Spark analytics using PySpark, right. So this was something that was announced earlier today to really just extend the value of Athena. And that is all in the simple, same instant on capability that Athena is known for a simple end user experience.
So with these data connectors, we not only expand connections to native services, but you get the combination in joining DynamoDB directly with S3 and actually delivering tremendous value both in our cloud and perhaps even other clouds. So some additional connectors because a lot of people think this is just specific to native services within AWS. But if you look at that list, SAP, Hanna Oracle, Teradata Google Big Query, right? So we've really extended our capability because you'll get more value out of it. And these connectors are free to use you only again pay for data scanned.
Now let me talk a little bit from Athena to Redshift. Um because I know Taz spent some time on that. Um these, you know, no cost easy to configure um configure once and share across. So a lot of people ask, can I use cross region accounting uh and actually share with my teams with a cro cross account access. And the answer is yes with Athena. So that has been uh a tremendous uh benefit for our customers.
Um and a product that's celebrating 10 years, I'm gonna go to Redshift. Um so Redshift is 10 years young or 10 years old. Um and this is obviously the data warehousing solution that Tas provided a demo of. But I wanna share some of the ML that's built into Redshift to allow it to do the things that it's capable of doing. We'll talk through a quick benchmark and then we'll uh talk through a couple of other things and, and get you out of here on time.
The first one I wanted to highlight because the workloads are always changing and everybody's trying to optimize for peak performance and the workloads fluctuate over time as you would expect in the requirements this autumn automatic vacuum delete kind of cleaning up the environment as we're going, releasing resources that are no longer required, incredibly helpful and a must have for our customers. And that's been available for a while. A complimentary feature is the automatic table sort um which a lot of people are embracing um be for for obvious reasons because it'll actually use the table statistics, the metadata associated with it to actually change the query execution plan, right, which will give you better performance and optimal use of your system resources as well.
So the common scenario that we kind of see um with auto analyze actually runs directly but in the back in the back now and that was auto analyze does that statistics with the metadata? Um and the the table sort allows us to continually optimize that query performance, the whole benefit here, this is allowed us to do some recent um benchmarks that we're really, really proud of and, and not to say that we're the right fit all of the time. But we think we've really come a long way in these 10 years.
So there's a QR code to a, a blog um that we published recently um around Redshift. Um the chart on the right has four different lines. Um we're the one on the top, the yellow, orange line and then there's a red and blue line that are other cloud providers that we ran this benchmark. It was a really quick benchmark, but we wanted to show that we could deliver what we didn't know what the output would be, but it's eight x better performance. And these are short queries and why we want to highlight the short query TPC benchmark is all the queries are less than a second. Most of them are less than half a second because we use a small 10 GB data set and we loaded this up because we hear from our customers that these are common score carding visualizations, dashboarding, hun hundreds of users going to tens of thousands of users in the same day.
So we wanted to build this benchmark to show the elasticity and the scale that can happen with Redshift very quickly. Um and so we saw these and we sized the the other uh vendors based on exactly as, as close as we could to our on demand pricing to be equivalent with the size. And the shape in this case, we used an R A 34 XL um was the, the the shape that we actually used to run this benchmark.
The reason why, so why were we successful? Like what, what's not the secret or the sauce? But like why were we successful? One of them was just a clear improvement that we did for ourselves around reducing the query planning steps. And that really shows out when you've got 100 and 20 users going to 12,000 users at the close of the market, right? And those reports and those that information has to be distributed. So small improvements there on short running queries have big impact over time.
And then we did a lot of investment around concurrent process optimization that Taz and I would love to talk in more detail on. But this these little things have a big impact on on how we do our resource optimization and swapping and utilization that drove a lot. And then finally just some improved query parallelism that was incredibly uh important. So we could measure across the entire cluster. How many queries a metric we track is how many queries are taking less than a second and then making sure we, we track and, and allocate and, and, and share things across the cluster.
Here is an interesting serverless TCO and I would recommend if you are in your company trying to figure out what do I, how do I justify moving to serverless? This is a great blog or great article from Deloitte. Um I thought I think it really accurately reflects because we know the adoption of serverless strategies is on the rise. In fact, 75% have either implemented or gonna be doing so in the next two years. So it's good to get educated around this or just get a refresher. We know it's growing. Why faster time to deployment faster time for allocation of compute rather than the fixed and paying upfront and all the the other capabilities?
Let me orient you quickly on this. It's a lot on this slide. Um but on the left, that is where the time to deploy. So it goes from low to high, right. So slower to faster is what that time deploy on the on the the left left axis on the bottom is flexibility and scalability. Do you want it to be low with flexibility and scalability or high? And then the top axis is spin pay upfront, reserve your pricing or pay as you go kind of across there. And then obviously that leaves the right axis which is capacity and it's fixed at the bottom to variable at the top so that you can kind of figure out what is the right thing for you.
I'll just highlight that there are several workloads that are probably a good fit and I think I put a circle there. Yeah. So the workloads are really, you know, computational event driven, heavy big data uh infrastructure, you'll you'll decide what the right server based deployment is and figure if it falls in the that sweet spot within their graphic. And then obviously, we think serverless is definitely for applications being a lot more agile not having to, you know, high degree of flexibility.
Um definitely wanna pay as I go. I don't want a fixed set of capacity. I wanna figure that as I go, but I need that scalability and flexibility to happen in an automatic fashion. So incredibly helpful um serverless data on analytics and this is what I wanted to highlight. So this was a late ad to our technical glitches earlier. You can see serverless is not new for us specifically in the Amazon space. All of these are the serverless data analytics uh capabilities that we have right for deployment. Anybody know the one missing Opensearch which was just announced like an hour ago or Adam unveiled that Opensearch is now serverless so this graphic will have another um uh circle um with, with Opensearch. So AWS Opensearch uh is now in preview um and available uh for customers to use.
Um I wanted to summarize because we're getting close to the end of our time. I wanted to summarize that. We hope we've at least planted a couple seeds, maybe covered where we think ease of use can really move the needle in democratizing analytics, making it available, making it ubiquitous. You know, all these phrases like oil like water, huge data assets are important to organizations. But we hope we've left you a little bit of ideas. We hopefully have shown you uh some products to kind of support that. And more importantly, this is based on the interactions we have with our customers and the benefits that they're seeing.
We know that price performance and ease of use, drive a lot of decisions and they'll also allow you to accelerate your customer experience. Um I wanted to highlight just the data lab. So the data lab is a service if you're not familiar with it. Scan that QR code allows you to come with an idea. We do a design and build session with you over a couple of days and you actually leave with a prototype. So this is a credibly fast way to move forward exploring our analytics capabilities and we have one specific to embedded analytics.
So if you are a customer that is figuring out how do I want to embed visualizations, dashboards or just a metric on another application leveraging API s it's worthwhile to check this out. This is a very quick one. That's a matter of just I think a day and a half, but it has been incredibly uh in high demand um from our customers. So again, that's the embedded analytics data lab. It's definitely up there. Keep going.
And then last but not least there's some training and certification that you guys are all aware of. You can go by the booth, get more information on the new classes that we've made available and then obviously getting additional certifications.
So thank you on behalf, on behalf of myself and Taz, we would like to take a selfie with anybody like from up here if you guys are willing to and we'll get it published on uh social media if you're willing to do that. But we thought we'd ask before we just did it. So Taz, come on up and let's take this selfie and thank you for your attention."