What’s new with Amazon EMR and Amazon Athena

最新推荐文章于 2024-07-25 13:56:03 发布

李白的朋友王维

最新推荐文章于 2024-07-25 13:56:03 发布

阅读量148

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134814174

版权

Good morning, everyone. It's day three of re:Invent at nine o'clock in the morning. You're all here. Congratulations. I wanna say I appreciate you. My name is Benita Anant. I'm the product lead for open source data analytics, EMR and Athena services. And I am very excited to be with you today because this is my favorite time of the year to share with you all of the features and capabilities that our engineering teams have been working on for several months leading up to this event.

And I also wanna say a thank you to all of you because you've been candid and provided feedback to us throughout the year on the features and capabilities that you'd like to see in the product. And therefore I'm here to provide you an update on the features that you've desired.

With me today are two presenters. We have Mohammed Rehan who is the VP of telecom cell. Mohammed leads enterprise strategy and planning. Telecom cell is Indonesia's largest telecommunications provider. And Radhika Ravi Rala, she is the Principal Product Manager for open source data analytics. Radhika is a veteran re:Invent presenter, so some of you in the audience might already recognize her and know of her expertise.

Thank you. Thank you. As the world continues to run on cloud, we're at the brink of another platform shift which is going to accelerate generations of innovation and value creation. Artificial intelligence has changed the way we work and will continue to change the way we work, live and interact with machines. We expect the knowledge and information economy will go through a major reinvention and it will be defined by artificial intelligence and the underlying data that will differentiate your companies from your competitors.

As we think of this transformation, the bedrock for successful reinvention lies in how we use AI, augmenting the large language models with proprietary data that is your secret sauce. Data volumes are increasing, increasing at an unprecedented rate, going from terabytes to petabytes to sometimes exabytes and petaflops of data. Traditional on premises data analytics approaches just aren't able to handle the data volumes because they don't scale well or they're extremely expensive.

So we hear from companies all the time that you're looking to extract additional value from your data, but struggle to capture, store and analyze all the data generated by your digital businesses. And this data is growing exponentially, it's coming from new sources, it's varied in nature and what's more important is that it needs to be securely accessed and analyzed by applications and people.

Your data foundation should offer a range of tools and solutions to address the needs of a range of use cases you're trying to address, spanning from traditional data lakes to state of the art large language models. And it should be crafted to have seamless integration. It should also provide streamlined task completion. Additionally, it should be providing end-to-end governance and it should leverage generative AI as well as machine learning to improve efficiency, speed and cost effectiveness.

You've also mentioned that you have several challenges associated with proliferation of tools. Some of our customers have shared that you use anywhere between 5 to 20 to even 25 tools in your data foundation. And you've expressed your desire for simplification. You've said, you know, you want a seamless, simple, easy to use unified experience. At the same time, you want the diverse set of capabilities that is required of the platform.

Earlier this year, we brought the EMR and Athena teams into one organization and we did this to facilitate the integration of our products, enhancing our ability to provide you with the diverse capabilities to support you with your diverse needs more efficiently. And you will hear, if you were in other sessions, you will hear the theme of unification and organizations coming together, especially with AWS more so this year and continuing into 2024.

Amazon's cloud-centric platform for big data analytics, incorporating EMR and Athena, is enabled to perform distributed data processing at petabyte scale. Many of you have been using these solutions and what it does is provide you a range of open source applications to choose from - Apache Spark, Hive, Presto, Flink and so many more.

You've also given us direction on our focus areas for innovation and it is informed by four main pillars. The first is supporting the latest versions of open source frameworks. And we commit to providing the latest and greatest innovations to you through our platform within 90 days for some of the more popular frameworks such as Apache Spark. We try to do this within a 30 day period.

We also have support for the very popular open table formats. Many of you have adopted the open table formats actively and therefore we support Apache Iceberg, Hudi and Delta over the last two years.

You've also told us that reducing costs or price performance is extremely important to you. We hear that this is a continued theme going into 2024. And therefore we bring to you the best performance and lowest cost option in the world through our platforms. We are hyper focused on delivering better price performance through our EMR and Athena runtimes. And I'll talk a little bit more about the performance, price performance and performance optimization that is typically done by our engineering teams in this regard.

We also have provided now a range of options to reduce costs, for example, using Spot Instances which harnesses unused capacity in Amazon EC2. We offer the ability to do very fine grained usage - essentially billing is done on a per second basis, very similar to EC2 - so that you're paying only for the instances that are being used and you can spin up and spin down. And there are several other options for optimizing costs that I will cover in the upcoming slides. But this is an area that we've deeply focused on based on your feedback.

Additionally, we provide you a flexible and versatile platform that offers batch and streaming, interactive and SQL workloads. We also have multiple deployment models - ECS, EKS and so on. And last but not least, we're providing a unified data access platform and overall governance because we understand that security especially is becoming extremely, extremely important across the board. And we have a series of announcements on fine grain access controls as well as identity propagation that Radhika will cover later in the session.

What is differentiated on our EMR platform is the EMR Apache Spark runtime. Our engineers spend dedicated time in improving performance year on year. We started optimizing the EMR runtime several years ago and we've gone from 2x performance to 5.15x performance in the latest version of EMR, which is the EMR 6.5. And we're not stopping there, there's more improvements coming soon.

In addition, because of the widespread adoption of Iceberg, we have run TPC-DS performance benchmarks against Iceberg 1.4.0.0 and that also has seen a performance improvement of 14% over open source Spark. What is important, and something that you've all told us, is that you want 100% open source compatibility. And in ensuring that at any point in time, if you wanted to port away or port to the cloud, that we provide our runtime fully compatible with open source. And that is how we deliver it through EMR.

We have focused on Iceberg for our performance benchmarks. However, several customers are also using Delta and Hudi. And therefore these open table formats are also supported on our platform.

Amazon EMR provides support for EC2 M7g, C7g and R7g instances which are optimized for big data workloads and they're running the latest Graviton 3 processors. These processors are uniquely crafted by AWS for EMR and they're using the 64-bit ARM cores. The synergy between EMR runtime for Spark and the C7g Graviton 3 instances results in better performance, up to 20% improvement, and a reduction in total cost by 15% over the previous generation of Graviton instances.

I'm very excited to announce that EMR now supports S3 Express One Zone. If you were at Adam's keynote yesterday, you would have heard the announcement of S3 Express One Zone as the new storage class. And we have done a series of tests to determine what the performance improvement would be with S3 Express One Zone. We're excited with the results - we're seeing a range of 2 to 4x improvements, faster access. And again, with that, it also is going to result in a lower total cost of ownership.

We've had several customers trying out EMR since we released it. And while we like to build new services, it's really important to ensure that our customers are driving value from our offerings. GoDaddy, which is the leading domain registry and web service provider, determined that they wanted to reduce complexity and empower their organization and their researchers in particular to run data processing at a more optimized offering.

And we embarked on a journey with GoDaddy to work with them and provide a platform that is based on EMR EKS and Graviton 2. And that resulted in a 60% reduction of overall costs. But more importantly, they were able to improve the efficiency of their developers by 5x.

Bridgewater Associates is using our EMR EKS platform and they are powering their financial models. Their researchers are using EMR and EKS. In a very similar challenge of reducing complexity, supporting the business growth but also advancing their security measures by leveraging EMR EKS, Bridgewater was able to improve their simulations and increase the usage of the platform for simulations by 5x.

You've mentioned that you want a variety of deployment options and they are for different reasons. Our EIS offering is one of the easiest and simplest ways to get started on EMR. This offering allows for you to start within seconds and scale up rapidly on your behalf. EMR EIS will manage capacities to instances, support versioning and patching, really making it super simple, removing the operational overhead for you to leverage the platform for data processing.

EMR on EC2 provides the broadest range of offerings which includes a variety of open source frameworks, up to 20+ frameworks to choose from, and even a large range of compute instances to pick from.

And EMR on EKS is typically an option that organizations who are already standardizing on Kubernetes choose naturally because they are looking to streamline their instance management or run EMR on Kubernetes container instances that are running other applications as well.

Let's dive a little bit more into EMR EIS. EMR EIS has become one of AWS's fastest growing services, swiftly expanding. At Peter DeSantis' keynote, you may have heard that he announced several serverless offerings. It is an area that we are deeply investing in. We wanna make it super easy, remove the operational overhead that our customers are experiencing with some of our offerings, making it super simple to leverage the services for the specific use case.

And with EMR EIS, we are making decision making overheads - we're removing decision making overheads and making it effortless to use the platform. One of the key value adds that EIS provides is the ability to scale up and scale down on demand, without actually having to go and pick and reserve capacity on machines.

EMR EIS also has added a series of features since its launch a couple of years ago. I'll summarize it in a couple of slides. But I just wanted to let you know that this is a fast evolving space and we are going to be releasing a range of features in 2024 as well.

One feature that I'd like to dive into today is the interactive experience using EMR Studio. Today you can on demand run interactive workloads. The EMR Studio provides an IDE which is based on Jupyter Notebook. And that basically supports PySpark, Scala as well as Python kernels, allowing data engineers and data scientists to run the interactive workloads.

We're also providing the ability to easily create, visualize and debug your applications within Jupyter Notebooks and the EMR Studio environment. And what's really special here is that we're also integrating with Code Whisperer, which is our generally available product, which is a code AI product.

The Code Whisperer's primary function is to support your developers in providing suggestions based on code or comments that they're writing. And it will support the efficiency or help your developers become more efficient in their development. I also want to note that this functionality is not just available on EMR EIS but also available on EMR EC2 and EMR EKS.

Apache Airflow and AWS Step Functions stand out as two of the most widely utilized orchestration engines for data pipelines on the AWS platform. And I'm excited to say that we are going to be supporting AWS Step Functions on EMR. So pleased to announce that it's a simple, easy drag and drop functionality using the AWS Step Functions workflow studio. And what you'll do and what it also facilitates is the ability to synchronously track the jobs and track the progress on the Step Functions console as well.

To summarize the launches that we delivered in 2023 for EMR EIS, we announced Graviton 2 support, we're supporting custom images and especially custom images support for Docker - it's an area that you requested and we've delivered.

You've asked for job level cost visibility - if you're using OpenSearch or Splunk to do log analysis, we have integrated CloudWatch for logging capability that can then flow into OpenSearch and Splunk. We've got now the integration with Secrets Manager for you to save your passwords. I mentioned earlier the Step Functions integration and EMR Studio interactive capability.

And what's also coming up is uh improved Spark Structured Streaming EMR and EC2 um is, are i i is a very popular uh offering that many of you are probably likely using today. Um if you're looking for cluster full options.

And as I mentioned earlier, we have a range of selection of instances as well as open source frameworks that you can pick from. And that includes Apache Spark Hive um Presto HBase and so on, several of you had asked for high availability options for instant fleet clusters.

So I'm very excited um to share the news because this is one of our most requested um capability from our customers. We are launching HHA for um instant feed clusters and so your primary node is no longer a single point of failure. In the event of a failure, um EMR will fill over to the standby nodes.

Um today we will provide um one active node and two standby nodes as a part of this offering. And it also, it will also increase the diversity of instances that you can pick from for these HEA clusters.

Uh you gave us a lot of feedback on um on our uh on our console and uh we took it to heart. We've done a lot of uh great work i think in this, in the area of to bring you the best console experience we can.

Um we also made this console experience, the default in Q3 and Q4. Um and this console experience uh offers improved single screen, create cluster workflow, simplified navigation, better performance, more responsive controls.

Um latest features like workspaces and new cluster configurations options and so on. Uh in 2024 we will be, we will we're investing heavily in improving the user experience of EMR Athena and a range of data analytics uh services that AWS offers. So stay tuned. We have a more, a lot more to offer in um elevated user experience coming soon in 2023.

To summarize uh our EMR on EC2 releases included um ensuring that we provided you the latest and greatest OS s um frameworks with EMR released 6.1 0.5.

We um improved manage scaling to help you with automatic uh instant selection. We've improved the algorithms, algorithms behind the scenes.

Um while you're using spot instances, you asked whether you could not only pick based on reduced cost but also on the availability of clusters. So we've provided that price cap, price capacity option to, to choose spot instances.

We've improved start up um the cluster start up times by 35% and now it's under five minutes. Uh we've modernized and improved the ER ER console.

Um in addition, we've also uh added CloudWatch events to support better debugging and um a a and at providing actionable insights when you're provisioning and scaling your clusters. And we've improved uh the resiliency of our clusters with the high availability options.

So with that, i i know we covered a lot here. Um i would like to invite Muhammad Rehan on stage to talk about how Telecom Cell is leveraging AWS platform for their transformation.

Welcome, Rehan 5 15 hours difference and also 17 hours flight. But of course, I'm very excited to be here in Vegas and also very excited also to hear what Fita has been shared new features, a lot of new improvements also and a lot of different deployment options. Also on the EMR now talking about EMR i would like to share some story about TC.

My company journey on using EMR to to modernize our critical operational data stores. So we start first with some context on the T Cell Cell is the largest telecommunications service provider in Indonesia. We have around 150 million customers across the nation. Yeah, it also reflected with a huge amount of data that is being flowing in our networks, more than 1000 400 p data per month and we are in arch nation. So it's quite challenging to do the full coverage.

So we have around 220 100,000 BTS to cover all of our nations. And in terms of the product and surfaces, the home initially start first with the wireless mobile connectivity and then sorry, we have prepaid post subscriptions and then we start to grow our business by entering the digital surfaces as well.

We have a platform for video gaming music also and just recently acquire our parent company, parent company product, it is called in the home. So it's a fixed broadband product. So now since 1st July this year has become the fully fixed and mobile convergence communication service provider.

Yeah, I'll continue with some uh sharing on the journey on Tao KC. So we started a bit late actually Ta start the journey around two years ago. The restriction and regulation in Indonesia is quite strict. So we're having some difficult times to convince our government. And fortunately, we are able to do that and it is coincidentally almost the same timing with the AWS Jakarta region opening. So it's a good timing.

So why we start our cloud transformation journey? It is because along the way we continue our digital transformation, we build more digital applications, we have more product offering innovation. We start to have some issues on the agility, scalability resol it is quite common to all the enterprises and the it company.

So having said that cloud is not only about the moving from on prem to the public cloud, but we set up the cloud journey as part of our critical business strategy, business transformation business strategy.

So in order to accelerate that, we set up cloud center of excellence, so this is the team that have the responsibility to kick start the adoption of the cloud we train a lot of people and now almost more than 100 people of our employee already AWS certified.

So we work with the AWS to kick start the strategic partnership. We already migrate some of our workloads, especially those in the digital products channels. For example, my apps, it's like self surface apps for our customers and also mobile application for resellers and virtual assistance.

Now we came to the data part. So as the almost every telco industry, they are quite have a variety of the product. We have a data big data, also a lot of type of the data, for example, the network data.

So we can we need to ingest like network quality, data location, data people movement and also the core BSS system of the telco itself, like the customer care, the order management payment systems and and so on and so forth.

So we have around 1000 data sources, we have identified and ingested into our big data platforms and in terms of volume is around 50 terabytes of data ingested per day and the growth is also quite high, almost 20% per year growth.

So we use this data a lot for operational improvement and also for customer service operations. We did also some data monitor, we sell inside inside as a service to our partners B to B services. And yeah, we work also with our network team to do network optimizations.

Now as the data grow continuously, especially after the merger of our fixed product. So the data scale is something that we are currently very difficult to handle, especially in the space of data center is very limited. Also, right now, it's not easy to spawn a new data center of course, and not to mention the complexity of maintenance and the infrastructure.

So this is the background why we decide to continue our journey and the cloud transformation. After migrating our digital apps, we now start to see whether we also need to migrate our big data and analytics workload.

Yeah, finally around a year ago, we decide to do it. So we start first with the three main components. The first one is the operational data store which will be the main focus of this session.

So personal data store is to is like ODS you know personal data store that is connected with various mission critical applications. So data is quite big also seven terabytes data ingested daily and there are some nonfunctional requirements around this.

So we expect very minimum down time below two second query response and also around 1000 queries per second. So initially before we migrate, we use open source on prem Hadoop. So stack like a Spark Base and also Impala. So this is the one that we are migrating to Amazon. EMR s.

The other two is on the fast analytic data warehouse which we move to Amazon and the machine learning surfaces that we move all the surfaces into the Amazon AKS.

We talk a bit more detail on the operational data store. So this is the design of the solution architecture diagram that we have we use. So we as the source data, we have 11 applications. It is sit on our on prem data center.

And we also establish the AWS direct connect to have more reliable and dedicated connection to the AWS public cloud on the Jakarta region. So we use the AWS data sing to pull out the data then store in the S3 bucket as the landing on first.

Then we use Spark that is running on top of the EMS that to do the ETL for training all the stuff. Then as the result we saw in the another three bucket in the format of the base.

So for the EMR for the query itself, we have we deployed in the two AZ, the one is the primary primary cluster and the other one is the replica. So the we use Apache Phoenix. This is for the layer to do query to transect the SQL query used from all the consumer of the data to retrieve data from our ODS.

There is a lot but also of course to distribute the traffic. So this is the example at the right, you can see some of the consumer of the application like the our E billing aggregation customer care applications and also our middleware. And there is a requirement also from our law enforcement.

Yeah. Ok. So this is the last part of my session. So having successfully migrated to EMR just recently, we are able to get a lot of benefits. The first one is on the open source compatibility. As we know we use the previously, we use open tax software also on on prem and it is smoothly compatible running. Well, also on the EMS not only compatibility, but also we see huge improvement also in terms of performance in the query latency improve and also the reduced resource utilization also being reduced.

i also forgot to mention that for our red replica pastor, we also use grayton instances. So it further improve our cost of optimization. So this is also empower our business to do more, to do more analysis. And in the end, it can provide a better services to our customers in terms of the data river cycles.

So previously, since data is quite big, we need to ingest it and the cycle is only we can only do in the daily. Now, after migrating to EMS, we are now able to do it in in hourly manners.

Then on the automation, this is also a lot of our effort now being optimized because previously, we need to maintain the cluster by our own need to do upgrades. But now since the EMR is managed by AWS, then all the automation can be done automatically.

And also the times required is only around 30 minutes per class process quite fast. So in the technical aspect on the key learnings, so first part on the loading part, so this is where we work together with the IWS team to define how to optimize the process and also the time required to execute.

So on this, we use a predefined key range and also choose the correct loading method so that for the initial data load, we are able to meet the required time frame. So in the fine tuning of the parameters, of course, by default, the ER already has a good configuration but sometimes in the large scale or huge scale, you need to do some fine tuning of the parameters, for example, like the JF parameters and also the catching size.

Yeah. Last part on the project implementation, of course, this is quite complex project. So we work collaboratively with AWS and several other partners. So yeah, this is one team and they also put their best people. So yeah, we are able to do this and successfully and we also looking forward to do more on the BI and analytic space.

Yeah, i think that's all from me. Hope you can find it useful and i'll now pass the floor to Radika to continue, please.

That's quite a, quite a lot. And how many of you are running ER neks? Ok. And just out of curiosity, any emrc is fancier. Ok. We hope to convert you.

So, uh EMR on EKS is a deployment option that allows you to run open source frameworks on Amazon's EKS, uh which is Amazon EKS service. Uh this is the service or this is the service that has been available for a couple of years now. And you have to, you can only focus on running your analytic workloads and leave the burden of building configuring and managing the containers that are required to run your open source frameworks to EMR on E KS.

So in addition to doing all that, there are a lot of benefits why customers actually love to run their workloads on E and E. For instance, if you're already running your applications on EKS, it makes a lot of sense to simplify the infrastructure management and reduce your resource utilization by consolidating all your ER workloads onto EK. You can take advantage of the multi AZ characteristics provided by the underlying EKS and spread your cluster nodes across AZs. To get that resiliency, you can also use cost saving options like spot instances as well as the savings plan which will optimize your costs for running on EKS and also help you run workloads more efficiently.

There are also a lot of things that we have taken from EMR on EC2 such as the, you run the same runtime environment as EMR on EC2 for Spark, it's the same runtime. So you get the same Spark performance, you have uh a lot of other things that you we carry on from EC two, which is includes monitoring of your Spark jobs through integration with CloudWatch and S3.

And most importantly, uh going back to the cost optimization, customers have often relied on auto scaling. Now, auto scaling is provided on EMR and E KS through a feature called Carpenter. Carpenter allows you to horizontally scale your container nodes and also allows you to build pack your containers with a diverse set of mixed sized instances to make for improved efficiency and reduce costs.

So overall, there are a lot of advantages to running your workloads on an ER and EKS deployment options uh because this is so important in terms of reducing your cost, we have been focusing a lot on auto scaling. So to that end, we have added a new feature in auto scaling called vertical scaling or the dynamic part, auto sc uh what what this feature does is it allows you to configure your memory, uh automatically configures the memory and cpu requirements to adapt to the your provided workloads and uh uh scale the container vertically allowing you to not worry about manually doing any of the tuning of your EES clusters because they are often cumbersome and complex it all the beauty of this feature is that it works. in conjunction with Carpenter and Coon is cluster auto scalar, which means that if you are already using horizontal scaling, you can add this additionally to improve the efficiency of your workloads and and run them with without having to worry about any kind of tuning.

Another feature that we added to help our customers run state full computations on streaming data is support for Apache Flink on EMRs. Now uh start with the EMRs release 6.13. we started supporting the Flink coons operator on EMR and EKS. Now, with this operator, uh you can submit your jobs to an EKS cluster uh simply by directly by submitting the job to this uh uh operator. You can auto scale based on the SLS for your upstream data using both the horizontal and vertical scaling that we just spoke about. And you get improved resiliency with multi AZ EKS benefit as well when you're running your Flink applications.

So the other advantage of running your Flink applications using the Flink coin operator is once you deploy and the operator onto a EKS cluster, the operator will manage the life cycle of your Flink application.

So besides uh the dynamic part auto scaling and the support for uh Apache Spink Apache Flink on uh EMR and EKS, we've been busy with uh adding a bunch of other features on the EKS platform and they include native Spark UI support through the ER console, which means that you can go see your virtual cluster, pick up your job and you can look at the Spark UI for the job.

We have also improved the efficiency of analytics and ML workloads. with the addition of the Spark Rapids accelerator integration. This is the NV DS Spark Lapids uh plug in that you can use to accelerate your data science workloads.

Uh there are other things that are interesting from an observable point of view which that has to do with integration with Prometheus and Grafana uh S3 as well as the CloudWatch integration, which means that you have clear visibility into how your jobs ran the resource consumption for your jobs uh as well as uh uh plan on how you can, you want to launch your uh clusters going forward.

We have also improved the performance with Java 17 by adding support for Java 17 and the Amazon Linux 2023 support to this platform.

So switching gears a little bit and moving to uh Amazon Athena. So Amazon Athena, as you all know is a service that allows you to query all your data using SQL or Python. It is very simple to start instant start, you put your data in S3 point, your query to that data and start running your queries. So it's serve as there is no setup required here.

Uh it's optimized, run times for fast results. The engine powering Athena is Trino. And uh it supports a variety of use cases which include interactive, as well as advanced analytics. So for example, if you are, if you want to run federated queries across multiple data sources, you can use Athena today and simply use one of the connectors and, and start running your queries and you can use, you have the flexibility to use any language of your choice that includes SQL or, or Python.

We support two engines on Athena. One is the Spark engine as well as the uh the traditional, the first one that came, which is the SQL engine. Uh you can use uh either one of those engines depending upon uh your workload.

It's also very cost effective because you're only paying for the amount of data scanned for your query. You only are uh you're able to save up to anywhere from 30 to 90% per query cost through compression of your data. So you can use the open table format that uh Vinita talked about earlier, compress the data and query the data uh using Athena.

So on Athena, we have been introducing some new features which include running queries using provision capacity. Now, traditionally, customers who have been running advanced use cases on Athena have had some challenges and they include queuing if you submit high volume of queries to Athena. Chances are that those queries are getting queued and it is causing a disruption to your business applications.

The second challenge that we had seen and heard from customers is customers run differentiated workloads, which means some queries are more important or they have a higher priority than other queries. And customers want to prioritize to run those queries um before some of the other ones. And lastly, uh customers also have been interested in having predictable costs for uh their queries.

So we designed the provision capacity to address these challenges. It is focused on giving you more control over the workloads while retaining Athena serverless user experience. It is fully managed and uh you get complete control over the concurrency and scale of your queries as well as uh decide which queries you want to run as a priority to the fastness is something that is always on our agenda. We are constantly striving to improve the performance of the engines. Athena is no different.

So we, we recently we added support for cost based optimize opti optimizer. This is a, a feature that allows the query planners to accelerate query performance. It leverages the table and column statistics from Glued data catalog. So we added that capability into the Glue data catalog to generate these statistics. So the CBO can uh start leveraging this. And uh when query starts are available, the CBO can take the most optimized plan and execute that query using that plan with respect to uh keeping in mind costs and uh the optimizations, it is tuned by default on configuration. There are no changes required from you, which means that uh it's uh it's it's available out of the box for you.

Um as we was talking earlier about the S3 express one zone, this is a new storage tier that was introduced by S3 at re went and Athena is happy to support this zone. So if you are interested in accelerating your queries for high performance, you can simply store your data in this storage tier and and point your Athena queries to it.

Uh in our benchmarking, we found it up to four times higher performance uh applications without changing any of your SQL code, which simply means that you take your existing queries, put the day point the data to uh the uh S3 express one zone and start running your queries. And you will see this uh uh performance difference automatically.

You can also uh query against the data stored in Glacier. So Glacier also has it's a low low cost and cold storage tier in S3. And up until now, we weren't able to run queries against Glacier. So this year we added support for that.

And the last thing on on the storage side is we added support for S3 object lambda that lets you add code to modify your data on the fly. Uh for instance, if you have sensitive columns that you want to mask before you run the query, you can, you can use the, you can kick off the S3 object uh lambda feature.

Um and on the on the user experience side, we have, we have heard from a lot of customers who are using single sign on through Ear studio or some such applications to have a similar experience for Athena. So we added support for running Athena queries from Amazon to Ear studio. So you can query all your Athena databases tables views and and also toggle between different work groups uh in Athena and, and run your queries against these work groups. And also look at the saved queries, recent queries and all the things that you are familiar with the uh Athena on the console. You can do that on the EMR studio as well. It gives you a simplified experience and it is primarily designed for analysts who are also using EMR studio.

Uh and another addition to our uh Athena features is the support for Cloud Trail data lake. So you can uh analyze the Cloud Trail data lake with Amazon Athena. Uh earlier, there was a lot of setup that you needed to do. You needed to get all the logs from Cloud Trail, put them in S3, cleanse it and and then run your queries with this, with this edition. You can query and analyze your Cloud Trail data lake and join it with other data sets in using Amazon Athena.

So a lot of times for auditability and for other reasons, customers often combine Cloud rail data with other data and this lets you do it very easily. You can also uh visualize uh and build dashboards using uh QuickSight uh with your Cloud Trail data. And these Cloud Trail uh data lake tables are available in the Glued data catalog if you want to explore them through Athena as well.

So, uh uh most of you know uh Amazon DataZone as a service that helps you govern data access. And uh we have been uh uh fortunate to uh add support for DataZone this year for Athena. So with DataZone, you can uh discover your data, share your data and and make it available for um uh consumption layers like Redshift and Athena to query that.

So we have added support for DataZone earlier this year to uh for customers who want to set up environments, discover this data and uh do a simple click on their uh DataZone environment to run queries with Athena.

So security is job zero at AWS. And uh and so I'm very excited that we have added a number of features on security with both Athena and ER this year.

So on, on the Amazon ER side, we started our support for native LDAP integration in in summer of this year. Uh this is a feature that allows you to simply use your corporate directory uh servers such as uh Open LDAP or maybe Microsoft AD and integrate that with your EMR clusters and have your users sign in using those corporate identities.

Now, earlier before we had this native integration, you could do this manually, but it was laborious and you had to do it for each application that you were running on. EMR, for example, if you had Presto Huey uh uh hue and uh uh he uh he server two that you were running, you would need to individually configure each of these applications with LDAP to make it work uh for the uh your users to sign into these applications with this. All that pain is gone. You simply launch your cluster with LDAP integration and this can be specified in the security configuration. And you can use any of the supported applications which includes Apache Livy Hive Server 2 Trop Presto as well as the Hue interface to interact uh with your cluster. And best part of all customers who have been setting up Crow's in order to use Ranger, they no longer have to do it. You can simply integrate your EM cluster with with LDAP and then start using Apache Ranger.

Another important feature that we are very excited to talk to you about is called Trusted Identity Propagation. This is a feature from AWS Identity Center and it's a feature that allows you to uh have corporate identities authenticate to analytic services such as Amazon ER Redshift, QuickSight Lake Formation, the Glue Data Catalog and we'll be adding support for Athena shortly uh for, for this feature uh the the beauty and the advantage of this feature is that you can bring in your corporate identities from the identity provider that you're using currently, which includes maybe Okta Azure or Microsoft AD and use those identities to log into ER Studio as well as uh interact with the EMR cluster through the studio.

It also propagates this identity to the downstream services that also support trusted identity propagation. And most important of all, you have the end to end auditability for the journey that the user identity took from uh from the beginning all the way to the last application in your call chain.

So if you take a look at the diagram on the right hand side, it depicts how the journey looks like you have an admin who actually creates integrates your IDP with the Identity Center. And then the admin creates the Identity Center applications for your analytic services such as EMR Studio, EMR and EC2 Lake Formation S3 access grants um and CloudTrail.

And then uh the user simply logs into EMR Studio, the user identity and the context is passed right from the moment he logs into EMR Studio and it is it it appears in the Cloud Trail log uh in that whole call chain.

So we are also excited to talk a little bit about fine grained access control. Previously, EMR supported fine grain access control using Lake Formation, but this was only supported at a column level. But we recently added support for table column row and cell level support with Lake Formation, which means that if you have your data lake and table set up in the Glue Data Catalog, you are able to simply use Lake Formation uh with your EMR on EC2 clusters.

The other things that uh you get out of this is it also supports all the OTFS, which includes Hudi Iceberg and Delta Lake. You can apply data filters on the nested attributes. Uh if your, if your columns have child, uh child columns and then you can also access control. Uh you also get access control for your Spark and Yarn users.

So uh I know we have a lot more to cover in terms of security, but uh we are out of time today. So I want to thank you for joining us today for this session and we'll be happy to take any questions outside the room. Thank you very much.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
What’s new with Amazon EMR and Amazon Athena

Good morning, everyone. It's day three of re:Invent at nine o'clock in the morning. You're all here. Congratulations. I wanna say I appreciate you. My name is Benita Anant. I'm the product lead for open source data analytics, EMR and Athena services. And I
复制链接

扫一扫