Building an open source data strategy on AWS

最新推荐文章于 2025-02-18 17:38:08 发布

李白的朋友王维

最新推荐文章于 2025-02-18 17:38:08 发布

阅读量234

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134794512

版权

There we are, good, good. So um very pleased also to be joined by uh Narayanan, who's uh one of our senior solutions architects in the uh analytics practice. And uh we're gonna go through uh the theory and the overview and then uh we're gonna talk in detail about some of the architectures that we see our customers building today.

So um first and foremost, you know, open source has uh really revolutionized the world of software engineering, whether it's with databases and operating systems, utilities to add on to the functionality of the software that you build and allowing you to extend and modify functionality in a, a communal model uh which has hugely benefited productivity and security. And uh you know, running open source on AWS is something that our customers are very passionate about. It really allows you to get more from your uh your developer life cycle.

Um being able to change software to meet your particular needs to have transparency around security and to improve performance for your particular workloads. You're able to take a piece of software that works for a particular use case and extend it. Uh knowing that you're gonna have, you know, more than just your own contributions uh being represented by the code base. And we love open source at AWS. And we want to share a little bit about what we do uh to use, build and run open source, but also how we contribute back.

So open source uh allows us to be really customer obsessed and learn from the community and learn from our customers and where they're innovating. And we see uh really important and healthy communities that help us to evolve. Um you know, world class code base, whether that's MySQL and PostgreSQL and Linux Redis. Um and our job then we believe is to help reduce the maintenance overhead of running open source on the cloud and at high scale and to improve the quality and security of open source modules over time.

Um we want to be the best place to run open source. So we have a broad variety of capabilities that are based upon open source and allow you to build a comprehensive data strategy, end to end from your operational processing all the way through analytics and into machine learning. And we're going to highlight how those hang together.

We also make a lot of contributions upstream to improve the quality and particularly to scale. And uh coming from the uh EMR team where we allow you to run uh Spark and Hive and Hadoop based workloads as well as HBase and Flink uh making open source contributions back upstream is a vital part of how we scale and how we deliver performance enhancements that you get in the cloud to all customers, whether they're running on AWS or not. And we do a lot to partner with open source leaders and communities to make sure that we're delivering this value in partnership with the community, not sort of unilaterally.

So on the cloud, you're taking open source capabilities and making it much easier and faster to get started. They should be performant and scalable and designed to be secure. Again, with that transparency you get by having access to the underlying source code, we contribute to things like high availability and making it easy for you to maintain and upgrade the open source software that you're running across a big broad portfolio of use cases while also making it so that you only pay for what you use. And you know, the the the open source model allows you to consume this software, but it is expensive and hard to maintain and operate over time. And so by running on the cloud, we take away all of that undifferentiated heavy lifting while giving you still the benefit of using open source software.

Some examples of where our customers are uh being really successful with open source uh are highlighted here. Uh Traloca for example, was able to, to standardize their API contract creation um and management through an open source platform called Backstage and that's then deployed onto uh Amazon API Gateway. Zomato uses uh the, the EMR platform plus Druid and Presto, which are both uh open source frameworks. They boosted their performance by 25% and they cut their compute costs by 30% by migrating onto uh our Graviton chip set on top of the open source frameworks. And then uh Augury is doing a lot with Spark and Onyx uh to do low latency machine learning on top of, of AWS. Um and, and using those open source frameworks to, to innovate in terms of um open source collaborations, we have a broad variety of areas that we make contributions and that's through uh things like the Zen project and Kubernetes, we're collaborating with the Rust Foundation around some innovations to the Rust language or with um wholly managed and uh sort of operated open source offerings like the OpenSearch project, which then is used to power the OpenSearch Service uh which delivers uh scalable search infrastructure and vector database processing on AWS.

You'll find two models for how we run open source. One of those is what we call managed open source. These are models where we take the open source software and we extend it or package it for operation on the cloud. We are responsible for deploying and managing the software, but it typically runs in your accounts on infrastructure that you may also have direct access to and the service that we're offering is gonna hopefully give you very close currency to the open source release cycle.

In EMR, for example, our objective is to be within 30 to 60 days of releases in the Spark ecosystem so that you can take advantage of new innovations that come out and you do that by upgrading the EMR software that's running within your accounts on EC2 instances you have access to on Kuburnetes pods that you're running or through our serverless infrastructure.

We also have things like managed Airflow, uh Apache, Kafka and so on. And the OpenSearch Service is a great example of where you can take the open source project and run that on your own infrastructure or you can take advantage of the OpenSearch Service.

The other model is where we have open source compatibility and this is where the service will display library or client or protocol compatibility with an open source client, but may or may not actually be implemented on top of the same open source software. And in this case, we are completely responsible for managing that infrastructure footprint and its availability, its scalability and so on.

In terms of where it runs, it tends to run in what we call service accounts, which is where the networking may be exposed through your VPCs. But where we're taking responsible for the uh what we call the data plan and we build these services based upon what we see happening in the open source ecosystem, as well as the features and requests that you give us as customers and where we need to focus uh on upstream contributions and so on.

So an example of this would be something like DocumentDB with MongoDB compatibility or Keyspaces for Cassandra where you can use an open source client to connect into these, these databases. Um but the underlying implementation is something that we take care of.

We talked to a lot of customers about their use of open source and what their objectives are for uh using open source. And in some cases, it's features. In other cases, it's because you want to be careful and thoughtful about where you expose your code to a vendor's proprietary implementations. And so one of the things that my customers have found really helpful is to be very detailed on when you're using open source and when that open source is being run on top of a cloud provider like AWS, where are you touching open source technology and where are you touching things that may be uh run and managed by AWS?

And so we want to decompose an AWS service and give you a sense of how these are shaped and where you then interact with the open source layers. So with an AWS service, we tend to divide the architecture into two parts, a control plane and a data plane.

The control plane tends to include the APIs and the service endpoints that you interact with from either the AWS API or from open source or proprietary orchestration and control software like AWS CloudFormation or Terraform. And that hits service endpoints which tend to be in AWS HTTP JSON REST based APIs, that's those service endpoints use a combination of uh technologies and techniques to control what's happening within the open source service.

We have control plane functions. In which case, we may use function as a service like Lambda or we have service fleets which are used for longer running tasks. Uh where we have a pool of capacity that we, we terminate the API requests onto that is paired up.

Then with a data plane, the data plane is where your data resides and it is where you will interact with the service. Once you've provisioned it, once you've updated it and made changes. And the core of that for open source based services is typically an open source image.

For example, in, in EMR, you can take advantage of the EMR AMI which is articulated as an EC2 uh machine image or you can extend that with your own custom AMI where you build on top of our AMI and you layer your own software on that. And that might be additional open source components. It might be proprietary components and the control plane is responsible when it receives a request for taking those open source images and applying them to a pool of capacity that is allocated within the data plane layer.

And that data plane layer might be just EC2 where you tell us which VPC you want to place that capacity into or that you want us to provision an EKS cluster or run on top of an EKS cluster that you've already provisioned. That capacity pool has a lot of different shapes, but ultimately yields a service instance. Endpoint. That's the thing where you get an address, you get an IP address or a CNAME that you can connect to using your service clients.

And the service clients are going to be uh for an open source service, whether it's a managed service or a compatible service, those are going to be the service clients that are, that are open source and they're going to be doing things like talking JDBC or talking Gremlin in case of graph databases or using Spark submit in case of analytical processing. And typically the underlying wire protocol is just the open source protocol.

In some cases, we also build uh drivers that we open source. Those take advantage of optimizations that we've made to the underlying infrastructure and allow you to scale.

So with this kind of architecture in mind, I'm gonna hand over to Narayanan who's gonna show you how to build a data strategy with this fundamental building block across a bunch of layers of the stack.

So welcome up, Narayanan.

Narayanan: Hello everyone. And uh thank you, Ian.

So um all right, let's do this, right? So when you're building your data strategy on AWS, right? There are a few different things that you need to think about. You need to think about what your data lake storage layer is going to be. How do you catalog your data and what services and tools will you use for that? What's the data processing tools that you would be using? Is it going to be batch? Is it going to be streaming or is it going to be both? What purpose build databases do you need? How will you orchestrate a pipeline end to end and what services will you use for that orchestration?

So and then there are customers who also want to make sure that they are building this data strategy using open source technology. So on the slide here, you can see the different building blocks that you need to think about when you're building a data strategy. And at the same time, how you can build that data strategy using open source frameworks on AWS.

So now in the rest of the session, we'll go over these building blocks in a little bit more detail and at the end, we'll tie it all together in a few common architecture patterns that our customers are using.

So the first thing that you need to decide when you're building your data lake is where do you want to store your data? And there are different aspects that go into this decision such as the durability, scalability, availability, cost efficiency, and even ecosystem integration of your storage layer, which is why customers are building their data lakes on Amazon S3 because of the unmatched durability, scalability and availability that S3 has.

So S3 is highly available, has 11 9s of durability and it's also very highly scalable. It currently holds over 350 trillion objects, exabytes of data. And it can serve on an average over 100 million requests per second. It is also very cost efficient and you can further optimize cost by moving objects into lower cost storage tiers.

Now when I talk to customers about this, they usually tell me, hey, I have a huge data lake. I don't know what objects are being accessed at what time it's very difficult for me to keep track of it. So for that, we have S3 Intelligent Tiering. So for that, all you need to do is enable Intelligent Tiering on your S3 buckets. And S3 is doing that monitoring for you. So it will monitor the objects, access patterns and move objects which are not being accessed on a regular basis into low cost storage tiers to save you cost.

In terms of security, S3 provides client side and server side encryption. It integrates with IAM. So you can easily do authentication. And S3 integrates with the broad number of AWS and open source analytic services which makes S3 the best place to build a data lake on.

Now when you are talking about data lake storage, you have to talk about open table formats. Apache Iceberg is one of the most common open table formats out there which a lot of customers are using and it's because it is distributed, it is community driven and it is 100% open source.

Now with Apache Iceberg, you can build transactional data lakes because it lets you insert update and delete data. It also has fast scan planning. So scan planning is you know, when you're running a query engine needs to know which objects to get the data from, to respond to your query, right? The process of identifying the object is called scan planning. With Iceberg...

You have fast scan planning and advanced filtering, which means the query. The process of getting the data for your query is going to be very fast.

There's also full scheme of evolution. So with iceberg, you can easily add and remove columns on your tables without affecting the underlying data. And there are other capabilities such as time travel version control. And iceberg also supports multiple, multiple concurrent writers through optimistic concurrency.

Now with AWS, most of our analytic services already support iceberg, read and write, support for iceberg is already there. And we are collaborating very closely with the open source community to add enhancements and work with them to improve iceberg performance on various on various projects.

We currently have one PMC member and three commits on the iceberg project. And recently we've been working with the open source community on improving performance for res on iceberg with reno and overall spark performance.

Now, there are other projects that we have contributed on the with open source as well. When it comes to s3, 1 of them is called s3 a which is something that hadoop jobs can use to easily interact with s3.

We also made some upstream contributions to reno to improve performance query performance with s3 by pushing down operations such as filtering and only returning a subset of data back to the engine to make the queries done faster, right?

But then there are other customers that are using domain specific applications. So these applications don't natively support the s3 object a p and they have to do a lot of undifferentiated heavy lifting to make those applications work easily with amazon s3.

So for that, we have recently announced mount points for amazon s3 which is an open source file based client that these domain specific applications can use.

So basically with mount point, what's happening is your s3 bucket is being mounted on a compute instance, these domain specific applications are interacting with these s3 objects as if they are local files in the file system.

So local file a ps are used to interact with the s3 objects and mount points are converting those file a ps into object aps and interacting with s3.

Now with this, you can have jobs which are running at a very high through port. You can scale petabytes of data and do that for your domain specific applications.

All right. So now you can have the best data lake out there, right? It can be really well designed well optimized, but you still need purpose databases for specific use cases. And at AWS, we have a broad set of databases that you can use for your specific use cases.

But now you'll be thinking, ok, i'm here for open source, which of these databases actually support open source. So everything within the box is either managed open source or is open source compatible and we'll go through them in a little bit more detail.

Now, now, Amazon RDS makes it easy for you to set up operate and scale your database on the cloud RDS, stands for relational database service. Amazon RDS comes in three flavors, three open open source databases are supported with RDS, MySQL, PostgreSQL and MariaDB.

You will think, ok, why should i use ideas? Right. So if anybody has done database administration, you know how complex and difficult job it is, you have to spin up your database, install everything, make sure the connections are set up correctly, make sure everything's working as expected. Reports are running fine and set up another database and ensure that the syn is happening for dr with ids, all of that is taken care of by the service.

So you don't have to worry about any of the infrastructure maintenance, multi, multi, multi, a high durability, all of that is part of the service. And we've been doing this for a while. So we've learned a lot of lessons over the years on operational best practices, security, best practices. And we bring that as part of the service to you, so you can benefit from our experiences.

Amazon Aurora is also a relational database service and it is built for the cloud. Amazon Aurora is fully compatible with MySQL and PostgreSQL. So if you have applications that run on mysql and post today, the transition to aurora should be seamless.

Now with Amazon Aurora, MySQL, you can get up to five x better performance than stock MySQL and with PostgreSQL Aurora, you can get up to three x better performance than stock PostgreSQL.

And with Aurora, you also get the security high availability scalability capabilities that you usually get in a commercial database, but you get that at 1/10 of the price.

Now, if you think about it, you're getting the capabilities of a commercial database. But at the same time, you're getting the cost efficiency and ease of use of open source database all within one service.

And if you want to further abstract away the management of the database, Aurora also has several serverless option. So that is something that you can use as well now.

So we have our database figured out right? There are a couple of relational database options. What about caching? Is there any caching option?

So, Amazon ElastiCache provides a highly available and scalable caching solution on AWS. So you can add ElastiCache in front of your applications and databases to make them run faster as well as scale to handle over hundreds of millions of operations per second and get microsecond response times.

With Amazon ElastiCache, there is support for memcached, Redis and DynamoDB. So if you're using applications that are working with redis and mam cached, the transition to Amazon ElastiCache should be seamless and with ElastiCache, there's also support for global data stores for redis.

So what global data stores is is it can create read replicas in a different region of your cache, right? So the advantage of that is if you have a globally distributed application, irrespective of where the application is running, you have a local cache that it can access, which helps with latency and performance.

And we've been working very closely with the open source community on contributing to redis. And one of the recent features that we worked with the open source community on was to replace cluster metadata with slot specific dictionaries. And there are other features that we are constantly working with the open source community on as well.

Now, ElastiCache is a great choice to have as a cache in front of your database. But what if you want a database that is providing you with ultra fast performance, like single digit millisecond rights and microsecond read performance.

So that's where Amazon MemoryDB for Redis can be used. So, Amazon MemoryDB for Redis is it can provide you this fast performance because all of your data is in memory, it is compatible with Redis so any application that today works with Redis should work with MemoryDB as well.

But then when you're designing, not all of your data is accessed all the time. So with MemoryDB, there is also an option to use data tearing with data tearing. Some of your data that is not accessed frequently is moved to a disk and only the frequently access data stays in memory.

So when our application is making a request, if the data is in memory, it is returned to the application. But if the data is not in memory, it's in disk memory, we will get it from the disk, return it to the application, then add it to the memory. Data tearing is a great choice if only up to 20% of your data is accessed on a regular basis. In that case, i think data tearing is a great cost saving option.

All right. Now what about the case? If you have documents that you want to store in a non relational database? So that's where Amazon DocumentDB comes in.

So, Amazon DocumentDB is a fully managed JSON document database and it comes with the MongoDB compatibility.

Now with Amazon DocumentDB, it simplifies your architecture quite a bit because it comes with built in security capabilities. It can do continuous backups for you and it integrates it integrates natively with a lot of AWS services.

So the integration with AWS services is very easy to set up. It's also very easy to scale DocumentDB. So if you want to just scale your reads, you can add read replicas to scale your reads. But if you want to scale rights, you can create elastic clusters and with elastic clusters, you can have millions of reads and writes per second and also have petabytes of storage in DocumentDB.

And earlier this year, we we announced support for MongoDB 4.2 wire protocol and field level encryption. So with this now you can encrypt sensitive data in your application and store it in a DocumentDB.

So think things like PII data, right? If you want to store it in DocumentDB, you can encrypt it in your application and then store it in in DocumentDB and make sure that you are still compliant with all the regulatory requirements.

Amazon Keyspaces is a Cassandra compatible database. So if you have use cases with Cassandra or you use a Cassandra database today, Amazon Keyspaces is a great choice to use on AWS.

It provides you with single digit millisecond read and write performance. So super fast, it can also do like manage global application. So again, if you have a globally distributed application, Amazon Keyspaces can help you distribute our database across multiple regions.

And it is compatible with Cassandra query language APIs. So any applications that today use Cassandra QL APIs can continue to work with Keyspaces as well.

And finally, we have Amazon Neptune which is managed graph database. And again, Amazon Neptune is also like it works with open source APIs that are commonly used for graph databases.

And recently we announced or like a few months back, we announced Graph Explorer which is a UI based way to traverse your graph database. So instead of writing queries to query your graph database, you can do this whole traversing of your graph on a UI which makes it very easy for even non technical users to use, use your graph database and Graph Explorer.

It not only works with Amazon Neptune, it actually works with any open source graph database that has Apache TinkerPop Gremlin support as well as a Spark QL 1.1 endpoint support.

All right. Now let's talk about data processing. So we'll start with batch more on streaming.

Now, in terms of batch data processing, some of the common frameworks that we use are Spark Hive Presto Hadooop right now with Amazon EMR. It makes it very easy for you to spin up a cluster with these applications installed and start doing a data processing in a matter of few minutes.

Now, how does Amazon EMR do it? So when you're launching a cluster, you tell em ok, i want these specific applications installed on my cluster. So let's say Spark and Amazon EMR goes, it provisions infrastructure for you. It will install these applications onto that infrastructure and it will provide you with the performance optimized runtime.

Once that is done, you can start running your jobs. And all of this takes about a few minutes. The applications that are installed on EMR are all 100% open source compatible. And as new versions of the frameworks are being launched on open source, we make are we aim to make that available on EMR within 90 days with EMR.

There's also a studio experience where you can launch an EMR Studio. It has a Jupyter notebook that where you can do like create notebooks, you can collaborate with your teammates on the same project. And it also integrates with GitHub. So you can easily check in and check out code all as part of the studio experience with version six of EMR.

Currently, we have over 25 frameworks that are supported and we are constantly working with the open source community to contribute to these projects. For example, with EMR we have four working with the open source community and same with the other projects.

Now, with EMR, there are a few different deployment options and you can choose depending on your use case, which one is the right one for you. Now, the first one is Amazon EMR on EC2. So with Amazon EMR on EC2, that's a good choice for customers who have a great understanding of the frameworks that are running. You can really optimize them to like really run, run it in an optimized manner. If you want to do some customizations by adding some configurations and making it a little bit more custom, there is flexibility for that as well. Or if you have like long running clusters where you just have a cluster running for a long time running long jobs or streaming jobs in those cases, the EMR on EC2 is a great choice.

But if you want to abstract away some of the maintenance or management, even if you don't want to think about scaling your clusters or what EC2 instances to use EMR is a great choice with EMR. You don't have to think about underutilized or over utilizing your clusters. You don't have to think about scaling all of that is taken part, taken care of by the service. All you need to do is like you write a job, you spin up an application and then you run, run the job on EMR.

If you're running big data applications on Kubernetes today, Amazon EMR and EKS makes it a lot, a lot easier for you to run those jobs on Kubernetes.

And finally, if you want to run EMR more closer to your on prem EMR on Outpost is an option that you can use.

Now, talking about streaming options for data processing, Amazon MSK is a fully managed, highly available and secure service for Apache Kafka. Now, if you have managed Apache Kafka clusters on your own, you know, it's very difficult to do right there is you need to think about security, scalability, high availability and making sure you're running everything with best practices. It's not very easy to do. So that's what Amazon MSK can help at a at amazon, we've been running these clusters at scale for a long time and we have learned a lot of lessons and best practices which we bring back to you as defaults and automations in the MSK service. So when you're running MSK, you're kind of taking the advantage of all the learnings that we have had over the years while operating Kafka.

If you want to further abstract away the management of clusters, meaning if you don't want to think about right sizing or scaling or even partition management, MSK Serverless is a great choice. And then there are some customers who actually prefer using serverless options where possible. So with MSK, there is a serve as option that that you can use. And there are a lot of innovations being done in this space as well.

One of them is tiered storage. So with tiered storage, you get a secondary storage which is off your cluster. So compute and storage is separated, you can offload your data after a certain number of days depending on your use case to the secondary storage. So with this, you get virtually unlimited storage in your Kafka clusters, you can get storage for a lower cost and even things like partition rebalancing is going to be faster because the data is not in your brokers, it's going to be in a secondary storage, right? So when partition rebalancing is happening, that lot of data doesn't need to move around.

And with Kafka, there's also option for MSK for Kafka Connect for your connectors using MSK Connect, right? So now your data has come into Kafka, you need to do some kind of stream analysis, right? You want to do some aggregations, you want to do rolling window aggregations and those kind of things, we have a managed service for Apache Flink that you can use. It's a serverless service. So you don't have to worry about any infrastructure set up or scaling it is all handled or managed as part of the service.

There is a studio environment with a manage service for Flink which is Zeppelin based. So you can build your jobs using the Zeppelin studio. And then once you're happy with the job that was created, you can add it into the application and then start running it. And similar to Kafka, the lessons that we have learned operating this at scale for so many years, we incorporate that back into the service. So you're kind of getting the best practices that we have learned as part of the service as well.

There is continuous collaboration with the open source community for Flink and we are working with them to maintain connectors, build new connectors, helping with security and bug fixes as well.

And finally, you can run Flink on EMR. Now, the difference here is with EMR, you still have to spin up a cluster and run your Flink application as part of the EMR cluster. The managed service is more hands off or like a serverless experience.

Now, why would you want to run it on EMR, you could have a long running cluster that is already running other applications on Flink, sorry, other applications on EMR. And you want to run Flink as part of that same set of applications as a YARN application. So in that case, EMR run Flink is a great choice right now.

In terms of search and operational analytics, OpenSearch is an Apache 2.0 licensed project. It has a few different components to it. That is the OpenSearch component which is for search and analytics. That is OpenSearch Dashboard if you want to do visualization. And then there are a bunch of plugins that come with it for anomaly detection, search alerting and so on.

And OpenSearch has been around for a few years and has become a very popular since its launch. So as you can see from the numbers, it has had over 300 million downloads, we are partnering with a lot of different partners and the list is constantly growing. There have been hundreds of features and we have multiple contributors to the project as well in terms of observable Prometheus and Grafana are very common services. Many customers are already using it.

We have a managed service for Prometheus, which is compatible with the same Prometheus query language that the open source version is using. The advantage here is that it is managed service. You don't have to worry about the multi AZ deployments. It is all taken care of by the by the service. You don't have to worry about the infrastructure as well.

And the same with Grafana. If you want to visualize your Prometheus alerts in Grafana, there is a managed service for Grafana that you can use.

So now we have figured out the different tools that we want to use, right? But when you're building a data pipeline, you still need to have an end to end set up where the data starts at your source, it goes through multiple, multiple steps along the way and then it ends up in a destination. So for that, you need to make sure that there is orchestration and Apache Airflow is a very popular service with many customers are using to orchestrate the data pipelines with managed workflows.

For Apache Airflow, you get a managed experience of that same Apache Airflow product. And then finally, everything needs to be in code, right? You cannot be going into the console and AWS console and just clicking on things and deploying applications, it has to all be codified so that you can easily replicate this in multiple accounts, multiple regions of multiple environments like dev test prod and so on.

So for that, we have AWS CloudFormation which is a template, template based way to have infrastructure as code, it supports JSON and YAML.

If you want to further abstract away, there is AWS Cloud Development Kit where you can deploy your infrastructure using familiar programming languages like Python or TypeScript.

And there are common architecture patterns such as building a data lake streaming applications, operational analytics using OpenSearch that we have made available in the AWS Analytics Reference Architecture. And we have created CDK constructs for that and you can find that in the QR code right there, there's a blog and the blog points you to GitHub repository which has all of this information.

So now let's talk about a few architecture patterns of how you can build like simple architecture patterns using the services that we just talked about.

So first we'll talk about the stream processing pattern. So now here, I think of a use case where you have a credit card company where credit card transactions are constantly coming in. And you've been asked to do two things. One make sure that the data that is coming in is transformed and is available in your data lake. So an analyst can easily run queries on it. And two you want to do some anomaly detection to see if there is any fraud going on in the data that is coming in.

So the first thing that we'll do is we have the data that is coming of data coming in from your streaming source. We want to create an event bus architecture. We'll use MSK as your event bus, we'll add tiered storage to it. So like I talked about, there is a way where you can offload some of your storage from your Kafka clusters into secondary storage for saving cost and longer retention. So in this case, I'll say after two days, I want all of my data to be uploaded to a lower to the tiered storage layer.

Then I will use Amazon EMR to run Apache Flink to do some aggregations and rolling window calculations. Now, the reason I'm using Amazon EMR for Flink and not the managed service is because there are some other tools that I'm using as well in this case like Spark and EMR on the same cluster. So it made sense to use Apache Flink but we could easily replace that EMR box with the managed service for Apache Flink. And the architecture would work the same way.

And with that, with the managed service for Apache Flink, it's also managed. So you don't have to even spin up any clusters. So we'll do some processing and we'll write it out into our data lake. In S3. In this case, I'm writing it out in an Iceberg format because it's a credit card data, it's financial. So there could be requirements where a user says, hey, you need to remove all of my data from your systems. If it's an Iceberg format, it's a simple delete statement. If you are in a Parquet, it's a much more complex process. So that's why I'll store all of my data in Iceberg.

Now, once the data is in the data lake storage layer, we need to catalog it right. That's the way you will run queries on it. So we'll use Glue Data Catalog, which is a Hive metastore compatible catalog. There are few ways that you can add a table into Glue. You can write a DDL statement like a CREATE TABLE statement and create it that way. But there's an easier way to do it.

You can use AWS Glue crawler to point to your S3 location. And then when you run the crawler, it is looking at the schema of your data, it infers the schema and it will create the table for you automatically, right? So all you have to do is create a Glue crawler pointed to an S3 location, run the crawler and the table gets created.

Now, once the data is in the Glue catalog, you can use Amazon Athena to run queries on it. Now, Amazon Athena is a SQL-based serverless service. It's used for interactive analytics and it's very easy to get started. You can actually log into the console and start running Athena queries right now. It has no setup cost or no setup time.

Now on the top, in Apache Flink, what we are doing is we are doing aggregation. So we'll do aggregations on the total number of transactions per user in the last five minutes, the total number of transactions, total number of dollars spent by the user in the last five minutes. And we do that for all of the users and we use that information to do anomaly detection.

Now, as I said earlier with OpenSearch, there is an option to, there is a built in anomaly detection capability. So this calculations that we have done, we will send it to OpenSearch. In OpenSearch, the anomaly detection capability kicks in. If an anomaly is detected, it will alert the right folks to take the right action.

Now, if you think about it, it's happening in a matter of a few seconds and the data comes in Flink is doing the aggregation, sending it out to OpenSearch and OpenSearch sends the alerts. So if there is some fraud going on, you can actually catch it in a matter of a few minutes instead of waiting this.

And finally, we'll add some observability to it. So we'll add Prometheus and Grafana to make sure that brokers, our MSK brokers are running fine and our Flink cluster is healthy and there are no issues going on there and with Prometheus and Grafana as well, you can have alerts in case there is some infrastructure issues you have to look into.

Ok. So now next we'll talk about a batch processing pattern. Again, this is a very common pattern that many customers are using where we have data coming in from multiple different sources. So here we are showing three different sources. One is the Database Migration Service. So Database Migration Service can bring data in from any database source and write it out to any destination of your choice, which could be any database or even Amazon S3.

Amazon AppFlow is a service that you can use to bring data in from a third party source such as Salesforce or something like that. If you have data in a third party source, you are, you can bring that in using Amazon AppFlow and drop it into S3.

And then with AWS Data Transfer, it is similar to like an FTP use case. All of that data, we bring it into our data lake storage layer in Amazon S3, we'll use Spark or EMR to do our transformations. We'll write out the transformed data, cleaned up data back into S3 in a different location in the Iceberg format.

Just like we did earlier, we'll catalog the data using the AWS Glue Data Catalog. And finally, you can use any of the purpose built analytics tools like Amazon Athena, Amazon EMR for analysis. You can use SageMaker for doing machine learning like building machine learning models. And Amazon Redshift is another data warehouse service which is not in here, but you can easily query your data in your data lakes using Redshift Spectrum as well.

Alright now building on the previous batch processing pattern, let's see how you can use that same pattern to do some training for a generative AI model.

So for it's a new topic. So I just want to level set a few basic terminologies, right? So generative AI applications are built on what we call large language models or LLMs. A machine learning model that is trained on a vast amounts of data. So think all of the data on the internet that scale. Those models that has been trained on a large amount of public data are called foundational models.

So foundational models are more general in nature. You can tune these foundational models with your domain specific data sets to make it more focused for your company's use cases. And that's the process of that is called fine tuning.

Of course, there is a lot more going on in the generative AI world. But I think for this pattern, this much information should be good enough.

Alright. So we have an application database in Amazon RDS, just like we did in the previous architecture, we are bringing that data in using AWS Database Migration Service. We write that into the data lake storage layer in the raw format.

We are using Spark to do some ETL processing. In this case, we are writing it out into a S3 in the JSON Lines format instead of Iceberg because the model that we are using and the fine tuning that we are doing requires it in a specific format. So you need to write it out in that format.

Then we, we have the model in SageMaker and we'll use this JSON Lines format, the data set. If you think about it, this is still your data set, this is your company's data and you fine tune the model for your specific use cases or your domain specific use cases.

Alright. Now the once, once the model is trained and it is ready to be used, you will have a generative AI application, you can build it on Lambda.

Now generative application, let's say there is a question that a user is asking, right? So that is the input that is coming into the to the model. Oh sorry. That is the input that is coming into the application. The application takes that input and then it also looks up the DSPostgres database to see if there is any contextually relevant data that it can find based on the question that is being asked by the customer.

It will take the contextually relevant data, add it to the input data from the user, make an engineered prompt and then then send the request to the model and it gets the response back.

Now. This process to look up the DSPostgres database to get contextually relevant data will dive, dive a little bit deeper into that in the next pattern, right? So we'll add to this in the next pattern that we talk about.

Ok. So here, let's start from the right side.

In order to improve a model accuracy in generative AI, you need to provide it with contextually relevant data, right? Just like we talked about. So where is this data coming from? This data comes from your use cases or your, your data stores? It could be your data lake, it could be a database or it could be any of the databases that you're storing right now.

Once you get the data from your databases or your data stores, you'll use any processing tools such as Glue or EMR to do some basic processing, but you're also splitting the data into relevant elements. So now an element could be a word, a phrase, a sentence. Again, it depends on the use case.

Now, once the data comes in, you will like take those elements, send it to the model which could be in SageMaker or Bedrock. The model generates a vector and vector is just a string of numbers. It's just easier to think of it as a string of numbers. And that vector is stored in a vector database which could be either in Aurora RDS or OpenSearch.

Now, if you go to the left side, you have a, a user who is accessing a generative application. It is asking a question. So that question is again broken down into elements same like we did on the other side, that element is sent to the model, the model responds with the vector that vector is taken and it looks up to the vectors that was stored previously from your domain specific data. And it's looking to do a comparative or nearest neighbor search to see what contextually relevant data that it can find. And then that data, it takes that data back to the application. It creates an engineered prompt and sends it back to the model to get a response back.

So I know there was a lot that we discussed here. And if you're not in the gen AI space, this might be a little bit too much. So we have a blog that talks about this in a great amount of detail. It talks about generative AI concepts as well as using vector databases with AWS. But for this particular architecture, what I really wanted to cover is using vector databases with generative AI is important to make sure your model accuracy is up to the mark that you want it to be. And there are different options in AWS where you can use vector databases which are Amazon RDS, Amazon Aurora and Amazon OpenSearch.

Alright. So to recap, we talked about AWS's open source strategy and how we are collaborating with the open source community. And we talked about the different services that you can use to build an open source data strategy on AWS. And we looked at a few different common architecture patterns.

There are a few resources on the slide here that you can use. It talks about our open source data strategy, the different database and analytic services that we have. And if you are building on EKS, there is a data on EKS GitHub repo that has a lot of good resources as well.

So with that, I want to thank you all for attending our talk. I know there was a lot of content that we covered. So thanks for sticking with us for the last one hour. If you have any questions about the content that we discussed or if any specific use cases that you are working on, you can find us in the hallway and we are happy to answer any questions or have a further discussion.

So yeah, thank you for your time. Please take the time to complete the survey and enjoy the rest of re:Invent. Thank you.