Build an end-to-end data strategy for analytics and generative AI

最新推荐文章于 2024-09-30 17:19:07 发布

李白的朋友王维

最新推荐文章于 2024-09-30 17:19:07 发布

阅读量272

点赞数 3

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134785529

版权

Hey, hello and welcome. I hope your re:Invent is up to a great start. My name is Chan Darla and I'm a Principal Product Manager with AWS Analytics. In this session, we'll talk about how to build an end-to-end data strategy for analytics and generative AI.

We will start with an overview of why you need an end-to-end data strategy. Then we will talk about how AWS can help you build this end-to-end data strategy. Then I will turn it over to Rob who will show you a demo of some of the innovations we talked about in the session. And finally, we'll turn it over to Karen, our friend and guest from Fannie Mae who will talk about how Fannie Mae built their end-to-end data strategy on AWS.

Now, we're presenting on a Monday and there are a lot of great sessions ahead of us this week. So where possible, I'm going to include a call out at the bottom of the screen about interesting sessions that are relevant to what we're talking about on that slide. For example, this call out is recommending you to tune in to Swami's keynote where he will talk about the innovations in our databases, analytics, machine learning and generative AI sessions.

Let's dive right in and start and talk about why we need an end-to-end data strategy. It's so easy, especially this year to focus in on that new generative AI application, the increasingly accurate machine learning models or the insightful dashboard that we miss the proverbial iceberg. These generative AI applications, machine learning models and dashboards are built on solid data foundations. This is where the complexity is and this is where the hard work is and building a solid data foundation for your company is the first step in deriving value and insights from your data.

Unfortunately, sometimes this can be challenging because you have to break down the silos that exist in your organization. You may need to break down the data silos where your data lives across disparate databases, data warehouses, data lakes, and even third party systems like Salesforce or SAP. You may need to break down people silos by making the data and analytics self service, so it's easily accessible to everyone inside your organization, including your less technical people in your organization. And finally, you may need to break down business barriers that prevent cross account across organization data sharing due to compliance issues or cost attribution.

To overcome these challenges, companies of all shapes and sizes are building decentralized, end-to-end data strategies that let data producers with the domain expertise build and share curated data products across their organization. These curated data products are then utilized by data consumers who understand business priorities and use these data products to drive business results and helping coordinate the data producers and the data consumers is a data foundations team that is responsible for selecting and deploying the tools that enable the various stakeholders to easily share data.

Now, of course, all the sharing has to be governed in order to ensure that organizations comply with applicable regulations. To implement the strategy, customers often use a multi-account architecture on AWS. Data producers use separate accounts to isolate data products from each other. And increasingly we see data producers using separate accounts to manage the cost of creating a data product separately from the cost of sharing that data product across the entire organization.

In addition, data producers are responsible for their own infrastructure and for meeting business defined SLAs around data timeliness and data quality. To facilitate the sharing of these data products across the organization, the foundations team typically provides discovery tools like a business data catalog tools for management and they are responsible for defining the governance auditing and compliance requirements for the entire organization.

Then consumers can now discover and subscribe to these data assets and they use to build analy machine learning models or applications to help you build your end to end data strategy. AWS offers a comprehensive set of purpose built services for a variety of use cases optimized for cost and performance. And many of our services support multiple employment options so you can get started quickly by using a serverless option or you can optimize the cost performance of your workloads by running on pre provision, compute kubernetes spot instances or reserved instances.

Now, you may be wondering, do I really need a comprehensive set of tools? And as anyone wearing the wrong size shoes this week, we'll find out one size truly does not fit all. Look joking aside, a database is not appropriate for every use case. Just like a data warehouse is not appropriate for every use case, just like even a data leak is not appropriate for every use case. And in fact, in our experience, it's common for customers to start with one service or an architectural approach. And then as they understand their workload and the usage patterns for that particular application to switch to another service or another architectural approach because it is better suited for the task.

For example, they may start off with a relational database because they understand it and they can get started with it quickly, but then switch to a non relational key value store because it is a better fit for their use case and allows them to really fine tune the cost performance profile for their applications and to support customers. AWS provides a comprehensive set of services to help them store and utilize data to help them integrate the data across their organization so they can create a so they have visibility into their entire business and customers and to help them govern their data assets so they can comply with their regulatory obligations.

Let's dive in and take a look at some of the services we provide to help you store and utilize your data. In addition to S3, our durable object store, we offered the industry's most complete set of relational databases such as Aurora and purpose built databases like DynamoDB a scalable key value store, Neptune, a graph database and Timestream, a database purpose built for your time series data. These databases are uniquely designed to provide optimal price performance for their respective use cases. So developers always have the right tool for their job.

Aurora is our MySQL and PostgreSQL compatible relational database service designed for unparalleled performance including scalability availability and reliability all at 1/10 the cost of commercial grade of enterprise, commercial grade databases.

Now it's hard to talk about all the innovations with Aurora or any of the services I'll mention in this talk. But I do want to highlight some recent launches from our services that are helping break down data silos inside organizations. And one such Aurora launch is the Aurora MySQL to Redshift zero ETL integration which seamlessly replicates data in Aurora into Redshift in seconds. In fact, p50 is less than 15 seconds. So customers can use Redshift for near real time analytics on petabyte scale data. And the best part is you don't have to do anything to set up this data integration pipeline. You simply tell us the tables you want to replicate to Redshift. And we take care of everything seamlessly in the background for you.

In addition to the database services, AWS offers a comprehensive set of analytic services and many of these services offer a serverless option so you can get started with them quickly to build your applications for data warehousing. We offer Redshift. Now the beauty of Redshift is in the scale of data it can operate on with consistently high performance while keeping your costs predictable. Redshift today offers 7.9 times the price performance compared to other cloud data warehouses and we will continue to innovate in this area. Redshift is the only cloud data warehouse that lets you run queries at exabyte scale against your data leak as well as petabyte scale inside your clusters.

In addition with Redshift's federated query capabilities, customers can now query their operational data stores like Aurora or RDS. And as data becomes more democratized within organizations, Redshift is delivering on easy analytics for everyone. A recent launch in this area I want to highlight is the autocomplete and syntax highlighting feature. Redshift query editor v2. This feature enables less technical users in an organization to build analytics and queries more efficiently and accurately embedding machine learning and generative AI capabilities into our services to empower all users is a theme you will hear about in many of the deep dive sessions this week.

For big data processing we offer EMR. EMR makes it easy to run big data processing frameworks like Spark, Hive, Presto or Flink. It supports the latest versions of these open source frameworks within 90 days and it provides the best performance at lowest cost. In fact, Spark workloads run five times faster than open source. In addition, EMR has flexible deployment options like EKS, Kubernetes or running on provisioned compute. And by running on Spot or Reserved instances, you can save 50 to 80%.

A recent EMR launch that I want to highlight it's fine grained access control with Lake Formation permissions for Spark jobs. This feature makes it easy to share data across your organization. So instead of creating different versions of the same table that you then share with different user groups that have different permissions on that table, you can now use table, column and row level permissions to share just the portions of that table with the appropriate users inside your organization. Be sure and check it out.

For business intelligence and dashboarding. We offer QuickSight which allows everyone in your organization to understand your data by asking questions in natural language using QuickSight. Q to explore your data through interactive dashboards or look for outliers in your data powered by machine learning. QuickSight enables business intelligence for everybody. And with QuickSight, embedded dashboards, these insights can now be embedded directly into your operational tools and internal portals. So your users have access to the data that they need with the appropriate business context.

For example, Bolt uses embedded dashboards to share deeper insights into how shoppers with Bolt accounts compare with guest shoppers. And of course, because these are interactive dashboards, it's always possible to drill into the details required for your use case.

Now to support your interactive log analytics, real time, application monitoring and website use cases. We opt for OpenSearch Service. It is a fully managed service that makes it easy to operate, deploy, operate and scale OpenSearch clusters on AWS. Customers use OpenSearch, for example, for log analytics so they can detect potential threats and respond to changes in system state all through an open source solution for observability.

Like many of our services OpenSearch lets you fine tune the cost performance profile of your workloads. Now, we know that the cost of log and application data increase as the data grows. And with OpenSearch Service, you can use different storage tiers to optimize the cost performance of your workloads. For example, you can keep your highest priority workloads on hot storage for fast performance while moving the data for your lower priority workloads to cold storage.

In order to optimize costs with all the interest in generative AI a recent OpenSearch feature that I want to highlight is the vector engine for OpenSearch Serverless to speed up searching across all the unstructured semi structured and structured data. In an organization. OpenSearch uses a data structure called an index at a very high level. You can think of an index as an index keeps a list of all the documents that contain a specific word like these are all the documents that contain the word restaurant. However, what if there are documents that contain words similar to restaurant, like diner or cafeteria or steakhouse or pizzeria? How do you find these documents when you search for a restaurant to group these documents together, you can use embeddings and you can think of an embedding as a mathematical signature for documents that allows you to find documents about the same concepts even if those documents don't contain the exact same words. And with the vector engine developers can now search across their structure, semi structured and unstructured data using descriptive text and metadata like they're doing right now as well as search against these embeddings.

Customers can use this capability to not only improve the search results and the relevance of those search results but to create personalized responses in generative AI applications by finding all the data within their organization about a specific customer, a business or a specific topic and use that information to create the prompts to feed the large language models.

To support your machine learning workloads, we offer a broad set of machine learning capabilities from support for deep learning frameworks like PyTorch or TensorFlow to services like Amazon SageMaker that makes it easy to create your own machine learning model or AI powered applications to AI services with built in machine learning capabilities, like Transcribe that can power your speech to text, use cases or Textract that can extract text, handwriting and data from your scanned documents.

Now, of course, many of our customers are interested in how generative AI could help transform their business and to support these customers, we offer Bedrock. Bedrock is a fully managed service that makes it easy to build and scale generative AI applications for your with your choice of high performing foundation models. All while maintaining privacy and security.

Bedrock includes all the capabilities you need to build generative AI applications and experiment with foundation models, customize foundation models with your private data and create agents that can execute business tasks like booking travel or creating an ad campaign. Some of the foundation models that Bedrock supports includes Codex from Anthropic, LLaMA from Meta and Titan from Amazon.

So, so far we've reviewed the services we provide to help you store and utilize the data inside your organization. Now, let's dig into the services we provide to help you integrate your data.

Let's start by discussing why data integration is important. Data integration is important because your data lives in disparate databases, data warehouses, data leaks and SAS systems in your organization.

"And you need to integrate this data together to create a holistic picture of your business. And customers often this involves developing code to clean and transform your data, developing code to replicate or move data between systems as well as orchestrating end to end workflows.

For example, you may want to manage dependencies like only run this aggregation step once all the data has been loaded into Redshift. And so to help you prepare your data, we offer use case specific data integration services.

For example, to help you migrate data from open source and commercial databases into your data lakes and data warehouses, we offer the Database Migration Service and if you want to purchase third party data sets to augment and enrich your data, then you can use Data Exchange a marketplace where you can access over 3500 data sets from over 300 customers.

And finally, for orchestration and workflow management, we offer Amazon Managed Apache Airflow as well as Step Functions.

And of course, Glue. Glue is our anchor data integration service and it features a serverless execution engine that can scale to support all your workloads. Glue supports all the users inside your organization with persona specific authoring tools, notebooks for your technical users, DataBrew for users that want an Excel style wrangling interface and Glue Studio for users that want a visual job authoring experience for developing your data integration pipelines and data integration jobs.

A recent Glue launch that I want to highlight is the ETL I coding assistant powered by Code Whisperer. Using this feature, users can build data integration pipelines using natural language input. In this example, just by putting in that comment, "Write DataFrame into Redshift" Code Whisper will give you the Spark code for writing a Spark DataFrame into Redshift.

While tools like this will simplify the development of data integration pipelines, in many cases, we can do better. And that's why we're investing in a zero ETL future. Zero ETL eliminates the need for ETL pipelines that you were previously building and managing by hand. And in many ways, this is similar to how we use the term “serverless” right? Serverless doesn't mean that there are no servers. It means that AWS removed the undifferentiated heavy lifting of provisioning, monitoring and deprovisioning servers so you could focus on your business logic.

And in a similar way as we progress on the zero ETL journey, we will remove, we will evolve our offerings to remove much of the undifferentiated heavy lifting of building, monitoring and managing data integration pipelines so you can focus on your business specific transformations.

And in practice what this means is that we will invest in making it easier for you to access data in place by expanding the federated query capabilities in Athena and Redshift Spectrum and Redshift. So instead of building a pipeline to replicate data from your operational data stores into your data warehouses or data lakes, so you can query them, there's no need to move the data, you can query the data directly in place.

We will also move machine learning and analytics closer to where the data resides. For example, many of our data services including Aurora, Redshift, QuickSight and Neptune have integrated machine learning capabilities in them already. And just yesterday, Redshift announced the ability to access large language models in SageMaker JumpStart from SQL.

And finally, we'll build point to point integrations between our services so we take care of the undifferentiated heavy lifting of moving and replicating data between our services so you don't have to. This is similar to the Aurora MySQL to Redshift zero ETL integration that I covered previously.

Now let's review the tools we provide to help you govern access to your data. So far, we've talked in detail about the tools we provide to help data producers create curated data sets. And we've talked in great detail about the tools we provide to help data consumers utilize these data sets to create interactive dashboards, data integration pipelines and so forth. But without the governance foundations in place, sharing data across your organization will be a difficult task.

And as recent GDPR and CCPA fines have shown, governance cannot be an afterthought. A proper data governance framework is critical because it helps you move faster with data while complying with your regulatory obligations. And to support you, many of our tools already have built in governance capabilities.

For example, SageMaker helps you address common machine learning challenges from onboarding new users to centralizing model information for multiple users in a single location to govern your data lakes. We provide Lake Formation which helps you easily build, govern and audit your S3 based data lakes and its features table, column and row level permissions so you can share the right data with the right users inside your organization.

And for true end to end data governance across your entire organization and all of our services we offer DataZone. DataZone provides the key components required to share data products across your organization. It includes an organization wide business data catalog where data producers can publish data assets, data consumers can then discover these data assets and request permissions to access them. And once that subscription has been approved or those permissions have been granted, DataZone sets up all the permissions including cross account permissions for you seamlessly in the background so your consumers can just access the data.

DataZone is a new service, we just launched it in October and if you are not familiar with it, I encourage you to learn more about it this week. It really is the connectivity tissue, it provides a lot of the connectivity tissue to build an end to end data strategy on AWS.

That was a quick overview of why you need an end to end data strategy as well as some of the services we provide to help you build your end to end data strategy. Now I will hand it over to Ram who will walk through a demo of how to build end to end systems on AWS using some of the innovations we just talked about. Ram..."

Now this part of the demo assumes that you have your data zone set up and the environment up and running. So if in case you are interested in knowing more about those details please please take a look at and 313 session that you will have a lot more details in the workshop.

Now, just to let you know here, I have logged into two different browsers because we are talking about two personas, right? A producer and the consumer. Now producer is the person who is publishing or the team who is publishing the data and you have the consumer who is requesting for the data or searching for the data, finding it out, sending a request and consuming that data. So I've put a call out at the top so that you have an idea whether I'm on the producer screen or the consumer screen.

Now because I'm part of the producer team, I have access to the datazone project which is the customer publisher project. Ok? Now I have the data source available. Let's go to that specific data source and collect the metadata or bring in the metadata from the systems by clicking that run button. Once that run is over, it is importing all the metadata from the red shift database. You can see the user profile and hotel booking table from there. Now our inventory is done.

So let's go to the inventory section and you can see both the tables there. Let's select user profile data. And here you can see the metadata that is automatically generated by datazone for you. So you can take a look at that. You can make it if required. But for me, everything looks good. So let's go ahead and accept all. Now inventory is done. We validated the metadata. Now let's go and publish it. So let's go to that specific data set and click, publish a set and let's go ahead and confirm. Yes, go ahead and publish this. That's all you need to do for your data to be visible or available for your business data catalog so that other users can search for it and consume it.

Now this time we are moving to consumer. So as a consumer again, there they have an environment available and you can see that it's a red shift environment that they are going to work with. So the first thing that consumer is going to do is search for a data asset. So they search for user profile information. So they saw that published data set, everything looks good. This is the data set I want, let's go ahead and subscribe. Let's hit that subscribe button. Provide a reason for the request why we are requesting access to this data set and hit, subscribe.

Now this starts a workflow because now there is a request being sent to the producer that hey, there is a consumer who wants to access this data. Let's now go to producer because now producer has to act on it, right. So at the producer side, you will see that hey, there's a notification that subscriber has requested for this data. So producer takes a look at it and if everything looks good, put that decision, comment and approve it. Now, that's all producer has to do behind the scenes. Datazone is doing all the permission that is needed so that the consumer has access to this data from their environment.

So let's go to the consumer and take a look at that. So you can see that in the subscribed data on the consumer side, the user profile is listed and we are going to the consumer redshift environment to see if you are able to see that data and query that data. So there is a view that is available now because it's at the consumer side. So they are using consumer compute consumer red of cluster and querying against that view to get the access to that data, you can work with this, you can analyze, you can do all the other activities that you want to perform on this.

Now, what I want to highlight here is think about how easy it was as a publisher to publish a data set as well as as a consumer to search for the data that you are looking for and get access to it and start using it right now.

Let's build that generative AI chatbot. So in this demo, we are using the data from the data that we have curated to power a generative application. So we are going to build a travel planning assistant chat bot so which takes the user input and use that to go against redshift query, the data that it is it needs and then go against amazon bedrock which will call a large language model, cloud large language model and pass all that information to provide personalized suggestions for this user. Let's take a look.

So here we are on the redshift query editor. This time, we want to look at specifically on us on a specific user because we want to understand what is their interest, what is a hotel booking schedule looking like and all of those details. So here there is a user id there, let's go and query. So you can see that blake uh like quilting, ice skating and tabletop games and the favorite food is macarons, waffles and pudding. Ok. Now let's look at the the hotel reservations looks like there is an upcoming trip in june to manchester, brussels and paris.

Now let's go to our chat bot. The first thing that chat board is looking for is a user id. So let's provide the user id of blake because at this time, what it is doing is it is going against redshift and getting all the information that we just looked at so that any time the user asks a question, it can pass that additional context so that large language model can provide personalized suggestion.

So blake is asking, hey, can you help me with the travel itinerary for my upcoming travel. So you can see the suggestions that are coming up. So now if you look at it, it considered each one of the cities that blake was traveling and the respective dates in the suggestions. And after that, within each one of those itineraries, it looked at tabletop games, for example, it provides suggestions for skating and love for waffle. And how can we miss some macros in paris? Right. So that's how it is.

How about if blake wants to do a trip in us? Now, this time model understands that hey, user is asking this question, but they don't have a booking information for that particular date or that location. So it still provides an answer. But this time instead of specific dates from the table, it is giving its suggestions as well as the the very specific interest of the user.

Now in the last 12 minutes or so, we looked at how you can build end to end systems using databases, analytics and machine learning services in aws. Keeping the environment set up aside. We spent almost three minutes in creating a zero integration between your transactional databases and your data warehouse and less than three minutes to publish a data set, share that and consume it from a sharing standpoint. And then we also looked at how we can use the data that we curated to power, a generative application.

So with that said, I will hand it over to Peter.

Thanks Chen and uh Ram. Uh it's a, you know, I mean, basically, I'm really glad they laid out the foundational stuff um of what, why did end data strategy is needed? Yeah, I'm Karen Ramini, um vice president for single family architecture, uh cloud data and infrastructure at Fanny Mae. As we all seen data strategy has to be inclusive and end to end. Yeah.

Um I'm here, I'm thrilled to be here to share our journey into data mesh and insights we have learned uh during the journey um as uh as it applies to financial sector. Yeah.

Um speaking of financial sector, uh probably all are familiar with what Fanny Mae us. Yeah, so we can basically um we are one of the most valued hosting partner. We facilitate equitable sustainable access to home house ownership, home owner, sorry, house ownership, quality, affordable rental housing across America to kind of put that in perspective, right?

Um you know, in 2021 28 and one and 20 we facilitated around $1.4 trillion of liquidity into the market. It's a kind of a one in four homes. A mortgage is facilitated by Fanny. Yeah.

Um so now uh how does it relate to you? So when you talk about data, think about all the data you have provided during your mortgage or loan origination to loan servicing the whole nine yards, all the data we host all the data. Now think about everyone uh one in four homes who uh we are facilitating their mortgage and loan and their servicing their loans. Yeah.

Um or um so all that plenty of petabytes of data we host. Yeah.

Now let's talk about data um traditionally um organization struggle with integrated um with elusive integrated data experience. And the fundamental reason is the approach the organization has taken. Typically, you see that as a technical issue and you see quite often you see people referring to uh data asset, there lies the problem. See, data is an enabler of a business capability, not an asset. That's the shift if you will.

And if you look at um where we all evolved from the first generation of data platforms, essentially you take their operational data and you dump it into data warehouses. That's the first generation of data platforms. And the second is uh basically without tl you dump into something like haro and leverage and extract data out of it. And the third is, is where you're a single account where you take all the data, do some etl and dump it into a single account um in aws if you will, if you look at all those things, there is a fundamental issue there.

The one that was referring to one size fits all your data needs, vary by the data domain. You are business domain you are in and the persona that leverages the data to, to generate, to pen powered business capabilities. Yeah, that's where we, you know, like any other company Fanny Mae also evolved from that particular approach where you're in a single, single account, 3rd, 3rd generation data platform, all data in one place in data warehouse. Obviously, you know, you would step on your queries will step on someone else.

Um and there's resource exhaustion issue, that's when we went into a multi account based structure. Yeah, basically.

Um so the way the way Fanny Mae approached mean we realise this way in end of 2020 q 4 2020 where uh we l uh uh real, hey, we need a more um integrated experience or data integrated experience. Yeah.

Um and we approached that as a um so tech, we took a more son approach to it. Yeah.

The word socio technical systems is not new. It's actually coined way back during world war era for coal mining. Basically, safety and soundness is very important, coal mining. The process you follow to realize the value is very important.

The same thing applies to data as well. You're realizing value out of the data. Okay? And if you apply that people process technology issue to data, it translates to ownership, business, domain driven, business, data ownership, domain, data ownership.

Yeah. Um and data is a product and not as an asset if you will and self service. Basically, when you need data, you need it and uh and centralizing that app that uh intent of provisioning data access slows you down and it affects your time to market.

So basically, uh and the next thing comes in is safety and soundness is important. If you make it centralized, it slows you down again to impacts your time to market, that's where, you know, federated governance makes a um it's very important. Yeah, so fanny made, took this approach and we went into multi-account structure.

What you're seeing here is we, we went, there are several ways to achieve data mesh as a principle if you will. But we took more of a hub and spoke kind of model. So what you're seeing on the extreme left is business domain accounts. Okay. The spokes are the where you host your business domain data. When they say data, it's a complete data compute and infrastructure all together they're all co located.

Yeah, you don't want to unnecessarily move the data around that's going to create latency and it's going to impact your completeness and timeliness. The whole idea is they're all co located. Yeah, but there is a central function of governance and enforcement of these controls, data quality checks, data injection controls.

So there is a central governance function by the way, make no mistake when i mean by central it's centralized cohesive control plane but actual execution, the execution plane and data movement plane is all on this on this books. Yeah. So think of that as a control plane and distributed data and execution plan. Yeah, so that's where then the there's extreme, right? You're seeing end users, their needs are completely different from your typical business application capabilities.

That's where the end user where the data scientists ad hoc uh business intelligence reports. Um and, and experimentation greenfield. Yeah. So that's where, you know, you see a completely different spokes for that.

Now, let's apply the two it on the techno what it means on the technology side. So what you're seeing is a and a one or more account for business domain. We don't restrict by um one account to one domain. You can have more than one account if you will. It's holy scalable.

So you have a business product specific operational account where all your operational data is and compute is hosted and the data is ingested in near real time into the, into the spoke of the enterprise data lake. And you think of that as a mesh of lakes and in real time in the real time.

And what you're seeing is the central data measure governance. Essentially that's, that's where all the the central services are hosted to manage the to provide the needed controls, to enforce data quality, data injection and data movement and also host the data catalog to be able to discover your data. Yeah.

So essentially business units manage their own code and infrastructure. It's self service based and we have an enterprise data catalog. So one of the things you will realize is as you mature into this multi account based structure and start migrating data cloud, you will end up with thousands of data sets. Yeah.

So now discovering this data and the relationship between the data becomes a challenge. That's where you know the center a data catalog which is again federated into each others pos but has a cohesive control plane in the central data mesh in the in the hub is just becomes essential. Yeah.

Now on, on the extreme right, data, the insights you're executing insights, you you are a data scientist, you are business reporting. Now they have a cohesive integrated data experience. Yeah.

So here's the recommendations for um data modeling and domain owners. So there's a two terms, i'm going to introduce data as product and data products. Yeah, data as product is basically bringing product discipline to your data. What capable, what business capabilities is it providing the infrastructure and code altogether. Data products is, is a, is a, is a, is something that provides or empowers a capability.

So what you're seeing here is think about business capabilities, your data is providing and not as an asset. So you would have in segments, you have one or more capabilities and together you have a bigger business capability if you will.

So what you're seeing now translate that into how data sets and data products come into picture a business application can produce one or more can be producer of one or more data sets and it is one or more data set can be part of a data product and the data product can span across multiple domains. It doesn't have to be one domain. Yeah, underwriting or finance. The finance, for example, domain product will span across multiple domains by the nature of it.

So you what data, when you talk about domain driven data, it doesn't mean one business domain, a data product can span multiple domains. So um now how do they relate? So you you you heard of uh uh chan and dr talk about various capabilities.

So one thing we have done is we basically everything you're seeing here is contract driven. Meaning let's say you want to just data, it's transport independent, it's more related to design by contract. Meaning today we use data sync, we also use um s3 or k streams as a way to ingest data in real time. As zero atl comes in picture, we're looking into zero a t as a way to accelerate injection of the data in real time.

So think about design the contract inter inter contract interfaces and make it transport independent. So that also includes your public data sets or external data sets. So basically what it provides is decoupling the management of data and technology by adopting data mesh as a principle. Yeah.

Now we have integrated data products across which crosses domains making them more valuable. Yeah, once you uh with the centralized data catalog what you have is a, a um enterprise wide terminology to various data attributes. Now discover of the data use of the data becomes simpler, basically, enterprise logical data model.

And now we also have ownership, you have well established ownership to the data sets or data products, which means upkeep and a seller of those, there's some accountability to those data products.

So now we look at the various uh just to wrap up various benefits. Basically uh what you have is ability to adopt new technologies, um agility. So you can actually uh come up to spare. Uh what do you call uh improve your time to market and horizontal scalability because you're going to deal with petabyte scale data. Yeah.

Um and you have faster access to critical data when you need it because all about self service now you can think about your, you have integrated data experience where you can visualize and leverage data across business domains rather than being siloed in one domains. Yeah.

Now where basically you have independence. If you want to switch out new technologies or adopt new technologies, you can do. So because it is an innovative space, it is evolving, which means you should be able to adopt new technologies for any particular function. Yeah.

And uh and based on the domain, the the type of work the technology use will change based on the business domain. So what are we looking at these? I mean we have made tremendous progress. We are continuing to adopt and accelerate adoption of data mesh. As a principle, we are looking into new technology technology architecture is a continuous process.

We are looking into new technologies that will simplify um our architecture basically using data zones, aws glue um uh aws bedrock ml is an interesting one, redshift ml. So we have this 11 principle of bring compute to data, not data, to compute. What i mean by that is if you can think of redshift ml and those kind of technologies you're taking computer data, which means data, you're eliminating need for data movement.

So those are some of those new things we are looking exploring as well and to improve our overall governance, you know, we have 00 trust as a fundamental need. Um and principle, we continue to adopt zero trust principles to improve our overall data governance.

So with the conclusions, i'm gonna turn it on to uh round four.

Thank you, karen. It's always amazing to hear about our customers that are using the capabilities that we build. Uh and, and uh and the innovations we build on their behalf. I want to make it sort of uh the conclusion. Keep the conclusion relatively short.

Um you know, really the theme of this entire session was how to break down the data s um silos inside your organization and build solid data foundations using the comprehensive set of tools that aws can provide, which helps you store and utilize your data, helps you integrate the data across all the disparate data sources inside your organization cre to create a cohesive whole.

Um so you can understand your business and your customers and how you can govern your data if you are participating in the analytics superhero sessions. Uh scan now. So you get credit for attending this session. I'll keep it up for a couple of seconds in case you're doing it. Okay. Awesome. And that's the conclusion.

Thank you. Uh please be sure and uh fill in the survey session.

李白的朋友王维

关注

3
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫