Building data mesh architectures on AWS

最新推荐文章于 2024-09-07 11:35:15 发布

taibaili2023

最新推荐文章于 2024-09-07 11:35:15 发布

阅读量462

点赞数 7

文章标签： aws

本文链接：https://blog.csdn.net/weixin_46812959/article/details/134595966

版权

Good afternoon. Good afternoon. Thanks everybody for coming. It's lovely to see such a, a full room. Uh my name is Ian Myers. I'm the director of product with AWS's analytics group. And uh we're gonna talk about data mesh today. Uh we're very lucky to have a customer speaker in Travis uh from GoDaddy who's going to talk in detail about what they built. And um I think it's going to be a pretty good deep dive for you guys.

We're going to cover a modern data strategy that we're seeing uh our customers building and talk a little bit about the challenges that our customers are facing as they go to scale across their business and try and link that to why some of our customers are finding the data mesh pattern to be useful. We'll go into detail about the design goals of data mesh and the principles that organizations need to think about as they adopt it and show you how you can build one on AWS, including uh a very deep dive with GoDaddy on what their architecture is and why it suited them so well.

So to open, a modern data strategy really needs to cover a broad set of your data sources and be expansive and inclusive of all the different types of operational data uh that you need to capture and allow it to scale with your business. And as it comes into your data strategy, we need to apply a strong catalog and governance model so that you can make sense of what the data is that you have in aid discoverability as well as applying the security guard rails that allow you to scale with confidence.

And around that you build a set of analytics applications, machine learning and integrate your databases and data lakes to store the data and solve the problem that you have at hand. This information all gets surfaced out to your end users, your business analysts, as well as your applications and devices that depend on better decision making. And we try and close the loop and bring those types of decision support back into the data sources to improve your customer experience.

When applied to AWS, we start with a foundation on Amazon S3. The Simple Storage Service is designed to store objects of any type with ultra high durability and availability so that your data is safe and secure regardless of what type of data you have. And really to be able to scale with your business.

And sitting around that, the AWS Glue Catalog stores references to how your data is structured. We support a bunch of different file formats. And we treat your files that sit on S3 as tables that live in databases that allow you to organize your data and access it with a bit more structure than you would get if you were just dealing with raw files.

AWS Lake Formation is our permissions layer that allow you to create fine grained access controls against your data and to use tag based security so that you're not just granting access to a database and a table to an individual, but instead describing your data through metadata that's stored in the catalog. And then creating permissions policy based upon that metadata so that you can do security based upon whether data includes PII or its cost center or what data protection regime it's subject to.

And then Amazon Athena is our serverless query and Spark processing service. Very pleased to have announced support for Spark just this year at re:Invent and allows you to get started very rapidly working with your data in a modern data strategy, whether you need SQL or you want to support Spark type processing for more advanced use cases.

Sitting around that then are a sphere of purpose built services that include services like OpenSearch for log analytics and data discovery. More operational data sources like Amazon Aurora, which is a highly reliable and scalable RDBMS that's compatible with PostgreSQL and MySQL. Or DynamoDB, which are fully managed NoSQL databases offers predictable latency at virtually any scale.

EMR gives you a lot of flexibility in how you customize your data lake processing and allows you to run Spark, Presto, Hive as well as any type of software that you want to bring into the cloud for data processing.

Redshift is our fully managed parallel data warehouse supports both cluster based and serverless models. And all of these services together are in aid of pushing data into SageMaker where you can do AI machine learning, build, train and manage your ML models at scale across all of these services.

First and foremost, security is job zero and we offer Lake Formation permissions as well as a broad array of security services that help you to keep your data safe, secure and understand its location and who's operating on it. Whether that be CloudTrail for auditing, IAM for role based permissions and access to the cloud or services like AWS Shield and Security Hub for general account access, durability and availability.

Durability and availability are part and parcel of storing data in the cloud. And with services like S3, you were designed for 11 nines of durability with a standard storage class means out of 10,000 objects, you could theoretically lose one in 10 million years. Durability guarantees that are backed up by active management of the quality of your data so that we can offer you the highest durability possible.

Simplicity and ease of use is a renewed focus for AWS with services like DataZone that we also announced at re:Invent. We're trying to drive simplicity and how you catalog your data sets, how you can aid finding those resources and sharing them across your business. And that simplicity will be driven down also through things like our announcement of OpenSearch Serverless. We're now able to support serverless options on all of our analytics offerings.

Price performance continues to be a major focus of our analytic services. A key area where we differentiate and we relentlessly try and drive performance from the estates that we have through optimizations and are able to offer significant gains in performance year on year, whether that's in Redshift or EMR and particularly investments in Spark.

You heard some about data connectivity and integration. We announced that we have now more than 50 connectors for AWS AppFlow that allows you to synchronize data from SaaS platforms with your data lakes. And we also have support for more than 120 connectors in AWS Glue and more than 30 in Amazon Athena trying to make it that all of your data can flow into the cloud and be processed with our purpose built services.

And last but certainly not least is a huge focus on data governance. You see services like Lake Formation allow us to have permissions, which is the foundation of governance. CloudTrail provides auditability. But over the coming months and years, we'll continue to invest in data governance controls that help you to create those security guard rails that in the data mesh pattern can be pushed down and federated into your business unit accounts so that your customers can innovate.

So good. We're done. I wish certainly a challenge that customers face is more sources every year with every new business unit, with every use case that comes on, there is more and more data, the variety continues to increase and innovation in open source continues to drive new use cases. A huge rise in interest in data formats like Iceberg, for example, now make us need to think about, well, what are we doing about Iceberg? How much do we care about open transactional table formats and how are we going to use those within our business?

You are probably also on, on building more and more and more business units each with disparate requirements. And I think it's fair to say that all of our business units today are all in on analytics in the cloud. And so you have to meet them where they are including with very unique requirements that are specific to the business that you're in.

And with this growth comes the need for stronger governance. Often the challenges of more business units and more data can feel like it's in tension with the need for stronger governance and lines of business have really made very big investments in data lakes to the point that in some cases these are proliferating and we have customers with tens. I've not seen anybody with hundreds quite yet, but raise your hand of data lakes in the cloud, each from specific business units, each with different data with specific requirements in the engines that they use to process data. And they can be very unique making it more difficult to share data across those data lake environments.

Some of the challenges we hear in this operational environment are around things like data discovery, being able to find the data that people need, wondering if the organization has a piece of data, knowing that it must exist. But where is it? Is it good quality if you can find it? When was it last updated? Who owns it? There can be very difficult questions to answer unless you invest in a model that facilitates data discovery.

And then ok, this looks good. It's up to date. It's real. Can I have access, please? Who do I call? How do I get that access? It can be very complex to build that workflow and facilitate sharing.

So we certainly he certainly hear from some customers that they say, well, this is so hard. We need to centralize everything, need to get everything in one place. And that can be an effective strategy. It really depends on your business, but this can reduce the friction, sorry, increase the friction, important point, you increase the friction to get everybody into one way of doing things.

We started from a place where business units were able to innovate in the cloud independently of each other. And now we may find that we're trying to constrain how they work and create something that ultimately could be not suitable for one size fits all. They can be complex to scale, they can be difficult to offer consistent set of tooling. And if you are able to offer consistent and up to date tooling, it isn't always what your end users need.

You may have a particular business unit that needs something very exotic, unproven. And then how do you bring that into a centralized environment in a durable and accessible way? There's one other thing that happens, data is being shared, you just don't know it based on relationships within your organizations. Folks are striking up sharing agreements. They're allowing direct connectivity to databases in an ad hoc and kind of ungoverned way which can increase the risk of breaches. You need to be able to audit and understand data access across a highly federated environment.

One of the other challenges is that there are organizational constraints that make it hard to get the types of projects done that, that, that we need and to facilitate data sharing. And I had this quote, this direct quote from a customer and it stuck with me. Everybody wants to be a consumer, nobody wants to be a producer and that is a consistent problem that we hear the benefits of being a data producer are often realized by other parts of the business that are not directly attributable to the producers, efforts and investments that are made. And so organizational incentives and how you think about data sharing can be one of the most important things that you do as you develop and evolve your modern data strategy in the cloud.

So why did a me why would we think that this could be a path forward? Well, firstly, it acknowledges and helps us to leverage the investments that have been made in these independent platforms. Rather than saying the only path forward is centralization, we say yeah, those things are out there and they're good and they do what they do and we're going to realize the benefit and we're going to treat them as a data domain.

We then have to figure out how do we interoperate. So we push down policy and governance into the environments that allow us to set the security guard rails where you say, yep, i know that you're using a different technology stack and maybe you're using different data formats. But we know that things around data security, data accessibility, metadata standards are all being maintained. Even through that, that heterogeneous implementation data mesh.

Also advocates for a mechanism for data discovery. In practical terms. This is usually some sort of centralized mechanism. Sometimes the mental model you might have of data mesh is truly point to point and that is a data mesh, but it becomes very difficult to find all of the data in your organization. So at some level, you have to bring all of those resources together into some sort of catalog of catalogs or index or search capability so that we can find where the data exists in all of these domains.

Data mesh also advocates for self service data sharing where business units can come in, find the data they need and make a request that request gets dispatched to the producer who then can decide whether or not they grant the access and in fact what access they grant, they then grant that. And there may be further guardrails that get applied. You can access this data for it only for 30 days. You can access this data but not the portions of it that include highly confidential or sensitive information.

And importantly, the small part of centralization that sits within a data mesh platform or product within your organization helps you to measure and audit usage and permissions and make business investments. What are our hot data sets? What are people wanting access to? Where should we be creating better data quality over time?

So with that, I'm going to close, i'm going to hand you over to Nva Shankar. He's a principal product manager in Lake Formation and he's going to talk to you about how we do it in practice. Thanks.

Thank you. Hello everyone. It's great to be here. My name is Nash Shankar, principal product manager of Lake Formation team. As you've seen the challenges like what customers have said to us, what is the need of going towards the data at a scale operated model governance and also having a sense of capabilities. Let's also take a moment to understand what are the design goals? Where did this principle actually originated from? And what is the architecture patterns that AWS recommends to build on AWS?

So let's let's deep dive into some of the aspects of the the core principles. So i think it work first created the thought works in the za a team because she's the one she really narrated this idea of data mesh in four key principles. And she also talked about very well in our how, how those principles are being formulated to be able to drive a design thinking on the data mesh part.

So as you have seen the first challenges of the scale, we need an autonomous positioning of data to be owned by data owners. So how would we do that? Like, how would we have a responsible data owners know where the data is stored, where the data is governed and how the data has been shared? So that advocates a role for a data owner person as in this model.

First, second, as they really go along and build this organization of data domain ownership and decentralization of data. They also need help to be able to product the data, which is exactly the core value of where the data belongs to, right, as you really going and transforming your draw data enriching the data. But the end of the day, as the consumers wanted to see it, they want to see only the what is the value of this data that i'm going to get from it

So they use personnels like data engineers to be able to cleanse, transform and make it available in the catalog so that it's able to be discoverable and it's able to share to the end consumers.

Third, the very important part as as we have this decentralization thinking of how data owners able to go set up their environments, data engineers go build the products. How would we still formulate a theory that you need to be able to provide a federated governance by domain owners so that they know how they are operating their data access, how they are on boarding the data and how they are granting access to it.

So that advocates for a data governance council which you will see from chavez that how they build their data council team to be able to manage those data set sharing between producer and consumer model.

And third, as a self serve uh principle, every consumers need not have to be able to submit a request in an id id system and then wait for like a week to able to get access to it.

Now, how would you democratize that? How would you make sure that the catalogs are easily available and accessible to end consumers as a generalist? So they would just come in and know here's a set of data sets. I need to be able to access it and then once you go see the catalog that really makes them interested it.

And when they submit the access request, how are they able to get the access back so that they will be able to to run their use cases like data warehousing, use case mission learning, use case seamlessly.

So if really take these four principles and that's exactly what thought works in the zama book was really indicating it driving towards the data at scale, providing a federated governance access to manage those independent decentralized data domains and then making the platform of the data accessible for consumers to be able to more in a sensor manner.

Next, as we go into this uh taking these challenges, taking these principles. Now looking at the idea of the design goals. Now, what is, what is that we need to do in building this in aws environment?

First thing as we really go into this data domain setup, each organization has to get value from the data. So that's so that they are really coming away from the centralized bottleneck that is creating for us as ian pointed out.

Second, when the business starts to create the data product. They should be able to support this top strategy goals so that they have complete autonomous and and ability to be able to provide their business customers, how they can able to really quickly spin up an application, analytical applications and able to serve to their consumers need as they do along in this model, like every domain owners comes and builds a platform in their own setup.

How do you still maintain from a security organization standpoint? The governance is not decentralized. Can we really maintain a consistency of where the data is stored properly? How the catalog is accessible and also how the policy has been granted to the consumer method.

And also you're truly encouraging by this design goal is that the data driven agility, right? Like every business owners, every data stewards now have this autonomous to go build their platform and they now have this more thinking around. I can go build more applications around it. I don't have to know, depend upon it. It architects to able to grant access to this s3 buckets.

So now i have this more platform based approach to able to set up my own domain own catalog. It's really enriched, it's really protected and it's really been owned by me to be able to grant access to do so.

And then as all these things comes along, then it makes the sharing of the data product much easier. So they are only obligated to be able to share their own data sets they manage instead of managing somebody else's domain data.

So if you could think about from a data mesh angle, each domain owners would have their own catalog and they would be responsible for their own data sets to be able to share it.

So keep this design goals and principles. And you will see that how it is represented in architecture to be able to implement an aws.

As you've seen the uh strategy of talking about various uh data lakes, data viruses from a and slide, let's just look at it from an architecture standpoint. What does it mean? What is that? You have to take the strategy and break it down into modern data architecture that each of the representation of the lakes and the data here contributes our own set of data that the business wants to operate on.

And they make a domain, it could be a data in a lake, it could be data and a data waring, it could be data in a relation database. And somebody wants to run a mission learning model and they want to store their asserts and they want to keep that consistently within, within the domain.

So that domain you could think of each line of businesses maintaining their own domain specific centric data. So as the as the domain is formulated, the next thing is some of the datas could be inbound and outbound to some of the data stores or within the data lake or data vs model.

So it really makes your scalable use case around. I want to be able to share my data from data lakes back to the users who are running a data aros use case or vice versa. And how would i take my data from an rds instance and be able to share it back to my data lake data?

So that visual learning is unable to run the transaction data, unable to run around it. So in that in that mode, then you want to be able to do. The first thing is a data sharing.

So the first operation within my domain, how would i take the datasets, data products and be able to do a data sharing? And within that, how would i enforce the unified governance which is again through a lake formations that we we we saw in the strategy perspective?

And then how would you now make all the consumers to be able to discover what is available in the data domain across the board so that they have one single unified view on what the catalog represents? And what does that mean for their use case to be able to drive it?

So that brings up this five key areas from an architecture pillar standpoint, which is you're really focusing on building a scalable data lake and data burrows.

Second, you're really going after not specific to that, but also building a purposeful data store within that data domain construct.

Third, using those data products. Now we are really going towards mechanisms to enable data sharing from a producer to consumer model.

Then you're formulating a unified governance around how do domain owners are managing those data sets and how consumers are getting access to those data sets.

And third, you are making it easy for all the data to be discoverable. So people can able to request access and manage their use cases around it.

How many of you who are datazone? So this re launch has been day zone and there's a lot of talk around it. So i want to really take that point of view from where we were with towards the datazone method and how aws can able to solve that point of view around it.

So this is one representation of logical architecture of data measures should be perceived in an aws environment. Think of all the lower end of is all the data domains which producers able to manage the data sets in the data lakes and data. Vs so that they maintain the consistency of what that data means, how the product is maintained and what's the value of the data.

So they maintain all their ownership perspective to it. They could have multiple domains within that nature. So it could be line of businesses. It could be any number of businesses who want to secure their own data lake and data viruses within it as they come along in those data measure.

So you see that the central piece is where the federated governance and the unified access layer comes in place, which you say that you could do it through the lake formation. And also the datazone which will provide a unified policy management.

On here is the data owners who are registered the data sets or data products in this environment. Here is how i'm going to be centralizing the governance access to those data products that is coming along in this data domains.

And here is how they can able to grant access based on their level of access within their domains. So that central layer is really making the data visible to the consumers. And it's also enforcing a strong governance federated by the domain owners and domain specific data angles.

And as you can see from consumers can come from various angles. Not every consumers are running a daily batch reports, loading the data into their data vases and running a reports and going on.

So some users want to really run what's going on in my data. If we take a legal and compliance use case, the legal and compliance use case may come in and say who access all the sensitive data? Let me go run my sql analysis on the data without disturbing my production workloads.

So they would come in as a different consumer use case and they could run something like sage maker and able to run visual learning analysis seamlessly without interrupting the production workloads from consumer one and two standpoint, they could still continue to run their models based on their use case.

But it's all unified and access controlled via those unified permission model from the central zone. So datazone is gonna really kind of solve some of those core problems on centralizing the catalog, able to provision your workflow management between your producer consumer model and keeping all the policy management within one single place.

Let's look at uh each principles in detail as we come across those data ownership, decentralization, data as a product, how those things are translated in aws world.

So i talked about the domain owners or the data owners, they are truly responsible to make the data reliable. So what do they do? First, they create an organizational construct. So let's say if we have one line of businesses who wants to maintain various domains within our various data sets within their own organization, they could create a number of accounts within their organizations which could formulate a production accounts, analytical accounts or operational accounts, independent of each other.

But they all operate within one domain similar to that many domains owners can create their own set of accounts and they could manage their own construction of data within their organization units.

Next, they stored their data in terms of how those they want to register the data as a catalog. And how they wanted to make sure they protect the data by default. So anybody who's, who's coming into their organization's data has been really granted access before they are able to see the data in the first place.

And third, how would they manage all the quality of the data and make sure that the maintaining of ease and use of auto is maintained. So this truly comes down to the ownership from the data owner standpoint to make keep that consistent.

Next, they are also responsible to share the data because they are the one who know how they are sharing the data across to the board. Like they are the accountable owners to know here is how for legal compliance reason, i'll be able to justify. This is all the data that i can able to uh share to the different consumers.

Right now, you could think of all three models can be able to support it with an aws, which is data lake data marketplace and then uh data by using data. So it all can be governed centralized from a data owner aspects.

So this really creates a boundary context like how they wanted to separate by domains within their own organization specific data and how they are independent of each other when it comes to the data owners.

Second, as the domain owner starts to create the data within their domains, then how do you really make sure that the data is valued at the end of the day, the data as a product is a principle which really comes in place to make sure that data engineers enrich the data. Make sure the data is available seamlessly for data owners to run through their own business logics, business transformation. They need to run against on this table and they are the one who are removing all the frictions around.

How do i make sure that my business owners and the consumers do not have to go transform or cleanse the data before they run their use case. Can we really make sure that all the aspects of cleansing transformations are all happens within this data product zone.

And then at the end of the day, they're really making the data products more valuable. So this enforces this engineering personnel to be able to closely work with the data owners and able to provide the insights that consumers needed.

Third, as both the things happen at the same time, the third one is the more of consumers coming in and making sure their model of accessing the data catalog is all seamless and self serve.

Once it is shared as more of a select only or describe only for consumers, they could see what data sets exists and what domains and how they can able to request access to certain domains, data that they're interested in and how they can able to now access the data after the policy has been set and been granted access to do so.

So this really makes it more easy for any generalist kind of users to not worry about what type of data sits in s3 red shift rds. I don't need it. I just need data to do my own work.

So this platform thinking makes that the design thinking makes it easier for consumers to say the data is available in a catalog, go use it. Now you could run your own type of instances, computer engines to able to run according to use cases.

And then all the abstraction of the complexities is all taken care of behind the scenes between data owners and engineers and and etcetera. So that way they don't have to really worry about. Can i plan the data constantly?

Let's just say domain owners change the data. Do i need to worry about having requesting access to domain owners to say what changed? No, all the data of changes seamlessly behind the scenes should be reflected to the consumers and we make sure that that principle is covered in that too.

Third, the federated governance. This is more interesting for the security force as more owners, domain owners to engineers comes in one single place. How would you still not? How would you still make sure that the governance is enforced at that domain level so that you're not really creating your own problem of and opened up multiple accounts to communicate each other?

Can i really make sure that who is communicating to what and what data has only been accessible by certain domain owners to be able to share to others. So you could think of each domain owners is coming in this more standardization tool so that they are operating in that governance model and they are obligated to be able to share it to the different users.

So you could control that by either tag based management on the data set. So you know what the data sets are being there and how it is managed and all your auditing is in one single place. So you know what data sets is registered, what data sets is shared, so that the, the cso s and the security compliance seems makes the job much easier

And the third one is all the credential vending within consumers accessing the data is also happening in one single place. So that that scope down credential of who is accessing at what session level is known to that central governance account, able to manage it. So all this is all automated so that any domain owners on boarding any data product creation, any sharing of access between your producer, consumer models has all been automated so that it becomes easier for consumers to do it.

Now that we have seen all the principles. Let me just take you to the quick journey of the reference architecture and I'll pass it on to Travis. I can't wait to see how Godaddy has taken up this pattern and implement it on their own. In the data mesh architecture. As we've seen from principle, there's slightly one level below think of in three constructs. One is the producer accounts or the domain owns the middle one is more of a governance account. And the third one is the uh good domain accounts, which is more of a consumer type model.

So now we'll we'll think about how the data lake, data product sharing happens in this architecture pattern. You bring a data lake, you you enrich it, you transform it, you make sure that the data is all created and ready, ready to go and be able to share it. The first step you do is you register those data locations and data catalogs in central governance account, which is to populate all the data and you make sure all the tables, columns and meta definitions are all in place. And you can also use datazone in the future where once it's preview ready. So there's all the catalog in one single place. You could define your data projects, data domains where this data is coming from. And then you could attach more business context into it. Once it's registered and populated, then you share those data tables back to those producer or owning accounts and those catalogs are populated with the same information but governed and approved through a governance account. And they could create a number of local tables. They could create resource links based on those resource share. And they could run all the transformations using glue and emr like services to be able to run spark workloads or run a transformations between your data, between your domain data from one account to another account or one table to another table. So that makes them all the producer, domain owners very centric to their domain account.

So any transformation changes or even metadata, they are updating constantly using partitions coming and changes coming in the data is all reflected in the governance catalog so that there is no another stream of workflow that needs to happen between producer domains and the sla accounts to know what changed, how things have transformed it. This is all the work the lake formation does behind the hood to be able to know the catalog change management between accounts. So this site completes a domain ownership setup and now the domain owners decide to share the data to the consumers.

So they would go and choose the set of tables that the consumer asked to access for. And they would go share those respective tables at the database level or even at the fine grain access level control to the consumers, which means that catalog tables will get populated in the consumer account and they will have their own set of catalogs. Now think of this way, right? So the data is now stored in one single place. You've not really copied the data, you just enable the mechanisms to be able to scale your data access from an account to another account all through federated from, from individual ownership standpoint.

So once consumers uh see this in the catalog, then they create the local databases and they create all the permissions for their analyst or consumers personnel and then they can subscribe and then can able to query the data products like co engines like amazon latina and amazon red spectrum who have seen the new announcements on marketplace being formulated into the lake formation. This is one of the newest launch we did in this re went and now we could now create the data products. If you want to monetize the data within your organization or externally to the organization, you could bring that into the governance account. Now lake formations can centrally govern and manage permissions to all the subscribers. So that way all your data la data can be shared using this primitives between your governance and producer and consumer model all through central.

Now, the same way a number of domains can on board the data sets in the same model. And I could see that how those are scaling by the domain specific owners on boarding the data in the same type. And also those end number of consumers can now come and request access to the data. So it's all happening behind the hood. So this means both producer and consumer model will scale and centrally the federated governments would still enforce all the access policies for the producers to be able to manage permissions and consumers will be able to get permissions.

Now let's look at the extension of that data lake into how it applies into the data. Vs so we recently announced the red shift data sharing is also part of the lake formation governance policies. And now this will this pattern will be able to help you manage both your data lake and the data data all through the lake formation, centralized governance control.

First, like you did on the previous reference pattern, you on board your data lake into the central account, populate the catalog in a in a datazone and then make sure your domains is registered and your data locations are registered in a similar way. Now, we could bring your redshift data share table also into those governance account which will have all the tables that you are tagged as a redshift data shared table within governance account.

So now we have 21 different catalog coming up from the same domain, one with the data lake information and another one with the data v information. Now the same domain owners are responsible to transform and enrich within their own accounts, which means any changes they are making into the catalog is seamlessly known in the central governance account to be able to understand what the changes are and those relationships between domains and governance are intact if they decide to share those data sets back to the consumer model.

Then they go grant access to specific database or tables or you have to find access control to a consumer account. Consumer gets a catalog. And then now they could use that catalog to be able to grant access to the consumer model. To say here is of the red shift and here is of the data share they can be able to manage, operate on. So this way you can combine both data, la data and data is all within this pattern.

So that brings an idea on how the challenges were formulated, what our design thinking are and what's the reference patterns that we really seem to build in aws. A similar pattern is what followed in Godaddy. So with that, I'm gonna pass it on to Travis and then see how he ran a journey with the aws and building with a ms architecture. Thank you.

Uh I hear as well for Amazon and how do we, how do we think about bringing that lift and shift over and we can talk about that after the session if you have any questions along those fronts?

Well, it's been a great journey, we expect to continue that journey going forward. So, data mass, how do we think about this from a customer clarity standpoint, how do we ensure that any of our incoming customers uh are able to access uh the facilities that they need and that all has to be rooted in data.

So we think about as we come in through the data. Thanks Novos for kind of walking us through that process. What do we think our instrumentation patterns are? How do we think about our telemetry, the preparation and the augmentation for the data? You start looking at things like GTMs and Front Door and traffic coming into the system. Uh you know, you're looking at uh uh different elements with going to server side versus looking at client based cookie based systems.

How do you bring a lot of those data sources in together as well as coming together and, and facilitating other data purchases that are coming into the system as well as other telemetry that our applications are doing. And that helps us to enable how we think about this journey for our consumers.

As once we get the data into that, that position, then we start thinking about the interactions, the governance and we'll talk a little bit about that as well. That helps to motivate how we think about discovery and exploration of our data systems. Ultimately, we have many domains within our system and that is we think about our insights and our sharing platforms and how we're driving towards that um helps us to, to think about in the same context, building better systems that enable this ubiquitous access by consumers and then the ability for data producers to uh to uh own their own data and, and push through this.

So let's look a little bit about uh a customer journey. As you see here, we have a number of different fronts. As think about incoming customers. It might be, users might be potential customers, they might even be bots. Uh they're coming through the system or return users. How do we think about that flow in through this particular case, we're looking at managing properties.

So how do we think about the products that we're managing the telemetry that we're pulling out when we look at campaigns that are coming through the system, where as we think about any of the, the data that we're inflow from our telemetry that then moves us into product engagement. How do we think about the engagement of our, of our uh customers in the product? How many websites are they're spinning up? How are they thinking about those particular patterns? And then ultimately, how we think about the first milestone use.

So this is an interesting path through here. When you start to think about data mesh, this becomes very interesting because when we think about the, the telemetry, we have all of these different systems. You could come in as a domain uh purchaser, you could come in to the website, you might have different telemetry patterns. How do you unify and bring this together in such a way that you're able to reduce the amount of load against the internal data systems? How do you stitch and tie together raw business events into uh into functional business events without spending all of your analysts time, 80% of their time on trying to munch through and figure out what the data sources are and how they're bringing uh those events together.

So there's many more intersections that you can find a as you think about this customer path. So how do we think about this going back to what Novos was sharing with us already? We think about these functions and i'll, i'll push through this a little bit more quickly, but we think about the data owner, the engineers. Now, the interesting thing is data stewards versus the data governance council. They are a bit different here when we think about data stewards, the way we've implemented this are individuals within the data domains who will own certain aspects of the data governance. But we have a steering council who are setting up different types of measurements.

How we think about uh measuring business, uh uh you know, advances, how we think about naming conventions, how we think about the dictionaries, how we pull together a number of those data fronts. And then how do we bring that all together across our customers when we think about visiting users versus perspective, all the way to high value users and, and pulling through that system.

So how, how do we build this to the data through the data mesh? We look at what we've done uh in, in looking at this, this ingress and pulling into uh i i into the uh into the systems. We think about domain ownership is, is a bit of a challenge for us to go at. It has been a challenge as we think about the various line of businesses who are trying to optimize for their different data patterns. And generally speaking before this in htfs, it was all coming through a consolidated data engineering team.

We're building all of the the components for this. They were acting in behalf of uh the, the different business line of business elements and then trying to stitch together and provide the data uh uh together in a, in a coherent coherent piece. I think the, the interesting thing here is that as we've moved forward uh into AWS, now we're starting to set this up in such a way that we're providing this ability for our data producers to own their own data sets through those core systems that, that Novos was talking about some of the architecture that we're trying to push forward through looks like it's not going through.

Uh thank you. So when we think about uh all the way through this on, on self service, how do we provide the right utilities for this APIs how we're using uh manage air floor operations. How are we thinking about different tooling systems that we're pulling together on this? How do we set up the schema registration? We start talking about an ABC type of schema where we're looking at corporate level schemas that are uh that are basically immutable and then looking at domain or context bounded schemas that enable us to understand how we're going to stitch those events together that all flows down into the line of businesses.

And then they come together and as they reading on top of their data that they understand best, they also have some principles and some guidelines that they're following and they have easier APIs as we start looking at low code and no code solutions that enable them to drive the drive their telemetry into the system. On the other side of this is how we think about our consumers, just, just what we were talking about.

Now, 95% of our analytics is coming through visualization. What are the types of elements that we need to think about in that visualization? How do we, how do we drive towards uh the right end points? This is really interesting. We're sitting down with uh the Quicksight team uh in, in Q and talking a little bit about how do we progressively disclose analytics information? Well, that all starts with how we think about our consumer setup at the middle of all of this is how we think about what we're doing in our core data processing and within the data lake itself.

Now, if you notice here a at the bottom, we have a number of native services that we're using in AWS, but we're also using third party services. I put Spark in there as a third service. I get to strike that out. Now at Re:Invent this year, at all these announcements of seeing Spark support in Redshift Spark support Lake Formation, all of these areas, they're enabling us to be able to take that workload and the and the efforts that we are putting into all of our Spark development and to leverage that tooling from AWS to enable us to take that next leap forward.

One of the interesting pieces when we look at data governance again, talking through this is how we think about measurement. Just an example for this is we're working with the marketing team uh at GoDaddy, we were having a an ongoing discussion about how do we measure business processes and how do we think about that data and then nay conventions, the measurement for that. What was the debate return on ad spend versus CPA cost per acquisition? And as we start, look, as we start looking at that, we found that we were following some industry standards and in some cases and in other cases, we weren't so stabilizing that in terms of data governance helped us to ensure that we were measuring this. And i'll talk about that in just a second of why that's so impactful.

Of course, as we think about our naming conventions and how we think about setting up that discovery process uh is very important as well. And this is where our catalog elation comes in how we're thinking about our dictionary. Uh and doing all of the work as we've migrated all of this data into AWS to describe that more fully, fully formed um uh mechanisms that enable us uh our consumers to be able to go and query this data.

Now using elation and had a queries or using Athena as well to accelerate any of those query patterns is something that we focus on very heavily as well from our applications within the different line of businesses. And as we're pushing the the permissions for that as well as visualization at the top of this chain is really how we think about automating this process.

Now this is where it becomes a li a little interesting we think about the GoDaddy culture. It's a learning culture. How do we build authentic relationships with our customers? Experimentation is at the root of that experimentation is another uh is another producer, data producer and consumer. How do we think about getting to statistical significance on lift tests that we're driving across our customer base? How do we look at all of the manual tests that we're driving to? That? That's a their feed into the data producers.

Now, there is a little secret here. I love that Ian said that because I in the beginning of the, of the presentation, I've heard this many times before, which was data producers, they, they wanna be consumers, they don't wanna be producers. But here's the fact data producers do wanna be producers, they just don't know it when we think about experimentation. For example, this is one of the things that we've seen at GoDaddy where we're constantly experimenting. Right? Hundreds of uh thousands of experiments per year, we're running actively with our customers trying to understand what our customer intent is and how we're driving towards that. That's generating massive amounts of data inference data that feeds back in through our data production pipelines that we can then turn around and leverage uh in our inference in our, in our data systems to build data products on top of that.

And so this is one way where we've embraced this as a company where we've changed this and really worked through with uh our, our community in working on data literacy and helping to drive that notion across GoDaddy as a company to learn how to learn and how to produce. And that's changed a lot of the culture. I don't hear that comment as much Ian anymore. I hear the comment of, hey, how do we produce more data? Now, the next step to that in our machine learning and as we're working together with SageMaker and building all of these components on top of any of our, any of our jobs and machine learning uh components is how we're automating through that process as well.

It's taking those that experimentation and now mapping that into our machine learning so that we can automate those processes and learn about our customers even even faster. Uh and then of course, the tooling, how do we provide the right tooling back out to our data producers that enable them to, to drive the the right facilities that they have at the end of the day.

So again, data producers just don't know that they wanna be data producers. And this is the big learning we've had at GoDaddy and something that we focused on to enable that we bring the whole company around in terms of providing the right data sets. That's, but that's a big part of data literacy.

So as we kind of walk through this, um I'm just gonna push through this pretty quickly, i guess I've got about three minutes left. Um you know, as we look at where we're driving towards producers, we have a number of different domains. We have over 10 line of businesses that are pushing uh data in through our, through our systems. Uh and then of course, we're registering that in the catalog. We have the naming conventions. The data governance council is helping us to drive uh a number of those uh specifications that we can push down to data stewards or others within the line of businesses and our data engineers to ensure that we're pulling in the right data.

Secondly, uh as we look at uh what we're driving in terms of our data processing and back up to the experimentation. This enables us to provide these APIs and these linkages not only into uh those systems but into our data systems. As we work in building out our event systems as batching as well as streaming to help engender a uh a faster flow of our data through the system and better outcomes and experimentation in our online machine learning models.

Of course governance. It's i have it as number four, but it's ubiquitous across the board, both for data producers and consumers. But how do we enforce those policies right now driving that down into the line of businesses? But then automating that is one of the things. And i would suggest you go out read our, our blog paper that we just published on that on how we're automating some of those processes.

And then of course, our, our data ingress or egress, which is contained not just in the visualization and trying to help to drive towards that. Just a fun note on this. When we did the data shift, we had about 4500 dashboards uh in, in Hadoop that we had in the business, we cut 4000 of those dashboards out of the cycle determined the business criticality and then started to look at how do we build dynamic uh referencing into our dashboards to enable us to drive business. Uh focus and and data access instead of spinning up all of these dashboards.

So with that, uh as we look at where we're going uh in, in GoDaddy, we think about the bounded context. You heard Novas talk about this as we look at the domains and where we're pushing whether it be marketing or care and support times, teams are uh us independence, which is one of our line of businesses or any other spoke. We look at, we tend to, we lovingly call this the hub and spoke as we talk about our data mass. Uh we are empowering all of our data producers and consumers.

We know that when we go across a multi a multidimensional factor with this, that most of these bounded context or these line of businesses want access to each other's data. And that becomes the interesting part as we intersect and cross that as we think about data products, like user insights or product insights or even our financial insights, what those overlays are in the data products have to traverse all of those line of businesses in order to be effective in order for us to build those authentic relationships with our customers and connect the dots in the data mesh.

So a couple of business outcomes, i mean, we've, we've driven a number of different things, many data products across 10 line of businesses, 300 plus teams globally. Um we've seen tens of millions of dollars in benefit already we just finished the lift and shift by the way, april 27th of this year. Uh so it's amazing to see the amount the accelerator and the amount of traction we've been able to build as we've thought more about the mesh architectures on top of AWS, it's created this acceleration force that's enabled us to move um at the speed of light with that.

Thank you. Please uh complete the session survey in the mobile app. And then i think we'll entertain questions off stage. Thank you, everyone.