Easily and securely prepare, share, and query data

最新推荐文章于 2024-09-12 20:11:47 发布

李白的朋友王维

最新推荐文章于 2024-09-12 20:11:47 发布

阅读量115

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134832213

版权

Hello, everyone. How's everyone doing? Thank you for being here. I know it's early in the morning, but, uh, also these lights are in my eyes, so I wanna make sure that. Um, so thank you for being here. Uh, today we're gonna talk about a very important topic about data sharing and why that is important.

Um, on stage with me, first of all, my name is Navneet. I'm a Principal Specialist with AWS and my role is that of a data strategist and I work with a lot of biopharma healthcare companies on building data strategies for them.

With me on the stage, I have Jason and, uh, Raj.

Hi, Jason Berkowitz. I run product for Lake Formation, Glue, Data Catalog, Crawlers and some new stuff we'll talk about today. Uh, been with AWS for eight years, uh, analytics my entire career until maybe some of this AI replaces me, but hopefully not. Um, I'm excited to be here and talk about data sharing. It's something that is dear to me and I'm excited to be here with Raj talking, uh, talking about Salesforce as well and just looking forward to hear feedback. And thank you very much.

Hi, everyone. Uh my name is Raj Kumar. I'm with the Salesforce. I'm a data cloud product lead responsible for Insights, Actions and Data Sharing. Really excited to be partnering with AWS and excited to talk to you all. Thank you.

So what do we have in store for, for all of you today? So our agenda is we're gonna look at the breaking down data silos, right? Data silos has been a bigger challenge in my opinion, where several organizations, they have data spread across multiple different silos. You may want to call them lines of businesses domains and whatnot.

We're also going to look at data govern data sharing using Amazon DataZone. Uh and then we can also take a look at the service at a very high level. Uh there are a lot of sessions at Re:Invent on Amazon DataZone that's happened last two days and of course more just today and tomorrow.

After that, we're gonna learn a little bit on zero ETL data sharing patterns. Um some great announcements that Jason is going to also share. And then last but not the least, of course is that you're going to hear from our partner Salesforce from Raj on how they have built bring your own lake zero ETL data sharing, really looking forward to that as well.

So what we're hearing across the industry from a customer for data sharing and you may be able to relate to some of them or maybe all of them. Um you know, for example, if I, if I want to call out a few data is all siloed, searching and analyzing data is difficult, I have to turn our organizational data into asset or a product. I wish I wish to focus on data and not on managing, maintaining, administering it. Just a show of hands these challenges. Does it align or resonate with, with you with, with your organization, with your business?

Ok. Why that is important is that I'm going to ask a question. Is your company interested in growth, right? Everyone is interested in growth. That's why we all are here. Um and they are of course, but being a data driven business can help stack the odds in your favor as per Forester, you are likely to have 8.5 times more likely to have a 20% success in your growth. If your organization is data driven organization, what does it actually mean to be data driven organization? I'm sure this stat doesn't surprise you that why data driven is so important? Jason just mentioned about Gen as well, but all of these generative AI or AI models that is really helping our customers drive the business underlying needs to have the data access across different lines of businesses and domains. That's why it's important.

Having said that uh it is only 26%. It has only 26% success rate, which means that there are 74% of businesses, they're not able to see that growth being in a, being a data driven organization. And the last part of that goes to that, we know that data driven is, is important, it's essential. Uh but at the same time, data needs to be treated as an organizational asset that everybody can see, everybody can share in a secure environment.

So there are some industry examples for data sharing a strategic asset. Now, at the top of this slide, we have two main components that we are calling out. One is that in order for you to enable a good data sharing mechanism across your organization is to have a simplified data access and sharing controls or mechanism. If it is too hard for me to share, I'll just duplicate and copy the data, right? And, and then is there in me, it's for me, it's in my domain. You know, I have to go through that pain one time and that's pretty much it. Unless you have a really a self serve access model that enables data sharing in a more governed way, which brings me to the second point on this slide, which is the greater data governance, which brings the greater operational efficiency.

Right now, we're going to talk about what data governance is. Now, I'm sure there are a lot of different ways you can look at data governance, we can point out a few at you. But if you look at the industry use cases down below. The one I want to point out is is from my own industry, which is healthcare, life sciences. And if you look at the secondary analysis within clinical trials is a, is a major area where the customers, our partners are saving a lot of cost and money and really enabling business insights to be driven faster. And why is that? Because when the trial is still the primary trial, still in active, the data is not allowed to be shared. That's how it works as per the regulations, right? When the trial is, is done, is submitted, the trial is no longer active. You want to leverage all the data that you have collected during the trial for future trials. Otherwise you're doing everything from scratch all over again. So how does it work? You basically want to build a model that can really enable you to share data across different lines of businesses in this case, to a secondary trial. And I'm sure there are other use cases that you may be able to relate to. Also things like media and entertainment, financial services, advertising, data, market attribution and things like that. Commercial analytics is another one that I want to point out.

So commercial analytics and life sciences companies is about when the drug is is already in the in the market it's a post market, right? And you really want to analyze the adverse effects, you want to analyze the market, analytics, the financial analytics of that. And so you may be getting data not only from your internal data sources, but also from external like Salesforce. As an example, a lot of data comes in for commercial analytics for for those stakeholders to drive insights in any case in all of these scenarios, having access to data faster and and being in an environment or have a data driven business that enables data sharing is really the key to to solve some of these challenges.

Now, I want to introduce to the AWS data governance framework. This is a framework that, that, that, that we have published that we really want to put in the center of all the data, modern data governance modern data strategy discussion note that right, in the middle of that wheel, we have the mindset plus people and process, right? It is not only a technology problem, it's not technology only problem to be a data driven organization. There also needs to be people and process alignment to that, right? And it's a larger topic, it's a, you know, that's something we may want to spend a lot more time with, but I want to call it out because people and process are equally important and we're going to look at some of the people and process in the in the in the. But if you know that this wheel that we have three main components of data governance. One is understand, protect and curate right and data governance is not only about providing access to data, it's much more than that right. To enable data sharing, to enable data governance, you need to be able to kind of look at all of these pillars of data governance.

Let's take a look at the few in understand, I want to make sure that I have a catalog through which I can really list my data add business merida to it and make it visible to consumers so that they can come and search for the real data assets. Very basic pillar of data. You may have heard of fair data principle FAI which is Findable, Accessible Interoperable and Reusable Find is the key without finding the data, I do not know if the data exists. So it's a responsibility for both producers and consumers of data to be in a position where the data can easily be found. Some other is data profiling data lineage. Of course part of the understand in protect, we're going to talk about some of the security mechanisms, security, security controls today. But things like data life cycle, data security and sharing is really the key when it comes to building a good modern data governance architecture. And last but not the least is the curation of data which involves data integration services. We heard a lot today and yesterday and about zero EL processes zero ETL data integration services, master data management are all part of data governance pillar.

Now. It is not, I'm not saying that you must have all of these boxes checked before we really start building a modern data strategy. What I'm really saying is that we really need to understand that it is the combination of mindset people process and what are some of the basic things you want to enable? You may want to start with the data catalog plus data security and then go to data lineage and data profiling depending on where you are in your journey. On either side of the governance, you have the injection of data and then on the other side, you have the process and consume.

So I talked about people in process as an example. So some of the personas for data sharing that I want to call out here is that in order for you to build a good data governance strategy, you need to have the people process alignment with the technology. Now you may have heard of this paradigm called data mesh. We're going to talk about that also a little bit today but really called out four main pillars of a data strategy aligning with that with that paradigm, data domain ownership.

So data owner is someone who really owns the data, he or she is responsible for building the data product and really owns the full life cycle of the data, right from the point of inception to the termination and of the retirement and all that data engineer is someone who is responsible for actually building the data pipelines data product and contributing to building more data products. You know, in your organization, you have data steward a very important role, manages the federated computation of governance. Someone who is really, really is there for the auditability of metadata who's really there to understand the or really define the process for data sharing or or the or the workflow for data sharing.

Remember, a data sharing is all about me as a consumer at a high level that is finding a data in a catalog and then requesting access to it. But what happens when I click that button that I want access, the workflow gets invoked. It goes to somebody in my organization who has the rights for me to for it to approve or deny. And that all federated computation governance is defined by the data steward in combination with the central IT and other data governance roles. And finally, data consumer who is a self serving customer or self serving consumer. These are your research scientists, your dashboard writers, your data scientists, your your your data analyst of the world who just wants to get data access faster. They really don't care about where the data is located within the organization. They really want to get to the data set faster and you want to make sure as a governance that they cannot get to data without following all the security protocols. But you want to make it easy for them through a self service workflow or self service analytics.

There are some data sharing patterns based on the challenges that we heard on the data on data sharing and some of the data governance pillar. If you note that we have four main categories here on data data sharing, right? We have hub spoke data mesh brief. You talked about that we have business to business with partner collaboration.

Now the way to understand this a little bit if you look on the on my right and your left is that the hub and spoken data mesh is largely within your organization, right? So the very first thing is that I have, I have a set of AWS accounts and I'm the one who's producing the data. If I go back to my example, I'm the one who probably has the data for real world patients. I have the data for clinical trials or whatever data sets I have. So I have a lot of AWS accounts and I want to effectively securely share my data with a set of consumers on the other side, right? You can leverage AWS Lake Formation, find an access control data sharing policies and things like that. Jason is going to talk about through which you can simply create a hub spoke model which would enable your organization to easily share the data, right?

You move that little bit on on the second topic which is data mesh. You make it even more modern by implementing a paradigm like that, right? And the reason is that because in a perfect data strategy, you want your producers can also be a consumer and you want consumer can also be a producer. You want to build an environment where somebody who is producing the data can also consume some other data, right? An example of that would be if I'm a research scientist, if I'm building set of patients cohort based on certain needs that I have, why should I, why can I just simply share that for others to use it once I'm done? Right? So you want an environment where there is a cross like bidirectional arrows between the producers and the consumers of data. Still you can manage that using AWS Lake Formation. We will talk about DataZone as well, just just briefly today, something that you can that you can manage with remember in a data mesh world. Also also think about the roles that I just talked about in the previous slides. Right?

Then the last two models are for not only for intra organization or in intra company, but also external. So business to business, how you producer somebody in your organization wants to share data, not only internally. But externally also, maybe you may want to monetize you want to buy data or you want to c data or something like that

In that case, you can leverage not only Lake Formation, which is a service for sharing data internally, but also Amazon Data Exchange for you to kind of share your data externally with your trusted partners. As an example, you can also monetize it. You can also be the one who's subscribing to data assets from Amazon Data Exchange.

Finally, the partner collaboration, this is where you are in a scenario where you want to collaborate with your partners. You are not ready to subscribe to the data because you do not know that if it is good enough for me as an example or if the data is the right data for me, but you are also going outside of your organization. So it's not easy for me to simply go to a data governor and say, ok, I want access to this data.

So we have AWS Clean Room, which is a service we introduced at last re:Invent that would really enable you and your partner to collaborate in a secure environment where each participant can really set the rule of how data can be used. And as an example, if I'm one of the data producer in Clean Room, I can say ok, I'm putting my data here, but I only want my data to be seen in a certain way aggregation, for example, don't see these columns as an example, right?

So when you think about these four models of data sharing and there could be many more, I just want to point out these four because they really invoke some of the thought process in terms of how should we be thinking about data sharing within and outside of the organization.

Really quickly, Amazon DataZone. So thinking about the data mesh paradigm data sharing pattern that I talked about. Amazon DataZone is a service we announced at last re:Invent it went GA this October we have several customers using Amazon DataZone. And it's largely you need to think of Amazon DataZone as a service that really helps you govern the data access and build and gives you an environment for data collaboration, right? It does.

So by giving you a, a business metadata, a business glossary by you for you to build a catalog, it connects with Lake Formation on building those policies for data access. But more importantly, it also gives you an environment or a workspace where different users can collaborate with one another, right?

So government data access, we saw how important that is in data sharing, connect data with people. Remember that if I am part of a use case, if I'm part of a team that is working on a use case and I'm and I'm and I'm a member of a 10 member team, I don't want to go through a process that each of these members can independently need to request the data access. I want to get data access based on the use case that I'm working on or the project that I'm part of the team that I'm part of that really enables the collaboration that really closes the last mile right of data access. That's what DataZone provides.

Then of course, auto metadata discovery. We heard yesterday we released a new generative AI based feature wherein you can generate metadata using Amazon DataZoom.

So what are the core components of Amazon DataZone at the high level? It is a domain. A domain in your organization can be can be a business unit, it can be a collection of sub domains or sub business units or a collection of use cases. On either side you have data producers and consumers. Remember the data sharing model I showed before and align that with that within domain, you have a business data catalog that really helps you define the business metadata glossies and things of things like that. And that's also where you can utilize some of the generative AI features projects and environments really gives you a boundary where multiple projects can be created for your multiple users to collaborate with one another. And of course the governance and access control and it has its own dedicated portal which is supported by the API.

So instead of going to AWS console your users, your producers and consumers will be going to a dedicated portal. Um we have several booths at our expo which is showcasing DataZone demo. So please stop by. We have one in healthcare life sciences. We also have one in the analytics booth.

With that, I'm going to hand over to my friend here, Jason to talk about the government data sharing with ETL. Thank you.

Thank you, everybody looking forward to talking to you today about. So we're gonna go a bit deeper to how AWS Glue Data Catalog and Link Formation support zero ETL data sharing experiences.

So the first thing I want to share is we're taking an open approach to data sharing. Historically, we have had best of breed services such as La Formation Data Catalog, so customers can build their data governance solutions within these services. We have a set of partners such as Salesforce Calibre, Priva Era building best in class experiences that are easy to use so your users can find data request access and have those permissions get pushed down to Lake Formation. So we play the role of enforcement of those permissions securely within the engines.

Now with the introduction of DataZone which Naveen just talked about, we're providing a overarching experience on top of these services. So you get an ease of use experience for your collaboration on data. Best of cases, they all work out of the box with all of our native engines. Amazon, Amazon Athena. Amazon Redshift data shares Redshift Spectrum and so forth. How it's open as well is good Data Catalog provides open APIs that are Hive compliant that allow you to do it. It sits on top of your are hundreds of thousands of data lakes running on AWS on S3.

As we move forward, we're going to continue to provide a cohesive solution by providing you fine grain access controls, deep security within our foundational services. So if you choose to integrate directly with those, you can do that as well as using partners and using our, our, our our solutions such as Amazon DataZone.

I'm gonna go a bit deeper into Glue Data Catalog and Lake Formation. So you start to hear some of the new features that will be exposed up through DataZone as well.

So Glue Data Catalog and Lake Formation help you govern secure and share data globally and across your organization for machine learning and analytics. The Glu Data Catalog is your technical catalog for your data lake and data warehouses. That allows you to build permissions with Lake Formation that include fine grain access controls on structured semi structured data sets.

I don't know if you saw last week we launched permissions on nested columns. Now, so if you have semi structured arrays, you can now put column level permissions on those semi structured data sets and access that data via Athena or EMR and so forth.

The next area we're focused on is scaling and optimizing your data lakes and data warehouses. We do that through a few, a few areas, one of which is scaling permissions, which I'm going to go a bit deeper on in a moment through tag, tag based access controls. We also have Glue crawlers which allow you to manage your schemas, without writing code, you just point your crawler at the location and we infer your schema and a set of launches that we're going bigger on around optimizing your data lake performance.

So, how many people here are using open table formats like Apache Iceberg or are people looking at? Ok, a few, a few? Great. If you're not already, it's something pretty much that comes up in every conversation that I'm having here here at re:Invent and prior where customers want to bring the power of transactional systems to their data lake and they're using open table formats such as Apache Iceberg.

When you adopt an open table format, you have to optimize it. The things that would have traditionally been in relational databases, one of which is compaction. So we launched, automated compaction for Apache Iceberg a few weeks ago where you just go in, you pick your table. You say I want AWS to manage compaction. For me, we monitor all of the transactions on those tables and kick off compaction when we think is the right time to optimize your table.

The other thing we launched a few weeks ago was Glue catalog column statistics. So with a single API call or in our console, you can click there and say I want this table optimized. We run the statistics for you and they work with cost based optimizers within Athena and Redshift. We're seeing amazing results on query performance improvement. So please test that out. It could save you cost and also optimize your results back to your users.

The next area is connecting with data sharing. So Lake Formation is core to our customers, data meshes, companies such as Gilead or JP Morgan or It Tau are building data meshes on top of Lake Formation and Glu Data Catalog to share data internally and do that protected with, with their compliance controls.

And the last area is audit and security. So all of our APIs are being logged to CloudTrail. So you can look in CloudTrail and see all of the audit activity as well as build your own reporting solution.

So we're going to go a bit deeper into connecting with data sharing. I'm gonna talk to you about a few challenges as you're scaling your data sharing solution globally and and Naveen talked about all those patterns. So each one of those patterns you need to account for this area. You do, you do not want data leakage even inside your organization or outside your organization. When you're doing this, the first area to focus on is preventing data access sprawl.

So data access sprawl occurs when you're, when you're focusing on giving permissions or policies or grants to specific tables or databases to individual users and groups, you absolutely want to provide higher level constructs around using business ontology on how to manage those permissions. With DataZone, we launched a new capability around managing subscriptions to projects which now you can actually manage it at a business intent layer within DataZone. It's actually granting Lake Formation permissions that you don't have to worry about. It's taking care of that for you.

So within DataZone, you, you have your subscription and then you're accessing data. Let's say it's for marketing churn, you're now keeping access and you're accessing the data just for that use case and think of the power you can do that. You can ensure, you know exactly who's using what data for what use case, which is a whole different paradigm than traditional role based access control.

The second area to be concerned about is preventing sharing sprawl. So sharing sprawl occurs when two things happen, one of which you could be bottlenecking to a single administrator and saying this person or this role is responsible for granting access globally and think about the time or effort they'd have to put into it. The second thing that occurs is not controlling and, and providing too many people access to provide grants.

So La Formation helps in a few areas and DataZone automates this for you. The first thing is we can delegate permission to data owners. You pick your data owners, either either users or groups or, or through Identity Center. We launched Identity Center, identity propagation this week where you can actually use Identity Center users and groups as well and you can delegate permissions to ownership there.

The second area is role based access control or tag based access control or the concepts that DataZone introduce with project based subscriptions with tag based access controls, you create business ontology so you could have cost center or department or classification, then you're only granting access to data and users that should have access to that data.

So we're also seeing companies that have different ontology per department as well and they want to delegate ownership to those business ontology within the different departments. So Duke Energy is talking about that today and how they're managing different ontology and making sure that that their business rules for data access apply to the right user in what role they're in within their corporation.

And the last thing is you can delegate the now the editability of those tags. So we launched this this past summer where you can choose this user or group. Can it be owners of the tag, can create, can edit tags and also delegate grants to specific users.

So I'm going to keep on going a bit deeper here. And now talk about some specific data sharing use cases. The first one is Redshift. So Redshift data sharing allows you to instantly share data securely with live data that's transaction consistent. This past week, they actually announced a preview. So you can have multi data warehouse writers and now you can actually write to data shares as well.

And when you create that data share, you can choose to use Lake Formation as your governing model. This way, you're now centralizing your data governance within Lake Formation, you get that central audited capability, you're no longer managing individual users within, within Redshift anymore.

So absolutely start looking at that. The last area is Amazon Data Exchange. So you can subscribe to data, you can publish data and so forth and you're using Redshift as that hub and Redshift obviously works with S3 as well. So it's further unifying your data lake and data warehousing for, for analytics.

The next use case, I'm hearing a lot of feedback about is this week we launched Glue Views. So Glue Views are multi dialect SQL views. We haven't done a lot of press about it yet. You should see some blogs come out in the next week.

You'll see some announcements this week. What Glue Views are is a single view schema. So you create that single view scheme and I'll walk you through how to create a view in a moment. You and that view, schema works across engines. So here you create the first view, you have the view schema working with Athena and Redshift. You can also have EMR Spark querying that querying that view consistently across engine.

Before this, you'd actually have to create the views in the different engines and secure them consistently across which engine which requires a lot of manual or automation to do that. Now, you just have the grant on the view. Customers are also using this as security features as well with Glue Views, you no longer have to grant access to the underlying tables.

Customers are using this for data access layers, a com a common use case. So they want to share data only provide access to the view. Don't expose the underlying details to the user. And that way the user is only getting access to the data. they need common examples we're seeing is meeting requirements regulatory like the Digital Markets Act. Uh also doing joins for GDPR uh making sure you only opt like meeting opt outs.

Let's say your view joins to an opt out table and your opt out table uh should not be exposed to your underlying users querying that data, you provide access to the view at the view level. They do not see the underlying tables and they only see the data they're using it for. So we're seeing use cases for that for region region isolation as well.

The next area we're seeing is you can then share that view across account and across region, further exposing your ability to share data and controlling who can access what data. So uh please try that out this week, we went in preview this week. We're, we're looking forward to feedback. We'll be g a shortly um and so forth. EMR and EC2 should be in the next couple of weeks. So we're looking forward to feedback and uh thank you for that.

The next area I want to show you just how to create the view. So the first thing is let's say you start with Athena very simple use, familiar syntax, create view as select and there you have your view, you can start querying that in Athena. If you want to add dialects to that view, you just do alter view, name, add dialect, very simple. So for people that are familiar with creating views, this should be very straightforward for them.

And the next thing is you do is grant grant select view on user. Very simple, no learning, no learning new API s. There will be DDL commands within all these engines. So you can start using familiar syntax that you would know and love and then that data can be accessible via the tools they know and love. They don't need to jump between engine because they want to access one view. And now we're providing data to the to the users where they're at.

So we're, we're things like views and la information data sharing are really picking up our business to business and business to consumer data sharing which Navene touched on and and Raj will talk about it in a moment. The first one is Cloud Trail Lake. So Cloud Trail Lake launched there on Sunday, the ability to share data from their Cloud Trail Lake back to AWS their own AWS account.

So let's say I'm a user, I want access to my Cloud Trail data very seamlessly via Athena. I go into Cloud Trail Lake. I say share my data with my data lake instantaneously that data shows up in my catalog. I can start querying v Athena, I can provide grants on that data with LA formation and now I can protect that data very seamlessly to only provide access to the data that you know that you need to see.

Amazon Omics is providing similar capabilities where they're controlling ownership of the JIC data and only sharing what customers need. Amazon Cloud Amazon Connect as well has provided similar capabilities as well. So we're really seeing this uptick of zero TL data sharing and that's really because of the power of S3 S3 being a global service with so much scale that we no longer have to move data, you just provide access, you expose it via catalog and with so many customers already using Glu Data Catalog, it's just an easy way to expose data access.

The last one is Salesforce Data Cloud, which I'll hand off to Raj in a moment. So I want to share a personal story and why I'm so excited to have Salesforce on the on the stage with me today. 10 or 12 years ago, I was running a team data lake data warehousing. My marketing team said we want to pull all of our data. We want to get all of our CRM data out. We want to build segmentation, we want to do customer churn attribution. We also want that data back in Salesforce.

And yes, I had to build complex pipelines. I had to do authentication authorization. I had to hire unique skill sets to build objects back in to Salesforce as well. This was a complex endeavor for me. It did not happen as fast as my business wanted. They wanted data immediately. They want to go in and see a customer. What segment I am? They don't want to push them to a different BI technology. They want to see it right in Salesforce.

So I wanna thank Raj for working on Salesforce Data Cloud. So my past self doesn't have to fight that battle anymore. Now you just share data. It's just amazing. So with that, I'm gonna hand it off to Raj. Thank you very much. We'll be available for questions after.

Thank you, Jason. Alright. So with that, I'm going to talk a little bit about how we have done the Salesforce integration, um forward looking statement, anything we share is involved assumptions and uncertainties. And so keep that in mind.

Um I want to start by saying thank you. Uh thank you for this great partnership with AWS, but also thank you to all of you for being here very much appreciated. So for this portion of the presentation, I want to cover a brief overview of what Salesforce Data Cloud is and talk about the innovation that we are building with AWS and Salesforce Data Cloud and how we are bringing the data driven architecture with zero ETL and bidirectional, bring your own lake to improve that data driven experiences that you expect from your customers. And I'll end with a brief demo and a scenario walkthrough of this.

But before I go into this, a quick show of hands, how many of you have heard of Salesforce Data Cloud are using Salesforce Data Cloud? Ok, a few. And how many of you are using Redshift? Few more? Ok, great. Ok. That's good to know.

Navneet talked about and he started the presentation with how data driven companies see a 20% growth in revenue. I wanna anchor on that, right. We know data driven companies out perform their competition. But what does data driven really mean? Right at Salesforce, we didn't, we define data driven as somebody who can actually have access to the right data at the right time and be able to drive insights and be able to make informed decisions with the data, right?

But for that, you need to be able to have data come in from various sources to be able to deliver these experiences. Why, why do we need all of the data? Let's take an example. You want to send a personalized email to a customer. If you send a personalized email, your experience that you're delivering to the customer is going to be increased significantly and you'll see better engagement from that customer. The more data you bring in, the more personalization you can offer and the better results you're gonna see.

But that's much harder, easier said than done. Why? Because data sits in silos, think about I call a service agent and, and this is an experience i've, you know, gone through with some of the utility providers. Um they can tell i'm the same customer who called through a different engagement channel, maybe a different phone number, email. Uh last week and i filed a different case and they can't tell the difference between the same person or your sales agent. He picks up a call from a customer. He can't tell whether the customer what the customer has been doing on the website, looking at what products, there's a various disconnect, right?

And so that's a huge problem to connect all of your data, right? We believe that there is an average of 1000 26 different applications to store bits and pieces of data just to track where data came from and who's using what it could be web and mobile data. It could be data coming from your point of sale, could be ads or even from an external lake or warehouse. Right? It's complicated.

Now, ETL we've talked about it in both Nani and and Jason's presentation. This is a big problem as well. I want to also share on top of what Jason shared as a personal story. I was talking to this investment and personal wealth management company not too long ago and they needed to bring about 700 million records from CRM into an external lake. They have this complex pipeline that they have built and they need to run that pipeline and do a full copy of all of the data from your source to destination and that takes about a day or sometimes more. And that is complex. They spend over a million dollars just doing that, including all of the resources and not including the license cost to, you know, have the tools and whatever else that you need. It still brings brittleness, right?

Not to mention that the fact that it takes that long to copy all of the data in from your, you know, source to destination where you can bring all of the data together, it adds to the complexity of data freshness. You can only do that once every two weeks, maybe even longer. Some people just don't even touch the data because it's so complex. And now imagine you have hundreds and hundreds of Salesforce objects that you want to be able to bring in and share or you want to be able to bring data from Redshift into Data Cloud. How do you do this? It gets more complex. It's, it's a, it's a nightmare for all of the data engineers that have to work with different data. And so they choose to leave the data sitting in different silos because it's just a complex problem that nobody wants to tackle. And that's a really big problem.

Now, on top of all of that, it's also insecure because you're now moving data all or over, you need to get governance on top of that and all the things, the patterns that we touched on how many of you are having ETL pipelines, sir. Can I get a show of hands? That's a lot of you, right? You can, you can imagine all the problems that I'm talking about.

So we'll, we'll continue down down this line Salesforce. This is where Salesforce Data Cloud, a hyperscale data platform really shines. So I want to take a brief moment to talk about Salesforce Data Cloud and then we'll go into zero ETL data sharing Data Cloud is how we bring all of the data from different sources easily into the Data Cloud federated and harmonize that data and give a single source of truth, a single view of the customer so that you can take actions, you can send them on different activation and marketing journeys.

You can do insights, you can do your AI and BI analysis on top of that and you can do zero ETL sharing whether it is bringing data into Data Cloud or sharing out uh to your uh uh Redshift or any of your legs.

Salesforce Data Cloud is built on top of Hyperforce, which is our public cloud infrastructure. What that means is it provides you enterprise skills. As you see in your sign your previous slide, we are able to easily handle over two trillion records a month, right? All all of that processing happens automatically. We provide real time automation.

Most importantly, it powers the Customer 360 experiences across the cloud using the data that we have in a single place, right? So let's take a look at how a Data Cloud works on the to your left. You'll see all the connectors. We have the CRM uh apps, right? You have your Sales Cloud Service Cloud, your Commerce Cloud. All of those have prebuilt uh built in connectors which is easy to connect and use and you have your Hyper scalars, you have your S3 bucket, you have your Google Cloud Azure storage.

We can also bring data from your web and mobile data or API s that integrate with MuleSoft to your external systems to bring data easily in right or to any of your external legs, right, including uh uh Redshift.

Once you bring all of the data in, you want to be able to easily prep and transform them, we provide built in data recipes that you can use or you can use your streaming or batch transforms to enrich the data that you have just brought into the system.

Now you brought the data in but the data is still a lot of data and you need to be able to see the data maybe by organization by brand by department. That's where Data Spaces is super useful where you can actually look at the data that you want to see and not all of the data and you can actually grant users specific permissions on what they should see and what they should not be able to see with that.

Now you have access to the data, you have access to specific data that you care about. Now, you can actually can harmonize all of the data by building canonical models, right? You build a canonical model, you shape the data, you still have one problem left, you have identities.

Imagine you have uh your marketing system, your marketing system called your user as subscriber, your sales and service system calls, your user a contact, right? Are they the same piece of person? You know Raj may have come in from a different email id and a different phone number. And you have multiple people. That's where our identity resolution helps us you to stitch all of these identities and it so you have a unified single view of the customer.

Now you have a single view of the customer. You have unified all of the identities. You have all of the data. We easily create data graphs which basically is an application view of the data after unified us and all of the related engagements that the that the user has gone through with all of that at your hands.

Now you can actually send them on actions. Now you have all of this at your fingertips. What do you want to do? You can send them on, you can do BI analysis, you can do AI prediction models with that. You can create segmentation, send them on marketing journeys or you can do any number of model building including uh what we do with SageMaker, right?

Um Data Cloud is open and extensible. We have an open architecture, we have done all the hard work of doing the integration behind the scenes, all the cool things that Jason talked about with the open data access with all of the analytics products that we have with AWS, we have integrations with uh with our uh advertising first party advertising as well as with uh bring your own AI with Amazon SageMaker, right?

And App Exchange delivers is, is, is a great platform for us where we have over 9000 applications that uh that, that have been built, right? All the uh exchange partners that you have available to you.

Uh uh right now um with this together, we're building a tighter integration with moving away from your traditional model and into a zero ETL architecture. So we're doing four things, right? We're giving you a secure way of being able to share the data without doing any of the copy, no movement of the data data because it's immediately available to you. It's fresh. And so now you can actually do real time uh uh you know, experiences to your customer.

Imagine you have data coming in and you want to be able to send a coupon immediately to that customer at the point where uh there was a there was a change or there was an interaction and you can do that now in real time because data is available to you uh instantly and you can trust the data because there's no brittleness.

You don't have to have to have multiple copies of the data. You don't have to deal with errors in broken uh pipelines. You can, you can trust the data and last but not the least, it's cost efficient because you're not moving the data, you don't have to write ETL pipelines. You can use this data sharing as a way for you to take full advantage of, of uh of our model, it's bidirectional, right, this unified architecture that we have, you can bring in data from your first party or third party right point of sale data HR web and mobile engagements or take advantage of any of our prebuilt integrations to bring data in into Data Cloud and share it back for your analysis and to be able to actually do zero ETL you can enrich the data back uh into your uh into your Data Cloud and share it with any of your uh integrations that you have uh externally that lights up all the personas.

Uh uh I talked about four personas. I think it kind of fits in right where you have on your left side, your data engineer persona who needs to be able to easily now be able to handle all of the data to be able to meet the business user requirement. Imagine you are a marketer, you need to be able to send them on a marketing journey. Now you have data that you can actually get uh from all of the sources uh and unify all of the data in one place.

Ok. Now I'm gonna go a little bit more deeper and talk about data federation and data sharing and how we built this integration with an example with data federation. What you see on the right side screen is it's very simple and I actually show this a little bit more when we actually go to the demo piece, it's all point and click, you have to write zero code, right point. Click that easy.

You select the Redshift data stream that you have. You created the dead stream and you instantly have all of the data that you want to be able to bring in from Redshift into the Data Cloud. This is the data in federation, right? Once you brought the data in now you can send them on all sorts of interesting things that you can do with your uh uh with the data.

And the cool thing is we now have data federation with Redshift pilot coming very soon. So if you're interested in this, uh please uh come talk to us uh and follow that we'll be, we'll be going uh g on this very shortly.

A quick example of how, how does this work? Imagine you have on the left hand side, you have your Salesforce Data Cloud, you have your unified us. Remember the one that you have uh harmonized and you have done the identity resolution, your engagement data. And now you want to be able to build, bring your billing history from Redshift, create a data stream point and click and you want to be able to see live data. It's that simple, right? So that's your JD BC uh access. When you actually bring the data in. It's available in Data Cloud as a data lake object, which is basically your external uh entity.

At at that point, you can do anything you want with the data, just like you would with any other data that's already there in Data Cloud. Pretty soon. We'll be also uh uh making it available at the at the file sharing level. That would be really cool where we will have file access to this data federation capability right now, let's look at data sharing.

Very similar again, very easy to use no code, just click, click, no code, right point and click. In this scenario, you can actually take any of the data lake objects or the model harmonized objects, easily share it with say Redshift, right? All you have to do is create a share and link it to your share target point click and you have the data available in your Redshift.

Now you can run your queries on your Redshift have to do nothing. And as Jason was talking about Glue views, we're going to be providing the Glue views integration fairly soon. And that that that's actually going to enable us to uh light up our ability to do all of the things that we're doing with sharing across all the different engines that uh that we have, right, a quick example.

Um again, I'm gonna breeze through this same thing Salesforce Data Cloud. Now imagine you want to be able to bring that unified individual engagement that you have from Data Cloud into Redshift. You do the data sharing with point and click once the data is available on the Redshift side, it's you, you just as simple as you treat any other Redshift table.

You write a SQL query, you run the query, you, you operate normally and you can do your analysis, you can do your training models.

Ok. Going a little bit deeper on what the architecture looks like. How do we do this? Right? So you may be thinking, ok, no copies no moment of data. How does this really work? So, on the left side, you'll see that there's a Salesforce Data Cloud. And what we're really saying is we everything in the Data Cloud is essentially a iceberg table in parquet format and that's sitting in S3 buckets. Ok?

So what you have done is simply created a custodian account of Redshift server inside our Data Cloud for this data sharing that allows us to selectively remember i talked about only selectively, you can bring data that you care about that data. You can, we basically have a merida for that data that you want to share and have that be seamlessly created within, within our uh custodian account.

So that enables us to create views. And those views are the ones that you're able to see on the Redshift side. So that ability to be able to easily create merida and have the views be visible on the Redshift side, eliminates the need to move, move data all of our data still sits in r s3 buckets, Redshift data sits wherever it is. And you can still essentially see all the data that you want to see in a very secure and easy and quick way, right?

Um and once we have uh Glue views, uh we will be able to light up all the other engines in addition to Redshift Athena as well as the Mo Spark.

Ok. So I'm gonna jump quickly into the demo and walk you through how easy it is to do all of this, right? Both data federation and what we can do with that and data sharing.

So I have three parts to this demo. Um and and it will, it will, it will not take too long. Part one is the federation piece where you can take the customer's Redshift account and whatever is the tables that you want to federate the data from Redshift into Salesforce, selectively bring it into Salesforce Data Cloud. I'll show how that works.

And number two, once you have brought the data into Data Cloud, how easy it is for us to create, for you to create any of the prediction models, create model scores and do interesting things with that data that you have created.

And then lastly, I'll show how you can share the data easily back into Redshift, anything that you want to be able to easily make it available in Redshift. Ok?

So let's focus on the first one first imagine Bob is the Data Cloud architect. Ok. Data Cloud uh uh architect is a persona, could be same as the uh owner, persona, data owner persona. You think of the same thing, right? In this case, Bob is responsible for creating and making sure that the data is available.

I want to be able to bring some selective data from Redshift into Data Cloud. I want to focus on a few things. Once you see you see these basically columns and draws, you can see all the different types of connections that we have the connector types. And you have this stream type, you can see federated and ingested all of the data that we do with sharing is essentially federated data. That's how you were able to bring data in.

Ok. So now I have uh a simple screen point and click. I want to be able to create a new data stream to bring in the Redshift data. You can see that there is all these prebuilt connectors that I talked about earlier and you have your federated sources, we have integrations with many different uh external lakes and you see Redshift uh is available.

So I select that as uh as Bob I'm selecting that now, I i'm able to create a new data stream. I go in there. All I have to do is provide the connection information in the database information. I can see. Ok, I have a few tables that I can select, i choose to bring in the web engagement scores, product catalog and purchase history because I want to be able to create a propensity score.

So I i'm being asked to bring this data into Data Cloud. So I select these tables. Now I go in the beauty of this is for these tables that I selected. I can actually see what fields are available to me to bring in. So now I don't have to bring in hundreds of columns in every row that exists, right?

I can actually add filters to selectively bring in what I want to bring in. So I choose only customer id engagement. I i don't need to bring in everything else. So I choose to select that click next. There you go. You have essentially all the things that you need.

You created the filters that you want. Now the beauty of it is you, you can also assign this to a specific data space. I've chosen default as a data space. You can create a space that's only specific to a particular user that they don't want to be able to see everything, maybe brand specific or a region specific. You can do that.

You select that once you have done that those three tables have now a stream established to bring the Redshift data into Data Cloud seamlessly. You can see that 123, I've got my three tables that I've federated into Data Cloud. It shows the stream type as federated everything works. I can see how much data I brought in everything all you have to do is point and click. You didn't write any code at all at this point, right?

So now that I brought my data into Data Cloud, I can, I can essentially do a canonical mapping of that data to your table. I see that I have brought in a product purchase history, but it's in the Redshift format, right? I want to be able to then easily map it to. I see that I have a sales order table and just map it. And it's simply drag and drop, click. And I can say ok, this is the uh id type or this is the, you know uh um list price and I can actually drag and drop and so I can create the canonical model.

I have everything ready to go. I've created my um my model. I have my model ready to go. Data graph is a easy way for me to hang all of these contacts, all of the things that I've touched for a uni unified individual or an individual perspective. So it gives me a simple, easy way to visualize what as an individual i've been doing. And this is important for me because I want to be able to take you to the next level where I want to create a propensity score.

So let's um Kyle is the data scientist so Bob is finished sharing the data, brought the data in. He's now handing it to the data scientist. Data scientist goes in, he's able to easily take the data that uh that was shared. And we created the DM OS, right, the data models that uh that was just prepared, bring it into Data Wrangler. Now, I can actually take that whole thing and run the uh uh model within SageMaker. I want to be able to create a purchase propensity out of that.

Um and once I've done that, it's as simple as you know, I can run and train the model. Kyle works on it. He's preparing all of the propensity models. He created the scores for each of them. Uh and once the scores have been created, the model is created, the inference is available, I can easily now take that and bring it back into uh Einstein Copilot.

Now we are back into Data Cloud. I have my propensity model available to me and, and, and I can now do something interesting with it. So I have my propensities code that's that has been created, right? I have my data graph for each of that. Now, you can see for this data graph for that individual. In addition to the data that I brought from Redshift, I also can see account and email engagement and case objects that were in Data Cloud.

So now it's a beautiful view for you to see how I can combine the data that's already existing in Data Cloud with the data that I brought in from Redshift. And I was able to create a propensity score out of that. And I can use that combined data to create data segments.

So let's see how easy that is, right? I go to segments. It's basically you see the tabs on the top within Data Cloud platform allows you to just click to go to any of these. You click on segments. Now, I want to be able to create a new segment for all the customers who have the high propensity to purchase based on the p purchase propensity scores that you were created, right? I just created.

So let's see how simple that is. I go to the segment creation tab assign it to a data space. I go through the same motions. I see that there's a total of 132 customers. I don't want all of them. I want to be able to apply the propensity score. On top of that. I choose the product catalog to be jackets because I want to target um the customers who are, who are interested in jackets. I add the propensity score. I say uh anything is equal to or greater than 65. Voila now have a segment population of 82,000 that I want to target a segment was created.

Now that your segment is created, you create an activation journey, send them on an activation journey. Maybe in this case, it's a campaign that I want to send them on. You create that campaign. All of those 82,000. Now go on a journey. They, they get a campaign. It could, it could be anything that you want to be able to uh push that campaign to. It's as simple as that, right?

So imagine you've done all of that. It's it's working, you've done your uh data, brought it in creative segments, sentiment activation journey. And now you choose to share some data back into Redshift from Data Cloud back into Redshift. You see how simple that is, right?

I click new, I want to be able to create a new data share. I see that I have certain some of the integrations already available. I choose Redshift because I want to be able to put some of uh share some of the data to Redshift. First, I want to create a data share target. This is the link from your Data Cloud into Redshift. Ok? You give the name, you provide the uh uh uh s3 tenant folder I ds a token. Your link is created. Now you have your data share target that's created right now.

I want to be able to say, ok, what is the data that I want to take and share to Redshift? Ok. Now that I have an established a connection from Data Cloud to Redshift, I want to go here i want to specify, ok, I want, I've created some segments. I want to be able to share that with uh Redshift so they can do analysis in Redshift.

Um I come here, I pick and choose some of the data that I want. I want to share uh pick case in usual and uh unified in usual. All of that, I want to be able to share it back into uh Data Cloud. That's it. It's as simple as you uh you, you click create and you click save. Your share is created at this point.

You have all these objects that you selected is now shared. Remember how easy it is to bring all of the data. It's just merida the views are now available. You can basically now go into Redshift and be able to see all of that.

There's one simple step left, right, you created the share, you created the target. All you're doing is taking the chair and linking it to the target so that you know, I want to be able to share this new share object that with these four tables that I've created and link it into Redshift. That's it. Once you have done that, that data is now available.

So Maya the data analyst comes in on the Redshift side, she's a data analyst. She can go look at that table, go into the Redshift query editor on the left side, you can see that she wants to create a churn model output based on that. But what's nice is she can see the case object in usual and the unified all of the data that was just shared immediately available to her.

So she can run our prediction and analysis for churn model. It's um it's as simple as writing a query and you're done. Ok. So with that, let me pause and uh bring uh Jason Nani back and uh we'll uh we'll do a quick uh round up on this. Hope this was helpful and we'll be here for any questions, Jason. Do you want to bring it home?

Sure. Uh again, thank you very much. Uh we're, we'll stay, hang out for a couple of minutes to answer any questions. A couple of things to just be aware of. Obviously, Salesforce has their content tub to learn more. Uh g two, our vp of analytics is an innovation talk. He's going to be talking about a lot of the things we talked about, but in more detail at larger scale, he's a few other customers joining him as well. So please check that out a few other things that you could take action on is uh we have a data governance workshop that just got announced a couple of weeks ago, self service workshop where you can try out all these things, you can try out Datazone and Lake Formation has a workshop as well. If you need to test these out.

I think the, the, the key thing for me is you're seeing, you're seeing like a aaa s a provider uh such as Salesforce has thousands of customers securely share data back and forth and we have lots of customers trying to just do this internally. It's a hard problem to do and they, if they're able to trust the technology to do business to business data sharing, then absolutely, it can work for your internal use cases.

So feel free to reach out to your solution architects. Uh if you have questions and what's happening under the hoods, we can share with you the details. If you have questions on the security, it's a very sensitive topic. So we're here to help you and, and thank you very much. You have anything else?

Alright. Thank you.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Easily and securely prepare, share, and query data

Right?It does.Few more?Right?
复制链接

扫一扫