Multi-data warehouse writes through Amazon Redshift data sharing

最新推荐文章于 2024-09-08 21:56:44 发布

李白的朋友王维

最新推荐文章于 2024-09-08 21:56:44 发布

阅读量137

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134837912

版权

welcome to this talk. i am sudip to das. i am a senior principal engineer working on amazon redshift and i'll be presenting to you a brand new launch that we did earlier this week called multi data warehouse rights through amazon redshift data sharing.

so this is the next evolution of our journey for multi cluster architectures for redshift that helps you scale your workloads even further.

so data is your differentiator in today's world. the statistic provided by idc says that more than 90% of the today's data was created in the last two years. a, a study from forester says that you are 8.5 x more likely to have at least 20% revenue growth in your businesses if you are able to use this data effectively. but a similar study from accenture says that only 32% of these organizations are able to effectively leverage this data and we are in this journey to make it easier for you to leverage the value out of this data.

so the competitive advantage for your business from all of this data comes from various different facets. for example, personalizing the experience for your customers when they are using your applications, inventory optimization or supply chain optimization, you you can do optimizations for how your applications are rendered, how you run your marketing campaigns and and so on.

amazon redshift sits at the core of your data journey. in order to get you the insights out of your data. it is a fully managed a i powered scalable cloud data warehouse that allows you to run your analytics on a broad class of data, whether it's transactional data, click stream io telemetry application logs and so on. and it supports a rich class of features to get you the insights that your applications need, whether it's running rich sequel analytics, your machine learning workloads, spark workloads and many more. it provides best in class security governance and compliance. you are with a familiar sequel surface area ease of use via serverless where the infrastructure gets out of the way and you can focus on your on your analytics. redshift is a leader in terms of price performance.

this is a chart which shows how redshift compares with some popular cloud data warehouse alternatives that exist using an industry standard benchmark derived from tpc ds three terabytes. in this figure lower means your workload runs faster per unit of perform per unit of price that you pay for getting that performance. and as you can see red shift is up to 66 better out of the box performance out of the box price performance compared to some of the cloud data warehouse alternatives. and as you may have seen throughout the launches that happened over this re invent as you run the workload, this gets even better using the a i powered a i powered smarts that are put into the data warehouse.

the previous start was about single query performance as you run one query at a time, that's how the benchmark usually runs. but our real workloads are a lot more concurrent and we have use cases around dashboards that run many, many concurrent queries where the story gets even better, where you can get up to seven x better price performance for the short running concurrent workloads that we often see with the in the bi dashboards.

so as the concurrency increases, red shift becomes even better in terms of price performance, tens of thousands of customers across a broad class of industries segments and verticals have been using successfully, have been using that shift successfully in production processing exabytes of data per day. these customers are many common names that you see in this slide and they are across industry segments such as media and entertainment, financial services, health care, consumer services, web and technology and many more to give you a high level overview of how red shift is able to give you this rich insights. what the way we explain it to our customers is red shifts allows you analytics on all of your data and for all of your users.

on the left hand side of this figure are the various places data gets generated on which you want to derive your insights on. if your application is generating data that is landing as files in s3 redshifts photocopy feature allows you to automatically ingest the data from these files into redshift. if you are using streams such as kinesis or kafka, the streaming analytics integration that red shift supports allows you to get the data in near real time into your red shift data warehouse so that you can power your analytics from there.

last year, we embarked on a zero etl journey where we have been continuously investing in a variety of zero etl sources. we started with aurora my sequel last year with g earlier this year and this three event, we announced three new zero etl sources for red shift including aurora post dress rds, my sequel and dynamo db. and again, depending on where your data starts, you can very easily get your data into that shift. and that is what we want our customers to be able to very quickly be able to get these insights out of the data

in cases where you do not want to move the data into red shift. red shift supports federated queries with aurora. for example, where you can query the aurora database in situ or you are if you want to combine your data with with other marketplace data, red shift also supports integrations with aws data exchange that where you can combine your business and operational data with other third party marketplace sources.

redshift has its own managed storage which provides highly durable and scalable storage for petabytes of data. but in addition, redshift also works very easily with data in open formats via querying of the open data lake. it has redshift has had support for parquet and many other file formats. and at this re invent, we launched, we launched the iceberg support in ga on the right hand side are the broad class of tools that you can use with red shift to get the insights that you care about.

not only can you use the rich sequel in red shift, you can connect a broad set of bi tools including aws services as well as third party partner services. red shift has support for data a ps so that your serverless application deployment or in general, your application deployment becomes a lot easier, but you don't have to manage your jdb c connections or odbc connections. and you can access red shift via a restful a p for ease of development supports a broad class of integrations across various machine learning services within aws such as sage maker forecast and so on so that you can cater to your applications ml use cases as needed. and last but not the least. there are also these other analytic services that red shift integrates with.

last year, we launched the spark integration with red shift where aws services such as emr athena or glue that are running spark workloads can access the data in red shift. all of the data that we are bringing in through from these various sources, access the data easily within red shift without having to move the data out to get the life transactional data in near real time from your spark applications. and earlier this year, we also open source that interconnector so that your own self managed open source spark deployments can also benefit from that.

now, as you can see the more data that you get in and the more use cases that come into red shift. a question that often comes to us is how do we scale this? some of you may remember when red shift started its journey about 10 years back, it started from its humble origins as a cluster. that's how the world used to look like 10 years back. and in that case, if your cluster is has become a scaling bottleneck, you have to create multiple clusters, you have to move the data, you have to copy the data. there is a lot more complexity that comes with it observing this trend. and the asks from the customer about 2 to 3 years back. we launched a new capability in red shift called red shift data sharing.

not only can you use this for cross organizational sharing that many of the other industry, the industry standard support. this is also redshifts evolution to enable multi cluster architectures for you where it allows you to securely and easily share live transactional data without you having to move any data and provide cross red shift data warehouse accesses on the same shared data.

the way it works is that you choose what are the objects in your source cluster that you want to share in red shift terminology. it's called the producer you decide who gets access to it. so as part of this data sharing construct, redshift introduces introduced a construct called name spaces where think of name space being a container for data that you can define certain privileges and accesses for. so you create these name spaces or you can grant access to another name space. so this is how cross cluster or cross warehouse access works within the same aws account or you could share it across accounts to for scenarios where within an organization, you may be using multiple aws accounts to scale your analytics use cases.

and then as soon as you set this connection up, then across these different warehouses, you can query the same live data and that allows you a very easy way to scale your read workloads. this functionality is supported across both red shift provision as well as serverless work groups. so you could have a provision producer and a serverless consumer and vice versa. or you could have provision and serverless as both producers and consumers and you could have multiple consumers also linking to the same producer since launch, it has been a widely popular architecture pattern that customers have adopted to scale. and as of today, customers run more than 10 million queries per day using red shifts, data sharing architecture to scale their workloads.

so as i said behind, this is access on the live data. so how does how does it work? so this slide gives you a very high level overview of how read data sharing works. there is the producer cluster. now as i mentioned, redshift has a highly scalable storage layer called redshift manage storage, which provides the durable as durable storage to the data decoupled from the red shift compute. the producer cluster is actually storing the data in red shift manage storage and the consumer cluster when the query comes in, they interact with the producer cluster to get the relevant meta data and then directly access the live data from the red shift manage storage as the transactions are committing on the producer.

so these read workloads get isolated from the producer because all accesses are happening via the storage layer decoupled from the compute and not only does it work cross cluster or cross account, it also works across regions. so you could set up cross region data shares and have regional access. as in this case, what you can see is that the rights the set of clusters are serve as work groups on the right can be in a completely different region in the other side of the world.

given the adoption of this architecture and how customers have successfully used it. we have been making continuous investments to improve the performance of read data sharing in the course of this year. for example, we have made several improvements that significantly improve the performance of your query on the first run on the consumer cluster or when updates are happening on the producer. how quickly does the consumer catch up to those updates? if the consumer is warmed up, how can they run more concurrent queries on the consumer and still get the benefits of data sharing? and it also allows you to appropriately size your workloads so that you can create serverless work groups or provision consumers of a size that you need, whether it's larger or smaller than the producer. and there have been many enhancements that we have made over this past year that allows for better performance when the consumer is much smaller than the producer cluster.

this slide is an example of how customers have successfully used. many of you may know peloton as one of the successful exercise equipment providers. so this is an architecture that peloton uses again with red shifts, multi cluster architecture being at the core of it where they are using various ways of getting data in via different etl tools, the data lands into the red shift cluster at the middle, which is the producer that curates all the data and then they create additional consumers for specific workloads. for example, if there is an ad hoc workload, there is a separate end point for that. if there is a looker dashboard, there is a separate cluster for that. there is a tableau dashboard is powered by another cluster. so these workloads stay independent and do not interfere with each other. not only does it result in better scaling and isolation property. peloton also saved about $300,000 annually by adopting this mesh architecture or this hub and spoke model by appropriately sizing their clusters for their workloads.

so you got an overview of how red shift started on this multi cluster journey and how red shift can provide value out of your data. one of the things that we hear from our customers is how to scale the right workloads, the right workload still are on that one warehouse, which is the producer and we will talk about some of these challenges that and patterns that we are seeing from the customers, which led us to come up with this new offering which supports multi data warehouse rights via red shift data sharing. so that not only can you scale your read workloads, but you can scale your rights etl workloads and many more.

so we'll cover how all of this works present a demo of this new feature and finally conclude this talk opening up for more question and answer. so to talk you through the next segment. i'll invite my co presenter to come and walk you through this. thank you, sir.

hey, everyone. uh my name is ryan. i am a product manager at redshift. i lead the product for redshift data sharing. um so i want to talk to you a little bit about some of the issues that we were hearing specifically with red scaling.

so as sio has explained, rescaling, we have a very robust architecture for making it easy to go and share, read access to your data, making it so that you don't have to go and constantly re etl all the data between different clusters, but sometimes you need a scale compute as well. and one of the issues that we we'll come up with or one of the scenarios often is like customers will go and set up their, their cluster or their service work group for their initial needs.

so for here, for instance, you know, you might go and have a job that's running three times per day. you know, you insert like 15 million rows and you update 2 million rows in a couple of different tables. you have a 64 rp work group and it's great, you know, your job finished, predictably your workload performance is worth the price that you're paying, however you need to evolve. and you might get into a situation where you have another job that's very different than your existing job.

so maybe you have a web interactions job. and for this job, you might not be ingesting a few million rows three times a day, you might be ingesting, you know, hundreds of millions of rows like 15 times per hour. and, you know, this introduces some issues. the jobs are competing with each other for compute. so they're no longer finishing, predictably, they're no longer finishing on time. and you say like, hey, i'm willing to pay for better performance. you might go and increase the size of your work group at that point, you say, ok, well, i need like a much larger work group for this to work. but you know, then you get into the issue where, well, it's, it's a little bit oversized for the initial etl job. you know, you don't need 512 rp us or red shift processing units for that. you could have done it with a lot smaller of a, of a work group. and, you know, you might find like, hey, like, i don't even know really when it gets more complicated than this, like which workloads are really driving up the cost? like why is this becoming more expensive?

and then there's another scenario that we've been seeing where customers are having issues when they have these multi team architectures. so for instance, many organizations have customer 360 what the customer 360 is, is it's where you centralize all of the data for your uh for your customers. so you might see their sales history, their marketing history and you'll use this to better understand what is happening with your data, what's happening with your customers. and in this case, you know, you have a central database account where you're storing this customer 360 data and you have, you know, a sales account and a marketing account, each of which need to contribute data for this. you might say, hey, let's go and do this via data shares

We'll go and we'll query the updates. Well, that is, that is convenient because it's via Sequel, but it's not necessarily easy because you have to go and develop change data capture mechanisms to see what actually needs to be ingested. You also have to go and trigger somehow the ingestion to happen live because you'll update the data. But the producer won't necessarily know, "Hey, this needs to be ingested." So you'll have to go and build like a multi-service ETL pipeline perhaps going and using Amazon SNS to send notifications that trigger a Lambda function that triggers the ETL.

And not only do you have to have engineers that spend time building this, they also have to continually monitor it. And you know, the more steps you add in the pipeline, the more services you integrate with, the more opportunities for difficulty as things go on. And that's why we're introducing multi-data warehouse writes through data sharing.

So now for flexibly scaling ETL easy, you can use different warehouse types and sizes based on your workload price performance needs. So if you need a large cluster and a small CU workgroup or any combination like that, now it's possible for your writes as well.

Additionally, this helps you go and you keep your workloads predictable. You can now go and separate out your compute to different work groups or clusters if you want to. And when you do this, the compute will no longer interfere with each other. And additionally, you'll be able to separate out the pricing. So you'll know exactly what's driving up your cost.

And finally, it also allows you to collaborate on live data. That's easy because multiple teams can write to the same Redshift databases with just a few clicks and just like it's with cross-account and cross-region works for reads, it's the same with this. So you can go and also start writing across regions and cross accounts as well.

So let me go and explain how this works. It's pretty secure and easy to go and do this for a producer. So like with the role today, you'll go and you'll grant access or grant privileges on the data share or to your objects to the data share. So you might say, "Hey, this schema they get create and usage on. I want them to be able to create objects in the schema." And you might say, "Okay, this table, they need to be able to go and write to. So I'll allow them to do that."

And then you'll also say like, "Oh but they only need select on this table. I don't need them modifying it." Very possible. Just like with read data sharing, you get to choose who gets access to it. This namespace gets access, this account gets access. The same process still exists for this.

For cross-account, we are introducing a new way to control access though where today for cross-account, you can't do this fully in Sequel because we want organizations to be able to go and control where their data goes. And if you're sharing data across an account, you need an account admin to say, "Hey, this account can read it."

We're introducing a new control where now account admins can also authorize a data share for rights. And what this does is it says like, "Hey, if there are right privileges in the data share, they can be passed through. If you do not explicitly specify for rights, then you then the consumer will not be able to write to the data share. They will only have the read privileges pass through."

So it's, it's like a veto function. There's not, it does not give additional, but it allows you to give the right privileges that are there.

And for the consumers, it works actually the same way. So today you can go and when you get access to a data share from another account, so for cross account only, you can go and specify this, the entire account gets access or you can say a specific region gets access or you can say a specific namespace gets access.

Now it's the same thing, you can also do that with rights. So if the, if the producer allows rights and the consumer also allows rights, then it will be passed through and to make sure that this is safe and backwards compatible, we've set it up so that if you specify nothing, it defaults to today's behavior where only reads are passed through.

And at this point, you now go and create a database from the data share. And this is actually something that we've added now where we've had a lot of feedback that, you know, people want to be able to grant granular privileges on the data share objects themselves. They want to be able to say you get access to some objects in the data share, not all of them.

And now with this, you'll be able to do that. So when you go and specify there, you're creating a data share database with permissions. You are now allowing it yourself to go and grant individual privileges on individual objects to users and roles.

So you can say the data engineer role, for instance might get like select insert and update but the BI user role, they only get select. And then, as you can see, this is how it works. It basically functions similar to local where you grant privileges on the specific objects.

And once you do, you get access to that, and we've actually are offering some new ease of use options for consumers as well. One of them is the USE SQL command. So you can say, "Hey, I'm gonna use this database" and this allows you to go and do two things.

One, you no longer have to use three part notations specifying the database. You can simply go and say "I'm using this database." Let me go and write with two notations specifying the schema and the table just like any other query that I do.

The second thing that this allows is multi-statement transactions where you can do a BEGIN statement, you can go and do multiple writes and then COMMIT them and none of the writes will complete until they, until you end the statement, they all complete.

The other option that we're adding is the ability to connect directly to a data shared database. So that was another piece of feedback that we got that customers wanted to be able to connect directly to it via like their JDBC or ODBC or Python functions that will be available as well.

And when you do this, you will similarly be able to go and run multi-statement transactions and use two part notation when you do. And what this will allow us to do is we're working with partners now because we know that a lot of people, you know, if you want to use ETL tools, you don't want to write your own scripts every time.

So we're working with partners to make it so that you'll be able to connect directly to the data shared databases and connect your ETL tools. So they feel very much like local.

So just to recap of the new user surface, so producers can now go and grant granular privileges on the data share objects to the data shares. So you say select on this table, select and insert on another create on this schema, just usage on another. Now that's possible.

Additionally, you can now the consumers on the other side can grant granular permissions to users and roles you no longer you can now go and say this user gets this privilege. Another one gets a different privilege just like local.

We're introducing the USE semantics as well. So you can USE another database and run multi-statement transactions in that database as well as go and use two part notation.

And we are also launching the ability to go and directly connect to data shared databases via JDBC, ODBC and Python drivers.

So let's go and quickly go over the use cases here. So as we said earlier, easily and flexibly scale compute for ETL workloads. You know, you have a 64 node workgroup for the smaller workload that's running three times per day. So it will start up, run quickly, then turn off. And for this job that's running all the time, you can say, "Hey, this needs a 32 node cluster." And it can run.

So you can separate, you can use different workgroups, different warehouses, different clusters based on the price performance needs. And you can change this with just a few clicks.

It also allows you to keep things predictably predictable rather. Um because you know, you can split up your ETL jobs to make sure that they're not competing with each other, make sure that the compute is not going. And uh like you don't have things delaying on each other.

If you have separate jobs that are not related, split them up and each warehouse is also separately built for its own compute. So you have different applications from uh from different services and you need to keep the compute separate or to keep the billing separate. Now, that would be possible as well.

And additionally, you can collaborate on live data very quickly and easily. So you simply create the data shares, grant right permissions of the objects of data shares specify who gets access. And then you can go and start writing data through the data shares. And this even works as we said across accounts and regions.

The data is live as soon as any cluster or work group commits it or makes it write and it gets committed.

So now I'll turn it over to Sudeep though. Sadeep is going to give you a deep, a deep dive of how this new feature works.

So as we walk through this user surface, so now you will see how behind the scenes, what exactly is happening and why we can give that isolation and why we can get this live transactional data in an efficient fashion.

So we'll go back to the familiar picture that we had in the original slide that I explained earlier about how data sharing works is there is this producer cluster. There is a consumer cluster that can now perform rights producer is the owner of the data. They set up the data share, set up the right permissions. That's where the journey starts.

And the first step of the interaction and these two clusters are communicating via the managed storage where all the read operations are happening with data sharing with this new functionality that we launch.

Now clients connect to the consumer either using directly connecting to the data share or by connecting to a local database on the consumer and executing USE or using three part names and can execute DMLs or DDLs on remote databases that are shared by the producer.

Once the query arrives on the consumer, the consumer does the parsing of the query using the table metadata that is obtained from the producer much like how reads work and then they fetch the live transaction context. And that is how we are able to provide the transactional semantics at the very beginning of the transaction.

Once the life transaction context is obtained subsequently, the consumer is doing the entire heavy lifting of the query execution using its own compute resources. So think of the example that Ryan worked through where you have millions of rows being ingested or you could have terabytes of data being ingested in a big ETL job that is running all of that is running on the consumer compute without the producer getting having any interactions in the mix.

Once all of these writes are performed again, Ryan also walked you through the multi-statement transactions. Either the client issues an explicit COMMIT or a ROLLBACK depending on what they did or they may be setting an AUTOCOMMIT setting. So every statement commits by itself independent of the setting.

When the client issues the end of the transaction, at that time, the consumer prepares to serialize this entire transaction by writing its data to directly to Redshift managed storage. Again, if you are ingesting terabytes of data, all of the writes are going in directly to storage without the producer having any impact or interactions.

And as a very last step since the producer is the owner of the data and has all of the necessary permissions to commit the transaction, the producer does a final commit of the transaction which is a reasonably lightweight operation that finalizes the data and makes it visible to other consumer transactions or other transactions that may be running on the producer using the same shared data.

Once the producer finalizes the commit, it sends an acknowledgement back to the consumer. At that point, the consumer knows that the data is durably stored and made visible so it can send the acknowledgement back to its own client where the entire process finishes.

So because you are using the consumer compute resources to perform your write operations. This is how what it allows you to create multiple different consumer warehouses, whether it's serverless, whether it's provision size it based on your workload and have your the compute on the consumers be using to write the data.

And the producer is always in control of who gets access to what and what transactions can commit. And if there are runaway transactions, the producer can also come and intervene to take any corrective action.

So in this case, for example, what we see is that there is a sales ETL cluster which is doing ETL on the sales data, there is a reporting endpoint that is running the BI dashboards running only read queries, which is a serverless endpoint and another CRM ETL cluster which is doing customer relations management ETL workflows.

So some important things to note here, as I mentioned, the producer is the owner of the data and is the final authority for writing to the shared data.

It's not only for the permissions but also for the transaction semantics of how of how these transactions execute, what logs they acquire and so on. There is an important aspect that we added that we adopted and expanded on as for this functionality for multi data warehouse rights.

So if you had paid attention behind the scenes for transactions, there are various ways you could provide transactional correctness and semantics which are defined by the various isolation levels. So started with sizable isolation level. But as we saw more and more of the workloads, bigger scale bigger requirements coming in, we realize the need to make snapshot isolation, the default for our serverless offering.

So we are building on that journey by making this functionality also require your databases to work with snapshot isolation. So whenever you are on the consumer and you want the database to be writeable, you set the database to snapshot isolation that gives you the benefit of multiple reads, working on the same snapshot of the data. And also making sure two rights cannot conflict with each other and overwrite each other's data.

So that gives you the correctness but also allows the system to add a lot more concurrency into the system so that your workload scale a lot better. In addition to cross cluster scenarios, the news, the new syntax that we shared, which is used allows you to also within a cluster do cross database rights let's say you are connected to a database db one, there is another database on the same cluster db two. And you want to write to that database today in red shift, you will have to reconnect to that new the second database db two. In order to write to that database.

With this new functionality, you can just say use db two and execute the right while connected to db one and that entire functionality also behind the scenes uses this cross cluster mechanism that we have developed and also follows the same snapshot isolation semantics.

And one of the things that we heard from a lot of customers is for the sizable isolation level, which is a much stricter form, a much stricter guarantee that red shift supported. There was this contention between dds and dms that often get in the way between selects or your dashboard queries and your jobs running concurrently by being able to build on snapshot isolation.

The multi data warehouse writes, removes this contention between dds and selects so that if you are running a dashboard query on a table that is being concurrently curated for by the etl workloads, your dashboard queries run on that snapshot and they continue to function. The tl workload can come behind the scenes, it can do a trunk, it can do whatever changes is needed. Flip the data, prepare the data once the transaction commits, the next dashboard query comes in, it sees the fresh data and we were able to remove this contention so that there is no blocking between these different types of statements.

And again, an enhancement to help you scale your workloads even better. So as i walk through how this deep dive works, we also wanted to show you how this feature works. So i'll have ryan come back to the stage to walk you through a demo of how this, how this is working.

Boss. Yep juan. So I just want to add one other thing to note just so that it's clear, the producer can be paused and the producer, regardless of whether or not the producer is paused or not, the consumer rights will work. So you do not get billed for anything. The producer does not get billed when the consumer takes an action. So yeah, at this point, I'm going to go and show you the demo of a classical tl use case that, that this will support.

So take the following scenario, you have an existing provision cluster, you're using to store regional customer data. You'd like to use a separate service workgroup to tl your mia customer data into this provision cluster and you want to grant the minimum needed permissions to this work group. So read only for some objects, writes on others.

Oh uh give me one second as i get this thing up and running. Cool. Awesome. So I'm going to show you how this works in the red shift console. So we're going to start here and we are going to go to your provision cluster after the play starts, there we go. Um so you're going to go to your provision cluster, you're going to enter it and you're going to go to the data share part of your cluster and you're going to go and create a data share.

Now, when you create a data share, you'll go and you'll put in a name just like you have previously. So we'll call this one etl scaling you select the database that you're using. So for this, we're using the marketing database, you have options for public accessibility as well. I generally recommend turning this on as it allows the consumer to make the choice as to how their cluster is configured in order to go and access the data.

And then you go and you add the data sharing objects. So you can see here, I'm going to go and first add the prod object for this. And I'm going to go and choose that I want tables and views. I don't want udf's in this and I'm going to go and specify which objects within the data share or within the schema i i want to use here or actually, no, i'm not. I'm just saying i want all of them.

And then for, for this one, for the other schema i specified, i want no objects. So at this point, we've given all objects within the prod schema. They are all in the, in the data share. And now I've also gone and said, I'm going to add the staging schema, but I'm not going to go and add any objects yet.

Now, we're going to go and take another step here where for the staging schema, we're going to say that we want to go and share additional privileges. So in the past, it was the equivalent of just sharing usage. So you can go and select from the schema. Now you have other options.

So for this, we're going to say you don't only have usage permissions. You can also create objects within the staging schema if you're so interested. So we've set that for the staging schema. But for the prod schema, you're unable to go and create objects because you do not want to go and share that with this work group.

Now, one other thing that we introduced is something called scope permissions. And what scope permissions does is it allows you to go and grant permissions on all current and future objects. So you can say, hey, i want that this data share to have select commission on everything within every table and view within this data share.

So i'll go and i'll select that i'll continue from here. And then the next thing that we're going to do is we're going to go into the actual table, the amia table and we're going to grant privileges on this. So we're going to go and select all of the privileges here. So we're going to say, hey, we want select insert, update, truncate, alter everything.

And at this point, we're going to go and select the name space that we're going to use for this. So this is a surplus workgroup that we, that we created for this. So it's a 64 rpu cbs workgroup in addition to the cluster. And this is what we're going to be using for the right operations at this point.

I'm going to switch over real quick to a different page for the demo. So no wrong button. So you do actually have the ability to go and um create a database from the data share in the console if you want to, i'm gonna show it va sequel here. But how data sharing works is you create a database from the data share and that's what allows you to go and interact with the objects in the data share.

So right here, you're seeing query editor v two. What query ed v two is, is redshift's new tool for querying data. So you'll see over here, i'm saying i'm selecting the 64 rp work group to run these queries. And then on the right, you're also seeing the database that i'm choosing. So i'm choosing to run this in the dev database.

Now, if you're doing this in the console, you'll need to go or in the query, you'll need to go and grab the name space for this. So you can see here if you run the command, show data shares, it shows you the data shares. You could also do this in the console itself.

At this point, we've now created a database in the data share and we're going to go and use the marketing database. That's the name of the database and we're going to go and create the table here. So if you refresh the on the left, you would actually see now that there's a new table in the producer database. Um it's in the staging schema and it's called customer.

So at this point, we're now going to go and run a multistage transaction. So here's what we're going to do. We're going to use the marketing database and then we're going to begin a transaction. The transaction is going to have a couple of steps. The first step is it's going to ingest all of the customer data. It's going to copy it all in to the staging table that we created.

The second thing that's going to do is it's going to go and run a query so that you only insert the data from the amir regions. After that, it's going to go and truncate the customer table that we just created because so it can be used quickly in the future and then it's going to commit the transaction.

So we'll start this here. You're gonna press run and what you'll see uh after this is that there's gonna be zero rows within the uh the table over here. So this is the amia customer table right now. It has zero rows within it. Then you know, we'll go through this, you'll see, you'll see each statement, you can see it all running right now.

If you're experienced red shift user, you might be a little bit surprised by this because previously, you could not truncate a table without committing the transaction. We are moving to a world where you can do that. So if you right now, if you use the use statement specifically, both local on the local database or on a data database, you can have a transactional, truncate the plan is to move to transactional overall at some point. But for now it's within the u statement.

So you'll truncate the table, you'll then commit the transaction and you will see here that you've gone from just having zero rows to in the last few seconds with his other work group, we have ingested almost 900,000 rows into this table. And additionally, we said earlier, you can now grant permissions to different objects of different users. If you so choose, it's now possible as well.

Same as before you'll go and you'll say, hey, i want to go and grant, you know all to the marketing uh to uh you grant all in the scheme of marketing dot prod, you'll grant specific privileges, you want different privileges to the b i user role versus the data engineering role. And it all works so very similar to the local say again.

Um you can, so you would have to use a view for now. Uh we are that, that is uh on the road map that should be out in the fairly near future. So if you want to think afterwards, uh but right now you would do that primarily with views, right on the side, do the pen.

Yeah, just we're almost done. So give me like about a minute and a half and then i will gladly answer your question.

Ok. So just in conclusion, you know, this new feature, the ability to go and have multi data warehouse rights through data sharing, you know, allows you to easily and flexibly scale your etl different workloads can have, can have different warehouse types and sizes based on price points needs. It allows you to go and keep your workplace predictable while isolating the cost as well because the compute can be separated.

And it also labs that cross-team collaboration use case you can go and start collaborating on live data across council and cross regions with just a couple of clicks and uh moving on to uh sa's favorite slide where uh it spins around. If you'd like to go and take the first spin, we would love to go and hear your feedback.

So, thanks very much. Um if you'd like to go and reach out to us, our emails are there. Um and we'd love to go and hear your feedback. Um please complete a survey as well.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Multi-data warehouse writes through Amazon Redshift data sharing

welcome to this talk. i am sudip to das. i am a senior principal engineer working on amazon redshift and i'll be presenting to you a brand new launch that we did earlier this week called multi data warehouse rights through amazon redshift data sharing.so t
复制链接

扫一扫