Breaking the data pipeline bottleneck with zero-ETL-CSDN博客

本文链接：https://blog.csdn.net/just2gooo/article/details/134867471

Hello. Hello. So I'm Rob Koch. I use sign language interpreters to communicate. So the ones that are sitting in the front are voicing for me and if anything is uh misheard or mise, you can just blame her. Ok.

Yes. So the talk I'm giving today is about zero ETL and breaking the pipeline bottleneck. I also should mention I work for Slalom. I'm also an AWS Data Hero. So I was honored to be one of the first uh data hero of coho cohort groups.

So I've gone through um a little bit of this content with zero ETL. Uh I'll be explaining here what the ETL part means. Extract, Load and Transform and it's someone a little bit in a different way which I'll get into. later. We're trying to make data available for extracting and loading it to different places in the data. Sometimes there's very specific needs. So we'll go through part of that as well.

And then I also have a demo that I will be showing through the console and those are some pre-recorded videos that I made. So um we won't have any of those awkward moments of silence and then we'll do a walkthrough of the console itself where I will be showing uh MySQL Aurora database to Redshift. And that'll be more of like an analytical database mp p parallel processing with the analytics.

And then we'll be using uh DBeaver or whatever your favorite database IDE. And then we'll be able to see how we can integrate all of that with the pipelines from the relational database to the analytics database. We'll do a few inserts and updates and deletes. So I'll walk you through that.

So zero ETL, some of you maybe have some data pipelines that have orchestration. Uh they require maintenance. There might be various errors that require fixing and maybe, you know, burn a lot of your time during the day doing that. So having it as easy as the push of a button and as simple as that with the querying of the data, some of those relational databases, if you query them, it might cause issues to the end user. So we wanna be able to move the tough work to the analytics platform with the data integration, we can analyze the data in place and I know that's sometimes a little bit of a struggle with latency issues and congestion that happens there. So we're wanting to use ETL and ELT it depends on your situation, how the data is orchestrated and delivered, maybe on a schedule that's repeatable and enhancing the timing there as well as uh moving on to dev ops, source control, deployment, security, et cetera.

So I prepared this demo for you. This is the console uh setting up a new Aurora database here. Create the name that you want for it. Um maybe a particular version if you want to use an older version that will also work. So selecting not here, then also naming the database here. Create the password. Takes a little time to spin up. Yes, I'm cheap. So I wanted to save money. So I use the lower uh instance here. This is in public access for now. So with uh db, you can uh dbeaver, you can access it. So that would take a few moments to get it up and running and create the database.

And next, we're going to go ahead and set up the Redshift server. So you can go to the next slide. There we go next. Yeah. So you can create this preview working group uh for this successful um postgres integration, mysql and dynamo db. So we'll go ahead and do the mysql portion of it, create the name space and then we'll set up that working group name as well. You have to set up the i a roles, of course through the command line in the documentation in redshift. I'm using this for dental purposes. So this is a little bit lower capacity. Here's the iim resources making you add, making sure you add the right roles. You can see my account id there. Ok? And then next we add the integration source, which i named it here with the sql instance. You can see this. So now if we go back to mysql and we look for that, that particular instance here, so now we're working with redshift serverless, it can be a little odd. It's, there's a, you gotta watch out for the slashes versus the dashes. So i had a few lessons learned with that. Ok? I was gonna say yes. So you gotta add the arm as well. Yeah. So now it's integrated, integrated successfully.

So we'll go into this, make sure it's the right one. And this is created from the source, just make up any sort of name here. And then we'll identify the database as a redshift database. Not sure if i have that in this part of the video, but we have time. So i'll let you watch a little bit more of this. So again, the creation of this will take some time. Yes. And this is important information to know too the parameters and the parameter groups, it has to have the pin log here. So the pin log with these properties are important to have happen. The bin log has uh a little bit more information coming through. So it's easier to keep those two databases in sync that way between redshift and aurora.

So now we'll go to the redshift query editor i've set up the users and the tables here in aurora. So i've installed some records in aurora. Looks like there's three records for now. So these 31 at a time. So now if we go to redshift, we can see that the data is showing up and it's integrated. So that's great. But you also don't see all of it yet. It takes a couple of seconds for these things to be transferred between the databases. So if you refresh a couple of times, then you will eventually see it here.

So if you play around with red, redshift data, uh maybe you wanna remove records or update insert, you can see what's happening here. You can't delete a record once it's integrated, you can't change it within redshift. Otherwise, how would you keep things in sync? Right? So we have to go back to the source, make the changes there and then it'll show up. And so i'll demo that as well here.

So this is just the walk through of the different properties that are happening. I can show you these um integration ids as well if you have dev ops or maybe you need to see which integration id is which to keep track of what you're doing. That's where you can find this information to make sure it matches there's multiple integrations that's possible as well. It's not. Yeah. Do you have ok.

So we can go to the next one and we'll go back to aurora. Yes. So now within aurora, i'm using an app called dbeaver. So just checking to make sure that this is the right redshift. There's the name of the city you see here, lufkin. So i'm gonna change that to seattle, washington here using the idea as well. And then adding w a you can't have, uh, seattle without washington and vice versa. So we'll go ahead and make that update.

So this is proof that it is come in. So now we'll have to wait and see if it comes into redshift again. It takes a couple of seconds. It's pretty fast, but there's a lot going on behind the scenes. So it does take a few seconds. I click, refresh a couple of times and then you can see the record has been updated automatically.

So that's really how easy it is to do this integration. Now, let's say we want to go and clean it up on my aurora database side. We can uh delete the record. It works. You can add a record, everything will be in sync from that point forward. Ok. Yeah. And it should update here. Ok.

So why should we use zero ETL in the first place? It can save time when you're working. You're always doing kind of a lot of little things that you might be struggling with or custom coding, networking, opening up ports, all of that type of work dealing with different subscriptions and software. So all these different things, these little things take up a lot of time. So if you want to interact, keep the data simple from one place to another. That's, that's really it. Of course, there's a few data pipelines to maintain, but it really does keep your work and your sanity and make your life easier. You don't have to worry about storage issues. Sometimes when we're migrating data from a relational database to an analytical database, the logging will have some issues and uh you know, you'll consume all those logs in the system and you'll maybe max out your storage so that can cause problems as well. So that's why it's good to mention here. So saving of resources um keep it simple and less to troubleshoot in general. If you've got, you know, integration of one thing, it makes it a lot simpler.

And then for the transform with a relational database that might cause some performance issues for other users on the same database. So you might wanna take that out and put it into the more right side position for the purposes of the relational database. They should stay relational and analytics should remain analytics. So just taking advantage of that with this process and then moving it towards the right, i know it says zero ETL. But it's really more like T that's why it's a little confusing. You know, when i started my talk, i said is it ETL or is it ELT because it's transformation is happening on the analytics side and that's what it's designed for. So it's really fast. So taking advantage of that computing on the analytics side is a benefit.

And the analytics database, they know, you know, the analysts, when they're doing their job, they know what they want to analyze and they're pretty good with querying and the query language also how to make things, you know, complex for the infrastructure folks, right? And also later down the line, if you're pushing things more off to the right side, uh it could become more prepared for a data lake and things like that.

So there are some drawbacks. I should mention we're honest people here. So we can talk about the drawbacks. You can't customize when you're extracting and loading here. You can't customize that process at this point. I know AWS will come up with something soon. They're always improving their services. So i have faith in that some behavioral questions like i'm not sure uh behind the scenes with zero ETL when you are moving from aurora to redshift, what's going on and how is that working? It's a black box to us. So if you have some concerns or questions about that, that's where you might find some of that information uh filtering as well. That's on the way i've been told that it's ok to share that, that filtering is on the way they might have some records that, you know, might be for confidentiality um before they get to the analytic side.

So it works for complex data types, for things like geo geospatial or maybe you have some cus custom function or maybe like maybe those things might not work in this process and you'd have to move that over to the side of things. So having as a black box, maybe not knowing what it's what's happening, it may be the issue and having the ability to quickly pivot quickly uh to be impacting the capability of that. Um so like i was saying, you might not be able to pivot quickly if something's happening with the integration and the interaction there, you might have to start using another tool to do the transaction and the transfer of the data. So if you're in a hurry, that could be problematic. But if you wait to see it and trust it, then you know, going forward that uh that transition should be smoother.

This is my quote "Zero ETL definitely has a place in your data platform." This checks off all the six pillars of the AWS Well-Architected Framework with your operations, your security, your sustainability, all of those different pillars, it covers all of them.

And of course, we've arrived at the last slide uh when i did the demo, if you wanted to see some of these um places that i gathered the information um when they were announcing the preview and playing around with it, all of that. So i have these QR codes here with the links of what i've looked into and figuring out how this all works. So this was a nice resource to have and then also the step by step guidance there. Yeah, some of the older versions of the software on cloudshell and things like that, it might need to be updated for things to go smoothly.

Thank you so much for coming. Um got a LinkedIn profile link here and if you don't mind um filling out the survey, i would appreciate the feedback. Um so thank you so much for that.