Upgrading from the modern data stack to the modern data lake

最新推荐文章于 2024-11-18 13:29:24 发布

李白的朋友王维

最新推荐文章于 2024-11-18 13:29:24 发布

阅读量124

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134868263

版权

Monica: Hello everyone. Thank you so much for joining us today. My name is Monica. This is my good friend Emma and we are known around Starburst as the Galaxy Gals.

So today, what we're gonna do is talk to you about how to upgrade from the modern data lake or the modern data stack to the modern data lake. So let's go ahead and get started.

So if you've been in the data space recently, within the last five years, you've probably heard of the modern data set and the modern data stack was the goal to try and provide a flexible data architecture. We were just getting two different, you know, it was too complicated. We were really leading into, you know, customization too much.

And so with the modern data stack, the goal was to try and simplify and essentially the modern data stack is basically a um technology, a flexible data architecture that was built by tools and cloud data technologies to collect process and analyze data. And it's really interesting because as you go and look at all of the definitions of the modern data stack, they're all very different. Everyone has their own slice and perspective, which really just like contributes to the allure that is the modern data stack.

And you know what it started off as a really good idea, it's really just up leveled the complexity that we have today. And so, you know, here's what the modern data stack traditionally looks like. You can see you've got your ingestion, you've got your transformation, your storage, your visualization, but then you start adding other components like your testing, your um you know, governance, your monitoring components. And you end up looking like a stack that we really know today.

Emma: Monica do you think that the modern data stack is really that modern? I think this might be a trick question. I'm gonna say no really. The modern data stack is just a redo of the legacy data architecture that we all know and love.

I used to work as a data engineer and I was actually tasked with the job of trying to build our legacy data architecture and modernize it and modernize it be whatever it may mean. And um essentially what we ended that doing is just taking the exact same data, the exact same components and just switching out a cloud component for our old legacy data component, which in the end is not really modernization, you're not thinking about anything differently. Instead you're just replicating the situation that you already have created.

And so you know, the modern data stack, it's a big lie, the modern data stack is indeed just the data sack from decades ago. But instead it's built within the cloud.

And you know, one of the constant things that people get tripped up about is that you don't actually have to migrate to actually modernize, you know, we have this notion built on the cloud data warehouse that you have to instead centralize all your data and move it to the modern um cloud at data warehouse to actually get that modernization component out of it. But really, that's not the case at all.

You know, at Starburst, we'll say that really you should not do that strategy and you really should be building for the storage solution that makes sense for you. And you know, a lot of times that flows right into the point that we think that data lakes are more pertinent to your choice of storage solution over the data warehouse.

You know, iceberg, all the um the delta lake hootie, they've done tremendous things with being able to apply those data warehouse like capabilities to the data lake. And that has just been a game changer for being able to up level your data lake from one situation to the other.

And so with the data lake today, you can really um capitalize on the separation of storage and compute and instead focus on building for the future building for open data standards building for things that will uh prevent vendor walk in in the future.

And you know, i know it might be scary to feel like this revolves around the data lake. Um because i know data lakes can be tricky and, you know, without the right organization or governance, it can be something that may not end up, you know, uh with a happy story. But what we're gonna do today is talk about some of the process and the modernization that you can take to really up level that data lake even more and avoid what may happen within this meme.

So to get started, we're gonna try and shift our concept and we're gonna focus on making the modernization as a process and not actually a data stack because we fall into this trap within this modern data stack narrative that we're so focused on um building for that proprietary cloud data warehouse.

And we need to actually look at the technologies that we have up to up level the availability of things today. So we're gonna look at building with our data lake. So we're gonna start with the storage component and then we're gonna select an engine and we're gonna talk about utilizing those open table formats. I've already talked about and then we're gonna build on the access component and basically creating that one layer for you to access going forward.

So let's get started with the storage. The first thing to do is really be mindful of how you're creating your data organization within your data lake. So what we'd like to suggest is building out your data lake into three different zones.

The land lay is gonna have all your unmodified source data. That's gonna be data that's just landed in there. You're not gonna touch it. You're gonna focus on specifically keeping this raw data.

Then you're gonna up level that and move to your structure zone and in that structure zone, you're going to have filtered clean joined data. You're gonna do all of your data manipulation, all of your data transformation that you need to do in order to get it ready to be viewed by the data consumer.

And then in your final layer, you're gonna have your consume layer. And what that's gonna do is you're gonna aggregate any of your final data commands that you may need. You're gonna aggregate all your data together, join your tables together that they need to be accessed from multiple different data sources.

And what you're gonna do is we're gonna basically either figure out the best way for your data consumers to access that data. May it be through data products? May it be through a dashboard, whatever makes sense for you in your organization? And that is really how you set that level, set that organizational component so that you know where the data is and it doesn't turn into that data swamp.

The second thing, second thing we're gonna talk about is how to choose a performance scalable engine. So essentially we've already talked about the separation of compute from storage. But really capitalizing on that component is what up levels modernization from being, you know, the process, we want it to be, you know, from just the regular lift and shift tech stack that we had from the past to the present.

And, and so to do that, we need to find a scalable query engine that is perform it with high concurrency at a high scale. And we need to also keep in mind about leveraging the elastic cloud compute resources to scale up and down your resources as it's running, which is critical to the component.

And then we also wanna think about not only having an engine that works for one specific instance, we wanna have an engine that can work for multiple different uh uses, we can wanna do the interactive use case. We wanna do some batch analytic workloads. If you want to return data la queries with scale and speed, that's where you wanna have a query engine that can handle all three at the same time.

Um and if you haven't heard of reno, we would love for you to go over to booth uh 1151 right over there. And we'll tell you more about that uh performance scalable engine.

Emma: Hey Monica. So for our third step, what we're gonna do is actually move to open table and file formats and I've already talked about this, but really just that ability to up level, the acid transactions that you have within your data lake and use those performance uh file formats with org parquet and afro allow you to really just you know, up level towards that process that you need.

Um and really you're just trying to aim for simplicity and efficiency like we're doing in the initial thought of the modern data stack. And so our last step in this process is actually to provide that a single point of access in governance.

So if you're thinking, oh monica is just telling me, i need to go from a cloud data warehouse to a data lake. No, i'm not. I would be so sad if that's how you left this talk. What i really want you to do is think about um integrating your data like with your other data sources.

You know, as i said, i was a data engineer and i um and just not to be religious, but like i don't believe in a data centralization. I just don't think that it's possible. I think that i have just used, you know, i moved way too many copies from one place and maintain those while i was a data engineer for the exact same copy for too many uses to really believe that that's possible.

And so really what we wanna do here is utilize a semantic layer to provide that single point of access and governance so that you can reach in the data lake with your center of gravity and then you can also access the data that lives in the orbit.

And so here we're just summarizing the points that i've talked about about modernization being a process and not a stack. So we're going to look at that single point of access and governance. We're going to add those advanced warehouse like capabilities directly on the data lake using our table formats and our file formats.

We're gonna look for an architecture that we can build, that's actually vendor agnostic. And we're gonna find a system that is scalable and cost effective. And so that's what we call the modern data lake.

And again, if you'd love to hear more about this, we could have a very extensive chat at booth 1151 to learn more about the modern data lake. But essentially, it's taking up all the components. We said um both they on the commodity storage and then building on the open file formats, the open table format, selecting your query engine of choice and then being able to access your uh federated data across your data sources beyond your lake.

And so at Starburst, we have developed the easiest and fastest way to manage and build your modern data lake. And that is the Starburst data lake analytics platform, which Emma is gonna tell you all about now.

Emma: Uh thank you Monica. And so I think it's uh the architecture that you just saw is composed of components that we're all thinking about in our everyday lives as we're working and managing with data. I don't know if it's anything crazy or net new, but it's the way the components work together that we're gonna talk to you about here today.

And it's about the learnings that our product team has seen as they've been building out this platform and as they've been managing our own internal data lake that we think is really important to impart to you um today.

And so here's kind of what a a typical architecture diagram might look like. We've obviously abstracted away different logos, but you might picture Apache Ranger Lake Formation here. Maybe you have Atlin, you have your, all of your tables and your data lake here with your object storage and then you have your Observable layer and all of these pieces of the puzzle, all the places that you see these different arrows mean you're integrating different technologies, you're learning those technologies and you're maintaining them.

And so when we're talking to different data teams, we're realizing that every day, the teams as they grow scale and start to mature their data lake, they're running into problems with time, money and resources. Their teams are constantly fogged. They're trying to onboard new use cases with the emergence of AI.

Data teams suddenly were being pounded by the business teams asking. Ok. Is our data ready? Are we on boarding new use cases? How can we use AI and the data team is still trying to figure out how to get this poor data lake architecture up and running.

And so a lot of the times this architecture looks easy, but in the actual implementation of it, there's a lot more breakpoints and it requires a lot more resources than the initial project team thought it would take.

So what did our engineering and product team do? Well, they took that this architecture and they thought of the idea of merging it into a single platform. But we're also created by the founders of Reno Star versus created by the founders of Reno. So we are open source purists. We're not trying to take control of your data.

And that's why when you're looking at this slide, I really want you to focus on the bottom layer, your data sources, you continue to own your data in any table format file format or data source on prem in the cloud hybrid. Whatever your data stack looks like today, you can use Starburst and Starburst will also scale with you tomorrow.

So that's the core, we want you to own your own data and we'll provide a unified analytics platform. So those breakpoints and those potential integration issues that I just talked about are removed. Instead, we built out three distinct layers that come from our learnings of working with a lakes over the years into a unified platform that we call Starburst Galaxy today.

And to avoid this being a sales pitch because I'm sure we've all heard enough of those over the week. What I'm gonna do is take you through these three different layers and how we architected them and why we did it that way. Because even if you're not using Starburst, there were some key learnings that we had over the past few years as we built this platform.

So we're gonna start with the single point of access. As Monica said, data centralization honestly is a myth. Your company is, if it's growing is going to go through acquisitions, it's gonna go through mergers. You're gonna want to go multi cloud. There's things that are always changing, new use cases that are merging and it's just not realistic to expect all of your data to land in the same place.

Now, that being said, the data lake we believe is the center of grab for the future. So we're not saying that you shouldn't try to centralize. We are saying that centralization is impossible in the center of your data gravity should live in a data la based on the architectures that we've seen.

So how do we account for this? Well, we started with a single point of access and the great part about our engine and our system is that it's built on Reno which connects to 50 plus data sources on the cloud and um on prem and you can see here on the side of the slide that we also treat every table format as a first class citizen.

So if you're coming to us from Delta Lake, if you're coming to us from Hootie, if you're coming to us from Iceberg, we will take your data and in that table format and treat it like equally regardless of your current architecture and your current implementation.

And last but not least you need to be able to query that data. So we do provide that as well. And that's really the foundation we're providing the single point of access, you're connecting your sources. But the next layer is our bread and butter. This is where we excel.

And what we learned here is that when our co-founders build Reno, Reno is great. It's used for sql analytics at internet scale at companies that you use products from every single day. We're talking Netflix, we're talking Pinterest, we're talking Facebook. All of these companies use Reno for sql analytics, but the co creators of Reno came together and rethought the product.

And so what we have in Galaxy is an enhanced version of Reno. If you're a Reno user today, you might be used to having to uh perform your own indexing your own caching, creating those own layers. We have a technology that leverage nan that leverages nano block indexing in order to provide fast queries against your data lake. It also fun fact reduces your get api cost on your s3 bucket.

So if that is something that is of interest. I recommend checking it out. We've done uh reduce those costs up to 70%. I'm sorry, that was a sales pitch. I will now step back into my, my learning zone.

Um and, and last, but not least it's really important to have a fault tolerant engine monica showing you the rest of the slide.

Monica: Oh my gosh. Um so the, the, the last part of the engine is that a lot of teams today are alternating between Reno and Spark for their batch and their sequel, their interactive sequel workloads. But that is a lot of cognitive load on the team. That's two separate tools to maintain.

And we do know that a limitation of Reno was in past uh fault tolerant execution. And so in Galaxy, we build that into the engine, um we can execute queries up to 60 terabytes. And so again, that's just another architectural consideration to think about for your team and how you're architecting your analytics platform, which is how many tools do you wanna be running in your stack and maintaining

Last but not least we have our Gravity layer. And so this is everything that exists around the data lake to make it possible. Like I said before, like we've talked about earlier in this presentation and I'm sure as you've talked to many of the booths out here today, a data lake is impossible without understanding the data that's in your lake, maybe cataloging it.

Um without governing it and ensuring that the right people have access the right access at the right time down to the row and column level. And last but not least figuring out how to share that data, maybe it's sharing that data with other internal teams with proper documentation or maybe it's external data sharing gravity is our universal discovery governance and sharing later that allows you to do all of those pieces of the puzzle without spending the time learning and integrating different platforms.

So again, it's built in you get the power of Reno but with Gravity and so it stretches across all of your data sources, which is great. And so this is really Starburst Galaxy. This is what we call our unified analytics platform. And it's based off of the many hours we spent learning and dealing with customers who are trying to mature their data lake or bring their data lake to the next level.

But this isn't all you know, everything i talked about today is kind of like in and around the lake. And so we have another couple of exciting announcements of customers also have to be able to get data in the lake, activate that data and query that scale and i am running out of time.

So what i'm gonna do is rapid fire through these announcements and feel free to come talk to us later at the booth. We have demos of all of these. But what uh our biggest announcement that we just released today this morning.

Um streaming in. So this is how you can build your data lake in real time and land that data in iceberg tables and start clearing it within a minute. So that's really exciting. It's fully managed, we then allow you to. So now you have your data in the link, but let's make sure that you are only giving that data to the right people.

So we have automatic data classification. So you can use AI powered models to identi to intelligently identify and tag your sensitive data. So you're not uh at risk with human error or waiting on humans to sift through all that data if you're using your attribute based access control policies, and then we also introduce data like optimization. Emma: So this is everything that happens in your warehouse under the hood. A lot of times when you're dealing with, you know, your core S3 technology, you have to build these cron jobs out yourself. We're automating that for you. So this is really exciting. You still own the Iceberg tables, but we will help you automate those cron job, literally just a click of a button.

And then last but not least, let's get your data out of your lake. And let's have you start um sharing the value of that data with either other organizations or other departments within your company with data sharing.

And so it's trusted data sharing. Um you can package your data sets up into a data product and you're good to go. And so now you have not only built your data lake, but you've governed it. You've secured and heard it. Thank you. And you shared that data.

So that's a little bit about why the modern data stack is dead. And data lakes are lake and thank you for listening.

And last one, text the sequel and sequel to text. Uh go to the booth and we'll check it out. We'll be happy to show you a demo. So yeah, booth 1151 right over there.

Thank you so much. Thank you.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫