How to build a platform for AI and analytics based on Apache Iceberg

I'm here to talk about Tabular and building an AI platform, or mostly an analytic platform. Let's be fair - we threw AI in there at the last minute. But we're going to talk about building an analytic platform on AWS using a lot of AWS services on top of S3 and Apache Iceberg to get that central storage that can scale as well as be used with any of the compute projects that you probably use in your day to day.

So today we're gonna talk about:

  • What is Apache Iceberg and why it turned out to be a bit more than we realized when we were initially creating it.

  • The modular data architecture that you can unlock by using Iceberg.

  • Tabular's platform and how we make that modular architecture a lot easier to deploy and run at scale.

  • I'm then going to do a quick demo. I think it's under two minutes, but we'll see - I may have lied there as well.

Jumping into it - what is Apache Iceberg?

Iceberg is the solution to a lot of the problems that we had when my co-founder and co-creator of Apache Iceberg, Dan, and I worked at Netflix. We were running a really, really large scale data platforms and data architecture with S3 as the source of truth. And that just highlighted a lot of issues that were long standing issues with the Hadoop ecosystem and Hadoop space.

If you're familiar with Hadoop, we were running Hive tables and those tables had a very simplistic model of tracking data and actually giving you this abstraction of a table on top of data files. Basically it said let's throw a whole bunch of data files in a directory structure. And when you list a directory that's whatever is in the table.

Well, that had a lot of problems. And there were three main areas of problems that we ended up seeing:

  1. You can't make atomic changes, you can't have database transactions if file listing is the primitive that tells you essentially what files live in your table because you can't say, hey S3, take these 100 files and make them all appear in the index at exactly the same time. There's just no transactional nature there.

  2. The other issue was performance - using S3 instead of HDFS gave us amazing scale but higher latencies. So if we needed to go list 100 or 1000 directories, there was a huge scale problem and challenge there.

  3. And finally, we had all sorts of data engineering and productivity challenges. And this was the real motivation for solving the other two. If your data engineers can't rely on correctness guarantees - if changing a table through one process actually makes another process get the wrong results - then you can't trust your data. And not being able to trust your data means you may as well not even have it there. So the hoops you have to jump through in order to actually go and build trust in the results that you're seeing were just too great.

Similarly, at our scale, we needed just a better way to cut down on the data that all of our queries were reading - really get that cost down and have the performance that our data engineers needed.

And finally, we also had all of these issues with how tables just didn't perform as tables. You could drop a column, add a column with the same name and you've just brought the older data back from the dead. And that's really bad. You know, people don't like getting rid of data and then having it show up three months later because they added a column back with the same name.

So what we realized was that all three of these large problem spaces had an intersection and that intersection was the simplicity - or the Hive table format was just too simple. It didn't do enough to keep track of schemas and things like that.

So we decided to fix these problems. And what we did was, in retrospect, we basically just stole a whole bunch of great ideas from the SQL world.

So what we did was we created Iceberg. It's an open standard for tables. And I think the critical thing was that we added SQL behavior. So we designed for ACID semantics, we designed for S3. We also applied basic SQL techniques to make sure there weren't unpleasant surprises - so that you didn't need to retrain all of your analysts in order to be able to query your tables at all.

Because of partitioning, you also had safe schema evolution and those sorts of things. We also built for productivity - we built time travel, SQL merge compaction and optimization and those sorts of things.

And it turns out this was actually much more powerful than we'd realized. This is why you see things like the S3 keynote mentioning Iceberg these days - fixing ACID, fixing transactions, and bringing transactional behavior to data lakes was actually a much larger innovation than we realized. It goes from being able to use Flink and Sparko safely into universal analytic storage. And that is incredibly important.

It means that all these components and pieces that you use every day can actually work really well together. So it's not that you have tables in Redshift and tables in EMR and getting them to play nicely with one another is actually a very difficult thing. It brings together this as one coherent and useful architecture. You can also use things like Snowflake and Databricks. And basically anything else - all of those commercial databases that we never even thought would support Iceberg have adopted Iceberg as an open standard.

So it really has become a universal analytic storage platform. And what that unlocks is basically modular data architecture that makes a whole lot more sense than what we've been doing in the past.

So where we're going from here now that we've taken S3 and said, okay, objects are a little too simple for what we're doing. We need a table abstraction, we want to build on top of tables just like everything has for the last 30 years. Well, now we can uplevel that abstraction layer to tables on top of S3, have one central store for all of the data, have a table abstraction that you can use across any database product, not just one application.

And what we're seeing is this move to a modular data architecture where all of those components fit together like Legos, where you use declarative practices to manage and update your data, and where you rely on open standards to provide connectivity across all of those engines and components so that you can just plug them together and have a new engine in your data architecture that you know, accepts and uses your existing storage policy and your users and everything else.

There are some really important principles here like declarative approaches. So when you're using four engines instead of just one, you need to configure things the right way so those engines know how to behave. So you declare - and this is actually a very old SQL type approach - put things at the table level so that engines know how to behave. That's even more important with security.

We need to move security from the engine level down to the data level so that we're securing the tables themselves and not the access points. Because if you have a security policy and a governance structure that requires syncing policy to seven different systems, when the eighth comes online, you've got to do a lot more work and you really don't know if you're secure or if everything is up to date.

So there are a lot of changes that we're seeing to come to this modular data architecture. But I really think that this is where the future is heading.

And that brings us to Tabular. So we've seen this modular architecture work at a number of different companies that we've worked with within the Apache Iceberg community - from our own experience at Netflix to working with engineers at Apple and LinkedIn and other large tech companies.

And so we left Netflix a couple of years ago to build Tabular which brings this architecture to life inside of AWS and other options.

So what I'm gonna show today is essentially the architecture that we're seeing here:

  • S3 as the global store

  • Underneath Iceberg as that universal data format

  • Tabular providing the catalog, access controls, optimization, and data maintenance

  • And then all the engines, anything that you want up top - and that can be Redshift, which has Iceberg support and I think it's getting better day to day, EMR and what we're going to demo today is Athena and how easy it is to connect up those things.

So in Tabular's Iceberg platform, what we're going to see is:

  • First and foremost, the Iceberg REST catalog, which is much like the table format - it's a standard for interacting with catalogs. So how do we get Athena and EMR and Snowflake to all talk the same language about what tables are there and how to secure them? That's the REST catalog.

  • Our platform includes Iceberg tables that are stored in your storage at the end of the day, you're in control of it.

  • We're going to see unified access controls and we also have things that we're not going to see today like SSO, SKIM, IAM integration as well as all the other managed services.

  • Tabular also brings in automatic file loading, automatic optimization, CDC table mirroring and lots more features that just make your life easier. We're trying to take all of the undifferentiated heavy lifting and build that into the platform.

  • We also have a number of compute integrations - so starting with EMR and Athena, but Kafka Connects for MSK, Redshift via Glu Sync, Kinesis and Flink, as well as Snowflake and Databricks. So you can use any tool you want in the data space. Even if you want to just have a Python process go talk to your tables.

Alright, so let's see this in action.

[Demo transcript omitted for brevity]

So that was a quick demo of Tabular and Iceberg. Let me know if you have any other questions!

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值