【O'Reilly Data Show】机器学习将如何加速数据管理系统

最新推荐文章于 2024-02-16 18:57:18 发布

OReillyData

最新推荐文章于 2024-02-16 18:57:18 发布

阅读量396

点赞数

本文链接：https://blog.csdn.net/zkh880loLh3h21AJTH/article/details/78883482

版权

请点击下方“阅读原文”，收听本次Data Show全文。

How machine learning will accelerate data management systems

The O’Reilly Data Show Podcast: Tim Kraska on why ML will change how we build core algorithms and data structures.

In this episode of the Data Show, I spoke with Tim Kraska, associate professor of computer science at MIT. To take advantage of big data, we need scalable, fast, and efficient data management systems. Database administrators and users often find themselves tasked with building index structures (“indexes” in database parlance), which are needed to speed up data access.

Some common examples include:

B-Trees—used for range requests (e.g., assemble all sales orders within a certain time frame)
Hash maps—used for key-based lookups
Bloom filters—used to check whether an element or piece of data is present in a set

Index structures take up space in a database, so you need to be selective about what to index, and they do not take advantage of the underlying data distributions. I’ve worked in settings where an administrator or expert user carefully implements a strategy for building indexes for a data warehouse based on important and common queries.

Indexes are really models or mappings—for instance, a Bloom filter can be thought of as a classification problem. In a recent paper, Kraska and his collaborators approach indexing as a learning problem. As such, they are able to build indexes that take into account underlying data distributions, are smaller in size (thus allowing for a more liberal indexing strategy), and their indexes execute faster. Software and hardware for computation are getting cheaper and better, so using machine learning to create index structures is something that may indeed become routine.

This ties with a larger trend of using machine learning to improve software systems and even software development. In the future, we’ll have database administrators who have machine learning tools at their disposal, which would allow them to manage larger and more complex systems, and these ML tools will free them to focus on complex tasks that are harder to automate.

Here are some highlights from our conversation:

Why use machine learning to learn index structures

I think it used to be the case that if you know you have the key distribution, you could leverage that, but you need to build a very specialized system for that. Then, if the data distribution changes, you need to adjust the whole system. At the same time, any learning mechanism was in the past and way too expensive to do it.

Things have changed a little bit because compute is becoming much cheaper. Suddenly, using machine learning to train this mapping actually pays off. On one hand, the B-tree structures are composed of a whole bunch of “IF statements,” and in the past, multiplications were very expensive. Now multiplications are getting cheaper and cheaper. Scaling “IF statements” is hard, but scaling math operations is at least relatively easier. In essence, we can trade these “IF statements” for multiplications, and that's actually why suddenly learning the data distribution pays off.

... For B-trees, for example, we saw speed-ups of up to roughly 2X. However, the indexes were up to two orders of magnitude smaller.

Why B-Trees are models. Image from Tim Kraska, used with permission.

The future of data management systems

If this machine learning approach really works out, I think this might change the way database systems are built. ... Maybe the database administrator (DBA) of the future becomes a machine learning expert.

... My hope is that the system can figure out what model to use, but maybe if you want to have the best performance and you know your data very well, I can see that maybe the DBA / machine learning expert chooses a certain type of model to tune the index.

… There was this Tweet, essentially saying that machine learning will change how we build core algorithms and data structures. I think this is currently still the better analogy.

Related resources:

Tupleware—redefining modern data analytics: a Strata Data 2014 presentation by Tim Kraska
Artificial intelligence in the software engineering workflow: a 2017 AI Conference keynote by Peter Norvig
"A scalable time-series database that supports SQL"
"Architecting and building end-to-end streaming applications"
Data is only as valuable as the decisions it enables

Ben Lorica

Ben Lorica is the Chief Data Scientist at O'Reilly Media, Inc. and is the Program Director of both the Strata Data Conference and the O'Reilly Artificial Intelligence Conference. He has applied Business Intelligence, Data Mining, Machine Learning and Statistical Analysis in a variety of settings including Direct Marketing, Consumer and Market Research, Targeted Advertising, Text Mining, and Financial Engineering. His background includes stints with an investment management company, internet startups, and financial services.

由O’Reilly和Intel共同举办的人工智能2018北京大会售票系统已经上线，现在是最佳票价阶段，点击下图二维码进入官网查看已公布的讲师及议题详情。

讲师及议题内容将持续更新，请大家关注O'Reilly官网及OReillyData公众号。