Database Meets AI: A Survey

最新推荐文章于 2024-06-23 19:46:46 发布

ygpGoogle

最新推荐文章于 2024-06-23 19:46:46 发布

阅读量2.5k

点赞数 4

分类专栏： CCF-华为数据库创新计划文章标签：数据库神经网络深度学习机器学习数据挖掘

本文链接：https://blog.csdn.net/ygp12345/article/details/107945956

版权

CCF-华为数据库创新计划专栏收录该内容

8 篇文章 0 订阅

订阅专栏

这篇文章是本人参与华为创新计划AI4DB方向时，对清华李国良教授的对于数据库融合AI技术的一个调研的部分解读(主要是懒得打字了)
![在这里插入图片描述](https://img-blog.csdnimg.cn/20200812001625486.png

Abstract

Database and Artificial Intelligence (AI) can benefit from each other. On one hand, AI can make database more intelligent(AI4DB). For example, traditional empirical database optimization techniques (e.g., cost estimation, join order selection, knob tuning,index and view selection) cannot meet the high-performance requirement for large-scale database instances, various applications and diversified users, especially on the cloud. Fortunately, learning-based techniques can alleviate this problem. On the other hand,database techniques can optimize AI models (DB4AI). For example, AI is hard to deploy in real applications, because it requires developers to write complex codes and train complicated models. Database techniques can be used to reduce the complexity of using AI models, accelerate AI algorithms and provide AI capability inside databases. Thus both DB4AI and AI4DB have been extensively studied recently. In this paper, we review existing studies on AI4DB and DB4AI. For AI4DB, we review the techniques on learning-based configuration tuning, optimizer, index/view advisor, and security. For DB4AI, we review AI-oriented declarative language, AI-oriented data governance, training acceleration, and inference acceleration. Finally, we provide research challenges and future directions.
从上述摘要我们得知，数据库和人工智能的结合能更有益于每一个人，为什么这么说？从AI对数据库的好处来看，我们用传统经验去优化数据库已经不能满足较高的执行要求，比如在大规模数据、各种各样的应用、不同的用户，尤其在云上。传统的优化方案是：(e.g., cost estimation, join order selection, knob tuning,index and view selection) 翻译过来大概是代价评估，连接命令的选择，节点之间的协作，索引与视图的选择。现在我们可以利用基于学习的策略去缓解上述传统优化的缺点。另一个方面就是数据库反过来可以有益于AI的发展！！数据库技术可以优化AI模型。比如说AI很难应用到真实的应用上，因为它对于开发人员太难了，因为要写复杂的代码和训练复杂的模型。数据库技术能减少使用AI模型的复杂性，加速AI算法，提供AI的可用性。AI4DB和DB4AI已经开始被研究了。对于AI4DB，我们研究基于学习的优化策略，包括配置协调，优化器，索引和视图的建议，还有安全方面。对于DB4AI，我们研究基于AI的声明语言，基于AI的数据管理，训练模型速度增加，参照增加。最后我们给出了研究挑战和未来的方向。

关键字：Database, Artificial Intelligence, DB4AI, AI4DB

INTRODUCTION

Artificial intelligence (AI) and database (DB) have been
extensively studied over the last five decades. First, database
systems have been widely used in many applications, because databases are easy to use by providing user-friendly
declarative query paradigms and encapsulating complicated query optimization functions. Second, AI has recently
made breakthroughs due to three driving forces: large-scale
data, new algorithms and high computing power. Moreover,
AI and database can benefit from each other [144], [55]. On
one hand, AI can make database more intelligent (AI4DB).
For example, traditional empirical database optimization
techniques (e.g., cost estimation, join order selection, knob
tuning, index and view selection) cannot meet the highperformance requirement for large-scale database instances,
various applications and diversified users, especially on the
cloud. Fortunately, learning-based techniques can alleviate
this problem. For instance, deep learning can improve the
quality of cost estimation and deep reinforcement learning
can be used to tune database knobs. On the other hand,
database techniques can optimize AI models (DB4AI). AI
is hard to deploy in real applications, because it requires
developers to write complex codes and train complicated
models. Database techniques can be used to reduce the
complexity of using AI models, accelerate AI algorithms and
provide AI capability inside databases. Thus both DB4AI
and AI4DB have been extensively studied recently.

AI for DB

Traditional database design is based on empirical methodologies and specifications, and requires human involvement
(e.g., DBAs) to tune and maintain the databases [78], [17]. AI
techniques are used to alleviate these limitations – exploring
more design space than humans and replacing heuristics to
address hard problems. We categorize existing techniques
of using AI to optimize DB as below.
这一段大概是说传统的数据库设计是基于人类经验的，然后需要人去维护，而AI技术就可以改变这一点，探索更多的设计空间，然后取代启发式处理复杂问题。何谓启发式？启发法（heuristics，源自古希腊语的baiεὑρίσκω，又译作du：策略法、助发现法、启发力zhi、捷思法，）是指依据有限的知识（或“不完整的信息”）在短时间内找到问题解决方案的一种技术。它是一种依据关于系统的有限认知和假说从而得到关于此系统的结论的分析行为。简言之启发式就是基于经验的。下面我们来对AI4DB分个类：

Learning-based Database Configuration. (1) Knob tuning.
Databases have hundreds of knobs and it requires DBAs
to tune the knobs so as to adapt to different scenarios.
Obviously, DBAs are not scalable to millions of database
instances on cloud databases. Recently the database community attempts to utilize learning-based techniques [3], [77],
[157] to automatically tune the knobs, which can explore
more knob combination space and recommend high-quality
knob values, thus achieving better results than DBAs. (2)
Index/View advisor. Database indexes and views are fairly
crucial to achieve high performance. However, traditional
databases rely highly on DBAs to build and maintain indexes and views. As there are a huge number of column/table combinations, it is expensive to recommend and
build appropriate indexes/views. Recently, there are some
learning-based works that automatically recommend and
maintain the indexes and views. (3) SQL Rewriter. Many
SQL programmers cannot write high-quality SQLs and it
requires to rewrite the SQL queries to improve the performance. For example, the nested queries will be rewritten into join queries to enable SQL optimization. Existing
methods employ rule-based strategies, which uses some
predefined rules to rewrite SQL queries. However, these
rule-based methods rely on high-quality rules and cannot
be scale to a large number of rules. Thus, deep reinforcing
learning can be used to judiciously select the appropriate
rules and apply the rules in a good order.
这一段是讲基于学习的数据库配置。第一点就是将用AI去替代DBA对数据库节点的协调，探索更多的节点结合空间，并且推荐更优的节点，因为DBA已经无法对云上数以万计的节点进行管理优化了。第二点使用AI去代替DBA对索引和视图的操作，因为字段和表多且复杂，靠人力去维护，太难了。第三点就是自动重写sql，举个例子，嵌套查询会被重写为连接查询来使一些sql优化可用。我们现在的sql优化是提前制定优化策略，这些策略依赖于高质量的规则，但是这些规则太多了，于是我们用深度强化学习去自动选择使用哪些规则以及怎样组合。
Learning-based Database Optimization. (1) Cardinality/-
Cost Estimation. Database optimizer relies on cost and
cardinality estimation to select an optimized plan, but traditional techniques cannot effectively capture the correlations
between different columns/tables and thus cannot provide
high-quality estimation. Recently deep learning based techniques [64], [110] are proposed to estimate the cost and
cardinality which can achieve better results, by using deep
neural networks to capture the correlations. (2) Join order
selection. A SQL query may have millions, even billions
of possible plans and it is very important to efficiently find a good plan. Traditional database optimizers cannot
find good plans for dozens of tables, because it is rather
expensive to explore the huge plan space. Thus there are
some deep reinforcement learning based methods to automatically select good plans. (3) End-to-end Optimizer. A
full-fledged optimizer not only replies on cost/cardinality
estimation and join order, but also requires to consider
indexes and views, and it is important to design an end-to-end optimizer. Learning-based optimizers [100], [148] use
deep neural networks to optimize SQL queries.
这一段主要讲数据库的优化。第一点是对优化器代价的评估，选择某一个优化策略的时候要对其代价进行估计，但是传统技术不能有效地获取表与表，列与列之间的关系，无法给出合理的优化方案，这时我们可以利用神经网络去获取这些联系。第二点是做连接顺序的选择，传统的优化器并不能在数以万计的表里找到合适的连接顺序，这时我们可以利用深度强化学习去自动选择好的计划。第三点，端到端优化器，一个好的优化器不仅要有好的代价评估和连接顺序，还要考虑索引和视图，神经网络可以选择最优的sql查询。

Learning-based Database Design. Traditional database is
designed by database architects based on their experiences,
but database architects can only explore a limited number of
possible design spaces. Recently some learning-based selfdesign techniques are proposed. (1) Learned indexes [66] are
proposed for not only reducing the index size but also improving the indexing performance. (2) Learned data structure design [53]. Different data structures may be suit for different environments (e.g., different hardware, different read-
/write applications) and it is hard to design an appropriate
structure for every scenario. Data structure alchemy [53]
is proposed to create a data inference engine for different
data structures, which can be used to recommend and
design data structures. (3) Learning-based Transaction Management. Traditional transaction management techniques
focus on transaction protocols, e.g., OCC, PCC, MVCC, 2PC.
Recently, some studies try to utilize AI techniques to predict
the transactions and schedule the transactions [92], [127].
They learn from existing data patterns, efficiently predict
future workload trend and effectively schedule them by
balancing the conflict rates and concurrency.
这一段大概是讲用AI去设计数据结构，还有自动管理事务，自动解决冲突和并发。
Learning-based Database Monitoring. Database monitoring can capture database runtime metrics, e.g., read/write
latency, CPU/memory usage, and thus can remind administrators when anomalies happen (e.g., performance slowdown and database attacks). However, traditional monitoring methods rely on database administrators to monitor
most database activities and report the problems, which
is incomplete and inefficient. Therefore, machine learning
based techniques are proposed to optimize database monitoring [62], [40], which determine when and how to monitor
which database metrics.
这一段大概是讲AI自动监测数据库运行状态，包括内存使用，IO操作等。
Learning-based Database Security. Traditional database
security techniques, e.g., data masking and auditing, rely
on user-defined rules, but cannot automatically detect the
unknown security vulnerabilities. Thus AI based algorithms
are proposed to discover sensitive data, detect anomaly [84],
conduct access control [39], and avoid SQL injection [135],
[156]. (1) Sensitive data discovery is to automatically identify sensitive data using machine learning. (2) Anomaly
detection is to monitor database activities and detect vulnerabilities. (3) Access control is to avoid data leak by
automatically estimate different data access actions. (4) SQL
injection is to mine user behavior and identify SQL injection
attacks with deep learning.
这一段是讲用AI自动发现敏感数据和易受攻击的点，以及访问控制和对SQL注入的防护。

DB for AI

Although AI can address many real-world problems, there
is no widely deployed AI system that can be used in different fields as popular as DBMS, because existing AI systems
have poor replicability and are hard to be used by ordinary
users. To address this problem, database techniques can be
used to lower the barrier of using AI.

Declarative Query Paradigm. SQL is relatively easy to be
used and widely accepted in database systems. However,
SQL lacks some complex processing patterns (e.g., iterative
training) compared with other high-level machine learning
languages. Fortunately, SQL can be extended to support AI
models [121], and we can also design user-friendly tools to
support AI models in SQL statements [33].
这一段是讲我们的SQL语言不能像Python那样搞AI，那么我们要对其进行扩展，使其具有AI的特性。
Data Governance. Data quality is important for machine
learning, and data governance can improve the data quality,
which includes data discovery, data cleaning, data integration, data labeling, data lineage. (1) Data discovery. Learning
based data discovery enhances the ability of finding relevant
data, which effectively finds out relevant data among a
large number of data sources [34]. (2) Data cleaning. Dirty
or inconsistent data can affect the training performance
terribly. Data cleaning and integration techniques can detect
and repair the dirty data, and integrate the data from
multiple sources to generate high-quality data [147]. (3)
Data labeling. With domain experts, crowdsourcing [76] and
knowledge base [102], we can properly utilize manpower or
existing knowledge to label a large number of training data
for ML algorithms. (4) Data lineage. Data lineage depicts
the relationship between input and output and is important
to ensure ML models working properly. With database
technologies like join and graph mapping, we can trace data
relationships backwardly and forwardly [18], [20].
这一段大概是讲用AI去进行数据挖掘，自动发现海量数据之间的联系；对脏数据和未持久化的数据进行自动清理，并且自动进行数据整合；利用机器学习算法去标记数据；利用机器学习算法向前向后追踪数据。

Model Training. Model training aims to train a good model
that will be used for online inference. Model training is a
time consuming and complicated process, and thus it requires optimization techniques, including feature selection,
model selection, model management and hardware acceleration. (1) Feature selection. It aims to select appropriate
features from a large number of possible features. It is
time consuming to select and evaluate the possible features.
Therefore, technologies like batching, materialization [155]
are proposed to address this issue. (2) Model selection. It
aims to select an appropriate model (and parameter values) from a large number of possible models. Thus, some
parallelism techniques are proposed to accelerate this step,
including task parallel [104], bulk synchronous parallel [67],
parameter server [80] and model hop parallelism [107]. (3)
Model management. Since model training is a trial-anderror process that needs to maintain many models and
parameters that have been tried, it is necessary to design
a model management system to track, store and search the
ML models. We review GUI-based [9] and command-based
model [140] management system. (4) Hardware acceleration. Hardwares, like GPU and FPGA, are also utilized to
accelerate the model training. We introduce hardware acceleration techniques in row-store [95] and column-store [63]
databases respectively.
这一段大概就是利用Ai去进行特征抽取，模型选择，模型管理，硬件加速。
Model Inference. Model inference aims to effectively infer
the results using a trained model, and in-database optimization techniques include operator support, operator selection,
and execution acceleration. (1) Operator support. An ML
model may contain different types of operators (e.g., scalar,
tensor), which have different optimization requirements.

在这里插入图片描述
Thus some in-database techniques are proposed to support
AI operators, including scalar operations [49], tensor operations [12], and tensor partitions [49]. (2) Operator selection.
The same ML model can be converted to different physical
operators, which may bring significant performance difference. In-database operator selection can estimate resource
consumption and judiciously schedule the operators [12].
(3) Execution acceleration. Inference acceleration aims to
enhance the execution efficiency. On one hand, in-memory
databases compress sample/model data inside memory and
conduct in-memory optimization [72]. On the other hand,
distributed databases enhance execution efficiency by issuing tasks to different nodes [118] .
这一段大概是讲用AI对使用模型之后的结果进行预测，包括操作符支持，操作符选择，以及加速执行。

Contributions

We make the following contributions (see Figure 1).
(1) We review AI4DB techniques that utilize AI techniques to
optimize database, including learning-based configuration
tuning, optimizer, and index/view advisor (see Section 2).
(2) We review DB4AI techniques that utilize DB techniques
to make AI easy-to-use, improve model performance and accelerate AI algorithm, including declarative language, data
governance, model training and inference (see Section 3).
(3) We provide research challenges and future directions,
including hybrid DB and AI data model, co-optimization,
and hybrid DB and AI system (see Section 4).

AI FOR DB（详细说明）

AI techniques can be utilized to optimize databases from
many aspects, including (1) learning-based database con-
figuration, (2) learning-based database optimization, (3)learning-based database design, (4) learning-based database monitoring and protection, (5) learning-based security. This section reviews existing studies from these aspects.

Learning-based Database

ConfigurationLearning-based database configuration aims to utilize machine learning techniques to automate database configurations, e.g., knob tuning, index advisor, and view advisor.

Knob Tuning

Databases and big data analytics systems have
hundreds of tunable system knobs (e.g., Work_Mem,
Max_Connections, Active_Statements) [87], which
control many important aspects of databases (e.g., memory
allocation, I/O control, logging). Traditional manual
methods leverage DBAs to manually tune these knobs
based on their experiences but they always spend too
much time to tune the knobs (several days to weeks) and
cannot handle millions of database instances on cloud
databases. To address this issue, self-tuning is proposed to
automatically tune these knobs, which uses ML techniques
to not only achieve higher tuning performance but less
tuning time (thus saving the human efforts and enabling
online tuning). We can categorize existing knob tuning
techniques into four categories, including search-based
tuning, traditional ML-based tuning, deep learning based
tuning and reinforcement learning based tuning.

Search-based Tuning. To reduce the manpower, Zhu et
al [160] propose a recursive bound-and-search tuning
method BestConfig, which, given a query workload, finds
the similar workloads from historical data and returns the
corresponding knob values. Specifically, given n knobs,
BestConfig divides the value range of each knob into k
intervals and these knob intervals form a discrete space with
kn subspaces (bounded space). And then, in each iteration,
BestConfig randomly selects k samples from the bounded
space, and selects the sample with the best performance
from the k selected samples, denoted as C1. In the next
iteration, it only gets samples from the bounded space
closed to C1. In this way, BestConfig iteratively reduces the
bounded space and finally gets a good knob combination.
However, this search-based method has several limitations.
First, it is heuristic and may not find optimal knob values
in limited time. Second, it cannot achieve high performance,
because it needs to search the whole space.

Traditional ML-based Tuning. To solve the problems in
the search-based tuning method, traditional ML models are
proposed to automatically tune the knobs (e.g., Gaussian
process [3], decision tree [36]). Aken et al propose an MLbased database tuning system OtterTune [3]. OtterTune uses
Gaussian Process (GP) to recommend suitable knobs for
different workloads. First, it selects some query templates,
where each query template contains a query workload
and its corresponding appropriate knob values. Second, it
extracts internal state of the database (e.g., the number
of pages read/written, the utilization of query cache) to
reflect workload characteristics. From internal state features,
OtterTune filters the irrelevant features using factor analysis,
and then uses simple unsupervised learning methods (e.g.,
K-means) to select K features which are most related to the tuning problem. And OtterTune uses these K features
to profile the workload characters. Third, it uses these
selected features to map the current workload to the most
similar template. OtterTune directly recommends the knob
configuration of this template as the optimal configuration.
And it also inputs the query workload into the GP model to
learn a new configuration for updating the model. Formally,
it trains the model as follows. Given a training data (W, W’, C‘, R), where W is a workload, W‘ is the similar workload template of W, C’ is the recommended configuration of W‘, C‘’ is the recommend configuration by the GP model, and R is the performance difference between C‘ and C‘’. It trains the model by minimizing the difference between C‘ and C‘’ .OtterTune uses these samples to train the model.

This ML-based method has good generalization ability
and can perform well in different database environments.
Moreover, it can make effective use of the experiences
learned from the historical tasks and apply the experiences
to future inference and training. However, it also has some
limitations. Firstly, it adopts a pipeline architecture. The
optimal solution obtained in the current stage is not guaranteed to be optimal in the next stage, and the models
used in different stages may not be generalized well. Secondly, it requires a large number of high-quality samples
for training, which are difficult to obtain. For example,
database performance is affected by many factors (e.g., disk
capacity, CPU status, workload features), and it is difficult
to reproduce similar scenarios due to the huge search space.
Thirdly, it cannot effectively support knob tuning with highdimensional and continuous space. OtterTune still needs to
filter out most knobs before utilizing the GP. For example, it
only tunes 10 knobs on PostgreSQL.

Reinforcement Learning for Tuning. Reinforcement learning (RL) improves the generalization ability through continuous interactions with the environment (e.g., database
state and workload). Zhang et al [157] propose a DRL-based
database tuning system CDBTune. The main challenge of
using DRL in knob tuning is to design the five modules
in DRL. CDBTune maps the database tuning problem into
the five modules in the reinforcement learning framework
as follows. It takes the cloud database instance as the
Environment, internal metrics of the instance as the State,
the tuning model as the Agent, the knob tuning as the
Action, and the performance change after tuning as the
Reward. The Agent adopts a neural network (the Actor)
as the tuning strategy which takes as input the metrics
from State and outputs the knob values. Besides Agent
adopts another neural network (the Critic) for tuning the
Actor, which takes as input the knob values and internal
metrics and outputs the Reward. In a tuning procedure, the
Agent inputs the state of the database instance. Based on
the state, Agent outputs a tuning action and applies it on
the database instance. And then, it executes the workload
on the database, gets the performance changes, and uses
the change as the reward to update the Critic in the
Agent. Next the Critic updates Actor, which captures
the relationships between knob values and internal metrics.

Unlike traditional supervised learning, the training of
reinforcement learning model does not require extensive
high-quality data. Through the trial-and-error process, the
learner in DRL model repeatedly generates (st, rt, at, st+1),
where st is the database state at time t, rt is the reward
at time t, and at is the action at time t, and uses the
action to optimize the tuning strategy. With exploration
and exploitation mechanisms, reinforcement learning can
make a trade-off between exploring unexplored space and
exploiting existing knowledge. Give an online workload,
CDBTune runs the workload and obtains the metrics from
the databases. Then CDBTune uses the Actor model to
recommend the knob values.
在这里插入图片描述
However, there are still some limitations in CDBTune.
First, CDBTune can only provide a coarse-grained tuning,
e.g., tuning knobs for a workload, but cannot provide a
fine-grained tuning, e.g., tuning knobs for some specific
queries. Second, previous DRL-based tuning works directly
use existing models (e.g., Q-learning [146], DDPG [83]);
however Q-learning cannot support input/output with continuous values, and in DDPG, the agent tunes database only
based on database state, without considering the workload
features. Thus Li et al. [77] propose a query-aware tuning
system QTune to solve the above problems. First, QTune
uses a double-state deep reinforcement learning model (DSDRL), which can embed workload characteristics and consider the effects of both the action (for inner state) and
the workload (for outer metrics). Therefore, it can provide
finer-grained tuning and can support online tuning even
if workload changes. Second, QTune clusters the queries
based on their “best” knob values and supports cluster-level
tuning, which recommends the same knob values for the
queries in the same cluster and different knob values for
queries in different clusters.

In conclusion, DRL-based methods can greatly improve
the tuning performance compared with traditional methods.
Firstly, DRL does not need a lot of training data to train the
model because it can generate training samples by iteratively running a workload under different database states.
Secondly, it combines advanced learning methods (e.g.,
Markov Decision Process, Bellman function and gradient
descent), and thus can adapt to database changes efficiently.

Deep Learning for Buffer Size Tuning. Above methods
focus on tuning general knobs. There are also some methods
that tune specific knobs to optimize database performance.
Tan et al. [133] propose iBTune, which only tunes buffer pool
size for individual database instances. It uses deep learning
to decide the time to tune databases to ensure minimum
negative effects on query latency and throughput. First, it
collects samples of database state metrics (e.g., miss ratio),
tuning action, and performance from the history records.
Second, it uses these samples to train a pairwise neural
network to predict the upper bounds of the latency. The
input of the network includes two parts: current database
instance and target database instance (after tuning buffer
pool size), each of which includes the performance metrics
(e.g., CPU usage, read I/O), and the encoding of current
time (e.g., 21:59). And the output of the network is the
predicted response time. If the predicted time is too high,
iBTune will tune the memory size. iBTune achieves a balance
between tuning performance and tuning frequency. Kunjir
et al [73] propose a multi-level tuning method RelM for
memory allocation. Different from iBTune, RelM first uses Guided Bayesian Optimization to compute the performance
metric values (e.g. shuffle memory usage) based on the application types (e.g., SQL, shuffling). RelM uses these metrics
to select knobs and uses DDPG to tune these selected knobs.
在这里插入图片描述
ML-based Tuning for Other Systems. Besides database tuning, there are many tuning methods for other data systems
like Hadoop [50], [61] and Spark [108], [41], [38], [141]. Big
data systems also have many knobs (e.g., >190 in Hadoop
,> 200 in Spark), which can have a significant impact on
the performance [87]. However, tuning for big data systems
is different from database tuning. (1) Big data systems like
Spark have more diverse workloads (e.g., machine learning
tasks on MLlib, real-time analytics on Spark streaming,
graph processing tasks on GraphX), and they need to tune
the systems for different tasks and environments. (2) Existing database tuning methods are mainly on single nodes,
while most big data systems adopt a distributed architecture. (3) The performance metrics are different – big data
systems focus on resource consumption and load balance,
while database tuning optimizes throughput and latency.

Herodotou et al [50] propose a self-tuning system Starfish
for Hadoop. For job-level tuning, Starfish captures online
features while running the job (e.g., job, data, and cluster characters) and tunes the parameters based on estimated resource consumption (e.g., time, CPU, memory).
For workflow-level tuning, a workflow can be distributed
to different cluster nodes. So Starfish learns policies to
schedule data across nodes based on both the local features
and distributed system features. For workload-level tuning,
Starfish mainly tunes system parameters like node numbers,
node configurations according to the access patterns of those
workflows and the network configurations. To tune these
parameters, it simulates the actual working scenarios, and
estimates the database performance after tuning based on
the execution cost predicted by the database optimizer.

For Spark tuning, Gu et al [108] propose a machine
learning based model to tune configurations of applications
running on Spark. First, it uses N neural networks, each of
which takes as input the default values of parameters and
outputs a recommended parameter combination. Second, it
uses a random forest model to estimate the performance of
the N recommended combinations and chooses the optimal
one to actually tune the configuration. This method is extremely useful for distributed cluster, where each network
can be tuned based on the state of each node.

Index Selection

In DBMS, indexes are vital to speed up query execution,
and it is very important to select appropriate indexes to
achieve high performance. We first define index selection
problem. Considering a set of tables, let C denote the set of
columns in these tables and size(c ∈ C) denote the index
size of a column c ∈ C. Given a query workload Q, let
benef it(q ∈ Q, c ∈ C) denote the benefit of building an index on column c for query q, i.e., the benefit of executing
query q with and without the index on column c. Given
a space budget B, the index selection problem aims to
find a subset C‘ of columns to build the index in order
to maximize the benefit while keeping the total index size
在这里插入图片描述
There are several challenges. The first is how to estimate benef it(q, c), and a well-known method is to use a
Hypothetical Index 1
, which adds the index information to
the data dictionary of DBMS rather than creating the actual
index, so that the query optimizer in DBMS can recognize
the existence of the index and estimate the cost of query
execution without building the real index. The estimated
benefit of an index is the decrease of the estimated cost of
query execution with and without the hypothetical index.
The second is to solve the above optimization problem.
There are two main solutions to address these challenges
– offline index selection and online index selection. Of-
fline index selection methods require DBAs to provide a
representative workload and select an index scheme by
analyzing this workload. Online index selection methods
monitor the DBMS and select index scheme according to the
change of workloads. The main difference between offline
methods and online methods is that offline methods only
select and materialize an index plan, while online methods
dynamically create or delete some indexes according to the
change of workloads.

Offline Index Selection. It relies on DBAs to choose some
frequent queries from the query log as the representative
workload, and uses the workload to select indexes. Chaudhuri et al propose an index selection tool AutoAdmin [15]
for Microsoft SQL server. The main idea is to select wellperformed index schemes for each query and then extend 在这里插入图片描述
this query. Third, for each query, it has a corresponding
optimal indexing strategy, and then for all queries in Q,
AutoAdmin selects top-k schemes based on the benefit. Then
for each top-k scheme, AutoAdmin uses a greedy algorithm
to incrementally add indexable columns until the size is
equal to a threshold B. Finally, the scheme with the highest
benefit and within the storage budget will be selected. Zilio
et al [139] model the index selection problem as a knapsack
problem and propose DB2 Advisor. Similar to AutoAdmin,
DB2 Advisor first enumerates index schemes with their
benefit. Then, it models the index selection problem as a
knapsack problem. More specifically, it regards each candidate index scheme as an item, the size of the scheme as the
item weight and the benefit of the scheme as the value. Then
DB2 uses dynamic programming to solve this problem.
The downside of offline methods is that they are not
flexible enough to handle the dynamic changes of workloads. Even worse, selecting a representative workload will
increase the burden of the DBAs. To address these problems,
some online methods are proposed as follows.

Online Index Selection. There are many online index
selection methods, which can be divided into three categories: traditional online index selection methods, semiautomatic index selection method and ML-based index selection method. Traditional online index selection methods
continuously analyze the workload and update the index
scheme on-the-fly according to the change of workloads.
Luhring et al. present a soft index autonomous management
method based on the “observation-prediction-reaction” cycle [89]. First, it extracts indexable columns from queries
and enumerates candidate index schemes. Then, it uses
a greedy policy to select schemes that have the highest
estimated benefit and adds them into the final result. Finally,
the selected index scheme will be materialized when the
workload of DBMS is not heavy. Schnaitte et al. propose
COLT [123], which supports automatically online implementation of new indexes based on current index scheme. It
models the index selection problem as a knapsack problem
as we described in offline index selection (DB2) and applies
dynamic programming to get an index scheme. Once the
final index scheme is derived, it will be immediately materialized. However, traditional online methods do not take
the experiences from DBAs into account. Also, continuous
change of the index scheme may affect the stability of the
DBMS and result in a high overhead.

Schnaitter proposed a semi-automatic index selection
algorithm WFIT [124]. WFIT also works online and takes
the feedback from DBAs into consideration. It monitors
the DBMS in real time, dynamically analyzes the workload
and enumerates some candidate schemes to tune the index
structure. But before implementing the index schemes, WFIT
needs DBAs to judge whether a column should be indexed
or not. Then in the subsequent index selection process, WFIT
will eliminate the column which should not be indexed from
the index schemes according to DBAs experiences. Similarly,
it can also add the column that should be indexed into
the index schemes. Compared with traditional methods,
WFIT does not need to select a representative workload and
thus reduces the DBAs’ burden. Though semi-automatic
index selection methods take the experience from DBAs
into consideration, these experience may not be useful. MLbased index selection methods can automatically learn some
experience rather than DBAs’ feedback and apply these
experience to validate whether an index is useful.

ML-based index selection methods automatically learn
experience from historical data rather than DBAs’ feedback. Pedrozo et al. propose an index selection method
ITLCS [113] based on a learning classifier system (LCS)
as well as a genetic algorithm. First, ITLCS uses LCS to
generate indexing rules in column level. Each rule consists
of two parts: (i) index-related information from DBAs, e.g.,
“the percentage of null tuples in the column”, “the data type in the column”; (ii) an action denotes whether to create or
delete an index. Second, ITLCS uses a genetic algorithm to
eliminate LCS rules and generates composited rules as the
final indexing strategy. However, it is hard to generate the
rules. Sadri et al [120] propose a reinforcement-learningbased index selection method. First, without expert rules,
they denote workload features as the arrival rate of queries,
column features as the access frequency and selectivity of
each column. Second, they use the Markov Decision Process
model (MDP) to learn from features of queries, columns,
and outputs a set of actions, which denote creating/dropping an index.
在这里插入图片描述

View Advisor

View materialization is rather important in DBMS that
utilizes views to improve the query performance based on
the space-for-time trade-off principle. Judiciously selecting
materialized views can significantly improve the query
performance within an acceptable overhead. However, it
is hard to automatically generate materialized views for
ordinary users. Existing methods rely on DBAs to generate and maintain materialized views. Unfortunately, even
DBAs cannot handle large-scale databases, especially cloud
databases that have millions of database instances and support millions of users. Thus, it calls for the view advisor,
which automatically identifies the appropriate views for a
given query workload.

View advisor has been studied for many years and modern DBMSs such as Oracle DB and IBM DB2 provide tools to
support materialized views [22], [161], [162]. There are two
main tasks in view advisor: (1) Candidate view generation.
View advisor should discover candidate views from the
history workloads. (2) View selection. Since materializing
all candidate views is impossible due to the system resource
constraints, view advisor selects a subset of candidates as
materialized views that can improve the performance most.

Candidate View Generation. Since the number of all possible views grows exponentially, the goal of view generation is
to generate a set of high-quality candidate views that can be
utilized for future queries. There are mainly two methods.
(1) Identify equivalent sub-queries that appear frequently in the workload. Dokeroglu et al. [25] propose a
heuristic method which includes Branch-and-Bound, Genetic, HillClimbing, and Hybrid Genetic-Hill Climbing algorithms to find a near-optimal multiple queries execution
plan in which queries share common computation tasks.
They decompose queries into sub-queries which contain
selection, projection, join, sorting, and data shipping. Next,
to make these sub-queries have more chances to share
common tasks, they generate alternative query plans for
each query. Their plan generator interacts with a cost model that estimates the query cost by considering the potential
reuse ability of its sub-queries. Then they detect common
tasks among these alternative plans and search for a global
plan with lower cost where queries share common tasks.
(2) Rewrite sub-queries to make a view answer more
queries. Sub-queries directly extracted from the queries are
still not general enough, because they cannot be generalized
to other queries even with minor difference. To alleviate
this problem, Zilio et al. [162] generalize extracted views
by merging similar views, changing the select condition,
adding “group by” clause. This reduces the resource consumption and more queries can utilize the views. 在这里插入图片描述 View Selection. Candidate views cannot be all materialized
due to the system resource restriction. Therefore, the view
advisor should choose a subset of views that lead to good
performance to materialize and maintain them dynamically.
The view selection problem aims to select a subset of
candidates as materialized views. Assume that there is a
在这里插入图片描述 much disk space the view occupies, and (2) the utility of
the view, which is the estimated reduction of query cost
when using this view. After temporally deciding the view
selection state, the second step is to solve the remained
smaller ILP problem by an ILP solver. In the second step,
given each query, the algorithm will decide which view
to be used to answer the query, and estimate the utility
of each selected view. Then the utility will be fed back to
the next iteration to select the candidate views until the
selected views as well as their corresponding queries do
not vary or the number of iterations is up to a threshold.
The time complexity of solving the smaller ILP problems is
more acceptable and the solution is near-optimal.

However, the above method is an offline method and
takes hours to analyze a workload. For supporting online workloads, they propose an online computation reuse
framework, CLOUDVIEWS [60]. The workflow is similar
to [59]. It also first generates candidate views, and then
given a query, studies how to select an appropriate view
for the query. CLOUDVIEWS decomposes the query plan
graph into multiple sub-graphs and chooses high-frequent
sub-graphs as candidate views. Then it estimates the utility
and cost of the candidate views online by collecting and
analyzing the information from the previously executed
workload, such as compile-time and run-time statistics.
Given the candidate views and their estimated utility and
cost, they convert the view selection problem to a packing
problem, where the utility is regarded as the value of a
view, the cost is regarded as the weight of a view, and the
disk space constraint is regarded as the capacity. To make
the view selection problem efficient, they also propose some
在这里插入图片描述
To address these limitations, Yuan et al [154] propose to
use reinforcement learning (RL) to address view selection
and maintenance problems. First, they use a wide-deep model
to estimate the benefit of using a materialized view to
answer a query. Second, they model the view selection
as an Integer Linear Programming (ILP) problem and use
reinforcement learning to find the best policy.

Some materialized views may need to be evicted when
a new one is ready to be created but there is not enough
disk budget. For view eviction, a credit-based model is
proposed [82] that evicts materialized views with the lowest
credit when the storage limit is reached. The credit of a view
is the sum of its future utility and recreation cost. Higher
credit means that we will sacrifice more utility if we evict
the view but cost more when we keep it. It is similar to
the Reward function in DQM and can be calculated in a similar way. However, because they train the RL model with
real runtime of workloads, it has the assumption that the
database environment will not change. This results in an
expensive cost of model training in the cold-start step.

SQL Rewrite

Many database users, especially cloud users, may not write
high-quality SQL queries, and SQL rewriter aims to transform SQL queries to the equivalent forms (e.g., pushing down the filters, transforming nested queries to join
queries), which can be efficiently executed in databases [97].
Most of existing SQL rewrite methods adopt rule-based
techniques [97], [6], [16], which, given a set of query
rewrite rules, find the rules that can apply to a query and
use the rule to rewrite the query. However, it is costly
to evaluate various combinations of rewriting operations,
and traditional methods often fail into sub-optimization.
Besides, rewriting rules are highly related to applications,
and it is hard to efficiently identify rules on new scenarios.
Hence, machine learning ca be used to optimize SQL rewrite
from two aspects. (1) Rule selection: since there are many
rewriting rules, a reinforcement model can be used to make
the rewrite decision. At each step, the agent estimates the
execution cost of different rewrite methods and selects the
method with the lowest cost. The model iteratively generates a rewrite solution and updates its decision strategy
according to the results. (2) Rule generation: according to
the sets of rewrite rules of different scenarios, a LSTM model
is used to learn the correlations between queries, compilers,
hardware features and the corresponding rules. Then for a
new scenario, the LSTM model captures features inside gate
cells and predicts proper rewrite rules.

Learning-based Database Optimization

Learning-based database optimization aims to utilize the
machine learning techniques to address the hard problems
in database optimization, e.g., cost/cardinality estimation,
join order selection and end-to-end optimizer.

Cardinality and Cost Estimation

Cardinality estimation is one of the most challenging problems in databases, and it’s often called ‘Achilles Heel’ of
modern query optimizers [75] and has been studied for
decades. Traditional cardinality estimation methods can
be divided into three categories, data sketching [51], [51],
histogram [88], [114] and sampling [85], [105]. However,
they have the following drawbacks. Firstly, data sketching
and histogram-based methods can only handle the data
distribution of one column, while for multi-column tables
or even multi-table joins, they will produce large errors due
to column correlations. Secondly, although sampling-based
methods can capture the correlations of multi-columns and
multi-tables by using indexes, they cannot work well for
sparse or complex queries due to the 0-tuple problem.

Cost estimation predicts the resources usage of a physical execution plan for a query, including I/O usage and
CPU usage. Traditional cost estimation methods utilize a
cost model built upon the estimated cardinality to incorporate the physical operators. Compared with the estimated
cardinality, the estimated cost provides more direct running
overhead for guiding the plan selection.

Recently, database researchers propose to use deep learning techniques to estimate the cardinality and cost. The
learning-based cardinality estimates can be classified into
supervised methods and unsupervised methods.

Supervised Methods.
We can further divide the supervised
methods into the following categories according to the
model they adopt.
(1) Mixture Model. Park et al [112] propose a querybased method with mixture models. The method uses a
mixture model to fit observed predicates and selectivities of
them by minimizing the difference between the model and
observed distribution. It can avoid the overhead of multidimensional histograms. However, these methods do not
support LIKE, EXISTS, and ANY keywords.
(2) Fully Connected Neural Network. Ortiz et al [111]
build a cardinality estimation model with fully connected
neural network and take an encoded query as input features. They conduct experiments to test trade-offs between
model size and estimated errors. Dutt et al [28] propose a regression model on multi-dimensional numerical range predicates for selectivity estimation. The regression model can
give an estimated selectivity with small space and time overhead, and it can learn the correlations of multiple columns,
which outperform methods with attribute value independence assumption. However, this method is hard to capture
the join correlations. Wu et al [148] propose a learning-based
method for workloads in shared clouds named CardLearner,
which extracts overlapped subqueries from workloads, and
classifies them according to the structure. Each category of
subqueries is a template, and CardLearner trains a cardinality
estimation model for each template. In this way, CardLearner
can replace traditional cardinality estimates for subqueries
with shared templates. However, the effectiveness of this
method is constrained by a workload.
(3) Convolutional Neural Network (CNN). Kipf et
al [64] propose a multi-set CNN to learn the cardinality of
joins. The model divides a query into three parts, selected
tables, join conditions and filter predicates. Each part is
represented by a convolutional network and concatenated
after average pooling. This is the first method to represent
whole queries and learn the cardinality in an end-to-end
manner, and it can give an accurate cardinality with low
evaluation overhead. However, this method is hard to be
used in a plan-based query optimization scenario directly,
because the representation of children nodes cannot be
used by the parent in a query plan. Marcus et al [100]
propose an end-to-end query optimizer named Neo. As an
important ingredient of the reinforcement learning model,
they propose a neural network which contains both query
encoding and partial plan encoding to evaluate the best
performance of the current query. For plan encoding, they
use a CNN to aggregate joins layer by layer. For predicates
encoding, they utilize the row vectors which are trained by
word2vec models, and encode each predicate by leveraging
the average value of selected rows. However, row vectors
representation is time-consuming for online encoding.
(4) Recurrent Neural Network. Ortiz et al [111] propose
a RNN-based model for estimating cardinality of left-deep
plan. At each iteration, one node will be added into the plan
tree, and the nodes sequence is the input of the RNN model.Marcus et al [99] propose a RNN-based cost model built on
estimated cardinality and other statistics parameters within
traditional RDBMS. The model encodes the query plans into
a tree-structured deep neural network. However, it doesn’t
encode the predicates and relies on cardinality estimated
by RDBMS, and the inaccurate estimated cardinality can
bring large errors to cost estimations. Sun et al [131] propose
an end-to-end learning-based cost estimator with a treestructured LSTM. It learns a representation for each subplan with physical operator and predicates, and outputs
the estimated cardinality and cost simultaneously by using
another estimation layer. Moreover, the method encodes
keywords in the predicates by using word embedding.
在这里插入图片描述
Unsupervised Methods. There are also studies [48], [45],
[152] on fitting underlying distributions of datasets by
using unsupervised density models, but they are hard to
support complex queries like multi-joins. Heimel et al [48]
propose a Kernel Density Estimator (KDE) based selectivity
estimators which is lightweight to construct and maintain
with database changes, and it optimizes the Kernel Density
Estimator model numerically for better estimation quality
by selecting the optimal bandwidth parameters. KDE is fast
to fit underline data distributions, is easy to construct and
maintain, and is robust for data correlations. Hasan et al [45]
and Yang et al [152] utilize autoregressive densities model
to represent the joint data distribution among columns.
The model returns a list of conditional densities present
in the chain rule with an input tuple. To support range
predicates, the model adopts progressive sampling to select
the meaningful samples by leveraging the learned densities
model, and works even with skewed data. However, it
cannot support high-dimensional data or multi-table joins.

Above methods have gained great improvements, but
they just support simple/fixed queries and queries with
DISTINCT keyword may use different techniques in query
planner (e.g., set-theoretic cardinality). Hence, Hayek et
al [46] used two methods to handle general queries. First,
they use a deep learning scheme to predict the unique
rate R in the query results. with duplicate rows, where
the query is denoted as a collection of (attributes, tables,
join, predicates). Second, they extend existing cardinality
methods by multiplying unique rate with duplicate results.

Join Order Selection

Join order selection plays a very important role in database
systems which has been studied for many years [42]. Traditional methods typically search the solution space of all
possible join orders with some pruning techniques, based
on cardinality estimation and cost models. For example,the dynamic programming (DP) based algorithms [58] often
select the best plan but have high computational overhead.
Also, the generated plan by DP algorithms may have large
costs due to the wrong cost estimation. Heuristic methods,
GEQO [10], QuickPick-1000 [142], and GOO [32], may generate plans more quickly, but may not produce good plans.

To address these problems, machine-learning-based
methods are proposed to improve the performance of join
order selection in recent years, which utilize machine learning to learn from previous examples and overcome the
bias caused by the inaccurate estimation. Moreover, the
learning-based methods can efficiently select a better plan
in a shorter time. We can categorize existing methods into
offline-learning methods and online learning methods.

Offline-Learning Methods. Some studies [130], [69], [98],
[153] learn from the previous queries to improve the performance of future queries. Stillger et al [130] propose a
learned optimizer LEO, which utilizes the feedback of query
execution to improve the cost model of query optimizer. It
uses a two-layer approach to guide the plan search. One
layer represents statistic information from database and the
other is the system catalog which is analyzed from past
executions. When a query comes, LEO uses a combination of
statistic and system catalog to estimate the cardinality and
cost, and generates a plan. After a query is executed, LEO
compares with the accurate cost and estimates cost of this
query’s plan and then updates the system catalog.

DQ [69] and ReJoin [98] are proposed to use neural
network and reinforcement learning to optimize the join
orders, inspired by LEO’s feedback-based methods. Both
DQ and ReJoin use the cost of previous plans as training
data to train the neural network so as to evaluate each
join order. DQ uses a one-hot vector G to represent each
join state. Each cell in a vector indicates the existence of
each table in the join tree and the operation selection. Then
it uses a multi-layer perceptron (MLP) with G as input
to measure each join state and uses DQN to guide the
join order selection. Once a plan is generated, the cost of
the plan will be treated as a feedback to train the neural
network. Different from DQ, the input of ReJoin is composed
of a depth vector and a query vector. The depth vector
represents the depth information of each table in the join
tree and the query vector represents the query information.
ReJoin uses the Proximal Policy Optimization(PPO) [125] to
guide the join order selection. The results show that both
DQ and ReJoin can generate good join order compared with
PostgreSQL’s optimizer with low cost while keeping high
efficiency. However, the neural networks in ReJoin and DQ
are simple, and can not represent the structure of join tree sufficiently, and they cannot learn the latency of query plans.

Moreover, the above two methods cannot support updates on schemas. To address this problem, RTOS[153] uses
a two stage training to generate better join order on latency
with a well-designed neural network structure. In order to
address the problem that the DNN design in ReJoin and
DQ cannot catch the structure of join tree, RTOS proposes a
model that utilizes the TreeLSTM to represent the join state.
RTOS first designs the representation for column, table and
join tree. Then the DQN is used to guide the join order
generation after the representation is obtained. Next, RTOS
first uses cost to pre-train the DNN and then utilizes latency
information to train the neural network online. The results
show that it can generate better join order with low latency.

Online-learning Methods. This class of methods [5], [138],
[137] focus on learn a join order using adaptive query
processing, which can change the join order even during the
execution of the queries. Avnur et al [5] propose an adaptive
query processing mechanism called Eddy, which combines
the execution and optimization of a query. It learns and
generates the join order during the online execution of a
query. Eddy splits the query process as many operators, e.g.,
two join operators between three relations. Eddy uses two
routing methods to control the order of these operator to
handle the coming tuples: Naive Eddy and Lottery Eddy.
Naive Eddy can route more tuples to the operator with
less cost; while Lottery Eddy can route more tuples to
the operator with smaller selectivity. However, these two
routing methods are designed to specific scenarios and it
calls for designing general routing algorithms to deal with
more complex scenarios.

Tzoumas et al [138] model the query execution with
Eddy as a reinforcement learning problem (Eddy-RL) and
automatically learn the routing methods from the query
execution without any human design. Eddy defines the
eddy itself as an agent, the progress of tuples as state, the
operators in the query as actions, the execution time as
reward(cost). The routing methods that decide the order of
operator are exactly what this RL problem will solve. EddyRL uses Q-learning [146] to solve this RL problem. It defines
a Q function as a sum of all operators’ cost. By minimizing
the Q function, it guides which operator to choose each
time. However, the above style optimizer does not analyze
the relationship between expected execution time and the
optimum will not discard intermediate results.

Trummer et al [137] propose a reinforcement learning
model SkinnerDB based on Eddy-RL [138]. SkinnerDB uses
the Upper Confidence Bounds to Trees (UCT) [65] instead
of Q-learning in Eddy-RL, because UCT provides theoretical
guarantees on accumulated regret over all choices. SkinnerDB divides the query execution into many small time
slices and in each slice, it chooses one join order to execute.
Using the real performance of a join order in the time slice,
Skinner-DB trains the UCT to guide a better join order
selection. Finally, Skinner-DB merges the tuples produced
in each time slice to generate the final result.

Learning-based Optimizer

Although many researchers have tried to use machine learning methods to solve cost/cardinality estimation and join order selection problems, there are still many factors that need to be considered in physical plan optimization, e.g indexes
and views selection. Join order selection methods [69], [98],
[153], [137] provide the logical plan, rely on the database
optimizer to select the physical operators and indexes, and
utilize the cost model to generate the final physical plan. Recently, Marcus et al [100] propose an end-to-end optimizer
NEO which does not use any cost model and cardinality
estimation to generate the final physical plan. NEO is also
an offline-learning method based on ReJoin[98]. Similar to
RTOS [153], NEO also uses Tree-CNN to catch the structural
information. It uses row vectors to represent the predicates.
In order to produce the physical plan, NEO uses a onehot vector to represent each physical operator selection and
index selection in neural networks. Then NEO performs
a DNN-guided search which keeps expanding the state
with minimal value to find the physical plan. In addition,
without any information from the cost model, NEO uses
PostgreSQL’s plan to pre-train the neural network and uses
latency as feedback to train the neural network. This end-toend method learns from the latency to generate the whole
physical plan, which can be applied to any environment and
robust to estimation errors.

Learning-based Design

Learning-based design aims to utilize machine learning
techniques to design database components, e.g., indexes.

Learned Data Structure

The database community and machine learning community
investigate learning-based structures, e.g., learned B-tree,
using learning-based models to replace traditional indexes
to reduce the index size and improve the performance.

Learned B+tree. Kraska et al. [66] propose that indexes are
models, where the B+tree index can be seen as a model
that maps each query key to its page. For a sorted array,
larger position id means larger key value, and the range
index should effectively approximate the cumulative distribution function (CDF). Based on this observation, Kraska
et al. [66] propose a recursive model index, which uses a
learning model to estimate the page id of a key. This method
outperforms B+tree under the in-memory environment.
However, it cannot support data updates, and it doesn’t
show effectiveness for secondary indexes.

Another learned index proposed in Galakatos [35],
Fitting-tree, provides strict error bounds and predictable
performance, and it supports two kinds of data insertion
strategies. For in-place insertion strategy, Fitting-tree keeps
extra insertion space at each end of the page to make
in-place insertion not to violate the page error. However,
the insertion cost may be high for large segments. Delta
insertion strategy keeps a fixed-size buffer zone in each
segment, and the keys will be inserted in the buffer and
kept sorted. Once the buffer overflows, it has to splits and
combines segments. Similar to Fitting-tree, the Alex-index [24]
also reserves spaces for inserted keys. The difference is that
reserved space in Alex-index is scattered and the inserted
key would be put into the position predicted by the model
directly. If the position is occupied, more gaps would be
inserted (for gapped array) or expanding itself (for packed
memory array). Alex-index can be more flexible for balancing
the trade-offs between space and efficiency.

Tang et al. [134] propose a workload-aware learned
index called Doraemon. Doraemon can incorporate read access pattern by making several copies of frequently visited
queries in the training data, and frequent queries would
make more contributions to the error and be optimized more
than other queries. Doraemon reuses pre-trained models for
the similar data distributions based on the observation that
similar data distribution requires the same model structure.

Learned Secondary Index. Wu et al. [149] propose a succinct secondary indexing mechanism called Hermit, which
leverages a Tiered Regression Search Tree (TRS-Tree) to
capture the column correlations and outliers. The TRS-tree
is a machine-learning enhanced tree index. Each leaf node
contains a learned linear regression model which maps the
target values to correlated values. Each internal node maintains a fixed number of pointers pointing to the children.
Hermit uses three steps to answer a query, first searching
TRS-Tree for mapping the target column to an existing
index, leveraging the existing index to get candidate tuples,
and finally validating the tuples. Hermit is effective in both
in-memory and disk-based RDBMS.

Learned Hashmap. Kraska et al. [66] propose to approximate the CDF of keys as hash functions to distribute all the
keys evenly in each hash bucket and make less conflicts.

Learned Bloom Filters. Bloom filter is a commonly used
index to determine whether a value exists in a given set.
But traditional bloom filter may occupy a large number of
memory for the bit array and hash functions. To reduce the
size of Bloom filters, Kraska et al. [66] propose a learningbased Bloom filter. They train a binary classifier model for
recognizing whether a query exists in the dataset. New
queries need to pass through the classifier firstly, and the
negative ones should pass through a traditional Bloom filter
further to guarantee no false negative exist. Mitzenmacher
et al. [103] propose a formal mathematical method to guide
how to improve the performance of learned Bloom filters.
They propose a Sandwich structure which contains three
layers. The first layer is a traditional Bloom filter aiming
to remove most of the queries which are not in the dataset,
the second layer is a neural network aiming to remove false
positives, and the last layer is another traditional Bloom
filter aiming to guarantee no false negatives. They provide
both the mathematical and intuitive analysis to prove that
the Sandwich structure is better than two-layer Bloom filter.
Besides, they design Bloomier Filter which can not only
determine key existence but can return the value associated
with the key in the dataset. However, training a Bloom filter
from scratch is not practical for ephemeral inputs stream
with high throughput, and it motivates the work of Rae et
al. [116] that proposes a learned Bloom filter with few-shot
neural data structures to support data updates.

For multidimensional datasets, it’s space and time consuming to search a set of single Bloom filters one by one.
Macke et al. [94] propose an efficient learned Bloom filter
for multidimensional data by using Sandwich structure. For
attributes embedding, they represent values in the highcardinality attributes by character-level RNNs to reduce
the model size. Moreover, they select the best classifier
cutoff threshold to maximize the KL divergence between
the true positive and false positive rates. In order to reduce the influence of noisy in-index data, they introduce a shift
parameter for each positive training sample.

Learned Index for Spatial Data. The conventional spatial indexes, e.g., R-tree, kd-tree, G-tree, cannot capture
the distributions of underlying data, and the look-up time
and space overhead could be optimized further with the
learning-based techniques. For example, Wang et al. [143]
propose a learned ZM index which first maps the multidimensional geospatial points into a 1-dimensional vector
with Z-ordering, and then constructs a neural network index
to fit the distributions and predicts the locations for queries.

Learned Index for High-dimensional Data. The nearest
neighbor search (NNS) problem on high-dimensional data
aims to find the k-nearest points of a query efficiently. Traditional methods for solving the approximate NNS problem
can be divided into three categories, including hashingbased index, partition-based index and graph-based index.
Some studies [26], [122], [119] improve the first two types of
indexes by using machine learning techniques. Schlemper
et al. [122] propose an end-to-end deep hashing method,
which uses a supervised convolutional neural network. It
combines two losses – similarity loss and bit rate loss, and
thus it can discretize the data and minimize the collision
probability at the same time. Sablayrolles et al. [119] propose
a similar end-to-end deep learning architecture which learns
a catalyzer function to increase the quality of subsequent
coding phases. They introduce a loss derived from the
Kozachenko-Leonenko differential entropy estimator to favor uniformity in the spherical output space. Dong et al. [26]
reduce the high dimensional space partitions problem to
balance graph partitioning and supervised classification
problem. They firstly partition the KNN graph into balanced
small partitions by using a graph partitioning algorithm
KaHIP, and then learns a neural model to predict the probabilities that the KKNs of a query fall in a partition, and
search the partitions with the large probabilities.

Learned KV-store Design. Idreos et al. [54], [53] show that
data structures in key-value stores can be constructed from
fundamental components, and the learned cost model can
guide the construction directions. They define the design
space as all the designs that can be described by fundamental design components such as fence pointers, links, and
temporal partitioning, and design continuums is a subspace
of design space which connects more than one designs. To
design a data structure, they first identify the bottleneck of
the total cost and which knob can be tweaked to alleviate
it, and then tweak the knob in one direction until they
reach its boundary or the total cost reaches the minimum.
This process is similar to the gradient descent and can be
conducted automatically.

Learned Transaction Management

As the number of CPU cores increases, concurrency control
for heavy workloads becomes more challenging [127]. Effective workload scheduling can greatly improve the performance by avoiding the conflicts. We introduce learned transaction management techniques from two aspects: transaction prediction and transaction scheduling.

Transaction Prediction. Transaction prediction is important
to database optimization (e.g., resource control, transaction scheduling). Traditional workload prediction methods are
rule-based. For example, a rule-based method [23] uses domain knowledge of database engines (e.g., internal latency,
resource utilization) to identify signals t relevant to workload characteristics, such as memory utilization. And this
method directly uses memory utilization to predict future
workload trend. However, rule-based method wastes much
time to rebuild a statistics model when workload changes,
so Ma et al. [92] propose an ML-based system QB5000 that
predicts the future trend of different workloads. QB5000
mainly contains three components, Pre-Processor, Cluster
and Forecaster. First, Pre-Processor records incoming
query features (e.g., syntax tree, arrival rate) and aggregates
queries with the same template to approximate workload
features. Second, Cluster clusters templates with similar
arrival rates using modified DBSCAN algorithm. Third,
Forecaster predicts the arrival rate patterns of queries in
each cluster. QB5000 tries six different models for forecasting, where the training data is the history workloads from
the past observations.

Transaction Scheduling. Traditional database systems either schedule workload sequentially, which cannot consider
potential conflicts, or schedule workloads based on the
execution costs predicted by the database optimizer. But
traditional database optimizer [74] estimates cost based
on assumptions like uniformity and independence, which
may be wrong when there are correlations between joined
attributes. Hence Sheng et al. [127] propose a machine
learning based transaction scheduling method, which can
balance between concurrency and conflict rates. First, they
estimate the conflict probabilities using supervised algorithm: they build a classifier M to identify whether any 在这里插入图片描述

Database Monitoring

Database monitoring records system running status and
examines the workload to ensure database stability, which
is very important to database optimization and diagnosis.
For example, knob tuning relies on database monitoring
metrics, e.g., system load, read/write blocks. We broadly divide database monitoring into three cases – database health
monitor, activity monitor and performance prediction.

Database Health Monitor. Database health monitor (DHM)
records database health related metrics, e.g., the number of
queries per second, the query latency, to optimize database
or diagnose failures. In [93], they assume that intermittent
slow queries with similar key performance Indicators (e.g.,
cpu.usage, mysql.tps) have the same root causes. So they
adopt two-stage diagnosis: (i) in offline stage, they extract slow SQLs from the failure records, cluster them with KPI
states, and ask DBAs to assign root causes to each cluster;
(ii) in online stage, for an incoming slow SQL, they match
it to a cluster C based on similarity score of KPI states. If
matched, they use the root cause of C to notify DBAs. Otherwise, they generate a new cluster and ask DBAs to assign
the root causes. However, [93] cannot prevent potential
database failure and it highly relies on DBA’s experience.
However, it is rather expensive to monitor many database
metrics, because monitoring also consumes the resources.
To tackle this issue, Taft et al. [132] propose an elastic
database system P-Store, which combines database monitor
with workload prediction. The basic idea is to proactively
monitor database to adapt to workload changes.

Database Activity Monitor. Different from health monitor,
database activity monitor (DAM) externally monitors and
controls database activities (e.g., creating new accounts,
viewing sensitive information), which is vital to protecting
sensitive data. We divide DAM into two classes, activity
selection and activity trace. For activity selection, there are
different levels of database activities (e.g., DBA activities,
user transactions including DML, DDL, and DCL). Traditional DAM methods are required to record all the activities
on extra systems according to trigger rules [21]. For example, the company might create a rule that generates an alert
every time a DBA performs a select query on a credit card
column which returns more than 5 results. However, it is
still a heavy burden to record all the activities, which brings
frequent data exchanges between databases and monitoring
systems. Hence, to automatically select and record risky activities, Hagit et al. [40] take database monitoring as a multiarmed bandits problem (MAB). MAB model is a decisionmaking algorithm that selects risky database activities by exploiting current policy and exploring new policies. The goal
is to train an optimal policy with the maximal risk score. So
for the MAB model, in each step, it samples some users with
the highest risk for exploiting the policy and some users
for exploring better policy. For activity trace, after deciding
which activities to be monitored, we need to trace high risky
activities and optimize database performance.

Performance Prediction. Query performance prediction is
vital to meet the service level agreements (SLAs), especially
for concurrent queries. Traditional prediction method [27]
only captures logical I/O metrics (e.g., page access time),
neglect many resource-related features, and cannot get accurate results. So Marcus et al [99] use deep learning to predict query latency under concurrency scenarios, including
interactions between child/parent operators, and parallel
plans. However, it adopts a pipeline structure (causing
information loss) and fails to capture operator-to-operator
relations like data sharing/conflict features. Hence, Zhou
et al [159] propose a performance prediction method with
graph embedding. They use a graph model to characterize concurrent queries, where the vertices are operators
and edges capture operator correlations (e.g., data passing,
access conflict, resource competition). They use a graph
convolution network to embed the workload graph, extract
performance-related features from the graph, and predict
the performance based on the features.

Learning-based Security

Learning-based database security aims to use the machine
learning techniques to ensure confidentiality, integrity and
availability of databases. We review recent works on sensitive data discovery, access control, and SQL injection.

Learning-based Sensitive Data Discovery. Since sensitive
data leakage will cause great financial and personal information loss, it is important to protect the sensitive data in
a database. Sensitive data discovery aims to automatically
detect and protect confidential data. Traditional methods
use some user-defined rules (e.g., ID, credit card and password) to detect the sensitive data [44], [117]. For example,
DataSunrise2 is a search-based data discovery framework.
在这里插入图片描述
the implementation, they directly add independent Laplace
noise to q(T, r) of each pattern r, and select k patterns
with the highest perturbed frequencies, which represent the
highest attach risk. In this way, they improve the accuracy
of data discovery within resource limitations.

Access Control. It aims to prevent unauthorized users to
access the data, including table-level and record-level access
control. There have been several traditional methods for
access control, such as protocol-based [39], role-based [129],
query-based [150] and purpose-based [13]. However, these
methods are mainly based on static rules, and advanced
techniques (e.g., identity impersonation, metadata modifi-
cation, illegal invasion) may fake access priority, and traditional methods cannot effectively prevent these attacks.
Recently, machine learning algorithms are proposed to estimate the legality of access requests. Colombo et al [19]
propose a purpose-based access control model, which customizes control polices to regulate data requests. As different actions and data content may lead to different private
problems, this method aims to learn legal access purposes.

SQL Injection. SQL injection is a common and harmful
vulnerability to database. Attackers may modify or view
data that exceeds their priorities by bypassing additional information or interfering with the SQL statement, such
as retrieving hidden data, subverting application logic,
union attacks and etc [106]. For example, an application
allows users to access the product information by filling the
SQL statement like “SELECT price, abstract from product
where pname=’?’ and released=1”. But an attacker may
retrieve the hidden information of unreleased products by
adding extra information in the pname value, like “SELECT
price, abstract from product where pname=’car’−− and
is released=’yes’;”, where “−−” is a comment indicator in
SQL and it removes the limitation of “released=1”. However, traditional methods are rule-based (e.g., parameterized
queries) and have two limitations. First, they take a long
time to scan illegal parameters. Second, there are many
variants of illegal parameters, which are not enumerable,
and thus the traditional feature matching methods fail to
recognize all attacks. Recently, there are mainly two types of
SQL injection detection methods that utilize machine learning techniques, including classification tree [128], [86], fuzzy
neural network [8]. Moises et al [86] propose a classification
algorithm for detecting SQL injections. There are frequent
SQL attacks caused by logical failures or bad filters in the
query parameters. Hence they build a classifier tree based
on tokens extracted from SQL queries to predict possible
SQL injections. The training samples are queries with typical
SQL injections and risk levels (dangerous/normal/none),
which are collected from the database logs. However, this
method requires much training data and cannot generalize
knowledge to different detection targets. To address the
problem of limited training samples, Batista et al [8] propose
a fuzzy neural network (FNN) for SQL attacks. The basic
idea is to identify attack patterns with fuzzy rules and
memorize these rules inside neural network.

DB FOR AI(详细说明)

Existing machine learning platforms are hard to use, because users have to write codes (e.g., Python) to utilize the
AI algorithms for data discovery/cleaning, modeling training and model inference. To lower the barrier for using AI,
the database community extends the database techniques
to encapsulate the complexity of AI algorithms and enables
users to use declarative languages, e.g., SQL, to utilize the
AI algorithms. In this section, we summarize the database
techniques for reducing the complexity of using AI.

Declarative Language Model

Traditional machine learning algorithms are mostly implemented with programming languages (e.g., Python, R)
and have several limitations. First, they require engineering
skills to define the complete execution logic, e.g., the iterative patterns of model training, and tensor operations like
matrix multiplication and flattening. Second, machine learning algorithms have to load data from database systems, and
the data import/export costs may be high.

Hence, AI-oriented declarative language models are proposed to democratize machine learning with extended SQL
syntax. We classify declarative language models into two
categories, hybrid language model [33], unified language
model [49], [101], [81], and drag-and-drop method [121],
[33].

Hybrid language model. Hybrid language model, e.g.,
BigQuery ML [33], contains both AI and DB operations.
Generally, for each query, it splits the statements into AI
operations and DB operations. Then it executes AI operations on AI platforms (e.g., TensorFlow, Keras) and executes
DB operators on databases. The advantage of the hybrid
language models is that they are easy to implement, but the
downside is that they have to frequently migrate data between DB and AI platforms, which results in low efficiency.

Unified language model. To fully utilize data management
techniques, unified language models [49], [101], [81] are proposed to natively support AI queries in databases without
data migration. Hellerstein et. al propose an in-database analytic method MADlib [49], which provides a suite of SQLbased algorithms for machine learning in three steps. (1)
AI operators need frequent matrix computation including
multiplication, transposition, etc, and MADlib implements
a customized sparse matrix library within PostgreSQL. For
example, linear algebra is coded with loops of basic arithmetic in C. (2) MADlib abstracts many AI operators inside
databases, including data acquisition, access, sampling, and
model definition, training, inference. (3) MADlib supports
iterative training in databases. For a training process with n
iterations, MADlib first declares a virtual table. Then for each
iteration it maintains the training results (e.g., gradients of
neural units) of m samples as a view, and joins the virtual
table with the view to update the model parameters.

Drag-and-Drop Interface. Some SQL statements are still
complicated (e.g., nesting structure, multiple joins) and it
is hard for inexperienced users to understand. Hence there
are some studies that use drag-and-drop methods to use
AI techniques [121], [33]. Spreadsheet is an indispensable
tool for many data analysts. BigQuery ML [121] provides
a virtual spreadsheet technology, Connected Sheets, which
combines the simplicity of the spreadsheet interface with
machine learning algorithms in database systems. Firstly,
it presents billions of rows of data to users in a virtual
spreadsheet and users can explore the data on that. Then, it
automatically translates data operations performed by users
into SQL statements and sends them to the database. In
this way, data can be analyzed with conventional worksheet
functions (e.g., formulas, pivot tables and graphs).

Data Governance

AI models rely on high-quality data, and data governance
aims to discover, clean, integrate, label the data to get highquality data, which is important for deploying AI models.

Data Discovery. Suppose an AI developer aims to build an
AI model. Given a dataset corpus, the user requires to find
relevant datasets to build an AI model. Data discovery aims
to automatically find relevant datasets from data warehouse
considering the applications and user needs. Many companies propose data discovery systems, like Infogather [151]
in Microsoft and Goods [43] in Google. The former mainly
focus on schema complement from an attribute level, which
aims to enrich attributes in a table from a massive corpus of
Web tables. The latter is a dataset-level method, which stores
information like dataset scheme, similarity and provenance
between datasets, and then users can search and manage
datasets they want. However, the discovery process of such systems is built in and the information stored is predefined,
and they cannot generalize to common use cases because
limited relationships between datasets can be represented.
To this end, Fernandez et al. propose Aurum [34], which
is a data discovery system that provides flexible queries
to search dataset based on users’ demand. It leverages
enterprise knowledge graph (EKG) to capture a variety of
relationships to support a wide range of queries. The EKG is
a hypergraph where each node denotes a table column, each
edge represents the relationship between two nodes and
hyperedges connect nodes that are hierarchically related
such as columns in the same table.

Data Cleaning. Note that most of the data are dirty and
inconsistent, and the dirty or inconsistent data will result
in unreliable decisions and biased analysis. Therefore, it is
necessary to clean the data. The pipeline of data cleaning
consists of error detection and error repair (see [57] for
a survey). Most data cleaning methods mainly focus on
cleaning an entire dataset. However, data cleaning is task
dependent and different tasks may use different cleaning
techniques to repair different parts of the data. Hence, Wang
et al. propose a cleaning framework ActiveClean for machine
learning tasks [68]. Given a dataset and machine learning
model with a convex loss, it selects records that can most
improve the performance of the model to clean iteratively.
ActiveClean consists of 4 modules, sampler, cleaner, updater
and estimator. Sampler is used to select a batch of records
to be cleaned. The selection criterion is measured by how
much improvement can be made after cleaning a record,
i.e., the variation of the gradient, which is estimated by
the Estimator. Then the selected records will be checked
and repaired by the Cleaner. Next, the Updater updates the
gradient based on these verified dirty data. The above four
steps are repeated until the budget is used up.

Data Labeling. Nowadays, some sophisticated machine
learning algorithms like deep learning, always require a
large number of training data to train a good model. Thus
how to derive such massive amount of labeling data is
a challenging problem. There are mainly three ways to
obtain training data, domain experts, non-experts and distant supervision. Firstly, domain experts can provide highquality labels, which will produce well-performed models.
However, it is always too expensive to ask experts to label a
large number of data, so active learning techniques [1], [91],
[90] are extended to leverage a smaller number of labeled
data to train a model. Secondly, thanks to commercial public crowdsourcing platforms, such as Amazon Mechanical
Turk (AMT) , crowdsourcing (see [76] for a survey) is an
effective way to address such tasks by utilizing hundreds
or thousands of workers to label the data. Thirdly, distant
supervision [102] labels data automatically by making use of
knowledge, such as Freebase or domain-specific knowledge.
Traditionally, supervised learning always needs a large
number of training data, which is expensive. Unsupervised
methods do not need training data but they usually suffer
from poor performance. Distant supervision combines their
advantages. Suppose that we want to extract relations from
a corpus without labeled data using distant supervision
with the help of Freebase. In the training step, if a sentence in a document contains two entities, which are an instance
of one of Freebase relations, it extracts the features from
the sentence, adds the feature vector for the relation and
performs the training. Then in the testing step, given some
pairs of entities as candidates, they use the features to
predict the relation based on the trained model.

Data Lineage. Data lineage [115], [158] aims to depict
the relationships between a machine learning pipeline. For
example, suppose that we detect an erroneous result of a
workflow and then want to debug it. It is very beneficial
to retrospect the source data that generates the error. Also,
if there are some dirty input data, it is helpful to identify
the corresponding output data to prevent incorrect results.
In summary, the relationships described by data lineage can
be classified into two categories. (1) Backward relationships
return the subset of input data that generate the given
output records; (2) Forward relationships return the subset
of output records that are generated from the given input
records. There are several data lineage approaches building
relationships between input and output data, including lazy
approach [18], [20], eager linear capture approach [30],
[56] and fine-grained data lineage [115], [158]. The lazy
approaches [18], [20] regard the lineage queries as relational
queries and execute them on the input relations directly. The
advantage is that the base queries do not incur overhead,
but the drawback is the computational cost of the lineage
query execution. The eager linear capture approach [30], [56]
builds a lineage graph to speed up the lineage query, and
uses the paths on graphs to support backward and forward
queries. The fine-grained data lineage system integrates the
lineage capture logic and physical operators in databases
tightly [115] and uses lightweight and write-efficient index
to achieve a low-overhead lineage capture.

Model Training for AI

Model training is an indispensable step in applying AI algorithms, which includes feature selection, model selection,
model management and model acceleration. In this section,
we summarize the existing model training techniques.
在这里插入图片描述

is parallelism, and the popular parallelism strategies include
task parallel [104] , bulk synchronous parallel [67] , model
hop parallelism [107] and parameter server [80].

DB Techniques for Accelerating Model Inference

Model inference is to use a pre-trained model to predict
the testing samples, including operator support, operator
selection, and execution acceleration.
在这里插入图片描述 to their access patterns, like cell-wise matrix additions and
transpose. And then SystemML optimizes these operations
with algebraic rewrites, adjusts the operator ordering of
matrix multiplication chains, and compiles them into a DAG
of low level aggregation operators like grouping, summing,
and counting. Thus SystemML can efficiently execute matrix
operators inside databases.

在这里插入图片描述

RESEARCH CHALLENGES AND FUTURE WORK

在这里插入图片描述

CONCLUSION

在这里插入图片描述

ygpGoogle

关注

4
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
Database Meets AI: A Survey

—Database and Artificial Intelligence (AI) can benefit from each other. On one hand, AI can make database more intelligent(AI4DB). For example, traditional empirical database optimization techniques (e.g., cost estimation, join order selection, knob tunin
复制链接

扫一扫