- Druid vs Elasticsearch
- Druid vs Key/Value Stores (HBase/Cassandra)
- Druid vs Redshift
- Druid vs Spark
- Druid vs SQL-on-Hadoop (Hive/Impala/Drill/Spark SQL/Presto)
Druid vs Elasticsearch
We are not experts on search systems, if anything is incorrect about our portrayal, please let us know on the mailing list or via some other means.
Elasticsearch is a search systems based on Apache Lucene. It provides full text search for schema-free documents and provides access to raw event level data. Elasticsearch is increasingly adding more support for analytics and aggregations. Some members of the community have pointed out
the resource requirements for data ingestion and aggregation in Elasticsearch is much higher than those of Druid.
Elasticsearch also does not support data summarization/roll-up at ingestion time, which can compact the data that needs to be stored up to 100x with real-world data sets. This leads to Elasticsearch having greater storage requirements.
Druid focuses on OLAP work flows. Druid is optimized for high performance (fast aggregation and ingestion) at low cost, and supports a wide range of analytic operations. Druid has some basic search support for structured event data, but does not support full text search. Druid also does not support completely unstructured data. Measures must be defined in a Druid schema such that summarization/roll-up can be done.
Druid vs. Key/Value Stores (HBase/Cassandra/OpenTSDB)
Druid is highly optimized for scans and aggregations, it supports arbitrarily deep drill downs into data sets. This same functionality is supported in key/value stores in 2 ways:
Pre-compute all permutations of possible user queries
Range scans on event data
When pre-computing results, the key is the exact parameters of the query, and the value is the result of the query.
The queries return extremely quickly, but at the cost of flexibility, as ad-hoc exploratory queries are not possible with pre-computing every possible query permutation. Pre-computing all permutations of all ad-hoc queries leads to result sets that grow exponentially with the number of columns of a data set, and pre-computing queries for complex real-world data sets can require hours of pre-processing time.
The other approach to using key/value stores for aggregations to use the dimensions of an event as the key and the event measures as the value. Aggregations are done by issuing range scans on this data. Timeseries specific databases such as OpenTSDB use this approach. One of the limitations here is that the key/value storage model does not have indexes for any kind of filtering other than prefix ranges, which can be used to filter a query down to a metric and time range, but cannot resolve complex predicates to narrow the exact data to scan. When the number of rows to scan gets large, this limitation can greatly reduce performance. It is also harder to achieve good locality with key/value stores because most don’t support pushing down aggregates to the storage layer.
For arbitrary exploration of data (flexible data filtering), Druid’s custom column format enables ad-hoc queries without pre-computation. The format also enables fast scans on columns, which is important for good aggregation performance.
Druid vs Redshift
How does Druid compare to Redshift?
In terms of drawing a differentiation, Redshift started out as ParAccel (Actian), which Amazon is licensing and has since heavily modified.
Aside from potential performance differences, there are some functional differences:
Real-time data ingestion
Because Druid is optimized to provide insight against massive quantities of streaming data; it is able to load and aggregate data in real-time.
Generally traditional data warehouses including column stores work only with batch ingestion and are not optimal for streaming data in regularly.
Druid is a read oriented analytical data store
Druid’s write semantics are not as fluid and does not support full joins (we support large table to small table joins). Redshift provides full SQL support including joins and insert/update statements.
Data distribution model
Druid’s data distribution is segment-based and leverages a highly available “deep” storage such as S3 or HDFS. Scaling up (or down) does not require massive copy actions or downtime; in fact, losing any number of historical nodes does not result in data loss because new historical nodes can always be brought up by reading data from “deep” storage.
To contrast, ParAccel’s data distribution model is hash-based. Expanding the cluster requires re-hashing the data across the nodes, making it difficult to perform without taking downtime. Amazon’s Redshift works around this issue with a multi-step process:
set cluster into read-only mode
copy data from cluster to new cluster that exists in parallel
redirect traffic to new cluster
Druid employs segment-level data distribution meaning that more nodes can be added and rebalanced without having to perform a staged swap. The replication strategy also makes all replicas available for querying. Replication is done automatically and without any impact to performance.
ParAccel’s hash-based distribution generally means that replication is conducted via hot spares. This puts a numerical limit on the number of nodes you can lose without losing data, and this replication strategy often does not allow the hot spare to help share query load.
Along with column oriented structures, Druid uses indexing structures to speed up query execution when a filter is provided. Indexing structures do increase storage overhead (and make it more difficult to allow for mutation), but they also significantly speed up queries.
ParAccel does not appear to employ indexing strategies.
Druid vs Spark
Druid and Spark are complementary solutions as Druid can be used to accelerate OLAP queries in Spark.
Spark is a general cluster computing framework initially designed around the concept of Resilient Distributed Datasets (RDDs). RDDs enable data reuse by persisting intermediate results in memory and enable Spark to provide fast computations for iterative algorithms. This is especially beneficial for certain work flows such as machine learning, where the same operation may be applied over and over again until some result is converged upon. The generality of Spark makes it very suitable as an engine to process (clean or transform) data. Although Spark provides the ability to query data through Spark SQL, much like Hadoop, the query latencies are not specifically targeted to be interactive (sub-second).
Druid’s focus is on extremely low latency queries, and is ideal for powering applications used by thousands of users, and where each query must return fast enough such that users can interactively explore through data. Druid fully indexes all data, and can act as a middle layer between Spark and your application. One typical setup seen in production is to process data in Spark, and load the processed data into Druid for faster access.
For more information about using Druid and Spark together, including benchmarks of the two systems, please see:
Druid vs SQL-on-Hadoop (Impala/Drill/Spark SQL/Presto)
SQL-on-Hadoop engines provide an execution engine for various data formats and data stores, and many can be made to push down computations down to Druid, while providing a SQL interface to Druid.
For a direct comparison between the technologies and when to only use one or the other, things basically comes down to your product requirements and what the systems were designed to do.
Druid was designed to
be an always on service
ingest data in real-time
handle slice-n-dice style ad-hoc queries
SQL-on-Hadoop engines generally sidestep Map/Reduce, instead querying data directly from HDFS or, in some cases, other storage systems. Some of these engines (including Impala and Presto) can be colocated with HDFS data nodes and coordinate with them to achieve data locality for queries. What does this mean? We can talk about it in terms of three general areas
Druid segments stores data in a custom column format. Segments are scanned directly as part of queries and each Druid server calculates a set of results that are eventually merged at the Broker level. This means the data that is transferred between servers are queries and results, and all computation is done internally as part of the Druid servers.
Most SQL-on-Hadoop engines are responsible for query planning and execution for underlying storage layers and storage formats. They are processes that stay on even if there is no query running (eliminating the JVM startup costs from Hadoop MapReduce).
Some (Impala/Presto) SQL-on-Hadoop engines have daemon processes that can be run where the data is stored, virtually eliminating network transfer costs. There is still some latency overhead (e.g. serde time) associated with pulling data from the underlying storage layer into the computation layer. We are unaware of exactly how much of a performance impact this makes.
Druid is built to allow for real-time ingestion of data. You can ingest data and query it immediately upon ingestion, the latency between how quickly the event is reflected in the data is dominated by how long it takes to deliver the event to Druid.
SQL-on-Hadoop, being based on data in HDFS or some other backing store, are limited in their data ingestion rates by the rate at which that backing store can make data available. Generally, the backing store is the biggest bottleneck for how quickly data can become available.
Druid’s query language is fairly low level and maps to how Druid operates internally. Although Druid can be combined with a high level query planner such as Plywood to support most SQL queries and analytic SQL queries (minus joins among large tables), base Druid is less flexible than SQL-on-Hadoop solutions for generic processing.
SQL-on-Hadoop support SQL style queries with full joins.
Druid vs Parquet
Parquet is a column storage format that is designed to work with SQL-on-Hadoop engines. Parquet doesn’t have a query execution engine, and instead relies on external sources to pull data out of it.
Druid’s storage format is highly optimized for linear scans. Although Druid has support for nested data, Parquet’s storage format is much more hierachical, and is more designed for binary chunking. In theory, this should lead to faster scans in Druid.