Apache Tajo™ - An open source big data warehouse system in Hadoop

最新推荐文章于 2021-04-29 11:02:26 发布

macyang

最新推荐文章于 2021-04-29 11:02:26 发布

阅读量1k

点赞数

分类专栏： hadoop

hadoop 专栏收录该内容

103 篇文章 0 订阅

订阅专栏

The main goal of Apache Tajo™ project is to build an advanced open source data warehouse system in Hadoop for processing web-scale data sets

Features

Interactive and Batch Queries
- Fully distributed SQL query processing on large data sets stored in HDFS and other data sources
- Very low response time (100 msec ~) against simple queries (e.g., just aggregation or small-large join) on reasonable data size
Long running query support
- Fault tolerance support that avoids query restart when some tasks are failed.
- Dynamic scheduling support that handles struggling and heterogeneous cluster nodes
Query Optimization
- Cost-based optimization for bushy join trees
- Progressive query optimization for reoptimizing running queries
ETL
- ETL features that transform one data format to another data format
- Various file formats support, such as CSV, RCFile, and RowFile (a row store file)
Extensibility
- User-defined function support
- Scanner/Appender interface for custom file formats
Compatibility
- ANSI/ISO SQL standard compliance and PostgreSQL compliance for non-standard parts
- HiveQL mode support
- Tables access in HCatalog and Hive MetaStore
- JDBC driver support
Easy
- Interactive shell to allow users to submit SQL queries to Tajo clusters
- Backup/Restore utility
- Asynchronous/Synchronous Java API to enable clients to submit SQL queries to Tajo clusters

Ref: http://tajo.apache.org/

The key differences between Tajo and Impala is the design goal. To increase
the performance of query processing, Impala adopts an approach which the
main memory is utilized as much as possible and intermediate data are
transfered via streaming. If a query requires too much memory, Impala
cannot process the query. Thus, Impala says that it is not an alternate of
Hive.

However, Tajo uses a query optimization which considers user queries,
characteristics of data, the status of cluster, and so on. Thus, Tajo can
process a query with Impala's algorithm, Hive's algorithm or any other
algorithms. For an example, Tajo can process a join query using the
repartition join, or the merge join. Intermediate results can be
materialized to disks or maintained in memory. Since Tajo builds a query
plan considering above mentioned various factors, it can always process
user queries. So, we can say that Tajo can be an alternate of Hive.

Tajo can perform well over Hive for most of queries. The key reason is that
Tajo uses the own query engine while Hive uses MapReduce. This limits that
Hive can uses only MapReduce-based algorithms. However, Tajo can uses a
more optimized algorithm.

A sort query is a good example. Hive supports only the hash partitioning.
Thus, each node sort data locally in the map phase and *ONE NODE* should
perform global sort in the reduce phase.
However, Tajo supports a sort algorithm using the range partitioning. In
the first phase, each node sort data locally as in Hive, but the
intermediate data are partitioned by the range of the sort key. In the
second phase, each node performs local sort to get the final results. Since
intermediate data are partitioned by the range of sort key, final results
are correct.

Ref: http://mail-archives.apache.org/mod_mbox/tajo-dev/201305.mbox/%3CCACZfFK6PNE+AuNX6CQ0WD784ZxUavEykEKa-rWFMXp0xdyAHmg@mail.gmail.com%3E