Apache Tajo™ - An open source big data warehouse system in Hadoop

The main goal of Apache Tajo™ project is to build an advanced open source data warehouse system in Hadoop for processing web-scale data sets

Features

  • Interactive and Batch Queries
    • Fully distributed SQL query processing on large data sets stored in HDFS and other data sources
    • Very low response time (100 msec ~) against simple queries (e.g., just aggregation or small-large join) on reasonable data size
  • Long running query support
    • Fault tolerance support that avoids query restart when some tasks are failed.
    • Dynamic scheduling support that handles struggling and heterogeneous cluster nodes
  • Query Optimization
    • Cost-based optimization for bushy join trees
    • Progressive query optimization for reoptimizing running queries
  • ETL
    • ETL features that transform one data format to another data format
    • Various file formats support, such as CSV, RCFile, and RowFile (a row store file)
  • Extensibility
    • User-defined function support
    • Scanner/Appender interface for custom file formats
  • Compatibility
    • ANSI/ISO SQL standard compliance and PostgreSQL compliance for non-standard parts
    • HiveQL mode support
    • Tables access in HCatalog and Hive MetaStore
    • JDBC driver support
  • Easy
    • Interactive shell to allow users to submit SQL queries to Tajo clusters
    • Backup/Restore utility
    • Asynchronous/Synchronous Java API to enable clients to submit SQL queries to Tajo clusters


The key differences between Tajo and Impala is the design goal. To increase
the performance of query processing, Impala adopts an approach which the
main memory is utilized as much as possible and intermediate data are
transfered via streaming. If a query requires too much memory, Impala
cannot process the query. Thus, Impala says that it is not an alternate of
Hive.

However, Tajo uses a query optimization which considers user queries,
characteristics of data, the status of cluster, and so on. Thus, Tajo can
process a query with Impala's algorithm, Hive's algorithm or any other
algorithms. For an example, Tajo can process a join query using the
repartition join, or the merge join. Intermediate results can be
materialized to disks or maintained in memory. Since Tajo builds a query
plan considering above mentioned various factors, it can always process
user queries. So, we can say that Tajo can be an alternate of Hive.

Tajo can perform well over Hive for most of queries. The key reason is that
Tajo uses the own query engine while Hive uses MapReduce. This limits that
Hive can uses only MapReduce-based algorithms. However, Tajo can uses a
more optimized algorithm.

A sort query is a good example. Hive supports only the hash partitioning.
Thus, each node sort data locally in the map phase and *ONE NODE* should
perform global sort in the reduce phase.
However, Tajo supports a sort algorithm using the range partitioning. In
the first phase, each node sort data locally as in Hive, but the
intermediate data are partitioned by the range of the sort key. In the
second phase, each node performs local sort to get the final results. Since
intermediate data are partitioned by the range of sort key, final results
are correct.
Ref:  http://mail-archives.apache.org/mod_mbox/tajo-dev/201305.mbox/%3CCACZfFK6PNE+AuNX6CQ0WD784ZxUavEykEKa-rWFMXp0xdyAHmg@mail.gmail.com%3E
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值