Hadoop hive General

最新推荐文章于 2024-04-26 11:43:32 发布

macyang

最新推荐文章于 2024-04-26 11:43:32 发布

阅读量700

点赞数

分类专栏： hive 文章标签： hadoop scalability jobs mapreduce processing query

本文链接：https://blog.csdn.net/macyang/article/details/7485393

版权

hive 专栏收录该内容

20 篇文章 0 订阅

订阅专栏

What is Hive?

Hive is a data warehouse infrastructure built on top of Hadoop. It provides tools to enable easy data ETL, a mechanism to put structures on the data, and the capability to querying and analysis of large data sets stored in Hadoop files. Hive defines a simple SQL-like query language, called QL, that enables users familiar with SQL to query the data. At the same time, this language also allows programmers who are familiar with the MapReduce fromwork to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language.

Hive does not mandate read or written data be in the "Hive format"---there is no such thing. Hive works equally well on Thrift, control delimited, or your specialized data formats. Please see File Format and SerDe in Developer Guide for details.

What Hive is NOT

Hive is based on Hadoop, which is a batch processing system. As a result, Hive does not and cannot promise low latencies on queries. The paradigm here is strictly of submitting jobs and being notified when the jobs are completed as opposed to real-time queries. In contrast to the systems such as Oracle where analysis is run on a significantly smaller amount of data, but the analysis proceeds much more iteratively with the response times between iterations being less than a few minutes, Hive queries response times for even the smallest jobs can be of the order of several minutes. However for larger jobs (e.g., jobs processing terabytes of data) in general they may run into hours.

In summary, low latency performance is not the top-priority of Hive's design principles. What Hive values most are scalability (scale out with more machines added dynamically to the Hadoop cluster), extensibility (with MapReduce framework and UDF/UDAF/UDTF), fault-tolerance, and loose-coupling with its input formats.

Hive query language provides the basic SQL like operations. These operations work on tables or partitions. These operations are:

Ability to filter rows from a table using a where clause.
Ability to select certain columns from the table using a select clause.
Ability to do equi-joins between two tables.
Ability to evaluate aggregations on multiple "group by" columns for the data stored in a table.
Ability to store the results of a query into another table.
Ability to download the contents of a table to a local (e.g., nfs) directory.
Ability to store the results of a query in a hadoop dfs directory.
Ability to manage tables and partitions (create, drop and alter).
Ability to plug in custom scripts in the language of choice for custom map/reduce jobs.

上面主要说明hive的强项不在获取低延迟的性能，而是在scalability、extensibility 、fault-tolerance。

macyang

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Hadoop hive General

What is Hive?Hive is a data warehouse infrastructure built on top of Hadoop. It provides tools to enable easy data ETL, a mechanism to put structures on the data, and the capability to querying an
复制链接

扫一扫