HIVE - Programming Hive Reading (Chapter 1)

CHAPTER 1: Introduction

🔑1. Introduction

From the early days of the Internet’s mainstream breakout, the major search engines and ecommerce companies wrestled with ever-growing quantities of data. More recently, social networking sites experienced the same problem.

Today, many organizations realize that the data they gather is a valuable resource for understanding their customers, the performance of their business in the marketplace, and the effectiveness of their infrastructure.

The Hadoop ecosystem emerged as a cost-effective way of working with such large data sets. It imposes a particular programming model, called MapReduce, for breaking up computation tasks into units that can be distributed around a cluster of commodity, server class hardware, thereby providing cost-effective, horizontal scalability. Underneath this computation model is a distributed file system called the Hadoop Distributed Filesystem (HDFS). Although the filesystem is “pluggable,” there are now several commercial and open source alternatives.

However, a challenge remains; how do you move an existing data infrastructure to Hadoop, when that infrastructure is based on traditional relational databases and the Structured Query Language (SQL)? What about the large base of SQL users, both expert database designers and administrators, as well as casual users who use SQL to extract information from their data warehouses?

This is where Hive comes in. Hive provides an SQL dialect, called Hive Query Language (abbreviated HiveQL or just HQL) for querying data stored in a Hadoop cluster.

SQL knowledge is widespread for a reason; it’s an effective, reasonably intuitive model for organizing and using data. Mapping these familiar data operations to the low-level MapReduce Java API can be daunting, even for experienced Java developers. Hive does this dirty work for you, so you can focus on the query itself. Hive translates most queries to MapReduce jobs, thereby exploiting the scalability of Hadoop, while presenting a familiar SQL abstraction.

Hive is most suited for data warehouse applications, where relatively static data is analyzed, fast response times are not required, and when the data is not changing rapidly.-> OLAP rather than OLTP

Hive is not a full database. The design constraints and limitations of Hadoop and HDFS impose limits on what Hive can do. The biggest limitation is that [1]Hive does not provide
record-level update, insert, nor delete
. You can generate new tables from queries or output query results to files. Also, because Hadoop is a batch-oriented system, [2]Hive queries have higher latency, due to the start-up overhead for MapReduce jobs. Queries that would finish in seconds for a traditional database take longer for Hive, even for relatively small data sets. [3]Finally, Hive does not provide transactions.

So, Hive doesn’t provide crucial features required for OLTP, Online Transaction Processing. It’s closer to being an OLAP tool, Online Analytic Processing, but as we’ll see, Hive isn’t ideal for satisfying the “online” part of OLAP, at least today, since there can be significant latency between issuing a query and receiving a reply, both due to the overhead of Hadoop and due to the size of the data sets Hadoop was designed to serve.

If you need OLTP features for large-scale data, you should consider using a NoSQL database. Examples include HBase, a NoSQL database integrated with Hadoop, Cassandra, and DynamoDB.
注:[NoSQL] NoSQL databases (aka “not only SQL”) store data differently than relational tables. The main types are document, key-value, wide-column, and graph.

So, Hive is best suited for data warehouse applications, where a large data set is maintained and mined for insights, reports, etc.

2. An Overview of Hadoop and MapReduce
2.1 MapReduce

MapReduce is a computing model that decomposes large data manipulation jobs into individual tasks that can be executed in parallel across a cluster of servers. The results of the tasks can be joined together to compute the final results.

The term MapReduce comes from the two fundamental data-transformation operations used, map and reduce. A map operation converts the elements of a collection from one form to another. In this case, input key-value pairs are converted to zero-to-many output key-value pairs, where the input and output keys might be completely different and the input and output values might be completely different.

In MapReduce, all the key-pairs for a given key are sent to the same reduce operation. Specifically, the key and a collection of the values are passed to the reducer. The goal
of “reduction” is to convert the collection to a value, such as summing or averaging a collection of numbers, or to another collection. A final key-value pair is emitted by the reducer. Again, the input versus output keys and values may be different. Note that if the job requires no reduction step, then it can be skipped.

The Hadoop Distributed Filesystem, HDFS, manages data across the cluster. Each block is replicated several times (three copies is the usual default), so that no single hard drive or server failure results in data loss. Also, HDFS uses very large block sizes, typically 64 MB or multiples thereof. Such large blocks can be stored contiguously on hard drives so they can be written and read with minimal seeking of the drive heads, thereby maximizing write and read performance.

Word Count Example:

  • In real scenarios, large documents might be split and each split would be sent to a separate Mapper. Also, there are techniques for combining many small documents into a single split for a Mapper. We won’t worry about those details now. //负载均衡
  • the key passed to the mapper is the character offset into the document at the start of the line. The corresponding value is the text of the line.
  • the mapper outputs a key-value pair, with the word
    as the key and the number 1 as the value.
  • Part of Hadoop’s magic is the Sort and Shuffle phase that comes next. Hadoop sorts the key-value pairs by key and it “shuffles” all pairs with the same key to the same Reducer. There are several possible techniques that can be used to decide which reducer gets which range of keys.
  • To finish the algorithm, all the reducer has to do is add up all the counts in the value collection and write a final key-value pair consisting of each word and the count for that word.

3.Hive in the Hadoop Ecosystem


  • CLI, command-line interface
  • All commands and queries go to the Driver, which compiles the input, optimizes the computation required, and executes the required steps, usually with MapReduce jobs.
  • Hive communicates with the JobTracker to initiate the MapReduce job
  • The Metastore is a separate relational database (usually a MySQL instance) where Hive persists table schemas and other system metadata.
4. Alternatives to mapreduce

Because Hadoop is a batch-oriented system, there are tools with different distributed computing models that are better suited for event stream processing, where closer to “real-time” responsiveness is required. Here we list several of the many alternatives

  • 0
  • 0
    觉得还不错? 一键收藏
  • 0


  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助




当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则
钱包余额 0


