Hive Queries on Tables

https://karmasphere.com/hive-queries-on-table-data

Hive Queries and Tables »

This page is reference for Hive queries used to analyze data in Hive.

  1. Select Syntax
  2. JOIN Syntax
  3. Lateral View Syntax
  4. UNION Syntax
  5. Subquery Syntax
  6. Sampling Syntax
  7. EXPLAIN Syntax
  8. Virtual Columns



1. Select Syntax


SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[   CLUSTER BY col_list
  | [DISTRIBUTE BY col_list] [SORT BY col_list]
]
[LIMIT number]

  • A SELECT statement can be part of a union query or a subquery of another query.

  • table_reference indicates the input to the query. It can be a regular table, a view, a join construct or a subquery.

  • Simple query. For example, the following query retrieves all columns and all rows from table t1.
SELECT * FROM t1


WHERE Clause

The where condition is a boolean expression. For example, the following query returns only those sales records which have an amount greater than 10 from the US region. Hive does not support IN, EXISTS or subqueries in the WHERE clause.

 
  
SELECT * FROM sales WHERE amount > 10 AND region = "US"




ALL and DISTINCT Clauses

The ALL and DISTINCT options specify whether duplicate rows should be returned. If none of these options are given, the default is ALL (all matching rows are returned). DISTINCT specifies removal of duplicate rows from the result set.

Partition Based Queries

In general, a SELECT query scans the entire table (other than for sampling). If a table created using the PARTITIONED BY clause, a query can do input pruning and scan only a fraction of the table relevant to the query. Hive currently does input pruning only if the partition predicates are specified in the WHERE clause closest to the table_reference in the FROM clause. For example, if table page_views is partitioned on column date, the following query retrieves rows for just one day 2008-03-31.

SELECT page_views.*
FROM page_views 
WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31'


HAVING Clause

Hive currently does not support HAVING clause. A similar effect can be achieved by using a subquery. For example,

SELECT col1 FROM t1 GROUP BY col1 HAVING SUM(col2) > 10

can also be expressed as

 
  
SELECT col1 FROM (SELECT col1, SUM(col2) AS col2sum FROM t1 GROUP BY col1) t2 WHERE t2.col2sum > 10




LIMIT Clause

Limit indicates the number of rows to be returned. The rows returned are chosen at random. The following query returns 5 rows from t1 at random.

SELECT * FROM t1 LIMIT 5
  • Top k queries. The following query returns the top 5 sales records wrt amount.
SET mapred.reduce.tasks = 1
SELECT * FROM sales SORT BY amount DESC LIMIT 5


REGEX Column Specification

A SELECT statement can take regex-based column specification.

SELECT `(ds|hr)?+.+` FROM sales


Group By Syntax

groupByClause: GROUP BY groupByExpression (, groupByExpression)*

groupByExpression: expression

groupByQuery: SELECT expression (, expression)* FROM src groupByClause?



Simple Examples

In order to count the number of rows in a table, use count(1) or count(*):

SELECT COUNT(1) FROM table2;
SELECT COUNT(*) FROM table2;


In order to count the number of distinct users by gender one could write the following query:

 INSERT OVERWRITE TABLE pv_gender_sum
  SELECT pv_users.gender, count (DISTINCT pv_users.userid)
  FROM pv_users
  GROUP BY pv_users.gender;


Multiple aggregations can be done at the same time, however, no two aggregations can have different DISTINCT columns .e.g while the following is possible

 INSERT OVERWRITE TABLE pv_gender_agg
  SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(1), sum(DISTINCT pv_users.userid)
  FROM pv_users
  GROUP BY pv_users.gender;


However, the following query is not allowed. Multiple DISTINCT expressions are not allowed in the same query.

  INSERT OVERWRITE TABLE pv_gender_agg
  SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT pv_users.ip)
  FROM pv_users
  GROUP BY pv_users.gender;



Advanced Features

Multi-Group-By Inserts

The output of the aggregations or simple selects can be further sent into multiple tables or even to hadoop dfs files (which can then be manipulated using hdfs utilitites). e.g. if along with the gender breakdown, one needed to find the breakdown of unique page views by age, one could accomplish that with the following query:

  FROM pv_users
  INSERT OVERWRITE TABLE pv_gender_sum
    SELECT pv_users.gender, count(DISTINCT pv_users.userid)
    GROUP BY pv_users.gender
  INSERT OVERWRITE DIRECTORY '/user/facebook/tmp/pv_age_sum'
    SELECT pv_users.age, count(DISTINCT pv_users.userid)
    GROUP BY pv_users.age;



Map-side Aggregation for Group By

hive.map.aggr controls how we do aggregations. The default is false. If it is set to true, Hive will do the first-level aggregation directly in the map task. This usually provides better efficiency, but may require more memory to run successfully.

  SET hive.map.aggr=true;
  SELECT COUNT(1) FROM table2;



Syntax of Order By

The ORDER BY syntax in Hive QL is similar to the syntax of ORDER BY in SQL language.

colOrder: ( ASC | DESC )
orderBy: ORDER BY colName colOrder? (',' colName colOrder?)*
query: SELECT expression (',' expression)* FROM src orderBy


There are some limitations in the "order by" clause. In the strict mode (i.e., hive.mapred.mode=strict), the order by clause has to be followed by a "limit" clause. The limit clause is not necessary if you set hive.mapred.mode to nonstrict. The reason is that in order to impose total order of all results, there has to be one reducer to sort the final output. If the number of rows in the output is too large, the single reducer could take a very long time to finish.


Syntax of Sort By

The SORT BY syntax is similar to the syntax of ORDER BY in SQL language.

colOrder: ( ASC | DESC )
sortBy: SORT BY colName colOrder? (',' colName colOrder?)*
query: SELECT expression (',' expression)* FROM src sortBy


Hive uses the columns in SORT BY to sort the rows before feeding the rows to a reducer. The sort order will be dependent on the column types. If the column is of numeric type, then the sort order is also in numeric order. If the column is of string type, then the sort order will be lexicographical order.


Difference between Sort By and Order By

Hive supports SORT BY which sorts the data per reducer. The difference between "order by" and "sort by" is that the former guarantees total order in the output while the latter only guarantees ordering of the rows within a reducer. If there are more than one reducer, "sort by" may give partially ordered final results.

Basically, the data in each reducer will be sorted according to the order that the user specified. The following example shows

SELECT KEY, value FROM src SORT BY KEY ASC, value DESC


The query had 2 reducers, and the output of each is:

0   5
0   3
3   6
9   1


0   4
0   3
1   1
2   5


Setting Types for Sort By

After a transform, variable types are generally considered to be strings, meaning that numeric data will be sorted lexicographically. To overcome this, a second SELECT statement with casts can be used before using SORT BY.

FROM (FROM (FROM src
            SELECT TRANSFORM(value)
            USING 'mapper'
            AS value, count) mapped
      SELECT cast(value AS double) AS value, cast(count AS int) AS count
      SORT BY value, count) sorted
SELECT TRANSFORM(value, count)
USING 'reducer'
AS whatever


Syntax of Cluster By and Distribute By

Cluster By and Distribute By are used mainly with the Transform/Map-Reduce Scripts. But, it is sometimes useful in SELECT statements if there is a need to partition and sort the output of a query for subsequent queries.

Cluster By is a short-cut for both Distribute By and Sort By.

Hive uses the columns in Distribute By to distribute the rows among reducers. All rows with the same Distribute By columns will go to the same reducer. However, Distribute By does not guarantee clustering or sorting properties on the distributed keys.

For example, we are Distributing By x on the following 5 rows to 2 reducer:

x1
x2
x4
x3
x1

Reducer 1 got

x1
x2
x1

Reducer 2 got

x4
x3

Note that all rows with the same key x1 is guaranteed to be distributed to the same reducer (reducer 1 in this case), but they are not guaranteed to be clustered in adjacent positions.

In contrast, if we use Cluster By x, the two reducers will further sort rows on x:

Reducer 1 got

x1
x1
x2

Reducer 2 got

x3
x4

Instead of specifying Cluster By, the user can specify Distribute By and Sort By, so the partition columns and sort columns can be different. The usual case is that the partition columns are a prefix of sort columns, but that is not required.

SELECT col1, col2 FROM t1 CLUSTER BY col1

SELECT col1, col2 FROM t1 DISTRIBUTE BY col1

SELECT col1, col2 FROM t1 DISTRIBUTE BY col1 SORT BY col1 ASC, col2 DESC

  FROM (
    FROM pv_users
    MAP ( pv_users.userid, pv_users.date )
    USING 'map_script'
    AS c1, c2, c3
    DISTRIBUTE BY c2
    SORT BY c2, c1) map_output
  INSERT OVERWRITE TABLE pv_users_reduced
    REDUCE ( map_output.c1, map_output.c2, map_output.c3 )
    USING 'reduce_script'
    AS date, count;


Transform/Map-Reduce Syntax

Users can also plug in their own custom mappers and reducers in the data stream by using features natively supported in the Hive 2.0 language. e.g. in order to run a custom mapper script - map_script - and a custom reducer script - reduce_script - the user can issue the following command which uses the TRANSFORM clause to embed the mapper and the reducer scripts.

By default, columns will be transformed to STRING and delimited by TAB before feeding to the user script; similarly, all NULL values will be converted to the literal string \N in order to differentiate NULL values from empty strings. The standard output of the user script will be treated as TAB-separated STRING columns, any cell containing only \N will be re-interpreted as a NULL, and then the resulting STRING column will be cast to the data type specified in the table declaration in the usual way. User scripts can output debug information to standard error which will be shown on the task detail page on hadoop. These defaults can be overridden with ROW FORMAT...

In the syntax, both MAP ... and REDUCE ... can be also written as SELECT TRANSFORM ( ... ). There are actually no difference between these three. Hive runs the reduce script in the reduce task (instead of the map task) because of the clusterBy/distributeBy/sortBy clause in the inner query.

Please also see Sort By / Cluster By / Distribute By and Larry Ogrodnek's blog post.

clusterBy: CLUSTER BY colName (',' colName)*
distributeBy: DISTRIBUTE BY colName (',' colName)*
sortBy: SORT BY colName (ASC | DESC)? (',' colName (ASC | DESC)?)*

rowFormat
  : ROW FORMAT
    (DELIMITED [FIELDS TERMINATED BY char]
               [COLLECTION ITEMS TERMINATED BY char]
               [MAP KEYS TERMINATED BY char]
               [ESCAPED BY char]
               [LINES SEPARATED BY char]
     |
     SERDE serde_name [WITH SERDEPROPERTIES
                            property_name=property_value,
                            property_name=property_value, ...])

outRowFormat : rowFormat
inRowFormat : rowFormat
outRecordReader : RECORDREADER className

query:
  FROM (
    FROM src
    MAP expression (',' expression)*
    (inRowFormat)?
    USING 'my_map_script'
    ( AS colName (',' colName)* )?
    (outRowFormat)? (outRecordReader)?
    ( clusterBy? | distributeBy? sortBy? ) src_alias
  )
  REDUCE expression (',' expression)*
    (inRowFormat)?
    USING 'my_reduce_script'
    ( AS colName (',' colName)* )?
    (outRowFormat)? (outRecordReader)?

  FROM (
    FROM src
    SELECT TRANSFORM '(' expression (',' expression)* ')'
    (inRowFormat)?
    USING 'my_map_script'
    ( AS colName (',' colName)* )?
    (outRowFormat)? (outRecordReader)?
    ( clusterBy? | distributeBy? sortBy? ) src_alias
  )
  SELECT TRANSFORM '(' expression (',' expression)* ')'
    (inRowFormat)?
    USING 'my_reduce_script'
    ( AS colName (',' colName)* )?
    (outRowFormat)? (outRecordReader)?


Example #1:

  FROM (
    FROM pv_users
    MAP pv_users.userid, pv_users.date
    USING 'map_script'
    AS dt, uid
    CLUSTER BY dt) map_output
  INSERT OVERWRITE TABLE pv_users_reduced
    REDUCE map_output.dt, map_output.uid
    USING 'reduce_script'
    AS date, count;
  FROM (
    FROM pv_users
    SELECT TRANSFORM(pv_users.userid, pv_users.date)
    USING 'map_script'
    AS dt, uid
    CLUSTER BY dt) map_output
  INSERT OVERWRITE TABLE pv_users_reduced
    SELECT TRANSFORM(map_output.dt, map_output.uid)
    USING 'reduce_script'
    AS date, count;



Example #2

  FROM (
    FROM src
    SELECT TRANSFORM(src.KEY, src.value) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.TypedBytesSerDe'
    USING '/bin/cat'
    AS (tkey, tvalue) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.TypedBytesSerDe'
    RECORDREADER 'org.apache.hadoop.hive.ql.exec.TypedBytesRecordReader'
  ) tmap
  INSERT OVERWRITE TABLE dest1 SELECT tkey, tvalue


Schema-less Map-reduce Scripts

If there is no AS clause after USING my_script, Hive assumes that the output of the script contains 2 parts: key which is before the first tab, and value which is the rest after the first tab. Note that this is different from specifying AS key, value because in that case, value will only contain the portion between the first tab and the second tab if there are multiple tabs.

Note that we can directly do CLUSTER BY key without specifying the output schema of the scripts.

  FROM (
    FROM pv_users
    MAP pv_users.userid, pv_users.date
    USING 'map_script'
    CLUSTER BY KEY) map_output
  INSERT OVERWRITE TABLE pv_users_reduced
    REDUCE map_output.KEY, map_output.value
    USING 'reduce_script'
    AS date, count;


Typing the output of TRANSFORM

The output fields from a script are typed as strings by default; for example in

  SELECT TRANSFORM(stuff)
  USING 'script'
  AS thing1, thing2


They can be immediately casted with the syntax:

 
  
  SELECT TRANSFORM(stuff)
  USING 'script'
  AS (thing1 INT, thing2 INT)




2. JOIN Syntax

Hive supports the following syntax for joining tables:

join_table:
    table_reference JOIN table_factor [join_condition]
  | table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference join_condition
  | table_reference LEFT SEMI JOIN table_reference join_condition

table_reference:
    table_factor
  | join_table

table_factor:
    tbl_name [alias]
  | table_subquery alias
  | ( table_references )

join_condition:
    ON equality_expression ( AND equality_expression )*

equality_expression:
    expression = expression
 


Only equality joins, outer joins, and left semi joins are supported in Hive. Hive does not support join conditions that are not equality conditions as it is very difficult to express such conditions as a MapReduce job. Also, more than two tables can be joined in Hive.

Some salient points to consider when writing join queries are as follows:

  • Only equality joins are allowed e.g.
  SELECT a.* FROM a JOIN b ON (a.id = b.id)

  SELECT a.* FROM a JOIN b ON (a.id = b.id AND a.department = b.department)

are both valid joins, however

  SELECT a.* FROM a JOIN b ON (a.id <> b.id)


is NOT allowed

  • More than 2 tables can be joined in the same query, For example, the below is a valid JOIN.
SELECT a.val, b.val, c.val FROM a JOIN b ON (a.KEY = b.key1) JOIN c ON (c.KEY = b.key2)


Hive converts joins over multiple tables into a single MapReduce job if for every table the same column is used in the join clauses e.g.

SELECT a.val, b.val, c.val FROM a JOIN b ON (a.KEY = b.key1) JOIN c ON (c.KEY = b.key1)


is converted into a single MapReduce job as only key1 column for b is involved in the join. On the other hand

SELECT a.val, b.val, c.val FROM a JOIN b ON (a.KEY = b.key1) JOIN c ON (c.KEY = b.key2)


is converted into two MapReduce jobs because key1 column from b is used in the first join condition and key2 column from b is used in the second one. The first MapReduce job joins a with b and the results are then joined with c in the second MapReduce job.

In every MapReduce stage of the join, the last table in the sequence is streamed through the reducers where as the others are buffered. Therefore, it helps to reduce the memory needed in the reducer for buffering the rows for a particular value of the join key by organizing the tables such that the largest tables appear last in the sequence. e.g. in


SELECT a.val, b.val, c.val FROM a JOIN b ON (a.KEY = b.key1) JOIN c ON (c.KEY = b.key1)


all the three tables are joined in a single MapReduce job and the values for a particular value of the key for tables a and b are buffered in the memory in the reducers. Then for each row retrieved from c, the join is computed with the buffered rows. Similarly for

SELECT a.val, b.val, c.val FROM a JOIN b ON (a.KEY = b.key1) JOIN c ON (c.KEY = b.key2)


there are two MapReduce jobs involved in computing the join. The first of these joins a with b and buffers the values of a while streaming the values of b in the reducers. The second of one of these jobs buffers the results of the first join while streaming the values of c through the reducers.

In every MapReduce stage of the join, the table to be streamed can be specified via a hint. e.g. in

SELECT /*+ STREAMTABLE(a) */ a.val, b.val, c.val FROM 
 a JOIN b ON (a.KEY = b.key1) JOIN c ON (c.KEY = b.key1)


all the three tables are joined in a single MapReduce job and the values for a particular value of the key for tables b and c are buffered in the memory in the reducers. Then for each row retrieved from a, the join is computed with the buffered rows.

  • LEFT, RIGHT, and FULL OUTER joins exist in order to provide more control over ON clauses for which there is no match. For example, this query:
SELECT a.val, b.val FROM a LEFT OUTER JOIN b ON (a.KEY=b.KEY)

  • will return a row for every row in a. This output row will be a.val,b.val when there is a b.key that equals a.key, and the output row will be a.val,NULL when there is no corresponding b.key. Rows from b which have no corresponding a.key will be dropped. The syntax "FROM a LEFT OUTER JOIN b" must be written on one line in order to understand how it works--a is to the LEFT of b in this query, and so all rows from a are kept; a RIGHT OUTER JOIN will keep all rows from b, and a FULL OUTER JOIN will keep all rows from a and all rows from b. OUTER JOIN semantics should conform to standard SQL specs.
  • Joins occur BEFORE WHERE CLAUSES. So, if you want to restrict the OUTPUT of a join, a requirement should be in the WHERE clause, otherwise it should be in the JOIN clause. A big point of confusion for this issue is partitioned tables:
SELECT a.val, b.val FROM a LEFT OUTER JOIN b ON (a.KEY=b.KEY) 
 WHERE a.ds='2009-07-07' AND b.ds='2009-07-07'


will join a on b, producing a list of a.val and b.val. The WHERE clause, however, can also reference other columns of a and b that are in the output of the join, and then filter them out. However, whenever a row from the JOIN has found a key for a and no key for b, all of the columns of b will be NULL, including the ds column. This is to say, you will filter out all rows of join output for which there was no valid b.key, and thus you have outsmarted your LEFT OUTER requirement. In other words, the LEFT OUTER part of the join is irrelevant if you reference any column of b in the WHERE clause. Instead, when OUTER JOINing, use this syntax:

  SELECT a.val, b.val FROM a LEFT OUTER JOIN b
  ON (a.KEY=b.KEY AND b.ds='2009-07-07' AND a.ds='2009-07-07')


the result is that the output of the join is pre-filtered, and you won't get post-filtering trouble for rows that have a valid a.key but no matching b.key. The same logic applies to RIGHT and FULL joins.

  • Joins are NOT commutative! Joins are left-associative regardless of whether they are LEFT or RIGHT joins.
  SELECT a.val1, a.val2, b.val, c.val
  FROM a
  JOIN b ON (a.KEY = b.KEY)
  LEFT OUTER JOIN c ON (a.KEY = c.KEY)
 


This statement first joins a on b, throwing away everything in a or b that does not have a corresponding key in the other table. The reduced table is then joined on c. This provides unintuitive results if there is a key that exists in both a and c but not b: The whole row (including a.val1, a.val2, and a.key) is dropped in the "a JOIN b" step because it is not in b. The result does not have a.key in it, so when it is LEFT OUTER JOINed with c, c.val does not make it in because there is no c.key that matches an a.key (because that row from a was removed). Similarly, if this were a RIGHT OUTER JOIN (instead of LEFT), we would end up with an even weirder effect: NULL, NULL, NULL, c.val, because even though we specified a.key=c.key as the join key, we dropped all rows of a that did not match the first JOIN.

To achieve the more intuitive effect, we should instead do FROM c LEFT OUTER JOIN a ON (c.key = a.key) LEFT OUTER JOIN b ON (c.key = b.key).

  • LEFT SEMI JOIN implements the correlated IN/EXISTS subquery semantics in an efficient way. Since Hive currently does not support IN/EXISTS subqueries, you can rewrite your queries using LEFT SEMI JOIN. The restrictions of using LEFT SEMI JOIN is that the right-hand-side table should only be referenced in the join condition (ON-clause), but not in WHERE- or SELECT-clauses etc.
  SELECT a.KEY, a.value
  FROM a
  WHERE a.KEY IN
   (SELECT b.KEY FROM B);


can be rewritten to:

   SELECT a.KEY, a.val
   FROM a LEFT SEMI JOIN b ON (a.KEY = b.KEY)


If all but one of the tables being joined are small, the join can be performed as a map only job. The query

  SELECT /*+ MAPJOIN(b) */ a.KEY, a.value
  FROM a JOIN b ON a.KEY = b.KEY


does not need a reducer. For every mapper of A, B is read completely. The restriction is that a FULL/RIGHT OUTER JOIN b cannot be performed

  • If the tables being joined are bucketized, and the buckets are a multiple of each other, the buckets can be joined with each other. If table A has 8 buckets are table B has 4 buckets, the following join
  SELECT /*+ MAPJOIN(b) */ a.KEY, a.value
  FROM a JOIN b ON a.KEY = b.KEY


can be done on the mapper only. Instead of fetching B completely for each mapper of A, only the required buckets are fetched. For the query above, the mapper processing bucket 1 for A will only fetch bucket 1 of B. It is not the default behavior, and is governed by the following parameter

  SET hive.OPTIMIZE.bucketmapjoin = true

If the tables being joined are sorted and bucketized, and the number of buckets are same, a sort-merge join can be performed. The corresponding buckets are joined with each other at the mapper. If both A and B have 4 buckets,

  SELECT /*+ MAPJOIN(b) */ a.KEY, a.value
  FROM A a JOIN B b ON a.KEY = b.KEY


can be done on the mapper only. The mapper for the bucket for A will traverse the corresponding bucket for B. This is not the default behavior, and the following parameters need to be set:

SET hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
SET hive.optimize.bucketmapjoin = true;
SET hive.optimize.bucketmapjoin.sortedmerge = true;


3.Lateral View Syntax

lateralView: LATERAL VIEW udtf(expression) tableAlias AS columnAlias (',' columnAlias)*
fromClause: FROM baseTable (lateralView)*


Description Lateral view is used in conjunction with user-defined table generating functions such as explode(). Note: a UDTF generates one or more output rows for each input row. A lateral view first applies the UDTF to each row of base table and then joins resulting output rows to the input rows to form a virtual table having the supplied table alias.


Important note

Currently, lateral view does not support the predicate push-down optimization. If you use a WHERE clause, your query may not compile. Look for the fix to come out at [https://issues.apache.org/jira/browse/HIVE-1056]

Until then, try adding sethive.optimize.ppd=false; before your query.


Example

Consider the following base table named pageAds. It has two columns: pageid (name of the page) and adid_list (an array of ads appearing on the page):

string pageidArray<int> adid_list
"front_page"[1, 2, 3]
"contact_page[3, 4, 5]

and the user would like to count the total number of times an ad appears across all pages.

A lateral view with explode() Hive/LanguageManual/UDF#explode can be used to convert adid_list into separate rows using the query:

SELECT pageid, adid
        FROM pageAds LATERAL VIEW explode(adid_list) adTable AS adid;

The resulting output will be

string pageidint adid
"front_page"1
"front_page2
"front_page3
"contact_page"3
"contact_page"4
"contact_page"5

Then in order to count the number of times a particular ad appears, count/group by can be used:

SELECT adid, count(1)
        FROM pageAds LATERAL VIEW explode(adid_list) adTable AS adid
GROUP BY adid;


int adidcount(1)
11
21
32
41
51


Multiple Lateral Views

A FROM clause can have multiple LATERAL VIEW clauses. Subsequent LATERAL VIEWS can reference columns from any of the tables appearing to the left of the LATERAL VIEW.

For example, the following could be a valid query:

SELECT * FROM exampleTable
        LATERAL VIEW explode(col1) myTable1 AS myCol1
        LATERAL VIEW explode(myCol1) myTable2 AS myCol2;


LATERAL VIEW clauses are applied in the order that they appear. For example with the following base table:

Array<int> col1Array<string> col2
[1, 2]["a", "b", "c"]
[3, 4]["d", "e", "f"]

The query:


SELECT myCol1, col2 FROM baseTable
        LATERAL VIEW explode(col1) myTable1 AS myCol1;

Will produce:

int mycol1

Array<string> col2

1["a", "b", "c"]
2["a", "b", "c"]
3["d", "e", "f"]
4["d", "e", "f"]

A query that adds an additional LATERAL VIEW:

SELECT myCol1, myCol2 FROM baseTable
        LATERAL VIEW explode(col1) myTable1 AS myCol1
        LATERAL VIEW explode(col2) myTable2 AS myCol2;


Will produce:

int myCol1string myCol2
1"a"
1"b"
1"c"
2"a"
2"b"
2"c"
3"d"
3"e"
3"f"
4"d"
4"e"
4"f"



4.UNION Syntax

select_statement UNION ALL select_statement UNION ALL select_statement ...


UNION is used to combine the result from multiple SELECT statements into a single result set. We currently only support UNION ALL (bag union) i.e. duplicates are not eliminated. The number and names of columns returned by each select_statement has to be the same. Otherwise, a schema error is thrown.

If some additional processing has to be done on the result of the UNION, the entire statement expression can be embedded in a FROM clause like below:

SELECT *
FROM (
  select_statement
  UNION ALL
  select_statement
) unionResult

For example, if we suppose there are two different tables that track which user has published a video and which user has published a comment, the following query joins the results of a union all with the user table to create a single annotated stream for all the video publishing and comment publishing events:

    SELECT u.id, actions.date
    FROM (
        SELECT av.uid AS uid
        FROM action_video av
        WHERE av.date = '2008-06-03'
        UNION ALL
        SELECT ac.uid AS uid
        FROM action_comment ac
        WHERE ac.date = '2008-06-03'
     ) actions JOIN users u ON (u.id = actions.uid) 



5.Subquery Syntax

SELECT ... FROM (subquery) name ...


Hive supports subqueries only in the FROM clause. The subquery has to be given a name because every table in a FROM clause must have a name. Columns in the subquery select list must have unique names. The columns in the subquery select list are available in the outer query just like columns of a table. The subquery can also be a query expression with UNION. Hive supports arbitrary levels of sub-queries.

Example with simple subquery:

SELECT col
FROM (
  SELECT a+b AS col
  FROM t1
) t2

Example with subquery containing a UNION ALL:

SELECT t3.col
FROM (
  SELECT a+b AS col
  FROM t1
  UNION ALL
  SELECT c+d AS col
  FROM t2
) t3


6. Sampling Syntax

table_sample: TABLESAMPLE (BUCKET x OUT OF y [ON colname])


The TABLESAMPLE clause allows the users to write queries for samples of the data instead of the whole table. The TABLESAMPLE clause can be added to any table in the FROM clause. The buckets are numbered starting from 1. colname indicates the column on which to sample each row in the table. colname can be one of the non-partition columns in the table or rand() indicating sampling on the entire row instead of an individual column. The rows of the table are 'bucketed' on the colname randomly into y buckets numbered 1 through y. Rows which belong to bucket x are returned.

In the following example the 3rd bucket out of the 32 buckets of the table source. 's' is the table alias.

SELECT *
FROM source TABLESAMPLE(BUCKET 3 OUT OF 32 ON rand()) s;

Input pruning: Typically, TABLESAMPLE will scan the entire table and fetch the sample. But, that is not very efficient. Instead, the table can be created with a CLUSTERED BY clause which indicates the set of columns on which the table is hash-partitioned/clustered on. If the columns specified in the TABLESAMPLE clause match the columns in the CLUSTERED BY clause, TABLESAMPLE scans only the required hash-partitions of the table.

Example: So in the above example, if table 'source' was created with 'CLUSTERED BY id INTO 32 BUCKETS'

    TABLESAMPLE(BUCKET 3 OUT OF 16 ON id)


would pick out the 3rd and 19th clusters as each bucket would be composed of (32/16)=2 clusters.

On the other hand the tablesample clause:

    TABLESAMPLE(BUCKET 3 OUT OF 64 ON id)


would pick out half of the 3rd cluster as each bucket would be composed of (32/64)=1/2 of a cluster.

7.EXPLAIN Syntax

Hive provides an EXPLAIN command that shows the execution plan for a query. The syntax for this statement is as follows:

EXPLAIN [EXTENDED] query


The use of EXTENDED in the EXPLAIN statement produces extra information about the operators in the plan. This is typically physical information like file names.

A Hive query gets converted into a sequence (it is more an Directed Acyclic Graph) of stages. These stages may be MapReduce stages or they may even be stages that do metastore or file system operations like move and rename. The explain output comprises of three parts:

  • The Abstract Syntax Tree for the query
  • The dependencies between the different stages of the plan
  • The description of each of the stages

The description of the stages itself shows a sequence of operators with the metadata associated with the operators. The metadata may comprise of things like filter expressions for the FilterOperator or the select expressions for the SelectOperator or the output file names for the FileSinkOperator.

As an example, consider the following EXPLAIN query:

EXPLAIN
FROM src INSERT OVERWRITE TABLE dest_g1 SELECT src.KEY, sum(substr(src.value,4)) GROUP BY src.KEY;

The output of this statement contains the following parts:

  • The Abstract Syntax Tree
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_TABREF src)) 
    (TOK_INSERT (TOK_DESTINATION (TOK_TAB dest_g1)) 
      (TOK_SELECT (TOK_SELEXPR (TOK_COLREF src KEY)) 
        (TOK_SELEXPR (TOK_FUNCTION sum (TOK_FUNCTION substr (TOK_COLREF src value) 4)))) 
        (TOK_GROUPBY (TOK_COLREF src KEY))
     )
   )


STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-2 depends on stages: Stage-1
  Stage-0 depends on stages: Stage-2
  • This shows that Stage-1 is the root stage, Stage-2 is executed after Stage-1 is done and Stage-0 is executed after Stage-2 is done.
STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        src
            Reduce Output Operator
              key expressions:
                    expr: key
                    type: string
              sort order: +
              Map-reduce partition columns:
                    expr: rand()
                    type: double
              tag: -1
              value expressions:
                    expr: substr(value, 4)
                    type: string
      Reduce Operator Tree:
        Group By Operator
          aggregations:
                expr: sum(UDFToDouble(VALUE.0))
          keys:
                expr: KEY.0
                type: string
          mode: partial1
          File Output Operator
            compressed: false
            table:
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                output format: org.apache.hadoop.mapred.SequenceFileOutputFormat
                name: binary_table

  Stage: Stage-2
    Map Reduce
      Alias -> Map Operator Tree:
        /tmp/hive-zshao/67494501/106593589.10001
          Reduce Output Operator
            key expressions:
                  expr: 0
                  type: string
            sort order: +
            Map-reduce partition columns:
                  expr: 0
                  type: string
            tag: -1
            value expressions:
                  expr: 1
                  type: double
      Reduce Operator Tree:
        Group By Operator
          aggregations:
                expr: sum(VALUE.0)
          keys:
                expr: KEY.0
                type: string
          mode: final
          Select Operator
            expressions:
                  expr: 0
                  type: string
                  expr: 1
                  type: double
            Select Operator
              expressions:
                    expr: UDFToInteger(0)
                    type: int
                    expr: 1
                    type: double
              File Output Operator
                compressed: false
                table:
                    input format: org.apache.hadoop.mapred.TextInputFormat
                    output format: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
                    serde: org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe
                    name: dest_g1

  Stage: Stage-0
    Move Operator
      tables:
            replace: true
            table:
                input format: org.apache.hadoop.mapred.TextInputFormat
                output format: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
                serde: org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe
                name: dest_g1
  • In this example there are 2 MapReduce stages (Stage-1 and Stage-2) and 1 File System related stage (Stage-0). Stage-0 basically moves the results from a temporary directory to the directory corresponding to the table dest_g1.

A MapReduce stage itself comprises of 2 parts:

  • A mapping from table alias to Map Operator Tree - This mapping tells the mappers which operator tree to call in order to process the rows from a particular table or result of a previous MapReduce stage. In Stage-1 in the above example, the rows from src table are processed by the operator tree rooted at a Reduce Output Operator. Similarly, in Stage-2 the rows of the results of Stage-1 are processed by another operator tree rooted at another Reduce Output Operator. Each of these Reduce Output Operators partitions the data to the reducers according to the criteria shown in the metadata.
  • A Reduce Operator Tree - This is the operator tree which processes all the rows on the reducer of the MapReduce job. In Stage-1 for example, the Reducer Operator Tree is carrying out a partial aggregation where as the Reducer Operator Tree in Stage-2 computes the final aggregation from the partial aggregates computed in Stage-1



8.Virtual Columns

Right now hive can only support 2 simple virtual columns:

One is INPUT__FILE__NAME, which is the input file's name for a mapper task.

the other is BLOCK__OFFSET__INSIDE__FILE, which is the current global file position.

For block compressed file, it is the current block's file offset, which is the current block's first byte's file offset.


Simple Examples

SELECT INPUT__FILE__NAME, KEY, BLOCK__OFFSET__INSIDE__FILE FROM src;
SELECT KEY, count(INPUT__FILE__NAME) FROM src GROUP BY KEY ORDER BY KEY;
SELECT * FROM src WHERE BLOCK__OFFSET__INSIDE__FILE > 12000 ORDER BY KEY;

Note: sections of this reference material are derived from the Hive project from the Apache Foundation, specifically from this wiki.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值