select … from clause:
1.1array类型:
hive> SELECT name, subordinates FROM employees;
John Doe ["Mary Smith","Todd Jones"]
Todd Jones []
--( ‘subordinates’为array类型,并且字符串带有双引号。)
hive> SELECT name, subordinates[0] FROM employees;
John Doe Mary Smith
Todd Jones NULL
--(可以和java一样进行索引选择,此时字符串没有双引号。空值返回NULL)
1.2 map类型:
hive> SELECT name, deductions FROM employees;
John Doe {"Federal Taxes":0.2,"State Taxes":0.05,"Insurance":0.1}
hive> SELECT name, deductions["State Taxes"] FROM mployees;
John Doe 0.05
--(’deductions’为map类型,也可以和Java一样对map通过键找到值)
1.3 struct类型:
hive> SELECT name, address FROM employees;
John Doe {"street":"1 Michigan Ave.","city":"Chicago","state":"IL","zip":60600}
hive> SELECT name, address.city FROM employees;
John Doe Chicago
--(struct类型的成员引用和c的一样。)
2.LIKE 和RLIKE区别:
(1)LIKE只支持‘_’和‘%’,其中前者代表匹配一位占位符,或者可以是一或多个占位符。
(2)RLIKE是regular like的缩写,顾名思义,其支持和Java一样的正则表达式,比like功能更强大。
3.在select中使用聚集函数时记得设置:
hive> SET hive.map.aggr=true;
hive> SELECT count(*), avg(salary) FROM employees;
hive> SELECT count(DISTINCT symbol) FROM stocks;
4.Table generating functions
(1)生成单列表:
hive> SELECT explode(subordinates) AS sub FROM employees;
Mary Smith
Todd Jones
(2)生成多列表:
SELECT parse_url_tuple(url, 'HOST', 'PATH', 'QUERY') as (host, path, query)
FROM url_table;
--(其他生成功能函数见page 87 of programming hive )
5. 嵌套查询:
hive> FROM (
> SELECT upper(name), salary, deductions["Federal Taxes"] as fed_taxes,
> FROM employees
> ) e
> SELECT e.name, e.salary_minus_fed_taxes
> WHERE e.salary_minus_fed_taxes > 70000 limit 2;
JOHN DOE 100000.0 0.2 80000
--(列的alias用‘as’,表的alias不用‘as’)
6.CASE …. WHEN字句:
hive> SELECT name, salary,
> CASE
> WHEN salary < 50000.0 THEN 'low'
> WHEN salary >= 50000.0 AND salary < 70000.0 THEN 'middle'
> WHEN salary >= 70000.0 AND salary < 100000.0 THEN 'high'
> ELSE 'very high'
> END AS bracket FROM employees;
John Doe 100000.0 very high
Mary Smith 80000.0 high
--(CASE… WHEN…THEN… ELSE …END 相当于if … else…字句)
7. When Hive Can Avoid MapReduce
set hive.exec.mode.local.auto=true;
--( Hive will attempt to run other operations in local mode,比如:
Select * from aa.)
--(同样适用于分区表:SELECT * FROM employees
WHERE country = 'US' AND state = 'CA'
LIMIT 100;)
8. Gotchas with Floating-Point Comparisons
hive> SELECT name, salary, deductions['Federal Taxes']
> FROM employees WHERE deductions['Federal Taxes'] > 0.2;
John Doe 100000.0 0.2
Boss Man 200000.0 0.3
--(deductions['Federal Taxes']为FLOAT类型,而hive会自动将小数(0.2)变为double类型)
--(In this particular case, the closest exact value is just slightly greater than 0.2, with a few nonzero bits at the least significant end of the number.
To simplify things a bit, let’s say that 0.2 is actually 0.2000001 for FLOAT and 0.200000000001 for DOUBLE, because an 8-byte DOUBLEhas more significant digits (after the decimal point). When the FLOATvalue from the table is converted to DOUBLEby Hive, it produces the DOUBLEvalue 0.200000100000, which is greater than 0.200000000001. That’s why the query results appear to use >=not >!)
--(解决方案:1.将deductions['Federal Taxes']以字符串形式读出再转化为double类型比较;
2. 可以将自己写的小数(0.2)用cast(0.2 AS FLOAT)函数转换到float类型再比较)
9.Having和Group by子句
hive> SELECT year(ymd), avg(price_close) FROM stocks
> WHERE exchange = 'NASDAQ' AND symbol = 'AAPL'
> GROUP BY year(ymd)
> HAVING avg(price_close) > 50.0;
1987 53.88968399108163
1991 52.49553383386182
--(GROUP BY要结合聚合函数使用)
10. JOIN Statements
10.1 hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend
> FROM stocks s JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol
> WHERE s.symbol = 'AAPL';
1987-05-11 AAPL 77.0 0.015
1987-08-10 AAPL 48.25 0.015
--(Hive supports the classic SQL JOIN statement, but only equi-joins are supported.)
--(Also, Hive does not currently support using OR between predicates in ON clauses)
10.2多个表的join:
hive> SELECT a.ymd, a.price_close, b.price_close , c.price_close
> FROM stocks a JOIN stocks b ON a.ymd = b.ymd
> JOIN stocks c ON a.ymd = c.ymd
> WHERE a.symbol = 'AAPL' AND b.symbol = 'IBM' AND c.symbol = 'GE';
2010-01-04 214.01 132.45 15.45
2010-01-05 214.38 130.85 15.53
--(Most of the time, Hive will use a separate MapReduce job for each pair of things to join. In this example, it would use one job for tables a and b, then a second job to join the output of the first join with c. Why not join band c first? Hive goes from left to right.)
10.3 Join Optimizations
在多个表进行join时,hive会缓存小表,之后会把大表给stream化.
Hive also assumes that the lasttable in the query is the largest. It attempts to buffer the other tables and then stream the last table through, while performing joins on individual records. Therefore, you should structure your join queries so the largest table is last.
Fortunately, you don’t have to put the largest table last in the query. Hive also provides a “hint” mechanism to tell the query optimizer which table should be streamed:
SELECT /*+ STREAMTABLE(s) */ s.ymd, s.symbol, s.price_close, d.dividend
FROM stocks s JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol
WHERE s.symbol = 'AAPL';
10.4 LEFT OUTER JOIN
hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend
> FROM stocks s LEFT OUTER JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol
> WHERE s.symbol = 'AAPL';
...
1987-05-01 AAPL 80.0 NULL
1987-05-11 AAPL 77.0 0.015
10.5 关于outer join 所要注意的:
You might wonder if you can move the predicates from the WHERE clause into the ON clause, at least the partition filters. This does not work for outer joins, despite documentation on the Hive Wiki that claims it should work.
However, using such filter predicates in ON clauses for inner joins does work!
hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend
> FROM stocks s LEFT OUTER JOIN dividends d
> ON s.ymd = d.ymd AND s.symbol = d.symbol
> AND s.symbol = 'AAPL' AND s.exchange = 'NASDAQ' AND d.exchange = 'NASDAQ';
1962-01-02 GE 74.75 NULL
1962-01-02 IBM 572.0 NULL
1962-01-03 GE 74.0 NULL
1962-01-03 IBM 577.0 NULL
10.6 outer join的其他类型:
RIGHT OUTER JOIN
Right-outer joins return all records in the right hand table that match the WHERE clause. NULL is used for fields of missing records in the left hand table.
FULL OUTER JOIN
Finally, a full-outer join returns all records from all tables that match the WHERE clause. NULL is used for fields in missing records in either table.
LEFT SEMI-JOIN
A left semi-join returns records from the left hand table if records are found in the right hand table that satisfy the ON predicates.相当于MySQL里的IN语句,MySQL语句实例如下:.
Example 6-2. Query that will not work in Hive
SELECT s.ymd, s.symbol, s.price_close FROM stocks s
WHERE s.ymd, s.symbol IN
(SELECT d.ymd, d.symbol FROM dividends d);
Instead, you use the following LEFT SEMI JOIN syntax:
hive> SELECT s.ymd, s.symbol, s.price_close
> FROM stocks s LEFT SEMI JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol;
1962-08-07 IBM 373.25
1962-05-08 IBM 459.5
--( Note that the SELECT and WHERE clauses can’t reference columns from the right hand table.)
--( 注意: Right semi-joins are not supported in Hive.)
Cartesian Product JOINs
笛卡尔积跟inner join相似,只是没有on 子句.
hive > SELECT * FROM stocks JOIN dividends
> WHERE stock.symbol = dividends.symbol and stock.symbol='AAPL';
--( 注意: In Hive, this query computes the full Cartesian product before applying the WHERE clause. It could take a very long time to finish. )
Map-side Joins
Ø 在mapper端把小表缓存,大表stream化,减少reducer端的处理步骤,甚至有时会减少mapper端的处理步骤.
Ø 在v0.7之前的hive需要添加一个暗语(hint),来达到优化.
SELECT /*+ MAPJOIN(d) */ s.ymd, s.symbol, s.price_close, d.dividend
FROM stocks s JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol
WHERE s.symbol = 'AAPL';
Ø 在v0.7开始不赞成这么做,而是设置hive.auto.convert.join=true
Ø 多大的小表将被mapper端缓存,其临界值可以进行设置,默认为(byte)
hive.mapjoin.smalltable.filesize=25000000,
Ø Hive does not support the optimization for right- and full-outer joins.
11. ORDER BY and SORT BY
在hive中SORT BY相当于sql中的order by,如下:
SELECT s.ymd, s.symbol, s.price_close
FROM stocks s
SORT BY(ORDER BY ) s.ymd ASC, s.symbol DESC;
--(在hive中ORDER BY and SORT BY 的区别:
Order by:所有数据的排序通过一个reducer
Sort by:可以有多个reducer来进行排序,提高效率.)
--( 因为order by用时太长,所以需要limit子句于order by一起用,当hive.mapred.mode=strict )
12. DISTRIBUTE BY with SORT BY
DISTRIBUTE BY相当于sql中的group by语句,其原理(By default, MapReduce computes a hash on the keys output by mappers and tries to
evenly distribute the key-value pairs among the available reducers using the hash values.这样分配后比较乱,reducer在排序中容易重复比较同一键值对(即比较过一个条件后又比较另一个条件).所以We can use DISTRIBUTE BY to ensure that the records for each stock symbol go to the same reducer, then use SORT BY to order the data the way we want.)
实例:
hive> SELECT s.ymd, s.symbol, s.price_close
> FROM stocks s
> DISTRIBUTE BY s.symbol
> SORT BY s.symbol ASC, s.ymd ASC;
1984-09-07 AAPL 26.5
1984-09-10 AAPL 26.37
--( Note that Hive requires that the DISTRIBUTE BY clause come before the SORT BY clause.)
13. CLUSTER BY
--(Using DISTRIBUTE BY ... SORT BY or the shorthand CLUSTER BY clauses is a way to exploit the parallelism of SORT BY, yet achieve a total ordering across the output files.可以看成是”DISTRIBUTE BY ... SORT BY”的简单写法。)
实例1:
原数据:(string,string,int,string)
b,w,15,man
c,w,21,woman
c,w,20,man
muse,a,24,man
经过CLUSTER BY之后:
hive (wang)> select id,name,phone
> from wang_manage
> cluster by name;
muse a 24
b w 15
c w 21
c w 20
当CLUSTER BY 之后为数字列时的排序如下:
b w 15
c w 20
c w 21
muse a 24
--(可以看出CLUSTER BY的列(不管是字符列还是数字列)默认按ascending,当该列相等时其他列若为字符(第一列)则默认按ascending,其他列若为数字类型(第三列)则默认按descending)
14. UNION ALL
-- (UNION ALL combines two or more tables. Each subquery of the union query must produce the same number of columns, and for each column, its type must match all the column types in the same position. For example, if the second column is a FLOAT, then the second column of all the other query results must be a FLOAT.)
SELECT log.ymd, log.level, log.message
FROM (
SELECT l1.ymd, l1.level,
l1.message, 'Log1' AS source
FROM log1 l1
UNION ALL
SELECT l2.ymd, l2.level,
l2.message, 'Log2' AS source
FROM log1 l2
) log
SORT BY log.ymd ASC;
--(UNION may be used when a clause selects from the same source table. Logically, the same results could be achieved with a single SELECT and WHERE clause. This technique increases readability by breaking up a long complex WHERE clause into two or more UNION queries. However, unless the source table is indexed, the query will have to make multiple passes over the same source data. )
For example:
FROM (
FROM src SELECT src.key, src.value WHERE src.key < 100
UNION ALL
FROM src SELECT src.* WHERE src.key > 110
) unioninput
INSERT OVERWRITE DIRECTORY '/tmp/union.out' SELECT unioninput.*
--( 可以看出当有多个select语句扫描同一个表时,这种写法效率更高,避免扫描多次.)