Hive Streaming

6 篇文章 0 订阅

1.Hive Streaming介绍

在前面我们看到了UDF、UDTF、UDAF的实现并不是很简单,而且还要求对Java比较熟悉,而Hive设计的初衷是方便那些非Java人员使用。因此,Hive提供了另一种数据处理方式——Streaming,这样就可以不需要编写Java代码了,其实Streaming处理方式可以支持很多语言。但是,Streaming的执行效率通常比对应编写的UDF或改写InputFormat对象的方式要低。管道中序列化然后反序列化数据通常时低效的。而且以通常的方式很难调试整个程序。

Hive中提供了多种语法来使用Streaming,包括:

  • MAP()
  • REDUCE()
  • TRANSFORM()

但是,注意MAP()实际上并非在Mapper阶段执行Streaming,正如REDUCE()实际上并非在Reducer阶段执行Streaming。因此,相同的功能,通常建议使用TRANSFORM()语句,这样可以避免产生疑惑。


2.Streaming的编写和使用

Streaming的实现需要TRANSFORM()函数和USING关键字,TRANSFORM()的参数是表的列名,USING关键字用于指定脚本。本节的数据仍然使用Hive UDF教程(一)中所使用的employee表。


例一:Streaming使用Linux命令

先看Streaming直接使用Linux系统中的命令cat来查询表,cat.q是HiveQL文件,内容如下:

SELECT TRANSFORM(e.name, e.salary)
USING '/bin/cat' AS name, salary
FROM employee e;

执行结果:

hive (mydb)> SOURCE cat.q;
OK
Time taken: 0.044 seconds
Query ID = root_20160120000909_2de2d4f9-b50c-4ed1-a876-768c0127f067
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1453275977382_0001, Tracking URL = http://master:8088/proxy/application_1453275977382_0001/
Kill Command = /root/install/hadoop-2.4.1/bin/hadoop job  -kill job_1453275977382_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2016-01-20 00:10:16,258 Stage-1 map = 0%,  reduce = 0%
2016-01-20 00:10:22,942 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.12 sec
MapReduce Total cumulative CPU time: 1 seconds 120 msec
Ended Job = job_1453275977382_0001
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.12 sec   HDFS Read: 1040 HDFS Write: 139 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 120 msec
OK
John Doe	100000.0
Mary Smith	80000.0
Todd Jones	70000.0
Bill King	60000.0
Boss Man	200000.0
Fred Finance	150000.0
Stacy Accountant	60000.0
Time taken: 24.758 seconds, Fetched: 7 row(s)

例二:Streaming使用Python脚本

下面,在对比下Hive的sum()函数,和使用sum.py的Python脚本执行情况,先看Hive的sum()函数执行:

hive (mydb)> SELECT sum(salary) FROM employee;
Query ID = root_20160120012525_1abf156b-d44b-4f1c-b2c2-3604e4c1bba0
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1453281391968_0002, Tracking URL = http://master:8088/proxy/application_1453281391968_0002/
Kill Command = /root/install/hadoop-2.4.1/bin/hadoop job  -kill job_1453281391968_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2016-01-20 01:25:20,364 Stage-1 map = 0%,  reduce = 0%
2016-01-20 01:25:31,620 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.55 sec
2016-01-20 01:25:42,394 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 2.73 sec
MapReduce Total cumulative CPU time: 2 seconds 730 msec
Ended Job = job_1453281391968_0002
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 2.73 sec   HDFS Read: 1040 HDFS Write: 9 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 730 msec
OK
720000.0
Time taken: 33.891 seconds, Fetched: 1 row(s)

然后,在看Streaming的方式执行,sum.py脚本:

#!/usr/bin/env python

import sys

def sum(arg):
    global total
    total += arg

if __name__ == "__main__":
    total  = 0.0
    for arg in sys.stdin:
        sum(float(arg))
    print total;

HiveQL脚本sum.q:

SELECT TRANSFORM(salary)                     
USING 'python /root/experiment/hive/sum.py' AS total
FROM employee; 

最后是执行结果(用虚拟机搭建的完全分布式,且数据量很小,所以和sum()函数相比,执行时间仅供参考):

hive> source sum.q;
OK
Time taken: 0.022 seconds
Query ID = root_20160120002626_0ced0b93-e4e8-4f3a-91d0-f2aaa06b5f11
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1453278047512_0002, Tracking URL = http://master:8088/proxy/application_1453278047512_0002/
Kill Command = /root/install/hadoop-2.4.1/bin/hadoop job  -kill job_1453278047512_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2016-01-20 00:26:28,341 Stage-1 map = 0%,  reduce = 0%
2016-01-20 00:26:36,185 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.4 sec
MapReduce Total cumulative CPU time: 1 seconds 400 msec
Ended Job = job_1453278047512_0002
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.4 sec   HDFS Read: 1040 HDFS Write: 9 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 400 msec
OK
720000.0
Time taken: 17.048 seconds, Fetched: 1 row(s)

例三:Streaming的WordCount

本节最后,在给一个用Hive Streaming的方式运行WordCount的例子。先看docs数据表:

hive (mydb)> SELECT * FROM docs;                                      
OK
hello world
hello hadoop
hello spark
Time taken: 0.044 seconds, Fetched: 3 row(s)

wc_mapper.py进行Mapper阶段的处理,即取出所有单词,并且计数为1:

#!/sur/bin/env python

import sys

def splitWord(rows):
    words = rows.strip().split(" ")
    for word in words:
        print "%s\t1" % (word)

if __name__ == "__main__":
    for line in sys.stdin:
        splitWord(line)

wc_reducer.py进行Reducer阶段的处理,即对单词进行统计:

#!/usr/bin/env python

import sys

(lastKey, lastCount) = (None, 0)
#f = open("test")
for line in sys.stdin:
    (key, count) = line.strip().split("\t")
    if (lastKey) and (lastKey != key):
        print "%s\t%d" % (lastKey, lastCount)
        (lastKey, lastCount) = (key, int(count))
    else:
        lastKey = key
        lastCount += int(count)

if lastKey:
    print "%s\t%d" % (lastKey, lastCount)

HiveQL脚本wc.q写的是要执行的HQL语句,我使用了中间表wordcount存储结果,当然也可以直接查询输出:

CREATE TABLE IF NOT EXISTS wordcount(
    word STRING,
    count INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';

FROM(
    FROM docs
    SELECT TRANSFORM(line) USING 'python /root/experiment/hive/wc_mapper.py'
    AS word, count
    CLUSTER BY word) wc
INSERT OVERWRITE TABLE wordcount
SELECT TRANSFORM(wc.word, wc.count) USING 'python /root/experiment/hive/wc_reducer.py'
AS words, counts;

最后是执行结果,因为我使用了中间表wordcount,所以执行完后,还需要从wordcount表中查询出结果:

hive (mydb)> SOURCE wc.q;
OK
Time taken: 0.022 seconds
OK
Time taken: 0.066 seconds
Query ID = root_20160120013535_c6e957a9-1981-475a-b21a-e73576df6a99
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1453281391968_0003, Tracking URL = http://master:8088/proxy/application_1453281391968_0003/
Kill Command = /root/install/hadoop-2.4.1/bin/hadoop job  -kill job_1453281391968_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2016-01-20 01:35:53,691 Stage-1 map = 0%,  reduce = 0%
2016-01-20 01:36:00,339 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.15 sec
2016-01-20 01:36:08,961 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 2.98 sec
MapReduce Total cumulative CPU time: 2 seconds 980 msec
Ended Job = job_1453281391968_0003
Loading data to table mydb.wordcount
Table mydb.wordcount stats: [numFiles=1, numRows=4, totalSize=33, rawDataSize=29]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 2.98 sec   HDFS Read: 260 HDFS Write: 103 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 980 msec
OK
Time taken: 25.652 seconds

hive (mydb)> SELECT * FROM wordcount;
OK
hadoop	1
hello	3
spark	1
world	1
Time taken: 0.047 seconds, Fetched: 4 row(s)



评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值