hive java streaming_Hive Streaming

最新推荐文章于 2024-06-14 10:22:35 发布

weixin_39572316

最新推荐文章于 2024-06-14 10:22:35 发布

阅读量133

点赞数

文章标签： hive java streaming

本文链接：https://blog.csdn.net/weixin_39572316/article/details/114353528

版权

1.Hive Streaming介绍

在前面我们看到了UDF、UDTF、UDAF的实现并不是很简单，而且还要求对Java比较熟悉，而Hive设计的初衷是方便那些非Java人员使用。因此，Hive提供了另一种数据处理方式——Streaming，这样就可以不需要编写Java代码了，其实Streaming处理方式可以支持很多语言。但是，Streaming的执行效率通常比对应编写的UDF或改写InputFormat对象的方式要低。管道中序列化然后反序列化数据通常时低效的。而且以通常的方式很难调试整个程序。

Hive中提供了多种语法来使用Streaming，包括：

MAP()

REDUCE()

TRANSFORM()

但是，注意MAP()实际上并非在Mapper阶段执行Streaming，正如REDUCE()实际上并非在Reducer阶段执行Streaming。因此，相同的功能，通常建议使用TRANSFORM()语句，这样可以避免产生疑惑。

2.Streaming的编写和使用

Streaming的实现需要TRANSFORM()函数和USING关键字，TRANSFORM()的参数是表的列名，USING关键字用于指定脚本。本节的数据仍然使用Hive UDF教程(一)中所使用的employee表。

例一：Streaming使用Linux命令

先看Streaming直接使用Linux系统中的命令cat来查询表，cat.q是HiveQL文件，内容如下：

SELECT TRANSFORM(e.name, e.salary)

USING '/bin/cat' AS name, salary

FROM employee e;

执行结果：

hive (mydb)> SOURCE cat.q;

Time taken: 0.044 seconds

Query ID = root_20160120000909_2de2d4f9-b50c-4ed1-a876-768c0127f067

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks is set to 0 since there's no reduce operator

Starting Job = job_1453275977382_0001, Tracking URL = http://master:8088/proxy/application_1453275977382_0001/

Kill Command = /root/install/hadoop-2.4.1/bin/hadoop job -kill job_1453275977382_0001

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0

2016-01-20 00:10:16,258 Stage-1 map = 0%, reduce = 0%

2016-01-20 00:10:22,942 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.12 sec

MapReduce Total cumulative CPU time: 1 seconds 120 msec

Ended Job = job_1453275977382_0001

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1 Cumulative CPU: 1.12 sec HDFS Read: 1040 HDFS Write: 139 SUCCESS

Total MapReduce CPU Time Spent: 1 seconds 120 msec

John Doe100000.0

Mary Smith80000.0

Todd Jones70000.0

Bill King60000.0

Boss Man200000.0

Fred Finance150000.0

Stacy Accountant60000.0

Time taken: 24.758 seconds, Fetched: 7 row(s)

例二：Streaming使用Python脚本

下面，在对比下Hive的sum()函数，和使用sum.py的Python脚本执行情况，先看Hive的sum()函数执行：

hive (mydb)> SELECT sum(salary) FROM employee;

Query ID = root_20160120012525_1abf156b-d44b-4f1c-b2c2-3604e4c1bba0

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

set hive.exec.reducers.bytes.per.reducer=

In order to limit the maximum number of reducers:

set hive.exec.reducers.max=

In order to set a constant number of reducers:

set mapreduce.job.reduces=

Starting Job = job_1453281391968_0002, Tracking URL = http://master:8088/proxy/application_1453281391968_0002/

Kill Command = /root/install/hadoop-2.4.1/bin/hadoop job -kill job_1453281391968_0002

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1

2016-01-20 01:25:20,364 Stage-1 map = 0%, reduce = 0%

2016-01-20 01:25:31,620 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.55 sec

2016-01-20 01:25:42,394 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.73 sec

MapReduce Total cumulative CPU time: 2 seconds 730 msec

Ended Job = job_1453281391968_0002

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.73 sec HDFS Read: 1040 HDFS Write: 9 SUCCESS

Total MapReduce CPU Time Spent: 2 seconds 730 msec

720000.0

Time taken: 33.891 seconds, Fetched: 1 row(s)

然后，在看Streaming的方式执行，sum.py脚本：

#!/usr/bin/env python

import sys

def sum(arg):

global total

total += arg

if __name__ == "__main__":

total = 0.0

for arg in sys.stdin:

sum(float(arg))

print total;

HiveQL脚本sum.q：

SELECT TRANSFORM(salary)

USING 'python /root/experiment/hive/sum.py' AS total

FROM employee;

最后是执行结果(用的是虚拟机，且数据量很小，所以和sum()函数相比，执行时间仅供参考)：

hive> source sum.q;

Time taken: 0.022 seconds

Query ID = root_20160120002626_0ced0b93-e4e8-4f3a-91d0-f2aaa06b5f11

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks is set to 0 since there's no reduce operator

Starting Job = job_1453278047512_0002, Tracking URL = http://master:8088/proxy/application_1453278047512_0002/

Kill Command = /root/install/hadoop-2.4.1/bin/hadoop job -kill job_1453278047512_0002

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0

2016-01-20 00:26:28,341 Stage-1 map = 0%, reduce = 0%

2016-01-20 00:26:36,185 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.4 sec

MapReduce Total cumulative CPU time: 1 seconds 400 msec

Ended Job = job_1453278047512_0002

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1 Cumulative CPU: 1.4 sec HDFS Read: 1040 HDFS Write: 9 SUCCESS

Total MapReduce CPU Time Spent: 1 seconds 400 msec

720000.0

Time taken: 17.048 seconds, Fetched: 1 row(s)

例三：Streaming的WordCount

本节最后，在给一个用Hive Streaming的方式运行WordCount的例子。先看docs数据表：

hive (mydb)> SELECT * FROM docs;

hello world

hello hadoop

hello spark

Time taken: 0.044 seconds, Fetched: 3 row(s)

wc_mapper.py进行Mapper阶段的处理，即取出所有单词，并且计数为1：

#!/sur/bin/env python

import sys

def splitWord(rows):

words = rows.strip().split(" ")

for word in words:

print "%s\t1" % (word)

if __name__ == "__main__":

for line in sys.stdin:

splitWord(line)

wc_reducer.py进行Reducer阶段的处理，即对单词进行统计：

#!/usr/bin/env python

import sys

(lastKey, lastCount) = (None, 0)

#f = open("test")

for line in sys.stdin:

(key, count) = line.strip().split("\t")

if (lastKey) and (lastKey != key):

print "%s\t%d" % (lastKey, lastCount)

(lastKey, lastCount) = (key, int(count))

else:

lastKey = key

lastCount += int(count)

if lastKey:

print "%s\t%d" % (lastKey, lastCount)

HiveQL脚本wc.q写的是要执行的HQL语句，我使用了中间表wordcount存储结果，当然也可以直接查询输出：

CREATE TABLE IF NOT EXISTS wordcount(

word STRING,

count INT

)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t';

FROM(

FROM docs

SELECT TRANSFORM(line) USING 'python /root/experiment/hive/wc_mapper.py'

AS word, count

CLUSTER BY word) wc

INSERT OVERWRITE TABLE wordcount

SELECT TRANSFORM(wc.word, wc.count) USING 'python /root/experiment/hive/wc_reducer.py'

AS words, counts;

最后是执行结果，因为我使用了中间表wordcount，所以执行完后，还需要从wordcount表中查询出结果：

hive (mydb)> SOURCE wc.q;

Time taken: 0.022 seconds

Time taken: 0.066 seconds

Query ID = root_20160120013535_c6e957a9-1981-475a-b21a-e73576df6a99

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks not specified. Estimated from input data size: 1

In order to change the average load for a reducer (in bytes):

set hive.exec.reducers.bytes.per.reducer=

In order to limit the maximum number of reducers:

set hive.exec.reducers.max=

In order to set a constant number of reducers:

set mapreduce.job.reduces=

Starting Job = job_1453281391968_0003, Tracking URL = http://master:8088/proxy/application_1453281391968_0003/

Kill Command = /root/install/hadoop-2.4.1/bin/hadoop job -kill job_1453281391968_0003

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1

2016-01-20 01:35:53,691 Stage-1 map = 0%, reduce = 0%

2016-01-20 01:36:00,339 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.15 sec

2016-01-20 01:36:08,961 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.98 sec

MapReduce Total cumulative CPU time: 2 seconds 980 msec

Ended Job = job_1453281391968_0003

Loading data to table mydb.wordcount

Table mydb.wordcount stats: [numFiles=1, numRows=4, totalSize=33, rawDataSize=29]

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.98 sec HDFS Read: 260 HDFS Write: 103 SUCCESS

Total MapReduce CPU Time Spent: 2 seconds 980 msec

Time taken: 25.652 seconds

hive (mydb)> SELECT * FROM wordcount;

hadoop1

hello3

spark1

world1

Time taken: 0.047 seconds, Fetched: 4 row(s)

weixin_39572316

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hive java streaming_Hive Streaming

1.Hive Streaming介绍在前面我们看到了UDF、UDTF、UDAF的实现并不是很简单，而且还要求对Java比较熟悉，而Hive设计的初衷是方便那些非Java人员使用。因此，Hive提供了另一种数据处理方式——Streaming，这样就可以不需要编写Java代码了，其实Streaming处理方式可以支持很多语言。但是，Streaming的执行效率通常比对应编写的UDF或改写InputFo...
复制链接

扫一扫