1. hive自定义mapred脚本运行机制
1.1)HQL中使用自定义mapred脚本的语法格式及案例
语法格式:
FROM (
FROM src
MAP expression (',' expression)*
(inRowFormat)?
USING 'my_map_script'
( AS colName (',' colName)* )?
(outRowFormat)? (outRecordReader)?
( clusterBy? | distributeBy? sortBy? ) src_alias
)
REDUCE expression (',' expression)*
(inRowFormat)?
USING 'my_reduce_script'
( AS colName (',' colName)* )?
(outRowFormat)? (outRecordReader)?
FROM (
FROM src
SELECT TRANSFORM '(' expression (',' expression)* ')'
(inRowFormat)?
USING 'my_map_script'
( AS colName (',' colName)* )?
(outRowFormat)? (outRecordReader)?
( clusterBy? | distributeBy? sortBy? ) src_alias
)
SELECT TRANSFORM '(' expression (',' expression)* ')'
(inRowFormat)?
USING 'my_reduce_script'
( AS colName (',' colName)* )?
(outRowFormat)? (outRecordReader)?
HQL使用案例:
FROM (
FROM pv_users
MAP pv_users.userid, pv_users.date
USING 'map_script'
AS dt, uid
CLUSTER BY dt) map_output
INSERT OVERWRITE TABLE pv_users_reduced
REDUCE map_output.dt, map_output.uid
USING 'reduce_script'
AS date, count;
FROM (
FROM pv_users
SELECT TRANSFORM(pv_users.userid, pv_users.date)
USING 'map_script'
AS dt, uid
CLUSTER BY dt) map_output
INSERT OVERWRITE TABLE pv_users_reduced
SELECT TRANSFORM(map_output.dt, map_output.uid)
USING 'reduce_script'
AS date, count;
1.2)HIVE运行自定义mapred脚本的机制
MAP和REDUCE关键字都是SELECT TRANSFORM关键字的语法转换,也就是说:在使用MAP关键字时,自定义脚本不一定是在map阶段运行,使用REDUCE关键字时自定义脚本不一定是在red阶段运行;
可以参考如下案例:
#red.py 代码
#!/usr/bin/python
# coding=utf-8
import sys
import time
import traceback
sys.stderr.write("we are running at " + time.strftime("%Y-%m-%d %H:%M:%S") + "\n")
num = 10
i = 0
try:
for l in sys.stdin:
if i < num:
print l[:-1]
i += 1
except:
traceback.print_exc(file=sys.stderr)
案例1:该HQL语句生成一个mapred Job,该Job只执行了map阶段,red数量为零;
FROM (
FROM testhive
MAP testhive.channel, testhive.pv
USING '/bin/cat'
AS dt, uid) map_output
REDUCE map_output.dt, map_output.uid
USING '/bin/cat'
AS date, count;
案例2:
该HQL语句生成一个mapred Job,该Job只执行了map阶段,red数量为零;
FROM testhive
MAP testhive.channel, testhive.pv
USING 'red.py'
AS c, p;
FROM testhive
REDUCE testhive.channel, testhive.pv
USING 'red.py'
AS c, p;
案例3:
该HQL语句生成一个mapred Job,该Job执行了map red阶段
FROM (
FROM testhive
MAP testhive.channel, testhive.pv
USING 'red.py'
AS dt, uid
cluster By dt) map_output
REDUCE map_output.dt, map_output.uid
USING 'red.py'
AS date, count;
总结:HQL中嵌入自定义脚本时,自定义脚本的运行阶段取决于自定义脚本在HQL语句中的语法位置。也就是说,hive 在将hql语句翻译成mapred任务时,如果hql中存在group by、cluster by、sort by等需要reduce阶段的命令时,且MAP、REDUCE命令在上述命令之后,则自定义脚本运行在red端。
注意事项:如果是在windows上开发的mapred脚本,在提交到linux后最好先执行一下 dos2unix 命令,否则可能会出现错误:
hadoop mapred框架错误:
java.io.IOException: Cannot run program "/data9/mapred/local/taskTracker/zhaoxiuxiang/jobcache/job_201311120949_47441/attempt_201311120949_47441_m_000003_2/work/././map.py": error=2, No such file or directory
本地错误:
-bash: ./map.py: /usr/bin/python^M: bad interpreter: No such file or directory
注意事项:使用下面的参数打开report和调试信息输出
stream.stderr.reporter.enabled=true
stream.stderr.reporter.prefix=reporter: