在mapreduce中调用c++的binary进行分词
# -*- coding:utf-8 -*-
import sys
import os
from collections import defaultdict
def main():
# 收集map的输入
with open("input.txt", "w", encoding="utf-8") as fout:
for line in sys.stdin:
fout.write(line)
# 调用binary分词
os.system("/workdir/seg_binary /workdir/word2cnt.txt input.txt output.txt")
# 输出到stdout
with open("output.txt", "r", encoding="utf-=8") as fin:
for line in fin:
print(line.strip())
if __name__ == '__main__':
main()
mapred.task.timeout
问题:由于map长时间没有输出,导致mapreduce任务挂掉
分析:控制超时的属性是:mapred.task.timeout,默认600000ms,即10min;
mapred.task.timeout解释:The number of milliseconds before a task will be terminated if it neither reads an input, writes an output, nor updates its status string. A value of 0 disables the timeout.
MR原理:如果监测到有一个task_attempt没有在规定的时间间隔内(mapreduce.task.timeout)汇报进度,那么就认为该attempt已经失败,并发送一个TA_TIMED_OUT的Event,通知ApplicationMaster去Kill掉该Attempt
解决方案
- 将mapred.task.timeout调大;
建议不要调的过大,如果设成1小时,那么假如作业运行的时候某台机器挂了,那也要等到1小时后才能发现进行异常处理。 - 确保map每隔一段时间就会有输出,优先考虑;