【最详细、亲测】Hadoop Stream的最全总结

【最详细、亲测】Hadoop Stream的最全总结

本人在做项目的过程中,想利用Hadoop Stream对现有的python代码进行并行化处理,加速代码运行速度。

推荐一个比较好的例子:用python写MapReduce函数——以WordCount为例

首先介绍一下Hadoop Stream的优点:让任何语言编写的map,reduce程序能够在Hadoop集群上运行;map/reduce程序只要遵循从标准输入stdin读,写出到标准输出stdout(这里标准输出基本不用管,只要在python里正常输出即可)即可

另外介绍一个Hadoop Stream在本地单机调试的方法:

cat inputFileName | python map.py | sort | python reduce.py

sort阶段是可以省略的,直接到python reduce.py也是可以,并且可以将输出重定向到文件中,比如

cat inputFileName | python map.py | sort | python reduce.py > outputFileName

hadoop里面比较基础的操作

hadoop fs -put fileName hdfs路径(将本地文件传到hdfs上)
hadoop fs -get hdfs路径(从hdfs里下载文件)
hadoop fs -rm -r 文件夹名字(删除文件夹名字)

Hadoop 上将所有文件合在一起的命令

hadoop fs -cat 需要合并的文件路径(如:A/*,A文件夹中的所有文件) | hadoop fs -put - A/merge.txt

上面的代码意思就是,将A文件夹下所有文件合并在A文件夹下的merge.txt文件中。

待解决的坑

遇到一个坑的地方,就是别人设置-D mapreduce.map.cpu.vcores=4,而我的程序在设置了-D mapreduce.map.cpu.vcores=4时,就一直跑不出结果,也没有报错信息。当删掉-D mapreduce.map.cpu.vcores=4这个地方的时候,程序就开始跑结果了,这块地方暂时还是没搞懂,希望各位大神不吝赐教呀。

原因: 这里发现一个原因,那就是如果设置map中cpu的核数为4,那么每一个节点会同时运行4个map,那么此时每个map可用的内存,就变成节点的内存 / cpu的核数 。而且对于只有一个map的操作,设置map的cpu核数为4,是无意义的,还会导致程序特别慢(个人推测是这个慢,导致了hadoop stream程序一直跑不出结果)。

Hadoop Stream相关配置信息

选项:

-input: 输入文件路径
-output:输出文件路径
-mapper: 用户自己写的mapper程序,可以是可执行文件或脚本
-reducer: 用户自己写的reducer程序,可以是可执行文件或脚本
-file:打包文件到提交的作业中,可以是mapper或者reducer要用的输入文件,如配置文件、字典等
-partitioner:用户自定义的partitioner程序
-combiner:用户自定义的combiner程序
-D:作业的一些属性(以前用的是-joncof)

例子:

hadoop jar $HADOOP_STREAMING \
-D mapred.job.name=”自定义的job名字” \
-D mapreduce.map.memory.mb=16384 \ (map的内存是16G)
-D mapreduce.reduce.memory.mb=16384 \ (reduce的内存设置,每次提交job前,应该要预估自己的map、reduce内存与个数,是否会占用资源太多,因为你这个程序占用集群的资源太多的话,那其他的程序就没办法跑了)
-D mapred.map.tasks=1024 \(map的个数)
-D mapred.reduce.tasks=1024 \ (reduce的task个数,如果设置为0的话,则reduce阶段就会取消)
-D mapred.task.timeout=86400000 \ (job最多的等待时间,后续会有介绍)
-files test.py \
-input input路径 \
-output output路径 \
-mapper “python test.py Map” \
-reducer “python test.py Reduce” \
-cacheArchive 打包后的压缩文件A在hdfs上的存放路径 /A.tar.gz#B #B表示解压后的A文件在B文件夹里(Archive可以将一些大文件,用压缩包 .tar.gz 的形式传到hadoop集群上)

Linux上关于yarn的相关命令(查看/杀死 hadoop进程)

  • 杀死集群上的hadoop进程: yarn: application -kill jobid ,你在linux本地上按Ctrl + c 是没有用的,job还是会在集群上运行。
  • 查看yarn上运行的任务:yarn application -list
  • 利用grep选择符合要求的任务进行查看:yarn application -list | grep,当job任务过多时,可以用grep过滤出符合要求的任务信息。

Hadoop Stream各种报错原因

1、实战遇到的坑

  • 1)cacheArchive的文件(放在集群上的)不能使用同一个名字,比如:Res.tar.gz#Res
  • 2)这里顺便说一下,我踩过的一个坑。就是Linux使用 tar zcvf B.tar.gz A将A文件打包,命名为B.tar.gz,在解压的时候(tar zxvf B.tar.gz)B.tar.gz还是会被解压成A。
    比如:有个文件夹叫0706A,我将它压缩的时候重命名为0630B.tar.gz,当我们解压0630B.tar.gz时,它还是会解压为0706A文件夹。
  • 3)还有一个问题,就是内存不够用,相信大家都遇到过,具体问题如下:

Container[pid=41884,containerID=container_1405950053048_0016_01_000284] is running beyond virtual memory limits. Current usage: 314.6 MB of 2.9 GB physical memory used; 8.7 GB of 6.2 GB virtual memory used. Killing container.

这个问题是因为容器的最小内存和最大内存分别为:3000m和10000m,而reduce设置的默认值小于2000m,map没有设置,所以两个值均为3000m,也就是log中的“2.9 GB physical memory used”。而由于使用了默认虚拟内存率(也就是2.1倍),所以对于Map Task和Reduce Task总的虚拟内存为都为3000*2.1=6.2G。而应用的虚拟内存超过了这个数值,故报错 。

解决办法:增大map和reduce所需的内存大小:mapreduce.map.memory.mb = 2048(2G)
  • 4)还有一个很常见的问题,那就是由于配置文件(这里配置文件指不是输入文件)比较大,所以常常程序要跑很久(比如10h才能跑出一个reduce),那么这里就需要设置 Task 的超时时间,

该参数表达的意思为:如果一个task在一定时间内没有任何进入,即不会读取新的数据,也没有输出数据,则认为该task处于block状态,可能是卡住了,也许永远会卡主,为了防止因为用户程序永远block住不退出,则强制设置了一个该超时时间(单位毫秒),默认是300000。如果你的程序对每条输入数据的处理时间过长(比如会访问数据库,通过网络拉取数据等),建议将该参数调大,
该参数过小常出现的错误提示是:

“AttemptID:attempt_14267829456721_123456_m_000224_0 Timed out after 300 secsContainer killed by the ApplicationMaster.”。

解决办法:设置允许任务没有响应(输入输出)的最大时间,如:mapred.task.timeout=86400000(这里是以ms为单位的,所以这里是24h),当设置允许任务没有响应的最大时间为24h,程序就会一直跑,即使超过了24h也会继续跑下去。

2、理论上遇到的报错原因

经常遇到的exception是:PipeMapRed.waitOutputThreads(): subprocess failed with code N

“OS error code 1: Operation not permitted”
“OS error code 2: No such file or directory”
“OS error code 3: No such process”
“OS error code 4: Interrupted system call”
“OS error code 5: Input/output error”
“OS error code 6: No such device or address”
“OS error code 7: Argument list too long”
“OS error code 8: Exec format error”
“OS error code 9: Bad file descriptor”
“OS error code 10: No child processes”
“OS error code 11: Resource temporarily unavailable”
“OS error code 12: Cannot allocate memory”
“OS error code 13: Permission denied”
“OS error code 14: Bad address”
“OS error code 15: Block device required”
“OS error code 16: Device or resource busy”
“OS error code 17: File exists”
“OS error code 18: Invalid cross-device link”
“OS error code 19: No such device”
“OS error code 20: Not a directory”
“OS error code 21: Is a directory”
“OS error code 22: Invalid argument”
“OS error code 23: Too many open files in system”
“OS error code 24: Too many open files”
“OS error code 25: Inappropriate ioctl for device”
“OS error code 26: Text file busy”
“OS error code 27: File too large”
“OS error code 28: No space left on device”
“OS error code 29: Illegal seek”
“OS error code 30: Read-only file system”
“OS error code 31: Too many links”
“OS error code 32: Broken pipe”
“OS error code 33: Numerical argument out of domain”
“OS error code 34: Numerical result out of range”
“OS error code 35: Resource deadlock avoided”
“OS error code 36: File name too long”
“OS error code 37: No locks available”
“OS error code 38: Function not implemented”
“OS error code 39: Directory not empty”
“OS error code 40: Too many levels of symbolic links”
“OS error code 42: No message of desired type”
“OS error code 43: Identifier removed”
“OS error code 44: Channel number out of range”
“OS error code 45: Level 2 not synchronized”
“OS error code 46: Level 3 halted”
“OS error code 47: Level 3 reset”
“OS error code 48: Link number out of range”
“OS error code 49: Protocol driver not attached”
“OS error code 50: No CSI structure available”
“OS error code 51: Level 2 halted”
“OS error code 52: Invalid exchange”
“OS error code 53: Invalid request descriptor”
“OS error code 54: Exchange full”
“OS error code 55: No anode”
“OS error code 56: Invalid request code”
“OS error code 57: Invalid slot”
“OS error code 59: Bad font file format”
“OS error code 60: Device not a stream”
“OS error code 61: No data available”
“OS error code 62: Timer expired”
“OS error code 63: Out of streams resources”
“OS error code 64: Machine is not on the network”
“OS error code 65: Package not installed”
“OS error code 66: Object is remote”
“OS error code 67: Link has been severed”
“OS error code 68: Advertise error”
“OS error code 69: Srmount error”
“OS error code 70: Communication error on send”
“OS error code 71: Protocol error”
“OS error code 72: Multihop attempted”
“OS error code 73: RFS specific error”
“OS error code 74: Bad message”
“OS error code 75: Value too large for defined data type”
“OS error code 76: Name not unique on network”
“OS error code 77: File descriptor in bad state”
“OS error code 78: Remote address changed”
“OS error code 79: Can not access a needed shared library”
“OS error code 80: Accessing a corrupted shared library”
“OS error code 81: .lib section in a.out corrupted”
“OS error code 82: Attempting to link in too many shared libraries”
“OS error code 83: Cannot exec a shared library directly”
“OS error code 84: Invalid or incomplete multibyte or wide character”
“OS error code 85: Interrupted system call should be restarted”
“OS error code 86: Streams pipe error”
“OS error code 87: Too many users”
“OS error code 88: Socket operation on non-socket”
“OS error code 89: Destination address required”
“OS error code 90: Message too long”
“OS error code 91: Protocol wrong type for socket”
“OS error code 92: Protocol not available”
“OS error code 93: Protocol not supported”
“OS error code 94: Socket type not supported”
“OS error code 95: Operation not supported”
“OS error code 96: Protocol family not supported”
“OS error code 97: Address family not supported by protocol”
“OS error code 98: Address already in use”
“OS error code 99: Cannot assign requested address”
“OS error code 100: Network is down”
“OS error code 101: Network is unreachable”
“OS error code 102: Network dropped connection on reset”
“OS error code 103: Software caused connection abort”
“OS error code 104: Connection reset by peer”
“OS error code 105: No buffer space available”
“OS error code 106: Transport endpoint is already connected”
“OS error code 107: Transport endpoint is not connected”
“OS error code 108: Cannot send after transport endpoint shutdown”
“OS error code 109: Too many references: cannot splice”
“OS error code 110: Connection timed out”
“OS error code 111: Connection refused”
“OS error code 112: Host is down”
“OS error code 113: No route to host”
“OS error code 114: Operation already in progress”
“OS error code 115: Operation now in progress”
“OS error code 116: Stale NFS file handle”
“OS error code 117: Structure needs cleaning”
“OS error code 118: Not a XENIX named type file”
“OS error code 119: No XENIX semaphores available”
“OS error code 120: Is a named type file”
“OS error code 121: Remote I/O error”
“OS error code 122: Disk quota exceeded”
“OS error code 123: No medium found”
“OS error code 124: Wrong medium type”
“OS error code 125: Operation canceled”
“OS error code 126: Required key not available”
“OS error code 127: Key has expired”
“OS error code 128: Key has been revoked”
“OS error code 129: Key was rejected by service”
“OS error code 130: Owner died”
“OS error code 131: State not recoverable”
“MySQL error code 132: Old database file”
“MySQL error code 133: No record read before update”
“MySQL error code 134: Record was already deleted (or record file crashed)”
“MySQL error code 135: No more room in record file”
“MySQL error code 136: No more room in index file”
“MySQL error code 137: No more records (read after end of file)”
“MySQL error code 138: Unsupported extension used for table”
“MySQL error code 139: Too big row”(我遇到的问题基本是说段错误,参考代码是否有写错的地方)
“MySQL error code 140: Wrong create options”
“MySQL error code 141: Duplicate unique key or constraint on write or update”
“MySQL error code 142: Unknown character set used”
“MySQL error code 143: Conflicting table definitions in sub-tables of MERGE table”
“MySQL error code 144: Table is crashed and last repair failed”
“MySQL error code 145: Table was marked as crashed and should be repaired”
“MySQL error code 146: Lock timed out; Retry transaction”
“MySQL error code 147: Lock table is full; Restart program with a larger locktable”
“MySQL error code 148: Updates are not allowed under a read only transactions”
“MySQL error code 149: Lock deadlock; Retry transaction”
“MySQL error code 150: Foreign key constraint is incorrectly formed”
“MySQL error code 151: Cannot add a child row”
“MySQL error code 152: Cannot delete a parent row”

  • 2
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值