python hadoop streaming,在Python中使用Hadoop Streaming中的文件

最新推荐文章于 2021-12-11 14:39:15 发布

房导航

最新推荐文章于 2021-12-11 14:39:15 发布

阅读量146

点赞数

文章标签： python hadoop streaming

I am completely new to Hadoop and MapReduce and am trying to work my way through it.

I am trying to develop a mapreduce application in python, in which I use data from 2 .CSV files. I am just reading the two files in mapper and then printing the key value pair from the files to the sys.stdout

The program runs fine when I use it on a single machine, but with the Hadoop Streaming, I get an error. I think I am making some mistake in reading files in the mapper on Hadoop. Please help me out with the code, and tell me how to use file-handling in Hadoop Streaming. The mapper.py code is as below. (You can understand the code from the comments):

#!/usr/bin/env python

import sys

from numpy import genfromtxt

def read_input(inVal):

for line in inVal:

# split the line into words

yield line.strip()

def main(separator='\t'):

# input comes from STDIN (standard input)

labels=[]

data=[]

incoming = read_input(sys.stdin)

for vals in incoming:

# write the results to STDOUT (standard output);

# what we output here will be the input for the

# Reduce step, i.e. the input for reducer.py

# tab-delimited;

if len(vals) > 10:

data.append(vals)

else:

labels.append(vals)

for i in range(0,len(labels)):

print "%s%s%s\n" % (labels[i], separator, data[i])

if __name__ == "__main__":

main()

There are 60000 records which are entered to this mapper from two .csv files as follows (on single machine, not hadoop cluster):

cat mnist_train_labels.csv mnist_train_data.csv | ./mapper.py

解决方案

I was able to resolve the issue after searching a solution for like 3 days.

The problem is with the newer version of Hadoop (2.2.0 in my case). The mapper code, when reading values from files was giving an exit code of non-zero at some point (maybe because it was reading a huge list of values(784) at a time). There is a setting in the Hadoop 2.2.0, which tells the Hadoop System to give a general error (subprocess failed with code 1). This setting is set to True by default. I just had to set the value of this property to False, and it made my code run without any errors.

Setting is: stream.non.zero.exit.is.failure. Just set it to false when streaming. So the streaming command would be somewhat like:

**hadoop jar ... -D stream.non.zero.exit.is.failure=false ...**

Hope it helps someone, and saves 3 days... ;)