最近做的事情是用mrjob写mapreduce程序,从mongo读取数据。我的做法很容易也很好懂,因为mrjob可以支持sys.stdin的读取,所以我考虑用一个python程序读mongo中的数据,然后同时让mrjob脚本接受输入,处理,输出。
具体方式:
readInMongoDB.py:
#coding:UTF-8
'''
Created on 2014年5月28日
@author: hao
'''
import pymongo
pyconn = pymongo.Connection(host,port=27017)
pycursor = pyconn.userid_cid_score.find().batch_size(30)
for i in pycursor:
userId = i['userId']
cid = i['cid']
score = i['score']
# temp = list()
# temp.append(userId)
# temp.append(cid)
# temp.append(score)
print str(userId)+','+str(cid)+','+str(score)
step1.py:
#coding:UTF-8
'''
Created on 2014年5月27日
@author: hao
'''
from mrjob.job import MRJob
# from mrjob import protocol
import pymongo
import logging
import simplejson as sj
class step(MRJob):
'''
'''
# logging.c
def parseMatrix(self, _, line):
'''
input one stdin for pymongo onetime search
output contentId, (userId, rating)
'''
line = (str(line))
line=line.split(',')
userId = line[0]
# print userId
cid = line[1]
# print cid
score = float(line[2])
# print score
yield cid, (userId, float(score))
def scoreCombine(self, cid, userRating):
'''
将对同一个内容的(用户,评分)拼到一个list里
'''
userRatings = list()
for i in userRating:
userRatings.append(i)
yield cid, userRatings
def userBehavior(self, cid, userRatings):
'''
'''
scoreList = list()
for doc in userRatings:
# 每个combiner结果
for i in doc:
scoreList.append(i)
for user1 in scoreList:
for user2 in scoreList:
if user1[0] == user2[0]:
continue
yield (user1[0], user2[0]), (user1[1], user2[1])
def steps(self):
return [self.mr(mapper = self.parseMatrix,
reducer = self.scoreCombine),
self.mr(reducer = self.userBehavior),]
if __name__=='__main__':
fp = open('a.txt','w')
fp.write('[')
step.run()
fp.write(']')
fp.close()
然后执行脚本 python readInMongoDB.py | python step1.py >> out.txt
这个方式在本地执行的非常好,没有任何问题(除开mrjob速度的问题,其实在本次应用中影响不大)
但是
问题出现在我把它移到hadoop环境中使用的时候,
执行脚本 python readInMongoDB.py | python step1.py -r hadoop >> out.txt
出现了问题:
no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
creating tmp directory /tmp/step1.root.20140606.091711.815391
writing wrapper script to /tmp/step1.root.20140606.091711.815391/setup-wrapper.sh
reading from STDIN
Copying local files into hdfs:///user/root/tmp/mrjob/step1.root.20140606.091711.815391/files/
Using Hadoop version 2.0.0
HADOOP: packageJobJar: [] [/opt/cloudera/parcels/CDH-4.6.0-1.cdh4.6.0.p0.26/lib/hadoop/hadoop-streaming.jar] /tmp/streamjob8615643898520402804.jar tmpDir=null
HADOOP: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
HADOOP: Total input paths to process : 1
HADOOP: getLocalDirs(): [/tmp/hadoop-root/mapred/local]
HADOOP: Running job: job_201405161502_0059
HADOOP: To kill this job, run:
HADOOP: /opt/cloudera/parcels/CDH-4.6.0-1.cdh4.6.0.p0.26/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=v-lab-110:8021 -kill job_201405161502_0059
HADOOP: Tracking URL: http://v-lab-110:50030/jobdetails.jsp?jobid=job_201405161502_0059
HADOOP: map 0% reduce 0%
HADOOP: map 100% reduce 100%
HADOOP: To kill this job, run:
HADOOP: /opt/cloudera/parcels/CDH-4.6.0-1.cdh4.6.0.p0.26/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=v-lab-110:8021 -kill job_201405161502_0059
HADOOP: Tracking URL: http://v-lab-110:50030/jobdetails.jsp?jobid=job_201405161502_0059
HADOOP: Job not successful. Error: NA
HADOOP: killJob...
HADOOP: Streaming Command Failed!
Job failed with return code 256: ['/opt/cloudera/parcels/CDH-4.6.0-1.cdh4.6.0.p0.26/lib/hadoop/bin/hadoop', 'jar', '/opt/cloudera/parcels/CDH-4.6.0-1.cdh4.6.0.p0.26/lib/hadoop/hadoop-streaming.jar', '-files', 'hdfs:///user/root/tmp/mrjob/step1.root.20140606.091711.815391/files/setup-wrapper.sh#setup-wrapper.sh,hdfs:///user/root/tmp/mrjob/step1.root.20140606.091711.815391/files/step1.py#step1.py', '-archives', 'hdfs:///user/root/tmp/mrjob/step1.root.20140606.091711.815391/files/mrjob.tar.gz#mrjob.tar.gz', '-input', 'hdfs:///user/root/tmp/mrjob/step1.root.20140606.091711.815391/files/STDIN', '-output', 'hdfs:///user/root/tmp/mrjob/step1.root.20140606.091711.815391/step-output/1', '-mapper', 'sh -e setup-wrapper.sh python step1.py --step-num=0 --mapper', '-combiner', 'sh -e setup-wrapper.sh python step1.py --step-num=0 --combiner', '-reducer', 'sh -e setup-wrapper.sh python step1.py --step-num=0 --reducer']
Scanning logs for probable cause of failure
Traceback (most recent call last):
File "step1.py", line 176, in <module>
step.run()
File "/usr/local/lib/python2.7/site-packages/mrjob/job.py", line 494, in run
mr_job.execute()
File "/usr/local/lib/python2.7/site-packages/mrjob/job.py", line 512, in execute
super(MRJob, self).execute()
File "/usr/local/lib/python2.7/site-packages/mrjob/launch.py", line 147, in execute
self.run_job()
File "/usr/local/lib/python2.7/site-packages/mrjob/launch.py", line 208, in run_job
runner.run()
File "/usr/local/lib/python2.7/site-packages/mrjob/runner.py", line 458, in run
self._run()
File "/usr/local/lib/python2.7/site-packages/mrjob/hadoop.py", line 239, in _run
self._run_job_in_hadoop()
File "/usr/local/lib/python2.7/site-packages/mrjob/hadoop.py", line 358, in _run_job_in_hadoop
raise CalledProcessError(returncode, step_args)
subprocess.CalledProcessError: Command '['/opt/cloudera/parcels/CDH-4.6.0-1.cdh4.6.0.p0.26/lib/hadoop/bin/hadoop', 'jar', '/opt/cloudera/parcels/CDH-4.6.0-1.cdh4.6.0.p0.26/lib/hadoop/hadoop-streaming.jar', '-files', 'hdfs:///user/root/tmp/mrjob/step1.root.20140606.091711.815391/files/setup-wrapper.sh#setup-wrapper.sh,hdfs:///user/root/tmp/mrjob/step1.root.20140606.091711.815391/files/step1.py#step1.py', '-archives', 'hdfs:///user/root/tmp/mrjob/step1.root.20140606.091711.815391/files/mrjob.tar.gz#mrjob.tar.gz', '-input', 'hdfs:///user/root/tmp/mrjob/step1.root.20140606.091711.815391/files/STDIN', '-output', 'hdfs:///user/root/tmp/mrjob/step1.root.20140606.091711.815391/step-output/1', '-mapper', 'sh -e setup-wrapper.sh python step1.py --step-num=0 --mapper', '-combiner', 'sh -e setup-wrapper.sh python step1.py --step-num=0 --combiner', '-reducer', 'sh -e setup-wrapper.sh python step1.py --step-num=0 --reducer']' returned non-zero exit status 256
这样,也就手工拼凑了mongodb和mrjob的链接。只能在mapreduce之后,再用python脚本去读取mongo,然后将协同过滤得到的结果填充。
留帖记录艰辛。