mrjob和pymongo的互斥

3 篇文章 0 订阅
2 篇文章 0 订阅

最近做的事情是用mrjob写mapreduce程序,从mongo读取数据。我的做法很容易也很好懂,因为mrjob可以支持sys.stdin的读取,所以我考虑用一个python程序读mongo中的数据,然后同时让mrjob脚本接受输入,处理,输出。


具体方式:

readInMongoDB.py:

#coding:UTF-8
'''
Created on 2014年5月28日

@author: hao
'''
import pymongo
pyconn = pymongo.Connection(host,port=27017)
pycursor = pyconn.userid_cid_score.find().batch_size(30)
for i in pycursor:
    userId = i['userId']
    cid = i['cid']
    score = i['score']
#     temp = list()
#     temp.append(userId)
#     temp.append(cid)
#     temp.append(score)
    print str(userId)+','+str(cid)+','+str(score)

step1.py:

#coding:UTF-8
'''
Created on 2014年5月27日

@author: hao
'''
from mrjob.job import MRJob
# from mrjob import protocol
import pymongo
import logging
import simplejson as sj

class step(MRJob):
    '''
    '''
#     logging.c
    def parseMatrix(self, _, line):
        '''
        input one stdin for pymongo onetime search
        output contentId, (userId, rating)
        '''
        line = (str(line))
        line=line.split(',')
        userId = line[0]
#         print userId
        cid = line[1]
#         print cid
        score = float(line[2])
#         print score
        yield cid, (userId, float(score))        

    
    def scoreCombine(self, cid, userRating):
        '''
        将对同一个内容的(用户,评分)拼到一个list里
        '''
        userRatings = list()
        for i in userRating:
            userRatings.append(i)
        yield cid, userRatings
        
    def userBehavior(self, cid, userRatings):
        '''        
        '''
        scoreList = list()
        for doc in userRatings:
            # 每个combiner结果
            for i in doc:
                scoreList.append(i)
        for user1 in scoreList:
            for user2 in scoreList:
                if user1[0] == user2[0]:
                    continue
                yield (user1[0], user2[0]), (user1[1], user2[1])
    
    
    def steps(self):
        return [self.mr(mapper = self.parseMatrix,
                        reducer = self.scoreCombine),
                self.mr(reducer = self.userBehavior),]
    
    
if __name__=='__main__':
    
    fp = open('a.txt','w')
    fp.write('[')
    step.run()
    fp.write(']')
    fp.close()

然后执行脚本  python readInMongoDB.py | python step1.py >> out.txt


这个方式在本地执行的非常好,没有任何问题(除开mrjob速度的问题,其实在本次应用中影响不大)


但是

问题出现在我把它移到hadoop环境中使用的时候,

执行脚本  python readInMongoDB.py | python step1.py -r hadoop >> out.txt


出现了问题:

no configs found; falling back on auto-configuration

no configs found; falling back on auto-configuration

creating tmp directory /tmp/step1.root.20140606.091711.815391

writing wrapper script to /tmp/step1.root.20140606.091711.815391/setup-wrapper.sh

reading from STDIN

Copying local files into hdfs:///user/root/tmp/mrjob/step1.root.20140606.091711.815391/files/

Using Hadoop version 2.0.0

HADOOP: packageJobJar: [] [/opt/cloudera/parcels/CDH-4.6.0-1.cdh4.6.0.p0.26/lib/hadoop/hadoop-streaming.jar] /tmp/streamjob8615643898520402804.jar tmpDir=null

HADOOP: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.

HADOOP: Total input paths to process : 1

HADOOP: getLocalDirs(): [/tmp/hadoop-root/mapred/local]

HADOOP: Running job: job_201405161502_0059

HADOOP: To kill this job, run:

HADOOP: /opt/cloudera/parcels/CDH-4.6.0-1.cdh4.6.0.p0.26/lib/hadoop/bin/hadoop job  -Dmapred.job.tracker=v-lab-110:8021 -kill job_201405161502_0059

HADOOP: Tracking URL: http://v-lab-110:50030/jobdetails.jsp?jobid=job_201405161502_0059

HADOOP:  map 0%  reduce 0%

HADOOP:  map 100%  reduce 100%

HADOOP: To kill this job, run:

HADOOP: /opt/cloudera/parcels/CDH-4.6.0-1.cdh4.6.0.p0.26/lib/hadoop/bin/hadoop job  -Dmapred.job.tracker=v-lab-110:8021 -kill job_201405161502_0059

HADOOP: Tracking URL: http://v-lab-110:50030/jobdetails.jsp?jobid=job_201405161502_0059

HADOOP: Job not successful. Error: NA

HADOOP: killJob...

HADOOP: Streaming Command Failed!

Job failed with return code 256: ['/opt/cloudera/parcels/CDH-4.6.0-1.cdh4.6.0.p0.26/lib/hadoop/bin/hadoop', 'jar', '/opt/cloudera/parcels/CDH-4.6.0-1.cdh4.6.0.p0.26/lib/hadoop/hadoop-streaming.jar', '-files', 'hdfs:///user/root/tmp/mrjob/step1.root.20140606.091711.815391/files/setup-wrapper.sh#setup-wrapper.sh,hdfs:///user/root/tmp/mrjob/step1.root.20140606.091711.815391/files/step1.py#step1.py', '-archives', 'hdfs:///user/root/tmp/mrjob/step1.root.20140606.091711.815391/files/mrjob.tar.gz#mrjob.tar.gz', '-input', 'hdfs:///user/root/tmp/mrjob/step1.root.20140606.091711.815391/files/STDIN', '-output', 'hdfs:///user/root/tmp/mrjob/step1.root.20140606.091711.815391/step-output/1', '-mapper', 'sh -e setup-wrapper.sh python step1.py --step-num=0 --mapper', '-combiner', 'sh -e setup-wrapper.sh python step1.py --step-num=0 --combiner', '-reducer', 'sh -e setup-wrapper.sh python step1.py --step-num=0 --reducer']

Scanning logs for probable cause of failure

Traceback (most recent call last):

  File "step1.py", line 176, in <module>

    step.run()

  File "/usr/local/lib/python2.7/site-packages/mrjob/job.py", line 494, in run

    mr_job.execute()

  File "/usr/local/lib/python2.7/site-packages/mrjob/job.py", line 512, in execute

    super(MRJob, self).execute()

  File "/usr/local/lib/python2.7/site-packages/mrjob/launch.py", line 147, in execute

    self.run_job()

  File "/usr/local/lib/python2.7/site-packages/mrjob/launch.py", line 208, in run_job

    runner.run()

  File "/usr/local/lib/python2.7/site-packages/mrjob/runner.py", line 458, in run

    self._run()

  File "/usr/local/lib/python2.7/site-packages/mrjob/hadoop.py", line 239, in _run

    self._run_job_in_hadoop()

  File "/usr/local/lib/python2.7/site-packages/mrjob/hadoop.py", line 358, in _run_job_in_hadoop

    raise CalledProcessError(returncode, step_args)

subprocess.CalledProcessError: Command '['/opt/cloudera/parcels/CDH-4.6.0-1.cdh4.6.0.p0.26/lib/hadoop/bin/hadoop', 'jar', '/opt/cloudera/parcels/CDH-4.6.0-1.cdh4.6.0.p0.26/lib/hadoop/hadoop-streaming.jar', '-files', 'hdfs:///user/root/tmp/mrjob/step1.root.20140606.091711.815391/files/setup-wrapper.sh#setup-wrapper.sh,hdfs:///user/root/tmp/mrjob/step1.root.20140606.091711.815391/files/step1.py#step1.py', '-archives', 'hdfs:///user/root/tmp/mrjob/step1.root.20140606.091711.815391/files/mrjob.tar.gz#mrjob.tar.gz', '-input', 'hdfs:///user/root/tmp/mrjob/step1.root.20140606.091711.815391/files/STDIN', '-output', 'hdfs:///user/root/tmp/mrjob/step1.root.20140606.091711.815391/step-output/1', '-mapper', 'sh -e setup-wrapper.sh python step1.py --step-num=0 --mapper', '-combiner', 'sh -e setup-wrapper.sh python step1.py --step-num=0 --combiner', '-reducer', 'sh -e setup-wrapper.sh python step1.py --step-num=0 --reducer']' returned non-zero exit status 256


用的hadoop环境是cloudear的,按理说测试了很多样例都pass了,不知道为什么。经过多次枚举测试,最后发现,问题出在step1.py这个脚本中,不能import  pymongo,将这句话注释就可以了。怀疑是个bug,已经提交到mrjob的github了。


这样,也就手工拼凑了mongodb和mrjob的链接。只能在mapreduce之后,再用python脚本去读取mongo,然后将协同过滤得到的结果填充。


留帖记录艰辛。




  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值