mrjob和pymongo的互斥

最新推荐文章于 2020-10-30 22:00:02 发布

itgeeks

最新推荐文章于 2020-10-30 22:00:02 发布

阅读量1.5k

点赞数

分类专栏： mrjob python 机器学习文章标签： mapreduce mongodb mrjob python

本文链接：https://blog.csdn.net/whzhcahzxh/article/details/29587059

版权

python 同时被 3 个专栏收录

13 篇文章 0 订阅

订阅专栏

机器学习

3 篇文章 0 订阅

订阅专栏

mrjob

2 篇文章 0 订阅

订阅专栏

最近做的事情是用mrjob写mapreduce程序，从mongo读取数据。我的做法很容易也很好懂，因为mrjob可以支持sys.stdin的读取，所以我考虑用一个python程序读mongo中的数据，然后同时让mrjob脚本接受输入，处理，输出。

具体方式：

readInMongoDB.py:

#coding:UTF-8
'''
Created on 2014年5月28日

@author: hao
'''
import pymongo
pyconn = pymongo.Connection(host,port=27017)
pycursor = pyconn.userid_cid_score.find().batch_size(30)
for i in pycursor:
    userId = i['userId']
    cid = i['cid']
    score = i['score']
#     temp = list()
#     temp.append(userId)
#     temp.append(cid)
#     temp.append(score)
    print str(userId)+','+str(cid)+','+str(score)

step1.py:

#coding:UTF-8
'''
Created on 2014年5月27日

@author: hao
'''
from mrjob.job import MRJob
# from mrjob import protocol
import pymongo
import logging
import simplejson as sj

class step(MRJob):
    '''
    '''
#     logging.c
    def parseMatrix(self, _, line):
        '''
        input one stdin for pymongo onetime search
        output contentId, (userId, rating)
        '''
        line = (str(line))
        line=line.split(',')
        userId = line[0]
#         print userId
        cid = line[1]
#         print cid
        score = float(line[2])
#         print score
        yield cid, (userId, float(score))        

    
    def scoreCombine(self, cid, userRating):
        '''
        将对同一个内容的（用户，评分）拼到一个list里
        '''
        userRatings = list()
        for i in userRating:
            userRatings.append(i)
        yield cid, userRatings
        
    def userBehavior(self, cid, userRatings):
        '''        
        '''
        scoreList = list()
        for doc in userRatings:
            # 每个combiner结果
            for i in doc:
                scoreList.append(i)
        for user1 in scoreList:
            for user2 in scoreList:
                if user1[0] == user2[0]:
                    continue
                yield (user1[0], user2[0]), (user1[1], user2[1])
    
    
    def steps(self):
        return [self.mr(mapper = self.parseMatrix,
                        reducer = self.scoreCombine),
                self.mr(reducer = self.userBehavior),]
    
    
if __name__=='__main__':
    
    fp = open('a.txt','w')
    fp.write('[')
    step.run()
    fp.write(']')
    fp.close()

然后执行脚本 python readInMongoDB.py | python step1.py >> out.txt

这个方式在本地执行的非常好，没有任何问题（除开mrjob速度的问题，其实在本次应用中影响不大）

但是

问题出现在我把它移到hadoop环境中使用的时候，

执行脚本 python readInMongoDB.py | python step1.py -r hadoop >> out.txt

出现了问题：

no configs found; falling back on auto-configuration

creating tmp directory /tmp/step1.root.20140606.091711.815391

writing wrapper script to /tmp/step1.root.20140606.091711.815391/setup-wrapper.sh

reading from STDIN

Copying local files into hdfs:///user/root/tmp/mrjob/step1.root.20140606.091711.815391/files/

Using Hadoop version 2.0.0

HADOOP: packageJobJar: [] [/opt/cloudera/parcels/CDH-4.6.0-1.cdh4.6.0.p0.26/lib/hadoop/hadoop-streaming.jar] /tmp/streamjob8615643898520402804.jar tmpDir=null

HADOOP: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.

HADOOP: Total input paths to process : 1

HADOOP: getLocalDirs(): [/tmp/hadoop-root/mapred/local]

HADOOP: Running job: job_201405161502_0059

HADOOP: To kill this job, run:

HADOOP: /opt/cloudera/parcels/CDH-4.6.0-1.cdh4.6.0.p0.26/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=v-lab-110:8021 -kill job_201405161502_0059

HADOOP: Tracking URL: http://v-lab-110:50030/jobdetails.jsp?jobid=job_201405161502_0059

HADOOP: map 0% reduce 0%

HADOOP: map 100% reduce 100%

HADOOP: To kill this job, run:

HADOOP: /opt/cloudera/parcels/CDH-4.6.0-1.cdh4.6.0.p0.26/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=v-lab-110:8021 -kill job_201405161502_0059

HADOOP: Tracking URL: http://v-lab-110:50030/jobdetails.jsp?jobid=job_201405161502_0059

HADOOP: Job not successful. Error: NA

HADOOP: killJob...

HADOOP: Streaming Command Failed!

Job failed with return code 256: ['/opt/cloudera/parcels/CDH-4.6.0-1.cdh4.6.0.p0.26/lib/hadoop/bin/hadoop', 'jar', '/opt/cloudera/parcels/CDH-4.6.0-1.cdh4.6.0.p0.26/lib/hadoop/hadoop-streaming.jar', '-files', 'hdfs:///user/root/tmp/mrjob/step1.root.20140606.091711.815391/files/setup-wrapper.sh#setup-wrapper.sh,hdfs:///user/root/tmp/mrjob/step1.root.20140606.091711.815391/files/step1.py#step1.py', '-archives', 'hdfs:///user/root/tmp/mrjob/step1.root.20140606.091711.815391/files/mrjob.tar.gz#mrjob.tar.gz', '-input', 'hdfs:///user/root/tmp/mrjob/step1.root.20140606.091711.815391/files/STDIN', '-output', 'hdfs:///user/root/tmp/mrjob/step1.root.20140606.091711.815391/step-output/1', '-mapper', 'sh -e setup-wrapper.sh python step1.py --step-num=0 --mapper', '-combiner', 'sh -e setup-wrapper.sh python step1.py --step-num=0 --combiner', '-reducer', 'sh -e setup-wrapper.sh python step1.py --step-num=0 --reducer']

Scanning logs for probable cause of failure

Traceback (most recent call last):

File "step1.py", line 176, in <module>

step.run()

File "/usr/local/lib/python2.7/site-packages/mrjob/job.py", line 494, in run

mr_job.execute()

File "/usr/local/lib/python2.7/site-packages/mrjob/job.py", line 512, in execute

super(MRJob, self).execute()

File "/usr/local/lib/python2.7/site-packages/mrjob/launch.py", line 147, in execute

self.run_job()

File "/usr/local/lib/python2.7/site-packages/mrjob/launch.py", line 208, in run_job

runner.run()

File "/usr/local/lib/python2.7/site-packages/mrjob/runner.py", line 458, in run

self._run()

File "/usr/local/lib/python2.7/site-packages/mrjob/hadoop.py", line 239, in _run

self._run_job_in_hadoop()

File "/usr/local/lib/python2.7/site-packages/mrjob/hadoop.py", line 358, in _run_job_in_hadoop

raise CalledProcessError(returncode, step_args)

subprocess.CalledProcessError: Command '['/opt/cloudera/parcels/CDH-4.6.0-1.cdh4.6.0.p0.26/lib/hadoop/bin/hadoop', 'jar', '/opt/cloudera/parcels/CDH-4.6.0-1.cdh4.6.0.p0.26/lib/hadoop/hadoop-streaming.jar', '-files', 'hdfs:///user/root/tmp/mrjob/step1.root.20140606.091711.815391/files/setup-wrapper.sh#setup-wrapper.sh,hdfs:///user/root/tmp/mrjob/step1.root.20140606.091711.815391/files/step1.py#step1.py', '-archives', 'hdfs:///user/root/tmp/mrjob/step1.root.20140606.091711.815391/files/mrjob.tar.gz#mrjob.tar.gz', '-input', 'hdfs:///user/root/tmp/mrjob/step1.root.20140606.091711.815391/files/STDIN', '-output', 'hdfs:///user/root/tmp/mrjob/step1.root.20140606.091711.815391/step-output/1', '-mapper', 'sh -e setup-wrapper.sh python step1.py --step-num=0 --mapper', '-combiner', 'sh -e setup-wrapper.sh python step1.py --step-num=0 --combiner', '-reducer', 'sh -e setup-wrapper.sh python step1.py --step-num=0 --reducer']' returned non-zero exit status 256

用的hadoop环境是cloudear的，按理说测试了很多样例都pass了，不知道为什么。经过多次枚举测试，最后发现，问题出在step1.py这个脚本中，不能import pymongo，将这句话注释就可以了。怀疑是个bug，已经提交到mrjob的github了。

这样，也就手工拼凑了mongodb和mrjob的链接。只能在mapreduce之后，再用python脚本去读取mongo，然后将协同过滤得到的结果填充。

留帖记录艰辛。