hadoop and python

著名音乐站点 Last.fm发布了基于Python的Dumbo(小 飞象)项目,Dumbo能够帮助Python开发者更方便的编写 Hadoop应用,并且Dumbo为MapReduce应用提供了灵活易用的Python API。Last.fm的开发者,同时也是Dumbo项目发起人Klaas Bosteels 认为,对于定制Hadoop应用,使用Python语言代替Java会让工作变得更有效率。
project website:http://wiki.github.com/klbostee/dumbo/
nginx和python环境搭建: http://www.dbasky.net/archives/2009/08/nginx-python-django-memcached-mysql-fastcgi.html
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Hadoop MapReduce是一个分布式计算框架,可以用于处理大规模数据集。Mapper和Reducer是MapReduce的两个主要组件。Python是一种流行的编程语言,也可以用于编写Hadoop MapReduce作业。 在Python中编写MapReduce作业,您可以使用Hadoop Streaming API。该API允许您使用任何可执行文件作为Mapper和Reducer。以下是一个使用Python编写Mapper和Reducer的示例: Mapper: ```python #!/usr/bin/env python import sys # input comes from STDIN (standard input) for line in sys.stdin: # remove leading and trailing whitespace line = line.strip() # split the line into words words = line.split() # increase counters for word in words: # write the results to STDOUT (standard output); # what we output here will be the input for the # Reduce step, i.e. the input for reducer.py # # tab-delimited; the trivial word count is 1 print '%s\t%s' % (word, 1) ``` Reducer: ```python #!/usr/bin/env python from operator import itemgetter import sys current_word = None current_count = 0 word = None # input comes from STDIN for line in sys.stdin: # remove leading and trailing whitespace line = line.strip() # parse the input we got from mapper.py word, count = line.split('\t', 1) # convert count (currently a string) to int try: count = int(count) except ValueError: # count was not a number, so silently # ignore/discard this line continue # this IF-switch only works because Hadoop sorts map output # by key (here: word) before it is passed to the reducer if current_word == word: current_count += count else: if current_word: # write result to STDOUT print '%s\t%s' % (current_word, current_count) current_count = count current_word = word # do not forget to output the last word if needed! if current_word == word: print '%s\t%s' % (current_word, current_count) ``` 这些脚本可以使用Hadoop Streaming API提交为MapReduce作业,如下所示: ```bash $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar \ -input input_file \ -output output_directory \ -mapper mapper.py \ -reducer reducer.py \ -file mapper.py \ -file reducer.py ``` 其中,input_file是输入文件的路径,output_directory是输出目录的路径,mapper.py和reducer.py是上述Python脚本的文件名。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值