序言:
最近在公司内部搭建了hadoop的mapreduce集群环境。
hadoop版本 1.0.4
hive版本 0.10
同时部署了ganglia 3.0.3版本的系统监控程序 ,可以动态监控集群环境的机器实时的各项参数指标。
已经可以用hive进行相关的数据查询,后续准备通过python语言做自定义的mapreduce计算。
第一章第一篇
安装python及mrjob
首选安装pip
pip是python的包管理器,可以方便的安装需要的python模块。官网地址是:https://pypi.python.org/pypi/distribute
Pip的安装与使用:
准备:
$ curl -O http://python-distribute.org/distribute_setup.py $ python distribute_setup.py
安装:
$ curl -O https://raw.github.com/pypa/pip/master/contrib/get-pip.py $ python get-pip.pymrjob的介绍:
mrjob是一个开放源码的Python框架,封装Hadoop的数据流,并积极开发Yelp的。,由于Yelp的运作完全在亚马逊网络服务,mrjob的整合与EMR是令人难以置信的光滑和容易(使用 boto包)。
mrjob提供了一个Python的API与Hadoop的数据流,并允许用户使用任何对象作为键和映射器。默认情况下,这些对象被序列化为JSON对象的内部,但也有支持pickle的对象。有没有其他的二进制I / O格式的开箱即用,但有一个机制来实现自定义序列化。
帮助网站:http://pythonhosted.org/mrjob/
mrjob的安装:
pip install mrjob
第一个案例:
1)创建你自己的mapreduce程序,需要继承MRJob类
from mrjob.job import MRJob
class MRWordCounter(MRJob):
def get_words(self, key, line):
for word in line.split():
yield word, 1
def sum_words(self, word, occurrences):
yield word, sum(occurrences)
def steps(self):
return [self.mr(self.get_words, self.sum_words),]
if __name__ == '__main__':
MRWordCounter.run()
2)本地运行,主要是方便于本地调试,调试成功之后既可以运行在hadoop集群上面。这也是mrjob的优点之一
python your_mr_job_sub_class.py < log_file_or_whatever > output
笔者运行的时候选择了一个文本文件
python mr1.py /home/hadoop/sqoop/hadoop.log mr1.log
运行结果:
no configs found; falling back on auto-configuration
creating tmp directory /tmp/mr1.hadoop.20130303.044656.895346
> /usr/bin/python mr1.py --step-num=0 --mapper /tmp/mr1.hadoop.20130303.044656.895346/input_part-00000
writing to /tmp/mr1.hadoop.20130303.044656.895346/step-0-mapper_part-00000
> /usr/bin/python mr1.py --step-num=0 --mapper /tmp/mr1.hadoop.20130303.044656.895346/input_part-00001
writing to /tmp/mr1.hadoop.20130303.044656.895346/step-0-mapper_part-00001
Counters from step 1:
(no counters found)
writing to /tmp/mr1.hadoop.20130303.044656.895346/step-0-mapper-sorted
> sort /tmp/mr1.hadoop.20130303.044656.895346/step-0-mapper_part-00000 /tmp/mr1.hadoop.20130303.044656.895346/step-0-mapper_part-00001
> /usr/bin/python mr1.py --step-num=0 --reducer /tmp/mr1.hadoop.20130303.044656.895346/input_part-00000
writing to /tmp/mr1.hadoop.20130303.044656.895346/step-0-reducer_part-00000
> /usr/bin/python mr1.py --step-num=0 --reducer /tmp/mr1.hadoop.20130303.044656.895346/input_part-00001
writing to /tmp/mr1.hadoop.20130303.044656.895346/step-0-reducer_part-00001
Counters from step 1:
(no counters found)
Moving /tmp/mr1.hadoop.20130303.044656.895346/step-0-reducer_part-00000 -> /tmp/mr1.hadoop.20130303.044656.895346/output/part-00000
Moving /tmp/mr1.hadoop.20130303.044656.895346/step-0-reducer_part-00001 -> /tmp/mr1.hadoop.20130303.044656.895346/output/part-00001
Streaming final output from /tmp/mr1.hadoop.20130303.044656.895346/output
"!" 240
"--append" 240
"--connect" 240
"--hive-import" 240
"--hive-overwrite" 240
"table" 551
"times(H):2" 1
removing tmp directory /tmp/mr1.hadoop.20130303.044656.895346
3)hadoop的集群部署运行
a)部署hadoop 参见: ( http://hadoop.apache.org/common/docs/current/)
b)确认在~/.bash_profile 设置了 HADOOP_HOME
笔者的配置文件如下:
# .bash_profile
# Get the aliases and functions
if [ -f ~/.bashrc ]; then
. ~/.bashrc
fi
# User specific environment and startup programs
PATH=$PATH:$HOME/bin
export JAVA_HOME=/opt/hadoop/java/jdk1.6.0_38
export CLASSPATH=$CLASSPATH:$JAVA_HOME/lib:$JAVA_HOME/jre/lib
export HADOOP_HOME=/opt/hadoop/hadoop/hadoop-1.0.4
export HIVE_HOME=/opt/hadoop/hive/hive-0.10.0-bin
export SQOOP_HOME=/opt/hadoop/sqoop/sqoop-1.4.2.bin__hadoop-1.0.0
export PATH=/usr/sbin:$PATH
export PATH=$SQOOP_HOME/bin:/opt/mysql/mysql-advanced-5.5.28-linux2.6-x86_64/bin:$HIVE_HOME/bin:$HADOOP_HOME/bin:$JAVA_HOME/bin:$JAVA_HOME/jre/bin:$PATH:$HOMR/bin:$PATH
c)运行hadoop的mapreduce程序
python your_mr_job_sub_class.py -r hadoop < input > output
需要注意 input参数 需要以 hdfs://....开头 否则mrjob会找不到文件
笔者跑maprduce的监控数据,跑的时间在3.2日的22点开始到24点之间,input的文件的大小14G