Author: Wingter Wu
1 环境说明
VitualBox 5.0.24
CentOS-7-x86_64-Minimal-1611:http://isoredirect.centos.org/centos/7/isos/x86_64/CentOS-7-x86_64-Minimal-1611.iso
JDK 1.7u75:网盘
Hadoop-2.7.0-bin.tar.gz:http://mirrors.hust.edu.cn/apache/hadoop/common/
主要参考资料:
http://blog.csdn.net/circyo/article/details/46724335(centos7+hadoop2.7)
辅助资料:
http://blog.csdn.net/stark_summer/article/details/43484545(配置文件+官方配置项列表)
http://www.tuicool.com/articles/JJziMr
2 安装VM系统
2.1 配置虚拟机
磁盘:本文配20G,动态分配,vdi格式
内存:1G-2G
CPU:一核
网络:网络方式使用 桥接网卡最为简单。
2.2安装CentOS VM
设置->添加虚拟光驱-> CentOS-7-x86_64-Minimal-1611.iso
启动虚拟机,进入CentOS 7安装
分区:
采用普通分区,LVM未尝试过
挂载点大小分配(总大小20GB):/boot=200M, /=10G, swap=1G (或2倍内存),剩下给/home
网络和主机名:
主机名设置为hadoop.master(其实不建议用’.’,可能在Bash提示符处显示不全)
网络设置:
Ipv4:设为固定(192.168.1.103),并在路由器管理页面进行mac-ip绑定
掩码:255.255.255.0
网关:192.168.1.1(路由器的ip)
DNS服务器:222.201.130.30 222.201.130.33 (同宿主机)
开始安装…...(同时设置root密码,可不创建hadoop用户,可以直接在root上部署hadoop,或者后续再创建)
完成安装,重新启动虚拟机。
2.3CentOSVM环境配置
Note: 为了减少工作量,本文档的思路是在hadoop.master上完成所有配置(i.e., 网络(克隆后需在slaves上调整)+JDK+Hadoop),然后再克隆出slave1和slave2,微调网络后再配置SSH,就完成了集群搭建。
下面开始配置hadoop.master这个主节点
网络配置:
centos7.0初始化时并没有ifconfig指令,需要使用下面两个指令进行安装:
#yum search ifconfg
#yum install net-tools.x86_64
输入ifconfig 查看是否正常获取IP地址,使用ping命令确认能否上网。
为了使用ssh,关闭防火墙(和SElinux):
#systemctlstop firewalld
关闭SElinux:
#getenforce
#setenforce 0
设置hostname,设置为hadoop.master,需要修改的文件包括:
#vim/etc/sysconfig/network
#vim/etc/hostname
#vim/etc/hosts
配置本地hosts,这里假设两台slave的ip已经确定了(虽然还未克隆出来)
# vim/etc/hosts
192.168.1.103 hadoop.master
192.168.1.104 hadoop.slave1
192.168.1.105 hadoop.slave2
2.4 JAVA配置:
一般linux中都已经开启了ssh功能。
使用宿主机的ssh客户端(e.g., SSH Secure Shell Client),上传jdk和hadoop。
Jdk1.7安装到/usr/java中:
#tar zxvf jdk-7u75-linux-x64.tar.gz–C /usr/java
软连接以方便设置envs
#ln –s jdk1.7.0_75 jdk
设置Java环境变量,可以直接修改/etc/profile,更容易维护的方式新建是/etc/profile.d/java.sh
#vim /etc/profile.d/java.sh
写入如下内容:
exportJAVA_HOME=/usr/java/jdk
exportJRE_HOME=/usr/java/jdk/jre
export CLASSPATH=.:$CLASSPATH:$JAVA_HOME/lib:$JRE_HOME/lib
export PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin
使之生效:
#source /etc/profile
2.5 Hadoop配置:
假设从host上传过来的Hadoop在~/
解压(安装)到/usr/hadoop:
#tar zxvf hadoop-2.7.0.tar.gz –C/usr/hadoop/hadoop-2.7.0.tar.gz
配置Hadoop相关环境变量:
#vim /etc/profile.d/hadoop.sh
写入:
export HADOOP_HOME=/usr/hadoop
export PATH=${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin${PATH}
exportHADOOP_MAPRED_HOME=${HADOOP_HOME}
exportHADOOP_COMMON_HOME=${HADOOP_HOME}
exportHADOOP_HDFS_HOME=${HADOOP_HOME}
export YARN_HOME=${HADOOP_HOME}
exportHADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_HOME}/lib/native
export HADOOP_OPTS="-Djava.library.path=${HADOOP_HOME}/lib: ${HADOOP_HOME}/lib/native"
使之生效:
#source /etc/profile
配置Hadoop的各种配置文件:
设置hadoop-env.sh、yarn-env.sh、mapred-env.sh中的java环境变量
#cd/usr/hadoop/etc/hadoop/
#vimhadoop-env.sh
#vim yarn-env.sh
#vim mapred-env.sh
// 修改JAVA_HOME
export JAVA_HOME=/usr/java/jdk
配置core-site.xml文件(指定了Namenode):
#vimcore-site.xml
// 修改文件内容为以下
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.defaultFS</name> #原文是fs.default.name,这两个属性在官网都有
<value>hdfs://hadoop.master:9000</value>
</property>
</configuration>
配置hdfs-site.xml文件(指定HDFS块在本地的存储位置、副本数):
#vim hdfs-site.xml
// 修改文件内容为以下
<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///usr/hadoop/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///usr/hadoop/dfs/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.nameservices</name>
<value>hadoop-cluster1</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>Master.Hadoop:50090</value>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
</configuration>
配置mapred-site.xml文件(指定mapred采用框架yarn和一系列状态查询接口):
#vim mapred-site.xml
// 修改文件为以下
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
<final>true</final> #应该可以不要
</property>
<property>
<name>mapreduce.jobtracker.http.address</name>
<value>Master.Hadoop:50030</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>Master.Hadoop:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>Master.Hadoop:19888</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>http://Master.Hadoop:9001</value>
</property>
</configuration>
配置yarn-site.xml文件(指定resourcemanager位置和一系列状态查询接口):
#vim yarn-site.xml
<configuration>
// 修改文件内容为以下
<property>
<name>yarn.resourcemanager.hostname</name>
<value>Master.Hadoop</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>Master.Hadoop:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>Master.Hadoop:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>Master.Hadoop:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>Master.Hadoop:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>Master.Hadoop:8088</value>
</property>
</configuration>
修改Master主机上的slaves文件(这两个文件在两个slaves上应该不需要)
#cd/usr/hadoop/etc/hadoop
#vim slaves (Hadoop2.x似乎没有master文件)
// 将文件内容修改为两个slaves的主机名
hadoop.slave1
hadoop.slave2
*现在hadoop.master的所有内容已经配置完成。
2.6 复制出slaves
先在VirtualBox”全局设定”中设置“默认虚拟电脑存储位置”
然后克隆hadoop.master
副本类型:“完全复制”;备份:“全部”;不勾选“重新初始化所有Mac”
如下图:
或者用VBoxManage命令形式(未试过):
C:\Program Files\Oracle\VirtualBox>VBoxManage clonehd"D:\Linux\CentOS\h1.vdi" "D:\Linux\CentOS\h2.vdi"
0%...10%...20%...30%...40%...50%...60%...70%...80%...90%...100%
Clone hard disk created in format 'VDI'. UUID:74a09b9d-c4d8-4689-9186-87e34e4b5265
2.7 修改slaves上的各种配置
启动克隆出来的slave前,在Hypervisor内重新生成其Mac地址,因为默认克隆出来和master一样的
分别登录3台虚拟机内,将IP地址与hostname,hosts修改正确,使之能互相ping通
#vim/etc/hostname
hadoop.slave1
#vim/etc/hosts
Note: 第一行127.0.0.1的解析名中不可以包含主机名(如hadoop.slave1),否则会导致Datanode连不上Namenode
第二行开始和master配置的一样,就是一个DNS本地解析
#vim/etc/sysconfig/network
NETWORKING=yes
HOSTNAME=hadoop.slave1
修改IP地址:
#vim /etc/sysconfig/network-scripts/ifcfg-eth0 #最后这个网卡名看情况
(上图不是slave1的配置,只是示例;实际IP应为192.168.1.10x,掩码24/32,网关192.168.1.1)
删掉UUID和HWADDR这两行(重启会自动生成)
更改完成以后保存并退出,然后删除Linux物理地址绑定的文件(该文件会在操作系统重启并生成物理地址以后将物理地址绑定到IP上);
如果不删除,则操作系统会一直绑定着克隆过来的物理地址;
最简单的解决办法是直接删除70-persistent-net.rules配置文件
#rm -fr /etc/udev/rules.d/70-persistent-net.rules
(这一文件我在克隆出的slave没有发现,相似的文件/etc/udev/rules.d/70-persistent-ipoib.rules也全是注释,所有没有做这一步)
#reboot
重启系统就ok了,系统会自动生成一个新的。
启动hadoop.master和两台slaves,尝试互相ping一下,测试网络接通性。
2.8 在三台VM上配置SSH无密码登录
// 据说RSA认证原理大致是:master要免密登陆slave,所以生成一对公私钥并发送公钥给slave,slave将id_rsa.pub内容加入自己的信任名单;master发起SSH连接时,slave向master回复一串内容,master用私钥加密并回送,slave用master的公钥解密,内容一致就完成认证,因此而免去密码验证。
以下在hadoop.master主机上配置
输入以下指令生成ssh
#ssh-keygen
// 会生成两个文件,放到默认的/root/.ssh/文件夹中(我采用root作为hadoop使用者的账户)
把id_rsa.pub追加到授权的key里面去
#cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
修改文件”authorized_keys”权限
#chmod 600 ~/.ssh/authorized_keys
设置SSH配置
#Vim /etc/ssh/sshd_config
// 以下三项修改成以下配置(可以先不做)
RSAAuthentication yes # 启用 RSA 认证
PubkeyAuthentication yes # 启用公钥私钥配对认证方式
AuthorizedKeysFile .ssh/authorized_keys # 公钥文件路径(和上面生成的文件同)
重启ssh服务
#service sshd restart
把公钥复制所有的Slave机器上
#格式:scp ~/.ssh/id_rsa.pub 远程用户名@远程服务器IP:~/
#scp ~/.ssh/id_rsa.pub root@192.168.1.104:~/
#scp ~/.ssh/id_rsa.pub root@192.168.1.105:~/
以下在Slave主机上配置
在slave主机上创建.ssh文件夹
#mkdir ~/.ssh
// 修改权限
#chmod 700 ~/.ssh
追加到授权文件”authorized_keys”
#cat ~/id_rsa.pub >> ~/.ssh/authorized_keys
修改权限
#chmod 600 ~/.ssh/authorized_keys
删除无用.pub文件
#rm –r ~/id_rsa.pub
在master主机下进行测试
#ssh192.168.1.104
#ssh 192.168.1.105
2.9 关闭各节点的防火墙
#serviceiptables stop# 但centos7默认防火墙不是iptables
#servicefirewalld stop # 暂时关闭
#systemctl disable firewalld.service # 禁止firewall开机启动
可以把SELinux也关闭了
#sestatus # 查看状态
#setenforce 0 # 关闭
2.11 测试hadoop
登录hadoop.master测试Hadoop:
格式化HDFS文件系统:
# 在Master主机上输入以下指令
#hadoop namenode -format
启动HDFS和Yarn
#cd /usr/hadoop/sbin
#./start-all.sh
// 更推荐的运行方式:
#cd /usr/hadoop/sbin
#./start-dfs.sh
#./start-yarn.sh
应该输出以下信息:
Startingnamenodes on [Master.Hadoop]
Master.Hadoop:starting namenode, logging to /usr/hadoop/logs/hadoop-root-namenode-localhost.localdomain.out
Slave2.Hadoop:starting datanode, logging to /usr/hadoop/logs/hadoop-root-datanode-Slave2.Hadoop.out
Slave1.Hadoop:starting datanode, logging to /usr/hadoop/logs/hadoop-root-datanode-Slave1.Hadoop.out
starting yarndaemons
startingresourcemanager, logging to /usr/hadoop/logs/yarn-root-resourcemanager-localhost.localdomain.out
Slave1.Hadoop:starting nodemanager, logging to /usr/hadoop/logs/yarn-root-nodemanager-Slave1.Hadoop.out
Slave2.Hadoop: starting nodemanager, logging to/usr/hadoop/logs/yarn-root-nodemanager-Slave2.Hadoop.out
通过JPS查看进程启动结果:
# 直接在Master或Slave输入指令:
#jps
# 应该输出以下信息(端口号仅供参考)
# Master:
3930 ResourceManager
4506 Jps
3693 NameNode
# Slave:
2792 NodeManager
2920 Jps
2701DataNode
通过hadoop的dfsadmin命令查看集群状态(可能需稍等HDFS退出safe-mode):
//输入以下指令
#hadoop dfsadmin –report
// 应该输出以下信息:
ConfiguredCapacity: 14382268416 (13.39 GB)
PresentCapacity: 10538565632 (9.81 GB)
DFS Remaining: 10538557440 (9.81 GB)
DFS Used: 8192 (8 KB)
DFS Used%: 0.00%
Under replicatedblocks: 0
Blocks with corruptreplicas: 0
Missing blocks: 0
Missing blocks (with replicationfactor 1): 0
-------------------------------------------------
Live datanodes (2):
Name: 192.168.1.104:50010 (hadoop.slave1)
Hostname: hadoop.slave1
DecommissionStatus : Normal
ConfiguredCapacity: 7191134208 (6.70 GB)
DFS Used: 4096 (4 KB)
Non DFS Used: 1921933312 (1.79 GB)
DFS Remaining: 5269196800 (4.91 GB)
DFS Used%: 0.00%
DFS Remaining%: 73.27%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact:Thu Jul 0210:45:04 CST 2017
Name: 192.168.1.105:50010 (hadoop.slave2)
Hostname: hadoop.slave2
DecommissionStatus : Normal
ConfiguredCapacity: 7191134208 (6.70 GB)
DFS Used: 4096 (4 KB)
Non DFS Used: 1921769472 (1.79 GB)
DFS Remaining: 5269360640 (4.91 GB)
DFS Used%: 0.00%
DFS Remaining%: 73.28%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
停止yarn和hdfs进程:
#stop-yarn.sh
#stop-dfs.sh
3 MR程序测试
现在写一个MyWordCount来在集群上运行一下:
笔者参照WordCount的标准实现,它可以在Hadoop安装包里找到,包含该.java的具体路径:
hadoop-2.7.0.tar.gz\hadoop-2.7.0\share\hadoop\mapreduce\sources\ hadoop-mapreduce-examples-2.7.0-sources
解压之后可以找到org\apache\hadoop\examples中找到WordCount.java
参照景仰的Official codes,我简化包结构,重名为主类为MyWordCount,代码如下:
/*
* MyWordCount.java
* This is a version of wordcount coded by
* by Wingter
* on 06/20/2017
* basically identical to the officialimplementation of wordcount
* provided inhadoop-mapreduce-2.7.0-sources.jar
* as org.apache.hadoop.examples.WordCount.java
* This class of mine is named MyWordCount andused to test Hadoop 2.7.0
*/
package hadoopTests;
import java.io.IOException;
importjava.util.StringTokenizer;
importorg.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
importorg.apache.hadoop.io.IntWritable;
importorg.apache.hadoop.io.Text;
importorg.apache.hadoop.mapreduce.Job;
importorg.apache.hadoop.mapreduce.Mapper;
importorg.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
importorg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
importorg.apache.hadoop.util.GenericOptionsParser;
public class MyWordCount {
public static class TokenizerMapper extendsMapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = newIntWritable(1);
private Text word = new Text();
public void map(Object key, Text value,Context context) throws IOException, InterruptedException {
StringTokenizer itr = newStringTokenizer(value.toString());
while(itr.hasMoreTokens()){
word.set(itr.nextToken());
context.write(word,one);
}
}
}
public static class IntSumReducer extendsReducer<Text, IntWritable, Text, IntWritable>{
private IntWritable result = newIntWritable();
public void reduce(Text key,Iterable<IntWritable> values, Context context) throws IOException,InterruptedException{
int sum = 0;
for(IntWritable val : values){
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throwsException{
Configuration conf = new Configuration();
String[] otherArgs = newGenericOptionsParser(conf, args).getRemainingArgs();
if(otherArgs.length < 2){
//otherArgs begin from <input path>and end with <output path>
System.err.println("Usage:MyWordCount <in> [<in>..] <out>");
System.exit(2);
}
Job job = Job.getInstance(conf, "wwtword count");
job.setJarByClass(MyWordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
for(int i=0;i<otherArgs.length-1;i++){
FileInputFormat.addInputPath(job, newPath(otherArgs[i]));
}
FileOutputFormat.setOutputPath(job, newPath(otherArgs[otherArgs.length-1]));
System.exit(job.waitForCompletion(true)?0:1);
}
}
该代码MyWordCount.java放在~/programs/hadoopTests/下,保存代码(:wq)之后,因为要在命令行下用javac编译它,mapreduce程序自然要依赖hadoop的一些jar包。
旧版的Hadoop标准编译方式是:
#javac-classpath $HADOOP_HOME/hadoop-core-1.0.1.jar -d MyWordCount/ MyWordCount.java //这是因为旧版程序依赖的包都在core那一个包里面
但到了2.x版本,需要包含多个包。
据资料<http://blog.csdn.net/wang_zhenwei/article/details/47439623>说,例如2.4.1版本需要:
$HADOOP_HOME/share/hadoop/common/hadoop-common-2.4.1.jar
$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.4.1.jar
$HADOOP_HOME/share/hadoop/common/lib/commons-cli-1.2.jar
所以一个办法是在-classpath后面敲上它们。
一个更为机智的办法是在JAVA环境变量CLASSPATH中就设置好,这样直接javac即可编译。
最为机智的方法来确定MapReduce程序依赖包的方法是(在/etc/profile.d/Hadoop.sh中加入):
#export CLASSPATH=$($HADOOP_HOME/bin/hadoop classpath):$CLASSPATH
# 因为命令hadoop classpath直接告诉你它所有的依赖包路径了。
所以现在,编译的命令仅仅是:
#javac –d myWordCount/ MyWordCount.java
# -d表示输出到myWordCount目录下,等下直接对这个目录打包;
# 指明-d选项,编译完后,该目录下会自动生成hadoopTests/的子目录(因为代码中申明了package),不加-d则会使所有.class散布在当前目录下,不利于jar打包。
打包成jar:
#jar –cvf myWordCount.jar –C myWordCount/ .
# 不要忘记最后的.,-C代表临时进去myWordCount目录搜寻.class并且不建立新的目录(即jar包内最外层不会是myWordCount/),最后的.代表回到当前工作目录。
弄些数据并上传至hdfs:
#hadoop dfs –mkdir /testdata #HDFS的组织结构如同linux,以/为根
#hadoop dfs –copyFromLocal ~/testdata/wikepedia.txt /testdata/
运行MyWordCount作业:
#hadoop jar ~/programs/myWordCount.jarhadoopTests.MyWordCount /testdata /output
# 其中hadoopTests.MyWordCount是主类名,注意带上包结构;/output是HDFS输出目录,需要不存在
执行结束,查看输出和job状态
#hadoop dfs –cat /testdata/part-r-0000
RM上查看Log:
#lessHADOOP_HOME/logs/yarn-root-resourcemanager-hadoop.master.log
#less查看时,方向键翻行,G跳至最后
Web接口查看作业状态:192.168.1.103:8088
Web接口查看输出文件:192.168.1.103:50070 -> Utilities -> Browse the file system:
4 测试Hadoop-streaming:
原理:
Streaming工具会创建MapReduce作业,在各节点启动运行用户指定脚本/程序的进程,同时监控整个作业的执行过程。Hadoop会自动解析数据文件(成行)到Mapper或者Reducer的标准输入中,以供它们读取使用。mapper和reducer会从标准输入中读取用户数据,一行一行处理后发送给标准输出,Hadoop再对标准输出自动进行kv化。
以mapper为例,如果一个文件(可执行或者脚本)作为mapper,mapper初始化时,每一个mapper任务会把该文件作为一个单独进程启动,mapper任务运行时,它把输入切分成行并把每一行提供给可执行文件进程的标准输入。 同时,mapper收集可执行文件进程标准输出的内容,并把收到的每一行内容转化成key/value对,作为mapper的输出。 默认情况下,一行中第一个tab之前的部分作为key,之后的(不包括tab)作为value。如果没有tab,整行作为key值,value值为null。
对于reducer,类似。
Hadoop-streaming的使用套路:
//hadoop2.7的streaming工具包在:$HADOOP_HOME/share/hadoop/tools/lib /hadoop-streaming-2.7.0.jar
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/tools/libhadoop-*-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper cat \
-reducer wc
4.1 以shell脚本为mapper和reducer实现wordcount:
Linux自带的wc程序只能统计行数、字数和字节数,因此要实现wordcount逻辑需要自行编写shell脚本mapper.sh和reducer.sh:
//mapper.sh
#! /bin/bash
while read LINE;do
forword in $LINE
do
echo"$word 1"
done
done
//reducer.sh
#! /bin/bash
count=0
started=0
word=""
while read LINE;do
newword=`echo$LINE | cut -d ' ' -f 1`
if ["$word" != "$newword" ];then
[$started -ne 0 ] && echo –e "$word\t$count"
word=$newword
count=1
started=1
else
count=$(($count + 1 ))
fi
done
echo –e "$word\t$count"
测试bash版本的Hadoop-streaming:
*在提交作业时,采用-file选项指定本地脚本文件,这样,Hadoop会将这两个文件自动分发到各个节点上:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/Hadoop-streaming-2.7.0.jar\
-input /myInputDirs \
-output /myOutputDir \
-mapper mapper.sh \
-reducer reducer.sh \
-file ~/programs/mapper.sh \
-file ~/programs/reducer.sh
4.2 以C代码为mapper和reducer实现wordcount:
//mapper.c
#include<stdio.h>
#include<string.h>
#include<stdlib.h>
#define BUF_SIZE 2048
#define DELIM "\n"
int main(int argc,char *argv[])
{
charbuffer[BUF_SIZE];
while(fgets(buffer,BUF_SIZE - 1, stdin))
{
intlen = strlen(buffer);
if(buffer[len-1]== '\n')
buffer[len-1]= 0;
char*token = NULL;
// use strtok() to tokenize a string
token= strtok(buffer, " ");
while(token)
{
printf("%s\t1\n",token);
// use NULL as the param from thesecond time
// to get the next token
token= strtok(NULL, " ");
}
}
return0;
}
//reducer.c
#include<stdio.h>
#include<string.h>
#include<stdlib.h>
#defineBUFFER_SIZE 1024
#define DELIM "\t"
int main(int argc,char *argv[])
{
charstrLastKey[BUFFER_SIZE];
char strLine[BUFFER_SIZE];
int count =0;
*strLastKey= '\0';
*strLine ='\0';
// get aline from the input
// with max length= buffer_size -1
while(fgets(strLine, BUFFER_SIZE - 1, stdin) )
{
char *strCurrKey = NULL;
char *strCurrNum = NULL;
strCurrKey = strtok(strLine, DELIM);
strCurrNum= strtok(NULL, DELIM); /* necessary to check error but.... */
if(strLastKey[0] == '\0')
{
strcpy(strLastKey, strCurrKey);
}
if(strcmp(strCurrKey,strLastKey))
{
printf("%s\t%d\n",strLastKey, count);
count= atoi(strCurrNum);
strcpy(strLastKey, strCurrKey);
} else {
count+= atoi(strCurrNum);
}
}// endwhile
printf("%s\t%d\n",strLastKey, count); /* flush the count */
return 0;
}
测试C版本的Hadoop-streaming:
先在各节点安装gcc:
#yum install gcc
本地编译好mapper.c和reducer.c:
#gcc mapper.c –o mapper.out
#gcc reducer.c –o reducer.out
在hadoop-streaming上执行:
$HADOOP_HOME/bin/hadoop jar $STREAMING_JAR \
-input /myInputDirs \
-output /C_output \
-mapper mapper.out \
-reducer reducer.out\
-file ~/programs/mapper.out \
-file ~/programs/reducer.out
4.3 以python脚本为mapper和reducer实现wordcount:
mapper.py的实现:
#!/usr/bin/envpython
# mapper.py
import sys
# maps words totheir counts
word2count = {}
# input comes fromSTDIN (standard input)
for line insys.stdin:
#remove leading and trailing whitespace
line= line.strip()
#split the line into words while removing any empty strings
words= line.split()
#increase counters
forword in words:
print'%s\t%s' % (word, 1)
reducer.py的实现:
#!/usr/bin/envpython
# reducer.py
from operatorimport itemgetter
import sys
# use a dict tostore <word, count>
word2count = {}
# input comes fromsys.stdin
for line insys.stdin:
line= line.strip()
#parse the input we got from mapper.py
word,count = line.split()
#convert count (currently a string) to int
try:
count= int(count)
# get its count;if the word is null, return 0
word2count[word]= word2count.get(word, 0) + count
exceptValueError:
#count was not a number, so silently
#ignore/discard this line
pass
# sort the dictlexigraphically;
#
# this step is NOTrequired, we just do it so that our
# final outputwill look more like the official Hadoop
# word countexamples
sorted_word2count= sorted(word2count.items(), key=itemgetter(0))
# write theresults to STDOUT (standard output)
for word, count insorted_word2count.items():
print'%s\t%s'% (word, count)
测试python版本的Hadoop-streaming:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.7.0.jar\
-input /myInputDirs \
-output /myOutputDir \
-mapper mapper.py \
-reducer reducer.py\
-file ~/programs/mapper.py \
-file ~/programs/reducer.py
运行完成~
5 Hadoop-streaming的简单debug
Hadoop-streaming的编程中,debug十分方便,可以直接在单机上用管道模拟这个流程:
*要注意在mapper输出和reducer输入之间加入排序(linux 自带的sort),模拟Hadoop的结果排序
# python debug
#cat input.txt | python Mapper.py | sort | pythonReducer.py
# C program debug
#cat input.txt | ./mapper.out | sort | ./reducer.out
# shell debug
#cat input.txt | /bin/bash mapper.sh | sort | /bin/bashreducer.sh