部署环境:
OS:Red Hat Enterprise Linux Client release 5.5
JDK:JDKk1.6.0_23
Hadoop:Hadoop-0.20.2
VMWare:8.0
网络环境:
192.168.2.86 namenode secondnamenode
192.168.4.123 datanode
192.168.4.124 datanode
配置以及安装步骤:
1. 网络环境配置
(1) 关闭所有网络主机的防火墙,service iptables stop
(2) /etc/hosts 文件内容追加
192.168.2.86 master
192.168.4.123 slave1
192.168.4.124 slave2
2. 建立ssh互信
(1) 修改 vi /etc/ssh/sshd_config
# $OpenBSD: sshd_config,v 1.73 2005/12/06 22:38:28 reyk Exp $
# This is the sshd server system-wide configuration file. See
# sshd_config(5) for more information.
# This sshd was compiled with PATH=/usr/local/bin:/bin:/usr/bin
# The strategy used for options in the default sshd_config shipped with
# OpenSSH is to specify options with their default value where
# possible, but leave them commented. Uncommented options change a
# default value.
#Port 22
#Protocol 2,1
Protocol 2
#AddressFamily any
#ListenAddress 0.0.0.0
#ListenAddress ::
# HostKey for protocol version 1
#HostKey /etc/ssh/ssh_host_key
# HostKeys for protocol version 2
#HostKey /etc/ssh/ssh_host_rsa_key
#HostKey /etc/ssh/ssh_host_dsa_key
# Lifetime and size of ephemeral version 1 server key
#KeyRegenerationInterval 1h
#ServerKeyBits 768
# Logging
# obsoletes QuietMode and FascistLogging
#SyslogFacility AUTH
SyslogFacility AUTHPRIV
#LogLevel INFO
# Authentication:
#LoginGraceTime 2m
#PermitRootLogin yes
#PermitRootLogin yes
#StrictModes yes
#MaxAuthTries 6
#RSAAuthentication yes
#PubkeyAuthentication yes
#PubkeyAuthentication no
#AuthorizedKeysFile .ssh/authorized_keys
# For this to work you will also need host keys in /etc/ssh/ssh_known_hosts
#RhostsRSAAuthentication no
# similar for protocol version 2
#HostbasedAuthentication no
# Change to yes if you don't trust ~/.ssh/known_hosts for
# RhostsRSAAuthentication and HostbasedAuthentication
#IgnoreUserKnownHosts no
# Don't read the user's ~/.rhosts and ~/.shosts files
#IgnoreRhosts yes
# To disable tunneled clear text passwords, change to no here!
#PasswordAuthentication yes
#PermitEmptyPasswords no
#PasswordAuthentication yes
PasswordAuthentication no
AuthorizedKeysFile .ssh/authorized_keys
# Change to no to disable s/key passwords
#ChallengeResponseAuthentication yes
ChallengeResponseAuthentication no
# Kerberos options
#KerberosAuthentication no
#KerberosOrLocalPasswd yes
#KerberosTicketCleanup yes
#KerberosGetAFSToken no
# GSSAPI options
#GSSAPIAuthentication no
GSSAPIAuthentication yes
#GSSAPICleanupCredentials yes
GSSAPICleanupCredentials yes
# Set this to 'yes' to enable PAM authentication, account processing,
# and session processing. If this is enabled, PAM authentication will
# be allowed through the ChallengeResponseAuthentication mechanism.
# Depending on your PAM configuration, this may bypass the setting of
# PasswordAuthentication, PermitEmptyPasswords, and
# "PermitRootLogin without-password". If you just want the PAM account and
# session checks to run without PAM authentication, then enable this but set
# ChallengeResponseAuthentication=no
#UsePAM no
UsePAM yes
# Accept locale-related environment variables
AcceptEnv LANG LC_CTYPE LC_NUMERIC LC_TIME LC_COLLATE LC_MONETARY LC_MESSAGES
AcceptEnv LC_PAPER LC_NAME LC_ADDRESS LC_TELEPHONE LC_MEASUREMENT
AcceptEnv LC_IDENTIFICATION LC_ALL
#AllowTcpForwarding yes
#GatewayPorts no
#X11Forwarding no
X11Forwarding yes
#X11DisplayOffset 10
#X11UseLocalhost yes
#PrintMotd yes
#PrintLastLog yes
#TCPKeepAlive yes
#UseLogin no
#UsePrivilegeSeparation yes
#PermitUserEnvironment no
#Compression delayed
#ClientAliveInterval 0
#ClientAliveCountMax 3
#ShowPatchLevel no
#UseDNS yes
#PidFile /var/run/sshd.pid
#MaxStartups 10
#PermitTunnel no
#ChrootDirectory none
# no default banner path
#Banner /some/path
# override default of no subsystems
Subsystem sftp /usr/libexec/openssh/sftp-server
(2) 生成公钥、私钥文件
生成密钥到用户主目录下
ssh -keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
将authorized_keys文件拷贝到两台slave主机
scp authorized_keys slave1:~/.ssh/
scp authorized_keys slave2:~/.ssh/
(3) 开启ssh-agent服务
ssh-agent bash --login -i
ssh-add 添加密钥
【ssh-agent介绍】
ssh-agent就是一个管理私钥的代理,受管理的私钥通过ssh-add来添加,所以ssh-agent的客户端都可以共享使用这些私钥。
好处1:不用重复输入密码。
用 ssh-add 添加私钥时,如果私钥有密码的话,照例会被要求输入一次密码,在这之后ssh-agent可直接使用该私钥,无需再次密码认证。
好处2:不用到处部署私钥
假设私钥分别可以登录同一内网的主机 A 和主机 B,出于一些原因,不能直接登录 B。可以通过在 A 上部署私钥或者设置 PortForwarding 登录 B,也可以转发认证代理连接在 A 上面使用ssh-agent私钥登录 B。
[root@master ~]# ssh 192.168.4.123
Last login: Mon Jul 23 23:04:54 2012 from 192.168.2.86
[root@localhost ~]#
Hadoop相关环境变量以及配置文件说明
1. 环境变量
可以将环境变量写入~/.bash_profile文件中
export JAVA_HOME=/usr/java/jdk1.6.0_23
export HADOOP_HOME=/root/bin/hadoop-0.20.2
export HADOOP_CLASSPATH=/user/root/build/classes
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin
2. 相关配置文件
(1) hadoop-env.sh 主要配置两个属性
# The java implementation to use. Required.
export JAVA_HOME=/usr/java/jdk1.6.0_23
# Extra Java CLASSPATH elements. Optional.
export HADOOP_CLASSPATH=/user/root/build/classes
(2) core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
</property>
<property>
<name>Hadoop.tmp.dir</name>
<value>/tmp/hadoop-root</value>
</property>
</configuration>
转载注明出处:博客园 石头儿 http://www.cnblogs.com/shitouer/
注释一:hadoop分布式文件系统文件存放位置都是基于hadoop.tmp.dir目录的,namenode的名字空间存放地方就是 ${hadoop.tmp.dir}/dfs/name, datanode数据块的存放地方就是 ${hadoop.tmp.dir}/dfs/data,所以设置好hadoop.tmp.dir目录后,其他的重要目录都是在这个目录下面,这是一个根目录。
注释二:fs.default.name,设置namenode所在主机,端口号是9000
注释三:core-site.xml 对应有一个core-default.xml, hdfs-site.xml对应有一个hdfs-default.xml,mapred-site.xml对应有一个mapred-default.xml。这三个defalult文件里面都有一些默认配置,现在我们修改这三个site文件,目的就覆盖default里面的一些配置
(3) hdfs-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
</configuration>
dfs.replication,设置数据块的复制次数,默认是3,如果slave节点数少于3,则写成相应的1或者2
(3) mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>http://master:9001</value>
</property>
</configuration>
mapred.job.tracker,设置jobtracker所在机器,端口号9001
(4) master
192.168.2.86 master
192.168.4.123 slave1
192.168.4.124 slave2
将已经配置好的hadoop-0.20.2,分别拷贝到另外两台主机,并做相同配置。此时,hadoop的集群配置已经完成,输入hadoop,则可看到hadoop相关的操作。
Map/Reduce程序
1. Map
public class MaxTemperatureMapper extends
Mapper<LongWritable, Text, Text, IntWritable> {
private static final int MISSIGN = 9999;
@Override
public void map(LongWritable key, Text value, org.apache.hadoop.mapreduce.Mapper<LongWritable,Text,Text,IntWritable>.Context context) throws IOException ,InterruptedException {
String line = value.toString();
String year = line.substring(15, 19);
int airTemperator = 0;
if (line.charAt(87) == '+') {
airTemperator = Integer.parseInt(line.substring(88, 92));
} else {
airTemperator = Integer.parseInt(line.substring(87, 92));
}
String quality = line.substring(92, 93);
if (airTemperator < MISSIGN && quality.matches("[01459]")) {
context.write(new Text(year), new IntWritable(airTemperator));
}
};
}
2. Reduce
public class MaxTemperatureReducer extends
Reducer<Text, IntWritable, Text, IntWritable> {
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int maxValue = Integer.MIN_VALUE;
for (IntWritable value : values) {
maxValue = Math.max(maxValue, value.get());
}
context.write(key, new IntWritable(maxValue));
};
}
3. Client
public class MaxTemperature {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.println("Usage MaxTemperator <input path> <output path>");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(MaxTemperature.class);
job.setMapperClass(MaxTemperatureMapper.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
4. 运行
将上述程序达成jar包放到$HADOOP_CLASSPATH路径下。
然后执行 hadoop jar test.jar com.tx.weahter.MaxTemerature input/sample.txt output