My Hadoop: Hadoop 0.23 setup

最新推荐文章于 2022-12-13 15:37:33 发布

FireCoder

最新推荐文章于 2022-12-13 15:37:33 发布

阅读量3.2k

点赞数

分类专栏： scalability java 文章标签： hadoop mapreduce browser jar file ssh

本文链接：https://blog.csdn.net/FireCoder/article/details/7097795

版权

java 同时被 2 个专栏收录

62 篇文章

订阅专栏

scalability

12 篇文章

订阅专栏

1 Download

choose a mirror http://www.apache.org/dyn/closer.cgi/hadoop/core/

download from renren for 0.23 version: hadoop-0.23.0.tar.gz

1.1 untar

tar zxfv hadoop-0.23.0.tar.gz

2 Run first hadoop program (locally)

2.1 compute pi

bin/hadoop jar hadoop-mapreduce-examples-0.23.0.jar pi -Dmapreduce.clientfactory.class.name=org.apache.hadoop.mapred.YarnClientFactory -libjars modules/hadoop-mapreduce-client-jobclient-0.23.0.jar 16 10000

Job Finished in 6.014 seconds
Estimated value of Pi is 3.14127500000000000000

2.2 word count

bin/hadoop jar hadoop-mapreduce-examples-0.23.0.jar wordcount -Dmapreduce.clientfactory.class.name=org.apache.hadoop.mapred.YarnClientFactory -libjars modules/hadoop-mapreduce-client-jobclient-0.23.0.jar LICENSE.txt output

Result is in the output dir

congratulations, you get the first MapReduce program.

While we know Hadoop is used in parallel/distributed computing, so next let's configure it one by one.

3 Setup the first node (master)

3.1 SSH

ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa

cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

id_dsa.pub is the public key of localhost

authorized_keys contains all the public keys trusted in current hosts.

Import the localhost public key into authorized_keys, then you can ssh localhost in passphraseless.

Similarly, you can cat id_dsa.pub to other hosts authorized_keys file. Then you can ssh to other hosts in passphraseless.

3.2 Config HDFS

etc/hadoop/core-site.xml (Default is here)

<configuration>
     <property>
         <name>fs.defaultFS</name>
         <value>hdfs://172.16.100.122:9000</value>
     </property>
</configuration>

etc/hadoop/hdfs-site.xml (Default is here)

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/tntuser/hadoop-0.23.0/data/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/tntuser/hadoop-0.23.0/data/hdfs/datanode</value>
</property>
</configuration>

a full URI is needed for the name dir and data dir.

3.3 Format HDFS

mkdir data/hdfs/namenode
mkdir data/hdfs/datanode
bin/hdfs namenode -format

3.4 Start HDFS

sbin/hadoop-daemon.sh start|stop namenode
sbin/hadoop-daemon.sh start|stop datanode

Check

JPS should show NameNode, DataNode

Run several HDFS command

bin/hadoop fs -ls

bin/hadoop fs -mkdir test

bin/hadoop fs -rm -r test

3.5 Config MapReduce

etc/hadoop/mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

conf/yarn-site.xml

<?xml version="1.0"?>
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce.shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>172.16.100.122:8025</value>
</property>

<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>172.16.100.122:8030</value>
</property>

<property>
<name>yarn.resourcemanager.address</name>
<value>172.16.100.122:8040</value>
</property>
</configuration>

conf/yarn-env.sh

export HADOOP_CONF_DIR="${HADOOP_CONF_DIR:-$YARN_HOME/etc/hadoop}"
export HADOOP_COMMON_HOME="${HADOOP_COMMON_HOME:-$YARN_HOME}"
export HADOOP_HDFS_HOME="${HADOOP_HDFS_HOME:-$YARN_HOME}"

The conf directory that comes with Hadoop is no longer the default configuration directory. Rather, Hadoop looks in etc/hadoop for configuration files.

sbin/hadoop-daemon.sh call hdfs-config.sh, hdfs-config.sh calls hadoop-config.sh in $HADOOP_COMMON_HOME/libexec/hadoop-config.sh

3.6 Start MapReduce (YARN) Daemon

bin/yarn-daemon.sh start resourcemanager
bin/yarn-daemon.sh start nodemanager
bin/yarn-daemon.sh start historyserver

NodeManage may be fail because of 8080 is used by Tomcat

conf/yarn-env.sh

<property>
  <name>mapreduce.shuffle.port</name>
  <value>8090</value>
</property>

4 Run the hadoop program in single node

MapReduce JobHistory Server http://jhs_host:port/ Default HTTP port is 19888.

See the detail, the task is executed by node.

NameNode http://nn_host:port/ Default HTTP port is 50070,browser HDFS and hdfs nodes

ResourceManager http://rm_host:port/ Default HTTP port is 8088, browser map-reduce nodes

5 Setup the slave node

5.1 untar on the slave

5.2 copy config from master

scp 172.16.100.122:/home/tntuser/hadoop-0.23.0/etc/hadoop/*.xml etc/hadoop

scp 172.16.100.122:/home/tntuser/hadoop-0.23.0/conf/yarn-* conf

5.3 (re) format hdfs on master

shutdown daemons on master first

bin/hdfs namenode -format -clusterid hadoop_cluster

5.4 add slave hosts

conf/slave

172.16.100.122 //master

172.16.100.130

5.5 Start Master Daemons

sbin/hadoop-daemon.sh start|stop namenode
sbin/hadoop-daemon.sh start|stop datanode

bin/yarn-daemon.sh start resourcemanager
bin/yarn-daemon.sh start nodemanager
bin/yarn-daemon.sh start historyserver

5.6 Start Slave Daemons

sbin/hadoop-daemon.sh start|stop datanode

bin/yarn-daemon.sh start nodemanager

6 Run the hadoop program in cluster

issue 1: temp directory already exists

 hdfs://172.16.100.122:9000/user/tntuser/QuasiMonteCarlo_TMP_3_141592654 already exists.  Please remove it first.

bin/hadoop fs -rm -r QuasiMonteCarlo_TMP_3_141592654

issues 2:

java.io.FileNotFoundException: File does not exist: hdfs://172.16.100.122:9000/user/tntuser/QuasiMonteCarlo_TMP_3_141592654/out/reduce-out
	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:764)
	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1614)
	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1638)
	at org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:314)
	at org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:351)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
	at org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:360)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:4

1) dns config /etc/resolve.conf, make sure the dns nameserver is right

2) add master/slave hostname to each others /etc/hosts

172.16.100.122 dev122
172.16.100.130 dev130

3) check the hadoop slaves config file conf/slaves, make sure the hostname or ip is right