Hadoop有三种运行模式:
1.单机模式(非分布式模式)
2.伪分布式运行模式(用不同进程模拟分布式运行中的各类特点)
3. 真正的分布式模式
1、单机模式
[hadoop@ hadoop_home]$ cd hadoop-0.20.205.0 [hadoop@ hadoop-0.20.205.0]$ mkdir input [hadoop@ hadoop-0.20.205.0]$ cp conf/*.xml input/ [hadoop@ hadoop-0.20.205.0]$ vim conf/hadoop-env.sh export JAVA_HOME=/usr/java/jdk1.6.0_29 [hadoop@ hadoop-0.20.205.0]$ hadoop jar hadoop-examples-0.20.205.0.jar grep input output 'dfs[a-z.]+' …… [hadoop@ hadoop-0.20.205.0]$ hadoop fs -cat output/part-* 1 dfsadmin [hadoop@ hadoop-0.20.205.0]$ |
2、伪分布式运行模式
参考URL:http://hadoop.apache.org/common/docs/current/single_node_setup.html
一、安装JDK
1、 检查java版本号:
[root@ ~]# java -version java version "1.4.2" gcj (GCC) 3.4.5 20051201 (Red Hat 3.4.5-2) Copyright (C) 2004 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. |
2、 http://www.oracle.com/technetwork/java/javase/downloads/jdk-6u29-download-513648.html:
下载【根据机器的位数选择相应的版本】:
Linux x64 | 81.45 MB |
3、 以root权限执行:
chmod +x jdk-6u29-linux-x64.bin ./ jdk-6u29-linux-x64.bin mv ./jdk1.6.0_29 /usr/java/jdk1.6.0_29 |
4、 修改~/.bash_profile
JAVA_HOME=/usr/java/jdk1.6.0_29 JAVA_BIN=/usr/java/jdk1.6.0_29/bin PATH=$JAVA_HOME/bin:$PATH CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar export JAVA_HOME JAVA_BIN PATH CLASSPATH |
5、 重新加载~/.bash_profile文件
source ~/.bash_profile |
6、 检查java版本号:
[root@ jdk]# java -version java version "1.6.0_29" Java(TM) SE Runtime Environment (build 1.6.0_29-b11) Java HotSpot(TM) 64-Bit Server VM (build 20.4-b02, mixed mode) |
JDK安装成功!
二、创建hadoop用户
1、 useradd hadoop
2、 passwd hadoop
3、 以hadoop用户登录:su hadoop
三、ssh免登录
1、 ssh-keygen -t rsa -P ''
2、 cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
3、 chmod 600 ~/.ssh/authorized_keys
4、 ssh localhost
如果第4步不需要输入密码,那么ssh免登陆成功;
四、hadoop安装
5、 http://labs.renren.com/apache-mirror//hadoop/common/
下载hadoop-0.20.205.0.tar.gz
6、 tar –zxvf hadoop-0.20.205.0.tar.gz
7、 在~/.bash_profile中添加环境变量,修改后source该文件:
HADOOP_HOME=/home/hadoop/hadoop_home/hadoop-0.20.2 export HADOOP_HOME PATH=$HADOOP_HOME/bin:$PATH export PATH |
8、 修改$HADOOP_HOME/conf/hadoop-env.sh
# The java implementation to use. Required. export JAVA_HOME=/usr/java/jdk1.6.0_29
|
9、 修改$HADOOP_HOME/conf/hdfs-site.xml文件
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration> <property> <name>dfs.replication</name> <value>1</value> <final>true</final> </property> </configuration> |
10、 修改$HADOOP_HOME/conf/core-site.xml文件
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> <final>true</final> </property> </configuration> |
11、 修改$HADOOP_HOME/conf/mapred-site.xml文件
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> <final>true</final> </property> </configuration> |
12、 枚举进程:ps x
13、 格式化namenode:hadoop namenode -format
14、 start-all.sh(start-dfs.sh,start-mapred.sh)
15、 ps x
16、 jps
17、 hadoop fs –mkdir yangkai
18、 Hadoop fs –ls
19、 stop-all.sh
20、 ps x
21、 jps
22、 重要路径
路径树:
/tmp/hadoop-hadoop . == |--dfs //硬盘 | |--data //datanode | | |--blocksBeingWritten | | |--current | | | |--subdir0 | | | |--subdir1 | | | |--subdir10 | | | |--…………. | | | |--subdir63 | | | |--subdir7 | | | |--subdir8 | | | |--subdir9 | | |--detach | | |--tmp | |--name //namenode | | |--current | | |--image | | |--previous.checkpoint | |--namesecondary | | |--current | | |--image |--mapred //mapreduce | |--local | | |--localRunner | | | |--tmp | | |--taskTracker //taskTracker | | |--tt_log_tmp | | |--ttprivate | | |--userlogs | |--staging | | |--hadoop1501661639 | | | |--.staging | | |--hadoop1997916211 | | | |--.staging |
默认硬盘目录:/tmp/hadoop-hadoop/dfs/name/
HADOOP日志路径:${HADOOP_HOME}/logs
JOB日志路径:${HADOOP_HOME}/logs/userlogs/job_201112141219_0001/attempt_201112141219_0001_m_000010_0【包含stderr,stdout,syslog三个日志文件】
五、测试是否安装成功
1、 hadoop jar ${HADOOP_HOME}/hadoop-test-0.20.205.0.jar TestDFSIO -write -nrFiles 10 -fileSize 10
[hadoop@ conf]$ hadoop jar ${HADOOP_HOME}/hadoop-test-0.20.205.0.jar TestDFSIO -write -nrFiles 10 -fileSize 10 Warning: $HADOOP_HOME is deprecated.
TestDFSIO.0.0.4 11/12/15 11:19:34 INFO fs.TestDFSIO: nrFiles = 10 11/12/15 11:19:34 INFO fs.TestDFSIO: fileSize (MB) = 10 11/12/15 11:19:34 INFO fs.TestDFSIO: bufferSize = 1000000 11/12/15 11:19:35 INFO fs.TestDFSIO: creating control file: 10 mega bytes, 10 files 11/12/15 11:19:36 INFO fs.TestDFSIO: created control files for: 10 files 11/12/15 11:19:36 INFO mapred.FileInputFormat: Total input paths to process : 10 11/12/15 11:19:36 INFO mapred.JobClient: Running job: job_201112151118_0001 11/12/15 11:19:37 INFO mapred.JobClient: map 0% reduce 0% 11/12/15 11:19:57 INFO mapred.JobClient: map 20% reduce 0% 11/12/15 11:20:03 INFO mapred.JobClient: map 30% reduce 0% 11/12/15 11:20:06 INFO mapred.JobClient: map 40% reduce 0% 11/12/15 11:20:09 INFO mapred.JobClient: map 50% reduce 0% 11/12/15 11:20:15 INFO mapred.JobClient: map 70% reduce 13% 11/12/15 11:20:21 INFO mapred.JobClient: map 70% reduce 16% 11/12/15 11:20:24 INFO mapred.JobClient: map 90% reduce 16% 11/12/15 11:20:27 INFO mapred.JobClient: map 90% reduce 23% 11/12/15 11:20:30 INFO mapred.JobClient: map 100% reduce 23% 11/12/15 11:20:39 INFO mapred.JobClient: map 100% reduce 100% 11/12/15 11:20:44 INFO mapred.JobClient: Job complete: job_201112151118_0001 11/12/15 11:20:44 INFO mapred.JobClient: Counters: 30 11/12/15 11:20:44 INFO mapred.JobClient: Job Counters 11/12/15 11:20:44 INFO mapred.JobClient: Launched reduce tasks=1 11/12/15 11:20:44 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=68890 11/12/15 11:20:44 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 11/12/15 11:20:44 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 11/12/15 11:20:44 INFO mapred.JobClient: Launched map tasks=10 11/12/15 11:20:44 INFO mapred.JobClient: Data-local map tasks=10 11/12/15 11:20:44 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=39780 11/12/15 11:20:44 INFO mapred.JobClient: File Input Format Counters 11/12/15 11:20:44 INFO mapred.JobClient: Bytes Read=1120 11/12/15 11:20:44 INFO mapred.JobClient: File Output Format Counters 11/12/15 11:20:44 INFO mapred.JobClient: Bytes Written=76 11/12/15 11:20:44 INFO mapred.JobClient: FileSystemCounters 11/12/15 11:20:44 INFO mapred.JobClient: FILE_BYTES_READ=833 11/12/15 11:20:44 INFO mapred.JobClient: HDFS_BYTES_READ=2360 11/12/15 11:20:44 INFO mapred.JobClient: FILE_BYTES_WRITTEN=237551 11/12/15 11:20:44 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=104857676 11/12/15 11:20:44 INFO mapred.JobClient: Map-Reduce Framework 11/12/15 11:20:44 INFO mapred.JobClient: Map output materialized bytes=887 11/12/15 11:20:44 INFO mapred.JobClient: Map input records=10 11/12/15 11:20:44 INFO mapred.JobClient: Reduce shuffle bytes=798 11/12/15 11:20:44 INFO mapred.JobClient: Spilled Records=100 11/12/15 11:20:44 INFO mapred.JobClient: Map output bytes=727 11/12/15 11:20:44 INFO mapred.JobClient: Total committed heap usage (bytes)=1929248768 11/12/15 11:20:44 INFO mapred.JobClient: CPU time spent (ms)=10520 11/12/15 11:20:44 INFO mapred.JobClient: Map input bytes=260 11/12/15 11:20:44 INFO mapred.JobClient: SPLIT_RAW_BYTES=1240 11/12/15 11:20:44 INFO mapred.JobClient: Combine input records=0 11/12/15 11:20:44 INFO mapred.JobClient: Reduce input records=50 11/12/15 11:20:44 INFO mapred.JobClient: Reduce input groups=5 11/12/15 11:20:44 INFO mapred.JobClient: Combine output records=0 11/12/15 11:20:44 INFO mapred.JobClient: Physical memory (bytes) snapshot=2002497536 11/12/15 11:20:44 INFO mapred.JobClient: Reduce output records=5 11/12/15 11:20:44 INFO mapred.JobClient: Virtual memory (bytes) snapshot=5408600064 11/12/15 11:20:44 INFO mapred.JobClient: Map output records=50 11/12/15 11:20:44 INFO fs.TestDFSIO: ----- TestDFSIO ----- : write 11/12/15 11:20:44 INFO fs.TestDFSIO: Date & time: Thu Dec 15 11:20:44 CST 2011 11/12/15 11:20:44 INFO fs.TestDFSIO: Number of files: 10 11/12/15 11:20:44 INFO fs.TestDFSIO: Total MBytes processed: 100 11/12/15 11:20:44 INFO fs.TestDFSIO: Throughput mb/sec: 24.48579823702253 11/12/15 11:20:44 INFO fs.TestDFSIO: Average IO rate mb/sec: 28.83795738220215 11/12/15 11:20:44 INFO fs.TestDFSIO: IO rate std deviation: 8.542554732984893 11/12/15 11:20:44 INFO fs.TestDFSIO: Test exec time sec: 68.417 11/12/15 11:20:44 INFO fs.TestDFSIO: [hadoop@ conf]$ |
2、 Streaming 简单测试
脚本
[hadoop@ hadoopTest1]$ cat run.sh #!/bin/sh
HADOOP_PATH="/user/hadoop/yangkai"
hadoop fs -test -d ${HADOOP_PATH} if [ 0 -ne $? ] then echo "${HADOOP_PATH} doesn't exist! We need to create it!" hadoop fs -mkdir ${HADOOP_PATH} if [ 0 -ne $? ] then echo "hadoop fs -mkdir ${HADOOP_PATH} faild!" exit 1 fi fi
PROGRAM_NAME="hadoopTest"
INPUT_FILES="${HADOOP_PATH}/data.txt" hadoop fs -test -e ${INPUT_FILES} if [ 0 -eq $? ] then hadoop fs -rmr ${INPUT_FILES} if [ 0 -ne $? ] then echo "hadoop fs -rmr ${INPUT_FILES} faild!" exit 1 fi fi
hadoop fs -put data.txt ${INPUT_FILES} if [ 0 -ne $? ] then echo "hadoop fs -put data.txt ${INPUT_FILES} faild!" exit 1 fi
OUTPUT_DIR="${HADOOP_PATH}/test" hadoop fs -test -d ${OUTPUT_DIR} if [ 0 -eq $? ] then echo "${OUTPUT_DIR} already exist! We need to remove it!" hadoop fs -rmr ${OUTPUT_DIR} if [ 0 -ne $? ] then echo "hadoop fs -rmr ${OUTPUT_DIR} faild!" exit 1 fi fi
hadoop jar ${HADOOP_HOME}/contrib/streaming/hadoop-streaming-0.20.205.0.jar \ -D mapred.job.name="${PROGRAM_NAME}" \ -input ${INPUT_FILES} \ -output ${OUTPUT_DIR} \ -mapper "mapper.sh" \ -reducer "reducer.sh" \ -file "mapper.sh" \ -file "reducer.sh"
if [ 0 -eq $? ] then echo "success!" else echo "failed!" fi
[hadoop@ hadoopTest1]$ cat mapper.sh #!/bin/sh
awk ' BEGIN{} { print $0; } END{} ' [hadoop@ hadoopTest1]$ cat reducer.sh #!/bin/sh
awk ' BEGIN{} { print $0; } END{} ' |
运行结果:
[hadoop@ hadoopTest1]$ sh run.sh Warning: $HADOOP_HOME is deprecated.
Warning: $HADOOP_HOME is deprecated.
Warning: $HADOOP_HOME is deprecated.
Deleted hdfs://localhost:9002/user/hadoop/yangkai/data.txt Warning: $HADOOP_HOME is deprecated.
Warning: $HADOOP_HOME is deprecated.
test: File does not exist: /user/hadoop/yangkai/test Warning: $HADOOP_HOME is deprecated.
packageJobJar: [mapper.sh, reducer.sh] [/home/hadoop/hadoop_home/hadoop-0.20.205.0/contrib/streaming/hadoop-streaming-0.20.205.0.jar] /tmp/streamjob2707334478803608361.jar tmpDir=null 11/12/15 11:37:00 INFO mapred.FileInputFormat: Total input paths to process : 1 11/12/15 11:37:00 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-hadoop/mapred/local] 11/12/15 11:37:00 INFO streaming.StreamJob: Running job: job_201112151118_0003 11/12/15 11:37:00 INFO streaming.StreamJob: To kill this job, run: 11/12/15 11:37:00 INFO streaming.StreamJob: /home/hadoop/hadoop_home/hadoop-0.20.205.0/libexec/../bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_201112151118_0003 11/12/15 11:37:00 INFO streaming.StreamJob: Tracking URL: http://localhost.localdomain:50030/jobdetails.jsp?jobid=job_201112151118_0003 11/12/15 11:37:01 INFO streaming.StreamJob: map 0% reduce 0% 11/12/15 11:37:13 INFO streaming.StreamJob: map 100% reduce 0% 11/12/15 11:37:22 INFO streaming.StreamJob: map 100% reduce 33% 11/12/15 11:37:28 INFO streaming.StreamJob: map 100% reduce 100% 11/12/15 11:37:34 INFO streaming.StreamJob: Job complete: job_201112151118_0003 11/12/15 11:37:34 INFO streaming.StreamJob: Output: /user/hadoop/yangkai/test success! |
输入文件
[hadoop@ hadoopTest1]$ more data.txt 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 1 2 3 5 6 7 8 1 1 2 3 5 6 7 8 1 1 2 3 5 6 7 8 1 1 2 3 5 6 7 8 1 1 2 3 5 6 7 8 1 1 2 3 5 6 7 8 1 1 2 3 5 6 7 8 1 1 2 3 5 6 7 8 1 1 2 3 5 6 7 8 1 1 2 3 5 6 7 8 1 1 2 3 5 6 7 8 1 1 2 3 5 6 7 8 1 1 2 3 5 6 7 8 1 1 2 3 5 6 7 8 1 1 2 3 5 6 7 8 1 1 2 3 5 6 7 8 1 1 2 3 5 6 7 8 1 1 2 3 5 6 7 8 1 1 2 3 5 6 7 8 1 1 2 3 5 6 7 8 1 1 2 3 5 6 7 8 1 3 4 5 6 7 8 9 1 3 4 5 6 7 8 9 1 3 4 5 6 7 8 9 1 3 4 5 6 7 8 9 |
结果文件
[hadoop@ hadoopTest1]$ more part-00000 1 1 2 3 5 6 7 8 1 1 2 3 5 6 7 8 1 1 2 3 5 6 7 8 1 1 2 3 5 6 7 8 1 1 2 3 5 6 7 8 1 1 2 3 5 6 7 8 1 1 2 3 5 6 7 8 1 1 2 3 5 6 7 8 1 1 2 3 5 6 7 8 1 1 2 3 5 6 7 8 1 1 2 3 5 6 7 8 1 1 2 3 5 6 7 8 1 1 2 3 5 6 7 8 1 1 2 3 5 6 7 8 1 1 2 3 5 6 7 8 1 1 2 3 5 6 7 8 1 1 2 3 5 6 7 8 1 1 2 3 5 6 7 8 1 1 2 3 5 6 7 8 1 1 2 3 5 6 7 8 1 1 2 3 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 3 4 5 6 7 8 9 1 3 4 5 6 7 8 9 1 3 4 5 6 7 8 9 1 3 4 5 6 7 8 9 |
六、出现错误及解决方法:
1. error: unable to get address of epoll functions
[hadoop@ attempt_201112141219_0001_m_000010_0]$ cat ${HADOOP_HOME}/logs/userlogs/job_201112141219_0001/attempt_201112141219_0001_m_000010_0/stderr Exception in thread "main" java.lang.InternalError: unable to get address of epoll functions, pre-2.6 kernel? at sun.nio.ch.EPollArrayWrapper.init(Native Method) at sun.nio.ch.EPollArrayWrapper.<clinit>(EPollArrayWrapper.java:272) at sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:52) at sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:18) at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.get(SocketIOWithTimeout.java:407) at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:322) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:203) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:604) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:434) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:560) at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:184) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1202) at org.apache.hadoop.ipc.Client.call(Client.java:1046) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) at $Proxy1.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:370) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:420) at org.apache.hadoop.mapred.Child$1.run(Child.java:113) at org.apache.hadoop.mapred.Child$1.run(Child.java:110) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.mapred.Child.main(Child.java:109) [hadoop@ attempt_201112141219_0001_m_000010_0]$ |
可能原因:JDK版本错误(机器为64位,JDK为32位)
2. Warning: $HADOOP_HOME is deprecated.
# The Hadoop command script # …………. bin=`dirname "$0"` bin=`cd "$bin"; pwd`
if [ "$HADOOP_HOME_WARN_SUPPRESS" == "" ] && [ "$HADOOP_HOME" != "" ]; then echo "Warning: \$HADOOP_HOME is deprecated." 1>&2 echo 1>&2 fi http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security/bin/hadoop截取这个warning对系统不会产生影响,只是一个警告信息,因为hadoop默认会从$hadoop_home/conf下读取配置文件,所以会提示你配置文件的目录是否正确 |
解决方法:将$HADOOP_HOME/bin/hadoop第53-56行注释掉:
#if [ "$HADOOP_HOME_WARN_SUPPRESS" == "" ] && [ "$HADOOP_HOME" != "" ]; then # echo "Warning: \$HADOOP_HOME is deprecated." 1>&2 # echo 1>&2 #fi |
3、 datanode不启动
删除/tmp下内容
4、 namenode不启动
hadoop namenode –format