Hadoop安装实验及MapReduce编程实验指导

实验环境:Red Hat 6.5、Hadoop-2.6.5  JDK1.7.0版本

具体参考实验指导书,本文档做辅助工作,详细命令请看教学实验指导书。

 

1.Hadoop安装实验

 

  1. 准备工作
    1. 配置主机名

 

更改主机名方法:

查看自己虚拟机的IP地址

修改每台服务器上的/etc/hosts文件,添加主机名配置。

      1. 安装配置JDK

Red Hat系统yum需要注册,所以我们手动下载

Jdk下载版本为:jdk-7u67-linux-x64.tar.gz

配置 在./bashrc文件中添加:

 

      1. 创建运行Hadoop程序的用户

在系统中添加运行Hadoop程序的用户hadoop,并修改其登录密码为hadoop

————————————

Linux下赋予普通用户超级权限sudo免密码

vim /etc/sudoers

找到这行 root ALL=(ALL) ALL

在他下面添加xxx ALL=(ALL) ALL

可以sudoers添加下面四行中任意一条
root           ALL=(ALL)                ALL
user            ALL=(ALL)                NOPASSWD: ALL

user          ALL=(root)                                                    //赋予root用户拥有的权限

 

————————————

      1. 配置SSH无密钥验证配置

为了实现无口令SSH登录,需要在主机上配置公私密钥方式登录。此时需要通过root帐号登录,修改每个主机上的/etc/ssh/sshd_config文件,去掉RSAAuthentication yes和PubkeyAuthentication yes前面的“#”。

 

使用创建的hadoop重新登录系统,在每台主机上运行ssh-keygen -t rsa命令生成本地的公私密钥对。

同样的过程需要在3台slave主机上执行。然后通过scp命令将3台slave主机的公钥拷贝到master主机上

其中id_rsa.pub内容如下,是用户hadoop在master主机上的公钥;而slave1.pub、slave2.pub和slave3.pub分别是slave主机的公钥。

之后,将4个主机的公钥加入到master主机的授权文件中[1],之后看起来如下所示:

然后将master主机上的authorized_keys文件分别scp到3台slave主机上

命令 scp authorized_keys slave1:/home/hadoop/.ssh/

此外,还必须修改相关目录及文件的权限,需要在每台服务器上运行:

关闭防火墙

将selinux的值设为disabled

    1. 安装Hadoop集群
      1.   在master上安装

上传文件hadoop-2.6.5.tar.gz到master结点,解压

[hadoop@master ~]tar zxvf hadoop-2.6.5.tar.gz

建立符号连接

[hadoop@master ~]ln -s hadoop-2.6.5 hadoop

为建行配置过程,建立hadoop-config目录,并建立环境变量

[hadoop@master ~]mkdir hadoop-config

[hadoop@master ~]cp hadoop/conf/* ./hadoop-config/

[hadoop@master ~]export HADOOP_CONF_DIR=/home/hadoop/hadoop-config/

建立环境变量

修改etc/hadoop下的配置文件

Core-site.xml

 

Hdfs-site.xml

 

masters

slaves

      1.   修改slave主机

拷贝master主机上的配置和安装文件:

[hadoop@master ~]$ scp -r .bashrc hadoop-config/ hadoop-2.6.5.tar.gz slave1:/home/hadoop/

[hadoop@slave1 ~]# mkdir /var/lib/cassandra/data/hadoop

[hadoop@slave1 ~]# mkdir /var/lib/cassandra/data/hadoop/tmp

[hadoop@slave1 ~]# mkdir /var/lib/cassandra/data/hadoop/data

[hadoop@slave1 ~]# mkdir /var/lib/cassandra/data/hadoop/name

 

解压安装文件,创建符号连接

tar zxvf hadoop-0.20.2.tar.gz && ln -s hadoop-2.6.5 hadoop

 

    1. 启动服务

若第一次启动服务,首先需要对NameNode结点进行格式化:

[hadoop@master ~]$ hadoop namenode -format

如果报错:则原因是权限不够无法在目录内新建文件,输入

sudo chmod -R a+w /var/lib/

就可以了

如果jps datanode没有启动:

则命令行输入:${HADOOP_HOME}/sbin/hadoop-daemon.sh start datanode

其他节点为启动也类似的方法

关闭Hadoop:stop-all.sh

上传文件

[hadoop@master ~]$ hadoop fs -put hadoop-config/ config

查看

Hadoop fs -ls /usr/Hadoop/

运行例子程序

[hadoop@master hadoop]hadoop jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar wordcount /usr/hadoop/config usr/hadoop/results

 

结果:

19/06/16 21:15:08 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local2077883811_0001_m_000027_0 decomp: 41 len: 45 to MEMORY

19/06/16 21:15:08 INFO reduce.InMemoryMapOutput: Read 41 bytes from map-output for attempt_local2077883811_0001_m_000027_0

19/06/16 21:15:08 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 41, inMemoryMapOutputs.size() -> 11, commitMemory -> 25025, usedMemory ->25066

19/06/16 21:15:08 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local2077883811_0001_m_000003_0 decomp: 3942 len: 3946 to MEMORY

19/06/16 21:15:08 INFO reduce.InMemoryMapOutput: Read 3942 bytes from map-output for attempt_local2077883811_0001_m_000003_0

19/06/16 21:15:08 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 3942, inMemoryMapOutputs.size() -> 12, commitMemory -> 25066, usedMemory ->29008

19/06/16 21:15:08 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local2077883811_0001_m_000016_0 decomp: 1723 len: 1727 to MEMORY

19/06/16 21:15:08 INFO reduce.InMemoryMapOutput: Read 1723 bytes from map-output for attempt_local2077883811_0001_m_000016_0

19/06/16 21:15:08 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 1723, inMemoryMapOutputs.size() -> 13, commitMemory -> 29008, usedMemory ->30731

19/06/16 21:15:08 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local2077883811_0001_m_000002_0 decomp: 4795 len: 4799 to MEMORY

19/06/16 21:15:08 INFO reduce.InMemoryMapOutput: Read 4795 bytes from map-output for attempt_local2077883811_0001_m_000002_0

19/06/16 21:15:08 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 4795, inMemoryMapOutputs.size() -> 14, commitMemory -> 30731, usedMemory ->35526

19/06/16 21:15:08 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local2077883811_0001_m_000029_0 decomp: 15 len: 19 to MEMORY

19/06/16 21:15:08 INFO reduce.InMemoryMapOutput: Read 15 bytes from map-output for attempt_local2077883811_0001_m_000029_0

19/06/16 21:15:08 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 15, inMemoryMapOutputs.size() -> 15, commitMemory -> 35526, usedMemory ->35541

19/06/16 21:15:08 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local2077883811_0001_m_000017_0 decomp: 1777 len: 1781 to MEMORY

19/06/16 21:15:08 INFO reduce.InMemoryMapOutput: Read 1777 bytes from map-output for attempt_local2077883811_0001_m_000017_0

19/06/16 21:15:08 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 1777, inMemoryMapOutputs.size() -> 16, commitMemory -> 35541, usedMemory ->37318

19/06/16 21:15:08 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local2077883811_0001_m_000011_0 decomp: 2140 len: 2144 to MEMORY

19/06/16 21:15:08 INFO reduce.InMemoryMapOutput: Read 2140 bytes from map-output for attempt_local2077883811_0001_m_000011_0

19/06/16 21:15:08 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 2140, inMemoryMapOutputs.size() -> 17, commitMemory -> 37318, usedMemory ->39458

19/06/16 21:15:08 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local2077883811_0001_m_000001_0 decomp: 4637 len: 4641 to MEMORY

19/06/16 21:15:08 INFO reduce.InMemoryMapOutput: Read 4637 bytes from map-output for attempt_local2077883811_0001_m_000001_0

19/06/16 21:15:08 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 4637, inMemoryMapOutputs.size() -> 18, commitMemory -> 39458, usedMemory ->44095

19/06/16 21:15:08 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local2077883811_0001_m_000025_0 decomp: 938 len: 942 to MEMORY

19/06/16 21:15:08 INFO reduce.InMemoryMapOutput: Read 938 bytes from map-output for attempt_local2077883811_0001_m_000025_0

19/06/16 21:15:08 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 938, inMemoryMapOutputs.size() -> 19, commitMemory -> 44095, usedMemory ->45033

19/06/16 21:15:08 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local2077883811_0001_m_000024_0 decomp: 1019 len: 1023 to MEMORY

19/06/16 21:15:08 INFO reduce.InMemoryMapOutput: Read 1019 bytes from map-output for attempt_local2077883811_0001_m_000024_0

19/06/16 21:15:08 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 1019, inMemoryMapOutputs.size() -> 20, commitMemory -> 45033, usedMemory ->46052

19/06/16 21:15:08 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local2077883811_0001_m_000012_0 decomp: 2144 len: 2148 to MEMORY

19/06/16 21:15:08 INFO reduce.InMemoryMapOutput: Read 2144 bytes from map-output for attempt_local2077883811_0001_m_000012_0

19/06/16 21:15:08 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 2144, inMemoryMapOutputs.size() -> 21, commitMemory -> 46052, usedMemory ->48196

19/06/16 21:15:08 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local2077883811_0001_m_000000_0 decomp: 12150 len: 12154 to MEMORY

19/06/16 21:15:08 INFO reduce.InMemoryMapOutput: Read 12150 bytes from map-output for attempt_local2077883811_0001_m_000000_0

19/06/16 21:15:08 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 12150, inMemoryMapOutputs.size() -> 22, commitMemory -> 48196, usedMemory ->60346

19/06/16 21:15:08 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local2077883811_0001_m_000026_0 decomp: 386 len: 390 to MEMORY

19/06/16 21:15:08 INFO reduce.InMemoryMapOutput: Read 386 bytes from map-output for attempt_local2077883811_0001_m_000026_0

19/06/16 21:15:08 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 386, inMemoryMapOutputs.size() -> 23, commitMemory -> 60346, usedMemory ->60732

19/06/16 21:15:08 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local2077883811_0001_m_000013_0 decomp: 2240 len: 2244 to MEMORY

19/06/16 21:15:08 INFO reduce.InMemoryMapOutput: Read 2240 bytes from map-output for attempt_local2077883811_0001_m_000013_0

19/06/16 21:15:08 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 2240, inMemoryMapOutputs.size() -> 24, commitMemory -> 60732, usedMemory ->62972

19/06/16 21:15:08 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local2077883811_0001_m_000008_0 decomp: 2387 len: 2391 to MEMORY

19/06/16 21:15:08 INFO reduce.InMemoryMapOutput: Read 2387 bytes from map-output for attempt_local2077883811_0001_m_000008_0

19/06/16 21:15:08 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 2387, inMemoryMapOutputs.size() -> 25, commitMemory -> 62972, usedMemory ->65359

19/06/16 21:15:08 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local2077883811_0001_m_000021_0 decomp: 1323 len: 1327 to MEMORY

19/06/16 21:15:08 INFO reduce.InMemoryMapOutput: Read 1323 bytes from map-output for attempt_local2077883811_0001_m_000021_0

19/06/16 21:15:08 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 1323, inMemoryMapOutputs.size() -> 26, commitMemory -> 65359, usedMemory ->66682

19/06/16 21:15:08 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local2077883811_0001_m_000009_0 decomp: 2992 len: 2996 to MEMORY

19/06/16 21:15:08 INFO reduce.InMemoryMapOutput: Read 2992 bytes from map-output for attempt_local2077883811_0001_m_000009_0

19/06/16 21:15:08 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 2992, inMemoryMapOutputs.size() -> 27, commitMemory -> 66682, usedMemory ->69674

19/06/16 21:15:08 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local2077883811_0001_m_000023_0 decomp: 1212 len: 1216 to MEMORY

19/06/16 21:15:08 INFO reduce.InMemoryMapOutput: Read 1212 bytes from map-output for attempt_local2077883811_0001_m_000023_0

19/06/16 21:15:08 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 1212, inMemoryMapOutputs.size() -> 28, commitMemory -> 69674, usedMemory ->70886

19/06/16 21:15:08 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local2077883811_0001_m_000022_0 decomp: 1202 len: 1206 to MEMORY

19/06/16 21:15:08 INFO reduce.InMemoryMapOutput: Read 1202 bytes from map-output for attempt_local2077883811_0001_m_000022_0

19/06/16 21:15:08 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 1202, inMemoryMapOutputs.size() -> 29, commitMemory -> 70886, usedMemory ->72088

19/06/16 21:15:08 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local2077883811_0001_m_000010_0 decomp: 2111 len: 2115 to MEMORY

19/06/16 21:15:08 INFO reduce.InMemoryMapOutput: Read 2111 bytes from map-output for attempt_local2077883811_0001_m_000010_0

19/06/16 21:15:08 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 2111, inMemoryMapOutputs.size() -> 30, commitMemory -> 72088, usedMemory ->74199

19/06/16 21:15:08 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning

19/06/16 21:15:08 INFO mapred.LocalJobRunner: 30 / 30 copied.

19/06/16 21:15:08 INFO reduce.MergeManagerImpl: finalMerge called with 30 in-memory map-outputs and 0 on-disk map-outputs

19/06/16 21:15:08 INFO mapred.Merger: Merging 30 sorted segments

19/06/16 21:15:08 INFO mapred.Merger: Down to the last merge-pass, with 30 segments left of total size: 73995 bytes

19/06/16 21:15:08 INFO reduce.MergeManagerImpl: Merged 30 segments, 74199 bytes to disk to satisfy reduce memory limit

19/06/16 21:15:08 INFO reduce.MergeManagerImpl: Merging 1 files, 74145 bytes from disk

19/06/16 21:15:08 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce

19/06/16 21:15:08 INFO mapred.Merger: Merging 1 sorted segments

19/06/16 21:15:08 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 74136 bytes

19/06/16 21:15:08 INFO mapred.LocalJobRunner: 30 / 30 copied.

19/06/16 21:15:08 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords

19/06/16 21:15:08 INFO mapred.Task: Task:attempt_local2077883811_0001_r_000000_0 is done. And is in the process of committing

19/06/16 21:15:08 INFO mapred.LocalJobRunner: 30 / 30 copied.

19/06/16 21:15:08 INFO mapred.Task: Task attempt_local2077883811_0001_r_000000_0 is allowed to commit now

19/06/16 21:15:08 INFO output.FileOutputCommitter: Saved output of task 'attempt_local2077883811_0001_r_000000_0' to hdfs://master:9000/user/hadoop/usr/hadoop/results/_temporary/0/task_local2077883811_0001_r_000000

19/06/16 21:15:08 INFO mapred.LocalJobRunner: reduce > reduce

19/06/16 21:15:08 INFO mapred.Task: Task 'attempt_local2077883811_0001_r_000000_0' done.

19/06/16 21:15:08 INFO mapred.LocalJobRunner: Finishing task: attempt_local2077883811_0001_r_000000_0

19/06/16 21:15:08 INFO mapred.LocalJobRunner: reduce task executor complete.

19/06/16 21:15:09 INFO mapreduce.Job:  map 100% reduce 100%

19/06/16 21:15:09 INFO mapreduce.Job: Job job_local2077883811_0001 completed successfully

19/06/16 21:15:09 INFO mapreduce.Job: Counters: 38

       File System Counters

              FILE: Number of bytes read=10542805

              FILE: Number of bytes written=19367044

              FILE: Number of read operations=0

              FILE: Number of large read operations=0

              FILE: Number of write operations=0

              HDFS: Number of bytes read=1847773

              HDFS: Number of bytes written=36492

              HDFS: Number of read operations=1117

              HDFS: Number of large read operations=0

              HDFS: Number of write operations=33

       Map-Reduce Framework

              Map input records=2087

              Map output records=7887

              Map output bytes=105178

              Map output materialized bytes=74319

              Input split bytes=3547

              Combine input records=7887

              Combine output records=3940

              Reduce input groups=1570

              Reduce shuffle bytes=74319

              Reduce input records=3940

              Reduce output records=1570

              Spilled Records=7880

              Shuffled Maps =30

              Failed Shuffles=0

              Merged Map outputs=30

              GC time elapsed (ms)=1088

              CPU time spent (ms)=0

              Physical memory (bytes) snapshot=0

              Virtual memory (bytes) snapshot=0

              Total committed heap usage (bytes)=4331405312

       Shuffle Errors

              BAD_ID=0

              CONNECTION=0

              IO_ERROR=0

              WRONG_LENGTH=0

              WRONG_MAP=0

              WRONG_REDUCE=0

       File Input Format Counters

              Bytes Read=77522

       File Output Format Counters

              Bytes Written=36492

 

 

在线查看part-r-00000文件中的内容

hadoop fs -cat /user/hadoop/results/part-r-00000

 

 

结果如下

java 3

javadoc   1

job  6

jobs 10

jobs, 1

jobs. 2

jsvc 2

jvm 3

jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext    1

jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext31 1

jvm.period=10 1

jvm.servers=localhost:8649   1

key  10

key. 1

key="capacity"      1

key="user-limit"    1

keys 1

keystore  9

keytab    2

killing     1

kms 2

kms-audit 1

language 24

last  1

law  24

leaf  2

level 6

levels      2

library    1

license    12

licenses   12

like  3

limit 1

limitations      24

line  1

links 1

list   44

location  2

log  14

log4j.additivity.kms-audit=false    1

log4j.additivity.org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit=false      1

log4j.additivity.org.apache.hadoop.mapred.AuditLogger=false   1

log4j.additivity.org.apache.hadoop.mapred.JobInProgress$JobSummary=false  1

log4j.additivity.org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary=false       1

log4j.appender.DRFA.DatePattern=.yyyy-MM-dd 1

log4j.appender.DRFA.File=${hadoop.log.dir}/${hadoop.log.file}      1

log4j.appender.DRFA.layout.ConversionPattern=%d{ISO8601} 1

log4j.appender.DRFA.layout=org.apache.log4j.PatternLayout    1

log4j.appender.DRFA=org.apache.log4j.DailyRollingFileAppender   1

log4j.appender.DRFAS.DatePattern=.yyyy-MM-dd    1

log4j.appender.DRFAS.File=${hadoop.log.dir}/${hadoop.security.log.file}      1

log4j.appender.DRFAS.layout.ConversionPattern=%d{ISO8601}      1

log4j.appender.DRFAS.layout=org.apache.log4j.PatternLayout  1

log4j.appender.DRFAS=org.apache.log4j.DailyRollingFileAppender 1

log4j.appender.EventCounter=org.apache.hadoop.log.metrics.EventCounter     1

log4j.appender.JSA.File=${hadoop.log.dir}/${hadoop.mapreduce.jobsummary.log.file}   1

log4j.appender.JSA.MaxBackupIndex=${hadoop.mapreduce.jobsummary.log.maxbackupindex}       1

log4j.appender.JSA.MaxFileSize=${hadoop.mapreduce.jobsummary.log.maxfilesize}     1

log4j.appender.JSA.layout.ConversionPattern=%d{yy/MM/dd   1

log4j.appender.JSA.layout=org.apache.log4j.PatternLayout 1

log4j.appender.JSA=org.apache.log4j.RollingFileAppender 1

log4j.appender.MRAUDIT.File=${hadoop.log.dir}/mapred-audit.log 1

log4j.appender.MRAUDIT.MaxBackupIndex=${mapred.audit.log.maxbackupindex}      1

log4j.appender.MRAUDIT.MaxFileSize=${mapred.audit.log.maxfilesize} 1

log4j.appender.MRAUDIT.layout.ConversionPattern=%d{ISO8601} 1

log4j.appender.MRAUDIT.layout=org.apache.log4j.PatternLayout    1

log4j.appender.MRAUDIT=org.apache.log4j.RollingFileAppender    1

log4j.appender.NullAppender=org.apache.log4j.varia.NullAppender  1

log4j.appender.RFA.File=${hadoop.log.dir}/${hadoop.log.file} 1

log4j.appender.RFA.MaxBackupIndex=${hadoop.log.maxbackupindex}   1

log4j.appender.RFA.MaxFileSize=${hadoop.log.maxfilesize}    1

log4j.appender.RFA.layout.ConversionPattern=%d{ISO8601}   1

log4j.appender.RFA.layout=org.apache.log4j.PatternLayout 1

log4j.appender.RFA=org.apache.log4j.RollingFileAppender 1

log4j.appender.RFAAUDIT.File=${hadoop.log.dir}/hdfs-audit.log    1

log4j.appender.RFAAUDIT.MaxBackupIndex=${hdfs.audit.log.maxbackupindex}   1

log4j.appender.RFAAUDIT.MaxFileSize=${hdfs.audit.log.maxfilesize}    1

log4j.appender.RFAAUDIT.layout.ConversionPattern=%d{ISO8601} 1

log4j.appender.RFAAUDIT.layout=org.apache.log4j.PatternLayout   1

log4j.appender.RFAAUDIT=org.apache.log4j.RollingFileAppender   1

log4j.appender.RFAS.File=${hadoop.log.dir}/${hadoop.security.log.file}  1

log4j.appender.RFAS.MaxBackupIndex=${hadoop.security.log.maxbackupindex}   1

log4j.appender.RFAS.MaxFileSize=${hadoop.security.log.maxfilesize}    1

log4j.appender.RFAS.layout.ConversionPattern=%d{ISO8601} 1

log4j.appender.RFAS.layout=org.apache.log4j.PatternLayout     1

log4j.appender.RFAS=org.apache.log4j.RollingFileAppender     1

log4j.appender.RMSUMMARY.File=${hadoop.log.dir}/${yarn.server.resourcemanager.appsummary.log.file}       1

log4j.appender.RMSUMMARY.MaxBackupIndex=20 1

log4j.appender.RMSUMMARY.MaxFileSize=256MB 1

log4j.appender.RMSUMMARY.layout.ConversionPattern=%d{ISO8601} 1

log4j.appender.RMSUMMARY.layout=org.apache.log4j.PatternLayout    1

log4j.appender.RMSUMMARY=org.apache.log4j.RollingFileAppender    1

log4j.appender.TLA.isCleanup=${hadoop.tasklog.iscleanup}     1

log4j.appender.TLA.layout.ConversionPattern=%d{ISO8601}   1

log4j.appender.TLA.layout=org.apache.log4j.PatternLayout 1

log4j.appender.TLA.taskId=${hadoop.tasklog.taskid} 1

log4j.appender.TLA.totalLogFileSize=${hadoop.tasklog.totalLogFileSize}      1

log4j.appender.TLA=org.apache.hadoop.mapred.TaskLogAppender  1

log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd     1

log4j.appender.console.layout=org.apache.log4j.PatternLayout   1

log4j.appender.console.target=System.err    1

log4j.appender.console=org.apache.log4j.ConsoleAppender 1

log4j.appender.httpfs.Append=true 1

log4j.appender.httpfs.DatePattern='.'yyyy-MM-dd      1

log4j.appender.httpfs.File=${httpfs.log.dir}/httpfs.log 1

log4j.appender.httpfs.layout.ConversionPattern=%d{ISO8601}  1

log4j.appender.httpfs.layout=org.apache.log4j.PatternLayout     1

log4j.appender.httpfs=org.apache.log4j.DailyRollingFileAppender    1

log4j.appender.httpfsaudit.Append=true 1

log4j.appender.httpfsaudit.DatePattern='.'yyyy-MM-dd      1

log4j.appender.httpfsaudit.File=${httpfs.log.dir}/httpfs-audit.log 1

log4j.appender.httpfsaudit.layout.ConversionPattern=%d{ISO8601}  1

log4j.appender.httpfsaudit.layout=org.apache.log4j.PatternLayout     1

log4j.appender.httpfsaudit=org.apache.log4j.DailyRollingFileAppender    1

log4j.appender.kms-audit.Append=true 1

log4j.appender.kms-audit.DatePattern='.'yyyy-MM-dd 1

log4j.appender.kms-audit.File=${kms.log.dir}/kms-audit.log     1

log4j.appender.kms-audit.layout.ConversionPattern=%d{ISO8601}   1

log4j.appender.kms-audit.layout=org.apache.log4j.PatternLayout      1

log4j.appender.kms-audit=org.apache.log4j.DailyRollingFileAppender     1

log4j.appender.kms.Append=true  1

log4j.appender.kms.DatePattern='.'yyyy-MM-dd 1

log4j.appender.kms.File=${kms.log.dir}/kms.log 1

log4j.appender.kms.layout.ConversionPattern=%d{ISO8601}    1

log4j.appender.kms.layout=org.apache.log4j.PatternLayout 1

log4j.appender.kms=org.apache.log4j.DailyRollingFileAppender 1

log4j.category.SecurityLogger=${hadoop.security.logger}  1

log4j.logger.com.amazonaws.http.AmazonHttpClient=ERROR  1

log4j.logger.com.amazonaws=ERROR 1

log4j.logger.com.sun.jersey.server.wadl.generators.WadlGeneratorJAXBGrammarGenerator=OFF       1

log4j.logger.httpfsaudit=INFO,     1

log4j.logger.kms-audit=INFO,      1

log4j.logger.org.apache.hadoop.conf.Configuration.deprecation=WARN   1

log4j.logger.org.apache.hadoop.conf=ERROR     1

log4j.logger.org.apache.hadoop.fs.http.server=INFO,  1

log4j.logger.org.apache.hadoop.fs.s3a.S3AFileSystem=WARN  1

log4j.logger.org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit=${hdfs.audit.logger}       1

log4j.logger.org.apache.hadoop.lib=INFO,   1

log4j.logger.org.apache.hadoop.mapred.AuditLogger=${mapred.audit.logger}  1

log4j.logger.org.apache.hadoop.mapred.JobInProgress$JobSummary=${hadoop.mapreduce.jobsummary.logger}       1

log4j.logger.org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary=${yarn.server.resourcemanager.appsummary.logger}       1

log4j.logger.org.apache.hadoop=INFO 1

log4j.logger.org.jets3t.service.impl.rest.httpclient.RestS3Service=ERROR 1

log4j.rootLogger=${hadoop.root.logger},    1

log4j.rootLogger=ALL, 1

log4j.threshold=ALL     1

logger     2

logger.    1

logging   4

logs 2

logs.</description> 1

loops      1

manage   1

manager  3

map 2

mapping 1

mapping]*     1

mappings 1

mappings.      1

mapred   1

mapred.audit.log.maxbackupindex=20 1

mapred.audit.log.maxfilesize=256MB  1

mapred.audit.logger=INFO,NullAppender   1

mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext      1

mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext31  1

mapred.class=org.apache.hadoop.metrics.spi.NullContext   1

mapred.period=10  1

mapred.servers=localhost:8649     1

mapreduce.cluster.acls.enabled     3

mapreduce.cluster.administrators  2

maps      1

master    1

masters   1

match="configuration"> 1

material  3

max 3

maximum      4

may 52

means     18

message  1

messages 2

metadata 1

method="html"/>   1

metrics   3

midnight 1

might      1

milliseconds.  4

min.user.id=1000#Prevent    1

missed    1

modified. 1

modify    1

modifying      1

more       12

mradmin 1

ms)  1

multi-dimensional  1

multiple  4

must 1

name      3

name.     1

name="{name}"><xsl:value-of    1

namenode      1

namenode-metrics.out   1

namenode.     2

namenode.</description>      1

namenode:     1

names.    19

nesting    1

new 2

no   4

nodes      2

nodes.     2

non-privileged 2

normal    1

not  51

null 5

number   6

numerical 3

obtain     24

of    138

off   4

on   30

one  15

one: 2

only 9

operation.      3

operations      8

operations.     9

opportunities  1

option     6

optional. 2

options   11

options.  2

or    71

ordinary  1

org.apache.hadoop.metrics2  1

other       1

other.      6

others     2

overridden     4

override  5

overrides 4

owner     1

ownership.     12

package-info.java  1

parameters     4

parent     1

part 2

password 3

path 2

pending  1

per  1

percent   1

percentage     1

period     1

period,    1

permissions    24

picked    3

pid  3

place      1

please     1

policy     3

port 5

ports       2

ports.      2

potential 2

preferred 3

prefix.    1

present,   1

principal 4

principal. 1

printed    1

priorities. 1

priority   1

privileged      2

privileges 1

privileges.      1

properties 7

property  11

protocol  6

protocol, 2

protocol. 2

provide   3

q1.  1

q2   2

q2.  1

quashed  1

query      1

queue     12

queue).   1

queue,    1

queue.    10

queues    9

queues,   1

queues.   3

rack 1

rack-local 1

recovery. 1

reduce    2

refresh    2

regarding 12

reload     2

remote    2

representing   3

required  29

resolve    2

resources 2

response. 2

restore    1

retrieve   1

return     1

returned  2

rolling    1

rollover-key   1

root 2

rootlogger      1

rpc.class=org.apache.hadoop.metrics.ganglia.GangliaContext     1

rpc.class=org.apache.hadoop.metrics.ganglia.GangliaContext31 1

rpc.class=org.apache.hadoop.metrics.spi.NullContext  1

rpc.period=10 1

rpc.servers=localhost:8649    1

run  10

running   3

running,  1

running.  1

runs 3

runtime   2

same       1

sample    1

sampling 2

scale 3

schedule 1

scheduler.      1

schedulers,     1

scheduling     2

secondary      1

seconds   1

seconds). 2

secret      3

secure     7

security   1

segment  1

select="description"/></td>   1

select="name"/></a></td>    1

select="property"> 1

select="value"/></td>   1

send 1

sending   1

separate  2

separated 20

separated.      1

server     3

service    2

service-level   2

set   60

set." 1

sets  2

setting    7

setup      1

severity   1

should    6

sign 1

signature 2

similar    3

single     1

sinks       1

site-specific    4

sizes 1

slave1     1

slave2     1

slave3     1

so    3

softlink   1

software  24

some      2

sometimes      1

source     2

space      1

space),    2

spaces     1

special    19

specific   33

specification.  1

specified 13

specified, 3

specified. 6

specifiying     1

specify    3

specifying      2

split 1

stand-by  1

start 3

started    2

starting   3

state 4

states      1

status      2

stopped,  1

store 1

stored     2

stored.    7

string      3

string,     1

submission     1

submit    2

submitting      1

such 1

summary 5

super-users     1

support   2

supported 1

supports  1

supportsparse 1

suppress  1

symlink  2

syntax     1

syntax:    1

system    3

tag   1

tags 3

target      1

tasks       2

tasktracker.    1

template  1

temporary      2

than 1

that  19

the   370

them 1

then 8

there 2

therefore 3

this  78

threads    1

time 4

timeline  2

timestamp.     1

to    163

top  1

traffic.    1

transfer   4

true. 3

turn 1

two. 3

type 1

type, 1

type="text/xsl"      6

typically 1

u:%user:%user      1

ugi.class=org.apache.hadoop.metrics.ganglia.GangliaContext     1

ugi.class=org.apache.hadoop.metrics.ganglia.GangliaContext31 1

ugi.class=org.apache.hadoop.metrics.spi.NullContext  1

ugi.period=10 1

ugi.servers=localhost:8649    1

uncommented 1

under      84

unset      1

up   2

updating 1

usage      2

use  30

use, 2

use. 4

used 36

used.      3

user 48

user. 2

user1,user2    2

user?      1

users       27

users,wheel".  18

uses 2

using      14

value      45

value="20"/>  1

value="30"/>  1

values     4

variable  4

variables 4

version   1

version="1.0" 5

version="1.0">      1

version="1.0"?>    7

via   3

view,      1

viewing  1

w/   1

want 1

warnings. 1

when      9

where     4

which     7

while      1

who 6

will 23

window  1

window,  1

with 59

within     4

without   1

work       12

writing,   24

written    2

xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 1

yarn.nodemanager.linux-container-executor.group 1

yarn.nodemanager.linux-container-executor.group=#configured  1

yarn.server.resourcemanager.appsummary.log.file 1

yarn.server.resourcemanager.appsummary.log.file=rm-appsummary.log     1

yarn.server.resourcemanager.appsummary.logger 2

yarn.server.resourcemanager.appsummary.logger=${hadoop.root.logger}  1

you  28

将结果下载到本地:

hadoop fs -get /user/hadoop/usr/hadoop/results count-results

ls

下载到本地后查看,如果成功

那么,实验一完成。

 

2.MapReduce编程实验

实验目标

了解MapReduce的工作原理

掌握基本的MapReduce的编程方法

学习Eclipse下的MapReduce编程

学会设计、实现、运行MapReduce程序

实验内容

备注:因为版本的原因,教学实验指导书的内容比较陈旧,现在Hadoop内容已经到了2.x3.x版本,对于实验指导书中Hadoop-0.20版本修改了很多类、方法及API,故本实验在参考教学实验指导书的同时,采用下面的步骤。

示例程序WordCount用于统计一批文本文件中单词出现的频率,完整的代码可在Hadoop安装包中得到(在src/examples 目录中)。(Hadoop2.x版本已有所变化,找不到)

WordCount代码:

/**
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
 
 
import java.io.IOException;
import java.util.StringTokenizer;
 
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
 
public class WordCount {
 
  public static class TokenizerMapper 
       extends Mapper<Object, Text, Text, IntWritable>{
    
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
      
    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }
  
  public static class IntSumReducer 
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();
 
    public void reduce(Text key, Iterable<IntWritable> values, 
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }
 
  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length < 2) {
      System.err.println("Usage: wordcount <in> [<in>...] <out>");
      System.exit(2);
    }
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    for (int i = 0; i < otherArgs.length - 1; ++i) {
      FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
    }
    FileOutputFormat.setOutputPath(job,
      new Path(otherArgs[otherArgs.length - 1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

 

    1. 安装配置Eclipse
    2. 安装Hadoop Eclipse Plugin

我这里用的是hadoop-eclipse-plugin-2.7.3.jar包,将编译好的hadoop-eclipse-plugin相关的jar包,复制到eclipse安装目录下的plugins下。重启eclipse。点击菜单栏Windows–>Preferences,如果插件安装成功,就会出现如下图,在红圈范围内选择hadoop安装目录。

点击Windows–> Show View –> Others –> Map/Redure Location 。 

 

点击下方Map/Redure Locations 窗口,空白处右键New Hadoop location。或点击右边带加号的蓝色小象。填写相关信息。

如何设置信息看下图:

    1. 在 Eclipse 环境下进行开发和调试

在Eclipse环境下可以方便地进行Hadoop并行程序的开发和调试。推荐使用IBM MapReduce Tools for Eclipse,使用这个Eclipse plugin可以简化开发和部署Hadoop并行程序的过程。基于这个plugin,可以在Eclipse中创建一个Hadoop MapReduce应用程序,并且提供了一些基于MapReduce框架的类开发的向导,可以打包成JAR文件,部署一个Hadoop MapReduce应用程序到一个Hadoop服务器(本地和远程均可),可以通过一个专门的视图(perspective)查看Hadoop服务器、Hadoop分布式文件系统(DFS)和当前运行的任务的状态。

可在IBM alpha Works网站下载这个MapReduce Tool,或在本文的下载清单中下载。将下载后的压缩包解压到你Eclipse安装目录,重新启动Eclipse即可使用了。点击Eclipse主菜单上Windows->Preferences,然后在左侧选择Hadoop Home Directory,设定你的Hadoop主目录。

 

配置完成,在Eclipse中运行示例。

File —> Project,选择Map/Reduce Project,输入项目名称WordCount等

写入WordCount代码。

点击WordCount.java,右键,点击Run As—>Run Configurations,配置运行参数,参数为:hdfs://localhost:9000/user/hadoop/input hdfs://localhost:9000/user/hadoop/output,分别对应输入和输出。

运行完成,查看结果:

方法1:在终端里面使用命令查看

方法2:直接在Eclipse中查看,DFS Locations,双击打开part-r00000查看结果。

    1. 改进的WordCount程序

下面对WordCount程序进行一些改进,目标:

  1. 原WordCount程序仅按空格切分单词,导致各类标点符号与单词混杂在一起,改进后的程序应该能够正确的切出单词,并且单词不要区分大小写。
  2. 在最终结果中,按单词出现频率的降序进行排序。

WordCount降序输出代码:

package desc;

/**
 * WordCount
 * 统计输入文件各个单词出现频率
 * 统计的时候对于“停词”(从文本文件读入)将不参与统计
 * 最后按统计的词频从高到底输出
 * 
 *  特别主import某个类的时候,确定你是要用哪个包所属的该类
 *  
 * */
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.*;
import java.util.Map.Entry;

import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.map.InverseMapper;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
        
public class WordCount {
	
	
	/**
	 * Map: 将输入的文本数据转换为<word-1>的键值对
	 * */
	public static class WordCountMap extends Mapper<LongWritable, Text, Text, IntWritable> {
		
		String regex = "[.,\"!--;:?'\\]]"; //remove all punctuation
		Text word = new Text();
		final static IntWritable one = new IntWritable(1);
		HashSet<String> stopWordSet = new HashSet<String>();
		
		/**
		 * 将停词从文件读到hashSet中
		 * */
		private void parseStopWordFile(String path){
			try {
				String word = null;
				BufferedReader reader = new BufferedReader(new FileReader(path));
				while((word = reader.readLine()) != null){
					stopWordSet.add(word);
				}
			} catch (IOException e) {
				e.printStackTrace();
			}	
		}
		
		/**
		 * 完成map初始化工作
		 * 主要是读取停词文件
		 * */
		public void setup(Context context) {			
			
			Configuration patternsFiles = null;
			{
				patternsFiles = context.getConfiguration();
			} 		
			if(patternsFiles == null){
				System.out.println("have no stopfile\n");
				return;
			}
			
			//read stop-words into HashSet
			for (Entry<String, String> patternsFile : patternsFiles) {
				parseStopWordFile(patternsFile.toString());
			}
		}  
		
		/**
		 *  map
		 * */
		public void map(LongWritable key, Text value, Context context) 
			throws IOException, InterruptedException {
			
			String s = null;
			String line = value.toString().toLowerCase();
			line = line.replaceAll(regex, " "); //remove all punctuation
			
			//split all words of line
			StringTokenizer tokenizer = new StringTokenizer(line);
			while (tokenizer.hasMoreTokens()) {
				s = tokenizer.nextToken();
				if(!stopWordSet.contains(s)){
					word.set(s);
					context.write(word, one);
				}				
			}
		}
	}
	
	/**
	 * Reduce: add all word-counts for a key
	 * */
	public static class WordCountReduce extends Reducer<Text, IntWritable, Text, IntWritable> {
		
		int min_num = 0;
		
		/**
		 * minimum showing words
		 * */
		public void setup(Context context) {
			min_num = Integer.parseInt(context.getConfiguration().get("min_num"));
			System.out.println(min_num);
		}
		
		/**
		 * reduce
		 * */
		public void reduce(Text key, Iterable<IntWritable> values, Context context)	
			throws IOException, InterruptedException {
			int sum = 0;
			for (IntWritable val : values) {
				sum += val.get();
			}
			if(sum < min_num) return;
			context.write(key, new IntWritable(sum));
		}
	}
	
	/**
	 * IntWritable comparator
	 * */
	private static class IntWritableDecreasingComparator extends IntWritable.Comparator {
        
	      public int compare(WritableComparable a, WritableComparable b) {
	    	  return -super.compare(a, b);
	      }
	      
	      public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
	          return -super.compare(b1, s1, l1, b2, s2, l2);
	      }
	}
	
	/**
	 * main: run two job
	 * */
	public static void main(String[] args){
		
		boolean exit = false;
		String skipfile = null; //stop-file path
		int min_num = 0;
		String tempDir = "wordcount-temp-" + Integer.toString(new Random().nextInt(Integer.MAX_VALUE));
		
		Configuration conf = new Configuration();
		
		//获取停词文件的路径,并放到DistributedCache中
	    for(int i=0;i<args.length;i++)
	    {
			if("-skip".equals(args[i]))
			{
				DistributedCache.addCacheFile(new Path(args[++i]).toUri(), conf);
				System.out.println(args[i]);
			}			
		}
	    
	    //获取要展示的最小词频
	    for(int i=0;i<args.length;i++)
	    {
			if("-greater".equals(args[i])){
				min_num = Integer.parseInt(args[++i]);
				System.out.println(args[i]);
			}			
		}
	    
		//将最小词频值放到Configuration中共享
		conf.set("min_num", String.valueOf(min_num));	//set global parameter
		
		try{
			/**
			 * run first-round to count
			 * */
			Job job = new Job(conf, "jiq-wordcountjob-1");
			job.setJarByClass(WordCount.class);
			
			//set format of input-output
			job.setInputFormatClass(TextInputFormat.class);
			job.setOutputFormatClass(SequenceFileOutputFormat.class);
			
			//set class of output's key-value of MAP
			job.setOutputKeyClass(Text.class);
		    job.setOutputValueClass(IntWritable.class);
		    
		    //set mapper and reducer
		    job.setMapperClass(WordCountMap.class);     
		    job.setReducerClass(WordCountReduce.class);
		    
		    //set path of input-output
		    FileInputFormat.addInputPath(job, new Path(args[0]));
		    FileOutputFormat.setOutputPath(job, new Path(tempDir));
		    
		    
		    
		    if(job.waitForCompletion(true)){		    
			    /**
			     * run two-round to sort
			     * */
			    //Configuration conf2 = new Configuration();
				Job job2 = new Job(conf, "jiq-wordcountjob-2");
				job2.setJarByClass(WordCount.class);
				
				//set format of input-output
				job2.setInputFormatClass(SequenceFileInputFormat.class);
				job2.setOutputFormatClass(TextOutputFormat.class);		
				
				//set class of output's key-value
				job2.setOutputKeyClass(IntWritable.class);
			    job2.setOutputValueClass(Text.class);
			    
			    //set mapper and reducer
			    //InverseMapper作用是实现map()之后的数据对的key和value交换
			    //将Reducer的个数限定为1, 最终输出的结果文件就是一个
				/**
				* 注意,这里将reduce的数目设置为1个,有很大的文章。
				* 因为hadoop无法进行键的全局排序,只能做一个reduce内部
				* 的本地排序。 所以我们要想有一个按照键的全局的排序。
				* 最直接的方法就是设置reduce只有一个。
				*/
			    job2.setMapperClass(InverseMapper.class);    
			    job2.setNumReduceTasks(1); //only one reducer
			    
			    //set path of input-output
			    FileInputFormat.addInputPath(job2, new Path(tempDir));
			    FileOutputFormat.setOutputPath(job2, new Path(args[1]));
			    
			    /**
			     * Hadoop 默认对 IntWritable 按升序排序,而我们需要的是按降序排列。
			     * 因此我们实现了一个 IntWritableDecreasingComparator 类, 
				 * 并指定使用这个自定义的 Comparator 类对输出结果中的 key (词频)进行排序
			     * */
			    job2.setSortComparatorClass(IntWritableDecreasingComparator.class);
			    exit = job2.waitForCompletion(true);
		    }
		}catch(Exception e){
			e.printStackTrace();
		}finally{
		    
		    try {
		    	//delete tempt dir
				FileSystem.get(conf).deleteOnExit(new Path(tempDir));
				if(exit) System.exit(1);
				System.exit(0);
			} catch (IOException e) {
				e.printStackTrace();
			}
		}
	}
         
 }

 

运行结果:

到此为止,实验一和实验二完成。

 

 

 

致谢:

[1]https://blog.csdn.net/u010223431/article/details/51191978

[2]https://blog.csdn.net/xingyyn78/article/details/81085100

[3]https://blog.csdn.net/abcjennifer/article/details/22393197

[4]https://www.cnblogs.com/StevenSun1991/p/6931500.html

[5]https://www.ibm.com/developerworks/cn/opensource/os-cn-hadoop2/index.html

 


 

  • 5
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 3
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值