要想wordcount在hadoop上运行,那么必须为wordcount程序指定输入路径和输出路径。输入路径是我们要进行词频统计的文本文件,在这里我们的文件名是20417.txt。而输出路径是词频统计结果存放的路径。如下图所示,是进行参数配置:WordCount.java->右键->Run As->Run Configuration
上述的路径是HDFS中的路径,HDFS路径可以查看下图:
在图一中我们输入完输入输出路径以后,我们点击Apply,但是这个时候不能点击Run,因为这里的run是指在单机上run,而我们是要在hadoop集群上run,因此我们执行以下步骤:WordCount.java->右键->Run as->Run on hadoop。
运行过程中console会提示一些信息,如下所示:
11/10/09 14:07:50 WARN conf.Configuration: DEPRECATED: hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of core-default.xml, mapred-default.xml and hdfs-default.xml respectively
11/10/09 14:07:50 INFO input.FileInputFormat: Total input paths to process : 1
11/10/09 14:07:50 INFO mapred.JobClient: Running job: job_201110091333_0001
11/10/09 14:07:51 INFO mapred.JobClient: map 0% reduce 0%
11/10/09 14:07:59 INFO mapred.JobClient: map 100% reduce 0%
11/10/09 14:08:12 INFO mapred.JobClient: map 100% reduce 100%
11/10/09 14:08:14 INFO mapred.JobClient: Job complete: job_201110091333_0001
11/10/09 14:08:14 INFO mapred.JobClient: Counters: 17
11/10/09 14:08:14 INFO mapred.JobClient: Job Counters
11/10/09 14:08:14 INFO mapred.JobClient: Launched reduce tasks=1
11/10/09 14:08:14 INFO mapred.JobClient: Launched map tasks=1
11/10/09 14:08:14 INFO mapred.JobClient: Data-local map tasks=1
11/10/09 14:08:14 INFO mapred.JobClient: FileSystemCounters
11/10/09 14:08:14 INFO mapred.JobClient: FILE_BYTES_READ=143076
11/10/09 14:08:14 INFO mapred.JobClient: HDFS_BYTES_READ=674762
11/10/09 14:08:14 INFO mapred.JobClient: FILE_BYTES_WRITTEN=286184
11/10/09 14:08:14 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=205265
11/10/09 14:08:14 INFO mapred.JobClient: Map-Reduce Framework
11/10/09 14:08:14 INFO mapred.JobClient: Reduce input groups=0
11/10/09 14:08:14 INFO mapred.JobClient: Combine output records=10015
11/10/09 14:08:14 INFO mapred.JobClient: Map input records=12761
11/10/09 14:08:14 INFO mapred.JobClient: Reduce shuffle bytes=0
11/10/09 14:08:14 INFO mapred.JobClient: Reduce output records=0
11/10/09 14:08:14 INFO mapred.JobClient: Spilled Records=20030
11/10/09 14:08:14 INFO mapred.JobClient: Map output bytes=1082004
11/10/09 14:08:14 INFO mapred.JobClient: Combine input records=112607
11/10/09 14:08:14 INFO mapred.JobClient: Map output records=112607
11/10/09 14:08:14 INFO mapred.JobClient: Reduce input records=10015
11/10/09 14:08:14 INFO input.FileInputFormat: Total input paths to process : 1
11/10/09 14:08:14 INFO mapred.JobClient: Running job: job_201110091333_0002
11/10/09 14:08:15 INFO mapred.JobClient: map 0% reduce 0%
11/10/09 14:08:24 INFO mapred.JobClient: map 100% reduce 0%
11/10/09 14:08:36 INFO mapred.JobClient: map 100% reduce 100%
11/10/09 14:08:38 INFO mapred.JobClient: Job complete: job_201110091333_0002
11/10/09 14:08:38 INFO mapred.JobClient: Counters: 17
11/10/09 14:08:38 INFO mapred.JobClient: Job Counters
11/10/09 14:08:38 INFO mapred.JobClient: Launched reduce tasks=1
11/10/09 14:08:38 INFO mapred.JobClient: Launched map tasks=1
11/10/09 14:08:38 INFO mapred.JobClient: Data-local map tasks=1
11/10/09 14:08:38 INFO mapred.JobClient: FileSystemCounters
11/10/09 14:08:38 INFO mapred.JobClient: FILE_BYTES_READ=143076
11/10/09 14:08:38 INFO mapred.JobClient: HDFS_BYTES_READ=205265
11/10/09 14:08:38 INFO mapred.JobClient: FILE_BYTES_WRITTEN=286184
11/10/09 14:08:38 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=104533
11/10/09 14:08:38 INFO mapred.JobClient: Map-Reduce Framework
11/10/09 14:08:38 INFO mapred.JobClient: Reduce input groups=0
11/10/09 14:08:38 INFO mapred.JobClient: Combine output records=0
11/10/09 14:08:38 INFO mapred.JobClient: Map input records=10015
11/10/09 14:08:38 INFO mapred.JobClient: Reduce shuffle bytes=0
11/10/09 14:08:38 INFO mapred.JobClient: Reduce output records=0
11/10/09 14:08:38 INFO mapred.JobClient: Spilled Records=20030
11/10/09 14:08:38 INFO mapred.JobClient: Map output bytes=123040
11/10/09 14:08:38 INFO mapred.JobClient: Combine input records=0
11/10/09 14:08:38 INFO mapred.JobClient: Map output records=10015
11/10/09 14:08:38 INFO mapred.JobClient: Reduce input records=10015
在运行完以后,HDFS中会产生词频统计结果,如下图所示:
词频统计结果存放在part-r-00000这个文件中。