伪分布式下运行内置的WordCount

最新推荐文章于 2023-03-25 15:51:15 发布

qq_33890533

最新推荐文章于 2023-03-25 15:51:15 发布

阅读量446

点赞数

分类专栏：大数据文章标签：大数据 MapReduce

本文链接：https://blog.csdn.net/qq_33890533/article/details/91395726

版权

大数据专栏收录该内容

12 篇文章 0 订阅

订阅专栏

一、了解Hadoop官方的示例程序包

在集群服务器的本地目录“$HADOOP_HOME/share/hadoop/mapreduce”中可以发现示例程序包hadoop-mapreduce-example-2.6.5.jar。这个程序包封装了一些常用的测试模板，内容如表所示。

模板名称	内容
multifilewc	统计多个文件中单词的数量
pi	应用 quasI- Monte Carlo算法来估算圆周率π的值
randomtextwriter	在每个数据节点随机生成一个10GB的文本文件
wordcount	对输入文件中的单词进行频数统计
wordmean	计算输入文件中单词的平均长度
wordmedian	计算输入文件中单词长度的中位数
wordstandarddeviation	计算输入文件中单词长度的标准差

二、开启Hadoop服务

开启hadoop进程执行命令start-all.sh

创建Hadoop HDFS目录，执行命令hadoop fs -mkdir

[root@hadoop ~]# hadoop fs -mkdir  /input
[root@hadoop ~]# vim test.txt

在这里插入图片描述
将linux系统/usr文件夹下的test.txt文件复制到Hadoop HDFS系统中/input目录下。

[root@hadoop ~]# hadoop fs -put /usr/test.txt  /input

三、提交MapReduce任务给集群运行

提交MapReduce任务，通常使用Hadoop jar命令。他的基本用法格式如下。

[root@hadoop Desktop]# hadoop  jar <jar>  [mainclass]  args

在实际任务中，对它的各项参数依次说明

hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.5.jar
wordcount  /input/test.txt  /output

注：
hadoop-mapreduce-examples-2.6.5.jar: Hadoop官方提供的示例程序包，其中包括词频统计模块（wordcount）。
wordcount：程序包中的主类名称。
/input/test.txt：HDFS上的输入文件名称。
/output：HDFS上的输出文件目录。注意这个目录不用创建，程序自己创建。
执行hadoop-mapreduce -examples-2.6.5.jar中的WordCount功能包。
到$HADOOP_HOME/share/hadoop/mapreduce目录下，执行命令。

[root@hadoop ~]cd /usr/hadoop/hadoop-2.6.5/share/hadoop/mapreduce
[root@Hadoop mapreduce] hadoop jar hadoop-mapreduce-examples-2.6.5.jar wordcount  
/input/test.txt  /output

注意：hdfs下的/output目录必须不存在，否则将无法执行。
查看执行结果，会发现任务执行完成后，显示以下信息

18/12/24 14:08:34 INFO client.RMProxy: Connecting to ResourceManager at hadoop/192.168.142.139:8032
18/12/24 14:08:35 INFO input.FileInputFormat: Total input paths to process : 1
18/12/24 14:08:35 INFO mapreduce.JobSubmitter: number of splits:1
18/12/24 14:08:36 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1545640521748_0002
18/12/24 14:08:38 INFO impl.YarnClientImpl: Submitted application application_1545640521748_0002
18/12/24 14:08:38 INFO mapreduce.Job: The url to track the job: http://hadoop:8088/proxy/application_1545640521748_0002/
18/12/24 14:08:38 INFO mapreduce.Job: Running job: job_1545640521748_0002
18/12/24 14:08:56 INFO mapreduce.Job: Job job_1545640521748_0002 running in uber mode : false
18/12/24 14:08:56 INFO mapreduce.Job:  map 0% reduce 0%
18/12/24 14:09:12 INFO mapreduce.Job:  map 100% reduce 0%
18/12/24 14:09:25 INFO mapreduce.Job:  map 100% reduce 100%
18/12/24 14:09:26 INFO mapreduce.Job: Job job_1545640521748_0002 completed successfully
18/12/24 14:09:27 INFO mapreduce.Job: Counters: 49
	File System Counters
		FILE: Number of bytes read=80
		FILE: Number of bytes written=214897
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=157
		HDFS: Number of bytes written=50
		HDFS: Number of read operations=6
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
	Job Counters 
		Launched map tasks=1
		Launched reduce tasks=1
		Data-local map tasks=1
		Total time spent by all maps in occupied slots (ms)=13473
		Total time spent by all reduces in occupied slots (ms)=9366
		Total time spent by all map tasks (ms)=13473
		Total time spent by all reduce tasks (ms)=9366
		Total vcore-milliseconds taken by all map tasks=13473
		Total vcore-milliseconds taken by all reduce tasks=9366
		Total megabyte-milliseconds taken by all map tasks=13796352
		Total megabyte-milliseconds taken by all reduce tasks=9590784
	Map-Reduce Framework
		Map input records=4
		Map output records=12
		Map output bytes=107
		Map output materialized bytes=80
		Input split bytes=98
		Combine input records=12
		Combine output records=6
		Reduce input groups=6
		Reduce shuffle bytes=80
		Reduce input records=6
		Reduce output records=6
		Spilled Records=12
		Shuffled Maps =1
		Failed Shuffles=0
		Merged Map outputs=1
		GC time elapsed (ms)=148
		CPU time spent (ms)=1680
		Physical memory (bytes) snapshot=305659904
		Virtual memory (bytes) snapshot=1683111936
		Total committed heap usage (bytes)=136122368
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=59
	File Output Format Counters 
		Bytes Written=50

查看执行结果，会发现任务执行完成后，在输出目录/output/中有两个新文件生成：一个是_SUCCESS，这是一个标识文件，表示这个任务执行完成；另一个是part-r-00000任务执行完成后产生的结果文件。
在这里插入图片描述
将Hadoop HDFS文件系统中的part-r-00000文件复制到Linux系统中。显示出的part-r-00000内容如图所示

图中有两列数据，第一列是单词，第二列是统计单词的个数。

qq_33890533

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
伪分布式下运行内置的WordCount

一、了解Hadoop官方的示例程序包在集群服务器的本地目录“$HADOOP_HOME/share/hadoop/mapreduce”中可以发现示例程序包hadoop-mapreduce-example-2.6.5.jar。这个程序包封装了一些常用的测试模板，内容如表所示。模板名称内容multifilewc统计多个文件中单词的数量pi应用 quasI- Monte ...
复制链接

扫一扫