MapReduce自己编写Wordcount程序

1、准备数据文件,并且上传到HDFS上,路径/input/wordcount.txt

wordcount.txt

Hello Hadoop
Hello BigData
Hello Spark
Hello Flume
Hello Kafka

 

2、编写Wordcount代码

这里用户可以输入三个参数,分别为应用的名称、数据文件的路径、结果的输出路径

package ls.wordcount;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.NLineInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;
import java.util.StringTokenizer;

public class WordCount {
	
	public static class TokenizerMapper extends  Mapper<Object, Text, Text, IntWritable> {

		private final static IntWritable one = new IntWritable(1);
		private Text word = new Text();
		
		public void map(Object key, Text value, Context context)
		        throws IOException, InterruptedException {
		    StringTokenizer itr = new StringTokenizer(value.toString());
		    while (itr.hasMoreTokens()) {
		        word.set(itr.nextToken());
		        context.write(word, one);
		    }
		}
	}
	 public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
		 private IntWritable result = new IntWritable();
		
		 public void reduce(Text key, Iterable<IntWritable> values,
		         Context context) throws IOException, InterruptedException {
		     int sum = 0;
		     for (IntWritable val : values) {
		         sum += val.get();
		     }
		     result.set(sum);
		     context.write(key, result);
		 }
	 }
	
	 public static void main(String[] args) throws Exception {
		if (args == null || args.length < 3) {
			args[0] = "wordcount";
			args[1] = "/input/word.txt";
			args[2] = "/output/wordcountpara1";
		}
	        Configuration conf = new Configuration();
	        Job job = Job.getInstance(conf, args[0]);
	        job.setJarByClass(WordCount.class);
	        job.setMapperClass(TokenizerMapper.class);
	        job.setCombinerClass(IntSumReducer.class);
	        job.setReducerClass(IntSumReducer.class);
	        job.setOutputKeyClass(Text.class);
	        job.setOutputValueClass(IntWritable.class);
	        job.setInputFormatClass(NLineInputFormat.class);
	        // 输入文件路径
	        FileInputFormat.addInputPath(job, new Path(args[1]));
	        // 输出文件路径
	        FileOutputFormat.setOutputPath(job, new Path(args[2]));
	        System.exit(job.waitForCompletion(true) ? 0 : 1);
	    }
}

3、打成jar包上传服务器本地(不需要上传到HDFS上)

在maven上进行package,在pom.xml加入下面的内容

主要是添加主类的入口

<mainClass>ls.wordcount.WordCount</mainClass>

 <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-jar-plugin</artifactId>
                <configuration>

                    <archive>
                        <manifest>
                            <mainClass>ls.wordcount.WordCount</mainClass>
                            <addClasspath>true</addClasspath>
                            <classpathPrefix>lib/</classpathPrefix>
                        </manifest>

                    </archive>
                    <classesDirectory>
                    </classesDirectory>
                </configuration>
            </plugin>
        </plugins>
    </build>

4、运行

hadoop jar /root/mapreduce_learn/wordcount/ls-hadoop-1.0-SNAPSHOT.jar wordcountpara /input/word.txt   /output/wordcountpara1.txt

[root@node1 wordcount]# hadoop jar /root/mapreduce_learn/wordcount/ls-hadoop-1.0-SNAPSHOT.jar wordcountpara /input/word.txt   /output/wordcountpara1.txt
18/09/09 10:54:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/09/09 10:54:50 INFO client.RMProxy: Connecting to ResourceManager at node1/192.168.254.101:8032
18/09/09 10:54:51 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
18/09/09 10:54:52 INFO input.FileInputFormat: Total input paths to process : 1
18/09/09 10:54:52 INFO mapreduce.JobSubmitter: number of splits:5
18/09/09 10:54:52 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1536504640893_0001
18/09/09 10:54:53 INFO impl.YarnClientImpl: Submitted application application_1536504640893_0001
18/09/09 10:54:53 INFO mapreduce.Job: The url to track the job: http://node1:8088/proxy/application_1536504640893_0001/
18/09/09 10:54:53 INFO mapreduce.Job: Running job: job_1536504640893_0001
18/09/09 10:55:01 INFO mapreduce.Job: Job job_1536504640893_0001 running in uber mode : false
18/09/09 10:55:01 INFO mapreduce.Job:  map 0% reduce 0%
18/09/09 10:55:13 INFO mapreduce.Job:  map 20% reduce 0%
18/09/09 10:55:14 INFO mapreduce.Job:  map 40% reduce 0%
18/09/09 10:55:21 INFO mapreduce.Job:  map 60% reduce 0%
18/09/09 10:55:23 INFO mapreduce.Job:  map 100% reduce 0%
18/09/09 10:55:24 INFO mapreduce.Job:  map 100% reduce 100%
18/09/09 10:55:25 INFO mapreduce.Job: Job job_1536504640893_0001 completed successfully
18/09/09 10:55:25 INFO mapreduce.Job: Counters: 50
	File System Counters
		FILE: Number of bytes read=129
		FILE: Number of bytes written=711389
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=669
		HDFS: Number of bytes written=51
		HDFS: Number of read operations=18
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
	Job Counters 
		Killed map tasks=1
		Launched map tasks=5
		Launched reduce tasks=1
		Other local map tasks=5
		Total time spent by all maps in occupied slots (ms)=74163
		Total time spent by all reduces in occupied slots (ms)=7315
		Total time spent by all map tasks (ms)=74163
		Total time spent by all reduce tasks (ms)=7315
		Total vcore-milliseconds taken by all map tasks=74163
		Total vcore-milliseconds taken by all reduce tasks=7315
		Total megabyte-milliseconds taken by all map tasks=75942912
		Total megabyte-milliseconds taken by all reduce tasks=7490560
	Map-Reduce Framework
		Map input records=5
		Map output records=10
		Map output bytes=103
		Map output materialized bytes=153
		Input split bytes=485
		Combine input records=10
		Combine output records=10
		Reduce input groups=6
		Reduce shuffle bytes=153
		Reduce input records=10
		Reduce output records=6
		Spilled Records=20
		Shuffled Maps =5
		Failed Shuffles=0
		Merged Map outputs=5
		GC time elapsed (ms)=1767
		CPU time spent (ms)=3720
		Physical memory (bytes) snapshot=886165504
		Virtual memory (bytes) snapshot=2868084736
		Total committed heap usage (bytes)=618680320
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=184
	File Output Format Counters 
		Bytes Written=51
[root@node1 wordcount]# 

 

 

展开阅读全文

没有更多推荐了,返回首页