编写一个mapreduce 程序:http://blog.itpub.net/30066956/viewspace-2107549/
下面是一个计算输入文件中 a~z每个单字符的数量的一个map reduce 程序。
从日志中我们可以得到:我们这个job的id是
job_1462517728035_0048。 有3个split读文件,有3个mapper 任务,1个reducer任务。
map的输出记录数是137,reduce的输入记录数也是137。也就是说这137条记录是通过网络进行传输,送到reducer任务中的。
在下一篇中,我们使用一个combiner,来优化这个mapreduce 任务。
一个MapReduce 程序示例 细节决定成败(三) :Combiner
下面是一个计算输入文件中 a~z每个单字符的数量的一个map reduce 程序。
点击(此处)折叠或打开
- package wordcount;
-
- import java.io.IOException;
-
- import org.apache.commons.lang.StringUtils;
- import org.apache.hadoop.conf.Configuration;
- import org.apache.hadoop.conf.Configured;
- import org.apache.hadoop.fs.Path;
- import org.apache.hadoop.io.IntWritable;
- import org.apache.hadoop.io.LongWritable;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.mapreduce.Job;
- import org.apache.hadoop.mapreduce.Mapper;
- import org.apache.hadoop.mapreduce.Reducer;
- import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
- import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
- import org.apache.hadoop.util.Tool;
- import org.apache.hadoop.util.ToolRunner;
- import org.apache.log4j.Logger;
-
- public class MyWordCountJob extends Configured implements Tool {
- Logger log = Logger.getLogger(MyWordCountJob.class);
-
- public static class MyWordCountMapper extends
- Mapper<LongWritable, Text, Text, IntWritable> {
- Logger log = Logger.getLogger(MyWordCountJob.class);
-
- Text mapKey = new Text();
- IntWritable mapValue = new IntWritable(1);
- @Override
- protected void map(LongWritable key, Text value, Context context)
- throws IOException, InterruptedException {
- for(char c :value.toString().toLowerCase().toCharArray()){
- if(c>='a' && c <='z'){
- mapKey.set(String.valueOf(c));
- context.write(mapKey, mapValue);
- }
- }
- }
-
- }
-
-
- public static class MyWordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
- Text rkey = new Text();
- IntWritable rvalue = new IntWritable(1);
- @Override
- protected void reduce(Text key, Iterable<IntWritable> values,Context context)
- throws IOException, InterruptedException {
- int n=0;
- for(IntWritable value :values){
- n+= value.get();
- }
- rvalue.set(n);
- context.write(key, rvalue);
- }
- }
-
- @Override
- public int run(String[] args) throws Exception {
- //valid the parameters
- if(args.length !=2){
- return -1;
- }
-
- Job job = Job.getInstance(getConf(), "MyWordCountJob");
- job.setJarByClass(MyWordCountJob.class);
-
- Path inPath = new Path(args[0]);
- Path outPath = new Path(args[1]);
-
- outPath.getFileSystem(getConf()).delete(outPath,true);
- TextInputFormat.setInputPaths(job, inPath);
- TextOutputFormat.setOutputPath(job, outPath);
-
-
- job.setMapperClass(MyWordCountJob.MyWordCountMapper.class);
- job.setReducerClass(MyWordCountJob.MyWordCountReducer.class);
- job.setInputFormatClass(TextInputFormat.class);
- job.setOutputFormatClass(TextOutputFormat.class);
-
- job.setMapOutputKeyClass(Text.class);
- job.setMapOutputValueClass(IntWritable.class);
- job.setOutputKeyClass(Text.class);
- job.setOutputValueClass(IntWritable.class);
-
- return job.waitForCompletion(true)?0:1;
- }
- public static void main(String [] args){
- int result = 0;
- try {
- result = ToolRunner.run(new Configuration(), new MyWordCountJob(), args);
- } catch (Exception e) {
- e.printStackTrace();
- }
- System.exit(result);
- }
-
- }
输入文件:
点击(此处)折叠或打开
- [train@sandbox MyWordCount]$ hdfs dfs -ls mrdemo
- Found 3 items
- -rw-r--r-- 3 train hdfs 34 2016-05-11 01:41 mrdemo/demoinput1.txt
- -rw-r--r-- 3 train hdfs 42 2016-05-11 01:41 mrdemo/demoinput2.txt
- -rw-r--r-- 3 train hdfs 81 2016-05-11 01:41 mrdemo/demoinput3.txt
点击(此处)折叠或打开
- [train@sandbox MyWordCount]$ hdfs dfs -cat mrdemo/*input*.txt
- hello world
- how are you
- i am hero
- what is your name
- where are you come from
- abcdefghijklmnopqrsturwxyz
- abcdefghijklmnopqrsturwxyz
- abcdefghijklmnopqrsturwxyz
运行mr 任务
先看一下结果文件,可以看到按我们预期计算出了对应字符的个数(ps:这不是重点)
点击(此处)折叠或打开
- a 8
- b 3
- c 4
- d 4
- e 11
- f 4
- g 3
- h 8
- i 5
- j 3
- k 3
- l 6
- m 7
- n 4
- o 12
- p 3
- q 3
- r 13
- s 4
- t 4
- u 6
- w 7
- x 3
- y 6
- z 3
下面看一个运行日志关注标红的部分(这才是重点)
点击(此处)折叠或打开
- [train@sandbox MyWordCount]$ hadoop jar mywordcount.jar mrdemo/ mrdemo/output
- 16/05/11 04:00:45 INFO client.RMProxy: Connecting to ResourceManager at sandbox.hortonworks.com/192.168.252.131:8050
- 16/05/11 04:00:46 INFO input.FileInputFormat: Total input paths to process : 3
- 16/05/11 04:00:46 INFO mapreduce.JobSubmitter: number of splits:3
- 16/05/11 04:00:46 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name
- 16/05/11 04:00:46 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
- 16/05/11 04:00:46 INFO Configuration.deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
- 16/05/11 04:00:46 INFO Configuration.deprecation: mapred.mapoutput.value.class is deprecated. Instead, use mapreduce.map.output.value.class
- 16/05/11 04:00:46 INFO Configuration.deprecation: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class
- 16/05/11 04:00:46 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name
- 16/05/11 04:00:46 INFO Configuration.deprecation: mapreduce.reduce.class is deprecated. Instead, use mapreduce.job.reduce.class
- 16/05/11 04:00:46 INFO Configuration.deprecation: mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class
- 16/05/11 04:00:46 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
- 16/05/11 04:00:46 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
- 16/05/11 04:00:46 INFO Configuration.deprecation: mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class
- 16/05/11 04:00:46 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
- 16/05/11 04:00:46 INFO Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
- 16/05/11 04:00:46 INFO Configuration.deprecation: mapred.mapoutput.key.class is deprecated. Instead, use mapreduce.map.output.key.class
- 16/05/11 04:00:46 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
- 16/05/11 04:00:46 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1462517728035_0048
- 16/05/11 04:00:47 INFO impl.YarnClientImpl: Submitted application application_1462517728035_0048 to ResourceManager at sandbox.hortonworks.com/192.168.252.131:8050
- 16/05/11 04:00:47 INFO mapreduce.Job: The url to track the job: http://sandbox.hortonworks.com:8088/proxy/application_1462517728035_0048/
- 16/05/11 04:00:47 INFO mapreduce.Job: Running job: job_1462517728035_0048
- 16/05/11 04:00:55 INFO mapreduce.Job: Job job_1462517728035_0048 running in uber mode : false
- 16/05/11 04:00:55 INFO mapreduce.Job: map 0% reduce 0%
- 16/05/11 04:01:10 INFO mapreduce.Job: map 33% reduce 0%
- 16/05/11 04:01:11 INFO mapreduce.Job: map 100% reduce 0%
- 16/05/11 04:01:19 INFO mapreduce.Job: map 100% reduce 100%
- 16/05/11 04:01:19 INFO mapreduce.Job: Job job_1462517728035_0048 completed successfully
- 16/05/11 04:01:19 INFO mapreduce.Job: Counters: 43
- File System Counters
- FILE: Number of bytes read=1102
- FILE: Number of bytes written=339257
- FILE: Number of read operations=0
- FILE: Number of large read operations=0
- FILE: Number of write operations=0
- HDFS: Number of bytes read=556
- HDFS: Number of bytes written=103
- HDFS: Number of read operations=12
- HDFS: Number of large read operations=0
- HDFS: Number of write operations=2
- Job Counters
- Launched map tasks=3
- Launched reduce tasks=1
- Data-local map tasks=3
- Total time spent by all maps in occupied slots (ms)=314904
- Total time spent by all reduces in occupied slots (ms)=34648
- Map-Reduce Framework
- Map input records=8
- Map output records=137
- Map output bytes=822
- Map output materialized bytes=1114
- Input split bytes=399
- Combine input records=0
- Combine output records=0
- Reduce input groups=25
- Reduce shuffle bytes=1114
- Reduce input records=137
- Reduce output records=25
- Spilled Records=274
- Shuffled Maps =3
- Failed Shuffles=0
- Merged Map outputs=3
- GC time elapsed (ms)=241
- CPU time spent (ms)=3340
- Physical memory (bytes) snapshot=1106452480
- Virtual memory (bytes) snapshot=3980922880
- Total committed heap usage (bytes)=884604928
- Shuffle Errors
- BAD_ID=0
- CONNECTION=0
- IO_ERROR=0
- WRONG_LENGTH=0
- WRONG_MAP=0
- WRONG_REDUCE=0
- File Input Format Counters
- Bytes Read=157
- File Output Format Counters
- Bytes Written=103
map的输出记录数是137,reduce的输入记录数也是137。也就是说这137条记录是通过网络进行传输,送到reducer任务中的。
在下一篇中,我们使用一个combiner,来优化这个mapreduce 任务。
一个MapReduce 程序示例 细节决定成败(三) :Combiner
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/30066956/viewspace-2107875/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/30066956/viewspace-2107875/