一个MapReduce 程序示例 细节决定成败(二) :观察日志及 Counter

下面是一个计算输入文件中a~z每个单字符的数量的一个map reduce 程序。

package wordcount;

import java.io.IOException;

import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.log4j.Logger;

public class MyWordCountJob extends Configured implements Tool {
        Logger log = Logger.getLogger(MyWordCountJob.class);

        public static class MyWordCountMapper extends
                        Mapper<LongWritable, Text, Text, IntWritable> {
                Logger log = Logger.getLogger(MyWordCountJob.class);

                Text mapKey = new Text();
                IntWritable mapValue = new IntWritable(1);
                @Override
                protected void map(LongWritable key, Text value, Context context)
                                throws IOException, InterruptedException {
                        for(char c :value.toString().toLowerCase().toCharArray()){
                                if(c>='a' && c <='z'){
                                        mapKey.set(String.valueOf(c));
                                        context.write(mapKey, mapValue);
                                }
                        }
                }

        }


        public static class MyWordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
                Text rkey = new Text();
                IntWritable rvalue = new IntWritable(1);
                @Override
                protected void reduce(Text key, Iterable<IntWritable> values,Context context)
                                throws IOException, InterruptedException {
                        int n=0;
                        for(IntWritable value :values){
                                n+= value.get();
                        }
                        rvalue.set(n);
                        context.write(key, rvalue);
                }
        }

        @Override
        public int run(String[] args) throws Exception {
                //valid the parameters
                if(args.length !=2){
                        return -1;
                }

                Job job = Job.getInstance(getConf(), "MyWordCountJob");
                job.setJarByClass(MyWordCountJob.class);

                Path inPath = new Path(args[0]);
                Path outPath = new Path(args[1]);

                outPath.getFileSystem(getConf()).delete(outPath,true);
                TextInputFormat.setInputPaths(job, inPath);
                TextOutputFormat.setOutputPath(job, outPath);


            job.setMapperClass(MyWordCountJob.MyWordCountMapper.class);
                job.setReducerClass(MyWordCountJob.MyWordCountReducer.class);
            job.setInputFormatClass(TextInputFormat.class);
            job.setOutputFormatClass(TextOutputFormat.class);

                job.setMapOutputKeyClass(Text.class);
                job.setMapOutputValueClass(IntWritable.class);
                job.setOutputKeyClass(Text.class);
                job.setOutputValueClass(IntWritable.class);

                return job.waitForCompletion(true)?0:1;
        }
        public static void main(String [] args){
                int result = 0;
                try {
                        result = ToolRunner.run(new Configuration(), new MyWordCountJob(), args);
                } catch (Exception e) {
                        e.printStackTrace();
                }
                System.exit(result);
        }

}

输入文件:

[train@sandbox MyWordCount]$ hdfs dfs -ls mrdemo
Found 3 items
-rw-r--r-- 3 train hdfs 34 2016-05-11 01:41 mrdemo/demoinput1.txt
-rw-r--r-- 3 train hdfs 42 2016-05-11 01:41 mrdemo/demoinput2.txt
-rw-r--r-- 3 train hdfs 81 2016-05-11 01:41 mrdemo/demoinput3.txt

[train@sandbox MyWordCount]$ hdfs dfs -cat mrdemo/*input*.txt
hello world
how are you
i am hero
what is your name
where are you come from
abcdefghijklmnopqrsturwxyz
abcdefghijklmnopqrsturwxyz
abcdefghijklmnopqrsturwxyz

运行mr 任务

先看一下结果文件,可以看到按我们预期计算出了对应字符的个数(ps:这不是重点)

a 8
b 3
c 4
d 4
e 11
f 4
g 3
h 8
i 5
j 3
k 3
l 6
m 7
n 4
o 12
p 3
q 3
r 13
s 4
t 4
u 6
w 7
x 3
y 6
z 3

下面看一个运行日志关注的部分(线框起来的这才是重点)

[train@sandbox MyWordCount]$ hadoop jar mywordcount.jar mrdemo/ mrdemo/output
16/05/11 04:00:45 INFO client.RMProxy: Connecting to ResourceManager at sandbox.hortonworks.com/192.168.252.131:8050
-----------------------------------------------------------------------------------
--16/05/11 04:00:46 INFO input.FileInputFormat: Total input paths to process : 3 --
--16/05/11 04:00:46 INFO mapreduce.JobSubmitter: number of splits:3              --
-----------------------------------------------------------------------------------
16/05/11 04:00:46 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name
16/05/11 04:00:46 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
16/05/11 04:00:46 INFO Configuration.deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
16/05/11 04:00:46 INFO Configuration.deprecation: mapred.mapoutput.value.class is deprecated. Instead, use mapreduce.map.output.value.class
16/05/11 04:00:46 INFO Configuration.deprecation: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class
16/05/11 04:00:46 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name
16/05/11 04:00:46 INFO Configuration.deprecation: mapreduce.reduce.class is deprecated. Instead, use mapreduce.job.reduce.class
16/05/11 04:00:46 INFO Configuration.deprecation: mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class
16/05/11 04:00:46 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
16/05/11 04:00:46 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
16/05/11 04:00:46 INFO Configuration.deprecation: mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class
16/05/11 04:00:46 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
16/05/11 04:00:46 INFO Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
16/05/11 04:00:46 INFO Configuration.deprecation: mapred.mapoutput.key.class is deprecated. Instead, use mapreduce.map.output.key.class
16/05/11 04:00:46 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
16/05/11 04:00:46 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1462517728035_0048
16/05/11 04:00:47 INFO impl.YarnClientImpl: Submitted application application_1462517728035_0048 to ResourceManager at sandbox.hortonworks.com/192.168.252.131:8050
16/05/11 04:00:47 INFO mapreduce.Job: The url to track the job: http://sandbox.hortonworks.com:8088/proxy/application_1462517728035_0048/
16/05/11 04:00:47 INFO mapreduce.Job: Running job: job_1462517728035_0048
16/05/11 04:00:55 INFO mapreduce.Job: Job job_1462517728035_0048 running in uber mode : false
16/05/11 04:00:55 INFO mapreduce.Job: map 0% reduce 0%
16/05/11 04:01:10 INFO mapreduce.Job: map 33% reduce 0%
16/05/11 04:01:11 INFO mapreduce.Job: map 100% reduce 0%
16/05/11 04:01:19 INFO mapreduce.Job: map 100% reduce 100%
16/05/11 04:01:19 INFO mapreduce.Job: Job job_1462517728035_0048 completed successfully
16/05/11 04:01:19 INFO mapreduce.Job: Counters: 43
        File System Counters
                FILE: Number of bytes read=1102
                FILE: Number of bytes written=339257
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=556
                HDFS: Number of bytes written=103
                HDFS: Number of read operations=12
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
        Job Counters 
-----------------------------------------------------------------------------------
-               Launched map tasks=3                                              -
-               Launched reduce tasks=1                                           -
-----------------------------------------------------------------------------------
                Data-local map tasks=3
                Total time spent by all maps in occupied slots (ms)=314904
                Total time spent by all reduces in occupied slots (ms)=34648
        Map-Reduce Framework
                Map input records=8
                Map output records=137
                Map output bytes=822
                Map output materialized bytes=1114
                Input split bytes=399
                Combine input records=0
                Combine output records=0
                Reduce input groups=25
-----------------------------------------------------------------------------------
                Reduce shuffle bytes=1114
                Reduce input records=137
                Reduce output records=25
                Spilled Records=274
                Shuffled Maps =3
-----------------------------------------------------------------------------------
                Failed Shuffles=0
                Merged Map outputs=3
                GC time elapsed (ms)=241
                CPU time spent (ms)=3340
                Physical memory (bytes) snapshot=1106452480
                Virtual memory (bytes) snapshot=3980922880
                Total committed heap usage (bytes)=884604928
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters 
                Bytes Read=157
        File Output Format Counters 
                Bytes Written=103

从日志中我们可以得到:我们这个job的id是job_1462517728035_0048。 有3个split读文件,有3个mapper 任务,1个reducer任务。
map的输出记录数是137,reduce的输入记录数也是137。也就是说这137条记录是通过网络进行传输,送到reducer任务中的。
在下一篇中,我们使用一个combiner,来优化这个mapreduce 任务。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值