mapreduce程序分解

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/wo198711203217/article/details/80519046

mapreduce任务过程分为两个处理阶段:map阶段和reduce阶段。每个阶段都以键值对作为输入输出,其类型由程序员指定。程序员还需要重写两个函数:map函数和reduce函数。
map阶段输入的键是每一行的行偏移量;输入的值是该行文本,输出的键和值由程序员自定义。
下面以统计文本单词数进行演示。
假设有如下文本:

[hadoop@hadoop1 ~]$ cat 1.txt
hello world
hello abc
hello good
hello mysql
hello oracle
hello hadoop
hello hdfs
hello yarn
hello namenode
hello datanode
hello resourcemanager
hello nodemanager

整个mapreduce数据流如下图所示:
这里写图片描述
下面进行编码。
代码目录结构如下:
这里写图片描述
WordCountMapper.java:

package com.wc;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
    @Override
    protected void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        // TODO Auto-generated method stub

        String line=value.toString();
        String[] words=line.split(" ");
        for(String word : words)
        {
            context.write(new Text(word), new IntWritable(1));
        }
    }
}

WordCountReducer:

package com.wc;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {
        // TODO Auto-generated method stub
        int total=0;
        for(IntWritable value:values)
        {
            total=total+value.get();
        }
        context.write(key, new IntWritable(total));
    }
}

WordCount.java:

package com.wc;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Configuration conf=new Configuration();

        Job job=Job.getInstance(conf);

        job.setJarByClass(WordCount.class);

        job.setMapperClass(WordCountMapper.class);

        job.setReducerClass(WordCountReducer.class);

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        FileInputFormat.setInputPaths(job, new Path("hdfs://192.168.16.2:9000/wordcount/input"));

        FileOutputFormat.setOutputPath(job, new Path("hdfs://192.168.16.2:9000/wordcount/output"));

        System.exit(job.waitForCompletion(true)?0:1);


    }
}

运行:
这里写图片描述
输出:

2018-05-31 11:42:41,929 INFO  [main] Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(1129)) - session.id is deprecated. Instead, use dfs.metrics.session-id
2018-05-31 11:42:41,931 INFO  [main] jvm.JvmMetrics (JvmMetrics.java:init(76)) - Initializing JVM Metrics with processName=JobTracker, sessionId=
2018-05-31 11:42:42,186 WARN  [main] mapreduce.JobResourceUploader (JobResourceUploader.java:uploadFiles(64)) - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
2018-05-31 11:42:42,188 WARN  [main] mapreduce.JobResourceUploader (JobResourceUploader.java:uploadFiles(171)) - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2018-05-31 11:42:42,239 INFO  [main] input.FileInputFormat (FileInputFormat.java:listStatus(281)) - Total input paths to process : 1
2018-05-31 11:42:42,268 INFO  [main] mapreduce.JobSubmitter (JobSubmitter.java:submitJobInternal(199)) - number of splits:1
2018-05-31 11:42:42,357 INFO  [main] mapreduce.JobSubmitter (JobSubmitter.java:printTokens(288)) - Submitting tokens for job: job_local1527889277_0001
2018-05-31 11:42:42,464 INFO  [main] mapreduce.Job (Job.java:submit(1301)) - The url to track the job: http://localhost:8080/
2018-05-31 11:42:42,465 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1346)) - Running job: job_local1527889277_0001
2018-05-31 11:42:42,467 INFO  [Thread-4] mapred.LocalJobRunner (LocalJobRunner.java:createOutputCommitter(471)) - OutputCommitter set in config null
2018-05-31 11:42:42,471 INFO  [Thread-4] mapred.LocalJobRunner (LocalJobRunner.java:createOutputCommitter(489)) - OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
2018-05-31 11:42:42,555 INFO  [Thread-4] mapred.LocalJobRunner (LocalJobRunner.java:runTasks(448)) - Waiting for map tasks
2018-05-31 11:42:42,555 INFO  [LocalJobRunner Map Task Executor #0] mapred.LocalJobRunner (LocalJobRunner.java:run(224)) - Starting task: attempt_local1527889277_0001_m_000000_0
2018-05-31 11:42:42,579 INFO  [LocalJobRunner Map Task Executor #0] util.ProcfsBasedProcessTree (ProcfsBasedProcessTree.java:isAvailable(181)) - ProcfsBasedProcessTree currently is supported only on Linux.
2018-05-31 11:42:42,724 INFO  [LocalJobRunner Map Task Executor #0] mapred.Task (Task.java:initialize(587)) -  Using ResourceCalculatorProcessTree : org.apache.hadoop.yarn.util.WindowsBasedProcessTree@4e71af26
2018-05-31 11:42:42,727 INFO  [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:runNewMapper(753)) - Processing split: hdfs://192.168.16.2:9000/wordcount/input/1.txt:0+163
2018-05-31 11:42:42,758 INFO  [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:setEquator(1202)) - (EQUATOR) 0 kvi 26214396(104857584)
2018-05-31 11:42:42,758 INFO  [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:init(995)) - mapreduce.task.io.sort.mb: 100
2018-05-31 11:42:42,758 INFO  [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:init(996)) - soft limit at 83886080
2018-05-31 11:42:42,759 INFO  [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:init(997)) - bufstart = 0; bufvoid = 104857600
2018-05-31 11:42:42,759 INFO  [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:init(998)) - kvstart = 26214396; length = 6553600
2018-05-31 11:42:42,761 INFO  [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:createSortingCollector(402)) - Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2018-05-31 11:42:42,843 INFO  [LocalJobRunner Map Task Executor #0] mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - 
2018-05-31 11:42:42,845 INFO  [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:flush(1457)) - Starting flush of map output
2018-05-31 11:42:42,845 INFO  [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:flush(1475)) - Spilling map output
2018-05-31 11:42:42,845 INFO  [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:flush(1476)) - bufstart = 0; bufend = 259; bufvoid = 104857600
2018-05-31 11:42:42,845 INFO  [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:flush(1478)) - kvstart = 26214396(104857584); kvend = 26214304(104857216); length = 93/6553600
2018-05-31 11:42:42,854 INFO  [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:sortAndSpill(1660)) - Finished spill 0
2018-05-31 11:42:42,859 INFO  [LocalJobRunner Map Task Executor #0] mapred.Task (Task.java:done(1001)) - Task:attempt_local1527889277_0001_m_000000_0 is done. And is in the process of committing
2018-05-31 11:42:42,867 INFO  [LocalJobRunner Map Task Executor #0] mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - map
2018-05-31 11:42:42,867 INFO  [LocalJobRunner Map Task Executor #0] mapred.Task (Task.java:sendDone(1121)) - Task 'attempt_local1527889277_0001_m_000000_0' done.
2018-05-31 11:42:42,868 INFO  [LocalJobRunner Map Task Executor #0] mapred.LocalJobRunner (LocalJobRunner.java:run(249)) - Finishing task: attempt_local1527889277_0001_m_000000_0
2018-05-31 11:42:42,868 INFO  [Thread-4] mapred.LocalJobRunner (LocalJobRunner.java:runTasks(456)) - map task executor complete.
2018-05-31 11:42:42,870 INFO  [Thread-4] mapred.LocalJobRunner (LocalJobRunner.java:runTasks(448)) - Waiting for reduce tasks
2018-05-31 11:42:42,870 INFO  [pool-6-thread-1] mapred.LocalJobRunner (LocalJobRunner.java:run(302)) - Starting task: attempt_local1527889277_0001_r_000000_0
2018-05-31 11:42:42,875 INFO  [pool-6-thread-1] util.ProcfsBasedProcessTree (ProcfsBasedProcessTree.java:isAvailable(181)) - ProcfsBasedProcessTree currently is supported only on Linux.
2018-05-31 11:42:42,948 INFO  [pool-6-thread-1] mapred.Task (Task.java:initialize(587)) -  Using ResourceCalculatorProcessTree : org.apache.hadoop.yarn.util.WindowsBasedProcessTree@55ad1062
2018-05-31 11:42:42,950 INFO  [pool-6-thread-1] mapred.ReduceTask (ReduceTask.java:run(362)) - Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@666361b3
2018-05-31 11:42:42,962 INFO  [pool-6-thread-1] reduce.MergeManagerImpl (MergeManagerImpl.java:<init>(197)) - MergerManager: memoryLimit=1991350656, maxSingleShuffleLimit=497837664, mergeThreshold=1314291456, ioSortFactor=10, memToMemMergeOutputsThreshold=10
2018-05-31 11:42:42,964 INFO  [EventFetcher for fetching Map Completion Events] reduce.EventFetcher (EventFetcher.java:run(61)) - attempt_local1527889277_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
2018-05-31 11:42:42,986 INFO  [localfetcher#1] reduce.LocalFetcher (LocalFetcher.java:copyMapOutput(141)) - localfetcher#1 about to shuffle output of map attempt_local1527889277_0001_m_000000_0 decomp: 309 len: 313 to MEMORY
2018-05-31 11:42:42,990 INFO  [localfetcher#1] reduce.InMemoryMapOutput (InMemoryMapOutput.java:shuffle(100)) - Read 309 bytes from map-output for attempt_local1527889277_0001_m_000000_0
2018-05-31 11:42:42,991 INFO  [localfetcher#1] reduce.MergeManagerImpl (MergeManagerImpl.java:closeInMemoryFile(315)) - closeInMemoryFile -> map-output of size: 309, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->309
2018-05-31 11:42:42,992 INFO  [EventFetcher for fetching Map Completion Events] reduce.EventFetcher (EventFetcher.java:run(76)) - EventFetcher is interrupted.. Returning
2018-05-31 11:42:42,992 INFO  [pool-6-thread-1] mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - 1 / 1 copied.
2018-05-31 11:42:42,992 INFO  [pool-6-thread-1] reduce.MergeManagerImpl (MergeManagerImpl.java:finalMerge(687)) - finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
2018-05-31 11:42:42,998 INFO  [pool-6-thread-1] mapred.Merger (Merger.java:merge(597)) - Merging 1 sorted segments
2018-05-31 11:42:42,998 INFO  [pool-6-thread-1] mapred.Merger (Merger.java:merge(696)) - Down to the last merge-pass, with 1 segments left of total size: 303 bytes
2018-05-31 11:42:42,999 INFO  [pool-6-thread-1] reduce.MergeManagerImpl (MergeManagerImpl.java:finalMerge(754)) - Merged 1 segments, 309 bytes to disk to satisfy reduce memory limit
2018-05-31 11:42:43,000 INFO  [pool-6-thread-1] reduce.MergeManagerImpl (MergeManagerImpl.java:finalMerge(784)) - Merging 1 files, 313 bytes from disk
2018-05-31 11:42:43,001 INFO  [pool-6-thread-1] reduce.MergeManagerImpl (MergeManagerImpl.java:finalMerge(799)) - Merging 0 segments, 0 bytes from memory into reduce
2018-05-31 11:42:43,001 INFO  [pool-6-thread-1] mapred.Merger (Merger.java:merge(597)) - Merging 1 sorted segments
2018-05-31 11:42:43,001 INFO  [pool-6-thread-1] mapred.Merger (Merger.java:merge(696)) - Down to the last merge-pass, with 1 segments left of total size: 303 bytes
2018-05-31 11:42:43,002 INFO  [pool-6-thread-1] mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - 1 / 1 copied.
2018-05-31 11:42:43,046 INFO  [pool-6-thread-1] Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(1129)) - mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
2018-05-31 11:42:43,466 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1367)) - Job job_local1527889277_0001 running in uber mode : false
2018-05-31 11:42:43,467 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1374)) -  map 100% reduce 0%
2018-05-31 11:42:43,694 INFO  [pool-6-thread-1] mapred.Task (Task.java:done(1001)) - Task:attempt_local1527889277_0001_r_000000_0 is done. And is in the process of committing
2018-05-31 11:42:43,698 INFO  [pool-6-thread-1] mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - 1 / 1 copied.
2018-05-31 11:42:43,698 INFO  [pool-6-thread-1] mapred.Task (Task.java:commit(1162)) - Task attempt_local1527889277_0001_r_000000_0 is allowed to commit now
2018-05-31 11:42:43,736 INFO  [pool-6-thread-1] output.FileOutputCommitter (FileOutputCommitter.java:commitTask(439)) - Saved output of task 'attempt_local1527889277_0001_r_000000_0' to hdfs://192.168.16.2:9000/wordcount/output/_temporary/0/task_local1527889277_0001_r_000000
2018-05-31 11:42:43,736 INFO  [pool-6-thread-1] mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - reduce > reduce
2018-05-31 11:42:43,737 INFO  [pool-6-thread-1] mapred.Task (Task.java:sendDone(1121)) - Task 'attempt_local1527889277_0001_r_000000_0' done.
2018-05-31 11:42:43,737 INFO  [pool-6-thread-1] mapred.LocalJobRunner (LocalJobRunner.java:run(325)) - Finishing task: attempt_local1527889277_0001_r_000000_0
2018-05-31 11:42:43,737 INFO  [Thread-4] mapred.LocalJobRunner (LocalJobRunner.java:runTasks(456)) - reduce task executor complete.
2018-05-31 11:42:44,467 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1374)) -  map 100% reduce 100%
2018-05-31 11:42:44,467 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1385)) - Job job_local1527889277_0001 completed successfully
2018-05-31 11:42:44,476 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1392)) - Counters: 38
    File System Counters
        FILE: Number of bytes read=1018
        FILE: Number of bytes written=517955
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=326
        HDFS: Number of bytes written=124
        HDFS: Number of read operations=15
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=4
    Map-Reduce Framework
        Map input records=12
        Map output records=24
        Map output bytes=259
        Map output materialized bytes=313
        Input split bytes=111
        Combine input records=0
        Combine output records=0
        Reduce input groups=13
        Reduce shuffle bytes=313
        Reduce input records=24
        Reduce output records=13
        Spilled Records=48
        Shuffled Maps =1
        Failed Shuffles=0
        Merged Map outputs=1
        GC time elapsed (ms)=2
        CPU time spent (ms)=0
        Physical memory (bytes) snapshot=0
        Virtual memory (bytes) snapshot=0
        Total committed heap usage (bytes)=519045120
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters 
        Bytes Read=163
    File Output Format Counters 
        Bytes Written=124

任务成功执行,接下来看一下输出目录。

[hadoop@hadoop2 ~]$ hadoop fs -ls hdfs://192.168.16.2:9000/wordcount/output
18/05/31 21:16:05 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
-rw-r--r--   3 hadoop supergroup          0 2018-05-31 19:42 hdfs://192.168.16.2:9000/wordcount/output/_SUCCESS
-rw-r--r--   3 hadoop supergroup        124 2018-05-31 19:42 hdfs://192.168.16.2:9000/wordcount/output/part-r-00000
[hadoop@hadoop2 ~]$ 

_SUCCESS:这个文件表示任务执行成功
part-r-00000:这个文件正是reduce的输出内容

[hadoop@hadoop2 ~]$ hadoop fs -cat hdfs://192.168.16.2:9000/wordcount/output/part-r-00000
18/05/31 21:17:59 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
abc     1
datanode        1
good    1
hadoop  1
hdfs    1
hello   12
mysql   1
namenode        1
nodemanager     1
oracle  1
resourcemanager 1
world   1
yarn    1
展开阅读全文

没有更多推荐了,返回首页