MapReduce作业任务过程分为两个处理阶段:map阶段和reduce阶段,每个阶段都以键-值对的形式作为输入和输出。下面分别列出map函数和reduce函数。(reduce的输入必须匹配map的输出。)本例,map阶段采集的是气象数据,依据年份作为key,进行排序,温度值作为value。然后reduce对输入的map数据,从中挑选年份中的最高气温值。(本例使用的是hadoop-2.8.5)
-
Mapper类实现:
package com.hadoop.ncdc.test;
import java.io.IOException;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Mapper;
public class MaxTemperatureMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private static final int MISSING = 9999;
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//hadoop2的新API使用了Context,其统一了旧API中的JobConf、OutputCollector和Reporter。
String line = value.toString();
String year = line.substring(15, 19);
int airTemperature;
if (line.charAt(87) == '+') {
airTemperature = Integer.parseInt(line.substring(88, 92));
} else {
airTemperature = Integer.parseInt(line.substring(87, 92));
}
String quality = line.substring(92, 93);
if (airTemperature != MISSING && quality.matches("[01459]")) {
context.write(new Text(year), new IntWritable(airTemperature));
}
}
}
2. Reducer类的实现:
package com.hadoop.ncdc.test;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class MaxTemperatureReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
//hadoop2的新API使用了Context,其统一了旧API中的JobConf、OutputCollector和Reporter。
int maxValue = Integer.MIN_VALUE;
for (IntWritable value : values) {
maxValue = Math.max(maxValue, value.get());
}
context.write(key, new IntWritable(maxValue));
}
}
3. main class:
package com.hadoop.ncdc.test;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Job;
public class MaxTemperature {
public static void main(String[] args) throws Exception {
// TODO Auto-generated method stub
//新版API在org.apache.hadoop.mapreduce包中,老版是在org.apache.hadoop.mapred中
Configuration config = new Configuration();
//新API通过Job类来完成作业控制,旧API中对应的是JobClient,新API中已经删除该类。
Job job = Job.getInstance(config, "Max temperature");
job.setJarByClass(MaxTemperature.class);
//args[0]命令行第一个输入路径参数
FileInputFormat.addInputPath(job, new Path(args[0]));
//args[1]命令行第二个输出路径参数
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(MaxTemperatureMapper.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
4. 打包成jar文件(hadoop-test.jar),运行测试。
mymacdeMac-mini:~ mymac$ hadoop jar /Users/mymac/Desktop/jjj/hadoop-test.jar com.hadoop.ncdc.test.MaxTemperature /Users/mymac/Desktop/NCDCData /Users/mymac/Desktop/output
(com.hadoop.ncdc.test.MaxTemperature这里是类所在的完整包名,输入文件是NCDCData,输出为output目录。)
命令行输入:hadoop jar jar文件路径 完整包名的main类名 输入路径 输出路径
----------------------------------------------------------------------------------
如果hadoop后面跟main类文件名(完整包名),那么需要在hadoop_classpath追加jar包。在命令行添加一句:
export HADOOP_CLASSPATH=/Users/mymac/Desktop/jjj/hadoop-test.jar(仅作为测试用,重启终端环境变量会还原为默认值)
在执行下面命令行输入:
hadoop 完整包名的main类名 输入路径 输出路径
测试成功:
18/12/12 21:34:32 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
18/12/12 21:34:32 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
18/12/12 21:34:32 INFO input.FileInputFormat: Total input files to process : 2
18/12/12 21:34:32 INFO mapreduce.JobSubmitter: number of splits:2 //作业输入分片为2个
18/12/12 21:34:32 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1917490337_0001
18/12/12 21:34:33 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
18/12/12 21:34:33 INFO mapreduce.Job: Running job: job_local1917490337_0001//作业1的ID
18/12/12 21:34:33 INFO mapred.LocalJobRunner: OutputCommitter set in config null
18/12/12 21:34:33 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
18/12/12 21:34:33 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
18/12/12 21:34:33 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
18/12/12 21:34:33 INFO mapred.LocalJobRunner: Waiting for map tasks
18/12/12 21:34:33 INFO mapred.LocalJobRunner: Starting task: attempt_local1917490337_0001_m_000000_0//第一个map任务第一次尝试
18/12/12 21:34:33 INFO mapred.LocalJobRunner: Finishing task: attempt_local1917490337_0001_m_000000_0
18/12/12 21:34:33 INFO mapred.LocalJobRunner: Starting task: attempt_local1917490337_0001_m_000001_0//第二个map任务第一次尝试
18/12/12 21:34:33 INFO mapred.LocalJobRunner: Finishing task: attempt_local1917490337_0001_m_000001_0
18/12/12 21:34:33 INFO mapred.LocalJobRunner: map task executor complete.//map任务完成
18/12/12 21:34:33 INFO mapred.LocalJobRunner: Waiting for reduce tasks
18/12/12 21:34:33 INFO mapred.LocalJobRunner: Starting task: attempt_local1917490337_0001_r_000000_0//开始第一个reduce任务的第一次尝试
18/12/12 21:34:33 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map //reduce抓取map的混洗数据
attempt_local1917490337_0001_m_000001_0 decomp: 72206 len: 72210 to MEMORY
18/12/12 21:34:33 INFO reduce.InMemoryMapOutput: Read 72206 bytes from map-output for attempt_local1917490337_0001_m_000001_0//reduce读取map的输出
18/12/12 21:34:33 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1917490337_0001_r_000000_0' to file:/Users/mymac/Desktop/output/_temporary/0/task_local1917490337_0001_r_000000
18/12/12 21:34:33 INFO mapred.LocalJobRunner: reduce > reduce
18/12/12 21:34:33 INFO mapred.Task: Task 'attempt_local1917490337_0001_r_000000_0' done.
//任务提交完毕,储存在设置的存储目录中
18/12/12 21:34:33 INFO mapred.LocalJobRunner: Finishing task: attempt_local1917490337_0001_r_000000_0
18/12/12 21:34:33 INFO mapred.LocalJobRunner: reduce task executor complete.//reduce任务完成
18/12/12 21:34:34 INFO mapreduce.Job: Job job_local1917490337_0001 running in uber mode : false
18/12/12 21:34:34 INFO mapreduce.Job: map 100% reduce 100%
18/12/12 21:34:34 INFO mapreduce.Job: Job job_local1917490337_0001 completed successfully//作业完成
注:如果hadoop后面跟main类文件(完整包名),那么需要在hadoop_classpath追加jar包。在命令行添加一句:
export HADOOP_CLASSPATH=你的jar路径,在执行下面:
hadoop 完整包名的类文件名称 输入路径 输出路径。