MapReduce中Mapper类和Reducer类4函数解析

Mapper类4个函数的解析
protected void setup(Mapper.Context context) throws IOException,InterruptedException //Called once at the beginning of the task
protected void cleanup(Mapper.Context context)throws IOException,InterruptedException //Called once at the end of the task. 
protected void map(KEYIN key, VALUEIN value Mapper.Context context)throws IOException,InterruptedException
//Called once for each key/value pair in the input split. Most applications should override this, but the default is the identity function. 
public void run(Mapper.Context context)throws IOException,InterruptedException
//Expert users can override this method for more complete control over the execution of the Mapper. 
执行顺序:setup --->   map/run   ----> cleanup
同理在Reduce类中也存在4个函数
protected void setup(Mapper.Context context) throws IOException,InterruptedException //Called once at the beginning of the task
protected void cleanup(Mapper.Context context)throws IOException,InterruptedException //Called once at the end of the task. 
protected void map(KEYIN key, VALUEIN value Mapper.Context context)throws IOException,InterruptedException
//This method is called once for each key. Most applications will define their reduce class by overriding this method. The default implementation is an identity function. public void run(Mapper.Context context)throws IOException,InterruptedException
//Advanced application writers can use the run(org.apache.hadoop.mapreduce.Reducer.Context) method to control how the reduce task works
执行顺序:setup --->   map/run   ----> cleanup
  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
以下是一个使用 MapReduce 进行数据清洗和分区的示例代码: ```java public class DataCleaner { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private Text word = new Text(); private final static IntWritable one = new IntWritable(1); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); // 进行数据清洗 String cleanLine = line.replaceAll("[^a-zA-Z0-9\\s]", "").toLowerCase(); // 分割单词并输出到Reducer StringTokenizer tokenizer = new StringTokenizer(cleanLine); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "datacleaner"); job.setJarByClass(DataCleaner.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); // 设置数据分区方式为HashPartitioner job.setPartitionerClass(HashPartitioner.class); job.setNumReduceTasks(4); // 设置Reduce任务数为4 FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } ``` 在上面的代码,我们定义了一个 `Map` 和一个 `Reduce` 。`Map` 用于对输入的每一行数据进行清洗,并将清洗后的单词作为键,出现次数作为值输出到 `Reducer`。`Reduce` 则对每个单词的出现次数进行累加,并输出最终结果。 在 `main` 函数,我们创建了一个 `Job` 对象,并将 `Map` 和 `Reduce` 指定为任务的 MapperReducer。我们还设置了输出键值对的型,以及数据分区方式为 `HashPartitioner`。最后指定了输入输出路径,并启动 MapReduce 任务。 需要注意的是,`setNumReduceTasks` 方法用于指定 Reduce 任务的数量,可以根据输入数据的大小和计算资源进行适当调整。此外,如果需要使用自定义的分区方式,可以继承 `Partitioner` 并实现 `getPartition` 方法来自定义分区逻辑。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值