MapReduce计算框架
并行计算框架
一个大的任务拆分成多个小任务,将多个小任务分发到多个节点上。每个节点同时执行计算。
MapReduce核心思想
分而治之,先分后和:将一个大的、复杂的工作或任务,拆分成多个小的任务,并行处理,最终进行合并。
MapReduce由Map和Reduce组成
Map: 将数据进行拆分
Reduce:对数据进行汇总
WordCount计算
统计单词出现的总次数
原始数据
zhangsan,lisi,wangwu
zhaoliu,maqi
zhangsan,zhaoliu,wangwu
lisi,wangwu
期望的最终
zhangsan 2
lisi 2
wangwu 3
zhaoliu 2
maqi 1
WordCountMap类实现
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class WordCountMap extends Mapper<LongWritable, Text, Text, LongWritable> {
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//1 将Text的Value 转换成string
String datas = value.toString();
//2 使用“,”对数据进行切分
String[] splits = datas.split(",");
//3 遍历每一个单词,进行输出(一个单词输出一次)
for (String split : splits) {
//输出数据
context.write(new Text(split), new LongWritable(1));
}
}
}
WordCountReduce类实现
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class WordCountReduce extends Reducer<Text, LongWritable, Text, LongWritable> {
@Override
protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
//key 表示的是图形
//values 表示的是好多1
long sum = 0;
//求和
//遍历values,逐一计算求和
for (LongWritable value : values) {
sum += value.get();
}
//将结果输出
context.write(key, new LongWritable(sum));
}
}
WordCountDriver类实现
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class WordCountDriver extends Configured implements Tool {
@Override
public int run(String[] args) throws Exception {
// 创建本次mr程序的job实例
Job job = Job.getInstance(new Configuration(), "WordCount");
job.setInputFormatClass(TextInputFormat.class);
TextInputFormat.addInputPath(job, new Path("F:\\wordcount\\input\\words5.txt"));
// 指定本次job的具体mapper reducer实现类
job.setReducerClass(WordCountReduce.class);
job.setMapperClass(WordCountMap.class);
// 指定本次job map阶段的输出数据类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
// 指定本次job reduce阶段的输出数据类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
job.setOutputFormatClass(TextOutputFormat.class);
TextOutputFormat.setOutputPath(job, new Path("F:\\wordcount\\output\\7"));
//等待代码执行(返回状态码)
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
ToolRunner.run(new WordCountDriver(), args);
}
}
运行成功
INFO - session.id is deprecated. Instead, use dfs.metrics.session-id
INFO - Initializing JVM Metrics with processName=JobTracker, sessionId=
WARN - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
WARN - No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
INFO - Total input paths to process : 1
INFO - OutputCommitter set in config null
INFO - Running job: job_local168080594_0001
INFO - OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
INFO - Waiting for map tasks
INFO - Starting task: attempt_local168080594_0001_m_000000_0
WARN - Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
INFO - Using ResourceCalculatorPlugin : null
INFO - Processing split: file:/F:/wordcount/input/words5.txt:0+72
INFO - Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
INFO - io.sort.mb = 100
INFO - data buffer = 79691776/99614720
INFO - record buffer = 262144/327680
INFO -
INFO - Starting flush of map output
INFO - Finished spill 0
INFO - Task:attempt_local168080594_0001_m_000000_0 is done. And is in the process of commiting
INFO -
INFO - Task 'attempt_local168080594_0001_m_000000_0' done.
INFO - Finishing task: attempt_local168080594_0001_m_000000_0
INFO - Map task executor complete.
WARN - Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
INFO - Using ResourceCalculatorPlugin : null
INFO -
INFO - Merging 1 sorted segments
INFO - Down to the last merge-pass, with 1 segments left of total size: 172 bytes
INFO -
INFO - Task:attempt_local168080594_0001_r_000000_0 is done. And is in the process of commiting
INFO -
INFO - Task attempt_local168080594_0001_r_000000_0 is allowed to commit now
INFO - Saved output of task 'attempt_local168080594_0001_r_000000_0' to F:/wordcount/output/7
INFO - reduce > reduce
INFO - Task 'attempt_local168080594_0001_r_000000_0' done.
INFO - map 100% reduce 100%
INFO - Job complete: job_local168080594_0001
INFO - Counters: 17
INFO - File System Counters
INFO - FILE: Number of bytes read=628
INFO - FILE: Number of bytes written=339964
INFO - FILE: Number of read operations=0
INFO - FILE: Number of large read operations=0
INFO - FILE: Number of write operations=0
INFO - Map-Reduce Framework
INFO - Map input records=4
INFO - Map output records=10
INFO - Map output bytes=150
INFO - Input split bytes=100
INFO - Combine input records=0
INFO - Combine output records=0
INFO - Reduce input groups=5
INFO - Reduce shuffle bytes=0
INFO - Reduce input records=10
INFO - Reduce output records=5
INFO - Spilled Records=20
INFO - Total committed heap usage (bytes)=506462208
文件查看是否成功
—————————————————————————————————————————————