一、MapReduce是什么?
1、MapReduce是一个分布式计算框架
它将大型数据操作作业分解为可以跨服务器集群并行执行的单个任务。
起源与Google。
2、适用于大规模数据处理场景
每个节点出库存储在该节点的数据。
3、每个job包含Map和Reduce两部分
二、MapReduce的设计思想
1、分而治之
简化并行计算的编程模型
2、构建抽象模型:Map和Reduce
开发人员专注于实现Mapper和Reducer函数
3、隐藏系统层细节
开发人员专注于业务逻辑实现
三、MapReduce的特点
优点 | 不适用领域 |
- 易于编程
- 可扩展性
- 高容错性
- 高吞吐量
|
- 难以实时计算
- 不适合流式计算
- 不适合DGA(有向图)计算
|
四、MapReduce编程规范
- MapReduce框架处理的数据格式是<k,v>键值对
- Mapper
Map端接收<k,v>键值对数据,经过处理输出新的<k,v>键值对
Map端处理逻辑写在Mapper类中map()方法中 - Reducer
Reduce端搜集多个Mapper端输出的<k,v>数据,进行汇总
Reduce的业务逻辑写在reduce()方法中
每一组相同k的<k,Iterator<v>>组调用一次reduce()方法
五、MapReduce实现WordCount
|
1、创建工程项目 |
|
2、引入依赖 |
| <properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <maven.compiler.source>1.7</maven.compiler.source> <maven.compiler.target>1.7</maven.compiler.target> <hadoop.version>3.1.3</hadoop.version> </properties> <dependencies> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.11</version> <scope>test</scope> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>${hadoop.version}</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>${hadoop.version}</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-core</artifactId> <version>${hadoop.version}</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-common</artifactId> <version>${hadoop.version}</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>${hadoop.version}</version> </dependency> </dependencies> |
3、Mapper |
|
package cn.kgc.kb23.demo01;
import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class WordCountMapper extends Mapper<LongWritable,Text , Text, IntWritable> { Text text=new Text(); IntWritable intWritable=new IntWritable(); @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { System.out.println("wordCountMap key:"+key+" value:"+value); String[] words=value.toString().split(","); for (String word : words) { text.set(word); intWritable.set(1); context.write(text,intWritable); } } } |
4、Reduce |
|
package cn.kgc.kb23.demo01;
import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class WordCountReduce extends Reducer<Text ,IntWritable, Text,LongWritable> {
@Override protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { System.out.println("reduce stage key:"+key+"values"+values.toString()); int count=0; for (IntWritable intWritable : values) { count += intWritable.get(); } LongWritable longWritable = new LongWritable(count); System.out.println("ReduceResoult key:"+key+"resoultValue:"+longWritable.get()); context.write(key,longWritable); } } |
5、Driver |
|
package cn.kgc.kb23.demo01;
import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class WordCountDriver { public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException { Configuration conf = new Configuration(); Job job = Job.getInstance(conf); job.setJarByClass(WordCountDriver.class);
job.setMapperClass(WordCountMapper.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class);
job.setReducerClass(WordCountReduce.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(LongWritable.class);
FileInputFormat.setInputPaths(job,new Path("E:\\springboot\\hadoop01\\in\\wordcount.txt")); Path path = new Path("E:\\springboot\\hadoop01\\out\\out1"); FileSystem fileSystem = FileSystem.get(path.toUri(), conf); if (fileSystem.exists(path)){ fileSystem.delete(path,true); } FileOutputFormat.setOutputPath(job,path); job.waitForCompletion(true); } } |