1、简单介绍一下mapreduce
MapReduce的思想核心是“分而治之”,适用于大量复杂的任务处理场景(大规模数据处理场景)
Map负责“分”,即把复杂的任务分解为若干个“简单的任务”来并行处理。可以进行拆分的前提是这些小任务可以并行计算,彼此间几乎没有依赖关系。
Reduce负责“合”,即对map阶段的结果进行全局汇总。
这两个阶段合起来正是MapReduce思想的体现。
2、mapreduce编写步骤
https://lansonli.blog.csdn.net/article/details/117376840
3、代码编写
目录结构:
pom依赖引入:
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.7.5</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.7.5</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.7.5</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>2.7.5</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
</dependency>
</dependencies>
Mapper编写
package mapreduce;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
//定义四个泛型类型
//keyin: LongWritable valuein: Text
//keyout: Text valueout: IntWritable
public class MyMapper extends Mapper<LongWritable, Text, Text, Writable> {
//第一个参数是 这一行的起始点在文件中的偏移量 第二个参数是这一行的内容 第三个是上下文
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//拿到一行数据转换为string
String line = value.toString();
//将这一行切分
String[] words = line.split(",");
//遍历数组 输出<单词, 1>
for (String word : words) {
context.write(new Text(word), new LongWritable(1));
}
}
}
reducer编写:
package mapreduce;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
//进行reduce 操作 再次转换map key value
public class MyReduce extends Reducer<Text, LongWritable, Text, LongWritable> {
@Override
protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
int count = 0;
for(LongWritable value:values){
count+=value.get();
}
context.write(key, new LongWritable(count));
}
}
driver编写:
package mapreduce;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class ReduceMain {
public static void main(String[] args) throws Exception{
//创建一个job任务对象
Configuration configuration = new Configuration();
Job job = Job.getInstance(configuration,"word-count");
//指定job所在jar包
job.setJarByClass(ReduceMain.class);
//指定源文件的读取方式类和源文件的读取路径
job.setInputFormatClass(TextInputFormat.class); //按行读取
TextInputFormat.addInputPath(job, new Path("hdfs://192.168.40.150:9000/test.txt"));
//指定自定义的Mapper类以及 k2 v2类型
job.setMapperClass(MyMapper.class);
job.setMapOutputKeyClass(Text.class); //指定k2类型
job.setMapOutputValueClass(LongWritable.class); //指定v2类型
//指定自定义的Reducer类以及 k3 v3类型
job.setReducerClass(MyReduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
//指定输出方式类以及输出路径 目录必须不存在
job.setOutputFormatClass(TextOutputFormat.class);
TextOutputFormat.setOutputPath(job, new Path("hdfs://192.168.40.150:9000/lgy_test/res"));
//将job提交到yarn集群
boolean bl = job.waitForCompletion(true);
System.exit(bl?0:1);
}
}
4、运行代码
将代码打包:
将代码拷贝到hadoop集群的任一node
运行:
hadoop jar HadoopTest-1.0-SNAPSHOT.jar mapreduce.ReduceMain
后边这个是主类
查看结果: