Mapreduce:分布式计算框架
开发人员要做的事情:实现Map和Reduce函数
一般只调用HDFS的话,不实际Yarn的工作,调用Mapreduce时才会调用yarn
三台设备Mapreduce详细过程
Mapreduce编程规范
MapReduce的开发一共有八个步骤,其中Map阶段分为2个步骤,huffle阶段分为4个步骤,
Reduce阶段分为2个步骤;
Map阶段2个步骤:
1.设置InputFormat类,将数据切分为Key-Value(K1和V1)(即<每个行号的偏移量,行内容>)对,输入到第2步
2.自定义Map逻辑,将第一步的结果转换成另外的Key-Value<K2,V2>对(<单词,出现次数>),输出结果
Shuffle阶段4个步骤:
3.对输出的Key-Value对进行分区
4.对不同分区的数据按照相同的Key排序
5.(可选)对分组过的数据初步规约,降低数据的网络拷贝
6.对数据进行分组,相同Key的Value放入一个集合中,还是<K2,V2>(<单词,出现次数的集合>)
Reduce阶段2个步骤:
7.对多个Map任务的结果进行排序以及合并,编写Reduce函数实现自己的逻辑,对输入的Key-Value进行处理,转为新的Key-Value(K3和V3)输出
8.设置OutputFormat处理并保存Reduce输出的Key-Value数据
Mapreduce编程1:(只包含Map程序和Reduce程序)
内容:对一个文本作单词计数
更改jdk1.7变成jdk1.8
junit 4.11改成4.12
- 将a.txt上传到hdfs上
- 新建Maven文件,导入坐标依赖
- org.apache.hadoop hadoop-common 2.6.0 org.apache.hadoop hadoop-hdfs 2.6.0 org.apache.hadoop hadoop-auth 2.6.0 org.apache.hadoop hadoop-client 2.6.0 org.apache.hadoop hadoop-mapreduce-client-core 2.6.0 org.apache.hadoop hadoop-mapreduce-client-jobclient 2.6.0
3. 编写Map程序
public class WCMapper extends Mapper<LongWritable, Text, Text, LongWritable>{
//Mapper的四个泛型相当于<Long, String, String, Long>,只不过hadoop认为Java的类型太不方便,所以改写了一下
//重写Map方法,Map方法就是将K1和V1转为K2和V2
/*参数:key:K1 每行偏移量
vlaue:V1 每一行的文本数据
context:表示上下文对象,连接Map--shuffle--duce各个步骤的
*/
/*
如何将K1和V1转化成K2和V2
K1 V1
0 hello,world,hadoop
15 hdfs,hive,hello
----------------------------------
K2 V2
hello 1
world 1
hdfs 1
hadoop 1
hello 1
*/
protected void map(LongWritable key, Text value, Context context) throws IOException InterruptedException{
Text text=new Text();
LongWritable longWritable=new LongWritable();
// 1.将一行的文本数据进行拆分(value是Text类型的,没有split方法,要转换成String类型)
String[]split=value.toString().split();
// 2.遍历数组,组装K2和V2
for(String word:split){
// 3.将K2和V2写入上下文context中(context的write方法只能写入Text和LongWritable里,需要类型转换)
text.set(word);
longWritable.set(1);
context.write(text,longWritable);`在这里插入代码片`
}
}
Reduce阶段
//四个泛型代表K2和V2,K3和V3类型
public class WCReducer extends Reducer<Text,IntWritable,Text,IntWritable>{
protected void reduce(Text key, Iterable<IntWritable>values,Context context)throws IOException{
long count=0;
//1、遍历集合,将集合中的数字相加,得到V3
for(LongWritable value:values){
count+=value.get();
}
//2.将K3和V3写入上下文Context中
context.write(key,new IntWritable(count));
}
}
主类:
public class WCDriver{
public static void main(String [] args) throws Exception{
//job建立设置
Configuration configuraiton=new Configuration();
Job job=null;
job=Job.getInstance(conf,"wc");
job.setJarByClass(WCDriver.class);
//job配置八大步骤:
//1.输入设置
job.setInputFormatClass(TextInputFormat.class);
TextInpuFormat.addInputPath(job,new Path("a.txt"));
//2.map设置
job.setMapperClass(WCDMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
//3,4,5,6.shuffle阶段(分区,排序,规约,分组)
//7.reduce设置
job.setReducerClass(WCReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
//8.输入输出设置
job.setOutputFormatClass(TextOutputFormat.class);
TextOutputFormat.setOutputPath(job,new Path("out1"));
//job结束设置
boolean flag=job.waitForCompletion(true);
System.out.print(flag?"成功","失败");
}
}
形成jar包,在Linux上运行jar包。hadoop jar +jar包+程序全名称l;
(Hadoop将jar包分发给各个datanode,运行结果汇总,输出后写入文件中)
打包:第一种方式:projectstructure—Artifacts—±–JAR—frommodulesbydependencies—选择程序—OK—build…方式
第二种方式:右侧的Mavenprojects—lifecycle—clean—package—下面会显示jar包的位置
Mapreduce编程2:包括Map、Shuffle、Reduce程序—(shuffle里的分区)
**内容:对才彩票信息分类(第6列数据大于15和小于等于15分类)**要从Map到Reduce都将行当做key
分区:给Map Task的执行结果(每个<k2,v2>)打上标记,有相同标记的值去往同一个Reduce Task里
Map代码:
/*K1 :行偏移量 LongWritable
V1 :行文本数据 Text
K2 :行文本数据 Text
V2 : NullWritable(我们最终只要给彩票信息分类,行数据给K2就行了,不需要V2这个数据,所以当做占位符就行了)
*/
public class PartitinoMapper extendds Mapper<LongWritable,Text,Text,NullWritable>{
//map方法将K1和V1转为K2和V2
@Override
protected void map(LongWritbale key ,Text value, Context)throws IOException,InterruptedException{
context.write(value,NullWritable.get());
}
}
分区代码:
/*定义分区规则
返回对应的分区编号
*/
public class MyPartitioner extends Partitioner<Text,NullWritable>{
@Override
public int getPartition(Text text,NullWritable nullWritable, int i){
//1.拆分行文本数据(K2),获取中奖字段的值
String[] spllit = text.toString().split("\t");
String numStr = split[5];
//2.判断中奖字段的值和15的关系,然后返回对应的分区编号
if(Integer.parseInt(numStr)>15){
return 1;
}else {
return 0;
}
}
Reduce代码:
/*定义分区规则
返回对应的分区编号
*/
public class PartitionerReducer extends Reducer<Text,NullWritable,Text,NullWritable>{
@Override
protected void reduce(Text key, Iterable<NullWritable> values,Context context)throws IOException,InterrupedException{
context.write(key,NullWritable.get());
}
}
主代码:
/*
*/
public class JobMain extends Configured implements Tool{
@Override
public int run(String[]args) throws Exception{
//1.创建job任务对象(configuration对象,随便起个名字)
Job job=Job.getInstance(super.getConf(),"partition_maperduce");
//2.对job任务进行分配(共八步)
//第一步:设置输入的类和输入的路径
job.setInputFormatClass(TextInputFormat.class);
TextInputFormat.addInputPath(job,new Path("hdfs://hadoop100:8020/intput"));
//第二部:设置Mapper类和数据类型
job.setMapperClass(PartitionMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(NullWritable.class);
//第三步:指定分区类
job.setPartitionerClass(MyPartitioner.class);
//第四、五、六步
//第七步:指定Reducer类和数据类型(K3和V3)
job.setReducerClass(PartitionerReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
//第八步:指定输出类和输出路径
job.setOutputFormatClass(TextOutputFormat.class);
TextOutputFormat.setOutputPath(job,new Path("hdfs://hadoop100:8020/out/partiton_out"));
//3.等待任务结束
boolean b1=job.waitForCompletion(true);
return b1? 0: 1;
}
public static void main(String[] args){
Configuration configuration=new Configuration();
//启动job任务
int run = ToolRunner.run(configuration,new JobMain(),args);
System.exit(run);
}
Mapreduce编程3:包括Map、程序-----shuffle里的排序
**内容:对两个文件进行读取,以对象的方式,完成对象序列化和反序列化,进行对象的传输(对象类需要实现WritableComparable接口,改写compareT(),write(),read()三个接口,分别是比较方法,序列化,反序列化)**要将行的值拆开塞进对象属性里,对象当做key;
对象代码
public class EmpDep implements WritableComparable<EmpDep> {
private String name;
private String gender;
private int age;
private int deptNo;
private String deptName;
@Override
public int compareTo(EmpDep o) {
if(null==o){
return 0;
}else{
return this.age-o.age;
}
}
@Override
public void write(DataOutput out) throws IOException {
out.writeUTF(name==null?"":name);
out.writeUTF(gender==null?"":gender);
out.writeInt(age);
out.writeInt(deptNo);
out.writeUTF(deptName==null?"":deptName);
}
@Override
public void readFields(DataInput in) throws IOException {
name=in.readUTF();
gender=in.readUTF();
age=in.readInt();
deptNo=in.readInt();
deptName=in.readUTF();
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public String getGender() {
return gender;
}
public void setGender(String gender) {
this.gender = gender;
}
public int getAge() {
return age;
}
public void setAge(int age) {
this.age = age;
}
@Override
public String toString() {
return "EmpDep{" +
"name='" + name + '\'' +
", gender='" + gender + '\'' +
", age=" + age +
", deptNo=" + deptNo +
", deptName='" + deptName + '\'' +
'}';
}
public String getDeptName() {
return deptName;
}
public void setDeptName(String deptName) {
this.deptName = deptName;
}
public int getDeptNo() {
return deptNo;
}
public void setDeptNo(int deptNo) {
this.deptNo = deptNo;
}
}
Map代码
public class JoinMapper extends Mapper<LongWritable, Text,EmpDep, NullWritable> {
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String s = value.toString();
//去除空行
if(null!=s &&!"".equals(s)){
//使用空字符为分隔符
String[] columns = s.split("\\s");
EmpDep ed=new EmpDep();
// for (String column : columns) {
if(columns.length==2){
//dep.txt
ed.setDeptNo(Integer.valueOf(columns[0]));
ed.setDeptName(columns[1]);
}else{
//emp.txt
int num=0;
String[] actural=new String[4];
//担心空格的切分会超过四个元素
for (String column : columns) {
if(null==column || "".equals(column)){
continue;
}
actural[num]=column;
num++;
}
ed.setName(actural[0]);
ed.setGender(actural[1]);
ed.setAge(Integer.valueOf(actural[2]));
ed.setDeptNo(Integer.valueOf(actural[3]));
}
//}
context.write(ed,NullWritable.get());
}
}
}
主代码
public class JoinDriver {
public static void main(String[] args) throws Exception{
Configuration conf=new Configuration();
Job job=Job.getInstance(conf,"ww");
job.setJarByClass(JoinDriver.class);
//
job.setInputFormatClass(TextInputFormat.class);
TextInputFormat.setInputPaths(job,new Path("lcs/"));
job.setMapperClass(JoinMapper.class);
job.setMapOutputKeyClass(EmpDep.class);
job.setMapOutputValueClass(NullWritable.class);
job.setOutputKeyClass(EmpDep.class);
job.setOutputValueClass(NullWritable.class);
job.setOutputFormatClass(TextOutputFormat.class);
TextOutputFormat.setOutputPath(job,new Path("out3"));
boolean b1=job.waitForCompletion(true);
System.out.println(b1?"成功":"失败");
}
}
Mapreduce编程4:Reducer阶段的Join:包括Map、Reducer、主程序
**内容:对两个文件进行读取,以对象的方式,完成对象序列化和反序列化,进行对象的传输(对象类需要实现WritableComparable接口,改写compareT(),write(),read()三个接口,分别是比较方法,序列化,反序列化)**要将行的值拆开塞进对象属性里,对象当做key;
对象代码
public class EmpDep implements WritableComparable<EmpDep> {
private String name;
private String gender;
private int age;
private int deptNo;
private String deptName;
@Override
public int compareTo(EmpDep o) {
if(null==o){
return 0;
}else{
return this.age-o.age;
}
}
@Override
public void write(DataOutput out) throws IOException {
out.writeUTF(name==null?"":name);
out.writeUTF(gender==null?"":gender);
out.writeInt(age);
out.writeInt(deptNo);
out.writeUTF(deptName==null?"":deptName);
}
@Override
public void readFields(DataInput in) throws IOException {
name=in.readUTF();
gender=in.readUTF();
age=in.readInt();
deptNo=in.readInt();
deptName=in.readUTF();
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public String getGender() {
return gender;
}
public void setGender(String gender) {
this.gender = gender;
}
public int getAge() {
return age;
}
public void setAge(int age) {
this.age = age;
}
@Override
public String toString() {
return "EmpDep{" +
"name='" + name + '\'' +
", gender='" + gender + '\'' +
", age=" + age +
", deptNo=" + deptNo +
", deptName='" + deptName + '\'' +
'}';
}
public String getDeptName() {
return deptName;
}
public void setDeptName(String deptName) {
this.deptName = deptName;
}
public int getDeptNo() {
return deptNo;
}
public void setDeptNo(int deptNo) {
this.deptNo = deptNo;
}
}
Map代码
public class JoinMapper extends Mapper<LongWritable, Text, IntWritable,EmpDep> {
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String s = value.toString();
//去除空行
if(null!=s &&!"".equals(s)){
//使用空字符为分隔符
String[] columns = s.split(" ");
EmpDep ed=new EmpDep();
// for (String column : columns) {
if(columns.length==2){
//dep.txt
ed.setDeptNo(Integer.valueOf(columns[0]));
ed.setDeptName(columns[1]);
}else{
//emp.txt
int num=0;
String[] actural=new String[4];
//担心空格的切分会超过四个元素
for (String column : columns) {
if(null==column || "".equals(column)){
continue;
}
actural[num]=column;
num++;
}
ed.setName(actural[0]);
ed.setGender(actural[1]);
ed.setAge(Integer.valueOf(actural[2]));
ed.setDeptNo(Integer.valueOf(actural[3]));
}
System.out.println(ed);
context.write(new IntWritable(ed.getDeptNo()),ed);
}
}
}
Reducer代码
public class JoinRedcuer extends Reducer<IntWritable,EmpDep,EmpDep, NullWritable> {
//从Shuffle出来的<k2,v2>是<DeptNo,<EmpDep1,EmpDep2...>>,Reducer每次只读取一个
//<k2,v2>数据,处理完进行下一条。这里利用了Shuffle的聚合功能,将DeptNo一样的行聚合成
//一行,没有shuffle的处理的话,mapper的一行直接进入reducer的话,是无法完成join的
@Override
protected void reduce(IntWritable key, Iterable<EmpDep> values, Context context) throws IOException, InterruptedException {
EmpDep ed=new EmpDep();
ArrayList<EmpDep> edList =new ArrayList();
for (EmpDep value : values) {
if(null==value.getDeptName()||"".equals(value.getDeptName())){
EmpDep t=new EmpDep();
t.setName(value.getName());
t.setGender(value.getGender());
t.setAge(value.getAge());
t.setDeptNo(value.getDeptNo());
edList.add(t);
}else{
ed.setDeptNo(value.getDeptNo());
ed.setDeptName(value.getDeptName());
}
}
for(EmpDep empDep:edList){
EmpDep tmp=empDep;
tmp.setDeptName(ed.getDeptName());
context.write(tmp,NullWritable.get());
}
}
}
Mapreduce编程5:Map阶段的Join:包括Map、主程序
**内容:对两个文件进行读取,以对象的方式,完成对象序列化和反序列化,进行对象的传输(对象类需要实现WritableComparable接口,改写compareT(),write(),read()三个接口,分别是比较方法,序列化,反序列化)**要将行的值拆开塞进对象属性里,对象当做key;
对象代码
public class EmpDep implements WritableComparable<EmpDep> {
private String name;
private String gender;
private int age;
private int deptNo;
private String deptName;
@Override
public int compareTo(EmpDep o) {
if(null==o){
return 0;
}else{
return this.age-o.age;
}
}
@Override
public void write(DataOutput out) throws IOException {
out.writeUTF(name==null?"":name);
out.writeUTF(gender==null?"":gender);
out.writeInt(age);
out.writeInt(deptNo);
out.writeUTF(deptName==null?"":deptName);
}
@Override
public void readFields(DataInput in) throws IOException {
name=in.readUTF();
gender=in.readUTF();
age=in.readInt();
deptNo=in.readInt();
deptName=in.readUTF();
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public String getGender() {
return gender;
}
public void setGender(String gender) {
this.gender = gender;
}
public int getAge() {
return age;
}
public void setAge(int age) {
this.age = age;
}
@Override
public String toString() {
return "EmpDep{" +
"name='" + name + '\'' +
", gender='" + gender + '\'' +
", age=" + age +
", deptNo=" + deptNo +
", deptName='" + deptName + '\'' +
'}';
}
public String getDeptName() {
return deptName;
}
public void setDeptName(String deptName) {
this.deptName = deptName;
}
public int getDeptNo() {
return deptNo;
}
public void setDeptNo(int deptNo) {
this.deptNo = deptNo;
}
}
Map代码
public class TestMapperJoin extends Mapper<LongWritable, Text,EmpDep, NullWritable> {
Map<Integer,String> depMap=new HashMap<>();
@Override
protected void setup(Context context) throws IOException, InterruptedException {
URI uri = context.getCacheFiles()[0];
FileReader fis=new FileReader(uri.getPath());
BufferedReader br=new BufferedReader(fis);
String line = null;
while(null!=(line=br.readLine())){
String[] columns = line.split(" ");
depMap.put(Integer.valueOf(columns[0]),columns[1]);
}
}
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
EmpDep ed =new EmpDep();
String[] columns = value.toString().split(" ");
ed.setName(columns[0]);
ed.setGender(columns[1]);
ed.setAge(Integer.parseInt(columns[2]));
ed.setDeptNo(Integer.parseInt(columns[3]));
Set<Integer> depNos = depMap.keySet();
for (Integer depNo : depNos) {
if(depNo==ed.getDeptNo()){
ed.setDeptName(depMap.get(depNo));
}
}
context.write(ed,NullWritable.get());
}
}
主代码
public class MapperJoinDriver {
public static void main(String[] args) throws Exception{
Configuration conf=new Configuration();
Job job=Job.getInstance(conf);
job.setJarByClass(MapperJoinDriver.class);
//
job.setInputFormatClass(TextInputFormat.class);
TextInputFormat.addInputPath(job,new Path("lcs/EMP.txt"));
//
job.setMapperClass(TestMapperJoin.class);
job.setMapOutputKeyClass(EmpDep.class);
job.setMapOutputValueClass(NullWritable.class);
//
URI[] uris={new URI("lcs/DEP.txt")};
job.setCacheFiles(uris);
//
job.setOutputFormatClass(TextOutputFormat.class);
TextOutputFormat.setOutputPath(job,new Path("out4"));
boolean b=job.waitForCompletion(true);
System.out.println(b?"成功":"失败");
}
}
计数器:收集作业统计信息
内置计数器:TaskCounter,FileSystemCounter,
FileOutputFormat,FileInputFormat
外置计数器:内置在Map里的,Map执行一次加1
public class PartitionMapper extends Mapper<LongWritable,Text,Text,NullWritable>{
protected void map(LongWritable key,Text value,Context context)throws Exception{
Counter counter=context.getCounter("MR_COUNT","MyRecordCounter");
counter.increment(1L);
context.write(value,NullWritable.get());
}
}
public class PartitionReducer extends Reducer<Text,NullWritable,Text,NullWritable>{
public static enum Counter{
My_REDUCE_INPUT_RECORDS,MY_REDUCE_INPUT_BYTES
}
protected void reduce(Text key,Iterable<NullWritable> values,Context context)throws Exception{
context.getCounter(Counter.MY_REDUCE_INPUT_RECORDS).increment(1L);
context.write(value,NullWritable.get());
}
}
MapReduce序列化和排序
序列化(Serialization)是指把结构化对象转化为字节流;
反序列化(Deserialization)是把字节流转化为结构化对象;
Hadoop的序列化格式Writable,一个类要支持可序列化只需要实现这个接口即可;
Writable的子接口是WritableComparable,这个子接口既可以实现序列化,也可以排序;
public class SortBean implements WritableComparable<SortBean>{
private String word;
private int num;
//get()set()toString()方法未展示
//实现排序(只需要告诉compareTo返回值就行,不用关心底层排序问题)
@Override
public int compareTo(SortBean sortBean){
int result=this.word.compareTo(sortBean.word);
if(result == 0){
return this.num - sortBean.num;
}
return result;
}
//实现序列化
@Override
public void write(DataOutput out)throws IOException{
out.writeUTF(word);
out.writeInt(num);
}
//实现反序列化
@Override
public void readFields(DataInput in) throws Exception{
this.word=in.readUTF(word);
this.num=in.readInt(num);