大数据之（三）MapReduce

最新推荐文章于 2024-07-15 08:30:00 发布

爱学习的老冰棍

最新推荐文章于 2024-07-15 08:30:00 发布

阅读量238

点赞数 2

分类专栏： Hadoop 文章标签： mapreduce hadoop

本文链接：https://blog.csdn.net/qq_43182741/article/details/108514221

版权

Hadoop 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

大数据之（三）MapReduce

3.4 OutPutFormat数据输出
3.5 join多种应用
- 3.5.1 Reduce join
- 3.5.2 Map join

MapReduce的内容较多，上两节介绍了MapReduce阶段的InputFormat、Map、以及Reduce的原理机制，我相信大家肯定会遇到各种bug及疑惑，没有算我输😀，大家可以在下面留言，我们一起解决。好了，今天上MapReduce的最后一章。

3.4 OutPutFormat数据输出

3.4.1 OutputFormat接口实现类

OutputFormat是MapReduce输出的基类，所有实现MapReduce输出都实现了OutputFormat接口。下面介绍几种常用的实现类
1、文本输出TextOutputFormat
以文本的形式将每条记录写成文本行。
2、SequenceFileOutputFormat
主要是在将其输出结果作为后续的MapReduce任务的输入，因为它的格式紧凑，很容易被压缩。
3、自定义OutputFormat
根据用户需求没自定义实现输出

3.4.2 自定义OutputFormat

一般情况下，我们想要得到我们想要得到的某些内容，就需要我们进行自定义实现类了。
步骤：
（1）自定义一个类继承FileOuputFormat
（2）改写RecordWriter，具体改写输出数据的方法write()。

3.4.3 自定义OutputFormat案例实操

（1）需求：过滤输入的log日志，包含atguigu的网站输出到e:/atguigu.log，不包含atguigu的网站输出到e:/other.log。
（2）案例实操
自定义logoutput

public class LogOutput extends >FileOutputFormat<LongWritable,Text> {
   @Override
   public RecordWriter<LongWritable, Text> getRecordWriter(TaskAttemptContext job) throws IOException, InterruptedException {
       return new MyRecordWriter(job);
   }
}

自定义MyRecordWriter

public class MyRecordWriter extends >RecordWriter<LongWritable, Text> {
   private FSDataOutputStream atguiguOut;
   private FSDataOutputStream otherOut;
   public MyRecordWriter(TaskAttemptContext job) throws IOException {
       Configuration configuration = job.getConfiguration();
       FileSystem fileSystem = FileSystem.get(configuration);
       String output = configuration.get(FileOutputFormat.OUTDIR);
       atguiguOut = fileSystem.create(new Path(output + "/atguigu.log"));
       otherOut = fileSystem.create(new Path(output + "/other.log"));
   }
   /**
    * 重写输出方法
    * @param key
    * @param value
    * @throws IOException
    * @throws InterruptedException
    */
   @Override
   public void write(LongWritable key, Text value) throws IOException, InterruptedException {
       String line =value.toString() + "\n";
       if(line.contains("atguigu")){
           atguiguOut.write(line.getBytes());
       }else {
           otherOut.write(line.getBytes());
       }
   }
   /**
    * 关闭资源
    * @param context
    * @throws IOException
    * @throws InterruptedException
    */
   @Override
   public void close(TaskAttemptContext context) throws IOException, InterruptedException {        >IOUtils.closeStreams(atguiguOut,otherOut);
   }
}

因为文件在map和reduce阶段没有进行操作，所以就不用重写map和reduce方法，下面是driver。

public class OutputDriver {
   public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
       Job job = Job.getInstance(new Configuration());
       job.setJarByClass(OutputDriver.class);
       job.setOutputFormatClass(LogOutput.class);
       FileInputFormat.setInputPaths(job,new Path("d:/input5"));        >
       FileOutputFormat.setOutputPath(job,new Path("d:/output5"));
       boolean b = job.waitForCompletion(true);
       System.exit(b ? 0 :1);
  }
}

3.5 join多种应用

我们将join之前先来明确一下，问什么学join哈，主要是为了后边学习hive用滴，要问我为什么学习hive，我就立马枪毙你，哈哈哈，开玩笑了，学习join主要是处理表的连接这样的问题。

3.5.1 Reduce join

Reduce端的主要工作：在Reduce端以连接字段作为key的分组已经完成，我们只需要在每一个分组当中将那些来源于不同文件的记录（在map阶段已经打标志）分开，最后进行合并就ok了。

案例实操
(1)需求

将商品信息表中数据根据商品pid合并到订单数据表中。

代码实现
因为我们想最后输出的是id，pname，amount，所以我们不防自定义一个类包含这些属性。

public class OrderBean implements WritableComparable<OrderBean> {
   private String id;
   private String pname;
   private String pid;
   private int amount;
   @Override
   public String toString() {
       return  id + "\t" + pname + "\t" + amount;
   }
   public String getId() {
       return id;
   }
   public void setId(String id) {
       this.id = id;
   }
   public String getPname() {
       return pname;
   }
   public void setPname(String pname) {
       this.pname = pname;
   }
   public String getPid() {
       return pid;
   }
   public void setPid(String pid) {
       this.pid = pid;
   }
   public int getAmount() {
       return amount;
   }
   public void setAmount(int amount) {
       this.amount = amount;
   }
   /**
    * 先根据pid排序
    * 若pid相同，在根据pname排序
    * @param o
    * @return
    */
   public int compareTo(OrderBean o) {
       int compare = this.getPid().compareTo(o.getPid());
       if(compare == 0){
           return o.getPname().compareTo(this.getPname());
       }else {
           return compare;
       }
   }
   /**
    * 序列化，注意，序列化和反序列化的顺序要保持一致
    * @param out
    * @throws IOException
    */
   public void write(DataOutput out) throws IOException {
       out.writeUTF(id);
       out.writeUTF(pname);
       out.writeUTF(pid);
       out.writeInt(amount);
   }
   /**
    * 反序列化
    * @param in
    * @throws IOException
    */
   public void readFields(DataInput in) throws IOException {
       this.id = in.readUTF();
       this.pname = in.readUTF();
       this.pid = in.readUTF();
       this.amount = in.readInt();
   }
}

mapper类

public class RJMapper extends Mapper<LongWritable, Text,OrderBean, NullWritable> {
   private OrderBean orderBean = new OrderBean();
   private String filename;
   /**
    * 获取文件名称
    * @param context
    * @throws IOException
    * @throws InterruptedException
    */
   @Override
   protected void setup(Context context) throws IOException, InterruptedException {
       InputSplit inputSplit = context.getInputSplit();
       FileSplit fileSplit = (FileSplit) inputSplit;
       filename = fileSplit.getPath().getName();
   }
   @Override
   protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
       String[] filds = value.toString().split("\t");
       if("order.txt".equals(filename)){
           orderBean.setId(filds[0]);
           orderBean.setPid(filds[1]);
           orderBean.setAmount(Integer.parseInt( filds[2]));
           orderBean.setPname("");
       }else {
           orderBean.setId("");
           orderBean.setPid(filds[0]);
           orderBean.setPname(filds[1]);
           orderBean.setAmount(0);
       }
       context.write(orderBean,NullWritable.get());
   }
}

自定义分组类ordercomparator

public class OrderComparator extends WritableComparator {
   protected OrderComparator(){
       super(OrderBean.class,true);
   }
   @Override
   public int compare(WritableComparable a, WritableComparable b) {
       OrderBean oa = (OrderBean)a;
       OrderBean ob = (OrderBean)b;
       return oa.getPid().compareTo(ob.getPid());
   }
}

reduce类

public class RJReducer extends Reducer<OrderBean, NullWritable,OrderBean,NullWritable> {
   @Override
   protected void reduce(OrderBean key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
       Iterator<NullWritable> iterator = values.iterator();
       //第一行数据是品牌
       iterator.next();
       String pname = key.getPname();
       while (iterator.hasNext()){
           iterator.next();
           key.setPname(pname);
           context.write(key,NullWritable.get());
       }
   }
}

driver类

public class RJDriver {
   public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
       Job job = Job.getInstance(new Configuration());
       job.setJarByClass(RJDriver.class);
       job.setMapperClass(RJMapper.class);
       job.setReducerClass(RJReducer.class);
       job.setMapOutputKeyClass(OrderBean.class);
       job.setMapOutputValueClass(NullWritable.class);
       job.setOutputKeyClass(OrderBean.class);
       job.setOutputValueClass(NullWritable.class);
       job.setGroupingComparatorClass(OrderComparator.class);
       FileInputFormat.setInputPaths(job,new Path("d:/input6"));
       FileOutputFormat.setOutputPath(job,new Path("d:/output6"));
       boolean b = job.waitForCompletion(true);
       System.exit(b? 0 : 1);
   }
}

总结一下：
缺点：这种方式中，合并的操作是在reduce阶段完成，reduce端的处理压力太大，map节点的运算负载则很低，资源利用率不高，且在reduce阶段极易产生数据倾斜。

3.5.2 Map join

map join适用于一张表十分小、一张表很大的场景。
优点：在Map端缓存多张表，提前处理业务逻辑，这样增加Map端业务，减少reduce端数据的压力，尽可能的减少数据倾斜。

案例实操
需求：
在这里插入图片描述

代码实现
mapper类

public class MJMapper extends Mapper<LongWritable, Text,Text, NullWritable> {
   private Map<String,String> pMap = new HashMap<String, String>();
   private Text result = new Text();
   /**
    * 完成表的缓存工作
    * @param context
    * @throws IOException
    * @throws InterruptedException
    */
   @Override
   protected void setup(Context context) throws IOException, InterruptedException {
       //1.获取缓存文件地址
       URI[] cacheFiles = context.getCacheFiles();
       //2.开流
       FileSystem fileSystem = FileSystem.get(context.getConfiguration());
       FSDataInputStream inputStream = fileSystem.open(new Path(cacheFiles[0]));
       //3.转换字节流并读取
       BufferedReader br = new BufferedReader(new InputStreamReader(inputStream));
       String line;
       while (StringUtils.isNotEmpty(line = br.readLine())){
           String[] fields = line.split("\t");
           pMap.put(fields[0],fields[1]);
       }
       IOUtils.closeStreams(br);
   }
   /**
    * 完成表的连接工作
    * @param key
    * @param value
    * @param context
    * @throws IOException
    * @throws InterruptedException
    */
   @Override
   protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
       String[] fields = value.toString().split("\t");
       result.set(fields[0] + "\t" + pMap.get(fields[1]) + "\t" + fields[2]);
       context.write(result,NullWritable.get());
   }
}

driver类

public class RJDriver {
   public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
       Job job = Job.getInstance(new Configuration());
       job.setJarByClass(RJDriver.class);
       job.setMapperClass(RJMapper.class);
       job.setReducerClass(RJReducer.class);        >job.setMapOutputKeyClass(OrderBean.class);      >job.setMapOutputValueClass(NullWritable.class);      >job.setOutputKeyClass(OrderBean.class);      >job.setOutputValueClass(NullWritable.class);      >job.setGroupingComparatorClass(OrderComparator.class);
       FileInputFormat.setInputPaths(job,new Path("d:/input6"));        >FileOutputFormat.setOutputPath(job,new Path("d:/output6"));
       boolean b = job.waitForCompletion(true);
       System.exit(b? 0 : 1);
   }
}