大数据之(三)MapReduce

MapReduce的内容较多,上两节介绍了MapReduce阶段的InputFormat、Map、以及Reduce的原理机制,我相信大家肯定会遇到各种bug及疑惑,没有算我输😀,大家可以在下面留言,我们一起解决。好了,今天上MapReduce的最后一章。

3.4 OutPutFormat数据输出

3.4.1 OutputFormat接口实现类

OutputFormat是MapReduce输出的基类,所有实现MapReduce输出都实现了OutputFormat接口。下面介绍几种常用的实现类
1、文本输出TextOutputFormat
以文本的形式将每条记录写成文本行。
2、SequenceFileOutputFormat
主要是在将其输出结果作为后续的MapReduce任务的输入,因为它的格式紧凑,很容易被压缩。
3、自定义OutputFormat
根据用户需求没自定义实现输出

3.4.2 自定义OutputFormat

一般情况下,我们想要得到我们想要得到的某些内容,就需要我们进行自定义实现类了。
步骤
(1)自定义一个类继承FileOuputFormat
(2)改写RecordWriter,具体改写输出数据的方法write()。

3.4.3 自定义OutputFormat案例实操

(1)需求:过滤输入的log日志,包含atguigu的网站输出到e:/atguigu.log,不包含atguigu的网站输出到e:/other.log。
(2)案例实操
自定义logoutput

public class LogOutput extends >FileOutputFormat<LongWritable,Text> {
   @Override
   public RecordWriter<LongWritable, Text> getRecordWriter(TaskAttemptContext job) throws IOException, InterruptedException {
       return new MyRecordWriter(job);
   }
}

自定义MyRecordWriter

public class MyRecordWriter extends >RecordWriter<LongWritable, Text> {
   private FSDataOutputStream atguiguOut;
   private FSDataOutputStream otherOut;
   public MyRecordWriter(TaskAttemptContext job) throws IOException {
       Configuration configuration = job.getConfiguration();
       FileSystem fileSystem = FileSystem.get(configuration);
       String output = configuration.get(FileOutputFormat.OUTDIR);
       atguiguOut = fileSystem.create(new Path(output + "/atguigu.log"));
       otherOut = fileSystem.create(new Path(output + "/other.log"));
   }
   /**
    * 重写输出方法
    * @param key
    * @param value
    * @throws IOException
    * @throws InterruptedException
    */
   @Override
   public void write(LongWritable key, Text value) throws IOException, InterruptedException {
       String line =value.toString() + "\n";
       if(line.contains("atguigu")){
           atguiguOut.write(line.getBytes());
       }else {
           otherOut.write(line.getBytes());
       }
   }
   /**
    * 关闭资源
    * @param context
    * @throws IOException
    * @throws InterruptedException
    */
   @Override
   public void close(TaskAttemptContext context) throws IOException, InterruptedException {        >IOUtils.closeStreams(atguiguOut,otherOut);
   }
}

因为文件在map和reduce阶段没有进行操作,所以就不用重写map和reduce方法,下面是driver。

public class OutputDriver {
   public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
       Job job = Job.getInstance(new Configuration());
       job.setJarByClass(OutputDriver.class);
       job.setOutputFormatClass(LogOutput.class);
       FileInputFormat.setInputPaths(job,new Path("d:/input5"));        >
       FileOutputFormat.setOutputPath(job,new Path("d:/output5"));
       boolean b = job.waitForCompletion(true);
       System.exit(b ? 0 :1);
  }
}

3.5 join多种应用

我们将join之前先来明确一下,问什么学join哈,主要是为了后边学习hive用滴,要问我为什么学习hive,我就立马枪毙你,哈哈哈,开玩笑了,学习join主要是处理表的连接这样的问题。

3.5.1 Reduce join

Reduce端的主要工作:在Reduce端以连接字段作为key的分组已经完成,我们只需要在每一个分组当中将那些来源于不同文件的记录(在map阶段已经打标志)分开,最后进行合并就ok了。

案例实操
(1)需求
在这里插入图片描述
将商品信息表中数据根据商品pid合并到订单数据表中。
在这里插入图片描述

代码实现
因为我们想最后输出的是id,pname,amount,所以我们不防自定义一个类包含这些属性。

public class OrderBean implements WritableComparable<OrderBean> {
   private String id;
   private String pname;
   private String pid;
   private int amount;
   @Override
   public String toString() {
       return  id + "\t" + pname + "\t" + amount;
   }
   public String getId() {
       return id;
   }
   public void setId(String id) {
       this.id = id;
   }
   public String getPname() {
       return pname;
   }
   public void setPname(String pname) {
       this.pname = pname;
   }
   public String getPid() {
       return pid;
   }
   public void setPid(String pid) {
       this.pid = pid;
   }
   public int getAmount() {
       return amount;
   }
   public void setAmount(int amount) {
       this.amount = amount;
   }
   /**
    * 先根据pid排序
    * 若pid相同,在根据pname排序
    * @param o
    * @return
    */
   public int compareTo(OrderBean o) {
       int compare = this.getPid().compareTo(o.getPid());
       if(compare == 0){
           return o.getPname().compareTo(this.getPname());
       }else {
           return compare;
       }
   }
   /**
    * 序列化,注意,序列化和反序列化的顺序要保持一致
    * @param out
    * @throws IOException
    */
   public void write(DataOutput out) throws IOException {
       out.writeUTF(id);
       out.writeUTF(pname);
       out.writeUTF(pid);
       out.writeInt(amount);
   }
   /**
    * 反序列化
    * @param in
    * @throws IOException
    */
   public void readFields(DataInput in) throws IOException {
       this.id = in.readUTF();
       this.pname = in.readUTF();
       this.pid = in.readUTF();
       this.amount = in.readInt();
   }
}

mapper类

public class RJMapper extends Mapper<LongWritable, Text,OrderBean, NullWritable> {
   private OrderBean orderBean = new OrderBean();
   private String filename;
   /**
    * 获取文件名称
    * @param context
    * @throws IOException
    * @throws InterruptedException
    */
   @Override
   protected void setup(Context context) throws IOException, InterruptedException {
       InputSplit inputSplit = context.getInputSplit();
       FileSplit fileSplit = (FileSplit) inputSplit;
       filename = fileSplit.getPath().getName();
   }
   @Override
   protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
       String[] filds = value.toString().split("\t");
       if("order.txt".equals(filename)){
           orderBean.setId(filds[0]);
           orderBean.setPid(filds[1]);
           orderBean.setAmount(Integer.parseInt( filds[2]));
           orderBean.setPname("");
       }else {
           orderBean.setId("");
           orderBean.setPid(filds[0]);
           orderBean.setPname(filds[1]);
           orderBean.setAmount(0);
       }
       context.write(orderBean,NullWritable.get());
   }
}

自定义分组类ordercomparator

public class OrderComparator extends WritableComparator {
   protected OrderComparator(){
       super(OrderBean.class,true);
   }
   @Override
   public int compare(WritableComparable a, WritableComparable b) {
       OrderBean oa = (OrderBean)a;
       OrderBean ob = (OrderBean)b;
       return oa.getPid().compareTo(ob.getPid());
   }
}

reduce类

public class RJReducer extends Reducer<OrderBean, NullWritable,OrderBean,NullWritable> {
   @Override
   protected void reduce(OrderBean key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
       Iterator<NullWritable> iterator = values.iterator();
       //第一行数据是品牌
       iterator.next();
       String pname = key.getPname();
       while (iterator.hasNext()){
           iterator.next();
           key.setPname(pname);
           context.write(key,NullWritable.get());
       }
   }
}

driver类

public class RJDriver {
   public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
       Job job = Job.getInstance(new Configuration());
       job.setJarByClass(RJDriver.class);
       job.setMapperClass(RJMapper.class);
       job.setReducerClass(RJReducer.class);
       job.setMapOutputKeyClass(OrderBean.class);
       job.setMapOutputValueClass(NullWritable.class);
       job.setOutputKeyClass(OrderBean.class);
       job.setOutputValueClass(NullWritable.class);
       job.setGroupingComparatorClass(OrderComparator.class);
       FileInputFormat.setInputPaths(job,new Path("d:/input6"));
       FileOutputFormat.setOutputPath(job,new Path("d:/output6"));
       boolean b = job.waitForCompletion(true);
       System.exit(b? 0 : 1);
   }
}

总结一下:
缺点:这种方式中,合并的操作是在reduce阶段完成,reduce端的处理压力太大,map节点的运算负载则很低,资源利用率不高,且在reduce阶段极易产生数据倾斜。

3.5.2 Map join

map join适用于一张表十分小、一张表很大的场景。
优点:在Map端缓存多张表,提前处理业务逻辑,这样增加Map端业务,减少reduce端数据的压力,尽可能的减少数据倾斜。

案例实操
需求:
在这里插入图片描述
在这里插入图片描述
代码实现
mapper类

public class MJMapper extends Mapper<LongWritable, Text,Text, NullWritable> {
   private Map<String,String> pMap = new HashMap<String, String>();
   private Text result = new Text();
   /**
    * 完成表的缓存工作
    * @param context
    * @throws IOException
    * @throws InterruptedException
    */
   @Override
   protected void setup(Context context) throws IOException, InterruptedException {
       //1.获取缓存文件地址
       URI[] cacheFiles = context.getCacheFiles();
       //2.开流
       FileSystem fileSystem = FileSystem.get(context.getConfiguration());
       FSDataInputStream inputStream = fileSystem.open(new Path(cacheFiles[0]));
       //3.转换字节流并读取
       BufferedReader br = new BufferedReader(new InputStreamReader(inputStream));
       String line;
       while (StringUtils.isNotEmpty(line = br.readLine())){
           String[] fields = line.split("\t");
           pMap.put(fields[0],fields[1]);
       }
       IOUtils.closeStreams(br);
   }
   /**
    * 完成表的连接工作
    * @param key
    * @param value
    * @param context
    * @throws IOException
    * @throws InterruptedException
    */
   @Override
   protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
       String[] fields = value.toString().split("\t");
       result.set(fields[0] + "\t" + pMap.get(fields[1]) + "\t" + fields[2]);
       context.write(result,NullWritable.get());
   }
}

driver类

public class RJDriver {
   public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
       Job job = Job.getInstance(new Configuration());
       job.setJarByClass(RJDriver.class);
       job.setMapperClass(RJMapper.class);
       job.setReducerClass(RJReducer.class);        >job.setMapOutputKeyClass(OrderBean.class);      >job.setMapOutputValueClass(NullWritable.class);      >job.setOutputKeyClass(OrderBean.class);      >job.setOutputValueClass(NullWritable.class);      >job.setGroupingComparatorClass(OrderComparator.class);
       FileInputFormat.setInputPaths(job,new Path("d:/input6"));        >FileOutputFormat.setOutputPath(job,new Path("d:/output6"));
       boolean b = job.waitForCompletion(true);
       System.exit(b? 0 : 1);
   }
}

总结:
join处理,map和reduce的区别重心看字面就能理解,但是具体处理到里面,就要考虑的比较多,坑也就多,效率、性能,mapreduce的工作机制原理等,都会对流程产生影响。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值