大数据之(三)MapReduce
MapReduce的内容较多,上两节介绍了MapReduce阶段的InputFormat、Map、以及Reduce的原理机制,我相信大家肯定会遇到各种bug及疑惑,没有算我输😀,大家可以在下面留言,我们一起解决。好了,今天上MapReduce的最后一章。
3.4 OutPutFormat数据输出
3.4.1 OutputFormat接口实现类
OutputFormat是MapReduce输出的基类,所有实现MapReduce输出都实现了OutputFormat接口。下面介绍几种常用的实现类
1、文本输出TextOutputFormat
以文本的形式将每条记录写成文本行。
2、SequenceFileOutputFormat
主要是在将其输出结果作为后续的MapReduce任务的输入,因为它的格式紧凑,很容易被压缩。
3、自定义OutputFormat
根据用户需求没自定义实现输出
3.4.2 自定义OutputFormat
一般情况下,我们想要得到我们想要得到的某些内容,就需要我们进行自定义实现类了。
步骤:
(1)自定义一个类继承FileOuputFormat
(2)改写RecordWriter,具体改写输出数据的方法write()。
3.4.3 自定义OutputFormat案例实操
(1)需求:过滤输入的log日志,包含atguigu的网站输出到e:/atguigu.log,不包含atguigu的网站输出到e:/other.log。
(2)案例实操
自定义logoutputpublic class LogOutput extends >FileOutputFormat<LongWritable,Text> { @Override public RecordWriter<LongWritable, Text> getRecordWriter(TaskAttemptContext job) throws IOException, InterruptedException { return new MyRecordWriter(job); } }
自定义MyRecordWriter
public class MyRecordWriter extends >RecordWriter<LongWritable, Text> { private FSDataOutputStream atguiguOut; private FSDataOutputStream otherOut; public MyRecordWriter(TaskAttemptContext job) throws IOException { Configuration configuration = job.getConfiguration(); FileSystem fileSystem = FileSystem.get(configuration); String output = configuration.get(FileOutputFormat.OUTDIR); atguiguOut = fileSystem.create(new Path(output + "/atguigu.log")); otherOut = fileSystem.create(new Path(output + "/other.log")); } /** * 重写输出方法 * @param key * @param value * @throws IOException * @throws InterruptedException */ @Override public void write(LongWritable key, Text value) throws IOException, InterruptedException { String line =value.toString() + "\n"; if(line.contains("atguigu")){ atguiguOut.write(line.getBytes()); }else { otherOut.write(line.getBytes()); } } /** * 关闭资源 * @param context * @throws IOException * @throws InterruptedException */ @Override public void close(TaskAttemptContext context) throws IOException, InterruptedException { >IOUtils.closeStreams(atguiguOut,otherOut); } }
因为文件在map和reduce阶段没有进行操作,所以就不用重写map和reduce方法,下面是driver。
public class OutputDriver { public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException { Job job = Job.getInstance(new Configuration()); job.setJarByClass(OutputDriver.class); job.setOutputFormatClass(LogOutput.class); FileInputFormat.setInputPaths(job,new Path("d:/input5")); > FileOutputFormat.setOutputPath(job,new Path("d:/output5")); boolean b = job.waitForCompletion(true); System.exit(b ? 0 :1); } }
3.5 join多种应用
我们将join之前先来明确一下,问什么学join哈,主要是为了后边学习hive用滴,要问我为什么学习hive,我就立马枪毙你,哈哈哈,开玩笑了,学习join主要是处理表的连接这样的问题。
3.5.1 Reduce join
Reduce端的主要工作:在Reduce端以连接字段作为key的分组已经完成,我们只需要在每一个分组当中将那些来源于不同文件的记录(在map阶段已经打标志)分开,最后进行合并就ok了。
案例实操
(1)需求
将商品信息表中数据根据商品pid合并到订单数据表中。
代码实现
因为我们想最后输出的是id,pname,amount,所以我们不防自定义一个类包含这些属性。public class OrderBean implements WritableComparable<OrderBean> { private String id; private String pname; private String pid; private int amount; @Override public String toString() { return id + "\t" + pname + "\t" + amount; } public String getId() { return id; } public void setId(String id) { this.id = id; } public String getPname() { return pname; } public void setPname(String pname) { this.pname = pname; } public String getPid() { return pid; } public void setPid(String pid) { this.pid = pid; } public int getAmount() { return amount; } public void setAmount(int amount) { this.amount = amount; } /** * 先根据pid排序 * 若pid相同,在根据pname排序 * @param o * @return */ public int compareTo(OrderBean o) { int compare = this.getPid().compareTo(o.getPid()); if(compare == 0){ return o.getPname().compareTo(this.getPname()); }else { return compare; } } /** * 序列化,注意,序列化和反序列化的顺序要保持一致 * @param out * @throws IOException */ public void write(DataOutput out) throws IOException { out.writeUTF(id); out.writeUTF(pname); out.writeUTF(pid); out.writeInt(amount); } /** * 反序列化 * @param in * @throws IOException */ public void readFields(DataInput in) throws IOException { this.id = in.readUTF(); this.pname = in.readUTF(); this.pid = in.readUTF(); this.amount = in.readInt(); } }
mapper类
public class RJMapper extends Mapper<LongWritable, Text,OrderBean, NullWritable> { private OrderBean orderBean = new OrderBean(); private String filename; /** * 获取文件名称 * @param context * @throws IOException * @throws InterruptedException */ @Override protected void setup(Context context) throws IOException, InterruptedException { InputSplit inputSplit = context.getInputSplit(); FileSplit fileSplit = (FileSplit) inputSplit; filename = fileSplit.getPath().getName(); } @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String[] filds = value.toString().split("\t"); if("order.txt".equals(filename)){ orderBean.setId(filds[0]); orderBean.setPid(filds[1]); orderBean.setAmount(Integer.parseInt( filds[2])); orderBean.setPname(""); }else { orderBean.setId(""); orderBean.setPid(filds[0]); orderBean.setPname(filds[1]); orderBean.setAmount(0); } context.write(orderBean,NullWritable.get()); } }
自定义分组类ordercomparator
public class OrderComparator extends WritableComparator { protected OrderComparator(){ super(OrderBean.class,true); } @Override public int compare(WritableComparable a, WritableComparable b) { OrderBean oa = (OrderBean)a; OrderBean ob = (OrderBean)b; return oa.getPid().compareTo(ob.getPid()); } }
reduce类
public class RJReducer extends Reducer<OrderBean, NullWritable,OrderBean,NullWritable> { @Override protected void reduce(OrderBean key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException { Iterator<NullWritable> iterator = values.iterator(); //第一行数据是品牌 iterator.next(); String pname = key.getPname(); while (iterator.hasNext()){ iterator.next(); key.setPname(pname); context.write(key,NullWritable.get()); } } }
driver类
public class RJDriver { public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException { Job job = Job.getInstance(new Configuration()); job.setJarByClass(RJDriver.class); job.setMapperClass(RJMapper.class); job.setReducerClass(RJReducer.class); job.setMapOutputKeyClass(OrderBean.class); job.setMapOutputValueClass(NullWritable.class); job.setOutputKeyClass(OrderBean.class); job.setOutputValueClass(NullWritable.class); job.setGroupingComparatorClass(OrderComparator.class); FileInputFormat.setInputPaths(job,new Path("d:/input6")); FileOutputFormat.setOutputPath(job,new Path("d:/output6")); boolean b = job.waitForCompletion(true); System.exit(b? 0 : 1); } }
总结一下:
缺点:这种方式中,合并的操作是在reduce阶段完成,reduce端的处理压力太大,map节点的运算负载则很低,资源利用率不高,且在reduce阶段极易产生数据倾斜。
3.5.2 Map join
map join适用于一张表十分小、一张表很大的场景。
优点:在Map端缓存多张表,提前处理业务逻辑,这样增加Map端业务,减少reduce端数据的压力,尽可能的减少数据倾斜。
案例实操
需求:
代码实现
mapper类public class MJMapper extends Mapper<LongWritable, Text,Text, NullWritable> { private Map<String,String> pMap = new HashMap<String, String>(); private Text result = new Text(); /** * 完成表的缓存工作 * @param context * @throws IOException * @throws InterruptedException */ @Override protected void setup(Context context) throws IOException, InterruptedException { //1.获取缓存文件地址 URI[] cacheFiles = context.getCacheFiles(); //2.开流 FileSystem fileSystem = FileSystem.get(context.getConfiguration()); FSDataInputStream inputStream = fileSystem.open(new Path(cacheFiles[0])); //3.转换字节流并读取 BufferedReader br = new BufferedReader(new InputStreamReader(inputStream)); String line; while (StringUtils.isNotEmpty(line = br.readLine())){ String[] fields = line.split("\t"); pMap.put(fields[0],fields[1]); } IOUtils.closeStreams(br); } /** * 完成表的连接工作 * @param key * @param value * @param context * @throws IOException * @throws InterruptedException */ @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String[] fields = value.toString().split("\t"); result.set(fields[0] + "\t" + pMap.get(fields[1]) + "\t" + fields[2]); context.write(result,NullWritable.get()); } }
driver类
public class RJDriver { public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException { Job job = Job.getInstance(new Configuration()); job.setJarByClass(RJDriver.class); job.setMapperClass(RJMapper.class); job.setReducerClass(RJReducer.class); >job.setMapOutputKeyClass(OrderBean.class); >job.setMapOutputValueClass(NullWritable.class); >job.setOutputKeyClass(OrderBean.class); >job.setOutputValueClass(NullWritable.class); >job.setGroupingComparatorClass(OrderComparator.class); FileInputFormat.setInputPaths(job,new Path("d:/input6")); >FileOutputFormat.setOutputPath(job,new Path("d:/output6")); boolean b = job.waitForCompletion(true); System.exit(b? 0 : 1); } }
总结:
join处理,map和reduce的区别重心看字面就能理解,但是具体处理到里面,就要考虑的比较多,坑也就多,效率、性能,mapreduce的工作机制原理等,都会对流程产生影响。