MapReduce场景应用

最新推荐文章于 2024-06-01 10:21:11 发布

lemon lime

最新推荐文章于 2024-06-01 10:21:11 发布

阅读量5.1k

点赞数 4

分类专栏： mapreduce 文章标签： mapreduce 大数据分析处理大数据应用场景

本文链接：https://blog.csdn.net/weixin_43270493/article/details/86035186

版权

mapreduce 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

MapReduce场景应用

一，QQ共同好友简单分析

相信大家都有用过QQ, QQ会不定期的向你推荐些可能认识的人(显示你们有多少个共同好友)。

那么它的数据是哪里来的呢？实际上是每个人好友列表数据分析出来的。

比如说 A 的好友列表中有 B C， B的好友列表有 C D E ,那么系统可能会给A推荐 D E ，这样才是合理的方案。

对于QQ用户来说，用户群是非常庞大的，分析每个人好友列表，用文本保存，假设保存的信息是如下图:

用户ID	好友列表	其他数据
A	B C D …	…

看下QQ的注册人数：
在这里插入图片描述

8亿人，假设一个人的好友列表占用10kB 处理下来数据量约是 8T,显然用普通的处理方式，存储都是一个问题，更别说数据的处理了。

这个时候我们就完全可以用HDFS和MapReduce去处理这些数据。

1.1 示例数据分析

这里我准备了一点数据，如下

ID       好友列表
--------------------------
tom 	 cat  hadoop  hello
hello  	 mr  tom  world  hive
cat 	 tom  hive
hive 	 cat  hadoop  world  hello  mr
mr 		 hive  hello
hadoop 	 tom  hive  world
world 	 hadoop  hive  hello

第一步： Mapper

分析每行数据，拿第一行举例，tom和好友列表的每一个都是属于直接好友，而tom的好友列表两两组合都属于非直接好友关系，但是都有一个共同的好友 tom 。

如果用 <tom-cat,0> 表示直接好友关系 <cat-hadoop,1> 表示二层关系，这样我们筛选出value为1的就是我们需要的数据,

思考一个问题，<tom-cat,0> <cat-tom,0> 这样Key是属于两个，这里我们代码用的是自定义Key去解决。

第二步：Reduce

遍历数据，values中含有0的代表，说明是直接好友，抛弃数据，留下values是1的即可，汇总。

到这，我们把需要的数据抽取出来了，但是相对<tom-cat,1> 系统可以向tom 推荐cat 系统也可以向 cat 推荐 tom,而且系统的推荐也是根据共同的数量来优先推荐的

所以这里，我们需要有第二个MapReduce程序，来对数据进行排序和整合

第三步：Mapper

自定义排序类，实现排序

第四步：Reduce

输出数据即可

1.2 代码分析

MapStep01.java

package com.shsxt.mr.friend;

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.hsqldb.lib.StringUtil;

public class MapStep01 extends Mapper<LongWritable, Text, FoF, IntWritable> {
	
	@Override
	protected void map(LongWritable key, Text value, Context context)
			throws IOException, InterruptedException {
		
		String lines = value.toString();
        //我这里为了方便，单词和单词之间分隔符都是用 " " 展示数据时，是为了大家方便观看
		String [] friends = StringUtil.split(lines, " ");
		
		for (int i = 1; i < friends.length; i++) {
			String friend = friends[i];
            //对于每一行，第一个数据和好友列表任意组合都是直接好友，Value=0 表示
			context.write(new FoF(friend,friends[0]), new IntWritable(0));
			//对于第二个数据和其后面的组合都是非直接好友  value=1表示
			for (int j = i+1; j < friends.length; j++) {
				String fri = friends[j];
                //写出的Key 是自定义类，下面贴出代码
				context.write(new FoF(fri, friend), new IntWritable(1));
			}
		}
	}
}

FoF.java

package com.shsxt.mr.friend;

import org.apache.hadoop.io.Text;
//继承Text，Text实现了WritableComparable，省写序列化代码了
public class FoF extends Text {

	public FoF() {
		super();
	}
	
	public FoF(String friend01,String friend02){
		set(getOf(friend01,friend02));
	}
	
	//防止相同的数据组合造成数据的重复 如 cat-tomcat 和tomcat-cat    
	private String getOf(String friend01, String friend02) {
		int c = friend01.compareTo(friend02);
		
		if (c>0) {
			return friend02 + "\t" + friend01;
		}
		return friend01+"\t"+friend02;
	}
}

ReduceStep01.java

package com.shsxt.mr.friend;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import org.hsqldb.lib.StringUtil;

public class ReduceStep01 extends Reducer<FoF, IntWritable, Text, NullWritable> {
	
	@Override
	protected void reduce(FoF key, Iterable<IntWritable> values,
			Context context) throws IOException, InterruptedException {
		int sum = 0;
		boolean flag = true;
        //到这边的数据类型  cat hadoop   {0,0,1,1}
		for (IntWritable value : values) {
            //只要value中含有0  说明是直接好友，跳过本次循环
			if (value.get()==0) {
				flag = false;
				break;
			}
            //统计共同好友的个数
			sum+=value.get();
		}
		
		if (flag) {
            //把数据全部放到了key中，方便后续操作
			String msg = StringUtil.split(key.toString(), "\t")[0]+" "+
					StringUtil.split(key.toString(), "\t")[1]+" "+sum;
            
			context.write(new Text(msg), NullWritable.get());
		}
	}
	
}

MapStep02.java

package com.shsxt.mr.friend;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class MapStep02 extends Mapper<LongWritable, Text, FriendSort, IntWritable>{
	
	@Override
	protected void map(LongWritable key, Text value,Context context)
			throws IOException, InterruptedException {
        //数据的处理
		String lines = value.toString();
		String friend01 = lines.split(" ")[0];
		String friend02 = lines.split(" ")[1];
		int hot = Integer.parseInt(lines.split(" ")[2]);
		
		System.out.println(friend01+"   "+friend02+"   "+hot);
		System.out.println(friend02+"   "+friend01+"   "+hot);
		//自定义了类，对数据进行排序，下面贴出代码
		context.write(new FriendSort(friend01, friend02, hot), new IntWritable(hot));
		context.write(new FriendSort(friend02, friend01, hot), new IntWritable(hot));
	}
}

FriendSort.java

package com.shsxt.mr.friend;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;

public class FriendSort implements WritableComparable<FriendSort>{
	
	private String friend01;
	private String friend02;
	private int hot;
	
	//自定义排序方法
	@Override
	public int compareTo(FriendSort friend) {
		int a = friend01.compareTo(friend.getFriend01());
		int b  = friend02.compareTo(friend.getFriend02());
		
        //对每个数据进行字典排序-> 字典排序默认是 升序排序
		if (a==0) {
			if (b==0) {
                //当前面的数据相同，根据相同好友个数--> 倒叙排序
				return -Integer.compare(hot, friend.getHot());
			}else {
				return b;
			}
		}
		
		return a;
	}
	
    //序列化，hadoop中的序列化机制和java不一样
	@Override
	public void write(DataOutput out) throws IOException {
		out.writeUTF(friend01);
		out.writeUTF(friend02);
		out.writeInt(hot);
	}
    
	//反序列化
	@Override
	public void readFields(DataInput in) throws IOException {
		this.friend01 = in.readUTF();
		this.friend02 = in.readUTF();
		this.hot = in.readInt();
	}
	
	public FriendSort() {
		super();
	}
	
	
	public FriendSort(String friend01, String friend02, int hot) {
		super();
		this.friend01 = friend01;
		this.friend02 = friend02;
		this.hot = hot;
	}
	
	
	public String getFriend01() {
		return friend01;
	}
	public void setFriend01(String friend01) {
		this.friend01 = friend01;
	}
	public String getFriend02() {
		return friend02;
	}
	public void setFriend02(String friend02) {
		this.friend02 = friend02;
	}
	public int getHot() {
		return hot;
	}
	public void setHot(int hot) {
		this.hot = hot;
	}
}

ReduceStep02.java

package com.shsxt.mr.friend;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class ReduceStep02 extends Reducer<FriendSort, IntWritable, Text, NullWritable> {
	
	@Override
	protected void reduce(FriendSort friend, Iterable<IntWritable> values,Context context)
			throws IOException, InterruptedException {
		int sum = 0;
		for (IntWritable i : values) {
            //其实这里应该是不会重复了的
			sum = i.get();
		}
		String msg = friend.getFriend01()+" "+friend.getFriend02()+" "+sum;
		System.out.println(msg);
		
		context.write(new Text(msg), NullWritable.get());
	}

}

JobRunner.java

这个类中设置都是固定的，不做过多介绍了

package com.shsxt.mr.friend;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class JobRunner {
	
	public static void main(String[] args) throws Exception {
		boolean a = job01();
		if (a) {
			job02();
		}
	}
	
	static boolean job01() throws Exception{
		Configuration conf = new Configuration();
		conf.set("fs.defaultFS", "hdfs://node02:8020");
		conf.set("yarn.resourcemanager.hostname", "node02:8088");
		
		Job job = Job.getInstance(conf);
		job.setJobName("job01..");
		job.setJarByClass(JobRunner.class);
		
		job.setMapperClass(MapStep01.class);
		job.setReducerClass(ReduceStep01.class);
		
		job.setMapOutputKeyClass(FoF.class);
		job.setOutputValueClass(IntWritable.class);
		
		FileInputFormat.addInputPath(job, new Path("/friend/input/data.txt"));
		FileSystem fs = FileSystem.get(conf);
		Path out = new Path("/friend/output/01/");
		if (fs.exists(out)) {
			fs.delete(out, true);
		}
		
		FileOutputFormat.setOutputPath(job, out);
		
		return job.waitForCompletion(true);
	}
	
	
	
	static boolean job02() throws Exception{
		Configuration conf = new Configuration();
		conf.set("fs.defaultFS", "hdfs://node02:8020");
		conf.set("yarn.resourcemanager.hostname", "node02:8088");
		
		Job job = Job.getInstance(conf);
		job.setJobName("job02..");
		job.setJarByClass(JobRunner.class);
		
		job.setMapperClass(MapStep02.class);
		job.setReducerClass(ReduceStep02.class);
		
		job.setMapOutputKeyClass(FriendSort.class);
		job.setOutputValueClass(IntWritable.class);
		
		FileInputFormat.addInputPath(job, new Path("/friend/output/01/"));
		FileSystem fs = FileSystem.get(conf);
		Path out = new Path("/friend/output/02/");
		if (fs.exists(out)) {
			fs.delete(out, true);
		}
		
		FileOutputFormat.setOutputPath(job, out);
		
		return job.waitForCompletion(true);
	}
}

运行结果：

cat hadoop 2
cat hello 2
cat mr 1
cat world 1
hadoop cat 2
hadoop hello 3
hadoop mr 1
hello cat 2
hello hadoop 3
hive tom 3
mr cat 1
mr hadoop 1
mr tom 1
mr world 2
tom hive 3
tom mr 1
tom world 2
world cat 1
world mr 2
world tom 2

二，天气统计

统计每年每月气温最高的两天,下面是示例数据

1949-10-01 14:21:02	34c
1949-10-01 19:21:02	38c
1949-10-02 14:01:02	36c
1950-01-01 11:21:02	32c
1950-10-01 12:21:02	37c
1951-12-01 12:21:02	23c
1950-10-02 12:21:02	41c
1950-10-03 12:21:02	27c
1951-07-01 12:21:02	45c
1951-07-02 12:21:02	46c

2.1 代码分析

TqMap.java

package com.shsxt.mr.tq;

import java.io.IOException;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Calendar;
import java.util.Date;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.hsqldb.lib.StringUtil;

public class TqMap extends Mapper<LongWritable, Text, TQ, Text> {
	
	TQ tq =  new TQ();
	Text vwd = new Text();
	
	@Override
	protected void map(LongWritable key, Text value, Context context)
			throws IOException, InterruptedException {
        
		// value :  1949-10-01 14:21:02    34c
		try {
			String [] lines = StringUtil.split(value.toString(), "\t");
			SimpleDateFormat dateFormat = new SimpleDateFormat("yyyy-MM-dd");
		    Date date = null;
			date = dateFormat.parse(lines[0]);
			Calendar calendar = Calendar.getInstance();
			calendar.setTime(date);
			
			tq.setYear(calendar.get(Calendar.YEAR));
			tq.setMonth(calendar.get(Calendar.MONTH)+1);
			tq.setDay(calendar.get(calendar.DAY_OF_MONTH));
			
			int wd = Integer.parseInt(lines[1].substring(0, lines[1].length()-1));
			tq.setWd(wd);
			vwd.set(wd+"");
			
            //把输出的Key封装对象，方便后面比较操作
			context.write(tq, vwd);
			
		} catch (ParseException e) {
			e.printStackTrace();
		}
		
	}
}

TQ.java

package com.shsxt.mr.tq;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

import org.apache.hadoop.io.WritableComparable;

public class TQ implements WritableComparable<TQ>{
	
	private int year;
	private int month;
	private int day;
	private int wd;
    
	//重写序列化
	@Override
	public void write(DataOutput out) throws IOException {
		out.writeInt(year);
		out.writeInt(month);
		out.writeInt(day);
		out.writeInt(wd);
	}
	
    //重写反序列化
	@Override
	public void readFields(DataInput in) throws IOException {
		this.year = in.readInt();
		this.month = in.readInt();
		this.day = in.readInt();
		this.wd = in.readInt();
	}
	
    //自定义比较器
	@Override
	public int compareTo(TQ o) {
		int c1 = Integer.compare(this.year, o.getYear());
		if (c1==0) {
			int c2 = Integer.compare(this.month, o.getMonth());
			if (c2==0) {
				return Integer.compare(this.getDay(),o.getDay());
			}
			return c2;
		}
		return c1;
	}

	public int getYear() {
		return year;
	}

	//省略 getter  setter....
	
}

TqPartitioner.java

Map阶段，分区设置

package com.shsxt.mr.tq;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;

public class TqPartitioner extends Partitioner<TQ, Text> {

	@Override
	public int getPartition(TQ key, Text value, int numPartitions) {
		return key.getYear() % numPartitions;
	}

}

TqGroupComparator.java

自定义分组聚合类

package com.shsxt.mr.tq;

import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;

public class TqGroupComparator extends WritableComparator  {
	
	public TqGroupComparator(){
		super(TQ.class,true);
	}
	
	@Override
	public int compare(WritableComparable a, WritableComparable b) {
		
		TQ t1 = (TQ) a;
		TQ t2 = (TQ) b;
        
		//根据年月分组聚合
		int c1 = Integer.compare(t1.getYear(), t2.getYear());
		if (c1==0) {
			return Integer.compare(t1.getMonth(), t2.getMonth());
		}
		return c1;
	}
}

TqSortComparator.java

自定义排序类

package com.shsxt.mr.tq;

import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;

public class TqSortComparator extends WritableComparator{
	
	public TqSortComparator(){
		super(TQ.class,true);
	}
	
	@Override
	public int compare(WritableComparable a, WritableComparable b) {
		TQ t1 = (TQ) a;
		TQ t2 = (TQ) b;
		
		int c1 = Integer.compare(t1.getYear(), t2.getYear());
		if (c1==0) {
			int c2 = Integer.compare(t1.getMonth(), t2.getMonth());
			if (c2==0) {
                //年月相同时，按照温度降序排序
				return -Integer.compare(t1.getWd(), t2.getWd());
			}
			return c2;
		}
		
		return c1;
	}
}

TqReduce.java

package com.shsxt.mr.tq;

import java.io.IOException;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class TqReduce extends Reducer<TQ, Text, Text, Text> {
	
	Text rkey = new Text();
	Text rvalue = new Text();
	
	
	//	1949-10-01         34
	//	1949-10-01         38
	//	1949-10-01         37
	//	1949-10-02         39
	@Override
	protected void reduce(TQ key, Iterable<Text> values, Context context)
			throws IOException, InterruptedException {
		int flag = 0;
		int day = 0;
		
		// key 1949-10-01  values : {38,36,34}
		// key 1949-10-02   此时这两个都是一组的，因为按照年 月分组的，忽略了具体的day
		// 具体分组  例如:
		//1949-10 	{38,36,34} -->  但是对应的key中的day实际可能不同
		for (Text text : values) {
			//最高气温
			if (flag==0) {
				day = key.getDay();
				rkey.set(key.getYear()+"-"+key.getMonth()+"-"+key.getDay());
				rvalue.set(key.getWd()+"");
				
				context.write(rkey,rvalue);
				flag++;
			}
			
			//找出与最高气温日期不同的，就是第二个高气温的时间
			if (flag!=0 && day!=key.getDay()) {
				rkey.set(key.getYear()+"-"+key.getMonth()+"-"+key.getDay());
				rvalue.set(key.getWd()+"");
				context.write(rkey, rvalue);
				break;
			}
		}
		
	}
}

TQJobRunner.java

package com.shsxt.mr.tq;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class TQJobRunner {
	
	public static void main(String[] args) throws Exception {
		//1.设置集群相关配置
		Configuration conf = new Configuration(true);
		conf.set("fs.defaultFS", "hdfs://node02:8020");
		conf.set("yarn.resourcemanager.hostname", "node02:8088");
		
		
		//2.创建一个Job,设置job相关的信息
		Job job = Job.getInstance(conf);
		job.setJobName("tq");
		job.setJarByClass(TQJobRunner.class);
		
		//3.设置要处理的数据位置
		FileInputFormat.addInputPaths(job, "/tq/input/tq.txt");
		
		Path output = new Path("/tq/output/");
		
		FileSystem fs = FileSystem.get(conf);
		
		//4.判断输入路径是否存在，存在的话删除
		if (fs.exists(output)) {
			fs.delete(output, true);
		}
		
		FileOutputFormat.setOutputPath(job, output);
		
		job.setMapperClass(TqMap.class);
		job.setMapOutputKeyClass(TQ.class);
		job.setMapOutputValueClass(Text.class);
		
		//5.设置reduce相关信息
		job.setReducerClass(TqReduce.class);
        //Map分区三个，可以用三个reduce同时工作
		job.setNumReduceTasks(3);
		
		//6.Map过程中分区信息设置
		job.setPartitionerClass(TqPartitioner.class);
		
		//7.二次排序，在shuffle阶段，reduce之前执行  对温度进行倒序排序 
		job.setSortComparatorClass(TqSortComparator.class);
        
		//对数据进行按照年份月份 分组聚合
		job.setGroupingComparatorClass(TqGroupComparator.class);
		
		job.setCombinerKeyGroupingComparatorClass(TqGroupComparator.class);
		
		job.waitForCompletion(true);
		
		System.out.println("success.....");
		
	}
}

运行结果：
在这里插入图片描述

1950-1-1	32
1950-10-2	41
1950-10-1	37

1951-7-2	46
1951-7-1	45
1951-12-1	23

1949-10-1	38
1949-10-2	36

数据分成了三份，按照分区(年份)

lemon lime

关注

4
点赞
踩
13

收藏

觉得还不错? 一键收藏
0
评论
MapReduce场景应用

MapReduce场景应用一，QQ共同好友简单分析相信大家都有用过QQ, QQ会不定期的向你推荐些可能认识的人(显示你们有多少个共同好友)。那么它的数据是哪里来的呢？实际上是每个人好友列表数据分析出来的。比如说 A 的好友列表中有 B C， B的好友列表有 C D E ,那么系统可能会给A推荐 D E ，这样才是合理的方案。对于QQ用户来说，用户群是非常庞大的，分析每个人好友列...
复制链接

扫一扫

专栏目录