MapReduce场景应用

MapReduce场景应用

一,QQ共同好友简单分析

相信大家都有用过QQ, QQ会不定期的向你推荐些可能认识的人(显示你们有多少个共同好友)。

那么它的数据是哪里来的呢?实际上是每个人好友列表数据分析出来的。

比如说 A 的好友列表中有 B CB的好友列表有 C D E ,那么系统可能会给A推荐 D E ,这样才是合理的方案。

对于QQ用户来说,用户群是非常庞大的,分析每个人好友列表,用文本保存,假设保存的信息是如下图:

用户ID好友列表其他数据
AB C D …

看下QQ的注册人数:
在这里插入图片描述

8亿人,假设一个人的好友列表占用10kB 处理下来数据量约是 8T,显然用普通的处理方式,存储都是一个问题,更别说数据的处理了。

这个时候我们就完全可以用HDFS和MapReduce去处理这些数据。

1.1 示例数据分析

这里我准备了一点数据,如下

ID       好友列表
--------------------------
tom 	 cat  hadoop  hello
hello  	 mr  tom  world  hive
cat 	 tom  hive
hive 	 cat  hadoop  world  hello  mr
mr 		 hive  hello
hadoop 	 tom  hive  world
world 	 hadoop  hive  hello

第一步: Mapper

分析每行数据,拿第一行举例,tom和好友列表的每一个都是属于直接好友,而tom的好友列表两两组合都属于 非直接好友关系,但是都有一个共同的好友 tom

如果 用 <tom-cat,0> 表示直接好友关系 <cat-hadoop,1> 表示二层关系,这样我们筛选出value为1的就是我们需要的数据,

思考一个问题,<tom-cat,0> <cat-tom,0> 这样Key是属于两个,这里我们代码用的是自定义Key去解决。

第二步:Reduce

遍历数据,values中含有0的代表,说明是直接好友,抛弃数据,留下values是1的即可,汇总。

到这,我们把需要的数据抽取出来了,但是相对<tom-cat,1> 系统可以向tom 推荐cat 系统也可以向 cat 推荐 tom,而且系统的推荐也是根据共同的数量来优先推荐的

所以这里,我们需要有第二个MapReduce程序 ,来对数据进行排序和整合

第三步:Mapper

自定义排序类,实现排序

第四步:Reduce

输出数据即可

1.2 代码分析

MapStep01.java

package com.shsxt.mr.friend;

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.hsqldb.lib.StringUtil;

public class MapStep01 extends Mapper<LongWritable, Text, FoF, IntWritable> {
	
	@Override
	protected void map(LongWritable key, Text value, Context context)
			throws IOException, InterruptedException {
		
		String lines = value.toString();
        //我这里为了方便,单词和单词之间分隔符都是用 " " 展示数据时,是为了大家方便观看
		String [] friends = StringUtil.split(lines, " ");
		
		for (int i = 1; i < friends.length; i++) {
			String friend = friends[i];
            //对于每一行,第一个数据和好友列表任意组合都是直接好友,Value=0 表示
			context.write(new FoF(friend,friends[0]), new IntWritable(0));
			//对于第二个数据和其后面的组合都是非直接好友  value=1表示
			for (int j = i+1; j < friends.length; j++) {
				String fri = friends[j];
                //写出的Key 是自定义类,下面贴出代码
				context.write(new FoF(fri, friend), new IntWritable(1));
			}
		}
	}
}

FoF.java

package com.shsxt.mr.friend;

import org.apache.hadoop.io.Text;
//继承Text,Text实现了WritableComparable,省写序列化代码了
public class FoF extends Text {

	public FoF() {
		super();
	}
	
	public FoF(String friend01,String friend02){
		set(getOf(friend01,friend02));
	}
	
	//防止相同的数据组合造成数据的重复 如 cat-tomcat 和tomcat-cat    
	private String getOf(String friend01, String friend02) {
		int c = friend01.compareTo(friend02);
		
		if (c>0) {
			return friend02 + "\t" + friend01;
		}
		return friend01+"\t"+friend02;
	}
}

ReduceStep01.java

package com.shsxt.mr.friend;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import org.hsqldb.lib.StringUtil;

public class ReduceStep01 extends Reducer<FoF, IntWritable, Text, NullWritable> {
	
	@Override
	protected void reduce(FoF key, Iterable<IntWritable> values,
			Context context) throws IOException, InterruptedException {
		int sum = 0;
		boolean flag = true;
        //到这边的数据类型  cat hadoop   {0,0,1,1}
		for (IntWritable value : values) {
            //只要value中含有0  说明是直接好友,跳过本次循环
			if (value.get()==0) {
				flag = false;
				break;
			}
            //统计共同好友的个数
			sum+=value.get();
		}
		
		if (flag) {
            //把数据全部放到了key中,方便后续操作
			String msg = StringUtil.split(key.toString(), "\t")[0]+" "+
					StringUtil.split(key.toString(), "\t")[1]+" "+sum;
            
			context.write(new Text(msg), NullWritable.get());
		}
	}
	
}

MapStep02.java

package com.shsxt.mr.friend;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class MapStep02 extends Mapper<LongWritable, Text, FriendSort, IntWritable>{
	
	@Override
	protected void map(LongWritable key, Text value,Context context)
			throws IOException, InterruptedException {
        //数据的处理
		String lines = value.toString();
		String friend01 = lines.split(" ")[0];
		String friend02 = lines.split(" ")[1];
		int hot = Integer.parseInt(lines.split(" ")[2]);
		
		System.out.println(friend01+"   "+friend02+"   "+hot);
		System.out.println(friend02+"   "+friend01+"   "+hot);
		//自定义了类,对数据进行排序,下面贴出代码
		context.write(new FriendSort(friend01, friend02, hot), new IntWritable(hot));
		context.write(new FriendSort(friend02, friend01, hot), new IntWritable(hot));
	}
}

FriendSort.java

package com.shsxt.mr.friend;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;

public class FriendSort implements WritableComparable<FriendSort>{
	
	private String friend01;
	private String friend02;
	private int hot;
	
	//自定义排序方法
	@Override
	public int compareTo(FriendSort friend) {
		int a = friend01.compareTo(friend.getFriend01());
		int b  = friend02.compareTo(friend.getFriend02());
		
        //对每个数据进行字典排序-> 字典排序默认是 升序排序
		if (a==0) {
			if (b==0) {
                //当前面的数据相同,根据相同好友个数--> 倒叙排序
				return -Integer.compare(hot, friend.getHot());
			}else {
				return b;
			}
		}
		
		return a;
	}
	
    //序列化,hadoop中的序列化机制和java不一样
	@Override
	public void write(DataOutput out) throws IOException {
		out.writeUTF(friend01);
		out.writeUTF(friend02);
		out.writeInt(hot);
	}
    
	//反序列化
	@Override
	public void readFields(DataInput in) throws IOException {
		this.friend01 = in.readUTF();
		this.friend02 = in.readUTF();
		this.hot = in.readInt();
	}
	
	public FriendSort() {
		super();
	}
	
	
	public FriendSort(String friend01, String friend02, int hot) {
		super();
		this.friend01 = friend01;
		this.friend02 = friend02;
		this.hot = hot;
	}
	
	
	public String getFriend01() {
		return friend01;
	}
	public void setFriend01(String friend01) {
		this.friend01 = friend01;
	}
	public String getFriend02() {
		return friend02;
	}
	public void setFriend02(String friend02) {
		this.friend02 = friend02;
	}
	public int getHot() {
		return hot;
	}
	public void setHot(int hot) {
		this.hot = hot;
	}
}

ReduceStep02.java

package com.shsxt.mr.friend;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class ReduceStep02 extends Reducer<FriendSort, IntWritable, Text, NullWritable> {
	
	@Override
	protected void reduce(FriendSort friend, Iterable<IntWritable> values,Context context)
			throws IOException, InterruptedException {
		int sum = 0;
		for (IntWritable i : values) {
            //其实这里应该是不会重复了的
			sum = i.get();
		}
		String msg = friend.getFriend01()+" "+friend.getFriend02()+" "+sum;
		System.out.println(msg);
		
		context.write(new Text(msg), NullWritable.get());
	}

}

JobRunner.java

这个类中设置都是固定的,不做过多介绍了

package com.shsxt.mr.friend;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class JobRunner {
	
	public static void main(String[] args) throws Exception {
		boolean a = job01();
		if (a) {
			job02();
		}
	}
	
	static boolean job01() throws Exception{
		Configuration conf = new Configuration();
		conf.set("fs.defaultFS", "hdfs://node02:8020");
		conf.set("yarn.resourcemanager.hostname", "node02:8088");
		
		Job job = Job.getInstance(conf);
		job.setJobName("job01..");
		job.setJarByClass(JobRunner.class);
		
		job.setMapperClass(MapStep01.class);
		job.setReducerClass(ReduceStep01.class);
		
		job.setMapOutputKeyClass(FoF.class);
		job.setOutputValueClass(IntWritable.class);
		
		FileInputFormat.addInputPath(job, new Path("/friend/input/data.txt"));
		FileSystem fs = FileSystem.get(conf);
		Path out = new Path("/friend/output/01/");
		if (fs.exists(out)) {
			fs.delete(out, true);
		}
		
		FileOutputFormat.setOutputPath(job, out);
		
		return job.waitForCompletion(true);
	}
	
	
	
	static boolean job02() throws Exception{
		Configuration conf = new Configuration();
		conf.set("fs.defaultFS", "hdfs://node02:8020");
		conf.set("yarn.resourcemanager.hostname", "node02:8088");
		
		Job job = Job.getInstance(conf);
		job.setJobName("job02..");
		job.setJarByClass(JobRunner.class);
		
		job.setMapperClass(MapStep02.class);
		job.setReducerClass(ReduceStep02.class);
		
		job.setMapOutputKeyClass(FriendSort.class);
		job.setOutputValueClass(IntWritable.class);
		
		FileInputFormat.addInputPath(job, new Path("/friend/output/01/"));
		FileSystem fs = FileSystem.get(conf);
		Path out = new Path("/friend/output/02/");
		if (fs.exists(out)) {
			fs.delete(out, true);
		}
		
		FileOutputFormat.setOutputPath(job, out);
		
		return job.waitForCompletion(true);
	}
}

运行结果:

cat hadoop 2
cat hello 2
cat mr 1
cat world 1
hadoop cat 2
hadoop hello 3
hadoop mr 1
hello cat 2
hello hadoop 3
hive tom 3
mr cat 1
mr hadoop 1
mr tom 1
mr world 2
tom hive 3
tom mr 1
tom world 2
world cat 1
world mr 2
world tom 2

二,天气统计

统计每年每月气温最高的两天,下面是示例数据

1949-10-01 14:21:02	34c
1949-10-01 19:21:02	38c
1949-10-02 14:01:02	36c
1950-01-01 11:21:02	32c
1950-10-01 12:21:02	37c
1951-12-01 12:21:02	23c
1950-10-02 12:21:02	41c
1950-10-03 12:21:02	27c
1951-07-01 12:21:02	45c
1951-07-02 12:21:02	46c
2.1 代码分析

TqMap.java

package com.shsxt.mr.tq;

import java.io.IOException;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Calendar;
import java.util.Date;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.hsqldb.lib.StringUtil;

public class TqMap extends Mapper<LongWritable, Text, TQ, Text> {
	
	TQ tq =  new TQ();
	Text vwd = new Text();
	
	@Override
	protected void map(LongWritable key, Text value, Context context)
			throws IOException, InterruptedException {
        
		// value :  1949-10-01 14:21:02    34c
		try {
			String [] lines = StringUtil.split(value.toString(), "\t");
			SimpleDateFormat dateFormat = new SimpleDateFormat("yyyy-MM-dd");
		    Date date = null;
			date = dateFormat.parse(lines[0]);
			Calendar calendar = Calendar.getInstance();
			calendar.setTime(date);
			
			tq.setYear(calendar.get(Calendar.YEAR));
			tq.setMonth(calendar.get(Calendar.MONTH)+1);
			tq.setDay(calendar.get(calendar.DAY_OF_MONTH));
			
			int wd = Integer.parseInt(lines[1].substring(0, lines[1].length()-1));
			tq.setWd(wd);
			vwd.set(wd+"");
			
            //把输出的Key封装对象,方便后面比较操作
			context.write(tq, vwd);
			
		} catch (ParseException e) {
			e.printStackTrace();
		}
		
	}
}

TQ.java

package com.shsxt.mr.tq;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

import org.apache.hadoop.io.WritableComparable;

public class TQ implements WritableComparable<TQ>{
	
	private int year;
	private int month;
	private int day;
	private int wd;
    
	//重写序列化
	@Override
	public void write(DataOutput out) throws IOException {
		out.writeInt(year);
		out.writeInt(month);
		out.writeInt(day);
		out.writeInt(wd);
	}
	
    //重写反序列化
	@Override
	public void readFields(DataInput in) throws IOException {
		this.year = in.readInt();
		this.month = in.readInt();
		this.day = in.readInt();
		this.wd = in.readInt();
	}
	
    //自定义比较器
	@Override
	public int compareTo(TQ o) {
		int c1 = Integer.compare(this.year, o.getYear());
		if (c1==0) {
			int c2 = Integer.compare(this.month, o.getMonth());
			if (c2==0) {
				return Integer.compare(this.getDay(),o.getDay());
			}
			return c2;
		}
		return c1;
	}

	public int getYear() {
		return year;
	}

	//省略 getter  setter....
	
}

TqPartitioner.java

Map阶段,分区设置

package com.shsxt.mr.tq;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;

public class TqPartitioner extends Partitioner<TQ, Text> {

	@Override
	public int getPartition(TQ key, Text value, int numPartitions) {
		return key.getYear() % numPartitions;
	}

}

TqGroupComparator.java

自定义分组聚合类

package com.shsxt.mr.tq;

import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;

public class TqGroupComparator extends WritableComparator  {
	
	public TqGroupComparator(){
		super(TQ.class,true);
	}
	
	@Override
	public int compare(WritableComparable a, WritableComparable b) {
		
		TQ t1 = (TQ) a;
		TQ t2 = (TQ) b;
        
		//根据年月分组聚合
		int c1 = Integer.compare(t1.getYear(), t2.getYear());
		if (c1==0) {
			return Integer.compare(t1.getMonth(), t2.getMonth());
		}
		return c1;
	}
}

TqSortComparator.java

自定义排序类

package com.shsxt.mr.tq;

import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;

public class TqSortComparator extends WritableComparator{
	
	public TqSortComparator(){
		super(TQ.class,true);
	}
	
	@Override
	public int compare(WritableComparable a, WritableComparable b) {
		TQ t1 = (TQ) a;
		TQ t2 = (TQ) b;
		
		int c1 = Integer.compare(t1.getYear(), t2.getYear());
		if (c1==0) {
			int c2 = Integer.compare(t1.getMonth(), t2.getMonth());
			if (c2==0) {
                //年月相同时,按照温度降序排序
				return -Integer.compare(t1.getWd(), t2.getWd());
			}
			return c2;
		}
		
		return c1;
	}
}

TqReduce.java

package com.shsxt.mr.tq;

import java.io.IOException;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class TqReduce extends Reducer<TQ, Text, Text, Text> {
	
	Text rkey = new Text();
	Text rvalue = new Text();
	
	
	//	1949-10-01         34
	//	1949-10-01         38
	//	1949-10-01         37
	//	1949-10-02         39
	@Override
	protected void reduce(TQ key, Iterable<Text> values, Context context)
			throws IOException, InterruptedException {
		int flag = 0;
		int day = 0;
		
		// key 1949-10-01  values : {38,36,34}
		// key 1949-10-02   此时这两个都是一组的,因为按照年 月分组的,忽略了具体的day
		// 具体分组  例如:
		//1949-10 	{38,36,34} -->  但是对应的key中的day实际可能不同
		for (Text text : values) {
			//最高气温
			if (flag==0) {
				day = key.getDay();
				rkey.set(key.getYear()+"-"+key.getMonth()+"-"+key.getDay());
				rvalue.set(key.getWd()+"");
				
				context.write(rkey,rvalue);
				flag++;
			}
			
			//找出与最高气温日期不同的,就是第二个高气温的时间
			if (flag!=0 && day!=key.getDay()) {
				rkey.set(key.getYear()+"-"+key.getMonth()+"-"+key.getDay());
				rvalue.set(key.getWd()+"");
				context.write(rkey, rvalue);
				break;
			}
		}
		
	}
}

TQJobRunner.java

package com.shsxt.mr.tq;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class TQJobRunner {
	
	public static void main(String[] args) throws Exception {
		//1.设置集群相关配置
		Configuration conf = new Configuration(true);
		conf.set("fs.defaultFS", "hdfs://node02:8020");
		conf.set("yarn.resourcemanager.hostname", "node02:8088");
		
		
		//2.创建一个Job,设置job相关的信息
		Job job = Job.getInstance(conf);
		job.setJobName("tq");
		job.setJarByClass(TQJobRunner.class);
		
		//3.设置要处理的数据位置
		FileInputFormat.addInputPaths(job, "/tq/input/tq.txt");
		
		Path output = new Path("/tq/output/");
		
		FileSystem fs = FileSystem.get(conf);
		
		//4.判断输入路径是否存在,存在的话删除
		if (fs.exists(output)) {
			fs.delete(output, true);
		}
		
		FileOutputFormat.setOutputPath(job, output);
		
		job.setMapperClass(TqMap.class);
		job.setMapOutputKeyClass(TQ.class);
		job.setMapOutputValueClass(Text.class);
		
		//5.设置reduce相关信息
		job.setReducerClass(TqReduce.class);
        //Map分区三个,可以用三个reduce同时工作
		job.setNumReduceTasks(3);
		
		//6.Map过程中分区信息设置
		job.setPartitionerClass(TqPartitioner.class);
		
		//7.二次排序,在shuffle阶段,reduce之前执行  对温度进行倒序排序 
		job.setSortComparatorClass(TqSortComparator.class);
        
		//对数据进行按照年份月份 分组聚合
		job.setGroupingComparatorClass(TqGroupComparator.class);
		
		job.setCombinerKeyGroupingComparatorClass(TqGroupComparator.class);
		
		job.waitForCompletion(true);
		
		System.out.println("success.....");
		
	}
}

运行结果:
在这里插入图片描述

1950-1-1	32
1950-10-2	41
1950-10-1	37
1951-7-2	46
1951-7-1	45
1951-12-1	23
1949-10-1	38
1949-10-2	36

数据分成了三份,按照分区(年份)

  • 4
    点赞
  • 13
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值