博主上一篇文章介绍了使用hadoop来建立倒排索引的算法实现,本篇文章我们继续来看看QQ、粉丝共同好友如何使用hadoop来实现。
一、背景
数据库中有很多个QQ、且这些QQ的好友都能够查询到;结果可以规整如下:
#结构---人:好友1,好友2,好友3,好友4....
A:B,C,D,F,E,O
B:C,E,G,F,O,D
D:Q,W,B,P,T,Y
Y:S,Q,L,V,B,H,J,K,L
O:L,E,Q,R,U,S,B
P:O,L,E,L,F,Q,W,G
K:S,L,D,U,R,E,A,X
.....
我们的需求是要得到两个人之间的共同好友:
A-B,C F O E
A-D,B
A-Y,B
D-Y,Q B
......
二、代码实现:
第一步,首先实现结构{友 人,人,人}:A I,K,C,B,G,F,H,O,D,
package com.empire.hadoop.mr.fensi;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class SharedFriendsStepOne {
static class SharedFriendsStepOneMapper extends Mapper<LongWritable, Text, Text, Text> {
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
// A:B,C,D,F,E,O
String line = value.toString();
String[] person_friends = line.split(":");
String person = person_friends[0];
String friends = person_friends[1];
for (String friend : friends.split(",")) {
// 输出<好友,人>
context.write(new Text(friend), new Text(person));
}
}
}
static class SharedFriendsStepOneReducer extends Reducer<Text, Text, Text, Text> {
@Override
protected void reduce(Text friend, Iterable<Text> persons, Context context)
throws IOException, InterruptedException {
StringBuffer sb = new StringBuffer();
for (Text person : persons) {
sb.append(person).append(",");
}
context.write(friend, new Text(sb.toString()));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(SharedFriendsStepOne.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(SharedFriendsStepOneMapper.class);
job.setReducerClass(SharedFriendsStepOneReducer.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
第二步,实现共同好友{人-人,友 友 友 友}:A-B,C F O E
package com.empire.hadoop.mr.fensi;
import java.io.IOException;
import java.util.Arrays;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class SharedFriendsStepTwo {
static class SharedFriendsStepTwoMapper extends Mapper<LongWritable, Text, Text, Text> {
// 拿到的数据是上一个步骤的输出结果
// A I,K,C,B,G,F,H,O,D,
// 友 人,人,人
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String[] friend_persons = line.split("\t");
String friend = friend_persons[0];
String[] persons = friend_persons[1].split(",");
Arrays.sort(persons);
for (int i = 0; i < persons.length - 1; i++) {
for (int j = i + 1; j < persons.length; j++) {
// 发出 <人-人,好友> ,这样,相同的“人-人”对的所有好友就会到同1个reduce中去
context.write(new Text(persons[i] + "-" + persons[j]), new Text(friend));
}
}
}
}
static class SharedFriendsStepTwoReducer extends Reducer<Text, Text, Text, Text> {
@Override
protected void reduce(Text person_person, Iterable<Text> friends, Context context)
throws IOException, InterruptedException {
StringBuffer sb = new StringBuffer();
for (Text friend : friends) {
sb.append(friend).append(" ");
}
context.write(person_person, new Text(sb.toString()));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(SharedFriendsStepTwo.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(SharedFriendsStepTwoMapper.class);
job.setReducerClass(SharedFriendsStepTwoReducer.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
三、执行程序
#上传jar
Alt+p
lcd d:/
put SharedStepOne.jar SharedStepTwo.jar
put shared.txt
#准备hadoop处理的数据文件
cd /home/hadoop
hadoop fs -mkdir -p /shared/sharedinput
hdfs dfs -put shared.txt /shared/sharedinput
#运行程序
hadoop jar SharedStepOne.jar com.empire.hadoop.mr.fensi.SharedFriendsStepOne /shared/sharedinput /shared/sharedsteponeoutput
hadoop jar SharedStepTwo.jar com.empire.hadoop.mr.fensi.SharedFriendsStepTwo /shared/sharedsteponeoutput/part-r-00000 /shared/sharedsteptwooutput
四、运行效果
[hadoop@centos-aaron-h1 ~]$ hadoop jar SharedStepOne.jar com.empire.hadoop.mr.fensi.SharedFriendsStepOne /shared/sharedinput /shared/sharedsteponeoutput
18/12/23 05:08:08 INFO client.RMProxy: Connecting to ResourceManager at centos-aaron-h1/192.168.29.144:8032
18/12/23 05:08:09 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
18/12/23 05:08:09 INFO input.FileInputFormat: Total input files to process : 1
18/12/23 05:08:10 INFO mapreduce.JobSubmitter: number of splits:1
18/12/23 05:08:10 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
18/12/23 05:08:10 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1545512861141_0001
18/12/23 05:08:11 INFO impl.YarnClientImpl: Submitted application application_1545512861141_0001
18/12/23 05:08:11 INFO mapreduce.Job: The url to track the job: http://centos-aaron-h1:8088/proxy/application_1545512861141_0001/
18/12/23 05:08:11 INFO mapreduce.Job: Running job: job_1545512861141_0001
18/12/23 05:08:20 INFO mapreduce.Job: Job job_1545512861141_0001 running in uber mode : false
18/12/23 05:08:20 INFO mapreduce.Job: map 0% reduce 0%
18/12/23 05:08:27 INFO mapreduce.Job: map 100% reduce 0%
18/12/23 05:08:33 INFO mapreduce.Job: map 100% reduce 100%
18/12/23 05:08:33 INFO mapreduce.Job: Job job_1545512861141_0001 completed successfully
18/12/23 05:08:33 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=306
FILE: Number of bytes written=394989
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=241
HDFS: Number of bytes written=166
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=3967
Total time spent by all reduces in occupied slots (ms)=3151
Total time spent by all map tasks (ms)=3967
Total time spent by all reduce tasks (ms)=3151
Total vcore-milliseconds taken by all map tasks=3967
Total vcore-milliseconds taken by all reduce tasks=3151
Total megabyte-milliseconds taken by all map tasks=4062208
Total megabyte-milliseconds taken by all reduce tasks=3226624
Map-Reduce Framework
Map input records=7
Map output records=50
Map output bytes=200
Map output materialized bytes=306
Input split bytes=122
Combine input records=0
Combine output records=0
Reduce input groups=22
Reduce shuffle bytes=306
Reduce input records=50
Reduce output records=22
Spilled Records=100
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=190
CPU time spent (ms)=1260
Physical memory (bytes) snapshot=339103744
Virtual memory (bytes) snapshot=1694265344
Total committed heap usage (bytes)=137867264
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=119
File Output Format Counters
Bytes Written=166
[hadoop@centos-aaron-h1 ~]$
[hadoop@centos-aaron-h1 ~]$ hadoop jar SharedStepTwo.jar com.empire.hadoop.mr.fensi.SharedFriendsStepTwo /shared/sharedsteponeoutput/part-r-00000 /shared/sharedsteptwooutput
18/12/23 05:12:19 INFO client.RMProxy: Connecting to ResourceManager at centos-aaron-h1/192.168.29.144:8032
18/12/23 05:12:20 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
18/12/23 05:12:20 INFO input.FileInputFormat: Total input files to process : 1
18/12/23 05:12:20 INFO mapreduce.JobSubmitter: number of splits:1
18/12/23 05:12:20 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
18/12/23 05:12:21 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1545512861141_0002
18/12/23 05:12:21 INFO impl.YarnClientImpl: Submitted application application_1545512861141_0002
18/12/23 05:12:21 INFO mapreduce.Job: The url to track the job: http://centos-aaron-h1:8088/proxy/application_1545512861141_0002/
18/12/23 05:12:21 INFO mapreduce.Job: Running job: job_1545512861141_0002
18/12/23 05:12:29 INFO mapreduce.Job: Job job_1545512861141_0002 running in uber mode : false
18/12/23 05:12:29 INFO mapreduce.Job: map 0% reduce 0%
18/12/23 05:12:38 INFO mapreduce.Job: map 100% reduce 0%
18/12/23 05:12:44 INFO mapreduce.Job: map 100% reduce 100%
18/12/23 05:12:44 INFO mapreduce.Job: Job job_1545512861141_0002 completed successfully
18/12/23 05:12:44 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=438
FILE: Number of bytes written=395295
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=298
HDFS: Number of bytes written=208
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=5637
Total time spent by all reduces in occupied slots (ms)=3126
Total time spent by all map tasks (ms)=5637
Total time spent by all reduce tasks (ms)=3126
Total vcore-milliseconds taken by all map tasks=5637
Total vcore-milliseconds taken by all reduce tasks=3126
Total megabyte-milliseconds taken by all map tasks=5772288
Total megabyte-milliseconds taken by all reduce tasks=3201024
Map-Reduce Framework
Map input records=22
Map output records=54
Map output bytes=324
Map output materialized bytes=438
Input split bytes=132
Combine input records=0
Combine output records=0
Reduce input groups=20
Reduce shuffle bytes=438
Reduce input records=54
Reduce output records=20
Spilled Records=108
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=251
CPU time spent (ms)=1260
Physical memory (bytes) snapshot=338214912
Virtual memory (bytes) snapshot=1694265344
Total committed heap usage (bytes)=137662464
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=166
File Output Format Counters
Bytes Written=208
[hadoop@centos-aaron-h1 ~]$
五、运行结果
[hadoop@centos-aaron-h1 ~]$ hdfs dfs -cat /shared/sharedsteponeoutput/part-r-00000
A K,
B A,O,Y,D,
C A,B,
D B,A,K,
E O,K,A,P,B,
F P,A,B,
G B,P,
H Y,
J Y,
K Y,
L K,Y,P,P,Y,O,
O A,B,P,
P D,
Q P,D,Y,O,
R K,O,
S O,Y,K,
T D,
U O,K,
V Y,
W D,P,
X K,
Y D,
[hadoop@centos-aaron-h1 ~]$ hdfs dfs -cat /shared/sharedsteptwooutput/part-r-00000
A-B F C D O E
A-D B
A-K D E
A-O B E
A-P E F O
A-Y B
B-K E D
B-O E
B-P F G O E
D-O B Q
D-P W Q
D-Y B Q
K-O E U L S R
K-P E L L
K-Y L L S
O-P L L Q E
O-Y B S Q L L
P-P L
P-Y L L Q L L
Y-Y L
[hadoop@centos-aaron-h1 ~]$
最后寄语,以上是博主本次文章的全部内容,如果大家觉得博主的文章还不错,请点赞;如果您对博主其它服务器大数据技术或者博主本人感兴趣,请关注博主博客,并且欢迎随时跟博主沟通交流。