1.需求:
给出A-O个人中每个人的好友列表(好友单向),求出哪些人两两之间有共同好友,以及他们的共同好友都有谁。
好友列表数据:
A:B,C,D,F,E,O
B:A,C,E,K
C:F,A,D,I
D:A,E,F,L
E:B,C,D,M,L
F:A,B,C,D,E,O,M
G:A,C,D,E,F
H:A,C,D,E,O
I:A,O
J:B,O
K:A,C,D
L:D,E,F
M:E,F,G
O:A,H,I,J
最终输出格式:
A-B: CE
A-C: DF
A-D: EF
..........
2.思路:
此题旨在求两人之间的共同好友,原信息是<人,该人的所有好友>,因此首先以好友为键,人为值,交给reduce找出拥有此好友的所有人。再将这些人中两两配对作为键,之前的键(好友)作为值交给reduce去合并。
简而言之我打算分成两个步骤,两次迭代:
1)求出每一个人都是哪些人的共同好友
2)把这些人(用共同好友的人)作为key,其好友作为value输出
mapreduce01:
mapreduce02:
3.代码:
map01.py
import sys
for line in sys.stdin:
ss = line.strip().split(':')
person = ss[0]
values = ss[1]
friends = values.strip().split(',')
for friend in friends:
print '%s\t%s'%(friend,person)
reduce01.py
import sys
per_friend = None
person = ''
for line in sys.stdin:
word,val = line.strip().split('\t')
if per_friend==None:
per_friend = word
if per_friend!=word:
print '%s\t%s'%(per_friend,person)
per_friend = word
person = ''
person=person+val
print '%s\t%s'%(per_friend,person)
map02.py
import sys
from itertools import combinations
for line in sys.stdin:
ss = line.strip().split('\t')
key = ss[0]
values = ss[1]
friends = list(combinations(values, 2))
for fri_com in friends:
print '%s-%s:\t%s'%(fri_com[0],fri_com[1],key)
reduce02.py
import sys
per_friend = None
person = ''
for line in sys.stdin:
word,val = line.strip().split('\t')
if per_friend==None:
per_friend = word
if per_friend!=word:
print '%s\t%s'%(per_friend,person)
per_friend = word
person = ''
person=person+val
print '%s\t%s'%(per_friend,person)
run.sh
HADOOP_CMD="/usr/local/src/hadoop-2.6.5/bin/hadoop"
STREAM_JAR_PATH="/usr/local/src/hadoop-2.6.5/share/hadoop/tools/lib/hadoop-streaming-2.6.5.jar"
INPUT_FILE_PATH_1="/friends/inputdata.txt"
#INPUT_FILE_PATH_1="/data/1.data"
OUTPUT_PATH="/output/friends"
OUTPUT_PATH2="/output/friends_com"
$HADOOP_CMD fs -rmr -skipTrash $OUTPUT_PATH
# Step 1.
$HADOOP_CMD jar $STREAM_JAR_PATH \
-input $INPUT_FILE_PATH_1 \
-output $OUTPUT_PATH \
-mapper "python map01.py" \
-reducer "python reduce01.py" \
-file ./map01.py \
-file ./reduce01.py \
-jobconf stream.num.map.output.key.fields=2 # 指定map输出中前两个字段作key
# Step 2.
$HADOOP_CMD fs -rmr -skipTrash $OUTPUT_PATH2
$HADOOP_CMD jar $STREAM_JAR_PATH \
-input $OUTPUT_PATH \
-output $OUTPUT_PATH2 \
-mapper "python map02.py" \
-reducer "python reduce02.py" \
-file ./map02.py \
-file ./reduce02.py
4.最终结果:
1.本地调试:
cat inputdata.txt | python map01.py | sort -k1 | python reduce01.py | python map02.py | sort -k1 | python reduce02.py
A-B: CE
A-C: DF
A-D: EF
A-E: BCD
A-F: BCDEO
A-G: CDEF
A-H: CDEO
A-I: O
A-J: BO
A-K: CD
A-L: DEF
A-M: EF
B-C: A
B-D: AE
B-E: C
B-F: ACE
B-G: ACE
B-H: ACE
B-I: A
B-K: AC
B-L: E
B-M: E
B-O: A
C-D: AF
C-E: D
C-F: AD
C-G: ADF
C-H: AD
C-I: A
C-K: AD
C-L: DF
C-M: F
C-O: AI
D-E: L
D-F: AE
D-G: AEF
D-H: AE
D-I: A
D-K: A
D-L: EF
D-M: EF
D-O: A
E-F: BCDM
E-G: CD
E-H: CD
E-J: B
E-K: CD
E-L: D
F-G: ACDE
F-H: ACDEO
F-I: AO
F-J: BO
F-K: ACD
F-L: DE
F-M: E
F-O: A
G-H: ACDE
G-I: A
G-K: ACD
G-L: DEF
G-M: EF
G-O: A
H-I: AO
H-J: O
H-K: ACD
H-L: DE
H-M: E
H-O: A
I-J: O
I-K: A
I-O: A
K-L: D
K-O: A
L-M: EF
cat inputdata.txt | python map01.py | sort -k1 | python reduce01.py | python map02.py | sort -k1 | python reduce02.py | wc -l
74
2.MapReduce:
sh run.sh
hadoop fs -cat /output/friends_com/part-00000
A-B: CE
A-C: DF
A-D: FE
A-E: CBD
A-F: BEDCO
A-G: FCDE
A-H: CDEO
A-I: O
A-J: OB
A-K: CD
A-L: EDF
A-M: FE
B-C: A
B-D: EA
B-E: C
B-F: ECA
B-G: EAC
B-H: CEA
B-I: A
B-K: CA
B-L: E
B-M: E
B-O: A
C-D: AF
C-E: D
C-F: AD
C-G: ADF
C-H: AD
C-I: A
C-K: AD
C-L: DF
C-M: F
C-O: IA
D-E: L
D-F: AE
D-G: AEF
D-H: AE
D-I: A
D-K: A
D-L: EF
D-M: FE
D-O: A
E-F: DCBM
E-G: DC
E-H: DC
E-J: B
E-K: CD
E-L: D
F-G: AEDC
F-H: CDEAO
F-I: OA
F-J: BO
F-K: CDA
F-L: ED
F-M: E
F-O: A
G-H: CAED
G-I: A
G-K: ACD
G-L: DEF
G-M: FE
G-O: A
H-I: AO
H-J: O
H-K: CAD
H-L: DE
H-M: E
H-O: A
I-J: O
I-K: A
I-O: A
K-L: D
K-O: A
L-M: EF
hadoop fs -cat /output/friends_com/part-00000 | wc -l
74