在大数据处理场景中,多表Join是非常常见的一类运算。但是对于分布式系统来说,这是个很大的麻烦,由于数据分布在各个节点上,在做join操作之前必须先要shuffle,这会导致巨大的网络传输IO,导致速度很慢。
下面,介绍一种map-side-join,该类join使用场景是一个大表和一个小表的连接操作,其中,“小表”是指文件足够小,可以加载到内存中。该算法可以将join算子执行在Map端,无需经历shuffle和reduce等阶段,因此效率非常高。
下面给出实例代码
// Fact table
val flights = sc.parallelize(List(
("SEA", "JFK", "DL", "418", "7:00"),
("SFO", "LAX", "AA", "1250", "7:05"),
("SFO", "JFK", "VX", "12", "7:05"),
("JFK", "LAX", "DL", "424", "7:10"),
("LAX", "SEA", "DL", "5737", "7:10")))
// Dimension table
val airports = sc.parallelize(List(
("JFK", "John F. Kennedy International Airport", "New York", "NY"),
("LAX", "Los Angeles International Airport", "Los Angeles", "CA"),
("SEA", "Seattle-Tacoma International Airport", "Seattle", "WA"),
("SFO", "San Francisco International Airport", "San Francisco", "CA")))
// Dimension table
val airlines = sc.parallelize(List(
("AA", "American Airlines"),
("DL", "Delta Airlines"),
("VX", "Virgin America")))
需要把三个表join成如下格式:
Seattle New York Delta Airlines 418 7:00
San Francisco Los Angeles American Airlines 1250 7:05
San Francisco New York Virgin America 12 7:05
New York Los Angeles Delta Airlines 424 7:10
Los Angeles Seattle Delta Airlines 5737 7:10
其中fact表是非常巨大的,而两个dimension表比较小,我们可以把小表加载到内存中
val airportsMap = sc.broadcast(airports.map{case(a, b, c, d) => (a, c)}.collectAsMap)
val airlinesMap = sc.broadcast(airlines.collectAsMap)
下面是map-side-join:
flights.map{case(a, b, c, d, e) =>
(airportsMap.value.get(a).get,
airportsMap.value.get(b).get,
airlinesMap.value.get(c).get, d, e)}.collect
运行结果的部分展示:
res: Array[(String, String, String, String, String)] = Array(
(Seattle, New York, Delta Airlines, 418, 7:00),
(San Francisco, Los Angeles, American Airlines, 1250, 7:05),
(San Francisco, New York, Virgin America, 12, 7:05),
(New York, Los Angeles, Delta Airlines, 424, 7:10),
(Los Angeles, Seattle, Delta Airlines, 5737, 7:10))