9 . 默认情况下,join()操作会对两个RDD的主键做哈希以分区,通过网络将主键相同的元素发送到同一台机器上,然后根据相同的主键再进行连接。例子如下:
val sc = new SparkContext()
val userData = sc.sequenceFile[UserID,LinkInfo]("hdfs://...").persist
def processNewLogs(logFileName:String){
val events = sc.sequenceFile[UserID, LinkInfo](logFileName)
//RDD of (UserID,(UserInfo,LinkInfo)) pairs
val joined = usersData.join(events)
val offTopicVisits = joined.filter {
// Expand the tuple into its components
case (userId, (userInfo, linkInfo)) =>