IP规则数据
1.0.1.0|1.0.3.255|16777472|16778239|亚洲|中国|福建|福州||电信|350100|China|CN|119.306239|26.075302
1.0.8.0|1.0.15.255|16779264|16781311|亚洲|中国|广东|广州||电信|440100|China|CN|113.280637|23.125178
1.0.32.0|1.0.63.255|16785408|16793599|亚洲|中国|广东|广州||电信|440100|China|CN|113.280637|23.125178
1.1.0.0|1.1.0.255|16842752|16843007|亚洲|中国|福建|福州||电信|350100|China|CN|119.306239|26.075302
1.1.2.0|1.1.7.255|16843264|16844799|亚洲|中国|福建|福州||电信|350100|China|CN|119.306239|26.075302
1.1.8.0|1.1.63.255|16844800|16859135|亚洲|中国|广东|广州||电信|440100|China|CN|113.280637|23.125178
1.2.0.0|1.2.1.255|16908288|16908799|亚洲|中国|福建|福州||电信|350100|China|CN|119.306239|26.075302
1.2.2.0|1.2.2.255|16908800|16909055|亚洲|中国|北京|北京|海淀|北龙中网|110108|China|CN|116.29812|39.95931
1.2.4.0|1.2.4.255|16909312|16909567|亚洲|中国|北京|北京||中国互联网信息中心|110100|China|CN|116.405285|39.904989
1.2.5.0|1.2.7.255|16909568|16910335|亚洲|中国|福建|福州||电信|350100|China|CN|119.306239|26.075302
…
访问日志数据示例
20090121000132581311000|115.120.36.118|tj.tt98.com|/tj.htm|Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;TheWorld)|http://www.tt98.com/|
…
代码实现
//广播变量
object IpLocationDemo {
def main(args: Array[String]): Unit = {
val sc = SparkUtils.createContext(args(0).toBoolean)
//使用spark读取IP规则数据
val ipLines: RDD[String] = sc.textFile(args(1))
val ipRules = ipLines.map(e => {
val fields = e.split("[|]")
val startNum = fields(2).toLong
val endNum = fields(3).toLong
val province = fields(6)
val city = fields(7)
(startNum, endNum, province, city)
})
//将整理后的IP规则收集到Driver端
val ipRulesInDriver = ipRules.collect()
//将Driver端的IP规则数据广播到Executor中,该方法是一个阻塞方法
//返回Driver端一个广播变量引用,记录了各个Executor存储的广播变量数据的地址
val broadcastRef: Broadcast[Array[(Long, Long, String, String)]] = sc.broadcast(ipRulesInDriver)
//读取访问日志数据
val accessLog = sc.textFile(args(2))
val provinceAndOne: RDD[(String, Int)] = accessLog.map(e => {
val fields = e.split("[|]")
val ip = fields(1)
//将访问日志的ip地址转成十进制
val ipNum: Long = IpUtils.ip2Long(ip)
//通过Driver端的广播变量引用,获取事先已经广播到当前Executor中的数据
val ipRulesInExecutor: Array[(Long, Long, String, String)] = broadcastRef.value
//通过二分法查找Ip地址所在省份
val index: Int = IpUtils.binarySearch(ipRulesInExecutor, ipNum)
var province = "未知"
if (index >= 0) {
province = ipRulesInExecutor(index)._3
}
(province, 1)
})
val reduced: RDD[(String, Int)] = provinceAndOne.reduceByKey(_ + _)
val res = reduced.collect()
println(res.toBuffer)
//释放广播变量
broadcastRef.unpersist(true)
//再次触发action
val res2 = reduced.collect()
print(res2.toBuffer)
sc.stop()
}
}