其中需要一个IP段对应的码表内容大体如下(已经上传到csdn,下载地址:点击跳转下载页面):
1,3708713472,3708715007,"河南省","信阳市","联通","221.14.122.0","221.14.127.255"
2,3708649472,3708813311,"河南省",,"联通","221.13.128.0","221.15.255.255"
3,3720390656,3720391679,"河北省","邢台市","联通","221.192.168.0","221.192.171.255"
4,1038992128,1038992383,"黑龙江省","齐齐哈尔市","铁通","61.237.195.0","61.237.195.255"
可以通过都好分割,其中第一列为ID,第二列为ip地址转换成long型后的上界,第三列为下界,第四列为省份,第五列为城市,第六列为运营商,第七列ip上界,第八列ip下界
广播变量其实就是和hadoop的map端join一样,将数据分发到各个执行节点的内存里面,在spark中使用:
sc.broadcast 这个方法就能将变量广播到各个执行节点里面,具体用法如下工程
工程项目如下:
大体内容:根据ip获得访问城市的省份,并且根据访问次数进行排序
其中Bootstrap:
package cn.lijie.business
import org.apache.spark.{SparkConf, SparkContext}
/**
* User: lijie
*/
object Bootstrap {
/**
* 二分查找
*
* @param arr
* @param ip
* @return
*/
def binarySearch(arr: Array[(String, String, String, String)], ip: Long): Int = {
var l = 0
var h = arr.length - 1
while (l <= h) {
var m = (l + h) / 2
if ((ip >= arr(m)._1.toLong) && (ip <= arr(m)._2.toLong)) {
return m
} else if (ip < arr(m)._1.toLong) {
h = m - 1
} else {
l = m + 1
}
}
-1
}
/**
* IP转long
*
* @param ip
* @return
*/
def ip2Long(ip: String): Long = {
val arr = ip.split("[.]")
var num = 0L
for (i <- 0 until arr.length) {
num = arr(i).toLong | num << 8L
}
num
}
def main(args: Array[String]): Unit = {
// print(3395782400.00.toLong)
//1,3708713472.00,3708715007.00,"河南省","信阳市","联通","221.14.122.0","221.14.127.255"
//id 下界 上界 省份 城市 运营商 ip段下界 ip段下界
//这里对IP.txt里面的内容进行排序,安装上界的升序排
val conf = new SparkConf().setMaster("local[2]").setAppName("ip")
val sc = new SparkContext(conf)
val rdd1 = sc.textFile("src/main/file/*.txt").map(x => {
val s = x.split(",")
//下界 上界 省份 运营商
(s(1), s(2), s(3), s(5))
}).sortBy(_._1)
//广播变量
val bd = sc.broadcast(rdd1.collect)
val rdd2 = sc.textFile("src/main/file/*.info").map(x => {
val s = x.split(",")
//(ip,1)
(s(1), 1)
}).reduceByKey(_ + _).sortBy(_._2)
rdd2.map(x => {
val ipLong = ip2Long(x._1)
//获取下标
val index = binarySearch(bd.value, ipLong)
//没找到的返回unknown
if (index == -1) {
(ipLong, x._1, x._2, "unknown", "unknown")
} else {
//获取省份
val p = bd.value(index)._3
//获取运营商
val y = bd.value(index)._4
(ipLong, x._1, x._2, p, y)
}
}).repartition(1).saveAsTextFile("C:\\Users\\Administrator\\Desktop\\out")
sc.stop()
}
}
ip.txt文件就是我上传的那份文件
下载地址:点击跳转下载页面
ip.info是我模拟的几条数据:
14:45:17,202.98.248.242
15:45:17,219.220.199.250
16:45:17,219.220.199.250
18:45:17,202.98.248.242
18:45:17,202.98.248.242
18:45:17,202.98.248.242
18:45:17,202.98.248.242
18:45:17,202.98.248.242
16:45:17,114.139.223.13
15:45:17,219.220.199.250
16:45:17,219.220.199.250
15:45:17,219.220.199.250
16:45:17,219.220.199.250
15:45:17,219.220.199.250
16:45:17,219.220.199.250
13:45:17,114.139.223.13
10:45:17,114.139.223.13
13:45:17,114.139.223.13
10:45:17,114.139.223.13
10:45:17,114.10.123.13
执行完成后: