spark通过ip计算IP所在省份,以及广播变量的使用

其中需要一个IP段对应的码表内容大体如下(已经上传到csdn,下载地址:点击跳转下载页面):

1,3708713472,3708715007,"河南省","信阳市","联通","221.14.122.0","221.14.127.255"
2,3708649472,3708813311,"河南省",,"联通","221.13.128.0","221.15.255.255"
3,3720390656,3720391679,"河北省","邢台市","联通","221.192.168.0","221.192.171.255"
4,1038992128,1038992383,"黑龙江省","齐齐哈尔市","铁通","61.237.195.0","61.237.195.255"

可以通过都好分割,其中第一列为ID,第二列为ip地址转换成long型后的上界,第三列为下界,第四列为省份,第五列为城市,第六列为运营商,第七列ip上界,第八列ip下界

广播变量其实就是和hadoop的map端join一样,将数据分发到各个执行节点的内存里面,在spark中使用:
sc.broadcast 这个方法就能将变量广播到各个执行节点里面,具体用法如下工程

工程项目如下:

大体内容:根据ip获得访问城市的省份,并且根据访问次数进行排序

这里写图片描述

其中Bootstrap:

package cn.lijie.business

import org.apache.spark.{SparkConf, SparkContext}

/**
  * User: lijie
  */
object Bootstrap {

  /**
    * 二分查找
    *
    * @param arr
    * @param ip
    * @return
    */
  def binarySearch(arr: Array[(String, String, String, String)], ip: Long): Int = {
    var l = 0
    var h = arr.length - 1
    while (l <= h) {
      var m = (l + h) / 2
      if ((ip >= arr(m)._1.toLong) && (ip <= arr(m)._2.toLong)) {
        return m
      } else if (ip < arr(m)._1.toLong) {
        h = m - 1
      } else {
        l = m + 1
      }
    }
    -1
  }

  /**
    * IP转long
    *
    * @param ip
    * @return
    */
  def ip2Long(ip: String): Long = {
    val arr = ip.split("[.]")
    var num = 0L
    for (i <- 0 until arr.length) {
      num = arr(i).toLong | num << 8L
    }
    num
  }

  def main(args: Array[String]): Unit = {
    //    print(3395782400.00.toLong)
    //1,3708713472.00,3708715007.00,"河南省","信阳市","联通","221.14.122.0","221.14.127.255"
    //id  下界  上界  省份  城市  运营商  ip段下界   ip段下界
    //这里对IP.txt里面的内容进行排序,安装上界的升序排
    val conf = new SparkConf().setMaster("local[2]").setAppName("ip")
    val sc = new SparkContext(conf)
    val rdd1 = sc.textFile("src/main/file/*.txt").map(x => {
      val s = x.split(",")
      //下界  上界  省份  运营商
      (s(1), s(2), s(3), s(5))
    }).sortBy(_._1)

    //广播变量
    val bd = sc.broadcast(rdd1.collect)

    val rdd2 = sc.textFile("src/main/file/*.info").map(x => {
      val s = x.split(",")
      //(ip,1)
      (s(1), 1)
    }).reduceByKey(_ + _).sortBy(_._2)

    rdd2.map(x => {
      val ipLong = ip2Long(x._1)
      //获取下标
      val index = binarySearch(bd.value, ipLong)
      //没找到的返回unknown
      if (index == -1) {
        (ipLong, x._1, x._2, "unknown", "unknown")
      } else {
        //获取省份
        val p = bd.value(index)._3
        //获取运营商
        val y = bd.value(index)._4
        (ipLong, x._1, x._2, p, y)
      }
    }).repartition(1).saveAsTextFile("C:\\Users\\Administrator\\Desktop\\out")
    sc.stop()
  }
}

ip.txt文件就是我上传的那份文件
下载地址:点击跳转下载页面

ip.info是我模拟的几条数据:

14:45:17,202.98.248.242
15:45:17,219.220.199.250
16:45:17,219.220.199.250
18:45:17,202.98.248.242
18:45:17,202.98.248.242
18:45:17,202.98.248.242
18:45:17,202.98.248.242
18:45:17,202.98.248.242
16:45:17,114.139.223.13
15:45:17,219.220.199.250
16:45:17,219.220.199.250
15:45:17,219.220.199.250
16:45:17,219.220.199.250
15:45:17,219.220.199.250
16:45:17,219.220.199.250
13:45:17,114.139.223.13
10:45:17,114.139.223.13
13:45:17,114.139.223.13
10:45:17,114.139.223.13
10:45:17,114.10.123.13

执行完成后:

这里写图片描述

这里写图片描述

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值