大数据之spark_spark的广播变量及ip地址案例分析

广播变量

广播变量
通常是为了实现mapside join,可以将Driver端的数据广播到属于该application的Executor,然后通过Driver广播变量返回的引用,获取事先广播到Executor的数据
广播变量是通过BT的方式广播的(TorrentBroadcast),多个Executor可以相互传递数据,可以提高效率
在Driver端使用sc.broadcast这个方法进行广播,并且该方法是阻塞的(同步的)
广播变量一但广播出去就不能改变,为了以后可以定期的改变要关联的数据,可以定义一个object[单例对象],在函数内使用,并且加一个定时器,然后定期更新数据
广播到Executor的数据,可以在Driver获取到引用,然后这个引用会伴随着每一个Task发送到Executor,在Executor中可以通过这个引用,获取到事先广播好的数据

闭包:

函数内部引用了外部的一个变量,Driver端在执行逻辑时,会将该变量随着生成的task发送到Executor端

案例:根据IP计算归属地

日志数据

20090121000132095572000|125.213.100.123|show.51.com|/shoplist.php?phpfile=shoplist2.php&style=1&sex=137|Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Mozilla/4.0(Compatible Mozilla/4.0(Compatible-EmbeddedWB 14.59 http://bsalsa.com/ EmbeddedWB- 14.59  from: http://bsalsa.com/ )|http://show.51.com/main.php|
20090121000132124542000|117.101.215.133|www.jiayuan.com|/19245971|Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; TencentTraveler 4.0)|http://photo.jiayuan.com/index.php?uidhash=d1c3b69e9b8355a5204474c749fb76ef|__tkist=0; myloc=50%7C5008; myage=2009; PROFILE=14469674%3A%E8%8B%A6%E6%B6%A9%E5%92%96%E5%95%A1%3Am%3Aphotos2.love21cn.com%2F45%2F1b%2F388111afac8195cc5d91ea286cdd%3A1%3A%3Ahttp%3A%2F%2Fimages.love21cn.com%2Fw4%2Fglobal%2Fi%2Fhykj_m.jpg; last_login_time=1232454068; SESSION_HASH=8176b100a84c9a095315f916d7fcbcf10021e3af; RAW_HASH=008a1bc48ff9ebafa3d5b4815edd04e9e7978050; COMMON_HASH=45388111afac8195cc5d91ea286cdd1b; pop_1232093956=1232468896968; pop_time=1232466715734; pop_1232245908=1232469069390; pop_1219903726=1232477601937; LOVESESSID=98b54794575bf547ea4b55e07efa2e9e; main_search:14469674=%7C%7C%7C00; registeruid=14469674; REG_URL_COOKIE=http%3A%2F%2Fphoto.jiayuan.com%2Fshowphoto.php%3Fuid_hash%3D0319bc5e33ba35755c30a9d88aaf46dc%26total%3D6%26p%3D5; click_count=0%2C3363619
20090121000132406516000|117.101.222.68|gg.xiaonei.com|/view.jsp?p=389|Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; CIBA)|http://home.xiaonei.com/Home.do?id=229670724|_r01_=1; __utma=204579609.31669176.1231940225.1232462740.1232467011.145; __utmz=204579609.1231940225.1.1.utmccn=(direct)
20090121000132581311000|115.120.36.118|tj.tt98.com|/tj.htm|Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; TheWorld)|http://www.tt98.com/|
20090121000132864647000|123.197.64.247|cul.sohu.com|/20071227/n254338813_22.shtml|Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; TheWorld)|http://cul.sohu.com/20071227/n254338813_22.shtml|ArticleTab=visit:1; IPLOC=unknown; SUV=0901080709152121; vjuids=832dd37a1.11ebbc5d590.0.b20f858f14e918; club_chat_ircnick=JaabvxC4aaacQ; spanel=%7B%22u%22%3A%22%22%7D; vjlast=1232467312,1232467312,30
20090121000133296729000|222.55.57.176|down.chinaz.com|/|Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; iCafeMedia; TencentTraveler 4.0)||cnzz_a33219=0; vw33219=%3A18167791%3A; sin33219=http%3A//www.itxls.com/wz/wyfx/it.html; rtime=0; ltime=1232464387281; cnzz_eid=6264952-1232464379-http%3A//www.itxls.com/wz/wyfx/it.html
20090121000133331104000|123.197.66.93|www.pkwutai.cn|/down/downLoad-id-45383.html|Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 1.7)|http://www.baidu.com/s?tn=b1ank_pg&ie=gb2312&bs=%C3%C0%C6%BC%B7%FE%D7%B0%B9%DC%C0%ED%C8%ED%BC%FE&sr=&z=&cl=3&f=8&wd=%C6%C6%BD%E2%C3%C0%C6%BC%B7%FE%D7%B0%B9%DC%C0%ED%C8%ED%BC%FE&ct=0|
20090121000133446262000|115.120.12.157|v.ifeng.com|/live/|Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; .NET CLR 2.0.50727; CIBA)|http://www.ifeng.com/|userid=1232466610953_4339; location=186; sclocationid=10002; vjuids=22644b162.11ef4bc1624.0.63ad06717b426; vjlast=1232466614,1232467297,13
20090121000133456256000|115.120.7.240|cqbbs.soufun.com|/3110502342~-1~2118/23004348_23004348.htm|Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; .NET CLR 2.0.50727; CIBA)||new_historysignlist=%u534E%u6DA6%u4E8C%u5341%u56DB%u57CE%7Chttp%3A//cqbbs.soufun.com/board/3110502342/%7C%7C%u9A8F%u9038%u7B2C%u4E00%u6C5F%u5CB8%7Chttp%3A//cqbbs.soufun.com/board/3110169184/%7C%7C%u793E%u533A%u4E4B%u661F%7Chttp%3A//cqbbs.soufun.com/board/sqzx/%7C%7C; SoufunSessionID=2y5xyr45kslc0zbdooqnoo55; viewUser=1; vjuids=-870e9088.11ee89aba57.0.be9c3d988def8; vjlast=1232263101,1232380806,11; new_viewtype=1; articlecolor=#000000; usersms_pop_type=1; articlecount=186; __utma=101868291.755195653.1232450942.1232450942.1232450942.1; __utmz=101868291.1232450942.1.1.utmccn=(referral)
20090121000133586141000|117.101.219.241|12.zgwow.com|/launcher/index.htm|Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)||
20090121000133744103000|123.197.49.171|2.82yyy.com|/32/webpage/L/2.Html|Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; QQDownload 1.7; TencentTraveler ; Maxthon; .NET CLR 1.1.4322)|http://2.82yyy.com/32/webpage/L/1.Html|cnzz_a998284=3; vw998284=%3A52225577%3A68566865%3A68566789%3A68566815%3A; sin998284=none; rtime=0; ltime=1232466017187; cnzz_eid=1870962-1232464084-; cnzz_a1021073=3; vw1021073=%3A34926533%3A; sin1021073=none; 61kkk=1,1232464210281
20090121000133757842000|117.101.213.104|game.7679.com|/scroll.php|Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; CIBA)|http://game.7679.com/games/1021/|cnzz_a30008507=7; rtime=2; ltime=1232466389781; cnzz_eid=12877395-http%3A//apps.51.com/application.php%3Fapp_key%3D4a99277cca695a34ba39719399030076; 4a99277cca695a34ba39719399030076_user=tangqingqing33; 4a99277cca695a34ba39719399030076_session_key=1b203792173c71e961fd8cafdf011f9d; 4a99277cca695a34ba39719399030076_time=1232466378; 4a99277cca695a34ba39719399030076=cf9441753f0b3312fd76b18b68261287
20090121000134038848000|115.120.10.205|bf.bearcn.com|/user.asp?userid=9795|Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; TencentTraveler ; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; .NET CLR 1.1.4322; .NET CLR 2.0.50727)|http://bf.bearcn.com/Photo.asp?page=7|ASPSESSIONIDCATABTSR=MDIFCDPCMMGJBDEBMJCPCGHF; BearCN=viewid=20944; ASPSESSIONIDCAQDCQSR=OEDPCHPCOJIKCGECBIFLAOGI
20090121000134178887000|117.101.218.147|www.baidu.com|/|test||BAIDUID=4221AC111420E40EFA125AEC596813B7:FG=1
20090121000134259104000|115.120.17.80|www.sjshu.com|/bookdown/ShowSoftDown.asp?UrlID=1&SoftID=22222|Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 1.7; .NET CLR 2.0.50727)|http://www.sjshu.com/bookdown/200803/22222.shtml|ASPSESSIONIDQQASTRAD=CEJGLDLCADCNKJOPLAMEKDJJ; AJSTAT_ok_pages=3; AJSTAT_ok_times=1; ppad_cookie_0=1
.......

ip数据

1.0.1.0|1.0.3.255|16777472|16778239|亚洲|中国|福建|福州||电信|350100|China|CN|119.306239|26.075302
1.0.8.0|1.0.15.255|16779264|16781311|亚洲|中国|广东|广州||电信|440100|China|CN|113.280637|23.125178
1.0.32.0|1.0.63.255|16785408|16793599|亚洲|中国|广东|广州||电信|440100|China|CN|113.280637|23.125178
1.1.0.0|1.1.0.255|16842752|16843007|亚洲|中国|福建|福州||电信|350100|China|CN|119.306239|26.075302
1.1.2.0|1.1.7.255|16843264|16844799|亚洲|中国|福建|福州||电信|350100|China|CN|119.306239|26.075302
1.1.8.0|1.1.63.255|16844800|16859135|亚洲|中国|广东|广州||电信|440100|China|CN|113.280637|23.125178
1.2.0.0|1.2.1.255|16908288|16908799|亚洲|中国|福建|福州||电信|350100|China|CN|119.306239|26.075302
1.2.2.0|1.2.2.255|16908800|16909055|亚洲|中国|北京|北京|海淀|北龙中网|110108|China|CN|116.29812|39.95931
1.2.4.0|1.2.4.255|16909312|16909567|亚洲|中国|北京|北京||中国互联网信息中心|110100|China|CN|116.405285|39.904989
1.2.5.0|1.2.7.255|16909568|16910335|亚洲|中国|福建|福州||电信|350100|China|CN|119.306239|26.075302
1.2.8.0|1.2.8.255|16910336|16910591|亚洲|中国|北京|北京||中国互联网信息中心|110100|China|CN|116.405285|39.904989
1.2.9.0|1.2.127.255|16910592|16941055|亚洲|中国|广东|广州||电信|440100|China|CN|113.280637|23.125178
1.3.0.0|1.3.255.255|16973824|17039359|亚洲|中国|广东|广州||电信|440100|China|CN|113.280637|23.125178
1.4.1.0|1.4.3.255|17039616|17040383|亚洲|中国|福建|福州||电信|350100|China|CN|119.306239|26.075302
1.4.4.0|1.4.4.255|17040384|17040639|亚洲|中国|北京|北京|海淀|北龙中网|110108|China|CN|116.29812|39.95931
1.4.5.0|1.4.7.255|17040640|17041407|亚洲|中国|福建|福州||电信|350100|China|CN|119.306239|26.075302
1.4.8.0|1.4.127.255|17041408|17072127|亚洲|中国|广东|广州||电信|440100|China|CN|113.280637|23.125178
....

需求

根据ip规则数据,计算出给定日志中ip地址对应的省份信息

广播变量实现原理

在这里插入图片描述

使用广播变量代码实现

package cn._51doit.spark.day07
import cn._51doit.spark.utils.IpUtils
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

object IpLocation {
  def main(args: Array[String]): Unit = {
    val isLocal = args(0).toBoolean
    val conf = new SparkConf().setAppName(this.getClass.getSimpleName)
    if (isLocal) {
      conf.setMaster("local[*]")
    }
    val sc = new SparkContext(conf)
    //先读取IP规则数据
    val ipLines: RDD[String] = sc.textFile(args(1))
    val ipRulesInDriver: Array[(Long, Long, String, String)] = ipLines.map(line => {
      val fields = line.split("[|]")
      val startNum = fields(2).toLong
      val endNum = fields(3).toLong
      val province = fields(6)
      val city = fields(7)
      (startNum, endNum, province, city)
    }).sortBy(_._1) //按照IP地址的起始十进制排序(因为以后用二分法查找)
      .collect() //将全部的IP规则收集到Driver端
    //将Driver端全部的IP规则数据广播到属于该Application的Executor
    //broadcast方法是一个同步的方法,如果没有广播完,Driver端的代码是阻塞的
    val broadcastRefInDriver: Broadcast[Array[(Long, Long, String, String)]] = sc.broadcast(ipRulesInDriver)
    //读访问日志的数据
    val accessLines = sc.textFile(args(2))
    //对数据进行切分整理
    val reduced: RDD[(String, Int)] = accessLines.map(line => {
      val fields = line.split("[|]")
      val ip = fields(1)
      //使用自定义的工具类,将ip转成10进制的数字
      val ipNum = IpUtils.ip2Long(ip)
      //关联Executor中事先已经广播好的数据
      //在Executor中,通过广播变量的引用,获取事先已经广播好的数据
      val ipRulesInExecutor: Array[(Long, Long, String, String)] = broadcastRefInDriver.value 
      //使用二分法查找
      val index = IpUtils.binarySearch(ipRulesInExecutor, ipNum)
      var province = "未知"
      if(index > -1) {
        province = ipRulesInExecutor(index)._3
      }
      (province, 1)
    }).reduceByKey(_+_)
    val res = reduced.collect()
    println(res.toBuffer)
    sc.stop()
  }
}

package cn._51doit.spark.utils

import scala.collection.mutable.ArrayBuffer

object IpUtils {

  /**
    * 将IP地址转成十进制
    *
    * @param ip
    * @return
    */
  def ip2Long(ip: String): Long = {
    val fragments = ip.split("[.]")
    var ipNum = 0L
    for (i <- 0 until fragments.length) {
      ipNum = fragments(i).toLong | ipNum << 8L
    }
    ipNum
  }

  /**
    * 二分法查找
    *
    * @param lines
    * @param ip
    * @return
    */
  def binarySearch(lines: ArrayBuffer[(Long, Long, String, String)], ip: Long): Int = {
    var low = 0 //起始
    var high = lines.length - 1 //结束
    while (low <= high) {
      val middle = (low + high) / 2
      if ((ip >= lines(middle)._1) && (ip <= lines(middle)._2))
        return middle
      if (ip < lines(middle)._1)
        high = middle - 1
      else {
        low = middle + 1
      }
    }
    -1 //没有找到
  }

  def binarySearch(lines: Array[(Long, Long, String, String)], ip: Long): Int = {
    var low = 0 //起始
    var high = lines.length - 1 //结束
    while (low <= high) {
      val middle = (low + high) / 2
      if ((ip >= lines(middle)._1) && (ip <= lines(middle)._2))
        return middle
      if (ip < lines(middle)._1)
        high = middle - 1
      else {
        low = middle + 1
      }
    }
    -1 //没有找到
  }
}

根据经度维度查询高德地图获取位置信息

广播变量实际上就是将一个小表保存到各个节点上方便快速和大表中的每条数据进行匹配,那么当规则数据量太大的时候,节点内存放不下,可以将规则数据放入数据库,例如Hbase或Mysql中,也可以去网络上通过别人开放的接口查询我们想要的数据

根据经度维度查询高德地图获取位置信息

日志数据如下,ip规则从高德地图中获取

{"cid": 1, "money": 600.0, "longitude":116.397128,"latitude":39.916527,"oid":"o123", }
"oid":"o112", "cid": 3, "money": 200.0, "longitude":118.396128,"latitude":35.916527}
{"oid":"o124", "cid": 2, "money": 200.0, "longitude":117.397128,"latitude":38.916527}
{"oid":"o125", "cid": 3, "money": 100.0, "longitude":118.397128,"latitude":35.916527}
{"oid":"o127", "cid": 1, "money": 100.0, "longitude":116.395128,"latitude":39.916527}
{"oid":"o128", "cid": 2, "money": 200.0, "longitude":117.396128,"latitude":38.916527}
{"oid":"o129", "cid": 3, "money": 300.0, "longitude":115.398128,"latitude":35.916527}
{"oid":"o130", "cid": 2, "money": 100.0, "longitude":116.397128,"latitude":39.916527}
{"oid":"o131", "cid": 1, "money": 100.0, "longitude":117.394128,"latitude":38.916527}
{"oid":"o132", "cid": 3, "money": 200.0, "longitude":118.396128,"latitude":35.916527}
....

先导入依赖

   <!--发送HTTP请求的Java工具包 -->
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
            <version>4.5.7</version>
        </dependency>
package cn._51doit.spark.day07

import java.sql.{Date, DriverManager}
import cn._51doit.spark.day03.OrderCaseClass
import com.alibaba.fastjson.{JSON, JSONException}
import org.apache.http.client.methods.HttpGet
import org.apache.http.impl.client.HttpClients
import org.apache.http.util.EntityUtils
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
import org.slf4j.LoggerFactory

object LocationIncome {

  private val logger = LoggerFactory.getLogger(this.getClass)

  def main(args: Array[String]): Unit = {

    val conf = new SparkConf().setAppName("CategoryIncome").setMaster("local[*]")
    val sc = new SparkContext(conf)

    val lines = sc.textFile(args(0))
    //解析数据
    val beanRDD: RDD[OrderCaseClass] = lines.map(line => {
      var bean: OrderCaseClass = null
      try {
        bean = JSON.parseObject(line, classOf[OrderCaseClass])
      } catch {
        case e: JSONException => {
          //记录错误数据
          logger.error("parse json error => " + line)
        }
      }
      bean
    })
    //过滤有问题的数据
    val filtered = beanRDD.filter(_ != null)

    //根据经度维度查询高德地图获取位置信息
    //关联维度数据
    val res = filtered.mapPartitions(it => {
    //通过HttpClients的工具类创建连接对象
    //mapPartitions可以避免每条数据就创建一个对象,按分区处理数据
      val httpclient = HttpClients.createDefault
      //迭代每个分区中的数据
      it.map(bean => {
      //获取经纬度,并用变量的形式传入请求的链接中
        val longitude = bean.longitude
        val latitude = bean.latitude
         //使用REST请求的规范:查询用GET、添加PUT、修改POST、删除用DELETE
        val httpGet = new HttpGet(s"https://restapi.amap.com/v3/geocode/regeo?&location=$longitude,$latitude&key=向高德地图申请的key")
        val response = httpclient.execute(httpGet)
        try {
  
          val entity1 = response.getEntity
          // 对响应体做一些有用的事情
          // 确保它被充分地消耗掉
          var province: String = null
          var city: String = null
          if (response.getStatusLine.getStatusCode == 200) {
            //获取请求的json字符串
            val result = EntityUtils.toString(entity1)
            //转成json对象
            val jsonObj = JSON.parseObject(result)
            //获取位置信息
            val regeocode = jsonObj.getJSONObject("regeocode")
            if (regeocode != null && !regeocode.isEmpty) {
              val address = regeocode.getJSONObject("addressComponent")
              //获取省市区
              bean.province = address.getString("province")
              bean.city = address.getString("city")
            }
          }
        } finally {
          response.close()
        }
        //判断是否是最后一条数据,是就关闭链接
        if(!it.hasNext) {
          httpclient.close()
        }
        bean
      })
      //注意此处不能关闭httpclient的链接,因为首先该Http的链接对象是定义在RDD函数内部的,在RDD函
      //数内部的对象会通过task发送到Executor端,然后在Executor端还未开始处理数据,只是在初始化的
      //时候new出对象来并执行下面的关流操作,那么在正式处理数据的时候,可是对象就已经被关掉了,然
      //后就会直接报错,(初始化的时候,迭代器是不会开始执行的,因为没有数据来调动迭代器,迭代器是不
      //会动起来的,迭代器只有当下面有命令说我要要数据了才会动起来(action))
      //所以在关闭链接对象时,要判断该迭代器
      //是否已经是最后一条数据了再关闭链接
      //httpclient.close()
      //iterator //返回新的迭代器
    })
    println(res.collect().toBuffer)
    sc.stop()
  }
}
package cn._51doit.spark.day03

case class OrderCaseClass(
                           cid: String,
                           money: Double,
                           oid: String,
                           longitude: Double,
                           latitude: Double,
                           var province: String,
                           var city: String
                         ) 

使用单例对象读取规则ip,并最后写入数据库

<!-- mysql连接依赖 -->
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>5.1.47</version>
        </dependency>
import java.io.{BufferedReader, FileInputStream, InputStreamReader}

import scala.collection.mutable.ArrayBuffer

object IpRulesLoader {

  //使用IO流读规则ip数据,然后放入到一个ArrayBuffer
  //在object中定义的定义的数据是静态的,在一个JVM进程中,只有一份
  val ipRules = new ArrayBuffer[(Long, Long, String, String)]()
  //加载IP规则数据,在Executor的类加载是执行一次
  //静态代码块
  //读取HDFS中的数据
  //val fileSystem = FileSystem.get(URI.create("file://"), new Configuration())
  //val inputStream = fileSystem.open(new Path("/Users/xing/Desktop/ip.txt"))
  val bufferedReader = new BufferedReader(new InputStreamReader(new FileInputStream("data/ip.txt")))
  var line: String = null
  do {
    line = bufferedReader.readLine()
    if (line != null) {
      //处理IP规则数据
      val fields = line.split("[|]")
      val startNum = fields(2).toLong
      val endNum = fields(3).toLong
      val province = fields(6)
      val city = fields(7)
      val t = (startNum, endNum, province, city)
      ipRules += t
    }
  } while (line != null)
  def getAllRules: ArrayBuffer[(Long, Long, String, String)] = {
    ipRules
  }
}
package cn._51doit.spark.day09

import java.sql.{Connection, Date, DriverManager, PreparedStatement}

import cn._51doit.spark.utils.IpUtils
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
import org.slf4j.{Logger, LoggerFactory}

import scala.collection.mutable.ArrayBuffer

object IpLocationV3 {
	//创建记录异常日志的对象
  private val logger: Logger = LoggerFactory.getLogger(IpLocationV3.getClass)

  def main(args: Array[String]): Unit = {

    val isLocal = args(0).toBoolean

    val conf = new SparkConf().setAppName(this.getClass.getCanonicalName)

    if (isLocal) {
      conf.setMaster("local[*]")
    }
    val sc = new SparkContext(conf)

    val accessLog: RDD[String] = sc.textFile(args(1))

    val reduced = accessLog.mapPartitions(it => {
      //初始化IPLoader
      val allRulesInExecutor: ArrayBuffer[(Long, Long, String, String)] = IpRulesLoader.getAllRules
      it.map(line => {
        val fields = line.split("[|]")
        val ip = fields(1)
        val ipNum = IpUtils.ip2Long(ip)
        val index = IpUtils.binarySearch(allRulesInExecutor, ipNum)
        var province = "未知"
        if (index != -1) {
          province = allRulesInExecutor(index)._3
        }
        (province, 1)
      })
    }).reduceByKey(_ + _)

    //将数据收集到Driver端在写入到MySQL、Redis、Hbase
    //collect到Driver端的数据不能太大,数据收到是通过网络,数据量大会效率低,会丢失数据
    //val res = reduced.collect()


    //假设这个数据量大,在Executor中写入数据,不能使用foreach,
	//因为它会来一条数据就创建一个链接,浪费内存资源
    //    reduced.foreach(t => {
    //      //创建一个数据库连接
    //      val connection = DriverManager.getConnection("jdbc:mysql://localhost:3306/bigdata?characterEncoding=UTF-8", "root", "123456")
    //      val pstm = connection.prepareStatement("INSERT INTO daily_ip_count (dt, province, counts) VALUES (?, ?, ?)")
    //      pstm.setDate(1, new Date(System.currentTimeMillis()))
    //      pstm.setString(2, t._1)
    //      pstm.setInt(3, t._2)
    //      pstm.executeUpdate()
    //
    //      pstm.close()
    //      connection.close()
    //    })
	//推荐使用foreachPartition,因为它有action的功能,还能一次处理一个区
    reduced.foreachPartition(it => {
      var connection: Connection = null
      var pstm: PreparedStatement = null
      try {
        var i = 0
        connection = DriverManager.getConnection("jdbc:mysql://localhost:3306/bigdata?characterEncoding=UTF-8", "root", "123456")
        pstm = connection.prepareStatement("INSERT INTO daily_ip_count (dt, province, counts) VALUES (?, ?, ?)")
        it.foreach(t => {
          pstm.setDate(1, new Date(System.currentTimeMillis()))
          pstm.setString(2, t._1)
          pstm.setInt(3, t._2)
          pstm.addBatch()
          i += 1
		  // pstm.addBatch()方法联合pstm.executeLargeBatch()方法,可以将数据批量写入数据库
		  //但一个区内数据太大也可能导致内存溢出,所以可以设置一个变量i,当有100条数据的时候就写入一次
          if (i % 100 == 0) {
            pstm.executeLargeBatch()
          }
        })
		//最后再刷一次,是为了保证最后几条数据凑不够100条也刷一次
        pstm.executeLargeBatch()
      } catch {
        case e: Exception => {
          //记录异常日志
          logger.error("写入数据库异常", e)
        }
      } finally {
        if(pstm != null) {
          pstm.close()
        }
        if(connection != null) {
          connection.close()
        }
      }
    })
    sc.stop()
  }
}

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值