RDD中的操作Oracle数据库异常处理

博客探讨了在Spark中使用RDD操作Oracle数据库时遇到的Task not serializable异常。问题源于RDD的分区特性,非序列化的JDBC Connection无法在多个节点间共享。解决方法是通过`foreachPartition`对RDD进行分区操作,确保每个节点独立处理连接。另一种资源消耗大的解决方案也被提及。
摘要由CSDN通过智能技术生成

RDD(Resilient Distributed Datasets)弹性分布式数据集,spark中数据的抽象说明

特性(来源于源码):

 * Internally, each RDD is characterized by five main properties:
 *
 *  - A list of partitions
        一组分区
        RDD由分区组成
 
 *  - A function for computing each split
        函数,用于计算RDD中数据
             
 *  - A list of dependencies on other RDDs
        RDD之间存在依赖关系:宽、窄
 
 *  - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
 *  - Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)             
RDD的构建方式:

1.使用SparkContex创建: val rdd1 = sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10),3)

2.通过读取外部的数据源,直接创建RDD

val rdd2 = sc.textFile("hdfs://bigdata111:9000/input/data.txt")    val rdd2 = sc.textFile("/root/temp/input/data.txt")

代码:需求是分析用户访问日志,提取用户访问的jps域名,将jps域名保存到关系型数据库中;

package SparkContextDemo

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.Partitioner
import scala.collection.mutable.HashMap
import java.sql.DriverManager
import java.sql.Connection
import java.sql.PreparedStatement

object Tomcat {
  def main(args: Array[String]): Unit = {
    
    System.setProperty("hadoop.home.dir", "D:\\DayDayUp\\hadoop-2.4.1\\hadoop-2.4.1")
    //构建sc对象  生成RDD对象
    val conf = new SparkConf().setAppName("Tomcat log").setMaster("local")
    val sc = new SparkContext(conf)
    //读取数据
    val rdd1 = sc.textFile("D:\\tmp\\localhost_access_log.2017-07-30.txt",2).map{
      //192.168.88.1 - - [30/Jul/2017:12:54:40 +0800] "GET /MyDemoWeb/oracle.jsp HTTP/1.1" 200 242
      line => {
        val index1 = line.indexOf("\"")
        val index2 = line.lastIndexOf("\"")
        val str1 = line.substring(index1+1, index2)
        //GET /MyDemoWeb/oracle.jsp HTTP/1.1  str1 获取的数据
        val index3 = str1.indexOf(" ")
        val index4 = str1.lastIndexOf(" ")
        val str2 = str1.substring(index3+1, index4)
        //str = /MyDemoWeb/oracle.jsp
        val nameJps = str2.substring(str2.lastIndexOf("/")+1)
        (nameJps,1)  
      }
    }
    var pst:PreparedStatement = null
    var conn:Connection = null
    conn DriverManager.getConnection("jdbc:oracle:thin:@192.168.128.90:1521/orcl.example.com",                          "scott", "tiger")
    try{
      rdd1.foreach(f=>{
        pst = conn.prepareStatement("insert into result values(?,?)")
        pst.setString(1,f._1)
        pst.setInt(2, f._2)
        pst.execute()
      })
    }catch{
      case ex:Exception => ex.printStackTrace()
    }finally{
      if(conn!=null) conn.close()
      if(pst!=null) pst.close()
    }
    sc.stop()
  }

程序运行抛出异常:Task not serializable  

问题原因:

RDD是由分区组成,任务会由不同的节点取执行,创建的JDBC Connection既要由节点1执行,由要由节点2执行,但是Connection本身是非序列化的对象,就导致该对象不能同时被2个节点使用,那就要对RDD进行分区操作调用方法foreachPartition;

修改的代码:

package SparkContextDemo

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.Partitioner
import scala.collection.mutable.HashMap
import java.sql.DriverManager
import java.sql.Connection
import java.sql.PreparedStatement

object Tomcat {
  def main(args: Array[String]): Unit = {
    
    System.setProperty("hadoop.home.dir", "D:\\DayDayUp\\hadoop-2.4.1\\hadoop-2.4.1")
    //构建sc对象  生成RDD对象
    val conf = new SparkConf().setAppName("Tomcat log").setMaster("local")
    val sc = new SparkContext(conf)
    //读取数据
    val rdd1 = sc.textFile("D:\\tmp\\localhost_access_log.2017-07-30.txt",2).map{
      //192.168.88.1 - - [30/Jul/2017:12:54:40 +0800] "GET /MyDemoWeb/oracle.jsp HTTP/1.1" 200 242
      line => {
        val index1 = line.indexOf("\"")
        val index2 = line.lastIndexOf("\"")
        val str1 = line.substring(index1+1, index2)
        //GET /MyDemoWeb/oracle.jsp HTTP/1.1  str1 获取的数据
        val index3 = str1.indexOf(" ")
        val index4 = str1.lastIndexOf(" ")
        val str2 = str1.substring(index3+1, index4)
        //str = /MyDemoWeb/oracle.jsp
        val nameJps = str2.substring(str2.lastIndexOf("/")+1)
        (nameJps,1)  
      }
    }
    rdd1.foreachPartition(saveJDBC)
    sc.stop()
  }
//定义分区规则,生成的rdd1 格式(String,Int) 输入的数据类型就是迭代器
  def saveJDBC(iter:Iterator[(String,Int)])={
    //jdbc 操作
    var pst:PreparedStatement = null
    var conn:Connection = null
    try{
      conn = DriverManager.getConnection("jdbc:oracle:thin:@192.168.128.90:1521/orcl.example.com", "scott", "tiger")
      pst = conn.prepareStatement("insert into result values(?,?)")
      iter.foreach(data=>{
        pst.setString(1, data._1)
        pst.setInt(2, data._2)
        pst.execute()
      })
    }catch{
      case ex:Exception => ex.printStackTrace()
    }finally{
      if(conn!=null) conn.close()
      if(pst!=null) pst.close()
    }
    
  }
  
}

数据成功插入:

SQL> select count(*) from result;

  COUNT(*)
----------
       120

SQL> select * from result;

NAMEJPS 		  COUNT
-------------------- ----------
head.jsp		      1
java.jsp		      1
web.jsp 		      1
mysql.jsp		      1
oracle.jsp		      1
hadoop.jsp		      1
head.jsp		      1
java.jsp		      1
web.jsp 		      1
mysql.jsp		      1
oracle.jsp		      1

备注:还有一种方法不用创建分区规则,这种方法很浪系统资源,对系统的性能有严重的影响特别是数据量很大的时候:

package SparkContextDemo

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.Partitioner
import scala.collection.mutable.HashMap
import java.sql.DriverManager
import java.sql.Connection
import java.sql.PreparedStatement

object Tomcat {
  def main(args: Array[String]): Unit = {
    
    System.setProperty("hadoop.home.dir", "D:\\DayDayUp\\hadoop-2.4.1\\hadoop-2.4.1")
    //构建sc对象  生成RDD对象
    val conf = new SparkConf().setAppName("Tomcat log").setMaster("local")
    val sc = new SparkContext(conf)
    //读取数据
    val rdd1 = sc.textFile("D:\\tmp\\localhost_access_log.2017-07-30.txt",2).map{
      //192.168.88.1 - - [30/Jul/2017:12:54:40 +0800] "GET /MyDemoWeb/oracle.jsp HTTP/1.1" 200 242
      line => {
        val index1 = line.indexOf("\"")
        val index2 = line.lastIndexOf("\"")
        val str1 = line.substring(index1+1, index2)
        //GET /MyDemoWeb/oracle.jsp HTTP/1.1  str1 获取的数据
        val index3 = str1.indexOf(" ")
        val index4 = str1.lastIndexOf(" ")
        val str2 = str1.substring(index3+1, index4)
        //str = /MyDemoWeb/oracle.jsp
        val nameJps = str2.substring(str2.lastIndexOf("/")+1)
        (nameJps,1)  
      }
    }var pst:PreparedStatement = null
    var conn:Connection = null
    //conn = DriverManager.getConnection("jdbc:oracle:thin:@192.168.128.90:1521/orcl.example.com", "scott", "tiger")
    try{
      rdd1.foreach(f=>{
        //针对rdd的每一条数据都创建一个connection,可以避免分区问题,后果就是每条数据都会创建connection,数据是千万级别的意味着你要创建千万级别的connection,
        //严重影响系统的性能
        conn = DriverManager.getConnection("jdbc:oracle:thin:@192.168.128.90:1521/orcl.example.com", "scott", "tiger")
        pst = conn.prepareStatement("insert into result values(?,?)")
        pst.setString(1,f._1)
        pst.setInt(2, f._2)
        pst.execute()
      })
    }catch{
      case ex:Exception => ex.printStackTrace()
    }finally{
      if(conn!=null) conn.close()
      if(pst!=null) pst.close()
    }
    sc.stop()
  }

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值