RDD(Resilient Distributed Datasets)弹性分布式数据集,spark中数据的抽象说明
特性(来源于源码):
* Internally, each RDD is characterized by five main properties:
*
* - A list of partitions
一组分区
RDD由分区组成
* - A function for computing each split
函数,用于计算RDD中数据
* - A list of dependencies on other RDDs
RDD之间存在依赖关系:宽、窄
* - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
* - Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)
RDD的构建方式:
1.使用SparkContex创建: val rdd1 = sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10),3)
2.通过读取外部的数据源,直接创建RDD
val rdd2 = sc.textFile("hdfs://bigdata111:9000/input/data.txt") val rdd2 = sc.textFile("/root/temp/input/data.txt")
代码:需求是分析用户访问日志,提取用户访问的jps域名,将jps域名保存到关系型数据库中;
package SparkContextDemo
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.Partitioner
import scala.collection.mutable.HashMap
import java.sql.DriverManager
import java.sql.Connection
import java.sql.PreparedStatement
object Tomcat {
def main(args: Array[String]): Unit = {
System.setProperty("hadoop.home.dir", "D:\\DayDayUp\\hadoop-2.4.1\\hadoop-2.4.1")
//构建sc对象 生成RDD对象
val conf = new SparkConf().setAppName("Tomcat log").setMaster("local")
val sc = new SparkContext(conf)
//读取数据
val rdd1 = sc.textFile("D:\\tmp\\localhost_access_log.2017-07-30.txt",2).map{
//192.168.88.1 - - [30/Jul/2017:12:54:40 +0800] "GET /MyDemoWeb/oracle.jsp HTTP/1.1" 200 242
line => {
val index1 = line.indexOf("\"")
val index2 = line.lastIndexOf("\"")
val str1 = line.substring(index1+1, index2)
//GET /MyDemoWeb/oracle.jsp HTTP/1.1 str1 获取的数据
val index3 = str1.indexOf(" ")
val index4 = str1.lastIndexOf(" ")
val str2 = str1.substring(index3+1, index4)
//str = /MyDemoWeb/oracle.jsp
val nameJps = str2.substring(str2.lastIndexOf("/")+1)
(nameJps,1)
}
}
var pst:PreparedStatement = null
var conn:Connection = null
conn DriverManager.getConnection("jdbc:oracle:thin:@192.168.128.90:1521/orcl.example.com", "scott", "tiger")
try{
rdd1.foreach(f=>{
pst = conn.prepareStatement("insert into result values(?,?)")
pst.setString(1,f._1)
pst.setInt(2, f._2)
pst.execute()
})
}catch{
case ex:Exception => ex.printStackTrace()
}finally{
if(conn!=null) conn.close()
if(pst!=null) pst.close()
}
sc.stop()
}
程序运行抛出异常:Task not serializable
问题原因:
RDD是由分区组成,任务会由不同的节点取执行,创建的JDBC Connection既要由节点1执行,由要由节点2执行,但是Connection本身是非序列化的对象,就导致该对象不能同时被2个节点使用,那就要对RDD进行分区操作调用方法foreachPartition;
修改的代码:
package SparkContextDemo
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.Partitioner
import scala.collection.mutable.HashMap
import java.sql.DriverManager
import java.sql.Connection
import java.sql.PreparedStatement
object Tomcat {
def main(args: Array[String]): Unit = {
System.setProperty("hadoop.home.dir", "D:\\DayDayUp\\hadoop-2.4.1\\hadoop-2.4.1")
//构建sc对象 生成RDD对象
val conf = new SparkConf().setAppName("Tomcat log").setMaster("local")
val sc = new SparkContext(conf)
//读取数据
val rdd1 = sc.textFile("D:\\tmp\\localhost_access_log.2017-07-30.txt",2).map{
//192.168.88.1 - - [30/Jul/2017:12:54:40 +0800] "GET /MyDemoWeb/oracle.jsp HTTP/1.1" 200 242
line => {
val index1 = line.indexOf("\"")
val index2 = line.lastIndexOf("\"")
val str1 = line.substring(index1+1, index2)
//GET /MyDemoWeb/oracle.jsp HTTP/1.1 str1 获取的数据
val index3 = str1.indexOf(" ")
val index4 = str1.lastIndexOf(" ")
val str2 = str1.substring(index3+1, index4)
//str = /MyDemoWeb/oracle.jsp
val nameJps = str2.substring(str2.lastIndexOf("/")+1)
(nameJps,1)
}
}
rdd1.foreachPartition(saveJDBC)
sc.stop()
}
//定义分区规则,生成的rdd1 格式(String,Int) 输入的数据类型就是迭代器
def saveJDBC(iter:Iterator[(String,Int)])={
//jdbc 操作
var pst:PreparedStatement = null
var conn:Connection = null
try{
conn = DriverManager.getConnection("jdbc:oracle:thin:@192.168.128.90:1521/orcl.example.com", "scott", "tiger")
pst = conn.prepareStatement("insert into result values(?,?)")
iter.foreach(data=>{
pst.setString(1, data._1)
pst.setInt(2, data._2)
pst.execute()
})
}catch{
case ex:Exception => ex.printStackTrace()
}finally{
if(conn!=null) conn.close()
if(pst!=null) pst.close()
}
}
}
数据成功插入:
SQL> select count(*) from result;
COUNT(*)
----------
120
SQL> select * from result;
NAMEJPS COUNT
-------------------- ----------
head.jsp 1
java.jsp 1
web.jsp 1
mysql.jsp 1
oracle.jsp 1
hadoop.jsp 1
head.jsp 1
java.jsp 1
web.jsp 1
mysql.jsp 1
oracle.jsp 1
备注:还有一种方法不用创建分区规则,这种方法很浪系统资源,对系统的性能有严重的影响特别是数据量很大的时候:
package SparkContextDemo
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.Partitioner
import scala.collection.mutable.HashMap
import java.sql.DriverManager
import java.sql.Connection
import java.sql.PreparedStatement
object Tomcat {
def main(args: Array[String]): Unit = {
System.setProperty("hadoop.home.dir", "D:\\DayDayUp\\hadoop-2.4.1\\hadoop-2.4.1")
//构建sc对象 生成RDD对象
val conf = new SparkConf().setAppName("Tomcat log").setMaster("local")
val sc = new SparkContext(conf)
//读取数据
val rdd1 = sc.textFile("D:\\tmp\\localhost_access_log.2017-07-30.txt",2).map{
//192.168.88.1 - - [30/Jul/2017:12:54:40 +0800] "GET /MyDemoWeb/oracle.jsp HTTP/1.1" 200 242
line => {
val index1 = line.indexOf("\"")
val index2 = line.lastIndexOf("\"")
val str1 = line.substring(index1+1, index2)
//GET /MyDemoWeb/oracle.jsp HTTP/1.1 str1 获取的数据
val index3 = str1.indexOf(" ")
val index4 = str1.lastIndexOf(" ")
val str2 = str1.substring(index3+1, index4)
//str = /MyDemoWeb/oracle.jsp
val nameJps = str2.substring(str2.lastIndexOf("/")+1)
(nameJps,1)
}
}var pst:PreparedStatement = null
var conn:Connection = null
//conn = DriverManager.getConnection("jdbc:oracle:thin:@192.168.128.90:1521/orcl.example.com", "scott", "tiger")
try{
rdd1.foreach(f=>{
//针对rdd的每一条数据都创建一个connection,可以避免分区问题,后果就是每条数据都会创建connection,数据是千万级别的意味着你要创建千万级别的connection,
//严重影响系统的性能
conn = DriverManager.getConnection("jdbc:oracle:thin:@192.168.128.90:1521/orcl.example.com", "scott", "tiger")
pst = conn.prepareStatement("insert into result values(?,?)")
pst.setString(1,f._1)
pst.setInt(2, f._2)
pst.execute()
})
}catch{
case ex:Exception => ex.printStackTrace()
}finally{
if(conn!=null) conn.close()
if(pst!=null) pst.close()
}
sc.stop()
}