RDD(Resilient Distributed Datasets)弹性分布式数据集,spark中数据的抽象说明
特性(来源于源码):
* Internally, each RDD is characterized by five main properties:
*
* - A list of partitions
一组分区
RDD由分区组成
* - A function for computing each split
函数,用于计算RDD中数据
* - A list of dependencies on other RDDs
RDD之间存在依赖关系:宽、窄
* - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
* - Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)
RDD的构建方式:
1.使用SparkContex创建: val rdd1 = sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10),3)
2.通过读取外部的数据源,直接创建RDD
val rdd2 = sc.textFile("hdfs://bigdata111:9000/input/data.txt") val rdd2 = sc.textFile("/root/temp/input/data.txt")
代码:需求是分析用户访问日志,提取用户访问的jps域名,将jps域名保存到关系型数据库中;
package SparkContextDemo
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.Partitioner
import scala.collection.mutable.HashMap
import java.sql.DriverManager
import java.sql.Connection
import java.sql.PreparedStatement
object Tomcat {
def main(args: Array[String]): Unit = {
System.setProperty("hadoop.home.dir", "D:\\DayDayUp\\hadoop-2.4.1\\hadoop-2.4.1")
//构建sc对象 生成RDD对象
val conf = new SparkConf().setAppName("Tomcat log").setMaster("local")
val sc = new SparkContext(conf)
//读取数据
val rdd1 = sc.textFile("D:\\tmp\\localhost_access_log.