一、SparkSession与SparkContext区别
首先介绍一下SparkSession与SparkContext区别
Application: 用户编写的Spark应用程序,Driver 即运行上述 Application 的 main() 函数并且创建 SparkContext。
SparkContext: 整个应用的上下文,控制应用的生命周期。
RDD: 不可变的数据集合,可由 SparkContext 创建,是 Spark 的基本计算单元。
SparkSession: 是Spark 2.0引如的新概念。SparkSession为用户提供了统一的切入 点。为了引入dataframe和dataset的API, 同时保留了原来SparkContext的functionality。如果想要使用 HIVE,SQL,Streaming的API, 就需要Spark Session作为入口。
创建一个SparkContext对象
//Spark app 配置:应用的名字和Master运行的位置
val sparkConf=new SparkConf()
.setAppName("SparkAppTemplate")
.setMaster("local[2]")
//创建sparkContext对象:主要用于读取需要处理的数据,封装在RDD集合中;调度jobs执行
val sc = new SparkContext(sparkConf)
在spark的早期版本中,SparkContext是spark的主要切入点,由于RDD是主要的API,我们通过sparkcontext来创建和操作RDD。对于每个其他的API,我们需要使用不同的context。
创建一个SparkSession对象
//在spark 2.x中不推荐使用sparkContext对象读取数据,而是推荐SparkSession
val warehouseLocation = "file:${system:user.dir}/spark-warehouse"
val spark = SparkSession
.builder()
.appName("SparkSessionZipsExample")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate()
二、Spark读取Mysql数据
Maven配置
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<java.version>1.8</java.version>
<scala.version>2.12.8</scala.version>
<spark.version>2.4.3</spark.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.12</artifactId>
<version>${spark.version}</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.27</version>
</dependency>
</dependencies>
2.1 单分区模式
//函数:
def jdbc(url: String, table: String, properties: Properties): DataFrame
实现代码:
import java.util.Properties
import org.apache.spark.sql.SparkSession
object Testsql {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("sparksql").master("local").getOrCreate()
val prop = new Properties()
prop.put("user", "java")
prop.put("password", "123456")
val url = "jdbc:mysql://localhost:3306/tushu?useUnicode=true&characterEncoding=UTF-8&serverTimezone=UTC"
val dataFrame = spark.read.jdbc(url,"book",prop).select("id","cover").where("id >= 3").show()
spark.stop()
}
}
从入参可以看出,只需要传入JDBC URL、表名及对应的账号密码Properties即可。但是计算此DF的分区数后发现,这种不负责任的写法,并发数是1
jdbcDF.rdd.partitions.size=1
2.2 指定Long型column字段的分区模式
//函数:
def jdbc(
url: String,
table: String,
columnName: String,
lowerBound: Long,
upperBound: Long,
numPartitions: Int,
connectionProperties: Properties): DataFrame
实现代码:
import java.util.Properties
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
object Testsql {
def main(args: Array[String]): Unit = {
Logger.getLogger("org.apache.hadoop").setLevel(Level.WARN)
Logger.getLogger("org.apache.spark").setLevel(Level.INFO)
Logger.getLogger("org.spark_project.jetty").setLevel(Level.WARN)
val spark = SparkSession.builder().appName("sparksql").master("local").getOrCreate()
val prop = new Properties()
prop.put("user", "java")
prop.put("password", "123456")
prop.put("driver", "com.mysql.jdbc.Driver")
val url = "jdbc:mysql://localhost:3306/tushu?useUnicode=true&characterEncoding=UTF-8&serverTimezone=UTC"
val columnName = "id"
val table = "book"
val lowerBound = 0
val upperBound = 1000
val numPartitions = 10
val df = spark.read.jdbc(url, table, columnName, lowerBound, upperBound, numPartitions, prop)
df.limit(10).show()
spark.stop()
}
}
从入参可以看出,通过指定 id 这个数字型的column作为分片键,并设置最大最小值和指定的分区数,可以对数据库的数据进行并发读取。spark会按numPartitions均分最大最小ID,然后进行并发查询,并最终转换成RDD。
numPartitions源码分析:
if upperBound-lowerBound >= numPartitions:
jdbcDF.rdd.partitions.size = numPartitions
else
jdbcDF.rdd.partitions.size = upperBound-lowerBound
入参为:
lowerBound=1, upperBound=1000, numPartitions=10
对应查询语句组为:
JDBCPartition(id < 101 or id is null,0),
JDBCPartition(id >= 101 AND id < 201,1),
JDBCPartition(id >= 201 AND id < 301,2),
JDBCPartition(id >= 301 AND id < 401,3),
JDBCPartition(id >= 401 AND id < 501,4),
JDBCPartition(id >= 501 AND id < 601,5),
JDBCPartition(id >= 601 AND id < 701,6),
JDBCPartition(id >= 701 AND id < 801,7),
JDBCPartition(id >= 801 AND id < 901,8),
JDBCPartition(id >= 901,9)
2.3 自定义option参数模式
函数示例
val jdbcDF = sparkSession.sqlContext.read.format("jdbc")
.option("url", url)
.option("driver", "com.mysql.jdbc.Driver")
.option("dbtable", "table")
.option("user", "user")
.option("partitionColumn", "id")
.option("lowerBound", 1)
.option("upperBound", 10000)
.option("fetchsize", 100)
.option("xxx", "xxx")
.load()
从函数可以看出,option模式其实是一种开放接口。所有支持的参数具体可以参考官方文档:官方JDBC配置文档
实现代码:
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
object Testsql {
def main(args: Array[String]): Unit = {
Logger.getLogger("org.apache.hadoop").setLevel(Level.WARN)
Logger.getLogger("org.apache.spark").setLevel(Level.INFO)
Logger.getLogger("org.spark_project.jetty").setLevel(Level.WARN)
val spark = SparkSession.builder()
.appName("sparksql")
.master("local")
.getOrCreate()
//useUnicode=true&characterEncoding=UTF-8 编码
//serverTimezone=UTC 时区
val dataDF = spark.read.format("jdbc")
.option("url","jdbc:mysql://localhost:3306/tushu?useUnicode=true&characterEncoding=UTF-8&serverTimezone=UTC")
.option("dbtable","book")
.option("user","java")
.option("password","123456")
.load()
dataDF.createOrReplaceTempView("tmptable")
val sql = "select * from tmptable where id >= 3"
spark.sql(sql).show()
spark.stop()
}
}
实现效果
三、Spark写入Mysql数据
3.1 spark.write.mode().jdbc() 先查询再写入
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.{SaveMode, SparkSession}
object Testsql {
def main(args: Array[String]): Unit = {
Logger.getLogger("org.apache.hadoop").setLevel(Level.WARN)
Logger.getLogger("org.apache.spark").setLevel(Level.INFO)
Logger.getLogger("org.spark_project.jetty").setLevel(Level.WARN)
val spark = SparkSession.builder().appName("sparksql").master("local").getOrCreate()
val prop = new Properties()
prop.put("user", "java")
prop.put("password", "123456")
val url = "jdbc:mysql://localhost:3306/tushu?useUnicode=true&characterEncoding=UTF-8&serverTimezone=UTC"
val dataFrame = spark.read.jdbc(url,"book",prop).where("id >= 5")
dataFrame.write.mode(SaveMode.Append).jdbc(url,"test01",prop)
spark.stop()
}
}
3.2通过构建DataFrame再写入
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("test")
.master("local")
.getOrCreate()
val prop = new Properties()
prop.put("user","java")
prop.put("password","123456")
val url = "jdbc:mysql://localhost:3306/test?useUnicode=true&characterEncoding=UTF-8&serverTimezone=UTC"
val rdd = spark.sparkContext.textFile("d://data/word.txt")
val rdd2 = rdd.flatMap(_.split(",")).distinct().zipWithIndex().map(t =>{Row(t._2,t._1)})
val schema = StructType{
List(
StructField("id",LongType,true),
StructField("user",StringType,true)
)}
val dataFrame = spark.createDataFrame(rdd2,schema)
dataFrame.write.mode(SaveMode.Overwrite).jdbc(url,"test02",prop)
spark.stop()
}