一 Spark SQL架构
Spark SQL是Spark的核心组件之一(Spark Core、Spark SQL、Spark Streaming、MLlib、GraphX) 能够直接访问现存的Hive数据 提供JDBC/ODBC接口供第三方工具借助Spark进行数据处理 提供了更高层级的接口方便地处理数据 支持多种操作方式:SQL、API编程 支持多种外部数据源:Parquet、JSON、RDBMS等
二 运行原理之Catalyst优化器
1、运行逻辑
Catalyst优化器是Spark SQL的核心 将逻辑计划转为物理计划
2、逻辑计划
3、优化
在投影上面查询过滤器 检查过滤器是否可下压
4、物理计划
三 Spark SQL API
1、SparkSession
SparkContext SQLContext
HiveContext
SparkSession(Spark 2.x推荐)
SparkSession:合并了SQLContext与HiveContext 提供与Spark功能交互单一入口点,并允许使用DataFrame和Dataset API对Spark进行编程
val conf: SparkConf = new SparkConf( ) . setAppName( "spark" ) . setMaster( "local[*]" )
val spark = SparkSession. builder
. master( "master" )
. appName( "appName" )
. getOrCreate( )
val spark= SparkSession. builder. config( conf) . getOrCreate( )
2、Dataset
scala> spark. createDataset( 1 to 3 ) . show
scala> spark. createDataset( List( ( "a" , 1 ) , ( "b" , 2 ) , ( "c" , 3 ) ) ) . show
scala> spark. createDataset( sc. parallelize( List( ( "a" , 1 , 1 ) , ( "b" , 2 , 2 ) ) ) ) . show
createDataset()的参数可以是:Seq、Array、RDD 上面三行代码生成的Dataset分别是:Dataset[Int]、Dataset[(String,Int)]、Dataset[(String,Int,Int)] Dataset=RDD+Schema,所以Dataset与RDD有大部共同的函数,如map、filter等
3、使用Case Class创建Dataset
Scala中在class关键字前加上case关键字 这个类就成为了样例类,样例类和普通类区别:
(1)不需要new可以直接生成对象 (2)默认实现序列化接口 (3)默认自动覆盖 toString()、equals()、hashCode()
case class Point( label: String , x: Double , y: Double )
case class Category( id: Long , name: String )
val points= Seq( Point( "bar" , 3.0 , 5.6 ) , Point( "foo" , - 1.0 , 3.0 ) ) . toDS
val categories= Seq( Category( 1 , "foo" ) , Category( 2 , "bar" ) ) . toDS
points. join( categories, points( "label" ) == = categories( "name" ) ) . show
4、RDD->Dataset
case class Point( label: String , x: Double , y: Double )
case class Category( id: Long , name: String )
val pointsRDD= sc. parallelize( List( ( "bar" , 3.0 , 5.6 ) , ( "foo" , - 1.0 , 3.0 ) ) )
val categoriesRDD= sc. parallelize( List( ( 1 , "foo" ) , ( 2 , "bar" ) ) )
val points= pointsRDD. map( line=> Point( line. _1, line. _2, line. _3) ) . toDS
val categories= categoriesRDD. map( line=> Category( line. _1, line. _2) ) . toDS
points. join( categories, points( "label" ) == = categories( "name" ) ) . show
5、DataFrame
什么是DataFrame
DataFrame=Dataset[Row],即DataFrame是Dataset的子类型 类似传统数据的二维表格 在RDD基础上加入了Schema(数据结构信息) DataFrame Schema支持嵌套数据类型
提供更多类似SQL操作的API
DataFrame API常用操作
val df = spark. read. json( "file:///software/wordcount/users.json" )
df. show
使用printSchema方法输出DataFrame的Schema信息
df. printSchema( )
df. select( "name" ) . show( )
使用select方法选择我们所需要的字段,并未age字段加1
df. select( df( "name" ) , df( "age" ) + 1 ) . show( )
df. select( df. col( "name" ) , df. col( "age" ) + 1 ) . show( )
df. filter( df( "age" ) > 21 ) . show( )
df. groupBy( "age" ) . count( ) . show( )
df. registerTempTable( "people" )
spark. sql( "SELECT * FROM people" ) . show
RDD -> DataFrame
通过反射RDD内的Schema 凡是涉及到其它类型到DF转换都需要导入隐式包import spark.implicits._
case class People( name: String , age: Int )
object rddToDF {
def main( args: Array[ String ] ) : Unit = {
val spark: SparkSession = SparkSession. builder( ) . appName( "sparksql" ) . master( "local[*]" ) . getOrCreate( )
val sc: SparkContext = spark. sparkContext
import spark. implicits. _
val df: DataFrame = sc. textFile( "in/people.txt" ) . map( x=> x. split( "," ) ) . map( x=> People( x( 0 ) , x( 1 ) . toInt) ) . toDF( )
df. printSchema( )
df. show( )
}
}
case class Person( name String , age Int )
val people: RDD[ String ] = sc. textFile( "file:///home/hadoop/data/people.txt" )
val schemaString = "name age"
import org. apache. spark. sql. Row
import org. apache. spark. sql. types. _
val schema: StructType = StructType( schemaString. split( " " ) . map( fieldName=> {
if ( fieldName. equals( "name" ) )
StructField( fieldName, StringType, true )
else
StructField( fieldName, IntegerType, true )
} ) )
val rowRDD: RDD[ Row] = people. map( _. split( "," ) ) . map( p => Row( p( 0 ) , p( 1 ) . toInt) )
val peopleDataFrame: DataFrame = spark. createDataFrame( rowRDD, schema)
peopleDataFrame. registerTempTable( "people" )
val results = spark. sql( "SELECT name FROM people" )
results. show
Seq/List ->DataFrame
case class Student( id: Int , name: String , sex: String , age: Int )
val stuDF: DataFrame = Seq(
Student( 1001 , "zhangsan" , "F" , 20 ) ,
Student( 1002 , "lisi" , "M" , 16 ) ,
Student( 1003 , "wangwu" , "M" , 21 ) ,
Student( 1004 , "zhaoliu" , "F" , 21 ) ,
Student( 1005 , "zhouqi" , "M" , 22 ) ,
Student( 1006 , "qianba" , "M" , 22 ) ,
Student( 1007 , "liuliu" , "F" , 23 )
) . toDF( )
val df: DataFrame = List( ( 1 , 20 ) , ( 3 , 40 ) ) . toDF( "id" , "age" )
df. show( )
DataFrame -> RDD
val rdd: RDD[ Row] = peopleDataFrame. rdd
DataFrame -> DataSet
val frame: DataFrame= sqlContext. sql( "select name,count(cn) from tbwordcount group by name" )
val DS1: Dataset[ ( String , Long ) ] = frame. map( row => {
val name: String = row. getAs[ String ] ( "name" )
val cn: Long = row. getAs[ Long ] ( "cn" )
( name, cn)
} )
val DS2: Dataset[ ( String , Long ) ] = frame. as[ ( String , Long ) ]
四 Spark SQL操作外部数据源
Spark SQL支持的外部数据源
Parquet文件
是一种流行的列式存储格式,以二进制存储,文件中包含数据与元数据
import org. apache. spark. SparkContext
import org. apache. spark. rdd. RDD
import org. apache. spark. sql. { DataFrame, Row, SparkSession}
import org. apache. spark. sql. types. _
object ParDemo {
def main( args: Array[ String ] ) : Unit = {
val spark: SparkSession = SparkSession. builder( ) . master( "local[*]" ) . appName( "ParquetDemo" ) . getOrCreate( )
import spark. implicits. _
val sc: SparkContext = spark. sparkContext
val list = List(
( "zhangsan" , "red" , Array( 3 , 4 , 5 ) ) ,
( "lisi" , "blue" , Array( 7 , 8 , 9 ) ) ,
( "wangwu" , "black" , Array( 12 , 15 , 19 ) ) ,
( "zhaoliu" , "orange" , Array( 7 , 9 , 6 ) )
)
val rdd1: RDD[ ( String , String , Array[ Int ] ) ] = sc. parallelize( list)
val schema = StructType(
Array(
StructField( "name" , StringType) ,
StructField( "color" , StringType) ,
StructField( "numbers" , ArrayType( IntegerType) )
)
)
val rowRDD: RDD[ Row] = rdd1. map( x=> Row( x. _1, x. _2, x. _3) )
val df: DataFrame = spark. createDataFrame( rowRDD, schema)
df. show( )
df. write. parquet( "out/color" )
val frame: DataFrame = spark. read. parquet( "out/color" )
frame. printSchema( )
frame. show( )
}
}
Spark对Hive表的数据插入和读取
Linux虚拟机spark-shell环境
- 复制hive中的hive-site.xml至spark安装目录下的conf下(ln -s /opt/hive/conf/hive-site.xml /opt/spark/conf/hive-site.xml)
- 将mysql驱动拷贝至spark的jars目录下(cp /opt/hive/lib/mysql-connector-java-5.1.38.jar /opt/spark/jars/)
- 启动元数据服务:nohup hive --service metastore &
- spark-shell
- 然后直接在spark.sql("....")里面写sql语句就可以了(scala> spark.sql("select * from stu").show())
- 同样,也不能够在spark-shell中创建hive数据库。
IDEA中开发环境
linux虚拟机输入下述命令会开启jps的RunJar就ok了
nohup hive --service metastore &
IDEA中创建HIve数据库会产生权限问题:-chgrp: 'LAPTOP-F4OELHQ8\86187' does not match expected pattern for groupUsage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
。会导致表能够创建成功,但是数据会存到IDEA目录下,不会上传到HDFS上,并且创建的数据库仅仅是个文件夹没有.db后缀。暂时未找到解决方案。建议:数据库提前用虚拟机创建好,然后使用即可。 IDEA添加依赖
< ! -- spark- sql -- >
< dependency>
< groupId> org. apache. spark< / groupId>
< artifactId> spark- sql_2. 11 < / artifactId>
< version> 2.1 .1 < / version>
< / dependency>
< ! -- spark- hive -- >
< dependency>
< groupId> org. apache. spark< / groupId>
< artifactId> spark- hive_2. 11 < / artifactId>
< version> 2.1 .1 < / version>
< / dependency>
< ! -- mysql- connector- java -- >
< dependency>
< groupId> mysql< / groupId>
< artifactId> mysql- connector- java< / artifactId>
< version> 5.1 .38 < / version>
< / dependency>
import org. apache. spark. sql. { DataFrame, Dataset, Row, SparkSession}
object SparksqlOnHiveDemo {
def main( args: Array[ String ] ) : Unit = {
val spark: SparkSession = SparkSession. builder( ) . appName( "sparkHive" )
. master( "local[*]" )
. config( "hive.metastore.uris" , "thrift://192.168.198.201:9083" )
. enableHiveSupport( ) . getOrCreate( )
spark. sql( "show databases" ) . collect( ) . foreach( println)
val df: DataFrame = spark. sql( "select * from toronto" )
df. printSchema( )
df. show( )
val df2: Dataset[ Row] = df. where( df( "ssn" ) . startsWith( "158" ) )
val df3: Dataset[ Row] = df. filter( df( "ssn" ) . startsWith( "158" ) )
}
}
操作Mysql中的表
虚拟机中请将mysql-connector-java-5.1.38.jar
复制到spark安装目录的jars
下 IDEA中请添加以下依赖
< dependency>
< groupId> mysql< / groupId>
< artifactId> mysql- connector- java< / artifactId>
< version> 5.1 .38 < / version>
< / dependency>
从mysql读数据
spark. read. format( "jdbc" )
. option( "url" , "jdbc:mysql://192.168.198.201:3306/hive" )
. option( "driver" , "com.mysql.jdbc.Driver" )
. option( "user" , "root" )
. option( "password" , "ok" )
. option( "dbtable" , "TBLS" )
. load( ) . show
spark. read. format( "jdbc" )
. options( Map( "url" - > "jdbc:mysql://192.168.198.201:3306/hive?user=root&password=ok" ,
"dbtable" - > "TBLS" , "driver" - > "com.mysql.jdbc.Driver" ) ) . load( ) . show
import org. apache. spark. sql. { DataFrame, SparkSession}
object SparksqlOnMysqlDemo {
def main( args: Array[ String ] ) : Unit = {
val spark: SparkSession = SparkSession. builder( ) . master( "local[*]" )
. appName( "sparksqlOnmysql" )
. getOrCreate( )
val url= "jdbc:mysql://192.168.198.201:3306/hive"
val user= "root"
val pwd= "ok"
val driver= "com.mysql.jdbc.Driver"
val prop= new java. util. Properties( )
prop. setProperty( "user" , user)
prop. setProperty( "password" , pwd)
prop. setProperty( "driver" , driver)
val df: DataFrame = spark. read. jdbc( url, "TBLS" , prop)
df. show( )
df. where( df( "CREATE_TIME" ) . startsWith( "159" ) ) . show( )
val frame: DataFrame = df. groupBy( df( "DB_ID" ) ) . count( )
frame. printSchema( )
frame. orderBy( frame( "count" ) . desc) . show( )
}
}
向Mysql写数据
object SparkSQL03_Datasource {
def main( args: Array[ String ] ) : Unit = {
val conf: SparkConf = new SparkConf( ) . setMaster( "local[*]" ) . setAppName( "SparkSQL01_Demo" )
val spark: SparkSession = SparkSession. builder( ) . config( conf) . getOrCreate( )
import spark. implicits. _
val rdd: RDD[ ( String , Int ) ] = spark. sparkContext. parallelize( List( ( "zs" , 21 ) , ( "ls" , 23 ) , ( "ww" , 26 ) ) )
val df: DataFrame = rdd. toDF( "name" , "age" )
df. write
. format( "jdbc" )
. option( "url" , "jdbc:mysql://192.168.198.201:3306/test" )
. option( "user" , "root" )
. option( "password" , "ok" )
. option( "dbtable" , "users" )
. mode( SaveMode. Append)
. save( )
val props: Properties = new Properties( )
props. setProperty( "user" , "root" )
props. setProperty( "password" , "ok" )
df. write. mode( SaveMode. Append) . jdbc( "jdbc:mysql://192.168.198.201:3306/test" , "users" , props)
五 Spark SQL函数
内置函数
package sparkSQL
import org. apache. spark. SparkContext
import org. apache. spark. rdd. RDD
import org. apache. spark. sql. { DataFrame, Row, SparkSession, types}
object InnerFunctionDemo {
def main( args: Array[ String ] ) : Unit = {
val spark: SparkSession = SparkSession. builder( ) . appName( "innerfunction" ) . master( "local[*]" ) . getOrCreate( )
import spark. implicits. _
val sc: SparkContext = spark. sparkContext
val accessLog = Array(
"2016-12-27,001" ,
"2016-12-27,001" ,
"2016-12-27,002" ,
"2016-12-28,003" ,
"2016-12-28,004" ,
"2016-12-28,002" ,
"2016-12-28,002" ,
"2016-12-28,001"
)
import org. apache. spark. sql. functions. _
import org. apache. spark. sql. types. _
val rdd1: RDD[ Row] = sc. parallelize( accessLog) . map( x=> x. split( "," ) ) . map( x=> Row( x( 0 ) , x( 1 ) . toInt) )
val structType= StructType( Array(
StructField( "day" , StringType, true ) ,
StructField( "user_id" , IntegerType, true )
) )
val frame: DataFrame = spark. createDataFrame( rdd1, structType)
frame. groupBy( "day" ) . agg( countDistinct( "user_id" ) . as( "pv" ) ) . select( "day" , "pv" )
. collect( ) . foreach( println)
}
}
case class
import org. apache. spark. SparkContext
import org. apache. spark. sql. { DataFrame, SparkSession}
object caseClass {
case class Student( id: Int , name: String , sex: String , age: Int )
def main( args: Array[ String ] ) : Unit = {
val spark: SparkSession = SparkSession. builder( ) . master( "local[*]" ) . appName( "case" ) . getOrCreate( )
val sc: SparkContext = spark. sparkContext
import spark. implicits. _
val stuDF: DataFrame = Seq(
Student( 1001 , "zhangsan" , "F" , 20 ) ,
Student( 1002 , "lisi" , "M" , 16 ) ,
Student( 1003 , "wangwu" , "M" , 21 ) ,
Student( 1004 , "zhaoliu" , "F" , 21 ) ,
Student( 1005 , "zhouqi" , "M" , 22 ) ,
Student( 1006 , "qianba" , "M" , 22 ) ,
Student( 1007 , "liuliu" , "F" , 23 )
) . toDF( )
import org. apache. spark. sql. functions. _
stuDF. groupBy( stuDF( "sex" ) ) . agg( count( stuDF( "age" ) ) . as( "num" ) ) . show( )
stuDF. groupBy( stuDF( "sex" ) ) . agg( max( stuDF( "age" ) ) . as( "max" ) ) . show( )
stuDF. groupBy( stuDF( "sex" ) ) . agg( min( stuDF( "age" ) ) . as( "min" ) ) . show( )
stuDF. groupBy( stuDF( "sex" ) ) . agg( "age" - > "max" , "age" - > "min" , "age" - > "avg" , "id" - > "count" ) . show( )
stuDF. groupBy( "sex" , "age" ) . count( ) . show( )
}
}
六 Spark UDF&UDAF&UDTF
UDF
import org. apache. spark. SparkContext
import org. apache. spark. rdd. RDD
import org. apache. spark. sql. { DataFrame, SparkSession}
object SparkUDFDemo {
case class Hobbies( name: String , hobbies: String )
def main( args: Array[ String ] ) : Unit = {
val spark: SparkSession = SparkSession. builder( ) . appName( "innerfunction" ) . master( "local[*]" ) . getOrCreate( )
import spark. implicits. _
val sc: SparkContext = spark. sparkContext
val rdd: RDD[ String ] = sc. textFile( "in/hobbies.txt" )
val df: DataFrame = rdd. map( x=> x. split( " " ) ) . map( x=> Hobbies( x( 0 ) , x( 1 ) ) ) . toDF( )
df. registerTempTable( "hobbies" )
spark. udf. register( "hobby_num" ,
( v: String ) => v. split( "," ) . size
)
val frame: DataFrame = spark. sql( "select name,hobbies,hobby_num(hobbies) as hobnum from hobbies" )
frame. show( )
}
}
UDAF
import org. apache. spark. SparkContext
import org. apache. spark. sql. { DataFrame, Row, SparkSession}
import org. apache. spark. sql. expressions. { MutableAggregationBuffer, UserDefinedAggregateFunction}
import org. apache. spark. sql. types. _
object SparkUDAFDemo {
def main( args: Array[ String ] ) : Unit = {
val spark: SparkSession = SparkSession. builder( ) . appName( "udaf" ) . master( "local[*]" ) . getOrCreate( )
val sc: SparkContext = spark. sparkContext
val df: DataFrame = spark. read. json( "in/user.json" )
println( "读取的文件详情" )
df. show( )
val myUdaf = new MyAgeAvgFunction
spark. udf. register( "myAvgAge" , myUdaf)
df. createTempView( "userinfo" )
val resultDF: DataFrame = spark. sql( "select sex, myAvgAge(age) from userinfo group by sex" )
println( "使用udaf后的效果" )
resultDF. show( )
}
}
class MyAgeAvgFunction extends UserDefinedAggregateFunction{
override def inputSchema: StructType = {
new StructType( ) . add( "age" , LongType)
}
override def bufferSchema: StructType = {
new StructType( ) . add( "sum" , LongType) . add( "count" , LongType)
}
override def dataType: DataType = DoubleType
override def deterministic: Boolean = true
override def initialize( buffer: MutableAggregationBuffer) : Unit = {
buffer( 0 ) = 0L
buffer( 1 ) = 0L
}
override def update( buffer: MutableAggregationBuffer, input: Row) : Unit = {
buffer( 0 ) = buffer. getLong( 0 ) + input. getLong( 0 )
buffer( 1 ) = buffer. getLong( 1 ) + 1
}
override def merge( buffer1: MutableAggregationBuffer, buffer2: Row) : Unit = {
buffer1( 0 ) = buffer1. getLong( 0 ) + buffer2. getLong( 0 )
buffer1( 1 ) = buffer1. getLong( 1 ) + buffer2. getLong( 1 )
}
override def evaluate( buffer: Row) : Any = {
buffer. getLong( 0 ) . toDouble/ buffer. getLong( 1 )
}
}
UDTF
import java. util
import org. apache. hadoop. hive. ql. exec. UDFArgumentException
import org. apache. hadoop. hive. ql. udf. generic. GenericUDTF
import org. apache. hadoop. hive. serde2. objectinspector. primitive. PrimitiveObjectInspectorFactory
import org. apache. hadoop. hive. serde2. objectinspector. { ObjectInspector, ObjectInspectorFactory, PrimitiveObjectInspector, StructObjectInspector}
import org. apache. spark. SparkContext
import org. apache. spark. rdd. RDD
import org. apache. spark. sql. { DataFrame, SparkSession}
object SparkUDTFDemo {
def main( args: Array[ String ] ) : Unit = {
val spark: SparkSession = SparkSession. builder( ) . appName( "udtf" ) . master( "local[*]" )
. enableHiveSupport( )
. getOrCreate( )
val sc: SparkContext = spark. sparkContext
import spark. implicits. _
val lines: RDD[ String ] = sc. textFile( "in/udtf.txt" )
val stuDF: DataFrame = lines. map( _. split( "//" ) ) . filter( x => x( 1 ) . equals( "ls" ) )
. map( x => ( x( 0 ) , x( 1 ) , x( 2 ) ) ) . toDF( "id" , "name" , "class" )
stuDF. createOrReplaceTempView( "student" )
spark. sql( "CREATE TEMPORARY FUNCTION MyUDTF AS 'sparkSQL.myUDTF' " )
spark. sql( "select MyUDTF(class) from student" ) . show( )
}
}
class myUDTF extends GenericUDTF{
override def initialize( argOIs: Array[ ObjectInspector] ) : StructObjectInspector = {
if ( argOIs. length != 1 ) {
throw new UDFArgumentException( "有且只能有一个参数" )
}
if ( argOIs( 0 ) . getCategory!= ObjectInspector. Category. PRIMITIVE) {
throw new UDFArgumentException( "参数类型不匹配" )
}
val fieldNames= new util. ArrayList[ String ]
val fieldOIs= new util. ArrayList[ ObjectInspector] ( )
fieldNames. add( "type" )
fieldOIs. add( PrimitiveObjectInspectorFactory. javaStringObjectInspector)
ObjectInspectorFactory. getStandardStructObjectInspector( fieldNames, fieldOIs)
}
override def process( objects: Array[ AnyRef ] ) : Unit = {
val strings: Array[ String ] = objects( 0 ) . toString. split( " " )
println( strings. mkString( "," ) )
for ( elem <- strings) {
val tmp = new Array[ String ] ( 1 )
tmp( 0 ) = elem
forward( tmp)
}
}
override def close( ) : Unit = { }
}
七 Spark SQL CLI
Spark SQL CLI是在本地模式下使用Hive元存储服务
和执行从命令行所输入查询语句的简便工具 注意,Spark SQL CLI无法与thrift JDBC服务器通信 Spark SQL CLI等同于Hive CLI(old CLI)、Beeline CLI(new CLI) 将hive-site.xml、hdfs-site.xml、core-site.xml复制到$SPARK_HOME/conf目录下 启动Spark SQL CLI,请在Spark目录中运行以下内容./bin/spark-sql
$spark- sql
spark- sql> show databases;
default
spark- sql> show tables;
default toronto false
spark- sql> select * from toronto where ssn like '111%' ;
John S. 111 - 222 - 333 123 Yonge Street
spark- sql> create table montreal( full_name string, ssn string, office_address string) ;
spark- sql> insert into montreal values( 'Winnie K. ' , '111-222-333 ' , '62 John Street' ) ;
spark- sql> select t. full_name, m. ssn, t. office_address, m. office_address from toronto t inner join montreal m on t. ssn = m. ssn;
John S. 111 - 222 - 333 123 Yonge Street 62 John Street
八 Spark性能优化
序列化
Java序列化,Spark默认方式 Kryo序列化,比Java序列化快约10倍,但不支持所有可序列化类型
conf. set( "spark.serializer" , "org.apache.spark.serializer.KryoSerializer" ) ;
conf. registerKryoClasses( Array( classOf[ MyClass1] , classOf[ MyClass2] ) ) ;
如果没有注册需要序列化的class,Kyro依然可以照常工作,但会存储每个对象的全类名(full class name),这样往往比默认的 Java serialization 更浪费空间
优化点
使用对象数组(array数组)、原始类型(基本数据类型)代替Java、Scala集合类(如HashMap) 避免嵌套结构 尽量使用数字作为Key,而非字符串 以较大的RDD使用MEMORY_ONLY_SER 加载CSV、JSON时,仅加载所需字段 仅在需要时持久化中间结果(RDD/DS/DF) 避免不必要的中间结果(RDD/DS/DF)的生成 DF的执行速度比DS快约3倍(结构简单,只有Row对象)
分区优化
自定义RDD分区与spark.default.parallelism
将大变量广播出去,而不是直接使用 尝试处理本地数据并最小化跨工作节点的数据传输
join操作
小表放在join左边,会缓存进内存,右边的大表一一与内存中表关联,效率更快 还有一个说法是表中重复键较少的表放在join左边,因为写在关联左侧的表每有1条重复的关联键时底层就会多1次运算处理。两表关联时,即使匹配到一条数据,它还是会继续运行下去,也就是说当一个表关联条件所在字段的某一个值有重复时,会打印多条重复的值