提交命令行
spark2-submit --master yarn --deploy-mode client --driver-memory 1g --num-executors 2 --executor-cores 2 --executor-memory 2g --class com.songzhixiao.sellcourse.conrtroller.DwdSellCourseController --queue spark /opt/module/jars/com_songzhixiao_warehouse-1.0-SNAPSHOT-jar-with-dependencies.jar
1.释放缓存与缓存方法
- DataFrame
dataFrame.cache()
dataFrame.unpersist()
- RDD
RDD.cache()
RDD.persist()
RDD.unpersist()
- Sql
sparkSession.catalog.cacheTable(“tableName”)
sparkSession.catalog.uncacheTable(“tableName”)
2.修改并行度
Spark sql
默认shuffle并行度为200 可对spark.sql.shuffle.partitions参数进行修改
RDD、DataFrame
coalese()和repartition()
3.使用Kryo序列化
sparkConf.set("spark.seralizer"."org.apache.spark.seralizer.KryoSeralizer")
sparkConf.registerKryoClasses(Array(Class[QueryResult]))
rdd.persist(StorageLevel.MEMORY_ONLY_SER)
4.BroadCast join
- 广播join小表默认值:10M
spark.sql.autoBroadcastJoinThread