spark core 优化面试

本文探讨了Spark Core中提升性能的策略,包括使用广播变量减少数据传输,理解缓存机制与适用场景,了解checkpoint检查点的作用,采用Kryo序列化提升效率,以及优化foreachPartition操作和并行度设置,以充分利用集群资源。
摘要由CSDN通过智能技术生成

广播变量
广播变量是避免每个excutor 发送一份数据
而是一个worker 发送一份数据

官网解释:
Using the broadcast functionality available in SparkContext can greatly reduce the size of each serialized task, and the cost of launching a job over a cluster. If your tasks use any large object from the driver program inside of them (e.g. a static lookup table), consider turning it into a broadcast variable. Spark prints the serialized size of each task on the master, so you can look at that to decide whether your tasks are too large; in general tasks larger than about 20 KB are probably worth optimizing.

缓存
如果内存放不下缓存的数据,那么多余的数据也不会缓存,没有缓存的数据在被使用的时候,需要再次计算
缓存一般缓存数据量小的算子
cache() persist()
没有打断血缘

checkpoint检查点
检查点(本质是通过RDD写入disk)是为了通过lineage(血统)做容错的辅助,lineage过长会造成容错成本过高,这样就不如在中间阶段做检查点容错,如果之后有节点出现问题而丢失分区,从做检查点的RDD开始重做Lineage,就会减少开销,主要是从容错成本考虑
checkpoint 切断了血缘

使用轻量级的序列化机制
代替java 的serilizble 而是使用kyro 序列化机制
spark 2.x 引进采用了kryo 序列化机制
官网解释:
Java serialization: By default, Spark serializes objects using Java’s ObjectOutputStream framework, and can work with any class you create that implements java.io.Serializable. You can also control the performance of your serialization more closely by extending java.io.Externalizable. Java serialization is flexible but often quite slow, and leads to large serialized formats for many classes.
Kryo serialization: Spark can also use the Kryo library (version 4) to serialize objects more quickly. Kryo is significantly faster and more compact than Java serialization (often as much as 10x), but does not support all Serializable types and requires you to register the classes you’ll use in the program in advance for best performance.
You can switch to using Kryo by initializing your job with a SparkConf and calling conf.set(“spark.serializer”, “org.apache.spark.serializer.KryoSerializer”). This setting configures the serializer used for not only shuffling data between worker nodes but also when serializing RDDs to disk. The only reason Kryo is not the default is because of the custom registration requirement, but we recommend trying it in any network-intensive application. Since Spark 2.0.0, we internally use Kryo serializer when shuffling RDDs with simple types, arrays of simple types, or string type.

高性能算子
foreachPartition 代替 foreach
避免mysql redis 频繁建立连接

并行度
一般我们可以通过config属性spark.default.parallelism来更改默认值。 通常,我们建议您群集中的每个CPU内核执行2-3个任务。
官网解释:
Clusters will not be fully utilized unless you set the level of parallelism for each operation high enough. Spark automatically sets the number of “map” tasks to run on each file according to its size (though you can control it through optional parameters to SparkContext.textFile, etc), and for distributed “reduce” operations, such as groupByKey and reduceByKey, it uses the largest parent RDD’s number of partitions. You can pass the level of parallelism as a second argument (see the spark.PairRDDFunctions documentation), or set the config property spark.default.parallelism to change the default. In general, we recommend 2-3 tasks per CPU core in your cluster.

默认shuffle 并行度是200
根据实际情况

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值