大数据实战十七课（下）- Spark-Core05

最新推荐文章于 2021-12-10 16:21:56 发布

zhikanjiani

最新推荐文章于 2021-12-10 16:21:56 发布

阅读量171

点赞数

本文链接：https://blog.csdn.net/zhikanjiani/article/details/100005606

版权

第四章：Spark监控

在官网中的定义：

The best way to size the amount of memory consumption a dataset will require is to create an RDD，put it into cache, and look at the “storage” page in the web UI. The page will tell you how much memory the RDD is occupying.

To estimate the memory consumption of a particular object, use SizeEstimator’s estimate method. This is useful for experimenting with different data layouts to trim memory usage, as well as determining the amount of space a broadcast variable will occupy on each executor heap.

在spark-shell中使用：

1、import org.apache.spark.util.SizeEstimator

2、 SizeEstimator.estimate("file:///home/hadoop/data/page_views.dat")

SparkContext中textFile函数中，可以自己定义一个并行度； minPartitions: Int = defaultMinPartitions): RDD[String] = withScope //defaultMinPartitions设置并行度

Sometimes, you will get an OutOfMemoryError not because your RDDs don’t fit in memory, but because the working set of one of your tasks, such as one of the reduce tasks in groupByKey, was too large.
Spark’s shuffle operations (sortByKey, groupByKey, reduceByKey, join, etc) build a hash table within each task to perform the grouping, which can often be large.

Spark中涉及到的shuffle操作，sortByKey, groupByKey, reduceByKey, join 操作时会在每一个任务中构建hash表来执行分组。

The simplest fix here is to increase the level of parallelism, so that each task’s input set is smaller. Spark can efficiently support tasks as short as 200 ms, because it reuses one executor JVM across many tasks and it has a low task launching cost, so you can safely increase the level of parallelism to more than the number of cores in your clusters.

解释：有时候你会遇到OOM错误，不是RDD放不进内存；最简单的方式是增加并行度，意味着task多了，每个task的数据量少了；这个方案解决不了数据倾斜。即使给1万个并行度也没用，倾斜还是倾斜在一个上面。

这块只做了解，很多时候很难保障：

Data locality can have a major impact on the performance of Spark jobs. if data and the code that operates on it are together then computation tends to be fast. if code and data are separated , one must move to the other.

There are several levels of locality based on the data’s current location . In order from closest to farthest：

PROCESS_LOCAL：在一个JVM运行时最佳的，在hadoop002:4040的task中查看；数据已经cache到executor内存当中去了。
NODE_LOCAL：节点本地性 data is on the same node. Examples might be in HDFS on the same node, or in another executor on the same node. This is a little slower than PROCESS_LOCAL because the data has to travel between processes. 数据两个进程传输（hdfs 和 yarn）
NO_PREF：
RACK_LOCAL： data is on the same rack of servers . Data is on a different server on the same rack so needs to be sent over the network , typically through a single switch.
ANY data is elsewhere on the network and not in the same rack 不是在同一个机架

Spark优先调度任务到最好的locality等级，但这不是总是可以这样做的；
a. 等待忙的CPU空闲，把作业调度过去
b. 启动一个新的task，把数据移动过去
上述几个模式 --> 备胎策略，一个不行找下一个。
在这里插入图片描述

在内存中存储的东西比原来的数据要大

There are three considerations in tuning memory usage：the amount of memory used by your objects (you may want your entire dataset to fit in memory), the cost of accessing those object, and the overhead of garbage collection (if you have high turnover in terms of objects)

关注