大数据实战十七课(下)- Spark-Core05

第四章:Spark监控

第五章:Other Consideration(其它的一些考虑)

第六章:Spark内存管理

第四章:Spark监控

4.2 Determining Memory Consumption(确定内存消耗)

在官网中的定义:

  1. The best way to size the amount of memory consumption a dataset will require is to create an RDD,put it into cache, and look at the “storage” page in the web UI. The page will tell you how much memory the RDD is occupying.
  • 最好的方式是创建一个RDD,把数据放到里面去,在UI界面去查看信息
  1. To estimate the memory consumption of a particular object, use SizeEstimator’s estimate method. This is useful for experimenting with different data layouts to trim memory usage, as well as determining the amount of space a broadcast variable will occupy on each executor heap.
  • 第二种方式使用 SizeEstimator这个类;这对于尝试使用不同的数据布局来调整内存使用非常有用,还可以确定广播变量在每个执行器堆上占用的空间大小。

实际使用SizeEstimator:

在spark-shell中使用:

1、import org.apache.spark.util.SizeEstimator

2、 SizeEstimator.estimate("file:///home/hadoop/data/page_views.dat")

5.1 Level of Parallelism

  • SparkContext中textFile函数中,可以自己定义一个并行度; minPartitions: Int = defaultMinPartitions): RDD[String] = withScope //defaultMinPartitions设置并行度
  1. reduceByKey在哪个类中?PairRDDFunctions.scala
    reduceByKey和reduceByKey都可以传入分区信息

  2. 通常情况推荐一个core跑2 3个task

  • 充分利用CPU

5.2 Memory Usage of Reduce Tasks

  1. Sometimes, you will get an OutOfMemoryError not because your RDDs don’t fit in memory, but because the working set of one of your tasks, such as one of the reduce tasks in groupByKey, was too large.

  2. Spark’s shuffle operations (sortByKey, groupByKey, reduceByKey, join, etc) build a hash table within each task to perform the grouping, which can often be large.

  • Spark中涉及到的shuffle操作,sortByKey, groupByKey, reduceByKey, join 操作时会在每一个任务中构建hash表来执行分组。
  1. The simplest fix here is to increase the level of parallelism, so that each task’s input set is smaller. Spark can efficiently support tasks as short as 200 ms, because it reuses one executor JVM across many tasks and it has a low task launching cost, so you can safely increase the level of parallelism to more than the number of cores in your clusters.
  • 解释:有时候你会遇到OOM错误,不是RDD放不进内存;最简单的方式是增加并行度,意味着task多了,每个task的数据量少了;这个方案解决不了数据倾斜。即使给1万个并行度也没用,倾斜还是倾斜在一个上面。

5.3 Data Locality(数据本地性)

这块只做了解,很多时候很难保障:

  1. Data locality can have a major impact on the performance of Spark jobs. if data and the code that operates on it are together then computation tends to be fast. if code and data are separated , one must move to the other.
  • 如果数据和代码在一块,那他的计算会非常快;如果不在一块,一个移动到另一个上去。
  1. There are several levels of locality based on the data’s current location . In order from closest to farthest:
  • PROCESS_LOCAL:在一个JVM运行时最佳的,在hadoop002:4040的task中查看; 数据已经cache到executor内存当中去了。
  • NODE_LOCAL:节点本地性 data is on the same node. Examples might be in HDFS on the same node, or in another executor on the same node. This is a little slower than PROCESS_LOCAL because the data has to travel between processes. 数据两个进程传输(hdfs 和 yarn)
  • NO_PREF:
  • RACK_LOCAL: data is on the same rack of servers . Data is on a different server on the same rack so needs to be sent over the network , typically through a single switch.
  • ANY data is elsewhere on the network and not in the same rack 不是在同一个机架

小结:

Spark优先调度任务到最好的locality等级,但这不是总是可以这样做的;
a. 等待忙的CPU空闲,把作业调度过去
b. 启动一个新的task,把数据移动过去
上述几个模式 --> 备胎策略,一个不行找下一个。
在这里插入图片描述

第六章:内存优化

在内存中存储的东西比原来的数据要大

官网描述:

  1. There are three considerations in tuning memory usage:the amount of memory used by your objects (you may want your entire dataset to fit in memory), the cost of accessing those object, and the overhead of garbage collection (if you have high turnover in terms of objects)

6.1 Spark内存管理

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值