Spark写Redis+Spark资源配置总结

1. 起源于Error

19/10/16 11:22:06 ERROR YarnClusterScheduler: Lost executor 28 on **********: Container marked as failed: container_********** on host: **********. Exit status: 137. Diagnostics: Container killed on request. Exit code is 137 Killed by external signal
19/10/16 11:32:59 ERROR YarnClusterScheduler: Lost executor 38 on 100.76.80.197: Container marked as failed: container_********** on host: **********. Exit status: 137. Diagnostics: Container killed on request. Exit code is 137 Killed by external signal
19/10/16 11:40:27 ERROR YarnClusterScheduler: Lost executor 39 on **********: Container marked as failed: container_1567762627991_1638740_01_000343 on host: **********. Exit status: 137. Diagnostics: Container killed on request. Exit code is 137 Killed by external signal
19/10/16 11:49:29 ERROR YarnClusterScheduler: Lost executor 40 on **********: Container marked as failed: container********** on host: **********. Exit status: 137. Diagnostics: Container killed on request. Exit code is 137 Killed by external signal
19/10/16 11:49:29 ERROR TaskSetManager: Task 51 in stage 4.0 failed 4 times; aborting job
19/10/16 11:49:29 ERROR ApplicationMaster: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 51 in stage 4.0 failed 4 times, most recent failure: Lost task 51.3 in stage 4.0 (TID 160, **********, executor 40): ExecutorLostFailure (executor 40 exited caused by one of the running tasks) Reason: Container marked as failed: container_1567762627991_1638740_01_000353 on host: 100.76.26.136. Exit status: 137. Diagnostics: Container killed on request. Exit code is 137 Killed by external signal
Driver stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 51 in stage 4.0 failed 4 times, most recent failure: Lost task 51.3 in stage 4.0 (TID 160, **********, executor 40): Executor Lost Failure (executor 40 exited caused by one of the running tasks) Reason: Container marked as failed: container_********** on host: 100.76.26.136. Exit status: 137. Diagnostics: Container killed on request. Exit code is 137 Killed by external signal

这种问题,最大的可能就是数据写入redis的资源配置不合理,数据量太大,超过了redis能承受的。

几个关键的spark资源配置如下:

  • driver-memory:主节点内存大小;driver 使用的内存,不可超过单机的总内存;
  • executor-memory:每个子节点(executor)分配的内存大小,推荐单个core配置2~3g;
  • num-executors:创建多少个 executor;
  • executor-cores:每个子节点(executor)使用的并发线程数目,也即每个 executor 最大可并发执行的 Task 数目。

由报错信息可以看出,yarn丢失了executor,极有可能还是因为executor被关闭了,所以还是要检查一下自己的driver-memory和executor-memory是不是够大。

2. Spark-Redis

使用Spark 2.0的scala API,使用jedis客户端API,dependency如下:

<dependency>
  <groupId>redis.clients</groupId>
  <artifactId>jedis</artifactId>
  <version>2.9.0</version>
  <type>jar</type>
</dependency>

数据写入redis代码如下:

sampleData.repartition(500).foreachPartition(
rows => {
  val rc = new Jedis(redisHost, redisPort)
  rc.auth(redisPassword)
  val pipe = rc.pipelined

  rows.foreach(
  r => {
    val redisKey = r.getAs[String]("key")
    val redisValue = r.getAs[String]("value")
    pipe.set(redisKey, redisValue)
    pipe.expire(redisKey, expireDays * 3600 * 24)
  })

  pipe.sync()
})

3. 总结

3.1 控制开启redis客户端的数量

sampleData是一个DataSet,每一行有两个数据:key和value。由于构建Jedis客户端会有一定开销,所以一定不要用map将数据逐条写入到redis,而是mapPartition或foreachPartition。这样,这个开销只会与parition数量相关,与数据总量无关。试想如果sampleData有1亿行,在map中将会构建1亿个Jedis对象。

3.2 数据批量插入Redis

推荐使用了pipe进行批量插入,批量插入效率与逐条插入效率差异非常大。但是批量插入有个非常大的坑。上面的代码中,如果一次性批量插入了整个partition的数据,恰巧单个partition的数据量非常大(超过了Redis pipline 的写入速度  或者 timeout),会导致Redis内存溢出(或者timeout),导致服务不可用!

解决方法是在foreachPartition之前,repartition整个DateSet,确保每个分区的数据不要太大。推荐控制在1k~20k左右。如上,将sampleData分为500个分区,每个分区10000条,那么sampleData的总数为500万左右。但是,如果数据总量太大,单个分区过小,会导致分区数过大,这样需要提高driver的内存,否则会导致driver内存溢出

3.3 控制在线更新并发

Redis一般提供在线服务。为了避免在写Redis时,与前端任务冲突,就不能使用太多executor。否则会使得QPS过高,影响在线服务响应,甚至导致Redis瘫痪。推荐的实践方法是提高数据的分区数量,确保每个partition的数量较小,然后逐步提高并发数量(executor数量)。观察在不同数量executor下,并发写入Redis的QPS,直到QPS达到一个可以接受的范围。

发布了652 篇原创文章 · 获赞 1575 · 访问量 298万+
展开阅读全文

没有更多推荐了,返回首页

©️2019 CSDN 皮肤主题: 代码科技 设计师: Amelia_0503

分享到微信朋友圈

×

扫一扫,手机浏览