[Spark][翻译]Spark 架构: Shuffle过程分析

最新推荐文章于 2024-08-01 16:11:59 发布

weixin_34007291

最新推荐文章于 2024-08-01 16:11:59 发布

阅读量95

点赞数

文章标签：大数据内存管理

原文链接：https://my.oschina.net/u/947726/blog/741278

版权

为什么80%的码农都做不了架构师？>>>

Spark Architecture: Shuffle Spark架构：Shuffle

This is my second article about Apache Spark architecture and today I will be more specific and tell you about the shuffle, one of the most interesting topics in the overall Spark design. The previous part was mostly about general Spark architecture and its memory management. It can be accessed here. The next one is about Spark memory management and it is available here. 这是我第二篇关于Spark架构的文章，我将更加侧重shuffle部分，它是Spark设计中最有意义的话题。前面的话题是Spark的内存管理的文章，链接在 https://0x0fff.com/spark-architecture/ https://0x0fff.com/spark-memory-management/ 本文链接： https://0x0fff.com/spark-architecture-shuffle/

[这里是shuffle stage切分的图片]

What is the shuffle in general? Imagine that you have a list of phone call detail records in a table and you want to calculate amount of calls happened each day. This way you would set the “day” as your key, and for each record (i.e. for each call) you would emit “1” as a value. After this you would sum up values for each key, which would be an answer to your question – total amount of records for each day. But when you store the data across the cluster, how can you sum up the values for the same key stored on different machines? The only way to do so is to make all the values for the same key be on the same machine, after this you would be able to sum them up.

通常说的shuffle是什么？假设你有一张表，里面是一些电话通话详单记录，你想计算出每天的通话量。这样的话，第一步：你需要把“天”设计为你的key，每条打电话的记录后面标志一个数量“1”。然后：将将单独的每一个key的数量进行sum求和，这样的话，你就得到每一天的电话通话数据量。但是在一个集群中，数据分布在不同的机器上，你怎么在不同的机器上进行求和呢？唯一的办法就是在一台机器上存着相同的key，然后你再进行求和操作。

There are many different tasks that require shuffling of the data across the cluster, for instance table join – to join two tables on the field “id”, you must be sure that all the data for the same values of “id” for both of the tables are stored in the same chunks. Imagine the tables with integer keys ranging from 1 to 1’000’000. By storing the data in same chunks I mean that for instance for both tables values of the key 1-100 are stored in a single partition/chunk, this way instead of going through the whole second table for each partition of the first one, we can join partition with partition directly, because we know that the key values 1-100 are stored only in these two partitions. To achieve this both tables should have the same number of partitions, this way their join would require much less computations. So now you can understand how important shuffling is.

有很多任务需要shuffle数据，比如：表join,在两张表的id字段上进行join。你必须确定所有的数据中两张表的同一个id的数据存在相同的数据块上。假设表的id范围从1到1000000（一百万）。比如：这两张表id为1-100的数据存储在一个partition或者chunk(块)上，通过这样的方式，我对第一张表1-100的数据和第二张表的数据进行join，就不用在读取完第一张表的一个分区后，然后去遍历第二张表的所有数据。因为第一张表的1-100在一个partition上，第二张表的1-100在一个partition上，我可以对着两个partition直接进行join.为了减少join的大量的遍历计算，我们就需要这两张表有相同数量的partitions。现在你应该了解到shuffle有多么重要。

Discussing this topic, I would follow the MapReduce naming convention. In the shuffle operation, the task that emits the data in the source executor is “mapper”, the task that consumes the data into the target executor is “reducer”, and what happens between them is “shuffle”.

在讨论shuffle中，我们可以看看MapReduce的过程中，在shuffle操作中，数据从mapper task到reducer task，中间发生的过程就是shuffle操作。

Shuffling in general has 2 important compression parameters: spark.shuffle.compress – whether the engine would compress shuffle outputs or not, and spark.shuffle.spill.compress – whether to compress intermediate shuffle spill files or not. Both have the value “true” by default, and both would use spark.io.compression.codec codec for compressing the data, which is snappy by default.

shffle有两个重要的压缩参数：spark.shuffle.compress（引擎是否压缩shuffle的输出），spark.shuffle.spill.compress（是否压缩shuffle中间spill的文件）。他们默认值都是true,他们都使用spark.io.compression.codec配置的编译码器来压缩数据，默认是snappy的实现。

As you might know, there are a number of shuffle implementations available in Spark. Which implementation would be used in your particular case is determined by the value of spark.shuffle.manager parameter. Three possible options are: hash, sort, tungsten-sort, and the “sort” option is default starting from Spark 1.2.0.

你现在知道，在Spark中有一些shuffle的实现。在特定的使用场景下使用什么样的实现，由参数spark.shuffle.manager决定。有三个可用选项，hash，tungsten-sort，sort(since version 1.2.0)

Hash Shuffle

Prior to Spark 1.2.0 this was the default option of shuffle (spark.shuffle.manager = hash). But it has many drawbacks, mostly caused by the amount of files it creates – each mapper task creates separate file for each separate reducer, resulting in M * R total files on the cluster, where M is the number of “mappers” and R is the number of “reducers”. With high amount of mappers and reducers this causes big problems, both with the output buffer size, amount of open files on the filesystem, speed of creating and dropping all these files. Here’s a good example of how Yahoo faced all these problems, with 46k mappers and 46k reducers generating 2 billion files on the cluster.

Hash Shuffle

hash是Spark 1.2.0的默认shuffle方式，但是它有很多不好的地方，它在shuffle的过程中会产生很多的文件-每一个mapper任务会为每一个reducer任务创建单独的文件，这样在集群中会创建M x R个文件(M为mappers的个数 R为reducers的个数)。如果mapper的个数和reduecer的个数比较多，这样产生buffer size比较大，系统中也会打开很多的文件。Yahoo有过这样的问题，46k mapper个数和46k的reducer数量，在集群中产生了20亿个小文件。

The logic of this shuffler is pretty dumb: it calculates the amount of “reducers” as the amount of partitions on the “reduce” side, creates a separate file for each of them, and looping through the records it needs to output, it calculates target partition for each of them and outputs the record to the corresponding file.

shuffle的逻辑非常笨：它先用reducers的个数作为reduce端partitions的个数，然后为每一个partition创建一份单独的文件，然后遍历它需要输出的结果，然后计算每一个partition，然后输出结果到相符的文件上。

Here is how it looks like:

这里是图

There is an optimization implemented for this shuffler, controlled by the parameter “spark.shuffle.consolidateFiles” (default is “false”). When it is set to “true”, the “mapper” output files would be consolidated. If your cluster has E executors (“–num-executors” for YARN) and each of them has C cores (“spark.executor.cores” or “–executor-cores” for YARN) and each task asks for T CPUs (“spark.task.cpus“), then the amount of execution slots on the cluster would be E * C / T, and the amount of files created during shuffle would be E * C / T * R. With 100 executors 10 cores each allocating 1 core for each task and 46000 “reducers” it would allow you to go from 2 billion files down to 46 million files, which is much better in terms of performance.

对这个shuffler的管理器有一个优化点，通过修改**“spark.shuffle.consolidateFiles”（默认为false）,当它设置成true的时候，mapper的输出文件将会合并。当你的集群中有E个executors（“–num-executors” for YARN），每个executor有C个cores（“spark.executor.cores” or “–executor-cores” for YARN），每一个task需要T个CPUs(“spark.task.cpus“)，这样集群中的execution slots数量将会有E * C / T**,在shuffle过程中创建的文件数量将会E * C / T * R。100个executors，每个10cores，一个task有1个core，那么46000 reducers，那shuffle创建的文件数量将从20亿降为4600万，这样更好一些。

This feature is implemented in a rather straightforward way: instead of creating new file for each of the reducers, it creates a pool of output files. When map task starts outputting the data, it requests a group of R files from this pool. When it is finished, it returns this R files group back to the pool. As each executor can execute only C / T tasks in parallel, it would create only C / T groups of output files, each group is of R files. After the first C / T parallel “map” tasks has finished, each next “map” task would reuse an existing group from this pool.

这个特性的实现使用一个直观的方式：它不是在每个reducers中创建新的文件，而是创建一个输出文件的池。当map任务开始输出数据的的时候，它需要从这个池中获取一个R个文件的组。当它完成后，它将这R个文件返回到池子中。每一个executor仅有 C/T的任务在并行，它将会创建C/T个输出文件组，每一个组有R个文件。当第一个C/T并行的map任务结束后，下一个map任务将会重复使用池中的这个文件组。

Here’s a general diagram of how it works: 这里是一个它工作的图表：这里是图

转载于:https://my.oschina.net/u/947726/blog/741278