分布式计算中的数据倾斜

最新推荐文章于 2023-10-08 09:56:36 发布

P(DNA)

最新推荐文章于 2023-10-08 09:56:36 发布

阅读量416

点赞数

CC 4.0 BY-SA版权

分类专栏： Distributed Software 文章标签：算法数据倾斜

本文链接：https://blog.csdn.net/weixin_38233104/article/details/132802536

Good Journey 同时被 2 个专栏收录

124 篇文章

订阅专栏

Distributed Software

74 篇文章

订阅专栏

摘要

数据倾斜是指在分布式计算中，由于数据负载不均匀或数据倾斜的特性，导致某些计算节点的负载过重，从而影响整个计算任务的性能和并行度。

数据倾斜的根本原因包括以下几个方面：

数据分布不均匀：在分布式计算中，数据的分布决定了各个计算节点的负载。如果某些数据分布不均匀，即某些计算节点负责处理的数据量远远大于其他节点，就会导致数据倾斜现象。
数据关联度高：如果计算任务中的数据之间存在高度的依赖关系或相关性，就可能导致某些计算节点处理的数据具有更多的计算量，从而引发数据倾斜。
数据自身特性：数据的特性，例如某些关键数据特别大或者某些属性值特别多，也可能导致数据倾斜。

为了解决数据倾斜问题，可以采用以下方式和设计思路：

数据预处理：在数据进入分布式计算引擎之前，进行数据预处理，例如数据的分桶、分片、分割等操作，将数据均匀分布到各个节点上，减少数据倾斜的概率。
数据切分：对于关联性高的数据，可以将其进行切分，将部分数据分散到不同的计算节点上进行并行处理，降低单一节点的负载。
数据分片和局部聚合：在某些计算任务中，可以对数据进行分片处理，将数据按照某种方式进行划分，分配给不同的计算节点，然后在各个节点上进行局部聚合，最后将结果进行汇总，从而减小数据倾斜的影响。
动态负载均衡：对于已经发生了数据倾斜的情况，可以采用动态负载均衡的策略，将负载过重的节点上的任务重新分配给其他节点，以平衡各个节点的负载。
基于样本的随机化：通过对数据进行采样和随机化，可以降低数据倾斜的概率，同时保持数据的统计特征。
针对具体应用场景设计的算法和策略：不同的应用场景可能需要采用不同的数据处理方式和设计思路来解决数据倾斜的问题，因此需要根据具体情况进行针对性的算法和策略设计。

需要注意的是，由于数据倾斜的根本原因较为复杂，解决数据倾斜问题可能需要综合考虑多个方面的因素，并根据具体情况采用不同的方法和策略。

Simply put

Data skew is a common issue in distributed computing engines, wherein the workload is not evenly distributed across the nodes or partitions. This results in several nodes or partitions becoming overloaded, leading to performance bottlenecks and slower processing times.

The fundamental reason for data skew can vary, but it is often caused by the uneven distribution of data values or the nature of the data itself. For example, if there is a lot of data related to a specific key, it can cause data skew when the data is partitioned or distributed across nodes.

To handle data skew in a distributed computing engine, there are several approaches and design principles that can be followed:

Data pre-processing: One way to mitigate data skew is to perform data pre-processing before distributing or partitioning the data. This involves analyzing the data and determining any potential skew patterns. Based on this analysis, the data can be transformed or pre-processed to evenly distribute the workload across nodes. This may involve reshaping the data, aggregating or splitting it, and redistributing it to balance the workload more evenly.
Key partitioning or shuffling strategies: In distributed computing engines, data is often partitioned based on a key that determines which node or partition the data will be assigned to. One approach to handle data skew is to use intelligent key partitioning or shuffling strategies. These strategies aim to distribute the data more evenly based on certain criteria, such as workload or data size. For example, a key-value pair could be sent to a specific node or partition based on the workload of that node or the size of the data already present in that node.
Dynamic load balancing: Another approach to handle data skew is to implement dynamic load balancing mechanisms. These mechanisms continuously monitor the workload on each node or partition and redistribute the data or workload dynamically. This ensures that no node or partition becomes overloaded and helps in maintaining a balanced workload distribution.
Skew-aware algorithms: Some distributed computing engines incorporate skew-aware algorithms that can specifically handle data skew. These algorithms consider the skew patterns in the data and adapt their processing strategy accordingly. For example, they may use different partitioning or shuffling algorithms for skewed data to ensure more even distribution and improved performance.

Overall, handling data skew in a distributed computing engine requires a combination of data pre-processing techniques, intelligent partitioning or shuffling strategies, dynamic load balancing, and skew-aware algorithms. The specific approach and design will depend on the characteristics of the data and the requirements of the application.

分布式引擎中Shuffle

在分布式计算任务中，shuffle是指将计算节点中的数据按照特定条件进行重分配和合并的过程。

任务进行shuffle的条件通常有以下几点：

划分数据：将原始数据划分成多个分片，使得每个分片能够被不同的计算节点处理。通常根据数据的键或哈希来划分数据，确保具有相同键或哈希值的数据落在同一个分片。
本地聚合：在每个计算节点上，对本地数据进行一些局部的聚合操作，以减少shuffle的数据量。例如，在MapReduce中的Map阶段，每个计算节点会对数据进行初步处理，将输出结果按照键值对的形式存储在本地。
传输数据：将各个计算节点上的局部数据发送到对应的目标计算节点。通常会使用网络传输协议，如TCP或UDP，将数据传输到目标节点。
排序和合并：目标计算节点接收到来自不同计算节点的数据后，会将数据按照键进行排序，并合并相同键的数据。这样能够保证相同键的数据聚合在一起，方便后续的处理。
执行操作：在合并后的数据上执行具体的计算操作，如进行聚合、过滤、计算等。

Spark中的shuffle算法

Spark在做Shuffle时默认使用HashPartitioner来进行数据分区。

HashPartitioner的原理设计思想如下：

根据键进行哈希：对于要被分区的数据，Spark会根据键的哈希值来确定数据应该被分配到哪个分区中。通过对键进行哈希，可以将数据均匀地散列到不同的分区。哈希函数的选择会影响到分区的均匀性。
分配数据到分区：根据哈希值计算得到的分区ID，Spark会将数据放入对应的分区。不同键计算得到的哈希值可能会映射到同一个分区ID，因此会有数据在同一个分区中聚合的情况。
数据本地化：在分配数据到分区时，Spark会尽量将数据调度到与其所在节点相同的节点上，从而降低数据传输的开销。这意味着相同键的数据更容易在同一个节点上被处理，提高了数据本地性。

使用HashPartitioner进行分区的优势在于：

均匀性：HashPartitioner通过对键进行哈希来分区，可以将数据均匀地散列到不同的分区中。这样可以提高算法的并行度，使得任务能够更快地完成。
数据本地性：HashPartitioner在分配数据到分区时考虑了数据本地性，尽量将数据调度到与其所在节点相同的节点上。这样可以减少数据的传输量，提高计算效率。

然而，使用HashPartitioner也存在一些问题。

例如，如果键的分布不均匀，哈希函数可能会导致某些分区的数据量非常大，而其他分区的数据量非常小，从而导致负载不均衡。

为了解决这个问题，可以考虑使用自定义的Partitioner或者根据数据的特点选择其他的分区策略。

On the other hand

Once upon a time in the year 2200, humanity had achieved unprecedented progress in technological advancements. The world had become interconnected through a vast network of computers and data centers known as the Global Computational Network (GCN). The GCN enabled seamless communication, instant access to information, and above all, distributed computing power that solved complex problems with ease.

However, the GCN was not without its challenges. Over the years, an unforeseen problem had emerged, threatening to disrupt the balance of power within the system. This problem was known as “data skew” or, as some had begun to call it, “the Virtual Divide.”

At the heart of the GCN was a revolutionary algorithm known as Clusterized Data Allocation (CDA). CDA enabled the distribution of computing tasks across multiple servers, ensuring efficiency and minimizing processing time. The algorithm analyzed data patterns and allocated computing resources accordingly. But as more and more data flooded the system, CDA began to struggle with an unforeseen issue - data skew.

Data skew occurred when certain data patterns became overwhelmingly dominant in the system. This skewed distribution led to an uneven load on the servers, causing some to be overloaded while others remained idle. Consequently, processing times suffered, and delays in solving critical problems emerged.

The cause of data skew lay in the improved ability of humans to generate and access data. As technology advanced, people had become increasingly interconnected, and their actions and creations were constantly being digitized and uploaded to the GCN. These massive amounts of data posed an unprecedented challenge to the system’s ability to evenly distribute processing tasks.

The Virtual Divide was born as a result of this data skew. The divide represented a growing disparity between the powerful data clusters and the struggling servers. Those in control of the dominant clusters had considerable leverage, as they could not only solve problems faster but also manipulate the distribution of tasks in their favor, potentially leading to a shift in power dynamics.

As data skew continued to worsen, a group of scientists and engineers formed the Data Equality Foundation (DEF) to combat the Virtual Divide. The DEF aimed to develop a new algorithm, known as Equilibrium Data Balancing (EDB), to counteract the effects of data skew and restore balance to the GCN.

The plot thickened when rumors began to circulate that certain powerful entities within the GCN were intentionally exacerbating data skew for their gain. These entities believed that controlling the distribution of computing tasks could give them an unprecedented level of influence over the world.

In a race against time, the DEF worked tirelessly to develop and implement EDB. Their goal was to create a fair and transparent system where computing tasks would be distributed equitably, regardless of the dominant data patterns.

Their efforts paid off, and the EDB algorithm was successfully integrated into the GCN. Through the powerful combination of CDA and EDB, the Virtual Divide was finally closed. The system regained its ability to distribute tasks efficiently, ensuring that every cluster, no matter how dominant, had a fair share of computing power.

With the Virtual Divide eradicated, the GCN thrived once again. The world could rely on its distributed computing power to solve complex problems, protect against cyber threats, and push the boundaries of scientific discovery.

The story of the Virtual Divide and its resolution served as a reminder to humanity about the importance of fairness, collaboration, and the need to continuously adapt to the challenges posed by technological progress. It showcased the indomitable spirit of humans working together to overcome obstacles and create a harmonious future in the interconnected world of distributed computing.