MapReduce简单介绍

暴躁老哥在线刷题

于 2024-09-30 12:49:37 发布

阅读量260

点赞数 9

分类专栏： SystemDesign 文章标签： mapreduce 大数据

本文链接：https://blog.csdn.net/qq_32424059/article/details/142654262

版权

SystemDesign 专栏收录该内容

13 篇文章 0 订阅

订阅专栏

MapReduce 是一种用于在集群上处理大规模数据集的编程模型。它广泛用于在多台机器上处理大数据，提供了一种并行化任务的方式。它在论文《MapReduce: Simplified Data Processing on Large Clusters》中被提出。

以下是 MapReduce 过程的逐步说明：

1. 输入

MapReduce 从一个输入数据集开始，通常存储在分布式文件系统中（如 HDFS）。数据可以来自多种来源，例如日志、数据库或文档。

2. 将输入拆分到不同的机器上

输入数据被拆分为固定大小的块（通常是 64MB 或 128MB），并分发到不同的机器（称为 Mappers）。每台机器将独立处理其数据块，从而帮助提高处理效率并扩大处理规模。

3. 在每台机器上应用 Map 函数

每个 Mapper 处理分配给它的数据块，并应用用户定义的 map 函数。这个函数将输入数据转换为中间的键值对。关键是 map 函数独立作用于每个数据块，使得整个过程具备可扩展性。

4. Shuffle 阶段：将 Map 结果移动到 Reduce 机器

在 Map 阶段完成后，中间的键值对需要被打乱并按键分组。这一过程包括：

分区和排序：每个 Mapper 的输出会按键进行排序并分区，以确保相同键的所有值都被发送到同一个 Reducer。
获取和合并：分区后的数据会被发送到相应的 Reducer 机器上，确保同一个键的所有数据能够在一起处理。

5. 在每台机器上应用 Reduce 函数

每个 Reducer 机器接收分组后的键值对，并应用 reduce 函数。reduce 函数根据每个键进行聚合或处理，比如计算总和、平均值或其他聚合操作。

6. 输出

Reduce 阶段的最终输出通常会被写回到分布式存储系统中，代表 MapReduce 作业的结果。这些结果可以进一步处理、分析或存储。

常见的 MapReduce 问题

1. 我应该使用多少个 Mappers/Reducers？

回答：这取决于你的数据量和集群规模。一个不错的起点是使用 1000 个 Mappers 和 Reducers，但可以根据系统性能进行调整。更多的 Mappers 对大数据集更有效，Reducers 的数量应与数据中的唯一键数量匹配。

2. 是否机器越多越好？

优点：更多的机器可以减少每个节点的工作量，尤其是对于大数据集，这可以减少整体的处理时间。
缺点：然而，每台机器都需要一定的时间启动，随着机器数量的增加，启动时间会成为瓶颈，特别是对于较小的作业。因此，在任务复杂度和机器数量之间要找到平衡。

3. 如果不考虑启动时间，增加更多的 Reducers 会让处理速度更快吗？

回答：不一定。Reducers 的数量受到 Mapper 生成的唯一键数量的限制。如果键的数量小于 Reducers 的数量，增加更多的 Reducers 并不会提升性能，因为每个 Reducer 需要足够的数据来处理。

4. Reducer 有一个机器 Key 数目特别多怎么办

回答：可以给 Key 加一个 random 后缀，类似用来解决hot spot问题的 Shard Key

MapReduce is a programming model used for processing large data sets with a distributed algorithm on a cluster. It is widely used for handling big data across multiple machines and provides a way to parallelize tasks.

Below is a step-by-step breakdown of how the MapReduce process works:

1. Input

MapReduce begins with an input dataset, typically stored in a distributed file system (like HDFS). The data can come from a variety of sources, such as logs, databases, or documents.

2. Split Input into Different Machines

The input data is split into fixed-size chunks (often 64MB or 128MB) and distributed to different machines (called Mappers). Each machine will work independently on its chunk of data, helping to scale the process and increase efficiency.

3. Apply the Map Function on Each Machine

Each Mapper processes its assigned chunk by applying a user-defined map function. This function transforms the input data into intermediate key-value pairs. The key here is that the map function operates independently on each chunk of input, making the process scalable.

4. Shuffle: Move Map Results to Reduce Machines

After the Map phase is complete, the intermediate key-value pairs need to be shuffled and grouped by key. This process involves:

Partition and Sort: The output from each Mapper is sorted by key and partitioned so that all occurrences of the same key are sent to the same Reducer.
Fetch and Merge: The partitioned data is then sent to the appropriate Reducer machine. This ensures that all data associated with a specific key is processed together.

5. Apply the Reduce Function on Each Machine

Each Reduce machine takes the grouped key-value pairs and applies the reduce function. The reduce function aggregates or processes the data for each key, such as summing counts, calculating averages, or other forms of aggregation.

6. Output

The final output from the Reduce phase is typically written back to a distributed storage system, and it represents the results of the MapReduce job. These results can then be used for further processing, analysis, or storage.

Common Questions About MapReduce

1. How many Mappers/Reducers should I use?

Answer: It depends on the size of your data and the cluster. A good starting point is 1,000 Mappers and Reducers, but you can adjust based on the performance of the system. More Mappers can be useful for larger datasets, and the number of Reducers should align with the number of unique keys in the data.

2. Is it always better to have more machines?

Advantages: More machines reduce the workload on each individual node, which can decrease the overall processing time, especially for large datasets.
Disadvantages: However, every machine requires a certain amount of time to boot up and initialize. With an increasing number of machines, the total boot-up time can become a bottleneck, especially for smaller jobs. It's crucial to balance the number of machines with the complexity of the task.

3. If we don’t care about boot-up time, will adding more reducers make the process faster?

Answer: Not necessarily. The number of Reducers is limited by the number of unique keys produced by the Mappers. If the number of keys is smaller than the number of Reducers, adding more Reducers won't improve performance because each Reducer needs enough data to process.