Network Topology and Hadoop

最新推荐文章于 2024-02-17 21:05:10 发布

和大黄

最新推荐文章于 2024-02-17 21:05:10 发布

阅读量5.7k

点赞数

分类专栏： Hadoop 文章标签： network distance hadoop rack performance interface

Hadoop 专栏收录该内容

34 篇文章

订阅专栏

本文探讨了Hadoop集群中网络拓扑的重要性及其如何影响数据处理效率。介绍了Hadoop通过网络距离衡量节点间的数据传输成本，并详细阐述了不同层级网络结构下带宽的利用情况。此外，还讲解了如何配置网络拓扑以实现最优性能，并深入讨论了数据副本放置策略及其实现机制。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

大象书 P64

Network Topology and Hadoop

What does it mean for two nodes in a local network to be “close” to each other?

In the context of high-volume data processing, the limiting factor is the rate

at which we can transfer data between nodes — bandwidth is a scarce( 缺乏的) commodity( 有用的东西 ). The idea is to use the bandwidth between two nodes as a measure of distance .

Rather than measuring bandwidth between nodes, which can be difficult to do in practice (it requires a quiet cluster, and the number of pairs of nodes in a cluster grows as the square of the number of nodes), Hadoop takes a simple approach in which the network is represented as a tree and the distance between two nodes is the sum of their distances to their closest common ancestor . Levels in the tree are not predefined, but it is common to have 3 levels that correspond to the data center, the rack, and the node that a process is running on. The idea is that the bandwidth available for each of the following scenarios becomes progressively less:

• Processes on the same node

• Different nodes on the same rack

• Nodes on different racks in the same data center

• Nodes in different data centers#

For example, imagine a node n1 on rack r1 in data center d1. This can be represented as /d1/r1/n1. Using this notation, here are the distances for the four scenarios:

• distance(/d1/r1/n1, /d1/r1/n1) = 0 (processes on the same node)

• distance(/d1/r1/n1, /d1/r1/n2) = 2 (different nodes on the same rack)

• distance(/d1/r1/n1, /d1/r2/n3) = 4 (nodes on different racks in the same data center)

• distance(/d1/r1/n1, /d2/r3/n4) = 6 (nodes in different data centers)

This is illustrated schematically in Figure 3-2. (Mathematically inclined readers will notice that this is an example of a distance metric.)

Figure 3-2. Network distance in Hadoop

Finally, it is important to realize that Hadoop cannot divine( 领悟) your network topology for you. It needs some help; we’ll cover how to configure topology in “Network Topology” on page 247. By default though, it assumes that the network is flat—a single-level hierarchy — or in other words, that all nodes are on a single rack in a single data center.For small clusters, this may actually be the case, and no further configuration is required.

大象书 P247

Network Topology

A common Hadoop cluster architecture consists of a two-level network topology, as illustrated in Figure 9-1.

Figure 9-1. Typical two-level network architecture for a Hadoop cluster

Typically there are 30 to 40 servers per rack, with a 1 GB switch for the rack (only three are shown in the diagram), and an uplink to a core switch or router (which is normally 1 GB or better). The salient （最重要的） point is that the aggregate( 总数 ) band-width between nodes on the same rack is much greater than that between nodes on different racks.

Rack awareness

To get maximum performance out of Hadoop, it is important to configure Hadoop so

that it knows the topology of your network. If your cluster runs on a single rack, then there is nothing more to do, since this is the default. However, for multirack clusters, you need to map nodes to racks. By doing this, Hadoop will prefer within-rack transfers(where there is more bandwidth available) to off-rack transfers when placing MapReduce tasks on nodes. HDFS will be able to place replicas more intelligently to trade-off performance and resilience.

Network locations such as nodes and racks are represented in a tree, which reflects the network “distance” between locations. The namenode uses the network location when determining where to place block replicas (see “Network Topology and Hadoop” on page 64); the jobtracker uses network location to determine where the closest replica is as input for a map task that is scheduled to run on a tasktracker.

For the network in Figure 9-1, the rack topology is described by two network locations, say, /switch1/rack1 and /switch1/rack2. Since there is only one top-level switch in this cluster, the locations can be simplified to /rack1 and /rack2.

The Hadoop configuration must specify a map between node addresses and network

locations. The map is described by a Java interface, DNSToSwitchMapping, whose signature is:

public interface DNSToSwitchMapping {

public List<String> resolve(List<String> names);

}

The names parameter is a list of IP addresses, and the return value is a list of corresponding network location strings. The topology.node.switch.mapping.impl configuration property defines an implementation of the DNSToSwitchMapping interface that the namenode and the jobtracker use to resolve worker node network locations.

For the network in our example, we would map node1, node2, and node3 to /rack1,

and node4, node5, and node6 to /rack2.

Most installations don’t need to implement the interface themselves, however, since the default implementation is ScriptBasedMapping, which runs a user-defined script to determine the mapping. The script’s location is controlled by the property topology.script.file.name. The script must accept a variable number of arguments that are the hostnames or IP addresses to be mapped, and it must emit the corresponding network locations to standard output, separated by whitespace. The example code includes a script for this purpose.

If no script location is specified, the default behavior is to map all nodes to a single network location, called /default-rack.

大象书 P67

Replica Placement

How does the namenode choose which datanodes to store replicas on? There’s a trade-off( 权衡 ) between reliability and write bandwidth and read bandwidth here. For example, placing all replicas on a single node incurs( 招致 ) the lowest write bandwidth penalty( 不利 ) since the replication pipeline runs on a single node, but this offers no real redundancy (if the node fails, the data for that block is lost). Also, the read bandwidth is high for off-rack reads. At the other extreme, placing replicas in different data centers may maximize redundancy( 冗余 ), but at the cost of bandwidth. Even in the same data center (which is what

all Hadoop clusters to date have run in), there are a variety of placement strategies.

Indeed, Hadoop changed its placement strategy in release 0.17.0 to one that helps keep a fairly even distribution of blocks across the cluster. (See “balancer” on page 284 for details on keeping a cluster balanced.)

Hadoop’s strategy is to place the first replica on the same node as the client (for clients running outside the cluster, a node is chosen at random, although the system tries not to pick nodes that are too full or too busy). The second replica is placed on a different rack from the first (off-rack) , chosen at random. The third replica is placed on the same rack as the second( 源代码中是和 the first 一样的？ ) , but on a different node chosen at random. Further replicas are placed on random nodes on the cluster, although the system tries to avoid placing too many replicas on the same rack.

Once the replica locations have been chosen, a pipeline is built, taking network topology into account. For a replication factor of 3, the pipeline might look like Figure 3-4.

Figure 3-4. A typical replica pipeline

Overall, this strategy gives a good balance between reliability (blocks are stored on two racks), write bandwidth (writes only have to traverse( 穿过 ) a single network switch), read performance (there’s a choice of two racks to read from), and block distribution across the cluster (clients only write a single block on the local rack).