Network Topology and Hadoop

大象书 P64

Network Topology and Hadoop



What does it mean for two nodes in a local network to be “close” to each other?

In the context of high-volume data processing, the limiting factor is the rate

at which we can transfer data between nodes — bandwidth is a scarce( 缺乏的) commodity( 有用的东西 ). The idea is to use the bandwidth between two nodes as a measure of distance .


Rather than measuring bandwidth between nodes, which can be difficult to do in practice (it requires a quiet cluster, and the number of pairs of nodes in a cluster grows as the square of the number of nodes), Hadoop takes a simple approach in which the network is represented as a tree and the distance between two nodes is the sum of their distances to their closest common ancestor . Levels in the tree are not predefined, but it is common to have 3 levels that correspond to the data center, the rack, and the node that a process is running on. The idea is that the bandwidth available for each of the following scenarios becomes progressively less:



Processes on the same node


Different nodes on the same rack


Nodes on different racks in the same data center


Nodes in different data centers#



For example, imagine a node n1 on rack r1 in data center d1. This can be represented as /d1/r1/n1. Using this notation, here are the distances for the four scenarios:



distance(/d1/r1/n1, /d1/r1/n1) = 0 (processes on the same node)


distance(/d1/r1/n1, /d1/r1/n2) = 2 (different nodes on the same rack)


distance(/d1/r1/n1, /d1/r2/n3) = 4 (nodes on different racks in the same data center)


distance(/d1/r1/n1, /d2/r3/n4) = 6 (nodes in different data centers)



This is illustrated schematically in Figure 3-2. (Mathematically inclined readers will notice that this is an example of a distance metric.)



Figure 3-2. Network distance in Hadoop


Finally, it is important to realize that Hadoop cannot divine( 领悟) your network topology for you. It needs some help; we’ll cover how to configure topology in “Network Topology” on page 247. By default though, it assumes that the network is flat—a single-level hierarchy — or in other words, that all nodes are on a single rack in a single data center.For small clusters, this may actually be the case, and no further configuration is required.



大象书 P247


Network Topology



A common Hadoop cluster architecture consists of a two-level network topology, as illustrated in Figure 9-1.



 

Figure 9-1. Typical two-level network architecture for a Hadoop cluster



Typically there are 30 to 40 servers per rack, with a 1 GB switch for the rack (only three are shown in the diagram), and an uplink to a core switch or router (which is normally 1 GB or better). The salient 最重要的 point is that the aggregate( 总数 ) band-width between nodes on the same rack is much greater than that between nodes on different racks.


Rack awareness



To get maximum performance out of Hadoop, it is important to configure Hadoop so

that it knows the topology of your network. If your cluster runs on a single rack, then there is nothing more to do, since this is the default. However, for multirack clusters, you need to map nodes to racks. By doing this, Hadoop will prefer within-rack transfers(where there is more bandwidth available) to off-rack transfers when placing MapReduce tasks on nodes. HDFS will be able to place replicas more intelligently to trade-off performance and resilience.


Network locations such as nodes and racks are represented in a tree, which reflects the network “distance” between locations. The namenode uses the network location when determining where to place block replicas (see “Network Topology and Hadoop” on page 64); the jobtracker uses network location to determine where the closest replica is as input for a map task that is scheduled to run on a tasktracker.



For the network in Figure 9-1, the rack topology is described by two network locations, say, /switch1/rack1 and /switch1/rack2. Since there is only one top-level switch in this cluster, the locations can be simplified to /rack1 and /rack2.



The Hadoop configuration must specify a map between node addresses and network

locations. The map is described by a Java interface, DNSToSwitchMapping, whose signature is:


public interface DNSToSwitchMapping {


public List<String> resolve(List<String> names);


}



The names parameter is a list of IP addresses, and the return value is a list of corresponding network location strings. The topology.node.switch.mapping.impl configuration property defines an implementation of the DNSToSwitchMapping interface that the namenode and the jobtracker use to resolve worker node network locations.



For the network in our example, we would map node1, node2, and node3 to /rack1,

and node4, node5, and node6 to /rack2.



Most installations don’t need to implement the interface themselves, however, since the default implementation is ScriptBasedMapping, which runs a user-defined script to determine the mapping. The script’s location is controlled by the property topology.script.file.name. The script must accept a variable number of arguments that are the hostnames or IP addresses to be mapped, and it must emit the corresponding network locations to standard output, separated by whitespace. The example code includes a script for this purpose.



If no script location is specified, the default behavior is to map all nodes to a single network location, called /default-rack.


大象书 P67


Replica Placement



How does the namenode choose which datanodes to store replicas on? There’s a trade-off( 权衡 ) between reliability and write bandwidth and read bandwidth here. For example, placing all replicas on a single node incurs( 招致 ) the lowest write bandwidth penalty( 不利 ) since the replication pipeline runs on a single node, but this offers no real redundancy (if the node fails, the data for that block is lost). Also, the read bandwidth is high for off-rack reads. At the other extreme, placing replicas in different data centers may maximize redundancy( 冗余 ), but at the cost of bandwidth. Even in the same data center (which is what

all Hadoop clusters to date have run in), there are a variety of placement strategies.


Indeed, Hadoop changed its placement strategy in release 0.17.0 to one that helps keep a fairly even distribution of blocks across the cluster. (See “balancer” on page 284 for details on keeping a cluster balanced.)


Hadoop’s strategy is to place the first replica on the same node as the client (for clients running outside the cluster, a node is chosen at random, although the system tries not to pick nodes that are too full or too busy). The second replica is placed on a different rack from the first (off-rack) , chosen at random. The third replica is placed on the same rack as the second( 源代码中是和 the first 一样的? ) , but on a different node chosen at random. Further replicas are placed on random nodes on the cluster, although the system tries to avoid placing too many replicas on the same rack.


Once the replica locations have been chosen, a pipeline is built, taking network topology into account. For a replication factor of 3, the pipeline might look like Figure 3-4.


 

Figure 3-4. A typical replica pipeline


Overall, this strategy gives a good balance between reliability (blocks are stored on two racks), write bandwidth (writes only have to traverse( 穿过 ) a single network switch), read performance (there’s a choice of two racks to read from), and block distribution across the cluster (clients only write a single block on the local rack).





评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值