Distributed Systems 分布式系统-CSDN博客

The field of distributed computing has witnessed an explosive expansion during the last decade. As the use of distributed computing systems for large-scale computations is growing, so is the need to increase their reliability.Nevertheless, the probability of failure of an individual processing node in multimode distributed systems is not negligible. Hence, it is necessary to develop mechanisms that prevent the waste of computations performed on distributed processing nodes if one of the nodes fails, either due to a hardware transient fault (bus error or segmentation fault) or a permanent fault (power failure or communication network malfunction).

在过去十年中，分布式计算领域出现了爆炸式的发展。随着分布式计算系统在大规模计算中的应用越来越广泛，提高可靠性的要求也越来越高。然而，在多模分布式系统中，单个处理节点的失效概率是不容忽视的。因此，有必要开发在分布式处理节点上防止计算浪费的机制，如果其中一个节点失效，要么是因为硬件的暂时故障（总线错误或分段故障）要么就是持久故障（电源故障或通信网络故障）。

Advances in communications technology and methods of work introduced at diverse workplace environments naturally led to a greater distribution of information processing. Initially, most distributed systems were homogeneous, but now many distributed environments are heterogeneous. Therefore, the distributed systems design must focus on heterogeneous environments, treating homogeneous systems as special cases in a heterogeneous world. Key issues in distributed systems design include where specific functionality should be located within the information infrastructure.

通信技术的进展以及基于不同环境下的工作方法的引入，自然会导致更广泛的信息处理分布。最初，大多数分布式系统是同构的，但现在许多分布式系统是异构的。因此，分布式系统设计必须聚焦异构环境，把同构系统当作异构领域里的特殊例子。分布式系统设计中的关键问题包括:具体的功能应该置于信息基础设施的何处。

WHAT ARE DISTRIBUTED SYSTEMS?

什么是分布式系统？

A distributed system is a collection of independent computers which appear to the users of the system as a single computer. Nearly all large software systems are by necessity 1 distributed. For example, enterprise-wide business systems must support multiple users running common applications across different sites.

分布式系统是独立计算机的集合，展示给系统用户的则是每一台独立的计算机。几乎所有的大型软件系统都是分布式的。比如，企业级的商务系统就必须支持多个用户跨站点运行应用程序。

A distributed system encompasses a variety of applications, their underlying support software, the hardware they run on, and the communication links connecting the distributed hardware. The largest and best-known distributed system is the set of computers, software, and services comprising the World Wide Web, which is so pervasive that it coexists with and connects to most other existing distributed systems. The most common distributed systems are networked client/server systems. Distributed systems share the general properties described below.

一个分布式系统包括各种应用程序、它们底层的软件支持、供它们运行的硬件设施、以及连接分布式硬件设施的通信链路。最大最著名的分布式系统是由万维网上的一系列计算机、软件和服务集组成的。它与现有的大多数分布式系统共存互联，应用极为普遍。最常见的分布式系统是网络C/S系统。分布式系统一般共享下面将要描述到的几个属性。

Resource Sharing

资源共享

The most common reason for connecting a set of computers into a distributed system is to allow them to share physical and computational resources (printers, files, databases, mail services, stock quotes, and collaborative applications, for example). Distributed system components that support resource sharing play a similar role as, and are increasingly indistinguishable from, operating systems.

将一组计算机连接到分布式系统的最直接的原因是支持它们共享物理的和计算的资源（比如，打印机、文件、数据库、邮件服务、股票报价以及各种彼此协作的应用程序）。分布式中支持资源共享的组件扮演着一个类似操作系统的角色，而且它们已经变得越来越难以区分。

Multiple Nodes

多节点

Software for the distributed system executes on nodes, or multiple independent computers (not merely multiple processors on the same computer, which is the realm of parallel computing). These nodes can range among personal computers, high-performance workstations, file servers, mainframes, and supercomputers. Each can take the role of a client, which requests services by others; a server, which provides computation or resource access to others; or a peer, which does both. A distributed system may be as small as two nodes, provided software connectivity is present. This arrangement is represented in Figure 6A-1.

软件运行在分布式系统的节点群上或多个独立主机上（不仅仅是指多个处理器运行在同一个主机上，这是并行计算的概念）。这些节点可以存在于个人电脑、高性能工作站、文件服务器、大型机以及超级计算机上。其中每一个节点都可以扮演客户机的角色，用来向其他节点请求服务；也可以扮演服务器，用来为其他节点提供数据处理或资源访问的服务；或者，同时扮演两种角色，两件事都做。一个分布式系统也可以小到只有两个节点，用来提供私有的软件互联服务。图6A-1表示的就是这种布置。

Figure 6A-1: A small distributed system

图6A-1：一个小型的分布式系统

Concurrency

并发

Each of the nodes in a distributed system functions both independently and concurrently with all of the others. More than one process (executing program) per node and more than one thread (concurrently executing task) per process can act as components in a system. Most components are reactive, continuously responding to commands from users and messages from other components. Like operating systems, distributed systems are designed to avoid termination and so should always remain at least partially available.

每一个分布式系统的功能节点都是与其他节点独立并发执行的。每个节点的多个进程（执行程序）以及每个进程的多个线程（并发执行任务）都可以作为系统中的组件。大多数组件是被动地、持续地响应来自用户的命令或来自其他组件的消息。就像操作系统一样，分布式系统的设计就是用来规避服务中断的，所以它必须总是保持至少部分可用。

Heterogeneity

异构性

The nodes participating in a system can consist of diverse computing and communication hardware. The software comprising the system can include diverse programming languages and development tools. Some heterogeneity issues can be addressed with common message formats and low-level protocols that are readily implemented across different platforms (e.g., PCs, servers, and mainframes). Others may require construction of bridges that translate one set of formats and protocols to another. More thorough system integration can be attained by requiring that all nodes support a common virtual machine that processes platform-independent program instructions. The systems that use the Java programming language follow this approach。

参与系统的节点可以由不同的处理方式及通信硬件组成。包含系统的软件可以包容不同的编程语言及开发工具。一些异构性问题，可以通过公共的消息格式来解决，而且一些底层协议可以很轻松地实现跨平台（例如，电脑、服务器及大型主机）。其他问题则需要构建能自由转换不同数据格式和协议的桥梁。更彻底的系统集成可以通过让所有节点都支持通用虚拟机（可以处理与平台不相关的程序指令）来实现。使用Java语言编程的系统就是遵循了这种方法。

Multiple Protocols

Most distributed message passing differs significantly from the kinds of invocations (such as procedure calls) used within the confines of sequential programs. The most basic form of distributed communication is asynchronous. Similar to letters mailed in a postal system, senders issue messages without relying on receipt of or reply by their recipients. Such basic distributed messages usually take much longer to reach recipients than do local invocations. They sometimes reach recipients in a different order than they were sent and they may fail to reach them at all. To avoid this, more sophisticated protocols must be constructed. These may include:

大多数分布式消息传递方式与适用于顺序程序的调用类型（如过程调用）明显不同。分布式通信最基本的方式是异步的。类似于邮政系统中的信件邮递，发送方发布消息不依赖于接收方的接收或回复。这种基本的分布式消息通常比本地调用要花费更长的时间才能到达接收方处。它们有时会以跟他们发出时完全不同的顺序到达接收方处，甚至可能无法到达接收方处。为了避免这种情况，必须构造更复杂的协议。它们包括：

● Procedural messaging, in which senders wait for full replies

程序通信，即消息发出者等待完整的回复。

● Semi-synchronous messaging, in which senders wait for an acknowledgment of message receipt before proceeding

半同步通信，即消息发出者在继续发出之前等待消息的接收确认。

● Transactional protocols, in which all messages in a given session or transaction are processed in an all-or-none fashion

事务处理协议，即在一个给定的会话或事务里，所有消息要么全部被处理，要么全部不被处理。

● Callback protocols, in which receivers later issue different messages back to their senders

回调协议，即接收者接到消息后返回给发出者不同的消息。

● Time-out protocols, in which senders only wait for replies for a certain period before proceeding

定时协议，即消息发出者在继续发出之前在一定的时间内等待回复。

● Multicast protocols, in which senders simultaneously issue messages to a group of other nodes

多点传送协议，即发出者在同一时间向一组其他节点发布消息。

These and other protocols are often extended and specialized to enhance reliability, security, and efficiency.

这些和其他一些协议常常被扩展和专门化，以提高可靠性、安全性和效率。

Fault Tolerance

容错

A program running on a single computer is, at best, only as reliable as that computer. Most distributed systems, on the other hand, need to remain at least partially available and functional even if some of their nodes, applications, or communication links fail or misbehave. In addition to outright failures, applications may suffer from unacceptably low quality of service due to bandwidth shortages, network contention, software overhead, or other system limitations, so fault-tolerance requirements present some of the most central, yet difficult challenges in the construction of distributed systems.

运行在某台计算机上的程序，充其量只能同这台计算机一样可靠。而大多数分布式系统却需要保持其功能至少部分可用，即使在系统中有部分节点、应用或通信链路完全中断或出现异常的时候。除彻底崩溃之外，因为带宽不足、网络拥挤、软件开销或其他系统限制，系统应用也可能发生不可接受的低质量服务。所以，高容错的要求，仍然代表了分布式架构中最核心的攻坚级挑战。

Security

安全

Only authorized users may access sensitive data or perform critical operations. Security in distributed systems is intrinsically a multilevel issue, ranging from the basic safety guarantees provided by the hardware and operating systems residing on each node; to message encryption and authentication protocols; to mechanisms supporting issues concerning privacy, appropriateness of content, and individual responsibility.

只有授权用户才能访问敏感数据或执行关键操作。分布式系统中的安全，本质上是一个多层次的问题，从硬件及隶属于每个节点上的操作系统所提供的基本安全保证到消息加密和协议校验再到支持关于隐私、内容适当性及个人责任等问题的机制。

Techniques for addressing trustworthiness include using digital certificates and preventing component code performing potentially dangerous operations such as modifying disk files.

对于解决诚信问题的技术包括：数字证书的应用和对组件代码执行潜在危险操作的预防，比如修改磁盘文件。

Message Passing

消息传递

Software on separate computers communicates via structured message-passing disciplines built upon a number of networking protocols (for example, TCP/IP). These, in turn, may run on any of a number of connection technologies (for example, Ethernet and modems). The nodes in most distributed systems are completely connected—any node may send a message to any other node. Delivery is mediated by underlying routing algorithms and related networking support.

分布在彼此孤立的计算机上的软件通过构建在一系列网络协议(例如TCP/IP)上的结构化消息传递规程来进行交流。接着，消息会在一些链接技术手段（如以太网和调制解调器）中运行。在大多数分布式系统中的节点都是完全互联的——任何节点都可以向任何其他节点发送消息。消息的传递是通过底层的路由算法和相关的网络支持来调解的。

Messages include commands, requests for services, event notifications, multimedia data, file contents, and even entire programs. It should be noted that most multiprocessors communicate via shared memory rather than message passing and therefore are not distributed.

消息包括：命令、服务请求、事件通知、多媒体数据、文件内容，甚至整个程序。应当指出的是，大多数多处理器（包含两台或多台功能相近的处理器，处理器之间彼此可以交换数据，所有处理器共享内存，I/O设备，控制器，及外部设备，整个硬件系统由统一的操作系统控制，在处理器和程序之间实现作业、任务、程序、数组极其元素各级的全面并行）是通过共享内存而并非消息传递来进行沟通的，因此并不是分布式的。

Openness

开放性

Most sequential programs are considered closed because their configurations never change after execution commences. Most distributed systems are, to some degree, open, because nodes, components, and applications can be added or changed while the system is running. This provides the extensibility necessary to accommodate expansion, and the ability to evolve and cope with the changing world in which a system resides.

大部分顺序程序都被认为是封闭的，因为它们的配置在执行开始之后是永远不变的。而大部分分布式系统在某种程度上来说是开放的，因为系统运行时，节点、组件或应用可以被添加或修改。这为系统提供了必不可少的可扩展性，以及进化和适应变化中的世界的能力。

Openness requires that each component obey a certain minimal set ofpolicies, conventions, and protocols to ensure interoperability among updated or added components. Historically, the most successful open systems have been those with the most minimal requirements. For example, the simplicity of the Hypertext Transfer Protocol (HTTP) was a major factor in the success of the World Wide Web.

开放性要求每个组件必须遵守最少的一组策略、约定和协议，以确保修改和新增的组件之间的互操作性。从历史的角度来看，最成功的开放系统已然是那些要求最少的系统。举个例子，超文本传输协议的简单易用正是万维网成功的一个主要因素。

Standards organizations such as International Standards Organization (ISO) and American National Standards Institute (ANSI), along with industrial consortia such as the Object Management Group (OMG ), establish the basic format and protocol standards underlying many interoperability guarantees. Individual distributed systems additionally rely on context-specific or domain-dependent policies and mechanisms.

标准组织如国际标准组织（ISO）、美国国家标准协会（ANSI），与诸如对象管理组（OMG）之类的工业联盟一起，在许多互操作性保障的基础上，创立了基本格式和协议标准。此外，私有分布式系统依赖于特定上下文或域相关的策略和机制。

Isolation

隔离

Each component is logically or physically autonomous, and it communicates with others only via structured message protocols. In addition, groups of components may be segregated for purposes of functionality, performance, or security. For example, while the connectivity of a corporate distributed system might extend to the entire Internet, its essential functionality could be segregated (often by a firewall) to an intranet operating only within the firewall. It would communicate then with other parts of the system via a restricted secure protocol.

每个组件在逻辑上和物理上都是自主的，并且仅通过结构化消息协议与其他组件通信。另外，组件群可能会被隔离，为了功能、性能或安全的考量。打个比方，企业级分布式系统的互联可能扩展至整个Internet，但它的基本功能却可以被隔离（通常由防火墙）到只能在防火墙内的内部网进行操作。它可以会通过保密的安全协议与系统的其他部分进行通信。

Persistence

持久化

At least some data and programs are maintained on persistent media that outlast the execution of a given application. Persistence may be arranged at the level of file systems, database systems, or programming language runtime support mechanisms.

至少一些维护在持久化媒介上的数据和程序比给定应用的执行要更加持久。持久化可以设置在文件系统、数据库系统、及编程语言运行机制的层级。

Decentralized Control

分散控制

No single computer is necessarily responsible for configuration, management, or policy control for the system as a whole. Distributed systems are instead domains joined by protocol of autonomous agents that agree on enough common policies to provide an aggregate functionality. Some aspects of decentralization are desirable, such as fault-tolerance provisions. Others are essential because centralized control cannot accommodate the number of nodes and connections supported by contemporary systems. The tools for administering system-wide policies, however, may be restricted to particular users.

对于整个系统的配置、管理及策略控制来说，没有一台单独的计算机是负全责的。分布式系统代替域加入以足够的通用策略为基础的自治代理协议，以提供聚合功能。分散控制的某些方面是可取的，譬如说容错条款。其他方面则是基本的，因为集中控制无法容纳现代系统所支持的大量的节点和连接。当然，用于管理整个系统策略的工具，可能仅限于特定用户。

ADVANTAGES AND DISADVANTAGES

优点和缺点

Advantages of Distributed Systems

分布式的优点

Distributed systems have many inherent advantages, especially over centralized systems. Some applications are inherently distributed as well. In general, distributed systems:

分布式有许多固有优势，相对于集中式系统来说尤为如此。一些应用本身就具有分布式特性。一般来说，分布式系统：

● Yield higher performance

生产更高的性能

● Provide higher reliability

提供更高的可靠性

● Allow incremental growth. Distributed computing offers higher rates of return over individual CPUs:

支持增量增长。分布式计算提供的回报率远在单个CPU之上。

● It is both feasible and easy to construct systems of a large number of CPUs connected by a high-speed network.

它使得大量由高速网络连接起来的CPU的系统架构变得可行而又简单。

● It answers a need to share data scattered over these CPUs.

它解决散列数据的共享需求优越于那些CPU

● It provides a way to share expensive peripherals.

它提供了一种共享昂贵设备的方式

● It allows one user to run a program on many different machines.

它允许用户在不同的机器上运行同一个程序。

Disadvantages of Distributed Systems

分布式系统的缺点

In spite of their many advantages, distributed systems do create a few disadvantages. Some of these are:

尽管分布式系统有很多优点，但也有一些缺点。比如：

● The need for new operating systems to support them

需要有支持分布式系统的新的操作系统

● A reliance on network communications

依赖网络通信

● The need for enhanced security

必须提高安全性

● Offer no nice classification of operating systems

没提供比较好的操作系统分类

● Use loosely and tightly coupled systems

既使用松耦合系统也使用紧耦合系统

CONCLUSIONS

结论

We want to fully utilize a heterogeneous computing environment where different types of processing resources and interconnection technologies are used effectively and efficiently. Using distributed resources provides the potential of maximizing performance and cost effectiveness for a wide range of scientific and distributed applications.

我们希望使得不同类型的处理资源和互联技术能有效并高效发挥作用的异构计算环境能得到充分利用。使用分布式资源为广泛的科学和分布式应用提供了性能最大化的潜力并节省了成本效益。

Distributed computing environments comprising networked heterogeneous workstations are becoming the standard configuration in both engineering and scientific environments. However, the lack of a unifying parallel computing model (a parallel equivalent of a von Neumann model) means that the current parallel applications are nonportable.

由网络异构工作站组成的分布式计算环境已经成为工程环境和科学环境中的标准配置。然而，缺乏统一的并行计算模型（与冯诺依曼模型等效），意味着当前并行应用程序是不可移植的。