Conventional single processor computers are classified as SISD systems. Each arithmetic instruction initiates an operation on a data item taken from a single stream of data elements. Historical supercomputers such as the Control Data Corporation 6600 and 7600 fit this category as do most contemporary microprocessors.
Vector processors such as the Cray-1 and its descendants are often classified as SIMD machines, although they are more properly regarded as SISD machines. Vector processors achieve their high performance by passing successive elements of vectors through separate pieces of hardware dedicated to independent phases of a complex operation. For example, in order to add two numbers such as
and
, the numbers must have the same exponent. The processor must shift the mantissa (and decrement the exponent) of one number until its exponent matches the exponent of the other number. In this example
is adjusted to
so it can be added to
, and the sum is
. A vector processor is specially constructed to feed a data stream into the processor at a high rate, so that as one part of the processor is adding the mantissas in the pair
another part of the processor is adjusting the exponents in
.
The ambiguity over the classification of vector machines depends on how one views the flow of data. A static ``snapshot'' of the processor during the processing of a vector would show several pieces of data being operated on at one time, and under this view one could say one instruction (a vector add) initiates several data operations (adjust exponents, add mantissas, etc.) and the machine might be classified SIMD. A more dynamic view shows that there is just one stream of data, and elements of this stream are passed sequentially through a single pipeline (which implements addition in this example). Another argument for not including vector machines in the SIMD category will be presented when we see how SIMD machines implement vector addition.
☆SIMD Computers:
SIMD machines have one instruction processing unit, sometimes called a controller and indicated by a K in the PMS notation, and several data processing units, generally called D-units or processing elements (PEs). The first operational machine of this class was the ILLIAC-IV, a joint project by DARPA, Burroughs Corporation, and the University of Illinois Institute for Advanced Computation [5]. Later machines included the Distributed Array Processor (DAP) from the British corporation ICL, and the Goodyear MPP. Two recent machines, the Thinking Machines CM-1 and the MasPar MP-1, are discussed in detail in Section 3.1.2
The control unit is responsible for fetching and interpreting instructions. When it encounters an arithmetic or other data processing instruction, it broadcasts the instruction to all PEs, which then all perform the same operation. For example, the instruction might be ``add R3,R0.'' Each PE would add the contents of its own internal register R3 to its own R0. To allow for needed flexibility in implementing algorithms, a PE can be deactivated. Thus on each instruction, a PE is either idle, in which case it does nothing, or it is active, in which case it performs the same operation as all other active PEs. Each PE has its own memory for storing data. A memory reference instruction, for example ``load R0,100'' directs each PE to load its internal register with the contents of memory location 100, meaning the 100th cell in its own local memory.
One of the advantages of this style of parallel machine organization is a savings in the amount of logic. Anywhere from 20% to 50% of the logic on a typical processor chip is devoted to control, namely to fetching, decoding, and scheduling instructions. The remainder is used for on-chip storage (registers and cache) and the logic required to implement the data processing (adders, multipliers, etc.). In an SIMD machine, only one control unit fetches and processes instructions, so more logic can be dedicated to arithmetic circuits and registers. For example, 32 PEs fit on one chip in the MasPar MP-1, and a 1024- processor system is built from 32 chips, all of which fit on a single board (the control unit occupies a separate board).
Vector processing is performed on an SIMD machine by distributing elements of vectors across all data memories. For example, suppose we have two vectors, a and b, and a machine with 1024 PEs. We would store
in location 0 of memory i and
in location 1 of memory i. To add a and b, the machine would tell each PE to load the contents of location 0 into one register, the contents of location 1 into another register, add the two registers, and write the result. As long as the number of PEs is greater than the length of the vectors, vector processing on an SIMD machine is done in constant time, i.e. it does not depend on the length of the vectors. Vector operations on a pipelined SISD vector processor, however, take time that is a linear function of the length of the vectors.
☆MISD Computers:
There are few machines in this category, none that have been commercially successful or had any impact on computational science. One type of system that fits the description of an MISD computer is a systolic array, which is a network of small computing elements connected in a regular grid. All the elements are controlled by a global clock. On each cycle, an element will read a piece of data from one of its neighbors, perform a simple operation (e.g. add the incoming element to a stored value), and prepare a value to be written to a neighbor on the next step.
One could make a case for pipelined vector processors fitting in this category, as well, since each step of the pipeline corresponds to a different operation being performed to the data as it flows past that stage in the pipe. There have been pipelined processors with programmable stages, i.e. the function that is applied at each location in the pipeline could vary, although the pipeline stage did not fetch its operation from a local control memory so it would be difficult to classify it as a ``processor.''
☆MIMD Computers:
The category of MIMD machines is the most diverse of the four classifications in Flynn's taxonomy. It includes machines with processors and memory units specifically designed to be components of a parallel architecture, large scale parallel machines built from ``off the shelf'' microprocessors, small scale multiprocessors made by connecting four vector processors together, and a wide variety of other designs. With the continued improvement in network communication and the development of software packages that allow programs running on one machine to communicate with programs on other machines, users are even starting to use local networks of workstations as MIMD systems.
Computer systems with two or more independent processors have been available commercially for a long time. For example, the Burroughs Corporation sold dual processor versions of its B6700 systems in the 1970s. These were rarely, if ever, used to work on the same job, however. Multiprocessors of this era were intended to be used for job level parallelism, i.e. each would run a separate program. Parallel processing, in the sense of using more than one processor in the execution of a single program, has been an active area in corporate and academic research labs since the early 1970s. The c.mmp and cm* projects at Carnegie Mellon University used DEC PDP-11 microcomputers as processing elements and pioneered several important developments in parallel hardware and software. Commercial parallel processors started to become widely used in the mid 1980s. By the early 1990s these systems began to approach top of the line vector processors in computing power, and the trend for future high performance computing is clearly with parallel processing.