Chapter 18 Parallel Processing-CSDN博客

本文链接：https://blog.csdn.net/peter_ouyang/article/details/53946678

Classification of parallel processor system based on Flynn
- Single instruction, single data stream
- - SISD
- - Single processor
  - Single instruction stream
  - Data stored in single memory
  - Uni-processor
  - - Pipeline
- Single instruction, multiple data stream
- - SIMD
- Multiple instruction, single data stream
- - MISD
- - Never implemented
- Multiple instruction, multiple data stream
- - MIMD
Organization classification of a numtiprocessor system
- Time shared or common bus: simplest(SMP)
- - Pros
  - - Simplicity
    - Flexibility
    - Reliability: Failure of advice should not cause failure of the whole system
  - Cons
  - - Performance limited by bus cycle time
    - Each processor should have local cache
    - Leads to problems with cache coherence: solved in hardware(Protocol)
- Multiport memory
- - Direct, independent access of memory modules by each processor or I/O module
  - Logic required to resolve conflicts
  - - priority
  - Little or no modification to processors or I/O modules required
  - Pros
  - - Better performance
    - Can configure portions of memory as private to one or more processors
  - Cons
  - - More complex control
    - Write through cache policy should be used for cache control
- Central control unit
- - Pros
  - - Flexibility
    - Simplicity of interface
  - Cons
  - - Structure is complex -----> bottleneck
- Interconnect networks
- - A network in which switch components are interconnected according to certain topology and control mode
  - - By this, many processors or function units can be connected key element in computer performance
  - Types of Interconnection Networks
  - - Static interconnection networks
    - Dynamic interconnection networks
SMP: Symmetric multiprocessor
- Share single memory or pool of memory
- By means of shared bus or other interconnection mechanism to access memory
- Memory access time to given area of memory is approximately the same for each processor
- Processors share the same memory and I/O
- Pros:
- - Greater performance: parallel work
  - Availability/Fault-tolerance: failure of a single processor does not halt the system
  - Incremental growth: add addtional processors
  - Scalable
  - Existence of multiple processors is transparent to users
NUMA: Non-uniform memory access
- Access time to different regions of memory may differ in a NUMA
- CC-NUMA: cache-coherent NUMA
- A NUMA system without cache coherence is more or less equivalent to a cluster
- With a SMP system, as the number of processors increases, the bus may become a performance bottleneck
- - The bus traffic increases
  - Cache coherence signals further add to the burden
  - So, processors are not infinite scalable
  - - Typically, 16~64 processors in SMP
- With a cluster, each node has its own private main memory, applications do not see a large global memory and this affects to achieve maximum performance
- NUMA can compensate above limitations
- Cache coherence
- - Need a protocol based on directory
  - If a modification to shared data is done in a cache, this fact can be broadcast to other nodes
- Pros
- - Higher levels of parallelism than SMP, without major software changes
  - The network traffic is limited, because remote accesses are not excessive
- Cons
- - Need a new OS to support
  - Availability
Clusters(Uniformed memory access)
- Collections of independent uniprocessors or SMPs
- Interconnected to form a cluster
- Communication via fixed path or network connections
- Pros
- - Absulute scalability
  - Incremental scalability
  - High availability
  - Superior price/performance
- Light weight clusters
- - Passive standby: primary periodically heart beats to standby
  - Active secondary
Cache coherence
- Software solutions
- - Compiler and operating system-based
  - - The compiler analyses source codes to mark those data which can not be put into local cache
    - The compiler may insert addtional instructions to enforce cache coherence during the critical periods
  - Pros
  - - Overhead transferred from run time to compile time
    - Design complexity transferred from hardware to software
  - Cons
  - - Inefficient cache utilization
    - Not transparent to compiler designers and some programmers
- Hardware solution
- - Cache coherence protocols
  - - Dynamic recognition of potential problems in run time
    - More efficient use of cache
    - Transparent to programmers
    - Two catagories
    - - Directory protocols
      - Collect and maintain global information about copies of data in local caches
        Directory stored in main memory
        When a processor writes its local cache, directory is checked to determine whether the line is in other local caches or not
        
        If not, write its cache
        Otherwise, the controller informs all processors holding this cache line to invalidate its copy. After all ACK received, the requesting processor just writes its cache
        Thereafter, if another processor tries to read that line, it will send a miss notification to the controller. The controller will command the processor holding the line to write back to main memory
        
        Cache controller
        
        Central controller: effective in large scale systems with complex interconnection schemes
        
        One directory stores all cache coherence information
        
        Distributed controller: complex
        
        All caches have its own directories
        Snoopy protocol
      - Snoopy protocols
      - Updates annouced to other caches by broadcast
MESI: modified, excusive, shared, invalid(use 2 status bits per tag)
- Designed for supporting cache consistency
- - Multiprocessor
  - Multi-level cache
- Status
- - Modified:
  - - the line has been modified , is available only in this cache
  - Exclusive:
  - - The line is the same as that in main memory, not present in any other caches
  - Shared:
  - - The line may also be in other caches and main memory
  - Invalid
  - - The line is invalid, must access main memory
Vector computation