Multiprocessing (parallel computer architecture) (多处理系统)

  • 参考: C o m p u t e r   A r i c h i t e c t u r e   ( 6 th ⁡   E d i t i o n ) Computer\ Arichitecture\ (6\th\ Edition) Computer Arichitecture (6th Edition)

Multiprocessing

subprocessor: 协处理器

Classification of computer architecture

Flynn’s Taxonomy (Flynn 分类法):

  • 基本思想:计算机工作过程是指令流的执行和数据流的处理。根据指令流数据流的多倍性对计算机系统结构进行分类
    • 指令流:机器执行的指令序列
    • 数据流:由指令流调用的数据序列(包括输入数据和中间结果)
    • 多倍性:在系统性能的瓶颈部件上处于同一执行阶段的指令或数据的最大个数 (一个指令最多操作多少数据)

在这里插入图片描述

  • SISD: 单指令流单数据流;Uniprocessor 的工作过程
    • 冯诺依曼架构采用的方式
  • SIMD (Data Level Parallelism):单指令流 → \rightarrow 一台机器上运行;但可以同时操纵多个数据
    • 向量机 Vector processer、阵列机 Array processor
  • MISD: 实际不存在
  • MIMD (Thread (Process) Level Parallelism 线程/进程级并行):多指令流 → \rightarrow 多道程序 → \rightarrow 多处理器;每个指令流处理自己的数据
    • Multi-computers: 集群 Clusters
    • Multi-processors: 一台机器多 CPU、多核 (Multi-core)
      • Reason for multicores: physical limitations can cause significant heat dissipation and data synchronization problems
      • e.g. Intel Core 2 dual core processor, with CPU-local Level 1 caches+ shared, on-die Level 2 cache. (多核之间共享 L2 Cache,数据同步相对好处理一点)
        在这里插入图片描述

SISD v. SIMD v. MIMD

在这里插入图片描述

PU: 处理单元;Instruction Pool: 指令 Cache;Data Pool: 数据 Cache

Challenges to Parallel Programming

First challenge is % of program inherently sequential

  • primarily via new algorithms that have better parallel performance

Example

  • Suppose 80X speedup from 100 processors. (100个处理器达到80倍的加速比) What fraction of original program can be sequential?
    a.10% b. 5% c.1% d.<1%
  • Amdahl’s Law Answers: <1%
    在这里插入图片描述

Second challenge is long latency to remote memory

  • both by architect and by the programmer. For example, reduce frequency of remote accesses either by
    • Caching shared data (HW)
    • Restructuring the data layout to make more accesses local (SW)

Example

  • Suppose 32 CPU MP, 2GHz, 200 ns remote memory, all local accesses hit memory hierarchy and base CPI is 0.5. What is performance impact if 0.2% instructions involve remote access?
    a. 1.5X b. 2.0X c. 2.6X
  • CPI Equation
    • Remote access cost = 200ns × \times × 2GHz = 400 clock cycles.
    • CPI = Base CPI + Remote request rate x Remote request cost = 0.5 + 0.2% x 400 = 0.5 + 0.8 = 1.3
    • No communication (the MP with all local reference) is 1.3/0.5 or 2.6 faster than 0.2% instructions involve remote access

SIMD (Data-Level Parallelism)

Vector Processor

Why Vector Processors?

  • A single vector instruction specifies a great deal of work—it is equivalent to executing an entire loop. Because an entire loop is replaced by a vector instruction whose behavior is predetermined, control hazards that would normally arise from the loop branch are nonexistent.
  • The computation of each result (for one element) in the vector is independent of the computation of other results in the same vector and so hardware does not have to check for data hazards within a vector instruction. Hardware need only check for data hazards between two vector instructions once per vector operand, not once for every element within the vectors.
  • Vector instructions that access memory have a known access pattern (数组在内存中的存储模式是固定的,因此访存操作也是固定的).

Basic Vector Architecture

vector-register processors

  • In a vector-register processor, all vector operations—except load and store—are among the vector registers. (向量寄存器 → \rightarrow 必须足够大)

memory-memory vector processors

  • In a memory-memory vector processor, all vector operations are memory to memory.

Vector Memory-Memory vs. Vector Register Machines

在这里插入图片描述

指令后加 V V V 表示向量指令

  • Vector memory-memory architectures (VMMA) require greater main memory bandwidth, why?
    • All operands must be read in and out of memory
  • VMMAs make it difficult to overlap execution of multiple vector operations, why?
    • Must check dependencies on memory addresses
  • VMMAs incur greater startup latency → \rightarrow All major vector machines since Cray-1 have had vector register architectures (we ignore vector memory-memory from now on)
    • Scalar code was faster on CDC Star-100 (memory-memory) for vectors < 100 elements
    • For Cray-1 (vector registor), vector/scalar breakeven point was around 2 elements

Vector Supercomputers

Epitomized by Cray-1, 1976: Scalar Unit + Vector Extensions

  • Load/Store Architecture, Vector Registers, Vector Instructions, Hardwired Control, Interleaved Memory System, Highly Pipelined Functional Units (其实向量指令中每个数据的计算并不是并行的一起做,而是采用高速流水的方式进行 (高速是因为不用检测向量内部的相关性))
    在这里插入图片描述

Vector Programming Model

在这里插入图片描述

Stride: 例如二维数组按列取数时就要用到 stride

Multimedia Extensions (aka SIMD extensions)

当前 CPU 里集成的一般是多媒体扩展,类似于向量操作

  • Very short vectors added to existing ISAs for microprocessors. Use existing 64-bit registers split into 2 × 32 2\times32 2×32b or 4 × 16 4\times16 4×16b or 8 × 8 8\times8 8×8b (Newer designs have wider registers)
    在这里插入图片描述
  • Single instruction operates on all elements within register
    在这里插入图片描述

Multimedia Extensions versus Vectors

  • Limited instruction set: no vector length control, no strided load/store or scatter/gather, unit-stride loads must be aligned to 64/128-bit boundary
  • Limited vector register length: requires superscalar dispatch to keep multiply/add/load units busy, loop unrolling to hide latencies increases register pressure
  • Trend towards fuller vector support in microprocessors: Better support for misaligned memory accesses; Support of double-precision (64-bit floating-point); New Intel AVX spec (announced April 2008), 256b vector registers (expandable up to 1024b)

The basic structure of a vector-register architecture

VMIPS

在这里插入图片描述
Primary Components of VMIPS

  • Vector registers — VMIPS has eight vector registers, and each holds 64 elements. Each vector register must have at least two read ports and one write port.
  • Vector functional units — Each unit is fully pipelined and can start a new operation on every clock cycle.
    • In VMIPS, vector operations use the same names as MIPS operations, but with the letter “ V V V” appended.
  • Vector load-store unit —The VMIPS vector loads and stores are fully pipelined, so that words can be moved between the vector registers and memory with a bandwidth of 1 word per clock cycle, after an initial latency.
  • A set of scalar registers —Scalar registers can also provide data as input to the vector functional units, as well as compute addresses to pass to the vector load-store unit.
Vector Code Example

在这里插入图片描述

VLR: vector length register

Automatic Code Vectorization
  • Scalar Sequential Code:
    在这里插入图片描述
  • Vectorized Code: Vectorization is a massive compile-time reordering of operation sequencing → \rightarrow requires extensive loop dependence analysis
    在这里插入图片描述
Vector Arithmetic Execution (deep pipeline + multiple independent memory banks)
  • 用一条指令(向量指令)发起对整个向量中的所有元素的访存操作流水化处理
    • To produce results every clock multiple memory: multiple memory banks are used
      • Allow multiple loads and stores per clock cycle
      • Allow for independent management of different memory addresses
      • Example: assume 8 memory banks and 6 cycles of memory bank time to deliver a data item – Overlapping of multiple data requests by the hardware
        在这里插入图片描述
    • Use deep pipeline (=> fast clock) to execute element operations. Simplifies control of deep pipeline because elements in vector are independent (=> no hazards!)
      在这里插入图片描述
Vector Stripmining

分段开采

  • Problem: Vector registers have finite length
    • The solution is to create a vector-length register (VLR), which controls the length of any vector operation. The value in the VLR, however, cannot be greater than the length of the vector registers— maximum vector length (MVL).
    • If the vector is longer than the maximum length, stripmining is used. (Break loops into pieces that fit into vector registers)
      在这里插入图片描述
Vector Stride

跨距

  • At the statement we could vectorize the multiplication of each row of B B B with each column of C C C. When an array is allocated memory, it is linearized and must be laid out in either row-major or column-major order. This linearization means that either the elements in the row or the elements in the column are not adjacent in memory. 总有一种是不连续的
    在这里插入图片描述
  • This distance separating elements that are to be gathered into a single register is called the stride.
  • The vector stride, like the vector starting address, can be put in a general-purpose register. Then the VMIPS instruction LVWS (load vector with stride) can be used to fetch the vector into a vector register. Likewise, when a nonunit stride vector is being stored, SVWS (store vector with stride) can be used.
Vector Chaining
  • Vector version of register bypassing: the Concept of Forwarding Extended to Vector Registers, making a sequence of dependent vector operations run faster (因为向量中元素的运算是用 pipeline 完成的,因此只要前一个向量指令的向量结果中有一个元素被计算出来,就可以送到下一条向量指令中进行运算)
    在这里插入图片描述
Multiple Lanes

多航道 / 多车道

  • increase the peak performance of a vector machine by adding more parallel execution units
    在这里插入图片描述在这里插入图片描述在这里插入图片描述

Array Processor

阵列机 (SIMD)

Basic idea:

  • A single control unit (并非是多核) provides the signals to drive many Processing Elements (Run in lockstep 前后紧接,步伐一致的前进) (PE 执行一样的指令,不一样的数据 → \rightarrow SIMD)
    • PE 组成:各部件受控制部件控制;控制部件受指令控制总线控制
      在这里插入图片描述

*GPU

GPU / CPU架构比较

  • The GPU Devotes More Transistors to Data Processing.
  • CPU:更多资源用于缓存和控制
    在这里插入图片描述

CUDA 高速运算的基础

CUDA (Compute Unified Device Architecture) → \rightarrow 编程接口

  • 计算一致性 (computing coherence)
    • 单程序、多数据执行模式 (SPMD) (不完全是 SIMD): 每一个核都可以执行程序中的一段代码,形成一个微小的线程放到 ALU 上执行 (依靠上千个核进行并行运算) (也称为 SIMT: 单指令多线程)
    • 大量并行计算资源: Thousands of CPU cores so far; Thousands of threads on the fly
  • 隐藏存储器延时: 提升计算/通信比例、合并相邻地址的内存访问; 快速线程切换
    • 1 cycle@GPU vs. ~1000 cycles@CPU

适合的应用

  • GPU 只有在计算 高度数据并行任务 时才能发挥作用。在这类任务中,需要处理大量的数据,数据的储存形式类似于规则的网格,而对这些数据的进行的处理则基本相同
    • 例如:图像处理,物理模型模拟(如计算流体力学),工程和金融模拟与分析,搜索,排序

不适合的应用 (需要重新设计算法和数据结构或者打包处理)

  • 需要复杂数据结构的计算如树,相关矩阵,链表,空间细分结构等,则不适用于使用 GPU 进行计算
  • 串行和事务性处理较多的程序
  • 并行规模很小的应用,如只有数个并行线程
  • 需要ms量级实时性的程序

MIMD (Thread-Level Parallelism)

  • “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
    • Parallel Architecture = Computer Architecture + Communication Architecture

Communication Models

Shared Memory Model

Multi-processors (多处理器系统): 基于共享存储器 Shared Memory

  • Communication occurs through a shared virtual address space (via loads and stores) ⇒ \Rightarrow low overhead for communication
    • 唯一的地址空间并不意味着在物理上只有一个存储器。共享地址空间可以通过一个物理共享的存储器来实现 (Centralized),也可以通过分布式存储器在软硬件支持下实现 (Distributed)
      在这里插入图片描述

shared memory multiprocessors either

  • UMA (Uniform Memory Access time) for shared address, centralized memory MP
  • NUMA (Non Uniform Memory Access time multiprocessor) for shared address, distributed memory MP

Message Passing

Multi-computers (多计算机系统): 基于消息传递 Message Passing

  • Communication occurs by explicitly passing messages among the processors. (distributed memory system)
    • 每个处理器有自己的局部/私有存储器 (private local memory),该存储器只能被该处理器访问而不能被其他处理器直接访问
    • The address space can consist of multiple private address spaces. Each processor-memory module is essentially a separate computer
      在这里插入图片描述

MIMD Memory Architecture: Centralized (SMP) vs. Distributed

2 classes of multiprocessors with respect to memory:

  • (1) Centralized Memory Multiprocessor (Also called symmetric multiprocessors (SMPs) because single main memory has a symmetric relationship to all processors; 对称 (针对存储而言的):每个处理器都以完全相同的方式访存)
    • few dozen processor chips and cores
    • Small enough to share single, centralized memory (Memory 过大,则硬件过于复杂,难以管理) ⇒ \Rightarrow needs Larger Cache
    • Can scale to a few dozen processors by using a switch and by using many memory banks. Although scaling beyond that is technically conceivable, it becomes less attractive as the number of processors sharing centralized memory increases
  • (2) Physically Distributed-Memory multiprocessor
    • Larger number chips and cores
    • BW (bandwidth) demands (Memory distributed among processors)
      在这里插入图片描述

互联网络可以是以太网、光纤等高速网络

The Flynn-Johnson classification of computer systems

在这里插入图片描述

横轴是存储器架构;纵轴是通信模型

考点;例如描述一下为什么这么划分

Typical parallel computer architectures

对称多处理机 SMP(Symmetric Multiprocessor)

  • Centralized Memory Multiprocessors are also called SMPs because single main memory has a symmetric relationship to all processors

工作站机群 COW (Cluster of Workstation)

  • A computer cluster is a group of coupled computers that work together closely so that in many respects they can be viewed as though they are a single computer.
  • The components of a cluster are commonly, but not always, connected to each other through fast local area networks.
    • MPI is a widely-available communications library that enables parallel programs to be written in C, Python…

Cluster categorizations

  • High-availability (HA) clusters 高可用性
    • operate by having redundant nodes, which are then used to provide service when system components fail.
    • e.g. failover clusters 故障转移集群
  • Load-balancing clusters 负载平衡
    • operate by distributing a workload evenly over multiple back end nodes.
  • Grid/Cloud computing
    • grid clusters, a technology closely related to cluster computing. The key differences are that grids connect collections of computers which do not fully trust each other, or which are geographically dispersed. (more like a computing utility than like a single computer)
    • support more heterogeneous collections

大规模并行处理机 MPP (Massively Parallel Processor)

  • MPP 系统是由成百上千台处理机组成的大规模并行计算机系统。主要用于科学计算、工程模拟等以计算为主的场合,目前也广泛应用于商业和网络应用中
    • 开发困难,价格高,国家综合实力的象征 (超算)
    • 使用高性能的私用的互连网络,可以在低延时和高带宽的条件下传递消息 (这是它与 COW 最大的区别)
      在这里插入图片描述

Cluster vs. MPP

体系结构方面的区别

  • (1) Cluster结点是更完整的计算机,计算机可以是同构的也可以是异构的。结点都有自己的磁盘,驻留有自己的完整的操作系统;一般都有一定的自主性, 脱离 Cluster 照样能运行。而 MPP 系统结点一般没有磁盘,只驻留操作系统内核
  • (2) MPP 使用制造厂商专有(或者有专利权)的高速通信网络,网络接口是连到处理结点的存储总线上(紧耦合); Cluster 一般采用公开销售的标准高速局域网或系统域网,网络通常是与结点计算机的 I/O 总线相连(松散耦合

Interconnection Networks

Processor-to-Memory Interconnection Networks

  • 互连网络:由开关元件按照一定的拓扑结构和控制方式构成的网络,实现计算机系统结点之间的相互连接
  • 互连函数:反映网络输入数组 (Processors) 和输出数组 (Memory banks) 之间对应的置换关系或排列关系
    在这里插入图片描述

互连网络的分类

  • 静态网络 (Static Networks): 结点间有着固定连接通路且在程序执行期间,这种连接保持不变的网络
  • 动态网络 (Dynamic Networks): 由开关单元构成,可按应用程序的要求动态地改变连接状态 (适合于大型互连网络). 动态网络主要有总线交叉开关多级交换网络

Connecting Multiple Computers

  • Shared Media vs. Switched (“point-to-point”)
    在这里插入图片描述

Switching scheme

  • Circuit switching
    在这里插入图片描述
  • Packet switching
    在这里插入图片描述

Multistage Network

  • 为了构造大型网络,可以把交叉开关级联起来,构成多级互连网络。作用是通过对各个交叉连接单元的控制来完成输入端和输出端之间的各种连接,使每个输入端上的信息都可以送到任何一个输出端上去
    在这里插入图片描述

Cache Coherence and Coherence Protocol

多处理器缓存一致性 (这个问题只可能在 “Shared Memory Models” 中才可能发生)

  • Symmetric shared-memory machines usually support the caching of both shared and private data.
    • Private data are used by a single processor
      • When a private item is cached, its location is migrated to the cache, reducing the average access time as well as the memory bandwidth required. Because no other processor uses the data, the program behavior is identical to that in a uniprocessor.
    • Shared data are used by multiple processors, essentially providing communication among the processors through reads and writes of the shared data
      • When shared data are cached, the shared value may be replicated in multiple caches. In addition to the reduction in access latency and required memory bandwidth, this replication also provides a reduction in contention that may exist for shared data items that are being read by multiple processors simultaneously.
      • Caching of shared data, however, introduces a new problem: cache coherence.

What Is Multiprocessor Cache Coherence?

Cache Coherence problem

  • Because the view of memory held by two different processors is through their individual caches, the processors could end up seeing different values for the same memory location
    在这里插入图片描述

Notice that the coherence problem exists because we have both a global state, defined primarily by the main memory, and a local state, defined by the individual caches, which are private to each processor core. Thus, in a multi-core where some level of caching may be shared (e.g., an L3), although some levels are private (e.g., L1 and L2), the coherence problem still exists and must be solved.

Coherent Memory Model

  • Definition: Reading an address should return the last value written to that address
  • This simple definition contains two different aspects:
    • (1) Coherence (一致性) defines what values can be returned by a read (返回给读操作的是什么值)
    • (2) Consistency (连贯性) determines when a written value will be returned by a read (什么时候才能将已写入的值返回给读操作)
    • Coherence and consistency are complementary: Coherence defines the behavior of reads and writes to the same memory location, while consistency defines the behavior of reads and writes with respect to accesses to other memory locations.
Coherent Memory System
  • Preserve Program Order (保持程序顺序): 处理器 P P P X X X 进行一次写之后又对 X X X 进行读,读和写之间没有其它处理器对 X X X 进行写,则读的返回值总是写进的值
  • Coherent view of memory (一致性存储器视图):一个处理器对 X X X 进行写之后,另一处理器对 X X X 进行读,读和写之间无其它写,则读 X X X 的返回值应为写进的值 (if a processor could continuously read an old data value, we would clearly say that memory was incoherent.)
  • Write serialization (写操作串行化): 对同一单元的写是顺序化的,这保证了任意两个处理器对同一单元的两次写,从所有处理器看来顺序都应是相同的
    • For example, if the values 1 and then 2 are written to a location, processors can never read the value of the location as 2 and then later read it as 1.
Memory Consistency Model
  • Although the three properties just described are sufficient to ensure coherence, the question of when a written value will be seen is also important.
    • To see why, observe that we cannot require that a read of X X X instantaneously see the value written for X X X by some other processor.
    • The issue of exactly when a written value must be seen by a reader is defined by a memory consistency model

Memory Consistency Model

  • (1) A write does not complete (and allow the next write to occur) until all processors have seen the effect of that write (直到所有的处理器均看到了写的结果,一次写操作才算完成)
  • (2) The processor does not change the order of any write with respect to any other memory access
    • if a processor writes location A A A followed by location B B B, any processor that sees the new value of B B B must also see the new value of A A A
  • These restrictions allow the processor to reorder reads, but forces the processor to finish writes in program order (允许处理器无序读,但必须以程序规定的顺序进行写)

We will rely on this assumption until we reach Section 5.6, where we will see exactly the implications of this definition, as well as the alternatives.

Basic Schemes for Enforcing Coherence

  • In a coherent multiprocessor, the caches provide both m i g r a t i o n migration migration and r e p l i c a t i o n replication replication of shared data items.
    • Migration (迁移) – data can be moved to a local cache and used there in a transparent fashion. ⇒ \Rightarrow 降低了对远程共享数据的访问延迟及带宽需求
    • Replication (复制) – for shared data that are being simultaneously read, since caches make a copy of data in local cache. ⇒ \Rightarrow 不仅降低了访存的延迟,也减少了访问共享数据所产生的冲突

Cache Coherence Protocols (HW)

  • Key to implementing a cache coherence protocol is tracking the state of any sharing of a data block. The state of any cache block is kept using status bits associated with the block, similar to the valid and dirty bits kept in a uniprocessor cache
    • (1) Directory based (目录法) — Sharing status of a block of physical memory is kept in just one location, the directory (存在瓶颈)
    • (2) Snooping — Every cache with a copy of data also has a copy of sharing status of block, but no centralized state is kept
      • All caches are accessible via some broadcast medium (a bus or switch)
      • All cache controllers monitor or snoop on the medium to determine whether or not they have a copy of a block that is requested on a bus or switch access

Snooping Coherence Protocols

Write Invalidate, Write Update

  • Cache Controller “snoopsall transactions on the shared medium (bus or switch) ⇒ \Rightarrow if any relevant transaction (Cache contains that block): take action to ensure coherence
    • (1) Write Invalidate (写作废) (Exclusive access): Exclusive access ensures that no other readable or writable copies of an item exist when the write occurs: all other cached copies of the item are invalidated
      在这里插入图片描述
    • (2) Write Update (写更新): update all the cached copies of a data item when that item is written ⇒ \Rightarrow uses more broadcast medium BW ⇒ \Rightarrow all recent MPUs use write invalidate

下面主要介绍 Write Invalidate

Basic Implementation Techniques (Write Invalidate)

Broadcast Medium Transactions (e.g., bus)
  • To perform an invalidate, the processor acquires bus access and broadcasts the address to be invalidated on the bus. All processors snoop on the bus, watching the addresses. The processors check whether the address on the bus is in their cache. If so, the corresponding data are invalidated.
  • 总线机制还同时实现了写串行化: If two processors attempt to write shared blocks at the same time, their attempts to broadcast an invalidate operation will be serialized when they arbitrate for the bus.
    • One implication of this scheme is that a write to a shared data item cannot actually complete until it obtains bus access.
Locate up-to-date copy of data

Write-through cache

  • all written data are always sent to the memory ⇒ \Rightarrow get up-to-date copy from memory
    • Write through simpler if enough memory BW

Write-back cache

  • Can use same snooping mechanism:
    • supply value: Snoop every address placed on the bus. If a processor has dirty (newest) copy of requested cache block, it provides it in response to a read request and aborts the memory access
  • Write-back needs lower memory bandwidth ⇒ \Rightarrow Support larger numbers of faster processors ⇒ \Rightarrow Most multiprocessors use write-back

An Example Protocol (Snoopy, Invalidate)

  • Snooping coherence protocol is usually implemented by incorporating a finite-state controller in each node

Write-through Cache Protocol

  • 2 states per block in each cache (Valid / Invalid)
    在这里插入图片描述

Write Back Cache Protocol

  • Each cache block is in one state:
    • Shared : block can be read
    • Exclusive : cache has only copy, its writeable, and dirty
    • Invalid : block contains no data (in uniprocessor cache too)

Write-Back State Machine - CPU

  • State machine for CPU requests for each cache block
    在这里插入图片描述

Writes to clean blocks are treated as misses (都是发信号到总线通知写无效)

Write-Back State Machine- Bus request

  • State machine for bus requests for each cache block
    在这里插入图片描述

Write-back State Machine-III

在这里插入图片描述

  • 3
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值