Instruction Set Principles

  • 参考: C o m p u t e r   A r i c h i t e c t u r e   ( 6 th ⁡   E d i t i o n ) Computer\ Arichitecture\ (6\th\ Edition) Computer Arichitecture (6th Edition)

RISC-V vs 80x86

  • In this section, we concentrate on instruction set architecture (ISA)—the portion of the computer visible to the programmer or compiler writer.
    • Architectures similar to RISC-V, which we focus on here, have been used successfully in desktops, servers, and embedded applications.
    • One successful architecture very different from RISC is the 80x86. Surprisingly, its success does not necessarily belie the advantages of a RISC instruction set. There remain, however, serious disadvantages for a complex instruction set like the 80x86. The commercial importance of binary compatibility with PC software combined with the abundance of transistors provided by Moore’s Law led Intel to use a RISC instruction set internally while supporting an 80x86 instruction set externally. Recent 80x86 microprocessors, use hardware to translate from 80x86 instructions to RISC-like instructions and then execute the translated operations inside the chip. They maintain the illusion of 80x86 architecture to the programmer while allowing the computer designer to implement a RISC-style processor for performance.

ISA Principles

Instruction Set Architecture

Introduction

  • ISA: A set of instructions (机器指令)
    • Each instruction is directly executed by the CPU’s hardware. It is represented by a binary format, concatenating (连接) together binary encoding for instructions, registers, constants, memories
  • Options - fixed or variable length formats
    • Fixed - each instruction encoded in same size field (typically 1 word) (Word size is typically 16, 32, 64 bits today)
    • Variable – half-word, whole-word, multiple word instructions are possible. Computers with a wide variety of flexible instruction formats reduce the number of bits required to encode the program.

Instruction Set Design

在这里插入图片描述

  • The instruction set influences everything
    C P U _ T i m e = I C ∗ C P I ∗ C y c l e _ t i m e CPU\_Time =IC*CPI *Cycle\_time CPU_Time=ICCPICycle_time

Classifying Instruction Set Architectures

Stack, Accumulator, GPR

Most basic differentiation

  • The type of internal storage in a processor (CPU 内部存储类型)

  • The major choices are a stack, an accumulator, or a set of registers. There are really two classes of register computers:
    • register-memory architecture: access memory as part of any instruction
    • load-store architecture: access memory only with load and store instructions
  • Operands may be named explicitly or implicitly:
    • The operands in a stack architecture are implicitly on the top of the stack, and in an accumulator architecture one operand is implicitly the accumulator. The general-purpose register (GPR) architectures have only explicit operands—either registers or memory locations.
      在这里插入图片描述在这里插入图片描述

General-Purpose Register (GPR) Architectures

General-Purpose Register (GPR) Architectures

  • Compiler writers would prefer that all registers be equivalent and unreserved.
    • If the number of truly general-purpose registers is too small, trying to allocate variables to registers will not be profitable. Most compilers reserve some registers for expression evaluation, use some for parameter passing, and allow the remainder to be allocated to hold variables.
    • Modern compiler technology and its ability to effectively use larger numbers of registers has led to an increase in register counts in more recent architectures.

Instruction Set Characteristics of GPR Architectures

  • Two major instruction set characteristics divide GPR architectures.
    • (1) Whether an ALU instruction has two or three operands.
      • In the three-operand format, the instruction contains one result operand and two source operands.
      • In the two-operand format, one of the operands is both a source and a result for the operation.
    • (2) How many of the operands may be memory addresses in ALU instructions. (vary from none to three) (在 ALU 指令中,有多少个操作数可以用存储器来寻址,也即有多少个存储器操作数)
      在这里插入图片描述在这里插入图片描述

Why load-store register architecture?

  • Although most early computers used stack or accumulator-style architectures, virtually every new architecture designed after 1980 uses a load-store register architecture.
  • First, registers are faster than memory
  • Second, registers are more efficient for a compiler to use than other forms of internal storage.
    • (A * B) – (B * C) – (A * D) → \rightarrow evaluated in any order (任意顺序执行, 否则要考虑取存位置和顺序)
    • Hold variables 存放变量,减少数据流量,加速程序运行,提高代码密度(指明一个寄存器的位数比指明一个存储器地址要少)
      在这里插入图片描述

Instruction Characteristics

  • Address modes, operations, and data types should be orthogonal (正交) of each other

Memory addressing

  • (1) how memory addresses are interpreted
    • what object is accessed as a function of the address and the length: Endian order 端序/字节序, Alignment 大于一个字节的数据的寻址必须对齐
  • (2) how they are specified.
    • Addressing modes 寻址模式
Interpreting Memory Addresses
  • All the instruction sets discussed in this book are byte addressed and provide access for bytes (8 bits), half words (16 bits), and words (32 bits). Most of the computers also provide access for double words (64 bits).
Endian Order
  • There are two different conventions for ordering the bytes within a larger object.

  • (1) Little Endian: The low-order byte of an object is stored in memory at the lowest address, and the high-order byte at the highest address. (The little end comes first)
    • Intel processors use “Little Endian” byte order.
      在这里插入图片描述
  • (2) Big Endian: The high-order byte of an object is stored in memory at the lowest address, and the low-order byte at the highest address. (The big end comes first)
    • 把给定系统所采用的字节序称为主机字节序。为了避免不同类别主机之间在数据交换时由于对于字节序解释的不同而导致的差错,引入了网络传输所采用的字节序,即网络字节序。规定网络字节序使用 “Big-Endian”方式 (Little Endian ordering fails to match the normal ordering of words when strings are compared. Strings appear “SDRAWKCAB” (backwards) in the registers.)
      在这里插入图片描述

Endian Order is Also Important to File Data

  • Adobe Photoshop – Big Endian
  • BMP (Windows and OS/2 Bitmaps) – Little Endian
  • GIF – Little Endian
  • JPEG – Big Endian
Alignment restrictions
  • Accesses to objects larger than a byte must be aligned. An access to an object of size s s s bytes at byte address A A A is aligned if A   m o d   s = 0 A\ mod\ s = 0 A mod s=0 ( K K K 字节大小的数据必须要存储在 K K K 的整数倍的地址上,不合要求则填充空白字节代替)
    • 例如,在字长为 32 位的机器中:“双字” 地址为 8 的整数倍,最低 3 位:000;“字” 地址为 4 的整数倍,最低两位:00;“半字” 地址为 2 的整数倍,最低位:0;“字节” 地址任意存放;假定依次存入一个int, short, double, char, short类型的数据,则边界对齐后插入数的位置如图所示 (例子参考自 CSDN):
      在这里插入图片描述
  • Why would someone design a computer with alignment restrictions?
    • Misalignment causes hardware complications, because the memory is typically aligned on a multiple of a word or double-word boundary. A misaligned memory access may, therefore, take multiple aligned memory references. 数据占据多个存储单元,此时就需要多次访存,并对高低宇节的位置进行调整后,才能取得一个字。增加了访问存储器的次数,降低指令的运行效率
      • For each misaligned example some objects require two memory accesses to complete.
      • Every aligned object can always complete in one memory access, as long as the memory is as wide as the object.
        在这里插入图片描述
Addressing Modes

寻址模式

  • Addressing Modes: how architectures specify the address of an object they will access. (Constants, Registers, Locations in memory)
    • (Multiple) Addressing modes can significantly reduce instruction counts but add the complexity of building a computer and may increase the average CPI

Example for Addressing Modes

在这里插入图片描述在这里插入图片描述在这里插入图片描述


Summary of Use of Memory Addressing Mode

  • displacement, immediate, and register indirect addressing modes represent 75% to 99% of the addressing mode usage
    在这里插入图片描述

下面着重分析这三种寻址方式


Displacement Addressing Mode

  • What’s an appropriate range of the displacements? (位移的范围) - The size of address should be at least 12-16 bits, which capture 75% to 99% of the displacements
    在这里插入图片描述

Immediate or Literal Addressing Mode

  • Does the mode need to be supported for all operations or for only a subset? - All operations
    在这里插入图片描述
  • What’s a suitable range of values for immediates? - The size of the immediate field should be at least 8-16 bits, which capture 50% to 80% of the immediates.
    在这里插入图片描述

Type and Size of Operands

How is the type of an operand designated?

  • Encoding in the opcode: For an instruction, the operation is typically specified in one field, called the opcode
  • By tag (not used currently)

Common operand types: Character, Integer, Single-precision floating point, Double-precision floating point, Vertex…


Distribution of Data Access

在这里插入图片描述

Operations in the instruction set

What Operations are Needed?

  • All computers provide the following operations:
    • Arithmetic and Logical, Data Transfer (Loads-stores), Control (Branch, jump, procedure call and return, trap), System (Operating system call, virtual memory management instructions)
  • The following operations are optional:
    • Floating Point, Decimal, String, Graphics

Top 10 Instructions for the 80x86

  • The top-10 instructions for 80x86 account for 96% of instructions executed → \rightarrow The most widely executed instructions are the simple operations of an instruction set. Make them fast, as they are the common case
    在这里插入图片描述

Instructions for Control Flow

  • Jump (unconditional), Branch (conditional), Procedure call, Procedure return
    在这里插入图片描述

Addressing Modes for Control Flow Instructions

  • How to get the destination address of a control flow instruction?
    • PC-relative: Supply a displacement that is added to the program counter (PC); Position independence: Permit the code to run independently of where it is loaded
    • A register contains the target address: case, DLL, Virtual Function…
    • The jump may permit any addressing mode to be used to supply the target address

常用的三种表示分支条件的技术及其优缺点

在这里插入图片描述

Encoding an Instruction Set

  • How the instructions are encoded into a binary representation for execution? (Affects the size of code and the CPU design)
    • The operation is typically specified in one field, called the opcode
  • How to encode the addressing mode with the operations
    • Address specifier (地址描述符)
    • Addressing modes encoded as part of the opcode

Popular Encoding Choices

  • Variable (变长编码): Allow virtually all addressing modes to be with all operations → \rightarrow Code size than performance
    在这里插入图片描述
  • Fixed (定长编码): A single size for all instructions. Few addressing modes and operations → \rightarrow Performance than code size
    • Combine the operations and the addressing modes into the opcode
      在这里插入图片描述
  • Hybrid (混合编码): Set of fixed formats
    • Size of programs vs. ease of decoding in the processor
      在这里插入图片描述

Role of Compilers

Goals of a Compiler

  • Correctness
  • Speed of the compiled code
  • Fast compilation (编译过程), Debugging support, Interoperability among languages (不同语言编制的不同组件之间的互操作性; 例如 Python 使用 C 库)

Structure of Recent Compilers

在这里插入图片描述


Optimization Types

  • High level optimizations: Done on the source (高层语言)
  • Local optimizations: Done on basic sequential block (within a straight-line code fragment)
  • Global optimizations: Extend the local optimizations across branches and loops (procedure inlining, loop unrolling…)
  • Register allocation: Use graph coloring (图着色) to allocate registers
    • NP-complete
    • Heuristic algorithm (启发式算法) works best when there are at least 16 (and preferably more) registers
  • Processor-dependent optimizations: take advantage of specific architectural knowledge.

Impact of Optimizations on Performance

  • Level 1: local optimizations, code scheduling, and local register allocation
  • Level 2: global optimization, loop transformation (software pipelining), global register allocation
  • Level 3: High-level procedure integration
    在这里插入图片描述

Optimization Observations

  • Hard to reduce branches
  • Biggest reduction is often memory references
  • Some ALU operation reduction happens but it is usually a few %
  • Implication: Branch, Call, and Return become a larger relative% of the instruction mix. Control instructions are the hardest to speed up

Impact of Compiler Technology on the Architect’s Decisions

  • (1) How are variables allocated and addressed?
    • Register allocation is more effective for stack-allocated objects than for global variables, and is essentially impossible for heap allocated objects because they are accessed with pointers. (堆中数据通过指针访问,如果把值存到寄存器里,在别的地方又用指针修改了该数据,就会造成寄存器和内存中数据不一致的情况. 存储体系中,Cache 和主存的不一致可以通过读写策略解决,但 Register 没有这个机制,否则就太复杂了). Some variables are impossible to allocate because they are aliased (multiple ways to refer to)
      • The stack is used to allocate local variables. Objects on the stack are addressed relative to the stack pointer and are primarily scalars (single variables) rather than arrays.
      • The global data area is used to allocate statically declared objects, such as global variables and constants. A large percentage of these objects are arrays or other aggregate data structures.
      • The heap is used to allocate dynamic objects that do not adhere to a stack discipline. Objects in the heap are accessed with pointers and are typically not scalars.
  • (2) How many registers will be needed?
    • An ISA has at least 16 GPR (general purpose register) (not counting for FP registers) to simplify allocation of registers

How can Architects Help Compiler Writers

Make the frequent cases fast and the rare case correct.

  • Some instruction set properties help the compiler writer.
  • (1) Provide Regularity (正则性 / 正交性): Address modes, operations, and data types should be orthogonal (independent) of each other (例如,整数和浮点数都可以做加法操作…): Orthogonality suggests all supported addressing modes apply to all instructions that transfer data
    • Simplify code generation especially multi-pass (多次扫描)
    • Counterexample (反例): restrict what registers can be used for a certain classes of instructions
  • (2) Provide primitives (原语), not solutions: 提供机制,但不提供解决方案。例如 Linux 的 iptable 黑名单白名单,提供该功能但策略你自己定
    • Special features that match a HLL (high level language) construct are often un-usable
    • What works in one language may be detrimental to others
  • (3) Simplify trade-offs (简化折中) among alternatives
    • How to write good code? What is a good code?
      • Metric (度量指标): IC or code size (no longer true) → \rightarrow caches and pipeline…
    • Help compiler writers understand the costs of alternatives
  • (4) Provide instructions that bind the quantities known (已知变量) at compile time as constants

The MIPS architecture: MIPS64

MIPS (Microcomputer without Interlocked Pipeline Stages)

  • MIPS is a simple, streamlined, highly scalable RISC architecture that is available for licensing.
    • 当今处理器一共有三个最强大的架构,其中之一是以 intel 和 AMD 为代表的 x86 架构 (CISC),另外一个是手机,平板处理器所使用的 ARM 架构 (RISC),最后一个便是我国龙芯处理器所选择的 MIPS 架构 (RISC)

MIPS64

  • Use general-purpose registers with a load-store architecture.
  • Design for pipelining efficiency, including a fixed instruction set encoding

RISC vs. CISC

  • CISC (Complex Instruction Set Computer):复杂指令系统计算机;从增强指令系统功能出发,指令功能复杂。使得设计、验证、实验都很困难。但人们后来计算机程序中的大部分指令都只用到了其中的一小部分简单指令,由此引发了 CISC \textsf{CISC} CISC → \rightarrow RISC \textsf{RISC} RISC

  • RISC (Reduced Instruction Set Computer):精简指令系统计算机,从提高指令执行效率出发,力图使用更少的指令实现更多的功能
    • 优先选取使用频率最高的一些简单指令,指令条数少,但又能让复杂指令的功能由频度高的简单指令的组合来实现
    • 指令长度、格式固定,指令格式种类少寻址方式简单
    • load-store 架构: 只有取数/存数 (load / store) 指令访问内存,其余指令操作都在寄存器之间进行
    • RISC 的内部通用寄存器数量相对 CISC 多
    • 减少了指令的执行周期数,大部分指令可以单周期执行完成. 因此,虽然 RISC 只有简单指令导致其在实现同一段程序时指令数量比 CISC 多,但其每条指令的执行周期数都更短,而且由于 RISC 结构更简单,执行周期也更短,因此 RISC 的性能优于 CISC
    • 控制器采用组合逻辑控制,不用微程序控制 (速度更快)
    • 采用优化的编译程序
    • 充分利用流水线

Register for MIPS

  • 32 64-bit integer GPRs (or integer registers) (R0 ~ R31)
    • R0 = 0 always (这是因为代码中可能经常会用到 0,访问寄存器比访问立即数常量更快,因此直接把 0 存 R0 里来加快访问速度)
    • By convention, each register also has a name (别名) to make it easier to code
      在这里插入图片描述
  • 32 FPRs (F0 ~ F31): for single (32 bits) or double precision (64 bits)
  • Extra status registers: SR (Status Register), floating-point status register
  • Other control registers

Data Types for MIPS

  • 8-bit bytes, 16-bit half words, 32-bit words, and 64-bit double words for integer data
  • 32-bit single precision and 64-bit double precision for FP
  • MIPS64 operations work on 64-bit integer and 32- or 64-bit floating point
    • Bytes, half words, and words are loaded into the GPRs with zeros or the sign bit replicated to fill the 64 bits of the GPRs

Addressing Modes for MIPS Data Transfers

  • Immediate: With 16-bit field
  • Displacement
    • Load R4, 100(R1) (Regs[R4] <- Mem[100 + Regs[R1]])
  • Register-indirect: MIPS 不直接支持,但可以通过位移寻址实现
    • Load R4, (R1) (Regs[R4] <- Mem[Regs[R1]])
  • Absolute addressing: MIPS 不直接支持,但可以通过位移寻址实现 (Using R0 as the base register)
    • Load R1, (1001) (Regs[R4] <- Mem[1001])

MIPS memory

  • Byte addressable with 64-bit address (64 位地址空间)
  • Mode selection for Big Endian or Little Endian

MIPS Instruction Format

  • One instruction is 32 bits (指令为 32 位字长): divide instruction word into “fields”. Each field tells computer something about instruction
  • We could define different fields for each instruction, but MIPS is based on simplicity, so define 3 basic types of instruction formats:
    • I I I-format: for immediate, and lw and sw (since the offset counts as an immediate) ( r s rs rs: register source; r t rt rt: register target; r d rd rd: register destination)
      在这里插入图片描述
    • R R R-format: for register ( r s rs rs r t rt rt 做运算,结果放入 r d rd rd; s h a m t shamt shamt: shift amount 位移量)
      在这里插入图片描述
    • J J J-format: for Jump, Jump and Link, Trap and return from exception (相对跳转)
      在这里插入图片描述

The load and store instructions in MIPS

在这里插入图片描述

Examples of arithmetic/logical instructions on MIPS

  • All ALU instructions are register-register instructions.
    在这里插入图片描述

Typical control flow instructions in MIPS

在这里插入图片描述

Subset of the instructions in MIPS64

在这里插入图片描述在这里插入图片描述

LoongISA (LISA)

  • LoongISA 指令系统在 MIPS64 架构 500 多条命令基础上,在基础指令、虚拟机指令、面向 X86 和 ARM 的二进制翻译指令、向量指令和核心态等多个方面增加了近 1400 条新指令,其中包括:
    • 148 条 LoongEXT 指令:龙芯通用扩展指令集
    • 5 条 LoongVM 指令(也就是 LoongVZ)
    • 213 条 LoongBT 指令
    • 1014 条 LoongSIMD 指令/ LoongMMI 多媒体扩展指令集
  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值