Instruction Set Principles_address modes, operations, and data types should b-CSDN博客

本文链接：https://blog.csdn.net/weixin_42437114/article/details/114880391

参考： $Computer\ Arichitecture\ (6\th\ Edition)$

RISC-V vs 80x86

In this section, we concentrate on instruction set architecture (ISA)—the portion of the computer visible to the programmer or compiler writer.
- Architectures similar to RISC-V, which we focus on here, have been used successfully in desktops, servers, and embedded applications.
- One successful architecture very different from RISC is the 80x86. Surprisingly, its success does not necessarily belie the advantages of a RISC instruction set. There remain, however, serious disadvantages for a complex instruction set like the 80x86. The commercial importance of binary compatibility with PC software combined with the abundance of transistors provided by Moore’s Law led Intel to use a RISC instruction set internally while supporting an 80x86 instruction set externally. Recent 80x86 microprocessors, use hardware to translate from 80x86 instructions to RISC-like instructions and then execute the translated operations inside the chip. They maintain the illusion of 80x86 architecture to the programmer while allowing the computer designer to implement a RISC-style processor for performance.

ISA Principles

Instruction Set Architecture

Introduction

ISA: A set of instructions (机器指令)
- Each instruction is directly executed by the CPU’s hardware. It is represented by a binary format, concatenating (连接) together binary encoding for instructions, registers, constants, memories
Options - fixed or variable length formats
- Fixed - each instruction encoded in same size field (typically 1 word) (Word size is typically 16, 32, 64 bits today)
- Variable – half-word, whole-word, multiple word instructions are possible. Computers with a wide variety of flexible instruction formats reduce the number of bits required to encode the program.

Instruction Set Design

在这里插入图片描述

The instruction set influences everything
$CPU\_Time =IC*CPI *Cycle\_time$

Classifying Instruction Set Architectures

Stack, Accumulator, GPR

Most basic differentiation

The type of internal storage in a processor (CPU 内部存储类型)

The major choices are a stack, an accumulator, or a set of registers. There are really two classes of register computers:
- register-memory architecture: access memory as part of any instruction
- load-store architecture: access memory only with load and store instructions
Operands may be named explicitly or implicitly:
- The operands in a stack architecture are implicitly on the top of the stack, and in an accumulator architecture one operand is implicitly the accumulator. The general-purpose register (GPR) architectures have only explicit operands—either registers or memory locations.

General-Purpose Register (GPR) Architectures

General-Purpose Register (GPR) Architectures

Compiler writers would prefer that all registers be equivalent and unreserved.
- If the number of truly general-purpose registers is too small, trying to allocate variables to registers will not be profitable. Most compilers reserve some registers for expression evaluation, use some for parameter passing, and allow the remainder to be allocated to hold variables.
- Modern compiler technology and its ability to effectively use larger numbers of registers has led to an increase in register counts in more recent architectures.

Instruction Set Characteristics of GPR Architectures

Two major instruction set characteristics divide GPR architectures.
- (1) Whether an ALU instruction has two or three operands.
  - In the three-operand format, the instruction contains one result operand and two source operands.
  - In the two-operand format, one of the operands is both a source and a result for the operation.
- (2) How many of the operands may be memory addresses in ALU instructions. (vary from none to three) (在 ALU 指令中，有多少个操作数可以用存储器来寻址，也即有多少个存储器操作数)

Why load-store register architecture?

Although most early computers used stack or accumulator-style architectures, virtually every new architecture designed after 1980 uses a load-store register architecture.
First, registers are faster than memory
Second, registers are more efficient for a compiler to use than other forms of internal storage.
- (A * B) – (B * C) – (A * D) $\rightarrow$ evaluated in any order (任意顺序执行, 否则要考虑取存位置和顺序)
- Hold variables 存放变量，减少数据流量，加速程序运行，提高代码密度（指明一个寄存器的位数比指明一个存储器地址要少）

Instruction Characteristics

Address modes, operations, and data types should be orthogonal (正交) of each other

Memory addressing

(1) how memory addresses are interpreted
- what object is accessed as a function of the address and the length: Endian order 端序/字节序, Alignment 大于一个字节的数据的寻址必须对齐
(2) how they are specified.
- Addressing modes 寻址模式

Interpreting Memory Addresses

All the instruction sets discussed in this book are byte addressed and provide access for bytes (8 bits), half words (16 bits), and words (32 bits). Most of the computers also provide access for double words (64 bits).

Endian Order

There are two different conventions for ordering the bytes within a larger object.

(1) Little Endian: The low-order byte of an object is stored in memory at the lowest address, and the high-order byte at the highest address. (The little end comes first)
- Intel processors use “Little Endian” byte order.
(2) Big Endian: The high-order byte of an object is stored in memory at the lowest address, and the low-order byte at the highest address. (The big end comes first)
- 把给定系统所采用的字节序称为主机字节序。为了避免不同类别主机之间在数据交换时由于对于字节序解释的不同而导致的差错，引入了网络传输所采用的字节序，即网络字节序。规定网络字节序使用 “Big-Endian”方式 (Little Endian ordering fails to match the normal ordering of words when strings are compared. Strings appear “SDRAWKCAB” (backwards) in the registers.)

Endian Order is Also Important to File Data

Adobe Photoshop – Big Endian
BMP (Windows and OS/2 Bitmaps) – Little Endian
GIF – Little Endian
JPEG – Big Endian

Alignment restrictions

Accesses to objects larger than a byte must be aligned. An access to an object of size $s$ bytes at byte address $A$ is aligned if $A\ mod\ s = 0$ ( $K$ 字节大小的数据必须要存储在 $K$ 的整数倍的地址上，不合要求则填充空白字节代替)
- 例如，在字长为 32 位的机器中：“双字” 地址为 8 的整数倍，最低 3 位：000；“字” 地址为 4 的整数倍，最低两位：00；“半字” 地址为 2 的整数倍，最低位：0；“字节” 地址任意存放；假定依次存入一个int, short, double, char, short类型的数据，则边界对齐后插入数的位置如图所示 (例子参考自 CSDN)：
Why would someone design a computer with alignment restrictions?
- Misalignment causes hardware complications, because the memory is typically aligned on a multiple of a word or double-word boundary. A misaligned memory access may, therefore, take multiple aligned memory references. 数据占据多个存储单元，此时就需要多次访存，并对高低宇节的位置进行调整后，才能取得一个字。增加了访问存储器的次数，降低指令的运行效率
  - For each misaligned example some objects require two memory accesses to complete.
  - Every aligned object can always complete in one memory access, as long as the memory is as wide as the object.

Addressing Modes

寻址模式

Addressing Modes: how architectures specify the address of an object they will access. (Constants, Registers, Locations in memory)
- (Multiple) Addressing modes can significantly reduce instruction counts but add the complexity of building a computer and may increase the average CPI

Example for Addressing Modes

在这里插入图片描述

Summary of Use of Memory Addressing Mode

displacement, immediate, and register indirect addressing modes represent 75% to 99% of the addressing mode usage

下面着重分析这三种寻址方式

Displacement Addressing Mode

What’s an appropriate range of the displacements? (位移的范围) - The size of address should be at least 12-16 bits, which capture 75% to 99% of the displacements

Immediate or Literal Addressing Mode

Does the mode need to be supported for all operations or for only a subset? - All operations
What’s a suitable range of values for immediates? - The size of the immediate field should be at least 8-16 bits, which capture 50% to 80% of the immediates.

Type and Size of Operands

How is the type of an operand designated?

Encoding in the opcode: For an instruction, the operation is typically specified in one field, called the opcode
By tag (not used currently)

Common operand types: Character, Integer, Single-precision floating point, Double-precision floating point, Vertex…

Distribution of Data Access

在这里插入图片描述

Operations in the instruction set

What Operations are Needed?

All computers provide the following operations:
- Arithmetic and Logical, Data Transfer (Loads-stores), Control (Branch, jump, procedure call and return, trap), System (Operating system call, virtual memory management instructions)
The following operations are optional:
- Floating Point, Decimal, String, Graphics

Top 10 Instructions for the 80x86

The top-10 instructions for 80x86 account for 96% of instructions executed $\rightarrow$ The most widely executed instructions are the simple operations of an instruction set. Make them fast, as they are the common case

Instructions for Control Flow

Jump (unconditional), Branch (conditional), Procedure call, Procedure return

Addressing Modes for Control Flow Instructions

How to get the destination address of a control flow instruction?
- PC-relative: Supply a displacement that is added to the program counter (PC); Position independence: Permit the code to run independently of where it is loaded
- A register contains the target address: case, DLL, Virtual Function…
- The jump may permit any addressing mode to be used to supply the target address

常用的三种表示分支条件的技术及其优缺点

在这里插入图片描述

Encoding an Instruction Set

How the instructions are encoded into a binary representation for execution? (Affects the size of code and the CPU design)
- The operation is typically specified in one field, called the opcode
How to encode the addressing mode with the operations
- Address specifier (地址描述符)
- Addressing modes encoded as part of the opcode

Popular Encoding Choices

Variable (变长编码): Allow virtually all addressing modes to be with all operations $\rightarrow$ Code size than performance
Fixed (定长编码): A single size for all instructions. Few addressing modes and operations $\rightarrow$ Performance than code size
- Combine the operations and the addressing modes into the opcode
Hybrid (混合编码): Set of fixed formats
- Size of programs vs. ease of decoding in the processor

Role of Compilers

Goals of a Compiler

Correctness
Speed of the compiled code
Fast compilation (编译过程), Debugging support, Interoperability among languages (不同语言编制的不同组件之间的互操作性; 例如 Python 使用 C 库)

Structure of Recent Compilers

在这里插入图片描述

Optimization Types

High level optimizations: Done on the source (高层语言)
Local optimizations: Done on basic sequential block (within a straight-line code fragment)
Global optimizations: Extend the local optimizations across branches and loops (procedure inlining, loop unrolling…)
Register allocation: Use graph coloring (图着色) to allocate registers
- NP-complete
- Heuristic algorithm (启发式算法) works best when there are at least 16 (and preferably more) registers
Processor-dependent optimizations: take advantage of specific architectural knowledge.

Impact of Optimizations on Performance

Level 1: local optimizations, code scheduling, and local register allocation
Level 2: global optimization, loop transformation (software pipelining), global register allocation
Level 3: High-level procedure integration

Optimization Observations

Hard to reduce branches
Biggest reduction is often memory references
Some ALU operation reduction happens but it is usually a few %
Implication: Branch, Call, and Return become a larger relative% of the instruction mix. Control instructions are the hardest to speed up

Impact of Compiler Technology on the Architect’s Decisions

(1) How are variables allocated and addressed?
- Register allocation is more effective for stack-allocated objects than for global variables, and is essentially impossible for heap allocated objects because they are accessed with pointers. (堆中数据通过指针访问，如果把值存到寄存器里，在别的地方又用指针修改了该数据，就会造成寄存器和内存中数据不一致的情况. 存储体系中，Cache 和主存的不一致可以通过读写策略解决，但 Register 没有这个机制，否则就太复杂了). Some variables are impossible to allocate because they are aliased (multiple ways to refer to)
  - The stack is used to allocate local variables. Objects on the stack are addressed relative to the stack pointer and are primarily scalars (single variables) rather than arrays.
  - The global data area is used to allocate statically declared objects, such as global variables and constants. A large percentage of these objects are arrays or other aggregate data structures.
  - The heap is used to allocate dynamic objects that do not adhere to a stack discipline. Objects in the heap are accessed with pointers and are typically not scalars.
(2) How many registers will be needed?
- An ISA has at least 16 GPR (general purpose register) (not counting for FP registers) to simplify allocation of registers

How can Architects Help Compiler Writers

Make the frequent cases fast and the rare case correct.

Some instruction set properties help the compiler writer.
(1) Provide Regularity (正则性 / 正交性): Address modes, operations, and data types should be orthogonal (independent) of each other (例如，整数和浮点数都可以做加法操作…): Orthogonality suggests all supported addressing modes apply to all instructions that transfer data
- Simplify code generation especially multi-pass (多次扫描)
- Counterexample (反例): restrict what registers can be used for a certain classes of instructions
(2) Provide primitives (原语), not solutions: 提供机制，但不提供解决方案。例如 Linux 的 iptable 黑名单白名单，提供该功能但策略你自己定
- Special features that match a HLL (high level language) construct are often un-usable
- What works in one language may be detrimental to others
(3) Simplify trade-offs (简化折中) among alternatives
- How to write good code? What is a good code?
  - Metric (度量指标): IC or code size (no longer true) $\rightarrow$ caches and pipeline…
- Help compiler writers understand the costs of alternatives
(4) Provide instructions that bind the quantities known (已知变量) at compile time as constants

The MIPS architecture: MIPS64

MIPS (Microcomputer without Interlocked Pipeline Stages)

MIPS is a simple, streamlined, highly scalable RISC architecture that is available for licensing.
- 当今处理器一共有三个最强大的架构，其中之一是以 intel 和 AMD 为代表的 x86 架构 (CISC)，另外一个是手机，平板处理器所使用的 ARM 架构 (RISC)，最后一个便是我国龙芯处理器所选择的 MIPS 架构 (RISC)

MIPS64

Use general-purpose registers with a load-store architecture.
Design for pipelining efficiency, including a fixed instruction set encoding

RISC vs. CISC

CISC (Complex Instruction Set Computer)：复杂指令系统计算机；从增强指令系统功能出发，指令功能复杂。使得设计、验证、实验都很困难。但人们后来计算机程序中的大部分指令都只用到了其中的一小部分简单指令，由此引发了 $\textsf{CISC}$ $\rightarrow$ $\textsf{RISC}$

RISC (Reduced Instruction Set Computer)：精简指令系统计算机，从提高指令执行效率出发，力图使用更少的指令实现更多的功能
- 优先选取使用频率最高的一些简单指令，指令条数少，但又能让复杂指令的功能由频度高的简单指令的组合来实现
- 指令长度、格式固定，指令格式种类少，寻址方式简单
- load-store 架构: 只有取数/存数 (load / store) 指令访问内存，其余指令操作都在寄存器之间进行
- RISC 的内部通用寄存器数量相对 CISC 多
- 减少了指令的执行周期数，大部分指令可以单周期执行完成. 因此，虽然 RISC 只有简单指令导致其在实现同一段程序时指令数量比 CISC 多，但其每条指令的执行周期数都更短，而且由于 RISC 结构更简单，执行周期也更短，因此 RISC 的性能优于 CISC
- 控制器采用组合逻辑控制，不用微程序控制 (速度更快)
- 采用优化的编译程序
- 充分利用流水线

Register for MIPS

32 64-bit integer GPRs (or integer registers) (R0 ~ R31)
- R0 = 0 always (这是因为代码中可能经常会用到 0，访问寄存器比访问立即数常量更快，因此直接把 0 存 R0 里来加快访问速度)
- By convention, each register also has a name (别名) to make it easier to code
32 FPRs (F0 ~ F31): for single (32 bits) or double precision (64 bits)
Extra status registers: SR (Status Register), floating-point status register
Other control registers

Data Types for MIPS

8-bit bytes, 16-bit half words, 32-bit words, and 64-bit double words for integer data
32-bit single precision and 64-bit double precision for FP
MIPS64 operations work on 64-bit integer and 32- or 64-bit floating point
- Bytes, half words, and words are loaded into the GPRs with zeros or the sign bit replicated to fill the 64 bits of the GPRs

Addressing Modes for MIPS Data Transfers

Immediate: With 16-bit field
Displacement
- Load R4, 100(R1) (Regs[R4] <- Mem[100 + Regs[R1]])
Register-indirect: MIPS 不直接支持，但可以通过位移寻址实现
- Load R4, (R1) (Regs[R4] <- Mem[Regs[R1]])
Absolute addressing: MIPS 不直接支持，但可以通过位移寻址实现 (Using R0 as the base register)
- Load R1, (1001) (Regs[R4] <- Mem[1001])

MIPS memory

Byte addressable with 64-bit address (64 位地址空间)
Mode selection for Big Endian or Little Endian

MIPS Instruction Format

One instruction is 32 bits (指令为 32 位字长): divide instruction word into “fields”. Each field tells computer something about instruction
We could define different fields for each instruction, but MIPS is based on simplicity, so define 3 basic types of instruction formats:
- $I$ -format: for immediate, and lw and sw (since the offset counts as an immediate) ( $r s$ : register source; $r t$ : register target; $r d$ : register destination)
- $R$ -format: for register ( $r s$ 和 $r t$ 做运算，结果放入 $r d$ ; $s h a m t$ : shift amount 位移量)
- $J$ -format: for Jump, Jump and Link, Trap and return from exception (相对跳转)

The load and store instructions in MIPS

在这里插入图片描述

Examples of arithmetic/logical instructions on MIPS

All ALU instructions are register-register instructions.

Typical control flow instructions in MIPS

在这里插入图片描述

Subset of the instructions in MIPS64

在这里插入图片描述

LoongISA (LISA)

LoongISA 指令系统在 MIPS64 架构 500 多条命令基础上，在基础指令、虚拟机指令、面向 X86 和 ARM 的二进制翻译指令、向量指令和核心态等多个方面增加了近 1400 条新指令，其中包括:
- 148 条 LoongEXT 指令：龙芯通用扩展指令集
- 5 条 LoongVM 指令(也就是 LoongVZ)
- 213 条 LoongBT 指令
- 1014 条 LoongSIMD 指令/ LoongMMI 多媒体扩展指令集