文章目录
Computer Organization Review
本文内容基于COMP5201课程讲义和David Patterson John Hennessy - Computer Organization and Design (5th Edition)教材内容整理。加上自己的一些理解,将知识整理便于复习。(未经许可,不得转载)
Chapter 1 Introduction
Eight Great Ideas in Computer Architecture
-
Design for Moore’s Law 按照摩尔定律的要求设计芯片
-
Use Abstraction to Simplify Design 用Abstraction将各个level representation分离 Use abstractions to characterize the design at different levels of representation
-
Make the Common Case Fast
Making the common case fast will tend to enhance performance better than optimizing the rare case.
优先考虑将最常出现的情况最优化,一些例外可能不能达到最优化,但是不影响整体性能。
-
Performance via Parallelism
并行提高性能
-
Performance via Pipelining
Pipelines充分利用CPU性能
-
Performance via Prediction
预测将来要用的数据, temporal locality (时间), spatial locality(空间).
-
Hierarchy of Memories
多层memories,hard disk, main memory, multiple-level caches, register, cpu.
-
Dependability via Redundancy
Computers not only need to be fast; they need to be dependable. Since any physical device can fail, we make systems dependable by including redundant components that can take over when a failure occurs and to help detect failures.
From Higher level language to the language of hardware:
Higher level language (C/C++) transform to Assembly language (eg. RISC-V, MIPS) by Compiler (gcc, g++). Assembly language transform to Binary machine language program by Assembler
Five Classic Components of a Computer
They are input, output, memory, datapath, and control, with the last two sometimes combined and called the processor.
Amdahl’s Law
完美并行部分,executive time / number of parallels 完全非并行部分,executive time Consider a program with one portion that is perfectly sequential, and another perfectly parallel portion that can be made as parallel as we like.
Chapter 2: Instructions: Language of the Computer
Instruction Set
The vocabulary of commands understood by a given architecture. 不同计算机体系结构有不同的指令集
RISC-V, developed by UC Berkeley starting in 2010
MIPS is an elegant example of the instruction sets designed since the 1980s.
In several respects, RISC-V follows a similar design.
The Intel x86 originated in the 1970s, but still today powers both the PC and the Cloud of the post-PC era.
stored-program concept: The idea that instructions and data of many types can be stored in memory as numbers and thus be easy to change, leading to the stored-program computer
Singed and Unsigned Numbers
-
Unsigned Number: 直接将bits从二进制转换为十进制
-
Signed Number: “two-complement number”
Most significant bit (leftmost) is 0 --> positive number
Most significant bit (leftmost) is 1 --> negative number
Example: In a 4-bit register, using two’s complement semantics, we have the following interpretations:
0000 = 0, 0001 = 1, 0010 = 2, 0011 = 3, 0100 = 4, 0101 = 5, 0110 = 6, 0111 = 7, 1000 = -8, 1001 = -7, 1010 = -6, 1011 = -5, 1100 = -4, 1101 = -3, 1110 = -2, and 1111 = -1.
How to negate two’s complement number, 如何转换正负数 1. flip every bit. 2. add one.
Character Set
- The ASCII system: Each character fits into a byte. ASCII only uses the lower-order 7 bits to distinguish characters. ASCII is thus able to represent 128 different characters.
- UTF-8(8-bit Unicode Transformation Format) is a variable width character encoding capable of encoding all 1,112,064 valid code points in Unicode using one to four 8-bit bytes.
Floating-Point Numbers
-
Binary expansion: 将浮点数二进制展开,指数同时包括正指数和负指数 0.75 = 2 − 1 + 2 − 2 0.75 = 2^{-1} + 2^{-2} 0.75=2−1+2−2
-
(m+n)-digit radix r fixed-point number with ‘m’ whole digits, numbers from 0 to r^m - r^-n, in increments of r^-n. 小数点前m位,小数点后n位。e.g. (2+3)-bit binary fixed-point number,
2.375 = (1 * 2^1) + (0 * 2^0) + (0 * 2^-1) + (1 * 2^-2) + (1 * 2^-3) = (10.011).
-
Blackboard notation: similar to scientific notation
-
In standard-computer bit patterns, we will drop “1.”
-
three parts when representing a floating point number: sign (one bit), exponent(two’s complement, e.g. 4 bits), fractional part (the rest of bits)
Instruction Formats
A machine instruction for an arithmetic/logic operation specifies an opcode, one or more source operands, and, usually, one destination register.
There are three instruction formats in MIPS 1. Register or R-type instructions, operate on two registers rs, rt (source operands), store the result in register rd (destination). 32 bits in total. Note:registers rs,rt, rd在instruction中只占5个bits, 储存对应的register编号,如果是加法运算,具体的数字储存在对应的register中。
R | opcode | rs | rt | rd | \ |
---|---|---|---|---|---|
6 bits | 5 bits | 5 bits | 5 bits | 11 bits |
mul.d f4, f2, f6
The contents of f2 and f6 are read, the result is placed into f4.
-
Immediate or I-type instructions Note:performed on rs and immediate, store result in rt.
I opcode rs rt immediate 6 bits 5 bits 5 bits 16 bits l.d f6, -24(r2)
add the immediate byte-offset -24 to ‘r2’ to determine a memory address. Then, we load the double-precision floating point number (64 bits) from that memory location and put it into floating-point register ‘f6’. s.d f6, -24(r2) add the immediate byte-offset 24 to ‘r2’ to determine a memory address. Then, we store the double-precision floating point number (64 bits) in floating-point register ‘f6’ into that memory location.
bne r1, r2, loop compare register ‘r1’ and register ‘r2’. If they are not equal, we add the word-offset derived from the immediate ‘loop’ to the current value of PC as the new value of PC. -
Jump or J-type instructions Note: Jump or J-type instructions cause unconditional transfer of control to the instruction at the specified address. Word address (as opposed to a byte address), two zeros are appended to the right. 由于是word-address, 跳转计数单位为1 word = 4 bytes = 32 bits.
J opcode partial jump-target address 6 bits 26 bits j done
Addressing Mode
Addressing mode is the method by which the location of an operand is specified within an instruction 1. Immediate addressing The operand is given in the instruction itself. daddui r1, r1 #-8 2. Register addressing The operand is taken from, or the result placed into, a specified register. mul.d f4, f2, f6 3. Base addressing The operand is in memory and its location is computed by adding a byte-address offset (16-bit signed integer) to the contents of a specified base register. l.d f6 -24(r2); s.d f6, 24(r2) 4. PC-relative addressing This is the same as base addressing, except that the “base” register is always PC, and a hardware trick is used to extend the signed-integer offset to 18 bits. Namely, we multiply by 4 to obtain a word-address offset. And then sign extended to 32 bits. beq r1, r2, found; bne r1, r2, loop 5. Absolute addressing The addressing mode for unconditional branches is different because we don’t really have a “base” register. j done 26-bits natural number, multiplying by 4 to a 28-bit natural number, pad the front of ‘done’ with four leading bits of PC, giving us a 32-bit (word) address.
Digital Logic
Notations for class COMP5201
- ~ not
- / and
- \/ or
- p -->q, conditional, if “p”, then “q”, p–>q is false only when p is true, q is false.
- p <- reversed conditional, if “q”, then “p”, it is false only when q is true, q is false.
- p <–> q, bi-conditional, XNOR, when p == q, it is true, else, it is false.
- p + q, XOR, exclusive or, p or q, but not both, when p == q, it is false, else, it is true.
- p | q, p NAND q, equals “not and”
- p V| q, NOR, not or.
- p < q, ((not p) and q)
- p > q, (p and (not q))
Translating and Staring a Program.
How C++ and Java program work Compiling C and interpreting Java To be Continued
Real Stuff: MIPS, x86, RISC-V Instruction Set
Chapter 3 Arithmetic for Computers
Addition and Subtraction
Similar to digit addition and subtraction. When subtracting a number, an overflow occurs which will result a correct answer. Example: 000111 = 7
- 111010 = -6
1|000001 = 1
Multiplication
Multiplication is a bit trickier. There isn’t one way to do it. The simplest to explain corresponds to what we learned in lower school (assume positive numbers):
-
put multiplier in 32-bit register
-
put multiplicand in 64-bit register
-
initialize 64-bit product to zero
loop: test lsb of multiplier if 1, add multiplicand to product shift multiplicand register 1-bit left shift multiplier register 1-bit right if not done, goto loop
Summary: Multiplication hardware simply shifts and adds, as derived from the paper-and-pencil method learned in grammar school. Compilers even use shifts instructions for multiplications by powers of 2.
Division
To be continued… (textbook)
Floating Point
To be continued… (textbook)
Chapter 4 Processor
Core Ideas in RISC Design
- There shall be a small set of instructions, each of which can be executed in approximately the same amount of time using hardwired control (you may need several RISC instructions to do the work of one complex instruction).
- The architecture shall be a load/store architecture that confines memory-address calculation, and memory-latency delays, to a small set of load and store instructions, with all other (register-register) instructions obtaining their operands from faster, and compactly addressable, processor
- There shall be a limited number of simple addressing modes that eliminate or speed up address calculations for the vast majority of cases.
- There shall be simple, uniform instruction formats that facilitate extraction/decoding of the various fields. This allows overlap between opcode interpretation and register readout.
In other words: 1. All operations on data apply to data in registers. 2. The only operators that affect memory are loads (which move data from memory to a register) and stores (which move data from a register to memory). 3. The instruction formats are few in number, with all instructions typically being one size.
RISC Instruction Execution
-
Pipeline Consider a computer system that takes in operations on the left, computes them, and then pushes out results on the right. In a pipeline, we may push in new operations on the left long before getting the results of previous operations pushed out on the right.
-
Three Parameters 1. Peak input bandwidth 2. Operation latency 3. Pipeline occupancy (concurrency) When pipeline reach equilibrium state, concurrency = bandwidth * latency
-
fdxmw instruction-execution pipeline A special case: at equilibrium, input bandwidth = 1, output bandwidth = 1, latency = 5, concurrency = 5.
-
Boxes and Latches
five boxes: f d x m w
four latches between each two boxes, f/d, d/x, x/m, m/w latches.
Instruction from I-cache. Data from D-cache. Content of register from register file.
-
Process of execution 1. f-box:
- read memory address of the next instruction from the PC register.
- fetches instruction from memory (mostly in I-cache, if hit miss, go to the memory)
- update PC by adding 4
- data flow: 32 bits memory address travel up to I-cache, 32-bits instruction travel down to f-box
- f-box latches the fetched instruction in the f/d latch.
- d-box:
- decoding the instruction, noting the operands, destination register etc.
- localize any register operands from the register file
- data flow: register names travel up to register file, register value travel down to d-box
- latch operand in d/x latch.
- also, d-box processes conditional branches, eg. check conditions, update PC.
- x-box:
- case statement, memory reference, register-register, register-immediate instructions.
- perform arithmetic operations
- for floating-point numbers, it needs 4 x-boxes.
- m-box:
- if instruction is load, reads from memory (D-cache) and latches result in m/w latch.
- if instruction is store, write to memory (D-cache) taking the value to be stored from some pipeline latch.
- otherwise, do nothing
- w-box:
- if instruction is load or an ALU instruction, which produce a result, take the value and write it into destination register.
-
Memory Wall Memory increase not fast enough to keep pace with processor improvements.
-
Power Wall Increase in performance will cause increase in power density, resulting in a high chip temperature. High temperature will slow down the speed and even melt the chip.
Pipelining
To make pipeline work, each box is followed immediately on its right by a set of (nonISA) pipeline registers, which is called a “pipeline latch”. The basic requirement is this: Prior to the end of a clock cycle, all the results from a given stage must be stored in the pipeline latch to its right, in order that these values can be preserved across clock cycles, and used as inputs to the next stage at the start of the next clock cycle.
Chapter 5 Cache: Large and Fast: Exploiting Memory Hierarchy
Introduction and Terminology
-
Memory hierarchy
A structure that uses multiple levels of memories; as the distance from the processor increases, the size of the memories and the access time both increase.
-
temporal locality
is present in a program when the code and data used in the recent past is highly likely to be reused in the near future
-
spatial locality
is present in a program when the code and data currently in use is highly likely to be followed by the use, in the near future, of code and data at nearby memory locations
-
block or line
The minimum unit of information that can be either present or not present in a cache. Eg. 16 bytes as a memory block, or called line.
The amount that copied form memory to cache is a called a cache line.
-
cache frame
A cache frame contains a cache line (content from memory bloc), tag field, valid bit.
-
hit rate / miss rate
The fraction of memory access found in a level of the memory hierarchy
-
hit time
The time required to access a level of the memory hierarchy, including the time needed to determine whether the access is a hit or a miss.
-
hit penalty
The time from lower level cache (or memory) to upper level cache.
In other words, the time required to fetch a block into a level of the memory hierarchy from the lower level, including the time to access the block, transmit it from one level to the other, insert it in the level that experienced the miss, and then pass the block to the requestor.
-
set
In m-way set associated cache, a set contains m cache frames.
Memory Technologies
-
SRAM Technology
SRAM is short for Static Random Access Memory. The level closer to CPU (cache) use SRAM. SRAM do not need to refresh so the access time is very close to the cycle time. It use six to eight transistors per bit to prevent the information from being disturbed when read. Much more expensive than DRAM.
-
DRAM Technology
Dynamic RAM, the value kept in a charged capacitor, it cannot be kept indefinitely and must periodically be refreshed.
The fastest version is called Double Data Rate (DDR) SDRAM. A DDR4-3200 DRAM can do 3200 million transfers per second, which means it has a 1600-MHz clock.
-
Flash Memory
Flash memory is a type of electrically erasable programmable read-only memory (EEPROM).
Unlike disks and DRAM, but like other EEPROM technologies, writes can wear out flash memory bits.
-
Disk Memory
Cylinder, track, sector, seek time, rotational latency…
The Basics of Caches
- Directed-map cache Almost all directed-map cache use this mapping to find a line (block) memory address: [tag filed][frame index number][offset] number of bits of offset determined by number of bytes in a cache line (block). number of frame index determined by number of frames in cache the rest of the memory address is tag filed (byte number) = (memory address) ‘mod’ (number of bytes in a block); (line number) = (memory address) ‘div’ (number of bytes in a block); (frame number) = (line number) ‘mod’ (number of blocks in cache); (tag number) = (line number) ‘div’ (number of blocks in cache)
- Handling Cache Miss For a cache miss, we stall the entire processor, essentially freezing all the contents of the temporary and programmer-visible registers, while we are waiting for memory.
- Handling Cache Writes
-
Write-through
A scheme in which writes always update both the cache and the next lower level of the memory hierarchy, ensuring that data are always consistent between the two.
-
Write-buffer
A queue that holds data while the data are waiting to be written to memory.
-
Write-back
A scheme that handles writes by updating values only to the block in the cache, then writing the modified block to the lower level of the hierarchy when the block is replaced.
-
Measuring and Improving Cache Performance
-
Average memory access time
Avg_time = hit_time + miss_rate * miss_penalty
-
Reducing cache miss by set-associated way cache
In set-associated way cache, a set contain more frames. When processor request a line in cache, it will compare the whole set to find whether data is in this set. This means more parallelism, and it can reduce cache miss.
-
Reducing cache penalty using multi-level caches
When miss happen in primitive cache, it will go to the second level cache to find the data. Usually, the second-level cache is larger and contains more data, the miss penalty from primitive cache to second-level cache is much smaller than primitive cache to memory.
-
Software optimization via Blocking
Dependable Memory Hierarchy
When deliver information, we need to make sure the reliability of this data transfer. So we need to defining failures.
- The Hamming Single Error Correction, Double Error Detecting Code (SEC/DED)
- Redundant Arrays of Inexpensive Disks (RAID)
Implementing Cache Controllers
some change