Chapter 3. Computer Architecture for Beginners

 

Chapter 3. Computer Architecture for Beginners

The CPU

Figure 3-1. The CPU

the computer

To greatly simplify, a computer consists of a central processing unit (CPU) attached to memory. The figure above illustrates the general principle behind all computer operations.

The CPU executes instructions read from memory. There are two categories of instructions

  1. Those that load values from memory into registers and store values from registers to memory.

  2. Those that operate on values stored in registers. For example adding, subtracting multiplying or dividing the values in two registers, performing bitwise operations (and, or, xor, etc) or performing other mathematical operations (square root, sin, cos, tan, etc).

So in the example we are simply adding 100 to a value stored in memory, and storing this new result back into memory.

Branching

Apart from loading or storing, the other important operation of a CPU is branching. Internally, the CPU keeps a record of the next instruction to be executed in the instruction pointer. Usually, the instruction pointer is incremented to point to the next instruction sequentially; the branch instruction will usually check if a specific register is zero or if a flag is set and, if so, will modify the pointer to a different address. Thus the next instruction to execute will be from a different part of program; this is how loops and decision statements work.

For example, a statement like if (x==0) might be implemented by finding the or of two registers, one holding x and the other zero; if the result is zero the comparison is true (i.e. all bits of x were zero) and the body of the statement should be taken, otherwise branch past the body code.

Cycles

We are all familiar with the speed of the computer, given in Megahertz or Gigahertz (millions or thousands of millions cycles per second). This is called the clock speed since it is the speed that an internal clock within the computer pulses.

The pulses are used within the processor to keep it internally synchronised. On each tick or pulse another operation can be started; think of the clock like the person beating the drum to keep the rower's oars in sync.

Fetch, Decode, Execute, Store

Executing a single instruction consists of a particular cycle of events; fetching, decoding, executing and storing.

For example, to do the add instruction above the CPU must

  1. Fetch : get the instruction from memory into the processor.

  2. Decode : internally decode what it has to do (in this case add).

  3. Execute : take the values from the registers, actually add them together

  4. Store : store the result back into another register. You might also see the term retiring the instruction.

Looking inside a CPU

Internally the CPU has many different sub components that perform each of the above steps, and generally they can all happen independently of each other. This is analogous to a physical production line, where there are many stations where each step has a particular task to perform. Once done it can pass the results to the next station and take a new input to work on.

Figure 3-2. Inside the CPU

the component in cpu

Above we have a very simple block diagram illustrating some of the main parts of a modern CPU.

You can see the instructions come in and are decoded by the processor. The CPU has two main types of registers, those for integer calculations and those for floating point calculations. Floating point is a way of representing numbers with a decimal place in binary form, and is handled differently within the CPU. MMX (multimedia extension) and SSE (Streaming Single Instruction Multiple Data) or Altivec registers are similar to floating point registers.

A register file is the collective name for the registers inside the CPU. Below that we have the parts of the CPU which really do all the work.

We said that processors are either loading or storing a value into a register or from a register into memory, or doing some operation on values in registers.

The Arithmetic Logic Unit (ALU) is the heart of the CPU operation. It takes values in registers and performs any of the multitude of operations the CPU is capable of. All modern processors have a number of ALUs so each can be working independently. In fact, processors such as the Pentium have both fast and slow ALUs; the fast ones are smaller (so you can fit more on the CPU) but can do only the most common operations, slow ALUs can do all operations but are bigger.

The Address Generation Unit (AGU) handles talking to cache and main memory to get values into the registers for the ALU to operate on and get values out of registers back into main memory.

Floating point registers have the same concepts, but use slightly different terminology for their components.

Pipelining

As we can see above, whilst the ALU is adding registers together is completely separate to the AGU writing values back to memory, so there is no reason why the CPU can not be doing both at once. We also have multiple ALUs in the system, each which can be working on separate instructions. Finally the CPU could be doing some floating point operations with its floating point logic whilst integer instructions are in flight too. This process is called pipelining[1], and a processor that can do this is referred to as a superscalar architecture. All modern processors are superscalar.

Another analogy might be to think of the pipeline like a hose that is being filled with marbles, except our marbles are instructions for the CPU. Ideally you will be putting your marbles in one end, one after the other (one per clock pulse), filling up the pipe. Once full, for each marble (instruction) you push in all the others will move to the next position and one will fall out the end (the result).

Branch instruction play havoc with this model however, since they may or may not cause execution to start from a different place. If you are pipelining, you will have to basically guess which way the branch will go, so you know which instructions to bring into the pipeline. If the CPU has predicted correctly, everything goes fine![2] Conversely, if the processor has predicted incorrectly it has wasted a lot of time and has to clear the pipeline and start again.

This process is usually referred to as a pipeline flush and is analogous to having to stop and empty out all your marbles from your hose!

Branch Prediction

pipeline flush, predict taken, predict not taken, branch delay slots

Reordering

This bit is crap

In fact, if the CPU is the hose, it is free to reorder the marbles within the hose, as long as they pop out the end in the same order you put them in. We call this program order since this is the order that instructions are given in the computer program.

Consider an instruction stream such as

Figure 3-3. Reorder buffer example

1: r3 = r1 * r2
2: r4 = r2 + r3
3: r7 = r5 * r6
4: r8 = r1 + r7
              

Instruction 2 needs to wait for instruction 1 to complete fully before it can start. This means that the pipeline has to stall as it waits for the value to be calculated. Similarly instructions 3 and 4 have a dependency on r7. However, instructions 2 and 3 have no dependency on each other at all; this means they operate on completely separate registers. If we swap instructions 2 and 3 we can get a much better ordering for the pipeline since the processor can be doing useful work rather than waiting for the pipeline to complete to get the result of a previous instruction.

However, when writing very low level code some instructions may require some security about how operations are ordered. We call this requirement memory semantics. If you require acquire semantics this means that for this instruction you must ensure that the results of all previous instructions have been completed. If you require release semantics you are saying that all instructions after this one must see the current result. Another even stricter semantic is a memory barrier or memory fence which requires that operations have been committed to memory before continuing.

On some architectures these semantics are guaranteed for you by the processor, whilst on others you must specify them explicitly. Most programmers do not need to worry directly about them, although you may see the terms.

CISC v RISC

A common way to divide computer architectures is into Complex Instruction Set Computer (CISC) and Reduced Instruction Set Computer (RISC).

Note in the first example, we have explicitly loaded values into registers, performed an addition and stored the result value held in another register back to memory. This is an example of a RISC approach to computing -- only performing operations on values in registers and explicitly loading and storing values to and from memory.

A CISC approach may only a single instruction taking values from memory, performing the addition internally and writing the result back. This means the instruction may take many cycles, but ultimately both approaches achieve the same goal.

All modern architectures would be considered RISC architectures[3].

There are a number of reasons for this

  • Whilst RISC makes assembly programming becomes more complex, since virtually all programmers use high level languages and leave the hard work of producing assembly code to the compiler, so the other advantages outweigh this disadvantage.

  • Because the instructions in a RISC processor are much more simple, there is more space inside the chip for registers. As we know from the memory hierarchy, registers are the fastest type of memory and ultimately all instructions must be performed on values held in registers, so all other things being equal more registers leads to higher performance.

  • Since all instructions execute in the same time, pipelining is possible. We know pipelining requires streams of instructions being constantly fed into the processor, so if some instructions take a very long time and others do not, the pipeline becomes far to complex to be effective.

EPIC

The Itanium processor, which is used in many example through this book, is an example of a modified architecture called Explicitly Parallel Instruction Computing.

We have discussed how superscaler processors have pipelines that have many instructions in flight at the same time in different parts of the processor. Obviously for this to work as well as possible instructions should be given the processor in an order that can make best use of the available elements of the CPU.

Traditionally organising the incoming instruction stream has been the job of the hardware. Instructions are issued by the program in a sequential manner; the processor must look ahead and try to make decisions about how to organise the incoming instructions.

The theory behind EPIC is that there is more information available at higher levels which can make these decisions better than the processor. Analysing a stream of assembly language instructions, as current processors do, looses a lot of information that the programmer may have provided in the original source code. Think of it as the difference between studying a Shakespeare play and reading the Cliff's Notes version of the same. Both give you the same result, but the original has all sorts of extra information that sets the scene and gives you insight into the characters.

Thus the logic of ordering instructions can be moved from the processor to the compiler. This means that compiler writers need to be smarter to try and find the best ordering of code for the processor. The processor is also significantly simplified, since a lot of its work has been moved to the compiler.[4]

Notes

[1]

In fact, any modern processor has many more than four stages it can pipeline, above we have only shown a very simplified view. The more stages that can be executed at the same time, the deeper the pipeline.

[2]

Processors such as the Pentium use a trace cache to keep a track of which way branches are going. Much of the time it can predict which way a branch will go by remembering its previous result. For example, in a loop that happens 100 times, if you remember the last result of the branch you will be right 99 times, since only the last time will you actually continue with the program.

[3]

Even the most common architecture, the Intel Pentium, whilst having an instruction set that is categorised as CISC, internally breaks down instructions to RISC style sub-instructions inside the chip before executing.

[4]

Another term often used around EPIC is Very Long Instruction World (VLIW), which is where each instruction to the processor is extended to tell the processor about where it should execute the instruction in it's internal units. The problem with this approach is that code is then completely dependent on the model of processor is has been compiled for. Companies are always making revisions to hardware, and making customers recompile their application every single time, and maintain a range of different binaries was impractical.

EPIC solves this in the usual computer science manner by adding a layer of abstraction. Rather than explicitly specifying the exact part of the processor the instructions should execute on, EPIC creates a simplified view with a few core units like memory, integer and floating point.


 

 

 

the original link:http://bottomupcs.sourceforge.net/csbu/c1453.htm

 

 

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
东南亚位于我国倡导推进的“一带一路”海陆交汇地带,作为当今全球发展最为迅速的地区之一,近年来区域内生产总值实现了显著且稳定的增长。根据东盟主要经济体公布的最新数据,印度尼西亚2023年国内生产总值(GDP)增长5.05%;越南2023年经济增长5.05%;马来西亚2023年经济增速为3.7%;泰国2023年经济增长1.9%;新加坡2023年经济增长1.1%;柬埔寨2023年经济增速预计为5.6%。 东盟国家在“一带一路”沿线国家中的总体GDP经济规模、贸易总额与国外直接投资均为最大,因此有着举足轻重的地位和作用。当前,东盟与中国已互相成为双方最大的交易伙伴。中国-东盟贸易总额已从2013年的443亿元增长至 2023年合计超逾6.4万亿元,占中国外贸总值的15.4%。在过去20余年中,东盟国家不断在全球多变的格局里面临挑战并寻求机遇。2023东盟国家主要经济体受到国内消费、国外投资、货币政策、旅游业复苏、和大宗商品出口价企稳等方面的提振,经济显现出稳步增长态势和强韧性的潜能。 本调研报告旨在深度挖掘东南亚市场的增长潜力与发展机会,分析东南亚市场竞争态势、销售模式、客户偏好、整体市场营商环境,为国内企业出海开展业务提供客观参考意见。 本文核心内容: 市场空间:全球行业市场空间、东南亚市场发展空间。 竞争态势:全球份额,东南亚市场企业份额。 销售模式:东南亚市场销售模式、本地代理商 客户情况:东南亚本地客户及偏好分析 营商环境:东南亚营商环境分析 本文纳入的企业包括国外及印尼本土企业,以及相关上下游企业等,部分名单 QYResearch是全球知名的大型咨询公司,行业涵盖各高科技行业产业链细分市场,横跨如半导体产业链(半导体设备及零部件、半导体材料、集成电路、制造、封测、分立器件、传感器、光电器件)、光伏产业链(设备、硅料/硅片、电池片、组件、辅料支架、逆变器、电站终端)、新能源汽车产业链(动力电池及材料、电驱电控、汽车半导体/电子、整车、充电桩)、通信产业链(通信系统设备、终端设备、电子元器件、射频前端、光模块、4G/5G/6G、宽带、IoT、数字经济、AI)、先进材料产业链(金属材料、高分子材料、陶瓷材料、纳米材料等)、机械制造产业链(数控机床、工程机械、电气机械、3C自动化、工业机器人、激光、工控、无人机)、食品药品、医疗器械、农业等。邮箱:market@qyresearch.com

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值