为什么寄存器比RAM快

最新推荐文章于 2024-06-02 20:42:17 发布

衾冷锦疏

最新推荐文章于 2024-06-02 20:42:17 发布

阅读量1.2k

点赞数 1

原文出处：www.mikeash.com/pyblog/friday-qa-2013-10-11-why-registers-are-fast-and-ram-is-slow.html

Why Registers Are Fast and RAM Is Slow

by Mike Ash

In the previous article on ARM64, I mentioned that one advantage of the new architecture is the fact that it has twice as many registers, allowing code load data from RAM less often, which is much slower. Reader Daniel Hooper asks the natural question: just why is RAM so much slower than registers?

在上一篇关于ARM64的文章中，我提到新架构的一个优点是它具有两倍的寄存器，允许代码从RAM中加载数据的频率较低，这要慢得多。读者丹尼尔·胡珀问自然的问题：刚才为什么是 RAM比寄存器这么慢得多？

Distance
Let's start with distance. It's not necessarily a big factor, but it's the most fun to analyze. RAM is farther away from the CPU than registers are, which can make it take longer to fetch data from it.

让我们从距离开始。它不一定是一个重要因素，但它是分析最有趣的。RAM比寄存器更远离CPU，这可能需要更长的时间才能从中获取数据。

Take a 3GHz processor as an extreme example. The speed of light is roughly one foot per nanosecond, or about 30cm per nanosecond for you metric folk. Light can only travel about four inches in time of a single clock cycle of this processor. That means a roundtrip signal can only get to a component that's two inches away or less, and that assumes that the hardware is perfect and able to transmit information at the speed of light in vacuum. For a desktop PC, that's pretty significant. However, it's much less important for an iPhone, where the clock speed is much lower (the 5S runs at 1.3GHz) and the RAM is right next to the CPU.

以3GHz处理器为例。光速大约是每纳秒一英尺，或者对于公尺人来说大约每厘秒30厘米。在该处理器的单个时钟周期内，光只能行进大约4英寸。这意味着往返信号只能到达距离不到两英寸的组件，并且假设硬件是完美的并且能够以真空中的光速传输信息。对于台式电脑来说，这非常重要。然而，对于iPhone来说，它的重要性要低得多，时钟速度要低得多（5S运行在1.3GHz），而RAM就在CPU的旁边。

Cost
Much as we might wish it wasn't, cost is always a factor. In software, when trying to make a program run fast, we don't go through the entire program and give it equal attention. Instead, we identify the hotspots that are most critical to performance, and give them the most attention. This makes the best use of our limited resources. Hardware is similar. Faster hardware is more expensive, and that expense is best spent where it'll make the most difference.

就像我们希望的那样，成本总是一个因素。在软件中，当试图使程序快速运行时，我们不会通过整个程序并给予同等的关注。相反，我们确定对性能最关键的热点，并给予他们最多的关注。这充分利用了我们有限的资源。硬件类似。更快的硬件更昂贵，并且最大限度地花费在最有效的地方。

Registers get used extremely frequently, and there aren't a lot of them. There are only about 6,000 bits of register data in an A7 (32 64-bit general-purpose registers plus 32 128-bit floating-point registers, and some miscellaneous ones). There are about 8 billion bits (1GB) of RAM in an iPhone 5S. It's worthwhile to spend a bunch of money making each register bit faster. There are literally a million times more RAM bits, and those eight billion bits pretty much have to be as cheap as possible if you want a $650 phone instead of a $6,500 phone.

寄存器非常频繁地使用，而且它们并不多。A7中只有大约6,000位寄存器数据（32个64位通用寄存器加32个128位浮点寄存器，还有一些杂项寄存器）。iPhone 5S中有大约80亿比特（1GB）的RAM。花一大笔钱让每个寄存器更快一点是值得的。实际上有多达百万倍的RAM位，如果你想要650美元的手机而不是6500美元的手机，那么80亿位必须要尽可能便宜。

Registers use an expensive design that can be read quickly. Reading a register bit is a matter of activating the right transistor and then waiting a short time for the register hardware to push the read line to the appropriate state.

Reading a RAM bit, on the other hand, is more involved. A bit in the DRAM found in any smartphone or PC consists of a single capacitor and a single transistor. The capacitors are extremely small, as you'd expect given that you can fit eight billion of them in your pocket. This means they carry a very small amount of charge, which makes it hard to measure. We like to think of digital circuits as dealing in ones and zeroes, but the analog world comes into play here. The read line is pre-charged to a level that's halfway between a one and a zero. Then the capacitor is connected to it, which either adds or drains a tiny amount of charge. An amplifier is used to push the charge towards zero or one. Once the charge in the line is sufficiently amplified, the result can be returned.

另一方面，读取RAM位更为复杂。在任何智能手机或PC中发现的DRAM中的一点由单个电容器和单个晶体管组成。电容器非常小，正如您所期望的那样，您可以在口袋中放入80亿个电容器。这意味着它们带有非常少量的电荷，这使得难以测量。我们喜欢将数字电路视为处理零和零，但模拟世界在这里发挥作用。读取线被预先充电到介于1和0之间的水平。然后将电容器连接到它，它可以增加或消耗少量电荷。放大器用于将电荷推向零或一。一旦线中的电荷被充分放大，就可以返回结果。

The fact that a RAM bit is only one transistor and one tiny capacitor makes it extremely cheap to manufacture. Register bits contain more parts and thereby cost much more.

RAM位只有一个晶体管和一个微型电容器，这使得制造起来非常便宜。寄存器位包含更多部分，因此成本更高。

There's also a lot more complexity involved just in figuring out what hardware to talk to with RAM because there's so much more of it. Reading from a register looks like:

在确定与RAM通信的硬件时还需要更多的复杂性，因为它有更多的内容。从寄存器中读取如下：

Extract the relevant bits from the instruction.
Put those bits onto the register file's read lines.
Read the result.

从指令中提取相关位。
将这些位放在寄存器文件的读取线上。
阅读结果。

Reading from RAM looks like:

从RAM读取看起来像：

Get the pointer to the data being loaded. (Said pointer is probably in a register. This already encompasses all of the work done above!)
Send that pointer off to the MMU.
The MMU translates the virtual address in the pointer to a physical address.
Send the physical address to the memory controller.
Memory controller figures out what bank of RAM the data is in and asks the RAM.
The RAM figures out particular chunk the data is in, and asks that chunk.
Step 6 may repeat a couple of more times before narrowing it down to a single array of cells.
Load the data from the array.
Send it back to the memory controller.
Send it back to the CPU.
Use it!

获取指向正在加载的数据的指针。（指针可能在寄存器中。这已经包含了上面完成的所有工作！）
将该指针发送到MMU。
MMU将指针中的虚拟地址转换为物理地址。
将物理地址发送到内存控制器。
内存控制器计算出数据所在的RAM组并询问RAM。
RAM计算出数据所在的特定块，并询问该块。
步骤6可以重复几次，然后将其缩小到单个单元阵列。
从阵列加载数据。
将其发送回内存控制器。
将其发送回CPU。
用它！

Whew.

Dealing With Slow RAM
That sums up why RAM is so much slower. But how does the CPU deal with such slowness? A RAM load is a single CPU instruction, but it can take potentially hundreds of CPU cycles to complete. How does the CPU deal with this?

总结了为什么RAM速度慢得多。但CPU如何处理这种缓慢？RAM加载是单CPU指令，但可能需要数百个CPU周期才能完成。CPU如何处理这个问题？

First, just how long does a CPU take to execute a single instruction? It can be tempting to just assume that a single instruction executes in a single cycle, but reality is, of course, much more complicated.

首先，CPU执行单个指令需要多长时间？假设单个指令在单个周期中执行可能很诱人，但实际情况当然要复杂得多。

Back in the good old days, when men wore their sheep proudly and the nation was undefeated in war, this was not a difficult question to answer. It wasn't one-instruction-one-cycle, but there was at least some clear correspondence. The Intel 4004, for example, took either 8 or 16 clock cycles to execute one instruction, depending on what that instruction was. Nice and understandable. Things gradually got more complex, with a wide variety of timings for different instructions. Older CPU manuals will give a list of how long each instruction takes to execute.

回到过去的好日子，当人们自豪地穿着他们的羊并且这个国家在战争中不败时，这不是一个难以回答的问题。这不是一个指令 - 一个周期，但至少有一些明确的对应关系。例如，英特尔4004需要8或16个时钟周期来执行一条指令，具体取决于该指令的内容。很好，也可以理解。事情逐渐变得更加复杂，各种指令的时间范围也各不相同。较旧的CPU手册将列出每条指令执行的时间。

Now? Not so simple.

现在？没那么简单。

Along with increasing clock rates, there's also been a long drive to increase the number of instructions that can be executed per clock cycle. Back in the day, that number was something like 0.1 of an instruction per clock cycle. These days, it's up around 3-4 on a good day. How does it perform this wizardry? When you have a billion or more transistors per chip, you can add in a lot of smarts. Although the CPU might be executing 3-4 instructions per clock cycle, that doesn't mean each instruction takes 1/4th of a clock cycle to execute. They still take at least one cycle, often more. What happens is that the CPU is able to maintain multiple instructions in flight at any given time. Each instruction can be broken up into pieces: load the instruction, decode it to see what it means, gather the input data, perform the computation, store the output data. Those can all happen on separate cycles.

随着时钟频率的增加，每隔一个时钟周期增加可执行指令的数量也会很长。在当天，这个数字类似于每个时钟周期0.1的指令。这些天，在美好的一天，它会在3-4左右上升。它是如何执行这种魔法的？当每个芯片有十亿或更多晶体管时，您可以添加许多智能。虽然CPU可能每个时钟周期执行3-4条指令，但这并不意味着每条指令需要1/4的时钟周期才能执行。他们仍然至少需要一个周期，通常更多。会发生什么是CPU能够在任何给定时间保持飞行中的多个指令。每条指令都可以分解为：加载指令，解码它以查看它的含义，收集输入数据，执行计算，存储输出数据。这些都可以在不同的周期中发生。

On any given CPU cycle, the CPU is doing a bunch of stuff simultaneously:

在任何给定的CPU周期中，CPU同时执行大量任务：

Fetching potentially several instructions at once.
Decoding potentially a completely different set of instructions.
Fetching the data for potentially yet another different set of instructions.
Performing computations for yet more instructions.
Storing data for yet more instructions.

一次获取可能的几个指令。
可能解码一组完全不同的指令。
获取可能的另一组不同指令的数据。
执行更多指令的计算。
存储更多指令的数据。

But, you say, how could this possibly work? For example:

但是，你说，这怎么可能有效呢？例如：

    add x1, x1, x2
    add x1, x1, x3

These can't possibly execute in parallel like that! You need to be finished with the first instruction before you start the second!

这些不可能像这样并行执行！在开始第二条指令之前，您需要完成第一条指令！

It's true, that can't possibly work. That's where the smarts come in. The CPU is able to analyze the instruction stream and figure out which instructions depend on other instructions and shuffle things around. For example, if an instruction after those two adds doesn't depend on them, the CPU could end up executing that instruction before the second add, even though it comes later in the instruction stream. The ideal of 3-4 instructions per clock cycle can only be achieved in code that has a lot of independent instructions.

这是真的，这不可能奏效。这就是智能系统的用武之地.CPU能够分析指令流并找出哪些指令依赖于其他指令并随机改变。例如，如果在这两个添加之后的指令不依赖于它们，则CPU可能在第二次添加之前最终执行该指令，即使它稍后在指令流中。每个时钟周期3-4个指令的理想情况只能在具有大量独立指令的代码中实现。

What happens when you hit a memory load instruction? First of all, it is definitely going to take forever, relatively speaking. If you're really lucky and the value is in L1 cache, it'll only take a few cycles. If you're unlucky and it has to go all the way out to main RAM to find the data, it could take literally hundreds of cycles. There may be a lot of thumb-twiddling to be done.

点击内存加载指令会发生什么？首先，相对而言，它肯定会永远消失。如果你真的很幸运，价值在L1缓存中，它只需要几个周期。如果你运气不好而且必须一直到主RAM来查找数据，那么它可能需要数百个周期。可能会有很多拇指蠢蠢欲动。

The CPU will try not to twiddle its thumbs, because that's inefficient. First, it will try to anticipate. It may be able to spot that load instruction in advance, figure out what it's going to load, and initiate the load before it really starts executing the instruction. Second, it will keep executing other instructions while it waits, as long as it can. If there are instructions after the load instruction that don't depend on the data being loaded, they can still be executed. Finally, once it's executed everything it can and it absolutely cannot proceed any further without that data it's waiting on, it has little choice but to stall and wait for the data to come back from RAM..

CPU会尽量不要转动拇指，因为效率低下。首先，它会尝试预测。它可能能够提前发现该加载指令，找出它将要加载的内容，并在它真正开始执行指令之前启动加载。其次，它会在等待时继续执行其他指令，只要它可以。如果加载指令后面的指令不依赖于正在加载的数据，则仍然可以执行它们。最后，一旦它执行了它可以做的一切，如果没有它正在等待的数据，它绝对不能继续进行，它几乎没有选择，只能停止并等待数据从RAM返回。

Conclusion

RAM is slow because there's a ton of it.
That means you have to use designs that are cheaper, and cheaper means slower.
Modern CPUs do crazy things internally and will happily execute your instruction stream in an order that's wildly different from how it appears in the code.
That means that the first thing a CPU does while waiting for a RAM load is run other code.
If all else fails, it'll just stop and wait, and wait, and wait, and wait.

RAM很慢，因为它有很多。
这意味着你必须使用更便宜的设计，而更便宜意味着更慢。
现代CPU在内部做疯狂的事情，并且会愉快地执行您的指令流，其顺序与它在代码中的显示方式大不相同。
这意味着CPU在等待RAM加载时所做的第一件事就是运行其他代码。
如果一切都失败了，它就会停下来等待，等待，等待，等待。

衾冷锦疏

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
为什么寄存器比RAM快

原文出处：www.mikeash.com/pyblog/friday-qa-2013-10-11-why-registers-are-fast-and-ram-is-slow.html Why Registers Are Fast and RAM Is Slow ...
复制链接

扫一扫