很早以前就知道写程序的时候最后做到内存按4字节(doublewords)对齐能增加性能,这也是很多编译器在编译的时候都会加上
的原因,不过今天终于看见一份官方文档对此的解释了:
When used in a configuration with a 32-bit bus, actual transfers of data between processor and memory take place in units of doublewords beginning at addresses evenly divisible by four; however, the processor converts requests for misaligned words or doublewords into the appropriate sequences of requests acceptable to the memory interface. Such misaligned data transfers reduce performance by requiring extra memory cycles.
上面这段是说在32位的总线上,每一次实际的传输都是按双字(doublewords)为单位进行的,然后cpu在进行组装,因此,如果你的地址不是按双字对齐的(即地址不能被4整除),那么cpu就会花费更多的周期(因为可能需要取两次才能组合成需要的东东),因此,按双字对齐对于提高cpu的性能是显著的,这也是为什么我们常常看见堆栈顶都对齐到双字上,而每次执行pop操作,也总是4个字节4个字节的压栈。这样,每次压栈只需存取一次内存就行了,而且压栈结束后,栈顶同样是按4字节对齐的。
不过并不是程序中的所有部份都需要4字节对齐,比如程序的代码就无需对齐,因为代码会经过预取及在cpu中排队(其实预取的时候一般都是一次取一批指令而且取的时候cpu在进行其它的流水操作),因此不对齐也不会对性能造成太大影响,下面这段解释了原因:
Due to instruction prefetching and queuing within the cpu, there is no requirement for instructions to be aligned on word or doubleword boundaries.(However, a slight increase in speed results if the target addresses of control transfers are evenly divisible by four.)
(注:以上两段英文说明均摘自《Intel 80386 programmer's reference manual 1986》)
-CODE
- .algin 4
的原因,不过今天终于看见一份官方文档对此的解释了:
When used in a configuration with a 32-bit bus, actual transfers of data between processor and memory take place in units of doublewords beginning at addresses evenly divisible by four; however, the processor converts requests for misaligned words or doublewords into the appropriate sequences of requests acceptable to the memory interface. Such misaligned data transfers reduce performance by requiring extra memory cycles.
上面这段是说在32位的总线上,每一次实际的传输都是按双字(doublewords)为单位进行的,然后cpu在进行组装,因此,如果你的地址不是按双字对齐的(即地址不能被4整除),那么cpu就会花费更多的周期(因为可能需要取两次才能组合成需要的东东),因此,按双字对齐对于提高cpu的性能是显著的,这也是为什么我们常常看见堆栈顶都对齐到双字上,而每次执行pop操作,也总是4个字节4个字节的压栈。这样,每次压栈只需存取一次内存就行了,而且压栈结束后,栈顶同样是按4字节对齐的。
不过并不是程序中的所有部份都需要4字节对齐,比如程序的代码就无需对齐,因为代码会经过预取及在cpu中排队(其实预取的时候一般都是一次取一批指令而且取的时候cpu在进行其它的流水操作),因此不对齐也不会对性能造成太大影响,下面这段解释了原因:
Due to instruction prefetching and queuing within the cpu, there is no requirement for instructions to be aligned on word or doubleword boundaries.(However, a slight increase in speed results if the target addresses of control transfers are evenly divisible by four.)
(注:以上两段英文说明均摘自《Intel 80386 programmer's reference manual 1986》)