“深入”理解字节对齐

tobybo

已于 2023-05-06 18:29:13 修改

阅读量1.1k

点赞数 1

文章标签：系统架构

于 2023-05-05 17:14:55 首次发布

本文链接：https://blog.csdn.net/bo_self_effacing/article/details/130445209

版权

前言

写下这篇文章的起因是在 leetcode 上做到一道题目时，很“自然”地使用了字节对齐的机制多分配的空闲内存。代码如下：

/**
 * Definition for singly-linked list.
 * struct ListNode {
 *     int val;
 *     ListNode *next;
 *     ListNode(int x) : val(x), next(NULL) {}
 * };
 */
class Solution {
public:
    ListNode *detectCycle(ListNode *head) {
        if (!head) return nullptr;
        *((char*)head + 4) = 1;
        head = head->next;
        while (head) {
            if (*((char*)head + 4) == 1) {
                break;
            } 
            *((char*)head + 4) = 1;
            head = head->next;
        }
        return head;
    }
};

从题目给出的 ListNode 结构说明可以看到它包含一个 int 类型的变量 val 和一个指针类型的变量 next，由于字节对齐的机制存在，next 成员在结构体中的地址偏移量 offset 要是 next 本身所占内存大小的整数倍，所以操作系统“可能”在成员 val 和 next 之间填充了 4 个字节。为什么说可能呢？因为这里是猜测 leet 官方是使用的 gcc 64bit 编译环境，因为在 gcc 64bit 编译环境下，int 是 4 个字节宽度，指针是占 8 个字节。如果是 32bit 编译环境，那么指针应该是占 4 个字节，和 int 的宽度一致，这里就不存在填充字节了。提交后顺利通过所有用例，那么说明猜测正确。

思考

虽然这道题就这么 passed “silently”，但是又唤起了我对字节对齐的一些疑惑的点：

为什么结构体的每个成员需要单独处理对齐，将该成员的结构体内偏移量对齐到该成员大小的整数倍。
为什么结构体最后一个成员的后面还可能需要填充字节以对齐该结构体内最大对齐值（align）。
操作系统是怎么保证分配给我的内存的起始地址已经对齐了，如果起始地址没有对齐，那么结构体内部对齐毫无意义。
字节对齐的由来，它因何诞生，又是否已经过时或者即将过时，这个想法主要是由于字节对齐的弊端（额外的内存 overhead）引发的。

查阅资料

从网上很多的博客可了解到字节对齐在 c/c++ 中的一般机制和为什么要字节对齐的大致原因——“对齐后的内存访问效率更高，没有对齐的内存访问可能需要两个以上的内存读取周期，且需要更复杂的内存技术来组合两次读取的内容”。
这里分享两篇讲字节对齐讲的比较到位的文章，然后对第二篇文章其中的一些知识点进行扩展探讨。第二篇文章直接就是维基百科上的，各位看完后可以尝试搜索下百度百科，体会一下其中的差距-_-||。

Byte alignment and ordering

原文就是标题的链接，可能有朋友不方便跳转，内容我就照搬过来，加上翻译（谷歌翻译的~）。

Realtime systems consist of multiple processors communicating with each other via messages. For message communication to work correctly, the message formats should be defined unambiguously. In many systems this is achieved simply by defining C/C++ structures to implement the message format. Using C/C++ structures is a simple approach, but it has its own pitfalls. The problem is that different processors/compilers might define the same structure differently, thus causing incompatibility in the interface definition.
实时系统由多个处理器组成，它们通过传递消息相互通信。为了使消息通信正常工作，应该明确定义消息格式。在许多系统中，这是通过定义 C/C++ 结构来定义消息格式来实现的。使用 C/C++ 结构是一种简单的方法，但它有其自身的缺陷。不同的处理器/编译器可能会以不同的方式（不同的字节对齐限制和不同的字节组织顺序）定义相同的结构，从而导致接口定义不兼容。

There are two reasons for these incompatibilities:
导致接口（消息结构体）定义不兼容有两个原因：

Byte Alignment Restrictions 字节对齐限制
Byte Ordering 字节顺序

Byte Alignment Restrictions

Most 16-bit and 32-bit processors do not allow words and long words to be stored at any offset. For example, the Motorola 68000 does not allow a 16 bit word to be stored at an odd address. Attempting to write a 16 bit number at an odd address results in an exception.
大多数 16 位和 32 位处理器不允许以任何偏移量（未对齐）存储字和长字。例如，Motorola 68000 不允许将 16 位字存储在奇数地址。尝试在奇数地址写入 16 位数会导致异常。

Why Restrict Byte Alignment?

32 bit microprocessors typically organize memory as shown below. Memory is accessed by performing 32 bit bus cycles. 32 bit bus cycles can however be performed at addresses that are divisible by 4. (32 bit microprocessors do not use the address lines A1 and A0 for addressing memory).
32 位微处理器通常按如下所示组织内存。通过执行 32 位总线周期访问内存。而32 位总线周期只能在可被 4 整除的地址上执行。（32 位微处理器不使用地址线 A1 和 A0 来寻址内存， A0 和 A1 用于表示要传输的字节数）。

The reasons for not permitting misaligned long word reads and writes are not difficult to see. For example, an aligned long word X would be written as X0, X1, X2 and X3. Thus the microprocessor can read the complete long word in a single bus cycle. If the same microprocessor now attempts to access a long word at address 0x000D, it will have to read bytes Y0, Y1, Y2 and Y3. Notice that this read cannot be performed in a single 32 bit bus cycle. The microprocessor will have to issue two different reads at address 0x100C and 0x1010 to read the complete long word. Thus it takes twice the time to read a misaligned long word.
不允许未对齐的长字读写的原因不难想到。例如，对齐的长字 X 将写为 X0、X1、X2 和 X3，因此微处理器可以在单个总线周期内读取完整的长字。如果同一个微处理器现在试图访问地址 0x000D 处的长字，它将必须读取字节 Y0、Y1、Y2 和 Y3。请注意，此读取不能在单个 32 位总线周期中执行。微处理器将不得不在地址 0x100C 和 0x1010 发出两次不同的读取以读取完整的长字。因此，读取未对齐的长字需要两倍的时间。
在这里插入图片描述

Compiler Byte Padding

Compilers have to follow the byte alignment restrictions defined by the target microprocessors. This means that compilers have to add pad bytes into user defined structures so that the structure does not violate any restrictions imposed by the target microprocessor.
编译器必须遵循目标微处理器定义的字节对齐限制。这意味着编译器必须将填充字节添加到用户定义的结构中，以便该结构遵循目标微处理器强加的限制。

The compiler padding is illustrated in the following example. Here a char is assumed to be one byte, a short is two bytes and a long is four bytes.
以下示例说明了编译器填充。这里假定 char 为一个字节，short 为两个字节，long 为四个字节。

User Defined Structure

struct Message
{
  short opcode;
  char subfield;
  long message_length;
  char version;
  short destination_processor;
};

Actual Structure Definition Used By the Compiler

struct Message
{
  short opcode;
  char subfield;
  char pad1;            // Pad to start the long word at a 4 byte boundary
  long message_length;
  char version;
  char pad2;            // Pad to start a short at a 2 byte boundary
  short destination_processor;
  char pad3[4];         // Pad to align the complete structure to a 16 byte boundary
};

In the above example, the compiler has added pad bytes to enforce byte alignment rules of the target processor. If the above message structure was used in a different compiler/microprocessor combination, the pads inserted by that compiler might be different. Thus two applications using the same structure definition header file might be incompatible with each other.
在上面的示例中，编译器添加了填充字节以强制执行目标处理器的字节对齐规则。如果上述消息结构用于不同的编译器/微处理器组合，则该编译器插入的填充可能不同。因此，使用相同结构定义头文件的两个应用程序可能彼此不兼容。

Thus it is a good practice to insert pad bytes explicitly in all C-structures that are shared in a interface between machines differing in either the compiler and/or microprocessor.
因此，在编译器和/或微处理器不同的机器之间的接口中共享的所有 C 结构中显式插入填充字节是一种很好的做法。

General Byte Alignment Rules

The following byte padding rules will generally work with most 32 bit processor. You should consult your compiler and microprocessor manuals to see if you can relax any of these rules.
以下字节填充规则通常适用于大多数 32 位处理器。您应该查阅您的编译器和微处理器手册，看看您是否可以放宽这些规则中的任何一条。（可能并不那么 generally）

Single byte numbers can be aligned at any address
Two byte numbers should be aligned to a two byte boundary
Four byte numbers should be aligned to a four byte boundary
Structures between 1 and 4 bytes of data should be padded so that the total structure is 4 bytes.
Structures between 5 and 8 bytes of data should be padded so that the total structure is 8 bytes.
Structures between 9 and 16 bytes of data should be padded so that the total structure is 16 bytes.
Structures greater than 16 bytes should be padded to 16 byte boundary.

Structure Alignment for Efficiency

Sometimes array indexing efficiency can also determine the pad bytes in the structure. Note that compilers index into arrays by calculating the address of the indexed entry by the multiplying the index with the size of the structure. This number is then added to the array base address to obtain the final address. Since this operation involves a multiply, indexing into arrays can be expensive. The array indexing can be considerably speeded up by just making sure that the structure size is a power of 2. The compiler can then replace the multiply with a simple shift operation.
有时，数组索引效率也可以决定结构中的填充字节数。请注意，编译器通过将索引与结构的大小相乘来计算索引条目的地址来对数组进行索引。然后将该数字添加到数组基地址以获得最终地址。由于此操作涉及乘法，因此对数组进行索引的开销可能会很昂贵。只需确保结构大小是 2 的幂，就可以大大加快数组索引的速度，因为编译器可以用简单的移位操作代替乘法。

虽然本文主要讨论的是字节对齐，但 eventhelix.com 这篇文章还提到了大小端问题，以下两小节个人觉得值得拿出来分享一下：

Byte Ordering

Microprocessors support big-endian and little-endian byte ordering. Big-endian is an order in which the “big end” (most significant byte) is stored first (at the lowest address). Little-endian is an order in which the “little end” (least significant byte) is stored first.
微处理器支持大端和小端字节顺序。 Big-endian 是“大端”（最高有效字节）首先存储（在最低地址）的顺序。 Little-endian 是先存储“little end”（最低有效字节）的顺序。

The table below shows the representation of the hexadecimal number 0x0AC0FFEE on a big-endian and little-endian machine. The contents of memory locations 0x1000 to 0x1003 are shown.
下表显示了十六进制数 0x0AC0FFEE 在大端和小端机器上的表示。显示了内存位置 0x1000 到 0x1003 的内容。
在这里插入图片描述

Why Different Byte Ordering?

This is a difficult question. There is no logical reason why different microprocessor vendors decided to use different ordering schemes. Most of the reasons are historical. For example, Intel processors have traditionally been little-endian. Motorola processors have always been big-endian.
这是一个很难回答的问题。不同的微处理器供应商决定使用不同的字节顺序方案并没有合乎逻辑的原因。大多数原因是历史原因。例如，英特尔处理器传统上是小端法。摩托罗拉处理器一直是大端。

The situation is actually quite similar to that of Lilliputians in Gulliver’s Travels. Lilliputians were divided into two groups based on the end from which the egg should be broken. The big-endians preferred to break their eggs from the larger end. The little-endians broke their eggs from the smaller end.
这种情况其实很像《格列佛游记》中的小人国。根据打破鸡蛋的哪一端，小人国被分为两组。 big-endians 更喜欢从较大的一端打破他们的鸡蛋。 little-endians 从较小的一端打破他们的鸡蛋。

下面进入重头戏，来自维基百科，免费的百科全书。

Data structure alignment

由于该文章内容很多，我只节选其中比较关键的部分以供参考。

what is data structure alignmemt

Data structure alignment is the way data is arranged and accessed in computer memory. It consists of three separate but related issues: data alignment, data structure padding, and packing.
数据结构对齐是数据在计算机内存中排列和访问的方式。它由三个独立但相关的问题组成：数据对齐、数据结构填充和打包。（打包部分不在本文讨论之中，没有摘抄过来）

The CPU in modern computer hardware performs reads and writes to memory most efficiently when the data is naturally aligned, which generally means that the data’s memory address is a multiple of the data size. For instance, in a 32-bit architecture, the data may be aligned if the data is stored in four consecutive bytes and the first byte lies on a 4-byte boundary.
现代计算机硬件中的 CPU 在数据自然对齐时可以最有效率地执行内存读取和写入，这通常意味着数据的内存地址是数据大小的倍数。例如，在 32 位架构中，如果数据存储在四个连续字节中并且第一个字节位于 4 字节边界上，则数据可能是对齐的。

Data alignment is the aligning of elements according to their natural alignment. To ensure natural alignment, it may be necessary to insert some padding between structure elements or after the last element of a structure. For example, on a 32-bit machine, a data structure containing a 16-bit value followed by a 32-bit value could have 16 bits of padding between the 16-bit value and the 32-bit value to align the 32-bit value on a 32-bit boundary. Alternatively, one can pack the structure, omitting the padding, which may lead to slower access, but uses three quarters as much memory.
数据对齐是根据元素的自然对齐方式对齐元素。为了确保自然对齐，可能需要在结构元素之间或结构的最后一个元素之后插入一些填充。例如，在 32 位机器上，包含一个 16 位值后跟一个 32 位值的数据结构可以在 16 位值和 32 位值之间有 16 位填充以对齐 32 位值 32 位边界上的值。或者，可以打包结构，省略填充，这可能会导致访问速度变慢，但只需要使用前者（遵循字节对齐限制，填充一定字节）四分之三的内存。

Although data structure alignment is a fundamental issue for all modern computers, many computer languages and computer language implementations handle data alignment automatically. Fortran, Ada, PL/I, Pascal, certain C and C++ implementations, D, Rust, C# and assembly language allow at least partial control of data structure padding, which may be useful in certain special circumstances.
尽管数据结构对齐是所有现代计算机的基本问题，但许多计算机语言和语言的具体实现会自动处理数据对齐。 Fortran、Ada、PL/I、Pascal、Pascal、C 和 C++ 实现、D、Rust、 C# 和汇编语言至少允许部分控制数据结构填充，这在某些特殊情况下可能很有用。

Definitions

A memory access is said to be aligned when the data being accessed is n bytes long and the datum address is n-byte aligned. When a memory access is not aligned, it is said to be misaligned. Note that by definition byte memory accesses are always aligned.
当访问的数据长度为 n 字节并且数据地址是 n 字节对齐时，内存访问被称为对齐。否则，称为未对齐。请注意，根据定义，单字节内存访问始终是对齐的。

A memory pointer that refers to primitive data that is n bytes long is said to be aligned if it is only allowed to contain addresses that are n-byte aligned, otherwise it is said to be unaligned. A memory pointer that refers to a data aggregate (a data structure or array) is aligned if (and only if) each primitive datum in the aggregate is aligned.
如果只允许包含 n 字节对齐的地址，则引用 n 字节长的原始数据的内存指针被称为对齐，否则被称为未对齐。当（且仅当）聚合中的每个原始数据对齐时，引用数据聚合（数据结构或数组）的内存指针对齐。

Note that the definitions above assume that each primitive datum is a power of two bytes long. When this is not the case (as with 80-bit floating-point on x86) the context influences the conditions where the datum is considered aligned or not.
请注意，上面的定义假设每个原始数据都是两个字节长的幂。如果不是这种情况（如 x86 上的 80 位浮点数），认定对齐或不对齐的基准条件就会受到影响，可能发生变化。

Problems

The CPU accesses memory by a single memory word at a time. As long as the memory word size is at least as large as the largest primitive data type supported by the computer, aligned accesses will always access a single memory word. This may not be true for misaligned data accesses.
CPU 一次通过一个内存字访问内存。只要内存字大小至少与计算机支持的最大原始数据类型一样大，对齐访问将始终访问单个内存字。对于未对齐的数据访问，情况可能并非如此。

If the highest and lowest bytes in a datum are not within the same memory word the computer must split the datum access into multiple memory accesses. This requires a lot of complex circuitry to generate the memory accesses and coordinate them. To handle the case where the memory words are in different memory pages the processor must either verify that both pages are present before executing the instruction or be able to handle a TLB miss or a page fault on any memory access during the instruction execution.
如果数据中的最高字节和最低字节不在同一个内存字内，则计算机必须将数据访问分成多个内存访问。这需要大量复杂的电路来生成内存访问并协调它们。要处理内存字位于不同内存页面的情况，处理器必须在执行指令之前验证两个页面是否存在，或者能够处理指令执行期间任何内存访问的 TLB 未命中或页面错误。

Some processor designs deliberately avoid introducing such complexity, and instead yield alternative behavior in the event of a misaligned memory access. For example, implementations of the ARM architecture prior to the ARMv6 ISA require mandatory aligned memory access for all multi-byte load and store instructions. Depending on which specific instruction was issued, the result of attempted misaligned access might be to round down the least significant bits of the offending address turning it into an aligned access (sometimes with additional caveats), or to throw an MMU exception (if MMU hardware is present), or to silently yield other potentially unpredictable results. The ARMv6 and later architectures support unaligned access in many circumstances, but not necessarily all.
一些处理器设计特意避免引入这种复杂性，而是在发生未对齐的内存访问时产生替代行为。例如，ARMv6 ISA 之前的 ARM 架构的实现要求对所有多字节加载和存储指令进行强制对齐的内存访问。根据发布的具体指令，尝试未对齐访问的结果可能是向下舍入违规地址的最低有效位，将其转换为对齐访问（有时带有额外的警告），或者抛出 MMU 异常（如果 MMU 硬件存在），或者默默地产生其他潜在的不可预测的结果。 ARMv6 及更高版本的体系结构在许多情况下都支持未对齐访问，但不一定是所有情况。

When a single memory word is accessed the operation is atomic, i.e. the whole memory word is read or written at once and other devices must wait until the read or write operation completes before they can access it. This may not be true for unaligned accesses to multiple memory words, e.g. the first word might be read by one device, both words written by another device and then the second word read by the first device so that the value read is neither the original value nor the updated value. Although such failures are rare, they can be very difficult to identify.
当访问单个内存字时，操作是原子的，即一次读取或写入整个内存字，其他设备必须等到读取或写入操作完成才能访问它。对于多个内存字的未对齐访问，由于操作不是原子的，可能会发生以下情形，第一个字可能由一个设备读取，第二个字由另一个设备写入，然后第二个字由第一个设备读取，因此读取的值既不是原始值也不是更新值。尽管此类故障很少见，但很难识别。

Data structure padding

Although the compiler (or interpreter) normally allocates individual data items on aligned boundaries, data structures often have members with different alignment requirements. To maintain proper alignment the translator normally inserts additional unnamed data members so that each member is properly aligned. In addition, the data structure as a whole may be padded with a final unnamed member. This allows each member of an array of structures to be properly aligned.
尽管编译器（或解释器）通常在对齐边界上分配单个数据项，但结构体通常具有不同对齐要求的成员。为了保持正确对齐，翻译器通常会插入额外的未命名数据成员，以便每个成员都正确对齐。此外，作为一个整体，在末尾可能会被最后一个未命名的成员填充，这使得结构数组（AoS: array of structure）的每个成员正确对齐。

Padding is only inserted when a structure member is followed by a member with a larger alignment requirement or at the end of the structure. By changing the ordering of members in a structure, it is possible to change the amount of padding required to maintain alignment. For example, if members are sorted by descending alignment requirements a minimal amount of padding is required.
仅当结构成员后跟着具有较大对齐要求的成员或在结构的末尾时才插入填充字节。通过更改结构中成员的顺序，可以更改保持对齐所需的填充量。例如，如果成员按降序对齐要求排序，则需要最少的填充量。（<<LINUX 系统编程>> 中有提到，可以使用 -Wpadded 选项实现这个优化，当编译器隐式填充时，它会发出警告）

Computing padding

For example, the padding to add to offset 0x59d for a 4-byte aligned structure is 3. The structure will then start at 0x5a0, which is a multiple of 4.
例如，要添加到 4 字节对齐结构的偏移量 0x59d 的填充是 3。然后该结构将从 0x5a0 开始，它是 4 的倍数。（这里描述了如何分配对齐结构的起始地址，实际上操作系统和 c 标准库已经在对齐限制上做了预对齐）

Typical alignment of C structs on x86

The type of each member of the structure usually has a default alignment, meaning that it will, unless otherwise requested by the programmer, be aligned on a pre-determined boundary. The following typical alignments are valid for compilers from Microsoft (Visual C++), Borland/CodeGear (C++Builder), Digital Mars (DMC), and GNU (GCC) when compiling for 32-bit x86:
结构的每个成员的类型通常有一个默认对齐方式，这意味着除非程序员另有要求，否则它将在预先确定的边界上对齐。为 32 位 x86 编译时，以下典型对齐方式对来自 Microsoft (Visual C++)、Borland/CodeGear (C++Builder)、Digital Mars (DMC) 和 GNU (GCC) 的编译器有效：

A char (one byte) will be 1-byte aligned.
A short (two bytes) will be 2-byte aligned.
An int (four bytes) will be 4-byte aligned.
A long (four bytes) will be 4-byte aligned.
A float (four bytes) will be 4-byte aligned.
A double (eight bytes) will be 8-byte aligned on Windows and 4-byte aligned on Linux (8-byte with -malign-double compile time option).
A long long (eight bytes) will be 8-byte aligned on Windows and 4-byte aligned on Linux (8-byte with -malign-double compile time option).
A long double (ten bytes with C++Builder and DMC, eight bytes with Visual C++, twelve bytes with GCC) will be 8-byte aligned with C++Builder, 2-byte aligned with DMC, 8-byte aligned with Visual C++, and 4-byte aligned with GCC.
Any pointer (four bytes) will be 4-byte aligned. (e.g.: char*, int*)

The only notable differences in alignment for an LP64 64-bit system when compared to a 32-bit system are:

A long (eight bytes) will be 8-byte aligned.
A double (eight bytes) will be 8-byte aligned.
A long long (eight bytes) will be 8-byte aligned.
A long double (eight bytes with Visual C++, sixteen bytes with GCC) will be 8-byte aligned with Visual C++ and 16-byte aligned with GCC.
Any pointer (eight bytes) will be 8-byte aligned.

Some data types are dependent on the implementation.

参考内容

"32 位微处理器不使用地址线 A1 和 A0 来寻址内存" <<32-Bit Microprocessor>>

The CPU uses the byte enable outputs with the address bus to select one or more bytes for
data transfer. Address lines A2 through A31 are used to select a 32-bit memory location. A0 and A1 are
used to internally generate the four byte-enable lines. The byte-enable lines are used to select
one or more bytes within that location. The data in the selected areas is then transferred onto
the data bus. If more than one byte is to be transferred, the bytes must be contiguous.

"内存字" <<计算机体系结构精髓>>

11.11 字、物理地址和内存传输

内存总线的并行连接与程序员和计算机架构师有关。从体系结构的角度来看，使用并行连接可以提高性能。从编程角度来看，并行连接定义了内存传输大小（即可以在单个操作中读取或写入内存的数据量）。我们将看到传输大小是内存组织的一个关键方面。为了允许并行访问，组成物理内存的位被划分成许多块，每块 N位，其中N是内存传输大小。一个 N 位的块有时称为一个宇，传送大小称为字大小或一个字的宽度。我们可以将内存组织成一个数组。为数组中的每个条目分配一个唯一索引，称为物理内存地址；该方法称为字寻址。图11.6说明了这个想法，并且显示了物理内存地址与数组索引完全相同。
在这里插入图片描述

"自然对齐" << LINUX系统编程>>

9.2.4 对齐

如果一个变量的内存地址是它大小的整数倍时，就称为 “自然对齐 (naturally aligned) ”。例如，对于一个 32 位长的变量，如果它的地址是 4(字节）的整数倍（也就是说，如果地址的低两位是 0 ），那就是自然对齐了。因此，如果一个类型的大小是 2^n 字节，那么它的内存地址至少低 n 位是 0。

"TLB(translation lookaside buffer)" <<计算机体系结构精髓>>

13.23 分页效率和转换后备缓冲区和 13.25 虚拟内存和缓存间的关系

该书中将 TLB 翻译成 “转换后备缓冲区”，gpt 将它翻译成 “翻译旁路缓存”，google 翻译将它翻译成 “翻译后备缓冲器”，我觉得 gpt 这个旁路比后备要好，但转换比翻译更恰当，缓存比缓冲区恰当，综合一下可以尝试翻译为 “旁路转换缓存”。书中该小节的重要知识点可以总结如下：

内存页简介：物理内存按页大小（一般为 4k）分为 n 个页，每个页有一个对应的页号，内存管理器 MMU 会为虚拟内存提供一个页表数组来通过虚拟内存地址查询到指定的物理内存页，然后通过页内偏移找到指定的实地址。
TLB 为最近访问过的（LRU算法可以说是在计算机底层组件到操作系统设计都随处可见）内存页地址转换的缓存 TLB[虚拟地址] = 实际物理地址，看似这个缓存的优化效果有限，因为虚拟地址到实际物理地址的转换并不复杂，而且经过页面对齐的优化后，虚拟地址被划分为几部分可以通过总线并行发送，如图 13.10 所示。但是，要认识到地址转换必须在每个内存引用上执行：每个指令的获取，每个操作数引用内存，以及每个结果的存储。由于内存的使用非常频繁，实现地址转换的机制必须非常的高效，否则地址转换将成为瓶颈。
要理解 TLB 为什么能提高性能，请考虑取指一执行周期。处理器倾向于从内存中连续的位置获取指令。此外，如果程序中包含一个分支，很高概率目的地址会在附近，或者在同一个页面上。因此，处理器不是随机访向页面，而是倾向于从同一页面获取连续的指令 TLB 可以提高性能，因为它通过避免素引到页表来优化连续查找。与在内存中存储页表的体系结构相比，性能差异尤其显著；没有 TLB，这种系统运行得太慢。
TLB 包含了数字电路，将值高速移动到一个专门用途的内容可寻址存储器（CAM）上。
简单介绍下 CAM(content access memory)：CAM 不仅存储数据项——它还包括用于高速搜索的硬件。并行搜索硬件使 CAM 非常昂贵，当查找速度比成本和功耗更重要时，架构师只使用 CAM。例如，在高速互联网路由器中，为了处理高速连接，一些设计使用 CAM 来存储来源标识符列表。

"预对齐内存分配" << LINUX 系统编程>>

在大多数情况下，编译器和C库会自动处理对齐问题。POSIX 规定通过 malloc、calloc 和 realloc 返回的内存空间对于C 中的标准类型都应该是对齐的。在Linux中，这些函数返回的地址在 32 位系统是以 8 字节为边界对齐，在 64 位系统是以 16 字节为边界对齐的。（在我不全面测试下，x86_64 GNU/Linux， gcc (GCC) 9.3.1，std=c99 环境下，malloc 分配的实际内存最小单位为 32 字节，超过 32 字节后，按 16 字节的边界对齐，上下 cookie 各占 4 字节，贴上侯捷老师的 ppt 截图，其中 48 字节的是加了 debug 的编译选项，图中的 16 字节在我的机器上会分配 32 字节，这个在 malloc 的源码注释中得到解释）

111
大致解释下这段代码注释（不保证完全正确，因为是基于我不完全测试得出）：先明确你机器上指针的大小和 size_t 的大小，在我的机器上这两个值都是 8 字节。然后就可以计算 malloc 最小分配的 mininum = 8 字节（上下各 4 字节 cookie 记录该区块的大小和是否正在使用）+ 16 字节（两个用于插入 free_list 的前后指针）= 24, 然后 24 需要向 16 的边界对齐，所以最小的分配值就变成了 32 字节。