非对齐访问——以ARM为例

什么是非对齐访问

在机器指令层面,当尝试从不能被 N 整除 (addr % N != 0) 的起始地址读取 N 字节的数据时即发生了非对齐内存访问。举例而言,从地址 0x10004 读取 4 字节是可以的,然而从地址 0x10005 读取 4 字节数据将会是一个非对齐内存访问。这里 N 就是数据的自然对齐值 (Natural alignment)。

为什么应该避免非对齐访问

非对齐访问的性能因不同的处理器而异。

  • 有些处理器支持硬件非对齐访问,性能有损失。
  • 有的处理器不支持非对齐访问,将抛出异常,操作系统可以接管帮助完成访问,但是性能损失很大。
  • 还有一些甚至会执行错误的内存访问,引起微妙的错误,难以定位问题。

因此为了代码的可移植性、健壮性以及性能,应当避免非对齐访问。

编译器在非对齐访问中的角色

上面提到的东西似乎很难和编码实践联系起来,因为我们对变量内存地址并没有太多的控制权。编译器在绝大部分时候都能帮我们搞定这个棘手的事情,满足对齐需求。通常编译器会处理好数据对齐的问题,变量分配时的地址都会是自然对齐的。

结构体元素排列

关于结构体对齐的相关内容可查看这篇文章,结构体大小和成员对齐

struct foo_s {
	u16 field1;
	u32 field2;
	u8 field3;
};

粗看起来,field2 将发生非对齐访问,幸运的是,编译器会根据内存对齐约束在 field1 和 field2 之间加入2字节的填充字节。在不进行类型强转到其它更长类型的情况下,无需担忧发生非对齐访问。

上述结构体在填充后将占用 12 字节,更优的写法如下,此时编译器只会填充一个字节,结构体大小为 8 字节,减少长驻内存的大小。

struct foo_s {
	u32 field2;
	u16 field1;
	u8 field3;
};

编译属性 __attribute__((packed))

这种情况下,也许大家会认为 field2 会产生非对齐访问。聪明的编译器实际上会为我们生成额外的指令去完成不会触发非对齐的数据访问方式。因此相比非 packed 的场景引入了性能损失。因此,除非有特殊的需求,慎用这个编译属性。

struct foo_pack_s {
	u16 field1;
	u32 field2;
	u8 field3;
} __attribute__((packed));

int main()
{
	struct foo_pack_s foo_pack;
	int val = foo_pack.field2;
	return val;
}

在这里插入图片描述

在 arm 下使能 -mno-unaligned-access 编译选项,将得到如下代码。
在这里插入图片描述

会产生非对齐访问的场景

C 语言中类型强转的支持导致可能出现非对齐的访问出现。例如:

void myfunc(u8 *data, u32 value)
{
	[...]
	*((u32 *) data) = value;
	[...]
}

这种场景可以使用 memcpy 代替,当然 memcpy 的实现需要考虑地址对齐的问题才能使用。
另外在调用者保证 data 地址有效对齐的的情况下,类型 casting 也是可以接受的。
又如:

uint8_t tmp;
uint32_t* pMyPointer = (uint32_t*)(&tmp);

此时编译器依然假定 pMyPointer 是4字节对齐的,因此可能在运行时非对齐访问。

程序员应该尽可能保持代码不包含非对齐访问以保证程序的性能以及可移植性。
指针的强制类型转换要小心,短类型的数据到长类型的转换要特别注意。
在保证调用场景的合法性基础上,可以这么操作。可以使用 memcpy 来避免非对齐访问。

ARM 处理器下数据的对齐与非对齐访问12

Older ARM processors require data load and stores to be to/from architecturally aligned addresses.  This means:

LDRB/STRB          - address must be byte aligned
LDRH/STRH          - address must be 2-byte aligned
LDR/STR            - address must be 4-byte aligned

On older processors, such as ARM9 family based processors, an unaligned load had to be synthesised in software.  
Typically by doing a series of small accesses, and combining the results.

The ARMv6 architecture introduced the first hardware support for unaligned accesses.
ARM11 and Cortex-A/R processors can deal with unaligned accesses in hardware, removing the need
for software routines.

由上可知,ARMv6 首次在硬件层面引入了非对齐访问的支持。更早期的处理器需要在软件层面考虑非对齐访问,通常是数据拼接。否则会触发 data abort 的异常。当然,32 位的处理器对非对齐访问的支持也只限定在特定的指令上,如 LDRB / LDRH / LDR 等,对原子类的指令是不支持非对齐访问的,因为如果访问了跨页的内存,如果一个页没有映射,就会出 pagefault 异常,OS 会接手做相应的异常处理操作,由此原子性也就荡然无存了。同时,非对齐访问也限制在 normal memory 上。硬件的非对齐访问比软件模拟的非对齐访问更高效,但依然比对齐访问低效。
How processors handle unaligned memory access

Device Memory

前面提到就算是支持对齐访问,也只限定在 Normal Memory。另一种类型的 memory 即为 Device Memory

Address regions that are used to access peripherals rather than memory should be marked as Device memory. Depending upon the processor, this may be configured in the Memory Protection Unit (MPU) or the Memory Management Unit (MMU). Unaligned accesses are not permitted to these regions even when unaligned access support is enabled. If an unaligned access is attempted, the processor will take an abort.

The compiler does not have any information on which address ranges are device memory, and it is therefore the responsibility of the person writing the code to ensure that accesses to devices are aligned. In practice, this usually is the case simply because peripheral registers are at aligned addresses. It is also usual to access peripheral registers through volatile variables or pointers, which restricts the compiler to accessing the data with the size of access specified where possible.

It is also necessary to avoid using C library functions such as memcpy() to access Device memory, as there is no guarantee of the type of accesses these functions will use. If it is necessary to copy a buffer of memory to a Device memory, you should provide a suitable copying routine and call this instead of memcpy().

Performance

If code frequently accesses unaligned data, there may be a performance advantage in enabling unaligned accesses. However, the extent of this advantage will be dependent on many factors. Even though this support allows a single instruction to access unaligned data, this will often require multiple bus accesses to occur. Therefore the bus transactions performed by an unaligned access may be similar to those performed by the multiple instructions used when unaligned access support is disabled. The code without unaligned access support will have to perform various shift and logical operations, but on a multi-issue processor the execution time of these may be hidden by executing them in parallel with the memory accesses. There will also be a function call overhead when functions such as __aeabi_uread4() are used, though the impact of these may be reduced by branch prediction.

如何测试访问性能

下面给了一个例子,内容很简单。读取一个数,取反,再写入。只需要传入操作的地址和长度即可。data 选择对齐与不对齐的情况。
分别获取各种操作情况下的测试所需的时间即可。

// Munging data one byte at a time
void Munge8(void ∗data, uint32_t size)
{
    uint8_t ∗data8 = (uint8_t) data;
    uint8_t ∗data8End = data8 + size;
    
    while( data8 != data8End ) {
        ∗data8++ = ‑∗data8;
    }
}

// Munging data two bytes at a time
void Munge16(void ∗data, uint32_t size)
{
    uint16_t ∗data16 = (uint16_t)data;
    uint16_t ∗data16End = data16 + (size >> 1); /∗ Divide size by 2./
    uint8_t ∗data8 = (uint8_t)data16End;
    uint8_t ∗data8End = data8 + (size & 0x00000001); /∗ Strip upper 31 bits./
    
    while(data16 != data16End) {
        ∗data16++ = ‑∗data16;
    }
    while(data8 != data8End) {
        ∗data8++ = ‑∗data8;
    }
}

// Munging data four bytes at a time
void Munge32(void ∗data, uint32_t size)
{
    uint32_t ∗data32 = (uint32_t)data;
    uint32_t ∗data32End = data32 + (size >> 2); /∗ Divide size by 4./
    uint8_t ∗data8 = (uint8_t) data32End;
    uint8_t ∗data8End = data8 + (size & 0x00000003); /∗ Strip upper 30 bits./
    
    while(data32 != data32End) {
        ∗data32++ = ‑∗data32;
    }
    while(data8 != data8End) {
        ∗data8++ = ‑∗data8;
    }
}

// Munging data eight bytes at a time
void Munge64(void ∗data, uint32_t size) {
    double ∗data64 = (double)data;
    double ∗data64End = data64 + (size >> 3); /∗ Divide size by 8./
    uint8_t ∗data8 = (uint8_t) data64End;
    uint8_t ∗data8End = data8 + (size & 0x00000007); /∗ Strip upper 29 bits./
    
    while(data64 != data64End) {
        ∗data64++ = ‑∗data64;
    }
    while(data8 != data8End) {
        ∗data8++ = ‑∗data8;
    }
}

  1. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka15414.html ↩︎

  2. https://developer.ibm.com/articles/pa-dalign/ ↩︎

©️2020 CSDN 皮肤主题: 深蓝海洋 设计师:CSDN官方博客 返回首页