一篇非常好的字节序对齐的文章

http://www.ibm.com/developerworks/library/pa-dalign/

Programmers are conditioned to think of memory as a simple array of bytes. Among C and its descendants, char* is ubiquitous as meaning "a block of memory", and even Java™ has its byte[] type to represent raw memory.


Figure 1. How programmers see memory
How Programmers See Memory

However, your computer's processor does not read from and write tomemory in byte-sized chunks. Instead, it accesses memory in two-, four-,eight- 16- or even 32-byte chunks. We'll call the size in which aprocessor accesses memory its memory access granularity.



Figure 2. How processors see memory
How Some Processors See Memory

The difference between how high-level programmers think of memory andhow modern processors actually work with memory raises interesting issuesthat this article explores.

If you don't understand and address alignment issues in your software,the following scenarios, in increasing order of severity, are allpossible:

  • Your software will run slower.
  • Your application will lock up.
  • Your operating system will crash.
  • Your software will silently fail, yielding incorrect results.

Alignment fundamentals

To illustrate the principles behind alignment, examine a constanttask, and how it's affected by a processor's memory access granularity.The task is simple: first read four bytes from address 0 into theprocessor's register. Then read four bytes from address 1 into the sameregister.

First examine what would happen on a processor with a one-bytememory access granularity:


Figure 3. Single-byte memory access granularity
Single-byte memory access granularity

This fits in with the naive programmer's model of how memory works: ittakes the same four memory accesses to read from address 0 as it does fromaddress 1. Now see what would happen on a processor with two-bytegranularity, like the original 68000:


Figure 4. Double-byte memory access granularity
Double-byte memory access granularity

When reading from address 0, a processor with two-byte granularitytakes half the number of memory accesses as a processor with one-bytegranularity. Because each memory access entails a fixed amount overhead,minimizing the number of accesses can really help performance.

However, notice what happens when reading from address 1. Because theaddress doesn't fall evenly on the processor's memory access boundary, theprocessor has extra work to do. Such an address is known as anunaligned address. Because address 1 is unaligned, a processor withtwo-byte granularity must perform an extra memory access, slowing down theoperation.

Finally, examine what would happen on a processor with four-bytememory access granularity, like the 68030 or PowerPC® 601:


Figure 5. Quad-byte memory access granularity
Quad-byte memory access granularity

A processor with four-byte granularity can slurp up four bytes from analigned address with one read. Also note that reading from an unalignedaddress doubles the access count.

Now that you understand the fundamentals behind aligned data access, youcan explore some of the issues related to alignment.


Lazy processors

A processor has to perform some tricks when instructed to access anunaligned address. Going back to the example of reading four bytes fromaddress 1 on a processor with four-byte granularity, you can work outexactly what needs to be done:


Figure 6. How processors handle unaligned memory access
How processors handle unaligned memory access

The processor needs to read the first chunk of the unaligned addressand shift out the "unwanted" bytes from the first chunk. Then it needs toread the second chunk of the unaligned address and shift out some of itsinformation. Finally, the two are merged together for placement in theregister. It's a lot of work.

Some processors just aren't willing to do all of that work for you.

The original 68000 was a processor with two-byte granularity and lackedthe circuitry to cope with unaligned addresses. When presented with suchan address, the processor would throw an exception. The original Mac OSdidn't take very kindly to this exception, and would usually demand theuser restart the machine. Ouch.

Later processors in the 680x0 series, such as the 68020, lifted thisrestriction and performed the necessary work for you. This explains whysome old software that works on the 68020 crashes on the 68000. It alsoexplains why, way back when, some old Mac coders initialized pointers withodd addresses. On the original Mac, if the pointer was accessed withoutbeing reassigned to a valid address, the Mac would immediately drop intothe debugger. Often they could then examine the calling chain stack andfigure out where the mistake was.

All processors have a finite number of transistors to get work done.Adding unaligned address access support cuts into this "transistorbudget." These transistors could otherwise be used to make other portionsof the processor work faster, or add new functionality altogether.

An example of a processor that sacrifices unaligned address accesssupport in the name of speed is MIPS. MIPS is a great example of aprocessor that does away with almost all frivolity in the name of gettingreal work done faster.

The PowerPC takes a hybrid approach. Every PowerPC processor to datehas hardware support for unaligned 32-bit integer access. While you stillpay a performance penalty for unaligned access, it tends to be small.

On the other hand, modern PowerPC processors lack hardware support forunaligned 64-bit floating-point access. When asked to load an unalignedfloating-point number from memory, modern PowerPC processors will throw anexception and have the operating system perform the alignment chores insoftware. Performing alignment in software is much slower thanperforming it in hardware.


Speed

Writing some tests illustrates the performance penalties of unaligned memory access. The test is simple: you read, negate, and write backthe numbers in a ten-megabyte buffer. These tests have two variables:

  1. The size, in bytes, in which you process the buffer. First you'll process the buffer one byte at a time. Then you'll move onto two-, four- and eight-bytes at a time.
  2. The alignment of the buffer. You'll stagger the alignment of the buffer by incrementing the pointer to the buffer and running each test again.

These tests were performed on a 800 MHz PowerBook G4. To help normalizeperformance fluctuations from interrupt processing, each test was run tentimes, keeping the average of the runs. First up is the test that operateson a single byte at a time:


Listing 1. Munging data one byte at a time
void Munge8( void *data, uint32_t size ) {
    uint8_t *data8 = (uint8_t*) data;
    uint8_t *data8End = data8 + size;
    
    while( data8 != data8End ) {
        *data8++ = -*data8;
    }
}

It took an average of 67,364 microseconds to execute this function.Now modify it to work on two bytes at a time instead of one byte at atime -- which will halve the number of memory accesses:


Listing 2. Munging data two bytes at a time
void Munge16( void *data, uint32_t size ) {
    uint16_t *data16 = (uint16_t*) data;
    uint16_t *data16End = data16 + (size >> 1); /* Divide size by 2. */
    uint8_t *data8 = (uint8_t*) data16End;
    uint8_t *data8End = data8 + (size & 0x00000001); /* Strip upper 31 bits. */
    
    while( data16 != data16End ) {
        *data16++ = -*data16;
    }
    while( data8 != data8End ) {
        *data8++ = -*data8;
    }
}

This function took 48,765 microseconds to process the same ten-megabytebuffer -- 38% faster than Munge8. However, that buffer was aligned. If thebuffer is unaligned, the time required increases to 66,385 microseconds --about a 27% speed penalty. The following chart illustrates the performancepattern of aligned memory accesses versus unaligned accesses:


Figure 7. Single-byte access versus double-byte access
Single-byte access versus double-byte access

The first thing you notice is that accessing memory one byte at a time isuniformly slow. The second item of interest is that when accessing memorytwo bytes at a time, whenever the address is not evenly divisible by two,that 27% speed penalty rears its ugly head.

Now up the ante, and process the buffer four bytes at a time:


Listing 3. Munging data four bytes at a time
void Munge32( void *data, uint32_t size ) {
    uint32_t *data32 = (uint32_t*) data;
    uint32_t *data32End = data32 + (size >> 2); /* Divide size by 4. */
    uint8_t *data8 = (uint8_t*) data32End;
    uint8_t *data8End = data8 + (size & 0x00000003); /* Strip upper 30 bits. */
    
    while( data32 != data32End ) {
        *data32++ = -*data32;
    }
    while( data8 != data8End ) {
        *data8++ = -*data8;
    }
}

This function processes an aligned buffer in 43,043 microseconds and anunaligned buffer in 55,775 microseconds, respectively. Thus, on this testmachine, accessing unaligned memory four bytes at a time is slowerthan accessing aligned memory two bytes at a time:


Figure 8. Single- versus double- versus quad-byte access
Single- versus double- versus quad-byte access

Now for the horror story: processing the buffer eight bytes at atime.


Listing 4. Munging data eight bytes at a time
void Munge64( void *data, uint32_t size ) {
    double *data64 = (double*) data;
    double *data64End = data64 + (size >> 3); /* Divide size by 8. */
    uint8_t *data8 = (uint8_t*) data64End;
    uint8_t *data8End = data8 + (size & 0x00000007); /* Strip upper 29 bits. */
    
    while( data64 != data64End ) {
        *data64++ = -*data64;
    }
    while( data8 != data8End ) {
        *data8++ = -*data8;
    }
}

Munge64 processes an aligned buffer in39,085 microseconds -- about 10% faster than processing the buffer fourbytes at a time. However, processing an unaligned buffer takes an amazing1,841,155 microseconds -- two orders of magnitude slower than alignedaccess, an outstanding 4,610% performance penalty!

What happened? Because modern PowerPC processors lack hardware supportfor unaligned floating-point access, the processor throws an exceptionfor each unaligned access. The operating system catches thisexception and performs the alignment in software. Here's a chartillustrating the penalty, and when it occurs:


Figure 9. Multiple-byte access comparison
Multiple-byte access comparison

The penalties for one-, two- and four-byte unaligned access are dwarfedby the horrendous unaligned eight-byte penalty. Maybe this chart, removingthe top (and thus the tremendous gulf between the two numbers), will be clearer:


Figure 10. Multiple-byte access comparison #2
Multiple-byte access comparison #2

There's another subtle insight hidden in this data. Compare eight-byteaccess speeds on four-byte boundaries:


Figure 11. Multiple-byte access comparison #3
Multiple-byte access comparison #3

Notice accessing memory eight bytes at a time on four- and twelve- byteboundaries is slower than reading the same memory four or eventwo bytes at a time. While PowerPCs have hardware support for four-bytealigned eight-byte doubles, you still pay a performance penalty if you usethat support. Granted, it's no where near the 4,610% penalty, but it'scertainly noticeable. Moral of the story: accessing memory in large chunks can be slower than accessing memory in small chunks, if that accessis not aligned.


Atomicity

All modern processors offer atomic instructions. These specialinstructions are crucial for synchronizing two or more concurrent tasks.As the name implies, atomic instructions must be indivisible --that's why they're so handy for synchronization: they can't bepreempted.

It turns out that in order for atomic instructions to performcorrectly, the addresses you pass them must be at least four-byte aligned.This is because of a subtle interaction between atomic instructions andvirtual memory.

If an address is unaligned, it requires at least two memory accesses.But what happens if the desired data spans two pages of virtual memory?This could lead to a situation where the first page is resident while thelast page is not. Upon access, in the middle of the instruction, a pagefault would be generated, executing the virtual memory management swap-incode, destroying the atomicity of the instruction. To keep things simpleand correct, both the 68K and PowerPC require that atomically manipulatedaddresses always be at least four-byte aligned.

Unfortunately, the PowerPC does not throw an exception when atomicallystoring to an unaligned address. Instead, the store simply always fails.This is bad because most atomic functions are written to retry upon afailed store, under the assumption they were preempted. These twocircumstances combine to where your program will go into an infinite loopif you attempt to atomically store to an unaligned address. Oops.


Altivec

Altivec is all about speed. Unaligned memory access slows down theprocessor and costs precious transistors. Thus, the Altivec engineers tooka page from the MIPS playbook and simply don't support unaligned memoryaccess. Because Altivec works with sixteen-byte chunks at a time, alladdresses passed to Altivec must be sixteen-byte aligned. What's scary iswhat happens if your address is not aligned.

Altivec won't throw an exception to warn you about the unalignedaddress. Instead, Altivec simply ignores the lower four bits of theaddress and charges ahead, operating on the wrong address. Thismeans your program may silently corrupt memory or return incorrect resultsif you don't explicitly make sure all your data is aligned.

There is an advantage to Altivec's bit-stripping ways. Because youdon't need to explicitly truncate (align-down) an address, this behaviorcan save you an instruction or two when handing addresses to theprocessor.

This is not to say Altivec can't process unaligned memory. You can finddetailed instructions how to do so on the Altivec ProgrammingEnvironments Manual (see Resources). It requires more work, but because memory is so slow compared tothe processor, the overhead for such shenanigans is surprisingly low.


Structure alignment

Examine the following structure:


Listing 5. An innocent structure
void Munge64( void *data, uint32_t size ) {
typedef struct {
    char    a;
    long    b;
    char    c;
}   Struct;

What is the size of this structure in bytes? Many programmers willanswer "6 bytes." It makes sense: one byte for a, four bytes for b andanother byte for c. 1 + 4 + 1 equals 6. Here'show it would lay out in memory:

Field TypeField NameField OffsetField SizeField End
chara011
longb145
charc516
Total Size in Bytes:6

However, if you were to ask your compiler to sizeof( Struct ), chances are the answer you'd getback would be greater than six, perhaps eight or even twenty-four. There'stwo reasons for this: backwards compatibility and efficiency.

First, backwards compatibility. Remember the 68000 was a processor withtwo-byte memory access granularity, and would throw an exception uponencountering an odd address. If you were to read from or write to fieldb, you'd attempt to access an odd address. If adebugger weren't installed, the old Mac OS would throw up a System Errordialog box with one button: Restart. Yikes!

So, instead of laying out your fields just the way you wrote them, thecompiler padded the structure so that band c would reside at even addresses:

Field TypeField NameField OffsetField SizeField End
chara011
padding112
longb246
charc617
padding718
Total Size in Bytes:8

Padding is the act of adding otherwise unused space to a structure tomake fields line up in a desired way. Now, when the 68020 came out withbuilt-in hardware support for unaligned memory access, this padding wasunnecessary. However, it didn't hurt anything, and it even helped a little inperformance.

The second reason is efficiency. Nowadays, on PowerPC machines,two-byte alignment is nice, but four-byte or eight-byte is better. Youprobably don't care anymore that the original 68000 choked on unalignedstructures, but you probably care about potential 4,610% performancepenalties, which can happen if a double fielddoesn't sit aligned in a structure of your devising.


Conclusion

If you don't understand and explicitly code for data alignment:

  • Your software may hit performance-killing unaligned memory access exceptions, which invoke very expensive alignment exception handlers.
  • Your application may attempt to atomically store to an unaligned address, causing your application to lock up.
  • Your application may attempt to pass an unaligned address to Altivec, resulting in Altivec reading from and/or writing to the wrong part of memory, silently corrupting data or yielding incorrect results.

Credits

Thanks to Alex Rosenberg and Ian Ollmann for feedback, Matt Slot forhis FastTimes timing library, and Duane Hayes for providing a bevy oftesting machines.


Resources

About the author



  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值