Memory Performance Motion estimation algorithms as presented here are computationally and memory intensive. In many implementations these algorithms are known to be load-bound as multiple frame positions are explored, requiring high execution to cache bandwidth. Therefore in the next section I describe the general operation of the caches in a typical IA processor in an attempt to map performance issues to the micro-architecture.
Figure 5. L1 Cache Operation
Intel® Architecture Caches The Intel® micro-architecture can support up to three levels of on-chip cache. Only two levels of on-chip caches are implemented in the Pentium® 4 processor. The following table provides the parameters for all cache levels for a Northwood processor with Hyper-Threading Technology. Table 1. Pentium® 4 Processor Cache Parameters
Level | Capacity | Ways | Line Size (bytes) | Access Latency Integer/floating-point (clocks) | First | 8KB | 4 | 64 | 2/6 | TC | 12K µops | 8 | N/A | N/A | Second | 512KB | 8 | 128 | 7/7 |
1st Level Cache Operation The 8 KB first level cache is arranged into four ways, each 2 KB in size with 32 lines of 64 bytes each (see Figure 5). A tag per cache line per way is also maintained in L1 cache. This tag contains the state of the cache line and a page tag that indicates to which page directory the cache line belongs. Two problems arise from current L1 cache operation:
-
Lack of cache associativity – as a result of 1st level cache size and number of ways, several memory references might correspond to the same way and cache line. Cache thrashing and conflicts between memory references in 1st level cache result in a performance penalty due to 2nd level cache latency.
-
Cache aliasing conflicts - related to the use of short tags in the cache look-up operation. Occurs when an L1 Cache reference (load or store) occurs with identical bits 0-N of the linear address to a reference which is under way. Then, the second reference cannot begin until the first one is retired from cache. This results in extra penalty cycles waiting for completion of the first reference or waiting for resolving the cache conflict.
Motion Estimation Memory Access Model As mentioned earlier, the basic building block in ME is the SAD computation. For a 16x16 block, 16 rows of 16 bytes from both the current and reference frames need to be loaded from 1st level cache to be able to do the computation. For each row in a macro-block there are 2 ALU computations: 1 PSADBW and 1 PADD. Therefore, there are a total of 2 loads (assuming no cache line split or unaligned access penalties) and 2 ALU’s per row. This becomes readily unbalanced if the latency due to 1st level cache access is large. This latency is difficult to mask as all operations depend on the data retrieved. With today’s architectures the performance of the operation is dependent on 1st level cache latency (load bound). Furthermore, by analyzing the operations of a motion estimation algorithm one can make the following observations:
- While loads from the current frame can be guaranteed to be 16-byte aligned, the positions in the reference frame will be un-aligned more than 99% of the times in a full search algorithm and 67% of the times in diamond search algorithm (assuming first diamond only). This adds more load pressure to the 1st level cache.
- Depending on the width of the frame (the frame stride), each row to be loaded from memory may incur in a cache line split (see Section 0). Cache line splits can incur in larger load latencies, as they require loading from two cache lines and merging the results.
- Again, depending on the width of the frame, you may incur in cache thrashing as result of cache-associativity issues. While common video frame widths will not incur in this problem (see section 1.1.1), it is still considered.
- Address aliasing conflicts between current and reference frames are inevitable and non deterministic. This incurs in large penalties in Pentium 4 processors for 64 KB strides. Prescott aliasing conflicts are reduced to 4 MB strides.
Cache Line Splits Depending on the width of a frame, the macro-block rows accessed during SAD computations may incur in cache line splits. Let’s assume a 64 byte cache line size and a 16x16 bytes macro-block. Since 64 is multiple of 16 (4 side by side macro-block rows per cache line), as long as the width of the frame is a multiple of 16, if the first byte of the macro-block is 64 byte cache aligned, each row will be guaranteed to fall within just one cache line, avoiding the cache split penalty. This is true for typical frame sizes (Std. Definition of 720x480, HD uses 1280x720, both widths being multiple of 16 bytes). This aligned access can only be guaranteed on accesses to the current frame. However, say the current frame macro-block is 16 byte aligned; only the access to the co-located macro-block in the reference frame will share this alignment. Once the block matching function moves around a search support area, the macro-block accesses may incur in cache line splits. As an example, take the case of the standard definition width of 720. Each stride will have exactly 11.25 cache lines. This means that if the first row of a macro-block starts at cache line 0, the second row starts at cache line 11.25, the third row at 22.5, etc. Every four rows will be cache-aligned. In terms of byte offsets from the cache line, the second row will have an offset of 16 bytes, the third will be 32, the fourth 48, and finally the fifth will be cache-aligned (64). Table 2 presents the access pattern for the macro-block in the current frame and the co-located macro-block in the reference frame. Note that all the accesses are within one cache line, thus, no cache-line splits occur.
Table 2. Aligned SAD accesses in the current frame
Now let’s consider accesses on the reference frame. For diamond search, let’s label points 1-9 as shown on Figure 6, where point 5 corresponds to the co-located macro-block in the reference frame. Points 1, 5 and 9 share the cache alignment of the current frame macro-block, which is an offset of 0. Points 2 and 7 are displaced by -1 byte; 3 and 8 are displaced by +1 byte; point 4 is displaced by -2 bytes; and point 6 is displaced by +2 bytes. We can now extend table 2 to show the access patterns for a diamond search on the reference frame. Table 3 summarizes this. The shaded cells indicate patterns that will incur in cache line splits. Thus, in a best case scenario 24 out of 160 (9 points * 16 ref. rows + 16 cur. rows) memory accesses (15% of the loads) will incur in a cache line split. If more than 1 diamond search is required this statistic will increase.
Table 3. Alignment of SAD accesses for one diamond pattern
Table 4. Alignment of SAD accesses for one diamond pattern
Figure 6. Diamond points
1.1.1 Cache Associativity With current cache architectures as described in Section 1.3, cache associativity may become an issue if the frame width is a power of 2. Let’s take for example a frame width of 1024. There are 32 possible cache sets in 1st level cache. Assuming cache aligned frames, each new row in a macro-block will be 1024 bytes apart from the last one, which means that the mapping of cache lines to 1st level cache is as presented in Table 4. As can be observed, due to lack of ways, row 8 will swap row 0 out of 1st level cache. This means that on every access to the current macro-block there will be an added penalty since the data will come from 2nd level cache. Although for typical video frame widths this will not be a problem, it wouldn’t be surprising to encounter implementations where the frames get padded to powers of 2 to allow efficient computation of other transforms, i.e. computing FFT’s or other efficient power of 2 transforms. In general, the cache position (set and way) can be computed as:
where [x] is the closest integer that is less than or equal to x (floor operator). Width is the frame width, CLS is the cache line size, Num_Ways is the number of ways, Num_Sets is the number of sets which is equal to (cache size)/(Num_Ways) and Row is the row being indexed. Using the above equations and a cache size of 8KB any power of 2 frame width greater than or equal to 1024 will incur in cache thrashing within one macro-block.
Table 5. Cache utilization on frame width of size 1024
|