Understand memory access characteristics of motion estimation algorithms

Introduction

By Alex A. Lopez-Estrada

Applications Engineer, Intel Software and Solutions Group

Today, processor computing power makes real-time video encoding and decoding using General Purpose Processors capability a reality. Still, the ability to achieve real-time encoding/decoding with minimum CPU utilization presents a challenge to system architects, as most of the processors’ cycles are saved for more intriguing and extended user experiences. Currently, a performance of close to 2X real time, or 50% CPU utilization, is achievable. Thus, due to its computational and memory characteristics, video encoding is still today considered a “killer application” from a processor micro-architecture design perspective.

This paper presents an analysis of the memory access patterns of typical motion estimation algorithms in order to better understand processor micro-architecture issues that affect the overall performance of video compression algorithms.

A Typical Video Encoder

Figure 1 presents a high level picture of a video encoder, applicable to MPEG-2, MPEG-4 or H.264 standards. As digital “raw” video arrives at the encoding pipeline, a series of transforms are applied with the purpose of removing redundant information (lossless) or irrelevant information (lossy) exploiting characteristics of the human visual system (HVS). The idea is to achieve bit rate reduction without sacrificing the quality of the video signal.

 

 

 

Figure 1. Generic MPEG source coder

 

One important aspect of a video encoder is the removal of temporal redundancy. This is achieved by the Motion Estimation (ME) block in the encoder. The objective is to represent each video frame as a difference in pixel values from a reference frame. The reference frame can be one or more past frames in the video sequence or frames in the future. Frames, or frame portions, not predicted from references are nominated as INTRA, frames, or frame portions, predicted from a reference are known as INTER. The selection of whether to encode as INTRA or INTER is based on the temporal predictability of the frame (or frame portion) with respect to the reference.

Typical motion estimation algorithms account for a substantial portion of the application CPU cycles, and in many cases (depending on the complexity of the algorithms) it’s been known to account for as much as 60% of total CPU cycles. Therefore, this paper targets the analysis of the Motion Estimation algorithms in typical processor micro-architectures with on-chip memory, such as IA-32 architectures.

Motion Estimation Algorithms

A wide variety of ME algorithms exists, offering a tradeoff between encoding speed and video quality. In the ME stage, current and reference frames are broken into macro-blocks (MB), each typically of 16x16 pixels. A search is performed trying to locate the macro-block in the reference frame that satisfies some minimum error criteria from the current frame. This is commonly known as block matching. A popular error criteria used in ME is the Sum of Absolute Differences or SAD, defined as:

 

 

Where x is the current frame macro-block, y is the reference frame macro-block and ij denotes row i, column j of the frame. Processor instructions have been introduced to optimally compute a SAD, with the purpose of improving ME performance. In SSE the instruction is called PSADBW, and computes one row of the SAD [1].

Once the best matching macro-block is located, a motion vector field (MVF) is encoded per macro-block corresponding to the absolute displacement, in pixel coordinates (x, y) from the current frame macro-block. The efficient search for the macro-block corresponding to the minimum SAD is the subject of on-going research in the area of video encoding. We explore three popular methods for performing the motion estimation stage: full search, diamond search and PMVFAST.

Full Search Block Matching

In Full Search (FS) ME, all possible overlapping locations (in a sliding fashion) around the co-located macro-block in the reference frame (see Figure 2) are evaluated and the position that results in the minimum SAD is used as the predictor. The search support area in the reference frame is denoted as X_SEARCH and Y_SEARCH. Therefore, the number of SAD’s to be computed increases proportionally with the search support area, e.g. N = X_SEARCH x Y_SEARCH. This method is the most computationally expensive but provides best coverage of the search area. However, this coverage depends on the selection of X_SEARCH and Y_SEARCH, and the algorithm may fail to find the optimum reference macro-block if it falls outside the search support.

 

 

 

Figure 2. Full-Search Motion Estimation

 

Diamond Search

The Diamond Search algorithm was first proposed by Shan Zhu and Kai-Kuang Ma [2] and it is based on the study of motion vector field distributions of a large population of video sequences. The diamond pattern as presented in Figure 3 is derived from the probability distribution function as those locations corresponding to the highest probability of finding the matching block in the reference frame. The algorithm starts in the co-located macro-block in the reference frame and performs eight additional SAD’s around the diamond center. This contrasts to FS which computes all possible overlapping SAD’s. Once the minimum SAD location is found, the diamond center is displaced to the optimum location and a new diamond search is executed. The new search will require fewer SAD computations. The search stops once the position of the minimum SAD is located in the center of the diamond. On average the best matching block will be found within a few diamond search iterations, thus requiring fewer SAD’s per macroblock than FS. The algorithm offers the advantage of extending the search support area, allowing more reference frame coverage with fewer computations. However, due to the sparse nature of the diamond, one may miss an optimum matched block near the center. Various algorithms based on DS have been proposed to do a hierarchical search, so that once the best matching macro-block is found an inner diamond search is performed to cover points interior to the diamond.

 

 

 

Figure 3. Diamond Search Motion Estimation Algorithm

 

Predictive Algorithms

Predictive algorithms use neighboring blocks and blocks on adjacent frames with already computed motion vectors to estimate the initial pattern of the search. Usually these algorithms use the prediction result to compute the first search and consecutive searches follow a prescribed pattern, i.e. a diamond search pattern. A popular predictive algorithm is abbreviated as PMVFAST for Predictive Motion Vector Field Adaptive Search Technique [3]. In Figure 4 an instance of PMVFAST is presented, using a diamond search pattern as the underlying technique to be used after the first search. The algorithm uses the motion vectors estimated from the following 4 macro-blocks: the one directly at the top (Top MV), the one to the left (Left MV), the one to the upper right (Top-Right MV) and the co-located macro-block in the previous frame (MV from frame n-1). Additionally a median is computed between these 4 motion vectors to provide the 5th position. On each position a SAD is computed and the minimum SAD is taken as the center for the next search. In Figure 4 say the Top-Right MV produces the minimum SAD, the bottom diagram shows the position of the diamond with respect to search 1. A regular diamond search is then executed. The general goal of this algorithm is faster convergence than regular DS.

 

 

Figure 4. Predictive Motion Vector Field Adaptive Search Technique (PMVFAST) using a Diamond Search pattern.

Memory Performance

Motion estimation algorithms as presented here are computationally and memory intensive. In many implementations these algorithms are known to be load-bound as multiple frame positions are explored, requiring high execution to cache bandwidth. Therefore in the next section I describe the general operation of the caches in a typical IA processor in an attempt to map performance issues to the micro-architecture.








 

 

 

Figure 5. L1 Cache Operation

 

Intel® Architecture Caches

The Intel® micro-architecture can support up to three levels of on-chip cache. Only two levels of on-chip caches are implemented in the Pentium® 4 processor. The following table provides the parameters for all cache levels for a Northwood processor with Hyper-Threading Technology.

Table 1. Pentium® 4 Processor Cache Parameters

LevelCapacityWaysLine Size (bytes)Access Latency Integer/floating-point (clocks)
First8KB4642/6
TC12K µops8N/AN/A
Second512KB81287/7

1st Level Cache Operation

The 8 KB first level cache is arranged into four ways, each 2 KB in size with 32 lines of 64 bytes each (see Figure 5). A tag per cache line per way is also maintained in L1 cache. This tag contains the state of the cache line and a page tag that indicates to which page directory the cache line belongs.

Two problems arise from current L1 cache operation:

  1. Lack of cache associativity – as a result of 1st level cache size and number of ways, several memory references might correspond to the same way and cache line. Cache thrashing and conflicts between memory references in 1st level cache result in a performance penalty due to 2nd level cache latency.

  2. Cache aliasing conflicts - related to the use of short tags in the cache look-up operation. Occurs when an L1 Cache reference (load or store) occurs with identical bits 0-N of the linear address to a reference which is under way. Then, the second reference cannot begin until the first one is retired from cache. This results in extra penalty cycles waiting for completion of the first reference or waiting for resolving the cache conflict.

Motion Estimation Memory Access Model

As mentioned earlier, the basic building block in ME is the SAD computation. For a 16x16 block, 16 rows of 16 bytes from both the current and reference frames need to be loaded from 1st level cache to be able to do the computation. For each row in a macro-block there are 2 ALU computations: 1 PSADBW and 1 PADD. Therefore, there are a total of 2 loads (assuming no cache line split or unaligned access penalties) and 2 ALU’s per row. This becomes readily unbalanced if the latency due to 1st level cache access is large. This latency is difficult to mask as all operations depend on the data retrieved. With today’s architectures the performance of the operation is dependent on 1st level cache latency (load bound).

Furthermore, by analyzing the operations of a motion estimation algorithm one can make the following observations:

  • While loads from the current frame can be guaranteed to be 16-byte aligned, the positions in the reference frame will be un-aligned more than 99% of the times in a full search algorithm and 67% of the times in diamond search algorithm (assuming first diamond only). This adds more load pressure to the 1st level cache.

  • Depending on the width of the frame (the frame stride), each row to be loaded from memory may incur in a cache line split (see Section 0). Cache line splits can incur in larger load latencies, as they require loading from two cache lines and merging the results.

  • Again, depending on the width of the frame, you may incur in cache thrashing as result of cache-associativity issues. While common video frame widths will not incur in this problem (see section 1.1.1), it is still considered.

  • Address aliasing conflicts between current and reference frames are inevitable and non deterministic. This incurs in large penalties in Pentium 4 processors for 64 KB strides. Prescott aliasing conflicts are reduced to 4 MB strides.

Cache Line Splits

Depending on the width of a frame, the macro-block rows accessed during SAD computations may incur in cache line splits. Let’s assume a 64 byte cache line size and a 16x16 bytes macro-block. Since 64 is multiple of 16 (4 side by side macro-block rows per cache line), as long as the width of the frame is a multiple of 16, if the first byte of the macro-block is 64 byte cache aligned, each row will be guaranteed to fall within just one cache line, avoiding the cache split penalty. This is true for typical frame sizes (Std. Definition of 720x480, HD uses 1280x720, both widths being multiple of 16 bytes). This aligned access can only be guaranteed on accesses to the current frame. However, say the current frame macro-block is 16 byte aligned; only the access to the co-located macro-block in the reference frame will share this alignment. Once the block matching function moves around a search support area, the macro-block accesses may incur in cache line splits.

As an example, take the case of the standard definition width of 720. Each stride will have exactly 11.25 cache lines. This means that if the first row of a macro-block starts at cache line 0, the second row starts at cache line 11.25, the third row at 22.5, etc. Every four rows will be cache-aligned. In terms of byte offsets from the cache line, the second row will have an offset of 16 bytes, the third will be 32, the fourth 48, and finally the fifth will be cache-aligned (64). Table 2 presents the access pattern for the macro-block in the current frame and the co-located macro-block in the reference frame. Note that all the accesses are within one cache line, thus, no cache-line splits occur.

 

Table 2. Aligned SAD accesses in the current frame

 

 

 

Now let’s consider accesses on the reference frame. For diamond search, let’s label points 1-9 as shown on Figure 6, where point 5 corresponds to the co-located macro-block in the reference frame. Points 1, 5 and 9 share the cache alignment of the current frame macro-block, which is an offset of 0. Points 2 and 7 are displaced by -1 byte; 3 and 8 are displaced by +1 byte; point 4 is displaced by -2 bytes; and point 6 is displaced by +2 bytes. We can now extend table 2 to show the access patterns for a diamond search on the reference frame. Table 3 summarizes this. The shaded cells indicate patterns that will incur in cache line splits. Thus, in a best case scenario 24 out of 160 (9 points * 16 ref. rows + 16 cur. rows) memory accesses (15% of the loads) will incur in a cache line split. If more than 1 diamond search is required this statistic will increase.

 

Table 3. Alignment of SAD accesses for one diamond pattern

 

 

 

 

Table 4. Alignment of SAD accesses for one diamond pattern

 

 

 

 

 

 

Figure 6. Diamond points

 

1.1.1 Cache Associativity

With current cache architectures as described in Section 1.3, cache associativity may become an issue if the frame width is a power of 2. Let’s take for example a frame width of 1024. There are 32 possible cache sets in 1st level cache. Assuming cache aligned frames, each new row in a macro-block will be 1024 bytes apart from the last one, which means that the mapping of cache lines to 1st level cache is as presented in Table 4.

As can be observed, due to lack of ways, row 8 will swap row 0 out of 1st level cache. This means that on every access to the current macro-block there will be an added penalty since the data will come from 2nd level cache. Although for typical video frame widths this will not be a problem, it wouldn’t be surprising to encounter implementations where the frames get padded to powers of 2 to allow efficient computation of other transforms, i.e. computing FFT’s or other efficient power of 2 transforms.

In general, the cache position (set and way) can be computed as:

 

 

where [x] is the closest integer that is less than or equal to x (floor operator). Width is the frame width, CLS is the cache line size, Num_Ways is the number of ways, Num_Sets is the number of sets which is equal to (cache size)/(Num_Ways) and Row is the row being indexed. Using the above equations and a cache size of 8KB any power of 2 frame width greater than or equal to 1024 will incur in cache thrashing within one macro-block.

 

Table 5. Cache utilization on frame width of size 1024

 

 

 

Proposed Solutions

One solution can be implemented as a software workaround with significant speed-ups in motion estimation performance. The same solution can be implemented in hardware with expected superior performance. The method has been nominated as Search Support Copy (SSC) and is described in detail.

Search Support Copy – Software Solution

In the proposed method, intermediate software buffers are used to store a copy of one or more of the current and reference frame macro-blocks. A copy to the buffers occurs prior to executing the Motion Estimation step (see Figure 7). The characteristic of the buffers are:

Current frame buffer

  • Must be of a size such that its stride is small in comparison to the cache design, in order to remove cache associativity problems. That is, make the buffer such that the cache lines associated with the buffer occupy only one way of the 1st level cache, minimizing cache thrashing issues. Since the buffer will be stored contiguously in memory, the buffer stride will determine its location in 1st level cache.

  • The size must be selected as to take full advantage of the cache line size. For example, in the Pentium 4 processor the cache lines are of 64 bytes, and able to hold 4 macro-block rows contiguously. The algorithm may take advantage of this and do just one copy for every 4 macro-blocks.

  • Must be cache aligned as to prevent cache-line splits during the ME search.

Reference frame buffer
The same characteristic of the current frame buffer plus:

  • It must be of a size such that it covers an area large enough on the reference frame, such that the probability of the ME search going outside the bounds of the buffer is minimized. This characteristic is algorithm and content dependent, and it could be adaptively modified accordingly. For example, for a DS algorithm executing on relatively smooth sequences, must MV’s will be found in one or two diamond positions. For high motion content the area might be increased as MV displacement is larger. This is a tradeoff between the amount of copying necessary (adding computational overhead) and the search support coverage. Most times a sufficiently small search support area is preferred to the overhead of unnecessary copies.

  • It should be aligned such that it benefits the search algorithm. For diamond search, 3 out of the 9 points share the same alignment (point 1, 5 and 9 on Figure 6). Thus, it is beneficial to align the buffer such that the co-located macro-block (point 5) is 16-byte aligned, while the leftmost point (point 4) is still within the same cache line.

The advantages of using the intermediate buffers are several: it alleviates cache-line split penalty and cache thrashing issues as discussed in Section 0, since the penalties are incurred only once during the copy. This method also eliminates the occurrence of aliasing conflicts as the buffers can be guaranteed to be allocated in areas of memory that are not modulo of the address aliasing strides.

 

 

 

Figure 7. High level overview of Search Support Copy

 

 

 

 

Figure 8. Search Support buffer size. The diagram shows the center diamond and 4 edge diamonds. Thus, we require -6 -> 6+16 rows, and -6 -> 6+16 columns. This translates into 28 rows and 28 columns. We use a 28 rows x 32 columns buffer to ensure 16-byte alignment.

Using the guidance for the selection of the buffers, a prototype implementation was made. The current frame buffer was selected to be of size 32x16 (2 side by side macro-blocks) as copying more than two at once did not show any benefit. The reference frame buffer was selected to benefit the Diamond Search algorithm. In this case we assume that most times the algorithm will converge in two diamond iterations. Therefore we allow movement of the diamond in all directions twice. This translates into a buffer of 28x28, but to ensure 16-byte alignment, we copy an area 32 columns x 28 rows (Figure 8). Pointers are updated to keep track of the diamond movement within the buffer. Only when a diamond moves outside the buffer a new copy is made into the reference buffer.

Streaming Buffers – Hardware Solution

The general idea of having intermediate buffers can be extended to hardware. By implementing small streaming buffers or addressable caches consisting of few consecutive cache lines the same goals will be accomplished. These buffers could be addressed by explicit registers if this mode is enabled. Then copies to the registers will implicitly populate the caches. This will avoid the overhead of the software copy which may overshadow the performance benefits, especially in high motion sequences.

Other Micro-architecture Considerations

Aside from the SSC solution, various other micro-architecture modifications result in more generic and transparent improvements. These are: improve cache line split performance, improve un-aligned access performance, improve latency to 1st level cache and adding multiple load ports. These modifications will be transparent to the programmer and will benefit all kinds of software. Other micro-architecture capacity improvements such as increasing cache size, number of ways, etc. are less likely to improve motion estimation performance.

Performance

The method described herein was implemented using a DS ME kernel. Data was collected on Northwood and Prescott using 4 video sequences ranging from low to high motion and at different resolutions (Standard Definition and High Definition).

Figure 9 and Figure 10 depict the results. The results presented illustrate two options described in this paper: copying just the current frame macro-block (labeled as Current Only in the figures) and copying both current and reference frame macro-blocks. On average, Northwood shows a 29% cycles improvement with up to 63% in low motion sequences (sequence 1). Prescott shows an average improvement of 11% and up to 40% improvement on slow motion sequences (sequence 1). In some cases the improvement is minimal and in others the “Current Frame Copy Only” version works better. A key observation that explains the relatively better improvements on Northwood as compared to Prescott is the method’s capability to reduce 64K aliasing conflicts not present in Prescott.

Summary

This paper was intended to examine the memory access characteristics of typical motion estimation algorithms. The diamond search algorithm was used as an example and it was shown how of 15% the SAD loads can incur in cache line splits, resulting in substantial performance degradation. Also, full search algorithms incur in un-aligned accesses 99% of the time in the reference frame, while diamond search can incur in more than 67% un-aligned accesses.

A software mechanism was presented to overcome some of these micro-architecture issues that can result in significant speed-ups. However, more generic micro-architecture improvements such as improving cache-line split and un-aligned access latencies are preferred.

 

 

Figure 9. Results of the disclosed method gathered on a Pentium® 4 Northwood processor.

 

 

Figure 10. Results of the disclosed method gathered on a Pentium® 4 Prescott processor.

References

[1] Abel, J. et al. Applications Tuning for Streaming SIMD Extensions. Intel Technology Journal, 1999.

[2] Zhu, S. et al. A New Diamond Search algorithm for Fast Block Matching Motion Estimation. ICICS, IEEE, Sept. 1997.

[3] Tourapis, A. et al. Predictive Motion Vector Field Adpative Search Technique (PMVFAST). Dept. of Electrical Engineering, The Hong Kong University of Science and Technology.

About the Author


Alex A. Lopez-Estrada is a Sr. Software Engineer in Intel’s Consumer Electronics Group. He holds an MS in EE with concentration in DSP algorithm development. Alex worked for the Eastman Kodak Research labs before joining Intel, where he developed A/V codecs and media infrastructure components, performance optimization and architecture analysis. He holds 2 patents, has 9 patents pending, and has published several papers in the field of imaging, audio and video technologies.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
提供的源码资源涵盖了安卓应用、小程序、Python应用和Java应用等多个领域,每个领域都包含了丰富的实例和项目。这些源码都是基于各自平台的最新技术和标准编写,确保了在对应环境下能够无缝运行。同时,源码中配备了详细的注释和文档,帮助用户快速理解代码结构和实现逻辑。 适用人群: 这些源码资源特别适合大学生群体。无论你是计算机相关专业的学生,还是对其他领域编程感兴趣的学生,这些资源都能为你提供宝贵的学习和实践机会。通过学习和运行这些源码,你可以掌握各平台开发的基础知识,提升编程能力和项目实战经验。 使用场景及目标: 在学习阶段,你可以利用这些源码资源进行课程实践、课外项目或毕业设计。通过分析和运行源码,你将深入了解各平台开发的技术细节和最佳实践,逐步培养起自己的项目开发和问题解决能力。此外,在求职或创业过程中,具备跨平台开发能力的大学生将更具竞争力。 其他说明: 为了确保源码资源的可运行性和易用性,特别注意了以下几点:首先,每份源码都提供了详细的运行环境和依赖说明,确保用户能够轻松搭建起开发环境;其次,源码中的注释和文档都非常完善,方便用户快速上手和理解代码;最后,我会定期更新这些源码资源,以适应各平台技术的最新发展和市场需求。
角度到达估计是指通过测量无线信号到达接收器时的角度,来确定信号源的方向。Decawave DW1000集成电路是一种超宽带无线电技术,可用于实现角度到达估计。DW1000集成了射频前端、数字信号处理器和嵌入式时钟,可以在非线性通信环境下实现高精度定位。其技术理论基础是利用不同位置的DW1000模块之间的时间差度量,通过多边定位算法确定目标的位置和角度。 在DW1000模块中最重要的是时间同步和距离测量,可以使用两种技术实现。第一种技术是基于精确的时间同步,当所有DW1000模块的时间被同步后,可以精确计算出信号源到各个DW1000模块之间的时间差,从而测量出信号源到各个模块之间的距离。第二种技术是基于双向时间传输,发送和接收DW1000模块之间的时间戳。通过计算收到和发送的时间戳之间的时间差,就可以得到信号源到DW1000模块之间的时间差,并计算出信号源到各个模块之间的距离。 利用收集到的距离和时间信息,可以使用多边定位算法计算出信号源的位置和角度。算法可以采用卡尔曼滤波和粒子滤波等技术,通过对不同DW1000模块之间的距离和时间信息进行组合,优化估计结果,并得出更准确的角度到达估计结果。 综上所述,DW1000集成电路可以通过精确的时间同步和距离测量实现高精度的角度到达估计,并通过多边定位算法实现定位。这种技术可以应用于工厂自动化、智能交通系统、无人驾驶汽车和室内定位等领域。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值