《The H.264 advanced video compression standard》读书笔记
3.2 video CODEC
本章节主要宏观介绍h264编码器的原理,介绍一些概念。英文版的原书,讲的很是清晰, 比一些只是罗列各种概念的书真的是好很多,建议看看英文版原书,慢慢翻译理解其中的意思效果比看翻译版强得多。
下面是个人对原文的理解:
一个视频编码器,主要由三部分功能组成,一个 prediction model ,一个 spatial model ,一个 entropy encoder .
prediction modle: 预测模型,他的输入数据是原始的视频数据,输入一个当前帧,通过帧内和帧间(即相邻的之前的帧数据,复杂的 B帧的还会使用后续相邻的帧,这个后续相邻帧,只是针对时间轴上是靠后的,实际编码的时候的顺序可以是不按照时间轴顺序进行排列的,后面帧可以先编码和存储传输,不管怎样解码器必须先拿到参考帧,便于理解,先忽略) 的帧数据进行预测,得到下一个帧的预测数据,然后用当前帧数据减去这个预测帧得到当前帧和下一帧的差值,这个差值叫做残差帧,residual. 这个预测模型,就会输出一个残差帧,和用来进行预测的一组模型参数(比如帧内预测类型,或者是运动补偿的描述)(这个模型参数解码的时候要用这个作为依据进行解码)
spatial model :空间模型,有了残差帧,把这个残差帧进行存储就已经有一定的压缩了,但是远不够,空间模型,就是对上面的残差帧进行再进一步量化,压缩。 空间模型将 多个残差帧转换到另一个方便处理量化的域(学过《信号系统》的同学应该知道,比如频域分析),对他进行量化,进一步去除不重要的数据,这个域里面对数据的表述 称为 transform coefficients , 这个也就是 空间模型的输出。
entropy encoder :熵编码器 有了上面两部分,最后要存储的 是coefficients 和 prediction parameters, 还可以通过熵编码,对最后要存储的数据进行压缩存储 (熵编码,熵,即概率,熵编码简单点理解,就是用最短的字节码来表示出现概率最高的元数据,这样整个的存储空间就可以有效减小(《信息论与编码》))
最后的解码,就是反过来,首先熵解码器解码解码得到 预测模型参数和 空间模型的coefficients, 空间模型 使用coefficients得到残差帧,同时解码器使用 预测模型参数和已经解码的之前帧的数据得到 一个当前帧的预测帧(这个预测帧同编码的时候一样,是一个中间产物)然后结合 残差帧,得到当前帧数据。
编码和解码,都会根据已经有的数据生成一个预测帧 prediction,解码器和编码器生成的这个预测帧数据要一样,编码器接着把 当前帧数据减去 prediction得到的残差帧数据存起来 ,解码器用这个 差值+自己通过生成的预测帧prediction 就还原了当前帧,为了确保 解码器和编码器得到的 prediction 一致, 编码器进行 预测使用到的源数据必须是解码器在解这一帧的时候也能够得到的数据,比如已经解析过的数据。 然后还要解码器能够获得 同编码器一样的生成预测帧的参数。
附上源文:
A video encoder (Figure 3.3) consists of three main functional units: a prediction model, a spatial model and an entropy encoder. The input to the prediction model is an uncompressed ‘raw’ video sequence. The prediction model attempts to reduce redundancy by exploiting the similarities between neighbouring video frames and/or neighbouring image samples, typically by constructing a prediction of the current video frame or block of video data. In H.264/AVC, the prediction is formed from data in the current frame or in one or more previous and/or future frames. It is created by spatial extrapolation from neighbouring image samples, intra prediction, or by compensating for differences between the frames, inter or motion compensated prediction. The output of the prediction model is a residual frame, created by subtracting the prediction from the actual current frame, and a set of model parameters indicating the intra prediction type or describing how the motion was compensated.
The residual frame forms the input to the spatial model which makes use of similarities between local samples in the residual frame to reduce spatial redundancy. In H.264/AVC this is carried out by applying a transform to the residual samples and quantizing the results.The transform converts the samples into another domain in which they are represented by transform coefficients. The coefficients are quantized to remove insignificant values, leaving a small number of significant coefficients that provide a more compact representation of theresidual frame. The output of the spatial model is a set of quantized transform coefficients.
The parameters of the prediction model, i.e. intra prediction mode(s) or inter prediction mode(s) and motion vectors, and the spatial model, i.e. coefficients, are compressed by the entropy encoder. This removes statistical redundancy in the data, for example representing commonly occurring vectors and coefficients by short binary codes. The entropy encoder produces a compressed bit stream or file that may be transmitted and/or stored. A compressed sequence consists of coded prediction parameters, coded residual coefficients and header information.
The video decoder reconstructs a video frame from the compressed bit stream. The coefficients and prediction parameters are decoded by an entropy decoder after which the spatial model is decoded to reconstruct a version of the residual frame. The decoder uses the prediction parameters, together with previously decoded image pixels, to create a prediction of the current
frame and the frame itself is reconstructed by adding the residual frame to this prediction.