目录
x265相关:
【x265】x265编码器参数配置
【x265】预测模块的简单分析—帧内预测
【x265】预测模块的简单分析—帧间预测
【x265】码率控制模块的简单分析—块级码控工具(AQ和cuTree)
【x265】码率控制模块的简单分析—帧级码控模式(CQP、CRF和ABR)
面向x265编码器,本文从帧内预测入手,结合编码结构做一些简单记录,并且只记录luma分量的实际编码流程。chroma分量和luma的编码流程很相似,chroma分量的编码过程中,会继承luma分量的一些编码信息作为参考。之所以选择帧内预测开始,是因为编码器的框架为块划分,预测,变换量化,熵编码,而块划分这一操作通常又是在预测模块当中实现的,所以实际上预测模块才是代码中最早执行的模块
1. h265标准帧内预测概述
1.1 帧内预测编码结构
在h265标准中,CU的尺寸最大为64x64,最小为8x8,以四叉树(QuadTree,QT)向下逐层划分,同时引入了预测单元(Prediction Unit,PU)这一概念,用于预测编码。PU由CU划分成为多个预测区域而来,对于帧内预测而言,PU的尺寸可以分为两类:
(1)2Nx2N
2Nx2N表示当前PU和该PU所属的CU具有相同的大小,并且以这个大小的区域进行预测编码
(2)NxN
NxN表示当前CU会划分成4大小相同的子PU,分别进行预测编码
// 2Nx2N
+----+----+
| |
+ +
| |
+----+----+
// NxN
+----+----+
| | |
+----+----+
| | |
+----+----+
1.2 帧内预测模式
1.2.1 角度模式(模式2~34)
在h265标准中,帧内预测共有35种模式,33种为角度预测模式,其余2种分别为DC和Planar模式;角度预测模式图如下所示
从图中可以看出,编号为2~17的模式为偏水平的模式,编号为18~34的模式为偏垂直的模式,并且模式26为垂直向下,模式10为水平向右。此外,从图中也能看出参考像素一共来自几部分,分别是:
(1)左上角参考像素
(2)上方参考像素
(3)右上方参考像素
(4)左侧参考像素
(5)左下参考像素
1.2.2 Planar模式(模式0)
Planar模式记为模式0,通过上方和左侧参考像素差值来获得当前预测块
1.2.3 DC模式(模式1)
DC模式用于图像比较平坦的区域,通过将上方和左侧参考像素求平均值获得当前的预测块
1.2 亮度模式的编码
相对比于H264标准,H265中使用的帧内预测模式达到了35种,为了提高实际编码速度,引入了最可能模式(Most Probable Mode,MPM)概念,通过考虑空间域信息,即相邻已编码块的信息来提高编码速度。这样做的依据是,相邻块之间往往具有类似的纹理特征,其编码模式很有可能相同或者接近。具体来说,mpm包含了3个候选模式,这3个候选模式来自于相邻参考块,分别来自于左侧和上方参考块,如下所示,其中c为当前待编码块,a和b为已编码块
+-----+-----+
| | b |
+-----+-----+
| a | c |
+-----+-----+
mpm的构建方式如下
(1)如果a和b具有相同模式
(i)a和b都为Planar或DC模式,则mpm = { Planar, DC, 26 }
(ii)a和b都为角度模式,则mpm = { ModeA, ModeA - 1, ModeA + 1};这里需要注意模式2与模式3和模式33相邻,模式34与模式33和模式3相邻
(2)如果a和b模式不同,则mpm = { modeA, modeB, X },其中X分几种情况决定
(i)modeA和modeB都不是Planar模式,则X=Planar
(ii)当(i)不满足时,如果modeA和modeB都不是DC模式,则X=DC
(iii)当前两条都不满足时,X=26(垂直模式)
基于mpm进行编码的流程如下
(1)如果当前PU的最佳模式modeC位于mpm中,则只需要编码modeC在mpm中的位置
(2)如果modeC不在mpm中
(i)将mpm中的候选模式从小到大重新排列
(ii)遍历mpm中的候选模式,分别与modeC进行比较,如果modeC >= mpm[i],则将modeC自减1,随后对modeC最终的值进行编码
1.3 色度模式编码
h265中的色度分量帧内预测一共有5种模式:Planar模式、垂直模式、水平模式、DC模式以及对应亮度分量的预测模式(因为是先进行的亮度编码)。如果对应亮度预测模式为前4种模式中的一种,则将数组中最后一个数据替换为角度预测中的模式34
具体实现如下:
(1)如果亮度预测模式modeLuma不是前4种中的一种,则直接对模式编号进行编码,此时色度模式参考队列为 modeChroma = { Planar, 26, 10, DC, modeLuma },其中Planar模式对应的是编码模式0号,角度模式26对应的是编码模式2号
(2)如果modeLuma是前4种模式中的一种,则分两种情况进行
(i)如果最优色度模式modeChroma与modeLuma相同,则modeChroma为模式4号
(ii)如果不满足前一条,则按照下列方式推断出色度模式编号
色度模式编号 | 亮度模式0(Planar) | 亮度模式26(垂直) | 亮度模式10(水平) | 亮度模式1(DC) |
---|---|---|---|---|
0 | 34 | 0 | 0 | 0 |
1 | 26 | 34 | 26 | 26 |
2 | 10 | 10 | 34 | 10 |
3 | 1 | 1 | 1 | 34 |
参考书上的解释,如果modeChroma为垂直模式(模式26),modeLuma为Planar模式(模式0),当前色度模式编号为1;如果modeChroma为模式34,modeLuma为Planar模式(模式0),当前色度模式编号为0
2. 帧内预测入口函数(compressIntraCU)
帧内预测入口函数compressIntraCU()位于encoder/analysis.cpp中,实现了对一个CU进行帧内预测的功能,主要的执行步骤为:
(1)检查当前CU是否可能继续向下划分
(2)检查当前CU是否有确定的intra模式和depth(前两步具有early termination的思想)
(3)帧内预测执行入口(checkIntra)
(4)检查最佳模式(checkBestMode)
(5)根据情况划分成为子块进行预测编码(递归调用compressIntraCU)
(6)存储最佳数据
其中,核心编码函数为checkIntra(),执行具体的帧内预测流程,随后使用checkBestMode()将预测的最佳模式进行存储
uint64_t Analysis::compressIntraCU(const CUData& parentCTU, const CUGeom& cuGeom, int32_t qp)
{
uint32_t depth = cuGeom.depth;
ModeDepth& md = m_modeDepth[depth];
md.bestMode = NULL;
/*
1. 检查当前CU是否有可能继续划分
(1)如果cuGeom是叶节点(leaf),则不能继续划分
(2)如果cuGeom位于编码帧框架内(没有超出图像右侧边界,因为实际编码时可能会在图像右侧进行padding),
则可能继续划分
*/
bool mightSplit = !(cuGeom.flags & CUGeom::LEAF);
bool mightNotSplit = !(cuGeom.flags & CUGeom::SPLIT_MANDATORY);
/*
2. 检查当前CU是否有确定的dir和depth
(1)如果CU已有对应的intra dir,并且分析类型不为HEVC_INFO,表示已确定了intra dir
(2)如果CU已有对应的depth,并且就是当前的depth,表示已确定了depth
PS: intraRefine默认为0
*/
bool bAlreadyDecided = m_param->intraRefine != 4 && parentCTU.m_lumaIntraDir[cuGeom.absPartIdx] != (uint8_t)ALL_IDX && !(m_param->bAnalysisType == HEVC_INFO);
bool bDecidedDepth = m_param->intraRefine != 4 && parentCTU.m_cuDepth[cuGeom.absPartIdx] == depth;
int split = 0;
if (m_param->intraRefine && m_param->intraRefine != 4)
{
split = m_param->scaleFactor && bDecidedDepth && (!mightNotSplit ||
((cuGeom.log2CUSize == (uint32_t)(g_log2Size[m_param->minCUSize] + 1))));
if (cuGeom.log2CUSize == (uint32_t)(g_log2Size[m_param->minCUSize]) && !bDecidedDepth)
bAlreadyDecided = false;
}
if (bAlreadyDecided)
{
if (bDecidedDepth && mightNotSplit)
{
Mode& mode = md.pred[0];
md.bestMode = &mode;
mode.cu.initSubCU(parentCTU, cuGeom, qp);
bool reuseModes = !((m_param->intraRefine == 3) ||
(m_param->intraRefine == 2 && parentCTU.m_lumaIntraDir[cuGeom.absPartIdx] > DC_IDX));
if (reuseModes)
{
memcpy(mode.cu.m_lumaIntraDir, parentCTU.m_lumaIntraDir + cuGeom.absPartIdx, cuGeom.numPartitions);
memcpy(mode.cu.m_chromaIntraDir, parentCTU.m_chromaIntraDir + cuGeom.absPartIdx, cuGeom.numPartitions);
}
checkIntra(mode, cuGeom, (PartSize)parentCTU.m_partSize[cuGeom.absPartIdx]);
// 尝试无损
if (m_bTryLossless)
tryLossless(cuGeom);
if (mightSplit)
addSplitFlagCost(*md.bestMode, cuGeom.depth);
}
}
else if (cuGeom.log2CUSize != MAX_LOG2_CU_SIZE && mightNotSplit)
{
md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom, qp);
// 3. 帧内预测执行入口
checkIntra(md.pred[PRED_INTRA], cuGeom, SIZE_2Nx2N);
// 4. 检查最佳模式
checkBestMode(md.pred[PRED_INTRA], depth);
// 如果当前CU尺寸为8x8,PU可以拆分为4x4进行分析
if (cuGeom.log2CUSize == 3 && m_slice->m_sps->quadtreeTULog2MinSize < 3)
{
md.pred[PRED_INTRA_NxN].cu.initSubCU(parentCTU, cuGeom, qp);
checkIntra(md.pred[PRED_INTRA_NxN], cuGeom, SIZE_NxN); // 分析的partSize设置为SIZE_NxN
checkBestMode(md.pred[PRED_INTRA_NxN], depth);
}
if (m_bTryLossless)
tryLossless(cuGeom);
if (mightSplit)
addSplitFlagCost(*md.bestMode, cuGeom.depth);
}
// 检查当前的split depth是否已经达到之前决定的split depth(这里包含了early termination思想)
// stop recursion if we reach the depth of previous analysis decision
mightSplit &= !(bAlreadyDecided && bDecidedDepth) || split;
// 5. 根据情况划分成为子块进行预测编码
if (mightSplit)
{
Mode* splitPred = &md.pred[PRED_SPLIT];
splitPred->initCosts();
CUData* splitCU = &splitPred->cu;
// 初始化子块
splitCU->initSubCU(parentCTU, cuGeom, qp);
uint32_t nextDepth = depth + 1; // 子块depth + 1
ModeDepth& nd = m_modeDepth[nextDepth];
invalidateContexts(nextDepth); // 将nextDepth的上下文设置为invalid
Entropy* nextContext = &m_rqt[depth].cur;
int32_t nextQP = qp;
uint64_t curCost = 0;
int skipSplitCheck = 0;
// 分成4个子块进行编码
for (uint32_t subPartIdx = 0; subPartIdx < 4; subPartIdx++)
{
// 获取child的CU信息
const CUGeom& childGeom = *(&cuGeom + cuGeom.childOffset + subPartIdx);
if (childGeom.flags & CUGeom::PRESENT)
{
// 将sub-CU的yuv信息拷贝到nd.fencYuv,用于后续分析
m_modeDepth[0].fencYuv.copyPartToYuv(nd.fencYuv, childGeom.absPartIdx);
m_rqt[nextDepth].cur.load(*nextContext);
// 检查是否需要为sub-CU调整QP
if (m_slice->m_pps->bUseDQP && nextDepth <= m_slice->m_pps->maxCuDQPDepth)
nextQP = setLambdaFromQP(parentCTU, calculateQpforCuSize(parentCTU, childGeom));
if (m_param->bEnableSplitRdSkip)
{
curCost += compressIntraCU(parentCTU, childGeom, nextQP);
if (m_modeDepth[depth].bestMode && curCost > m_modeDepth[depth].bestMode->rdCost)
{
skipSplitCheck = 1;
break;
}
}
else // 开始执行sub-CU的分析
compressIntraCU(parentCTU, childGeom, nextQP);
// Save best CU and pred data for this sub CU
splitCU->copyPartFrom(nd.bestMode->cu, childGeom, subPartIdx);
splitPred->addSubCosts(*nd.bestMode);
nd.bestMode->reconYuv.copyToPartYuv(splitPred->reconYuv, childGeom.numPartitions * subPartIdx);
nextContext = &nd.bestMode->contexts;
}
else
{
/* record the depth of this non-present sub-CU */
splitCU->setEmptyPart(childGeom, subPartIdx);
/* Set depth of non-present CU to 0 to ensure that correct CU is fetched as reference to code deltaQP */
if (bAlreadyDecided)
memset(parentCTU.m_cuDepth + childGeom.absPartIdx, 0, childGeom.numPartitions);
}
}
if (!skipSplitCheck)
{
nextContext->store(splitPred->contexts);
if (mightNotSplit)
addSplitFlagCost(*splitPred, cuGeom.depth);
else
updateModeCost(*splitPred);
checkDQPForSplitPred(*splitPred, cuGeom);
checkBestMode(*splitPred, depth);
}
}
if (m_param->bEnableRdRefine && depth <= m_slice->m_pps->maxCuDQPDepth)
{
int cuIdx = (cuGeom.childOffset - 1) / 3;
cacheCost[cuIdx] = md.bestMode->rdCost;
}
if ((m_limitTU & X265_TU_LIMIT_NEIGH) && cuGeom.log2CUSize >= 4)
{
CUData* ctu = md.bestMode->cu.m_encData->getPicCTU(parentCTU.m_cuAddr);
int8_t maxTUDepth = -1;
for (uint32_t i = 0; i < cuGeom.numPartitions; i++)
maxTUDepth = X265_MAX(maxTUDepth, md.bestMode->cu.m_tuDepth[i]);
ctu->m_refTuDepth[cuGeom.geomRecurId] = maxTUDepth;
}
// 6. 存储最佳数据
/* Copy best data to encData CTU and recon */
md.bestMode->cu.copyToPic(depth);
if (md.bestMode != &md.pred[PRED_SPLIT])
md.bestMode->reconYuv.copyToPicYuv(*m_frame->m_reconPic, parentCTU.m_cuAddr, cuGeom.absPartIdx);
return md.bestMode->rdCost;
}
2.1 帧内预测入口函数(checkIntra)
帧内预测入口函数checkIntra()的定义位于encoder/search.cpp中,完成了对一个CU进行帧内预测模式分析的功能,具体实现步骤为:
(1)一些信息的初始化(partSize、intra mode和costs)
(2)计算luma分量的帧内预测损失(estIntraPredQT)
(3)计算chroma分量的帧内预测损失(estIntraPredChromaQT)
(4)对一些信息进行编码(.codeXXX)
(5)计算psy的开销(没有研究)
(6)检查DQP
其中,estIntraPredQT()用于计算luma分量的帧内预测信息
void Search::checkIntra(Mode& intraMode, const CUGeom& cuGeom, PartSize partSize)
{
CUData& cu = intraMode.cu;
// 1. 一些信息的初始化
cu.setPartSizeSubParts(partSize);
cu.setPredModeSubParts(MODE_INTRA);
uint32_t tuDepthRange[2];
cu.getIntraTUQtDepthRange(tuDepthRange, 0);
intraMode.initCosts();
// 2. 计算luma分量的帧内预测损失
intraMode.lumaDistortion += estIntraPredQT(intraMode, cuGeom, tuDepthRange);
if (m_csp != X265_CSP_I400) // 如果不是400格式,即还有色度分量,计算度分量的帧内预测损失
{
// 3. 计算chroma分量的帧内预测损失
intraMode.chromaDistortion += estIntraPredChromaQT(intraMode, cuGeom);
// 将luma和chroma分量的损失相加,得到总体的distortion
intraMode.distortion += intraMode.lumaDistortion + intraMode.chromaDistortion;
}
else
intraMode.distortion += intraMode.lumaDistortion;
// 4. 对一些信息进行编码
// 索引号为0表示当前树结构的最顶层CU损失
cu.m_distortion[0] = intraMode.distortion;
m_entropyCoder.resetBits();
// pps->bTransquantBypassEnabled = m_param->bCULossless || m_param->bLossless;
// bTransquantBypassEnabled由Lossless(无损)参数决定是否启用,默认不启用
if (m_slice->m_pps->bTransquantBypassEnabled)
m_entropyCoder.codeCUTransquantBypassFlag(cu.m_tqBypass[0]);
int skipFlagBits = 0;
if (!m_slice->isIntra()) // 当前slice不是intra,因为有的非slice块当中可能有些CU进行intra编码
{
m_entropyCoder.codeSkipFlag(cu, 0); // 编码skip flag
skipFlagBits = m_entropyCoder.getNumberOfWrittenBits(); // 获取skip flag对应的bits
m_entropyCoder.codePredMode(cu.m_predMode[0]); // 编码预测模式
}
// 编码划分size
m_entropyCoder.codePartSize(cu, 0, cuGeom.depth);
// 编码预测信息
m_entropyCoder.codePredInfo(cu, 0);
intraMode.mvBits = m_entropyCoder.getNumberOfWrittenBits() - skipFlagBits;
bool bCodeDQP = m_slice->m_pps->bUseDQP;
// 编码残差系数
m_entropyCoder.codeCoeff(cu, 0, bCodeDQP, tuDepthRange);
m_entropyCoder.store(intraMode.contexts);
// 计算总比特
intraMode.totalBits = m_entropyCoder.getNumberOfWrittenBits();
// 计算残差系数比特
intraMode.coeffBits = intraMode.totalBits - intraMode.mvBits - skipFlagBits;
const Yuv* fencYuv = intraMode.fencYuv;
// 4. 计算psy的开销
// 基于心理视觉的Rd优化
if (m_rdCost.m_psyRd)
intraMode.psyEnergy = m_rdCost.psyCost(cuGeom.log2CUSize - 2, fencYuv->m_buf[0], fencYuv->m_size, intraMode.reconYuv.m_buf[0], intraMode.reconYuv.m_size);
else if(m_rdCost.m_ssimRd) // 基于心理视觉的ssimRd优化
intraMode.ssimEnergy = m_quant.ssimDistortion(cu, fencYuv->m_buf[0], fencYuv->m_size, intraMode.reconYuv.m_buf[0], intraMode.reconYuv.m_size, cuGeom.log2CUSize, TEXT_LUMA, 0);
// 计算SSE,获得运动预测后的残差能量之和
intraMode.resEnergy = primitives.cu[cuGeom.log2CUSize - 2].sse_pp(intraMode.fencYuv->m_buf[0], intraMode.fencYuv->m_size, intraMode.predYuv.m_buf[0], intraMode.predYuv.m_size);
// 更新RdCost
updateModeCost(intraMode);
// 5. 检查DQP
checkDQP(intraMode, cuGeom);
}
2.1.1 计算亮度分量的帧内预测损失(estIntraPredQT)
帧内预测损失计算函数estIntraPredQT()的定义位于encoder/search.cpp中,实现了帧内预测模式的开销计算,其中包括每个子CU的预测,其主要的步骤为:
(1)初始化intra的相邻CU(initIntraNeighbors)
(2)填充相邻像素并平滑滤波(initAdiPattern)
(3)获取mpm
(4)选择最佳的帧内预测模式
(a)先进行基于SAD的模式粗选,用于筛选一些可能为最佳的模式(primitives.cu[sizeIdx].intra_pred)
(b)随后再进行精确的模式选择(基于SSE),确定最佳模式(codeIntraLumaQT)
sse_t Search::estIntraPredQT(Mode &intraMode, const CUGeom& cuGeom, const uint32_t depthRange[2])
{
CUData& cu = intraMode.cu;
Yuv* reconYuv = &intraMode.reconYuv;
Yuv* predYuv = &intraMode.predYuv;
const Yuv* fencYuv = intraMode.fencYuv;
uint32_t depth = cuGeom.depth;
// 如果partSize为SIZE_2Nx2N,即PU尺寸与当前CU尺寸相同,所以PU数量numPU为1
uint32_t initTuDepth = cu.m_partSize[0] != SIZE_2Nx2N;
uint32_t numPU = 1 << (2 * initTuDepth);
uint32_t log2TrSize = cuGeom.log2CUSize - initTuDepth;
uint32_t tuSize = 1 << log2TrSize;
uint32_t qNumParts = cuGeom.numPartitions >> 2;
uint32_t sizeIdx = log2TrSize - 2;
uint32_t absPartIdx = 0;
sse_t totalDistortion = 0;
// 是否跳过变换过程,只有无损模式下为true
int checkTransformSkip = m_slice->m_pps->bTransformSkipEnabled && !cu.m_tqBypass[0] && cu.m_partSize[0] != SIZE_2Nx2N;
// 开始进行PU级别的预测划分
// loop over partitions
for (uint32_t puIdx = 0; puIdx < numPU; puIdx++, absPartIdx += qNumParts)
{
uint32_t bmode = 0;
// 检查是否已经指定了帧内预测模式
if (intraMode.cu.m_lumaIntraDir[puIdx] != (uint8_t)ALL_IDX)
bmode = intraMode.cu.m_lumaIntraDir[puIdx];
else
{ // 如果没有指定intra dir,则检查最佳的帧内预测模式
uint64_t candCostList[MAX_RD_INTRA_MODES];
uint32_t rdModeList[MAX_RD_INTRA_MODES];
uint64_t bcost;
// rdLevel表示使用什么水平的RDO,默认为3
int maxCandCount = 2 + m_param->rdLevel + ((depth + initTuDepth) >> 1);
{
ProfileCUScope(intraMode.cu, intraAnalysisElapsedTime, countIntraAnalysis);
// Reference sample smoothing
IntraNeighbors intraNeighbors;
// 1. 初始化intra的相邻CU
initIntraNeighbors(cu, absPartIdx, initTuDepth, true, &intraNeighbors);
// 2. 填充相邻像素并平滑滤波
initAdiPattern(cu, cuGeom, absPartIdx, intraNeighbors, ALL_IDX);
// determine set of modes to be tested (using prediction signal only)
// 获取fenc和stride
const pixel* fenc = fencYuv->getLumaAddr(absPartIdx);
uint32_t stride = predYuv->m_size;
int scaleTuSize = tuSize;
int scaleStride = stride;
int costShift = 0;
// 从先前编码的信息中加载亮度分量(Luma)的帧内预测方向(Intra direction mode)
m_entropyCoder.loadIntraDirModeLuma(m_rqt[depth].cur);
/* there are three cost tiers for intra modes:
* pred[0] - mode probable, least cost
* pred[1], pred[2] - less probable, slightly more cost
* non-mpm modes - all cost the same (rbits) */
/*
3. 获取mpm(most probable mode)
(1)mpms:按照每一位来标记该模式是否位于mpm中
例如 67,108,867,即 0100 0000 0000 0000 0000 0000 0011
其中,低2位分别表示DC(第1位)和Planar(第0位),最高的 0100 表示第26位,即角度编号为26的模式位于mpm中
(2)mpmModes[3]:存储具体的mpm的模式
*/
uint64_t mpms;
uint32_t mpmModes[3];
// 获得mpms,如果没有得到mpm,会返回bits开销
uint32_t rbits = getIntraRemModeBits(cu, absPartIdx, mpmModes, mpms);
// 获得计算sad的函数指针
pixelcmp_t sa8d = primitives.cu[sizeIdx].sa8d;
uint64_t modeCosts[35];
/*
4. 选择最佳的帧内预测模式
(a)先进行基于SAD的模式粗选,用于筛选一些可能为最佳的模式
(b)随后再进行精确的模式选择(基于SSE),确定最佳模式
*/
// DC
// 进行DC模式的预测
primitives.cu[sizeIdx].intra_pred[DC_IDX](m_intraPred, scaleStride, intraNeighbourBuf[0], 0, (scaleTuSize <= 16));
// 如果mpm中有DC模式,则获取DC模式对应的比特开销
uint32_t bits = (mpms & ((uint64_t)1 << DC_IDX)) ? m_entropyCoder.bitsIntraModeMPM(mpmModes, DC_IDX) : rbits;
uint32_t sad = sa8d(fenc, scaleStride, m_intraPred, scaleStride) << costShift;
// 计算SAD Cost
modeCosts[DC_IDX] = bcost = m_rdCost.calcRdSADCost(sad, bits);
// PLANAR
// intraNeighbourBuf[0] 表示未经过平滑的像素
pixel* planar = intraNeighbourBuf[0];
if (tuSize >= 8 && tuSize <= 32)
planar = intraNeighbourBuf[1]; // intraNeighbourBuf[1] 表示经过平滑的像素
// 进行Planar模式的预测
primitives.cu[sizeIdx].intra_pred[PLANAR_IDX](m_intraPred, scaleStride, planar, 0, 0);
bits = (mpms & ((uint64_t)1 << PLANAR_IDX)) ? m_entropyCoder.bitsIntraModeMPM(mpmModes, PLANAR_IDX) : rbits;
sad = sa8d(fenc, scaleStride, m_intraPred, scaleStride) << costShift;
modeCosts[PLANAR_IDX] = m_rdCost.calcRdSADCost(sad, bits);
COPY1_IF_LT(bcost, modeCosts[PLANAR_IDX]);
// angular predictions
/*
角度模式的预测,asm函数的初始化位于asm-primitives.cpp中,默认启用asm
初始化函数为ALL_LUMA_TU(intra_pred_allangs, all_angs_pred, sse4);
*/
if (primitives.cu[sizeIdx].intra_pred_allangs)
{
/*
2-17的角度和19-33的角度是互相转置的,先将fenc转置
初始化函数为ALL_LUMA_CU_S(transpose, transpose, avx2);
*/
primitives.cu[sizeIdx].transpose(m_fencTransposed, fenc, scaleStride);
// 角度模式的预测
primitives.cu[sizeIdx].intra_pred_allangs(m_intraPredAngs, intraNeighbourBuf[0], intraNeighbourBuf[1], (scaleTuSize <= 16));
for (int mode = 2; mode < 35; mode++)
{
bits = (mpms & ((uint64_t)1 << mode)) ? m_entropyCoder.bitsIntraModeMPM(mpmModes, mode) : rbits;
// mode < 18,预测方向偏水平
if (mode < 18)
sad = sa8d(m_fencTransposed, scaleTuSize, &m_intraPredAngs[(mode - 2) * (scaleTuSize * scaleTuSize)], scaleTuSize) << costShift;
else // mode > 18,预测方向偏垂直
sad = sa8d(fenc, scaleStride, &m_intraPredAngs[(mode - 2) * (scaleTuSize * scaleTuSize)], scaleTuSize) << costShift;
// 计算Rdcost
modeCosts[mode] = m_rdCost.calcRdSADCost(sad, bits);
COPY1_IF_LT(bcost, modeCosts[mode]);
}
}
else
{ // 单独进行每种模式的预测
for (int mode = 2; mode < 35; mode++)
{
bits = (mpms & ((uint64_t)1 << mode)) ? m_entropyCoder.bitsIntraModeMPM(mpmModes, mode) : rbits;
int filter = !!(g_intraFilterFlags[mode] & scaleTuSize);
primitives.cu[sizeIdx].intra_pred[mode](m_intraPred, scaleTuSize, intraNeighbourBuf[filter], mode, scaleTuSize <= 16);
sad = sa8d(fenc, scaleStride, m_intraPred, scaleTuSize) << costShift;
modeCosts[mode] = m_rdCost.calcRdSADCost(sad, bits);
COPY1_IF_LT(bcost, modeCosts[mode]);
}
}
/* Find the top maxCandCount candidate modes with cost within 25% of best
* or among the most probable modes. maxCandCount is derived from the
* rdLevel and depth. In general we want to try more modes at slower RD
* levels and at higher depths */
/*
下面执行筛选出后续用于精细模式选择的模式队列(candCostLists)
(1)从前面已经计算的所有模式当中选取出数量为maxCandCount的模式
(2)选出的模式对应的开销必须在最佳模式的25%以内
(3)或者选出的模式必须在mpm中
(4)maxCandCount由rdLevel和depth决定
(5)如果rdLevel比较高或depth比较高,可能会去尝试更多的模式
e.g. CUSize=32, rdLevel=3, maxCandCount=5
*/
for (int i = 0; i < maxCandCount; i++)
candCostList[i] = MAX_INT64;
uint64_t paddedBcost = bcost + (bcost >> 2); // 1.25%
for (int mode = 0; mode < 35; mode++)
if ((modeCosts[mode] < paddedBcost) || ((uint32_t)mode == mpmModes[0]))
/* choose for R-D analysis only if this mode passes cost threshold or matches MPM[0] */
updateCandList(mode, modeCosts[mode], maxCandCount, rdModeList, candCostList);
}
/* measure best candidates using simple RDO (no TU splits) */
// 进行RDO来选择最佳的候选模式
bcost = MAX_INT64;
for (int i = 0; i < maxCandCount; i++)
{
if (candCostList[i] == MAX_INT64)
break;
ProfileCUScope(intraMode.cu, intraRDOElapsedTime[cuGeom.depth], countIntraRDO[cuGeom.depth]);
m_entropyCoder.load(m_rqt[depth].cur);
cu.setLumaIntraDirSubParts(rdModeList[i], absPartIdx, depth + initTuDepth);
Cost icosts;
// 检查是否跳过transform
if (checkTransformSkip)
codeIntraLumaTSkip(intraMode, cuGeom, initTuDepth, absPartIdx, icosts);
else // 不跳过,进行基于SSE的模式选择
codeIntraLumaQT(intraMode, cuGeom, initTuDepth, absPartIdx, false, icosts, depthRange);
COPY2_IF_LT(bcost, icosts.rdcost, bmode, rdModeList[i]);
}
}
ProfileCUScope(intraMode.cu, intraRDOElapsedTime[cuGeom.depth], countIntraRDO[cuGeom.depth]);
/* remeasure best mode, allowing TU splits */
// 重新评估最佳模式,这时允许TU向下划分
// 将最佳模式设置到各个子块中
cu.setLumaIntraDirSubParts(bmode, absPartIdx, depth + initTuDepth);
m_entropyCoder.load(m_rqt[depth].cur);
Cost icosts;
if (checkTransformSkip)
codeIntraLumaTSkip(intraMode, cuGeom, initTuDepth, absPartIdx, icosts);
else // TU split enable设置为true
codeIntraLumaQT(intraMode, cuGeom, initTuDepth, absPartIdx, true, icosts, depthRange);
totalDistortion += icosts.distortion;
// 存储系数和重建帧
extractIntraResultQT(cu, *reconYuv, initTuDepth, absPartIdx);
// 存储重建帧,用于后续block进行帧内预测
// set reconstruction for next intra prediction blocks
if (puIdx != numPU - 1)
{
/* This has important implications for parallelism and RDO. It is writing intermediate results into the
* output recon picture, so it cannot proceed in parallel with anything else when doing INTRA_NXN. Also
* it is not updating m_rdContexts[depth].cur for the later PUs which I suspect is slightly wrong. I think
* that the contexts should be tracked through each PU */
PicYuv* reconPic = m_frame->m_reconPic;
pixel* dst = reconPic->getLumaAddr(cu.m_cuAddr, cuGeom.absPartIdx + absPartIdx);
uint32_t dststride = reconPic->m_stride;
const pixel* src = reconYuv->getLumaAddr(absPartIdx);
uint32_t srcstride = reconYuv->m_size;
primitives.cu[log2TrSize - 2].copy_pp(dst, dststride, src, srcstride);
}
}
if (numPU > 1)
{
uint32_t combCbfY = 0;
for (uint32_t qIdx = 0, qPartIdx = 0; qIdx < 4; ++qIdx, qPartIdx += qNumParts)
combCbfY |= cu.getCbf(qPartIdx, TEXT_LUMA, 1);
cu.m_cbf[0][0] |= combCbfY;
}
// TODO: remove this
m_entropyCoder.load(m_rqt[depth].cur);
return totalDistortion;
}
2.1.1.1 初始化相邻块(initNeighbors)
初始化当前CU的相邻块时,由于涉及到了其他块,这里需要考虑CU的寻址,在x265编码过程中使用了树结构,所以使用了Zig-Zag这种扫描方式。为了良好的实现寻址,x265定义了两个寻址表,分别是g_zcsanToRaster和g_rasterToZscan
(1)g_zscanToRaster:将zig-zag顺序转换成raster顺序
(2)g_rasterToZscan:将raster顺序转换成zig-zag顺序
其中,zig-zag顺序利于CU的寻址,而raster顺序利于pixel的寻址,在寻找相邻块时有至关重要的作用,这两个寻址表用int类型打印出来的结果如下。
下面举例说明这个寻址表是如何实现寻址的:
下面为8个4x4的PU块,其中a0,a1,a2,a3属于一个CU,a块(b块同理)
// block
+----+----+----+----+
| a0 | a1 | b0 | b1 |
+----+----+----+----+
| a2 | a3 | b2 | b3 |
+----+----+----+----+
// zscan number
+---+---+---+---+
| 0 | 1 | 4 | 5 |
+---+---+---+---+
| 2 | 3 | 6 | 7 |
+---+---+---+---+
假设现在正在编码b2块,首先需要先将b2块的zscan编号转换成raster编号(在代码链路上传输的编号通常是zscan,因为更利于树结构的CU寻址)。按照zscan顺序,b2对应的zscan编号应该为6,查找g_zscanToRaster当中对应的raster编号为18,这表明b2这个块按照raster扫描是第19个块。
在一个CU大小为64x64的块中,具有256个最小单元(因为最小单元尺寸为4x4),每一行有16个,b2就对应第2行的第3个块,raster扫描第1行是16个块,所以b2块会在第19个块被扫描到,这样就完成了寻址表的对应
初始化相邻块的代码如下
void Predict::initIntraNeighbors(const CUData& cu, uint32_t absPartIdx, uint32_t tuDepth, bool isLuma, IntraNeighbors *intraNeighbors)
{
uint32_t log2TrSize = cu.m_log2CUSize[0] - tuDepth;
int log2UnitWidth = LOG2_UNIT_SIZE;
int log2UnitHeight = LOG2_UNIT_SIZE;
// chroma分量
if (!isLuma)
{
log2TrSize -= cu.m_hChromaShift;
log2UnitWidth -= cu.m_hChromaShift;
log2UnitHeight -= cu.m_vChromaShift;
}
int numIntraNeighbor; // 可用neighbor数量
// bNeighborFlags表示相邻块是否可用
bool* bNeighborFlags = intraNeighbors->bNeighborFlags;
uint32_t tuSize = 1 << log2TrSize;
int tuWidthInUnits = tuSize >> log2UnitWidth;
int tuHeightInUnits = tuSize >> log2UnitHeight;
/*
aboveUnits = 2 x tuWidthInUnits
leftUnits = 2 x tuHeightInUnits
上方包括了正上方和右上方;左侧包括了左侧和左下角
(1)正上方 + 右上方 = 8 + 8 = 16
(2)左侧 + 左下角 = 8 + 8 = 16
*/
int aboveUnits = tuWidthInUnits << 1;
int leftUnits = tuHeightInUnits << 1;
/*
g_rasterToZscan表示将光栅顺序转换成为zig-zag顺序
g_zscanToRaster表示将zig-zag顺序转换成为光栅顺序
*/
// left top
uint32_t partIdxLT = cu.m_absIdxInCTU + absPartIdx;
// right top
// 先将zig-zag转换成raster,再将raster转换成zig-zag
uint32_t partIdxRT = g_rasterToZscan[g_zscanToRaster[partIdxLT] + tuWidthInUnits - 1];
// left bottom
uint32_t partIdxLB = g_rasterToZscan[g_zscanToRaster[partIdxLT] + ((tuHeightInUnits - 1) << LOG2_RASTER_SIZE)];
/*
bConstrainedIntraPred表示对intra pred进行的限制
(1)bConstrainedIntraPred = 1,表示只有相邻块也使用intra prediction时,才会被用于参考
(2)bConstrainedIntraPred = 0,不添加限制
如果bConstrainedIntraPred = 1,对应下面的函数中使用isAboveAvailable<false>,否则对应isAboveAvailable<true>
*/
if (cu.m_slice->isIntra() || !cu.m_slice->m_pps->bConstrainedIntraPred)
{
// 检查左上角块是否可用
bNeighborFlags[leftUnits] = isAboveLeftAvailable<false>(cu, partIdxLT);
numIntraNeighbor = (int)(bNeighborFlags[leftUnits]);
/*
检查各方向角度是否可用
(1)上方(isAboveAvailable)
(2)右上(isAboveRightAvailable)
(3)左侧(isLeftAvailable)
(4)左下(isBelowLeftAvailable)
bNeighborFlags标记对应的相邻块(4x4尺寸)是否可用,长度为65
假如当前PU为32x32,存储的顺序为
(1)左下:8(8个4x4子块)
(2)左侧:8
(3)左上:1
(4)上方:8
(5)右上:8
在检查上方是否可用时,bNeighborFlags的偏移量为
leftUnits(16) + 1 = 左下块数量(8) + 左侧块数量(8) + 左上角块(1) = 17
*/
numIntraNeighbor += isAboveAvailable<false>(cu, partIdxLT, partIdxRT, bNeighborFlags + leftUnits + 1);
numIntraNeighbor += isAboveRightAvailable<false>(cu, partIdxRT, bNeighborFlags + leftUnits + 1 + tuWidthInUnits, tuWidthInUnits);
numIntraNeighbor += isLeftAvailable<false>(cu, partIdxLT, partIdxLB, bNeighborFlags + leftUnits - 1);
numIntraNeighbor += isBelowLeftAvailable<false>(cu, partIdxLB, bNeighborFlags + tuHeightInUnits - 1, tuHeightInUnits);
}
else
{ // 相邻块必须为intra模式才能够用于参考
bNeighborFlags[leftUnits] = isAboveLeftAvailable<true>(cu, partIdxLT);
numIntraNeighbor = (int)(bNeighborFlags[leftUnits]);
numIntraNeighbor += isAboveAvailable<true>(cu, partIdxLT, partIdxRT, bNeighborFlags + leftUnits + 1);
numIntraNeighbor += isAboveRightAvailable<true>(cu, partIdxRT, bNeighborFlags + leftUnits + 1 + tuWidthInUnits, tuWidthInUnits);
numIntraNeighbor += isLeftAvailable<true>(cu, partIdxLT, partIdxLB, bNeighborFlags + leftUnits - 1);
numIntraNeighbor += isBelowLeftAvailable<true>(cu, partIdxLB, bNeighborFlags + tuHeightInUnits - 1, tuHeightInUnits);
}
intraNeighbors->numIntraNeighbor = numIntraNeighbor; // 可用相邻块总量
intraNeighbors->totalUnits = aboveUnits + leftUnits + 1; // 总共unit数量(unit一定存在,但这个unit不一定可用)
intraNeighbors->aboveUnits = aboveUnits; // 上方unit数量
intraNeighbors->leftUnits = leftUnits; // 左侧unit数量
intraNeighbors->unitWidth = 1 << log2UnitWidth; // unit宽度
intraNeighbors->unitHeight = 1 << log2UnitHeight; // unit高度
intraNeighbors->log2TrSize = log2TrSize; // 以log2为底的TU Size
}
2.1.1.1.1 检查上方块是否可用(isAboveAvailable)
template<bool cip>
int Predict::isAboveAvailable(const CUData& cu, uint32_t partIdxLT, uint32_t partIdxRT, bool* bValidFlags)
{
// 先将zig-zag顺序换成raster顺序
const uint32_t rasterPartBegin = g_zscanToRaster[partIdxLT];
const uint32_t rasterPartEnd = g_zscanToRaster[partIdxRT];
const uint32_t idxStep = 1;
int numIntra = 0;
// 依次检查上方各个块是否可用
for (uint32_t rasterPart = rasterPartBegin; rasterPart <= rasterPartEnd; rasterPart += idxStep, bValidFlags++)
{
uint32_t partAbove;
const CUData* cuAbove = cu.getPUAbove(partAbove, g_rasterToZscan[rasterPart]);
// 找到可用块则将该位置标记为true,否则false
if (cuAbove && (!cip || cuAbove->isIntra(partAbove)))
{
numIntra++;
*bValidFlags = true;
}
else
*bValidFlags = false;
}
// 上方可用块的数量
return numIntra;
}
getPUAbove()函数的实现方式为
const CUData* CUData::getPUAbove(uint32_t& aPartUnitIdx, uint32_t curPartUnitIdx) const
{
uint32_t absPartIdx = g_zscanToRaster[curPartUnitIdx];
// 检查PU是否位于第1行,如果是第1行,上方不会存在可用块
if (!isZeroRow(absPartIdx))
{
uint32_t absZorderCUIdx = g_zscanToRaster[m_absIdxInCTU];
aPartUnitIdx = g_rasterToZscan[absPartIdx - RASTER_SIZE];
if (isEqualRow(absPartIdx, absZorderCUIdx))
return m_encData->getPicCTU(m_cuAddr);
else
aPartUnitIdx -= m_absIdxInCTU;
return this;
}
aPartUnitIdx = g_rasterToZscan[absPartIdx + ((s_numPartInCUSize - 1) << LOG2_RASTER_SIZE)];
return m_cuAbove;
}
2.1.1.1.2 检查右上方块是否可用(isAboveRightAvailable)
template<bool cip>
int Predict::isAboveRightAvailable(const CUData& cu, uint32_t partIdxRT, bool* bValidFlags, uint32_t numUnits)
{
int numIntra = 0;
// 依次检查右上方的块是否可用
for (uint32_t offset = 1; offset <= numUnits; offset++, bValidFlags++)
{
uint32_t partAboveRight;
const CUData* cuAboveRight = cu.getPUAboveRightAdi(partAboveRight, partIdxRT, offset);
// 如果检查出来块可用,则标记为true,否则false
if (cuAboveRight && (!cip || cuAboveRight->isIntra(partAboveRight)))
{
numIntra++;
*bValidFlags = true;
}
else
*bValidFlags = false;
}
return numIntra;
}
getPUAboveRightAdi()实现方式为
const CUData* CUData::getPUAboveRightAdi(uint32_t& arPartUnitIdx, uint32_t curPartUnitIdx, uint32_t partUnitOffset) const
{
// 检查右上角所对应的位置是否可能超出整张图片的边界(例如1280、1920)
if ((m_encData->getPicCTU(m_cuAddr)->m_cuPelX + g_zscanToPelX[curPartUnitIdx] + (partUnitOffset << LOG2_UNIT_SIZE)) >= m_slice->m_sps->picWidthInLumaSamples)
return NULL;
uint32_t absPartIdxRT = g_zscanToRaster[curPartUnitIdx];
/*
检查absPartIdxRT所在一列是否小于s_numPartInCUSize - partUnitOffset对应一列
(1)如果返回true,说明右上角PU可能存在
(2)如果返回false,说明右上角PU不存在
对于取值
(1)如果CU最大尺寸为64x64,则s_numPartInCUSize恒定为16,表示了每行(或每列)最小单元的数量
(2)partUnitOffset表示要搜索的右上角的第几个最小单元(从1开始)
*/
if (lessThanCol(absPartIdxRT, s_numPartInCUSize - partUnitOffset))
{
// 如果是第0行,右上没有可用的块
if (!isZeroRow(absPartIdxRT))
{
if (curPartUnitIdx > g_rasterToZscan[absPartIdxRT - RASTER_SIZE + partUnitOffset])
{
// 计算当前PU右上角PU位置(以raster顺序呈现)
uint32_t absZorderCUIdx = g_zscanToRaster[m_absIdxInCTU] + (1 << (m_log2CUSize[0] - LOG2_UNIT_SIZE)) - 1;
// 计算当前PU右上角PU位置(以zscan顺序呈现)
arPartUnitIdx = g_rasterToZscan[absPartIdxRT - RASTER_SIZE + partUnitOffset];
/*
检查当前PU和右上角PU的相对位置(什么情况下会出现?)
(1)如果当前PU和右上角PU位于同一行或同一列,则返回CTU
(2)如果当前PU和右上角PU不是位于同一行,也不是位于同一列,减去m_absIdxInCTU,
计算的idx表示右上角PU位于当前CU的相对位置(例如8x8 CU的右上角PU的idx为1)
*/
if (isEqualRowOrCol(absPartIdxRT, absZorderCUIdx))
return m_encData->getPicCTU(m_cuAddr);
else
{
arPartUnitIdx -= m_absIdxInCTU;
return this;
}
}
return NULL;
}
arPartUnitIdx = g_rasterToZscan[absPartIdxRT + ((s_numPartInCUSize - 1) << LOG2_RASTER_SIZE) + partUnitOffset];
return m_cuAbove;
}
if (!isZeroRow(absPartIdxRT))
return NULL;
arPartUnitIdx = g_rasterToZscan[((s_numPartInCUSize - 1) << LOG2_RASTER_SIZE) + partUnitOffset - 1];
return m_cuAboveRight;
}
2.1.1.1.3 检查左侧块是否可用(isLeftAvailable)
template<bool cip>
int Predict::isLeftAvailable(const CUData& cu, uint32_t partIdxLT, uint32_t partIdxLB, bool* bValidFlags)
{
const uint32_t rasterPartBegin = g_zscanToRaster[partIdxLT];
const uint32_t rasterPartEnd = g_zscanToRaster[partIdxLB];
const uint32_t idxStep = RASTER_SIZE; // idxStep = 16
int numIntra = 0;
// 依次检查左侧块是否可用,这里flags的索引号是从左上角开始向左下角移动的
for (uint32_t rasterPart = rasterPartBegin; rasterPart <= rasterPartEnd; rasterPart += idxStep, bValidFlags--) // opposite direction
{
uint32_t partLeft;
const CUData* cuLeft = cu.getPULeft(partLeft, g_rasterToZscan[rasterPart]);
// 找到左侧可用块,则置为true,否则false
if (cuLeft && (!cip || cuLeft->isIntra(partLeft)))
{
numIntra++;
*bValidFlags = true;
}
else
*bValidFlags = false;
}
return numIntra;
}
getPULeft()的实现方式为
const CUData* CUData::getPULeft(uint32_t& lPartUnitIdx, uint32_t curPartUnitIdx) const
{
uint32_t absPartIdx = g_zscanToRaster[curPartUnitIdx];
// 检查当前PU是否位于当前CU的第一列
if (!isZeroCol(absPartIdx))
{
// 获取当前CU的位置(或者说当前CU的第一个PU地址)
uint32_t absZorderCUIdx = g_zscanToRaster[m_absIdxInCTU];
// 获取当前PU左侧的块的idx,并转换成zscan顺序
lPartUnitIdx = g_rasterToZscan[absPartIdx - 1];
/*
检查当前PU是否位于CU的第一列
(1)如果是同一列,返回CTU
(2)如果不是同一列,减去m_absIdxInCTU,计算的结果是当前CU中PU的位置(而不是整个CTU的位置),
例如lPartUnitIdx为3,表示当前CU中编号为3的PU(一共4个,编号分别为={0, 1, 2, 3})
*/
if (isEqualCol(absPartIdx, absZorderCUIdx))
return m_encData->getPicCTU(m_cuAddr);
else
{
lPartUnitIdx -= m_absIdxInCTU;
return this;
}
}
// 返回左侧CTU的idx
lPartUnitIdx = g_rasterToZscan[absPartIdx + s_numPartInCUSize - 1];
return m_cuLeft;
}
2.1.1.1.4 检查左下块是否可用(isBelowLeftAvailable)
template<bool cip>
int Predict::isBelowLeftAvailable(const CUData& cu, uint32_t partIdxLB, bool* bValidFlags, uint32_t numUnits)
{
int numIntra = 0;
//依次检查左下角块是否可用
for (uint32_t offset = 1; offset <= numUnits; offset++, bValidFlags--) // opposite direction
{
uint32_t partBelowLeft;
const CUData* cuBelowLeft = cu.getPUBelowLeftAdi(partBelowLeft, partIdxLB, offset);
if (cuBelowLeft && (!cip || cuBelowLeft->isIntra(partBelowLeft)))
{
numIntra++;
*bValidFlags = true;
}
else
*bValidFlags = false;
}
return numIntra;
}
getPUBelowLeftAdi()的实现方式为
const CUData* CUData::getPUBelowLeftAdi(uint32_t& blPartUnitIdx, uint32_t curPartUnitIdx, uint32_t partUnitOffset) const
{
// 检查是否超出了图像边界
if ((m_encData->getPicCTU(m_cuAddr)->m_cuPelY + g_zscanToPelY[curPartUnitIdx] + (partUnitOffset << LOG2_UNIT_SIZE)) >= m_slice->m_sps->picHeightInLumaSamples)
return NULL;
uint32_t absPartIdxLB = g_zscanToRaster[curPartUnitIdx];
/*
检查absPartIdxLB所对应的row,是否小于s_numPartInCUSize - partUnitOffset
(1)如果返回true,说明当前PU对应的行小一些,左下角PU可能存在
(2)如果返回false,左下角PU不存在
对于取值
(1)如果CU最大尺寸为64x64,则s_numPartInCUSize恒定为16,表示了每行(或每列)最小单元的数量
(2)partUnitOffset表示要搜索的右上角的第几个最小单元(从1开始)
*/
if (lessThanRow(absPartIdxLB, s_numPartInCUSize - partUnitOffset))
{
// 检查是否是第0列
if (!isZeroCol(absPartIdxLB))
{
if (curPartUnitIdx > g_rasterToZscan[absPartIdxLB + (partUnitOffset << LOG2_RASTER_SIZE) - 1])
{
uint32_t absZorderCUIdxLB = g_zscanToRaster[m_absIdxInCTU] + (((1 << (m_log2CUSize[0] - LOG2_UNIT_SIZE)) - 1) << LOG2_RASTER_SIZE);
blPartUnitIdx = g_rasterToZscan[absPartIdxLB + (partUnitOffset << LOG2_RASTER_SIZE) - 1];
if (isEqualRowOrCol(absPartIdxLB, absZorderCUIdxLB))
return m_encData->getPicCTU(m_cuAddr);
else
{
blPartUnitIdx -= m_absIdxInCTU;
return this;
}
}
return NULL;
}
blPartUnitIdx = g_rasterToZscan[absPartIdxLB + (partUnitOffset << LOG2_RASTER_SIZE) + s_numPartInCUSize - 1];
return m_cuLeft;
}
// 如果给定的索引号超出了CTU的范围,设置为NULL
return NULL;
}
2.1.1.2 填充相邻块及平滑滤波(initNeighbors)
void Predict::initAdiPattern(const CUData& cu, const CUGeom& cuGeom, uint32_t puAbsPartIdx, const IntraNeighbors& intraNeighbors, int dirMode)
{
int tuSize = 1 << intraNeighbors.log2TrSize;
int tuSize2 = tuSize << 1;
PicYuv* reconPic = cu.m_encData->m_reconPic;
// 获取PU的
pixel* adiOrigin = reconPic->getLumaAddr(cu.m_cuAddr, cuGeom.absPartIdx + puAbsPartIdx);
intptr_t picStride = reconPic->m_stride;
// 填充参考像素
fillReferenceSamples(adiOrigin, picStride, intraNeighbors, intraNeighbourBuf[0]);
pixel* refBuf = intraNeighbourBuf[0]; // 填充的像素
pixel* fltBuf = intraNeighbourBuf[1]; // 滤波的像素
/*
refBuf存储的顺序为
(1)topLeft
(2)top(从左到右)
(3)left(从上向下)
*/
pixel topLeft = refBuf[0], topLast = refBuf[tuSize2], leftLast = refBuf[tuSize2 + tuSize2];
// 检查是否进行强滤波
if (dirMode == ALL_IDX ? (8 | 16 | 32) & tuSize : g_intraFilterFlags[dirMode] & tuSize)
{
// generate filtered intra prediction samples
// 检查是否使用strong intra smooth,tu尺寸必须为32
if (cu.m_slice->m_sps->bUseStrongIntraSmoothing && tuSize == 32)
{
const int threshold = 1 << (X265_DEPTH - 5); // threshold = 8
/*
(1)topMiddle位于top和top-right的中间
(2)leftMiddle位于left和below-left中间
*/
pixel topMiddle = refBuf[32], leftMiddle = refBuf[tuSize2 + 32];
// 检查两侧像素值变化是否小于阈值
if (abs(topLeft + topLast - (topMiddle << 1)) < threshold &&
abs(topLeft + leftLast - (leftMiddle << 1)) < threshold)
{
// "strong" bilinear interpolation
// 进行强双线性插值
const int shift = 5 + 1;
int init = (topLeft << shift) + tuSize;
int deltaL, deltaR;
deltaL = leftLast - topLeft; deltaR = topLast - topLeft;
fltBuf[0] = topLeft;
for (int i = 1; i < tuSize2; i++)
{
/*
left = a + 1/2 + [(e - a) * i] / 64
top = a + 1/2 + [(c - a) * i] / 64
*/
fltBuf[i + tuSize2] = (pixel)((init + deltaL * i) >> shift); // Left Filtering
fltBuf[i] = (pixel)((init + deltaR * i) >> shift); // Above Filtering
}
fltBuf[tuSize2] = topLast;
fltBuf[tuSize2 + tuSize2] = leftLast;
return;
}
}
// 执行普通的滤波
primitives.cu[intraNeighbors.log2TrSize - 2].intra_filter(refBuf, fltBuf);
}
}
2.1.1.2.1 填充参考像素(fillReferenceSamples)
void Predict::fillReferenceSamples(const pixel* adiOrigin, intptr_t picStride, const IntraNeighbors& intraNeighbors, pixel dst[258])
{
const pixel dcValue = (pixel)(1 << (X265_DEPTH - 1));
int numIntraNeighbor = intraNeighbors.numIntraNeighbor;
int totalUnits = intraNeighbors.totalUnits;
uint32_t tuSize = 1 << intraNeighbors.log2TrSize;
uint32_t refSize = tuSize * 2 + 1;
/*
检查是否有可用的相邻块
(1)如果没有可用相邻块,进行DC预测
(2)如果有可用的相邻块,将相邻块中的像素值copy到dst中
dst存储的顺序为
(1)top-left
(2)top
(3)left
*/
// Nothing is available, perform DC prediction.
if (numIntraNeighbor == 0)
{
// Fill top border with DC value
for (uint32_t i = 0; i < refSize; i++)
dst[i] = dcValue;
// Fill left border with DC value
for (uint32_t i = 0; i < refSize - 1; i++)
dst[i + refSize] = dcValue;
}
else if (numIntraNeighbor == totalUnits) // 所有块都可用,使用重建帧像素填充border
{
// Fill top border with rec. samples
const pixel* adiTemp = adiOrigin - picStride - 1;
memcpy(dst, adiTemp, refSize * sizeof(pixel));
// Fill left border with rec. samples
adiTemp = adiOrigin - 1;
for (uint32_t i = 0; i < refSize - 1; i++)
{
dst[i + refSize] = adiTemp[0];
adiTemp += picStride;
}
}
else // reference samples are partially available
{
// 部分块可用
const bool *bNeighborFlags = intraNeighbors.bNeighborFlags;
const bool *pNeighborFlags;
int aboveUnits = intraNeighbors.aboveUnits;
int leftUnits = intraNeighbors.leftUnits;
int unitWidth = intraNeighbors.unitWidth;
int unitHeight = intraNeighbors.unitHeight;
int totalSamples = (leftUnits * unitHeight) + ((aboveUnits + 1) * unitWidth);
pixel adiLineBuffer[5 * MAX_CU_SIZE]; // 5 * 64 = 320
pixel *adi;
// Initialize
for (int i = 0; i < totalSamples; i++)
adiLineBuffer[i] = dcValue; // 全部初始化为dcValue
// Fill top-left sample
/*
adiOrigin指向当前PU左上角像素,adiTemp为adiOrigin再向左上角偏移一个单位的位置
+---------+----------+
|adiOrigin| | ...
+---------+----------+
| | adiTemp | ...
+---------+----------+
...
adi指向adiLineBuffer数组的中间,记录topLeftVal
+---+---+---+---+---+---+---+
|leftUnits * unitHeight |adi| ...
+---+---+---+---+---+---+---+
*/
const pixel* adiTemp = adiOrigin - picStride - 1;
adi = adiLineBuffer + (leftUnits * unitHeight);
pNeighborFlags = bNeighborFlags + leftUnits;
if (*pNeighborFlags)
{
pixel topLeftVal = adiTemp[0];
for (int i = 0; i < unitWidth; i++)
adi[i] = topLeftVal; // 写入4个相同的值
}
// Fill left & below-left samples
/*
移动指针,adiTemp指向adiTemp左侧位置
+---------+----------+
| | | ...
+---------+----------+
|adiOrigin| adiTemp | ...
+---------+----------+
...
adi向左移动一个单位,准备填充left和below-left像素
+---+---+---+---+---+---+---+
|leftUnits * unitHeight | | ...
+---+---+---+---+---+---+---+
adi--
*/
adiTemp += picStride;
adi--;
// NOTE: over copy here, but reduce condition operators
for (int j = 0; j < leftUnits * unitHeight; j++)
{
adi[-j] = adiTemp[j * picStride];
}
// Fill above & above-right samples
/*
移动指针,adiTemp指向adiTemp上方位置
+---------+----------+
| |adiOrigin | ...
+---------+----------+
| | adiTemp | ...
+---------+----------+
...
adi指向aboveUnits存储的位置,准备填充above和above-right像素
+---+---+---+---+---+---+---+---+---+---+---+---+---+
|leftUnits * unitHeight | |aboveUnits * unitWidth |
+---+---+---+---+---+---+---+---+---+---+---+---+---+
adi--
*/
adiTemp = adiOrigin - picStride;
adi = adiLineBuffer + (leftUnits * unitHeight) + unitWidth;
// NOTE: over copy here, but reduce condition operators
memcpy(adi, adiTemp, aboveUnits * unitWidth * sizeof(*adiTemp));
// Pad reference samples when necessary
// 检查是否有必要填充参考像素
int curr = 0;
int next = 1;
adi = adiLineBuffer;
int pAdiLineTopRowOffset = leftUnits * (unitHeight - unitWidth);
// 如果最左下角的块不可用,需要找到邻近块的像素值去进行填充
if (!bNeighborFlags[0])
{
// very bottom unit of bottom-left; at least one unit will be valid.
// 检查至少到哪一个块开始是可用的
while (next < totalUnits && !bNeighborFlags[next])
next++;
/*
检查要开始填充的位置
(1)next < leftUnits,说明需要填充的位置位于左侧(或左下)
(2)next >= leftUnits,说明需要填充的位置位于上方(或右上)
*/
pixel* pAdiLineNext = adiLineBuffer + ((next < leftUnits) ? (next * unitHeight) : (pAdiLineTopRowOffset + (next * unitWidth)));
const pixel refSample = *pAdiLineNext; // 将最邻近块的像素值作为填充值
// Pad unavailable samples with new value
int nextOrTop = X265_MIN(next, leftUnits); // 检查当前样本值位于左侧还是上方
// fill left column
#if HIGH_BIT_DEPTH
while (curr < nextOrTop)
{
for (int i = 0; i < unitHeight; i++)
adi[i] = refSample;
adi += unitHeight;
curr++;
}
// fill top row
while (curr < next)
{
for (int i = 0; i < unitWidth; i++)
adi[i] = refSample;
adi += unitWidth;
curr++;
}
#else
X265_CHECK(curr <= nextOrTop, "curr must be less than or equal to nextOrTop\n");
// 填充左侧像素
if (curr < nextOrTop)
{
const int fillSize = unitHeight * (nextOrTop - curr);
memset(adi, refSample, fillSize * sizeof(pixel));
curr = nextOrTop;
adi += fillSize;
}
// 填充上方像素
if (curr < next)
{
const int fillSize = unitWidth * (next - curr);
memset(adi, refSample, fillSize * sizeof(pixel));
curr = next;
adi += fillSize;
}
#endif
}
// pad all other reference samples.
// 填充其他位置的像素值
while (curr < totalUnits)
{
if (!bNeighborFlags[curr]) // samples not available
{
int numSamplesInCurrUnit = (curr >= leftUnits) ? unitWidth : unitHeight;
const pixel refSample = *(adi - 1);
for (int i = 0; i < numSamplesInCurrUnit; i++)
adi[i] = refSample;
adi += numSamplesInCurrUnit;
curr++;
}
else
{
adi += (curr >= leftUnits) ? unitWidth : unitHeight;
curr++;
}
}
// Copy processed samples
adi = adiLineBuffer + refSize + unitWidth - 2;
memcpy(dst, adi, refSize * sizeof(pixel));
adi = adiLineBuffer + refSize - 1;
for (int i = 0; i < (int)refSize - 1; i++)
dst[i + refSize] = adi[-(i + 1)];
}
}
2.1.1.4 基于不同模式的帧内预测(primitives.cu[sizeIdx].intra_pred)
该函数用于获取预测块,根据前面获取的可用参考模式和可用参考像素来计算当前PU对应的预测块,这类函数通常使用汇编进行加速(asm),在x265中使用asm-primitives.cpp中定义,截取一部分代码如下
void setupAssemblyPrimitives(EncoderPrimitives &p, int cpuMask) // Main
{
#if X86_64
p.scanPosLast = PFX(scanPosLast_x64);
#endif
// 检查当前CPU支持SSE2的汇编加速
if (cpuMask & X265_CPU_SSE2)
{
/* We do not differentiate CPUs which support MMX and not SSE2. We only check
* for SSE2 and then use both MMX and SSE2 functions */
// 进行sad函数的初始化
AVC_LUMA_PU(sad, mmx2);
AVC_LUMA_PU(sad_x3, mmx2);
AVC_LUMA_PU(sad_x4, mmx2);
p.pu[LUMA_16x16].sad = PFX(pixel_sad_16x16_sse2);
p.pu[LUMA_16x16].sad_x3 = PFX(pixel_sad_x3_16x16_sse2);
p.pu[LUMA_16x16].sad_x4 = PFX(pixel_sad_x4_16x16_sse2);
p.pu[LUMA_16x8].sad = PFX(pixel_sad_16x8_sse2);
p.pu[LUMA_16x8].sad_x3 = PFX(pixel_sad_x3_16x8_sse2);
p.pu[LUMA_16x8].sad_x4 = PFX(pixel_sad_x4_16x8_sse2);
HEVC_SAD(sse2);
// ...
}
// ...
// 检查CPU是否支持SSE4指令集
if (cpuMask & X265_CPU_SSE4)
{
// ...
/*
定义Planar、DC和all_angs的基于SSE4实现的intra pred函数
(1)实际使用时,一般会先使用Planar和DC模式进行预测
(2)随后intra_pred_all_angs函数对所有角度模式进行统一预测
*/
ALL_LUMA_TU_S(intra_pred[PLANAR_IDX], intra_pred_planar, sse4);
ALL_LUMA_TU_S(intra_pred[DC_IDX], intra_pred_dc, sse4);
ALL_LUMA_TU(intra_pred_allangs, all_angs_pred, sse4);
/*
下面是ALL_LUMA_TU宏的定义
#define ALL_LUMA_TU_TYPED(prim, fncdef, fname, cpu) \
p.cu[BLOCK_4x4].prim = fncdef PFX(fname ## _4x4_ ## cpu); \
p.cu[BLOCK_8x8].prim = fncdef PFX(fname ## _8x8_ ## cpu); \
p.cu[BLOCK_16x16].prim = fncdef PFX(fname ## _16x16_ ## cpu); \
p.cu[BLOCK_32x32].prim = fncdef PFX(fname ## _32x32_ ## cpu)
#define PFX3(prefix, name) prefix ## _ ## name
#define PFX2(prefix, name) PFX3(prefix, name)
#define PFX(name) PFX2(X265_NS, name)
扩展的结果是
p.cu[BLOCK_4x4].intra_pred_allangs = x265_all_angs_pred_4x4_sse4;
p.cu[BLOCK_8x8].intra_pred_allangs = x265_all_angs_pred_8x8_sse4;
p.cu[BLOCK_16x16].intra_pred_allangs = x265_all_angs_pred_16x16_sse4;
p.cu[BLOCK_32x32].intra_pred_allangs = x265_all_angs_pred_32x32_sse4;
*/
// ...
}
截取intrapred8_allangs.asm中的一部分汇编加速的代码,选取4x4大小
;-----------------------------------------------------------------------------
; void all_angs_pred_4x4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma)
;-----------------------------------------------------------------------------
INIT_XMM sse4
cglobal all_angs_pred_4x4, 4, 4, 8
; mode 2 // 模式2
movh m0, [r1 + 10]
movd [r0], m0
palignr m1, m0, 1
movd [r0 + 4], m1
palignr m1, m0, 2
movd [r0 + 8], m1
palignr m1, m0, 3
movd [r0 + 12], m1
; mode 3 // 模式3
mova m2, [pw_1024]
pslldq m1, m0, 1
pinsrb m1, [r1 + 9], 0
punpcklbw m1, m0
lea r3, [ang_table]
pmaddubsw m6, m1, [r3 + 26 * 16]
pmulhrsw m6, m2
packuswb m6, m6
movd [r0 + 16], m6
palignr m0, m1, 2
mova m7, [r3 + 20 * 16]
pmaddubsw m3, m0, m7
pmulhrsw m3, m2
packuswb m3, m3
movd [r0 + 20], m3
; mode 6 [row 3]
movd [r0 + 76], m3
palignr m3, m1, 4
pmaddubsw m4, m3, [r3 + 14 * 16]
pmulhrsw m4, m2
packuswb m4, m4
movd [r0 + 24], m4
palignr m4, m1, 6
pmaddubsw m4, [r3 + 8 * 16]
pmulhrsw m4, m2
packuswb m4, m4
movd [r0 + 28], m4
// ...
当然,也可以调试C语言版本实现的模式预测函数,位于primitives.cpp中,由setupIntraPrimitives_c()初始化
void setupIntraPrimitives_c(EncoderPrimitives& p)
{
p.cu[BLOCK_4x4].intra_filter = intraFilter<4>;
p.cu[BLOCK_8x8].intra_filter = intraFilter<8>;
p.cu[BLOCK_16x16].intra_filter = intraFilter<16>;
p.cu[BLOCK_32x32].intra_filter = intraFilter<32>;
p.cu[BLOCK_4x4].intra_pred[PLANAR_IDX] = planar_pred_c<2>;
p.cu[BLOCK_8x8].intra_pred[PLANAR_IDX] = planar_pred_c<3>;
p.cu[BLOCK_16x16].intra_pred[PLANAR_IDX] = planar_pred_c<4>;
p.cu[BLOCK_32x32].intra_pred[PLANAR_IDX] = planar_pred_c<5>;
p.cu[BLOCK_4x4].intra_pred[DC_IDX] = intra_pred_dc_c<4>;
p.cu[BLOCK_8x8].intra_pred[DC_IDX] = intra_pred_dc_c<8>;
p.cu[BLOCK_16x16].intra_pred[DC_IDX] = intra_pred_dc_c<16>;
p.cu[BLOCK_32x32].intra_pred[DC_IDX] = intra_pred_dc_c<32>;
// single angs初始化
for (int i = 2; i < NUM_INTRA_MODE; i++)
{
p.cu[BLOCK_4x4].intra_pred[i] = intra_pred_ang_c<4>;
p.cu[BLOCK_8x8].intra_pred[i] = intra_pred_ang_c<8>;
p.cu[BLOCK_16x16].intra_pred[i] = intra_pred_ang_c<16>;
p.cu[BLOCK_32x32].intra_pred[i] = intra_pred_ang_c<32>;
}
// all angs初始化,一般不使用或禁用
p.cu[BLOCK_4x4].intra_pred_allangs = all_angs_pred_c<2>;
p.cu[BLOCK_8x8].intra_pred_allangs = all_angs_pred_c<3>;
p.cu[BLOCK_16x16].intra_pred_allangs = all_angs_pred_c<4>;
p.cu[BLOCK_32x32].intra_pred_allangs = all_angs_pred_c<5>;
}
但是在实际使用时,如果使用C语言的intra pred的代码,会禁用all_angs_pred
void x265_setup_primitives(x265_param *param)
{
if (!primitives.pu[0].sad)
{
// 设置C语言实现的基本函数
setupCPrimitives(primitives);
/* We do not want the encoder to use the un-optimized intra all-angles
* C references. It is better to call the individual angle functions
* instead. We must check for NULL before using this primitive */
// 会把前面C实现的all angs禁用
for (int i = 0; i < NUM_TR_SIZE; i++)
primitives.cu[i].intra_pred_allangs = NULL;
#if ENABLE_ASSEMBLY
#if X265_ARCH_X86
setupInstrinsicPrimitives(primitives, param->cpuid);
#endif
// 设置汇编加速的函数
setupAssemblyPrimitives(primitives, param->cpuid);
#endif
#if HAVE_ALTIVEC
if (param->cpuid & X265_CPU_ALTIVEC)
{
setupPixelPrimitives_altivec(primitives); // pixel_altivec.cpp, overwrite the initialization for altivec optimizated functions
setupDCTPrimitives_altivec(primitives); // dct_altivec.cpp, overwrite the initialization for altivec optimizated functions
setupFilterPrimitives_altivec(primitives); // ipfilter.cpp, overwrite the initialization for altivec optimizated functions
setupIntraPrimitives_altivec(primitives); // intrapred_altivec.cpp, overwrite the initialization for altivec optimizated functions
}
#endif
setupAliasPrimitives(primitives);
if (param->bLowPassDct)
{
enableLowpassDCTPrimitives(primitives);
}
}
x265_report_simd(param);
}
C语言实现的单个角度预测模式实现的方式如下,基本的一些运算,不过在实现过程中为了简便增加了一些映射表
template<int width>
void intra_pred_ang_c(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int dirMode, int bFilter)
{
int width2 = width << 1;
// Flip the neighbours in the horizontal case.
int horMode = dirMode < 18; // 小于18,说明当前的模式为偏向水平的模式
pixel neighbourBuf[129];
const pixel *srcPix = srcPix0;
// 如果是水平模式,需要转置
if (horMode)
{
neighbourBuf[0] = srcPix[0];
for (int i = 0; i < width << 1; i++)
{
neighbourBuf[1 + i] = srcPix[width2 + 1 + i];
neighbourBuf[width2 + 1 + i] = srcPix[1 + i];
}
srcPix = neighbourBuf;
}
// Intra prediction angle and inverse angle tables.
const int8_t angleTable[17] = { -32, -26, -21, -17, -13, -9, -5, -2, 0, 2, 5, 9, 13, 17, 21, 26, 32 };
const int16_t invAngleTable[8] = { 4096, 1638, 910, 630, 482, 390, 315, 256 };
// Get the prediction angle.
int angleOffset = horMode ? 10 - dirMode : dirMode - 26; // 计算angle偏移量
int angle = angleTable[8 + angleOffset];
// Vertical Prediction.
if (!angle)
{
for (int y = 0; y < width; y++)
for (int x = 0; x < width; x++)
dst[y * dstStride + x] = srcPix[1 + x];
if (bFilter)
{
int topLeft = srcPix[0], top = srcPix[1];
for (int y = 0; y < width; y++)
dst[y * dstStride] = x265_clip((int16_t)(top + ((srcPix[width2 + 1 + y] - topLeft) >> 1)));
}
}
else // Angular prediction.
{
// Get the reference pixels. The reference base is the first pixel to the top (neighbourBuf[1]).
pixel refBuf[64];
const pixel *ref;
// Use the projected left neighbours and the top neighbours.
if (angle < 0)
{
// Number of neighbours projected.
int nbProjected = -((width * angle) >> 5) - 1;
pixel *ref_pix = refBuf + nbProjected + 1;
// Project the neighbours.
int invAngle = invAngleTable[- angleOffset - 1];
int invAngleSum = 128;
for (int i = 0; i < nbProjected; i++)
{
invAngleSum += invAngle;
ref_pix[- 2 - i] = srcPix[width2 + (invAngleSum >> 8)];
}
// Copy the top-left and top pixels.
for (int i = 0; i < width + 1; i++)
ref_pix[-1 + i] = srcPix[i];
ref = ref_pix;
}
else // Use the top and top-right neighbours.
ref = srcPix + 1;
// Pass every row.
int angleSum = 0;
for (int y = 0; y < width; y++)
{
angleSum += angle;
int offset = angleSum >> 5;
int fraction = angleSum & 31;
if (fraction) // Interpolate
for (int x = 0; x < width; x++)
dst[y * dstStride + x] = (pixel)(((32 - fraction) * ref[offset + x] + fraction * ref[offset + x + 1] + 16) >> 5);
else // Copy.
for (int x = 0; x < width; x++)
dst[y * dstStride + x] = ref[offset + x];
}
}
// Flip for horizontal.
if (horMode)
{
for (int y = 0; y < width - 1; y++)
{
for (int x = y + 1; x < width; x++)
{
pixel tmp = dst[y * dstStride + x];
dst[y * dstStride + x] = dst[x * dstStride + y];
dst[x * dstStride + y] = tmp;
}
}
}
}
2.1.1.3 编码IntraLuma块(codeIntraLumaQT)
前面实现的是预测模式的粗选,使用基于SAD的方法从许多模式中选一些性能还不错的模式,而codeIntraLumaQT()函数之中会使用SSE计算出开销,并且结合所消耗的比特确定最佳模式,其中
void Search::codeIntraLumaQT(Mode& mode, const CUGeom& cuGeom, uint32_t tuDepth, uint32_t absPartIdx, bool bAllowSplit, Cost& outCost, const uint32_t depthRange[2])
{
CUData& cu = mode.cu;
uint32_t fullDepth = cuGeom.depth + tuDepth;
uint32_t log2TrSize = cuGeom.log2CUSize - tuDepth;
uint32_t qtLayer = log2TrSize - 2;
uint32_t sizeIdx = log2TrSize - 2;
/*
检查当前CU是否还可以继续split
(1)如果TUSize <= depth上限,可以不继续划分
(2)如果TUSize > depth下限,并且参数允许划分,可以继续划分
PS: 感觉正常的逻辑应该是,只要位于depth范围之内都有可能继续划分?
*/
bool mightNotSplit = log2TrSize <= depthRange[1];
bool mightSplit = (log2TrSize > depthRange[0]) && (bAllowSplit || !mightNotSplit);
bool bEnableRDOQ = !!m_param->rdoqLevel;
/* If maximum RD penalty, force spits at TU size 32x32 if SPS allows TUs of 16x16 */
// 如果使用最大RD penalty,强制32x32的TU划分为16x16
if (m_param->rdPenalty == 2 && m_slice->m_sliceType != I_SLICE && log2TrSize == 5 && depthRange[0] <= 4)
{
mightNotSplit = false;
mightSplit = true;
}
Cost fullCost;
uint32_t bCBF = 0;
pixel* reconQt = m_rqt[qtLayer].reconQtYuv.getLumaAddr(absPartIdx);
uint32_t reconQtStride = m_rqt[qtLayer].reconQtYuv.m_size;
if (mightNotSplit)
{
if (mightSplit)
m_entropyCoder.store(m_rqt[fullDepth].rqtRoot);
const pixel* fenc = mode.fencYuv->getLumaAddr(absPartIdx);
pixel* pred = mode.predYuv.getLumaAddr(absPartIdx);
int16_t* residual = m_rqt[cuGeom.depth].tmpResiYuv.getLumaAddr(absPartIdx);
uint32_t stride = mode.fencYuv->m_size;
// init availability pattern
uint32_t lumaPredMode = cu.m_lumaIntraDir[absPartIdx];
IntraNeighbors intraNeighbors;
initIntraNeighbors(cu, absPartIdx, tuDepth, true, &intraNeighbors);
initAdiPattern(cu, cuGeom, absPartIdx, intraNeighbors, lumaPredMode);
// get prediction signal
// 预测
predIntraLumaAng(lumaPredMode, pred, stride, log2TrSize);
cu.setTransformSkipSubParts(0, TEXT_LUMA, absPartIdx, fullDepth);
cu.setTUDepthSubParts(tuDepth, absPartIdx, fullDepth);
uint32_t coeffOffsetY = absPartIdx << (LOG2_UNIT_SIZE * 2);
coeff_t* coeffY = m_rqt[qtLayer].coeffRQT[0] + coeffOffsetY;
// store original entropy coding status
if (bEnableRDOQ)
m_entropyCoder.estBit(m_entropyCoder.m_estBitsSbac, log2TrSize, true);
// 计算残差
primitives.cu[sizeIdx].calcresidual[stride % 64 == 0](fenc, pred, residual, stride);
// 进行变换
uint32_t numSig = m_quant.transformNxN(cu, fenc, stride, residual, stride, coeffY, log2TrSize, TEXT_LUMA, absPartIdx, false);
if (numSig)
{
m_quant.invtransformNxN(cu, residual, stride, coeffY, log2TrSize, TEXT_LUMA, true, false, numSig);
bool reconQtYuvAlign = m_rqt[qtLayer].reconQtYuv.getAddrOffset(absPartIdx, mode.predYuv.m_size) % 64 == 0;
bool predAlign = mode.predYuv.getAddrOffset(absPartIdx, mode.predYuv.m_size) % 64 == 0;
bool residualAlign = m_rqt[cuGeom.depth].tmpResiYuv.getAddrOffset(absPartIdx, mode.predYuv.m_size) % 64 == 0;
bool bufferAlignCheck = (reconQtStride % 64 == 0) && (stride % 64 == 0) && reconQtYuvAlign && predAlign && residualAlign;
primitives.cu[sizeIdx].add_ps[bufferAlignCheck](reconQt, reconQtStride, pred, residual, stride, stride);
}
else
// no coded residual, recon = pred
primitives.cu[sizeIdx].copy_pp(reconQt, reconQtStride, pred, stride);
bCBF = !!numSig << tuDepth;
cu.setCbfSubParts(bCBF, TEXT_LUMA, absPartIdx, fullDepth);
// 计算sse损失
/*
p.cu[BLOCK_4x4].sse_pp = PFX(pixel_ssd_4x4_ssse3);
*/
fullCost.distortion = primitives.cu[sizeIdx].sse_pp(reconQt, reconQtStride, fenc, stride);
m_entropyCoder.resetBits();
if (!absPartIdx)
{
if (!cu.m_slice->isIntra())
{
if (cu.m_slice->m_pps->bTransquantBypassEnabled)
m_entropyCoder.codeCUTransquantBypassFlag(cu.m_tqBypass[0]);
m_entropyCoder.codeSkipFlag(cu, 0);
m_entropyCoder.codePredMode(cu.m_predMode[0]);
}
// 编码partSize
m_entropyCoder.codePartSize(cu, 0, cuGeom.depth);
}
// 编码luma的ang
if (cu.m_partSize[0] == SIZE_2Nx2N)
{
if (!absPartIdx)
m_entropyCoder.codeIntraDirLumaAng(cu, 0, false);
}
else
{
uint32_t qNumParts = cuGeom.numPartitions >> 2;
if (!tuDepth)
{
for (uint32_t qIdx = 0; qIdx < 4; ++qIdx)
m_entropyCoder.codeIntraDirLumaAng(cu, qIdx * qNumParts, false);
}
else if (!(absPartIdx & (qNumParts - 1)))
m_entropyCoder.codeIntraDirLumaAng(cu, absPartIdx, false);
}
if (log2TrSize != depthRange[0])
m_entropyCoder.codeTransformSubdivFlag(0, 5 - log2TrSize);
// 编码cbf
m_entropyCoder.codeQtCbfLuma(!!numSig, tuDepth);
// 对NxN变换块的系数进行编码
if (cu.getCbf(absPartIdx, TEXT_LUMA, tuDepth))
m_entropyCoder.codeCoeffNxN(cu, coeffY, absPartIdx, log2TrSize, TEXT_LUMA);
// 计算一共消耗的编码比特数
fullCost.bits = m_entropyCoder.getNumberOfWrittenBits();
if (m_param->rdPenalty && log2TrSize == 5 && m_slice->m_sliceType != I_SLICE)
fullCost.bits *= 4;
// 根据不同配置计算rdcost
if (m_rdCost.m_psyRd)
{
fullCost.energy = m_rdCost.psyCost(sizeIdx, fenc, mode.fencYuv->m_size, reconQt, reconQtStride);
fullCost.rdcost = m_rdCost.calcPsyRdCost(fullCost.distortion, fullCost.bits, fullCost.energy);
}
else if(m_rdCost.m_ssimRd)
{
fullCost.energy = m_quant.ssimDistortion(cu, fenc, stride, reconQt, reconQtStride, log2TrSize, TEXT_LUMA, absPartIdx);
fullCost.rdcost = m_rdCost.calcSsimRdCost(fullCost.distortion, fullCost.bits, fullCost.energy);
}
else
fullCost.rdcost = m_rdCost.calcRdCost(fullCost.distortion, fullCost.bits);
}
else
fullCost.rdcost = MAX_INT64;
// 如果允许继续划分,会分成子块去进行编码
if (mightSplit)
{
if (mightNotSplit)
{
m_entropyCoder.store(m_rqt[fullDepth].rqtTest); // save state after full TU encode
m_entropyCoder.load(m_rqt[fullDepth].rqtRoot); // prep state of split encode
}
/* code split block */
uint32_t qNumParts = 1 << (log2TrSize - 1 - LOG2_UNIT_SIZE) * 2;
int checkTransformSkip = m_slice->m_pps->bTransformSkipEnabled && (log2TrSize - 1) <= MAX_LOG2_TS_SIZE && !cu.m_tqBypass[0];
if (m_param->bEnableTSkipFast)
checkTransformSkip &= cu.m_partSize[0] != SIZE_2Nx2N;
Cost splitCost;
uint32_t cbf = 0;
for (uint32_t qIdx = 0, qPartIdx = absPartIdx; qIdx < 4; ++qIdx, qPartIdx += qNumParts)
{
if (checkTransformSkip)
codeIntraLumaTSkip(mode, cuGeom, tuDepth + 1, qPartIdx, splitCost);
else
codeIntraLumaQT(mode, cuGeom, tuDepth + 1, qPartIdx, bAllowSplit, splitCost, depthRange);
cbf |= cu.getCbf(qPartIdx, TEXT_LUMA, tuDepth + 1);
}
cu.m_cbf[0][absPartIdx] |= (cbf << tuDepth);
if (mightNotSplit && log2TrSize != depthRange[0])
{
/* If we could have coded this TU depth, include cost of subdiv flag */
m_entropyCoder.resetBits();
m_entropyCoder.codeTransformSubdivFlag(1, 5 - log2TrSize);
splitCost.bits += m_entropyCoder.getNumberOfWrittenBits();
if (m_rdCost.m_psyRd)
splitCost.rdcost = m_rdCost.calcPsyRdCost(splitCost.distortion, splitCost.bits, splitCost.energy);
else if(m_rdCost.m_ssimRd)
splitCost.rdcost = m_rdCost.calcSsimRdCost(splitCost.distortion, splitCost.bits, splitCost.energy);
else
splitCost.rdcost = m_rdCost.calcRdCost(splitCost.distortion, splitCost.bits);
}
if (splitCost.rdcost < fullCost.rdcost)
{
outCost.rdcost += splitCost.rdcost;
outCost.distortion += splitCost.distortion;
outCost.bits += splitCost.bits;
outCost.energy += splitCost.energy;
return;
}
else
{
// recover entropy state of full-size TU encode
m_entropyCoder.load(m_rqt[fullDepth].rqtTest);
// recover transform index and Cbf values
cu.setTUDepthSubParts(tuDepth, absPartIdx, fullDepth);
cu.setCbfSubParts(bCBF, TEXT_LUMA, absPartIdx, fullDepth);
cu.setTransformSkipSubParts(0, TEXT_LUMA, absPartIdx, fullDepth);
}
}
// set reconstruction for next intra prediction blocks if full TU prediction won
// 存储重建帧信息,用于intra prediction
PicYuv* reconPic = m_frame->m_reconPic;
pixel* picReconY = reconPic->getLumaAddr(cu.m_cuAddr, cuGeom.absPartIdx + absPartIdx);
intptr_t picStride = reconPic->m_stride;
primitives.cu[sizeIdx].copy_pp(picReconY, picStride, reconQt, reconQtStride);
// 存储开销
outCost.rdcost += fullCost.rdcost;
outCost.distortion += fullCost.distortion;
outCost.bits += fullCost.bits;
outCost.energy += fullCost.energy;
}
2.2 确认最佳模式(checkBestMode)
确认最佳模式使用的是一个内联函数,依据rdCost来评估是否是最佳模式
/* check whether current mode is the new best */
inline void checkBestMode(Mode& mode, uint32_t depth)
{
ModeDepth& md = m_modeDepth[depth];
if (md.bestMode)
{
if (mode.rdCost < md.bestMode->rdCost)
md.bestMode = &mode;
}
else
md.bestMode = &mode;
}
在上述的代码之中,除了一些标准定义的实现方式,其中需要注意的部分包括寻址和类的嵌套,这样x265的帧内预测模块简单分析就结束了