【x265】预测模块的简单分析—帧间预测

x265相关:
【x265】x265编码器参数配置
【x265】预测模块的简单分析—帧内预测
【x265】预测模块的简单分析—帧间预测
【x265】码率控制模块的简单分析—块级码控工具(AQ和cuTree)
【x265】码率控制模块的简单分析—帧级码控模式(CQP、CRF和ABR)

1. 帧间预测概述

1.1 编码块结构

帧间预测是编码器中降低编码耗时和编码码率的最有效工具之一,通过时域上的相邻参考,能够大幅度降低编码码率,从而节省网络带宽。在x265当中,帧间预测(Inter Prediction,下文简称Inter模式)是基于PU实现和操作的,它能够将一个CU划分成为若干个子区域,分别实现预测功能,与帧内预测(Intra Prediction,下文简称Intra模式)不同,Inter模式能够将CU分成不规则的PU尺寸,如下所示,一共8种
在这里插入图片描述

1.2 运动估计

通常在视频播放时,前后帧具有比较强的关联性,一个比较好的思考是,在前后图像中找到两个很相似的块,并利用一个运动偏移量来描述这两个块之间的位置差异,前面的帧编码图像块,后续的帧只对这个位置偏移量进行编码,就能够节省编码比特。找到这个运动偏移量的过程叫做运动估计(Motion Estimation,ME),为了找到这两个很相似的块,需要考虑两个问题:

  1. 如何描述这两个块的差异程度
  2. 如何高效的去找到这两个块

1.2.1 运动估计准则

在Inter模式中,描述参考块(下称refBlock)和当前块(下称curBlock)的方式主要为SAD和SATD,另外需要加上对应MV使用的比特开销,即率失真优化公式 J = D + lambda * R

1.2.2 运动搜索

在x265中使用的运动搜索(Motion Search,MS)分为几个步骤:

  1. 整像素搜索
    (1) 菱形搜索(X265_DIA_SEARCH)
    (1) 六边形搜索(X265_HEX_SEARCH)
    这两种搜索方式和x264当中的类似,可以参考雷博的文章:x264源代码简单分析:宏块分析(Analysis)部分-帧间宏块(Inter)。不同之处在于,如果使用HEX搜索,在x265中还会多循环几次,使用半HEX快速搜索,扩大搜索范围,因为x265当中CU的尺寸要更大一些
  2. 1/2像素搜索
  3. 1/4像素搜索

PS:整像素搜索使用的是SAD来描述损失大小,1/2和1/4像素搜索使用的是SATD来描述损失大小。另外,不使用1/8像素搜索的原因是带来的性能增益不明显

1.3 MV预测技术

在Inter模式中,使用了Merge和AMVP两项技术,辅助实现更好的Inter编码。其中,Merge技术可以看作成一种编码模式,在x265中有专门的宏定义这种模式,并且在实际编码时也会将merge相关信息写入码流(例如m_entropyCoder.codeMergeIndex(cu, 0)),不存在MVD(MV Difference);而AMVP技术可以看成一种MV预测技术,编码器只需要对实际MV和预测MV的差值进行编码,因此是存在MVD的

1.3.1 Merge模式

Merge模式为当前PU构建一个MV候选列表,这个候选列表存在5个候选MV。通过遍历这个列表,从5个候选MV中选择一个最佳的MV作为Merge模式的MV,merge mv会在后续的帧间预测流程中提供有力指导。

Merge列表的构建分为空域候选列表和时域候选列表两个部分:

  1. 空域候选列表的构建
    空域候选列表的构建顺序 = { A1, B1, B0, A0, B2 },列表从左到右进行顺序构建,空域候选列表至多包含4个候选MV
    在这里插入图片描述
    对于下列使用矩形划分方式中的PU 2,其候选模式需要做额外处理。下图(a)中的情形,PU2的候选列表中不能存在A1的运动信息,因为如果PU2使用了A1(即PU1)中的信息,则PU1和PU2的MV会一样,这与2NxN的划分方式就没有区别了。同理,对于图(b)中的情形,PU2的列表中不能存在B1的运动信息
    在这里插入图片描述
  2. 时域候选列表的建立
    时域MV候选列表的建立利用了当前PU在邻近已编码图像中对应位置PU(同位PU)的运动信息,但不是直接使用,而是根据当前帧与参考帧的相对位置做对应的比例伸缩调整。图示如下,其中cur_PU为当前预测PU,col_PU为相邻已编码帧的同位PU,cur_ref为当前帧的参考帧,col_ref为相邻已编码帧的参考帧
    在这里插入图片描述
    当前PU的时域候选MV的计算公式为
    c u r M V = t d t b c o l M V curMV = \frac{td}{tb}colMV curMV=tbtdcolMV
    时域候选列表中同位块的位置位于右下角H块,如果H块不存在,则使用C3来代替。时域候选列表最多只提供1个候选MV
    在这里插入图片描述

PS:如果merge模式前面两步的操作之后,候选列表不足5个,就填充(0, 0)

1.3.2 AMVP技术

AMVP技术与merge有类似之处,同样使用了空域和时域上运动向量的相关性。

  1. 空域候选列表的建立
    沿用merge模式使用的相邻块编号,AMVP空域候选列表分别从左侧和上方各产生一个候选预测MV,左侧选择的顺序 = { A0, A1, scaled A0, scaled A1 },上方选择的顺序 = { B0, B1, B2, (scaled B0, scaled B1, scaled B2) },这里的scaled和merge中利用同位块计算当前块MV的方式相同。对于上方选择的顺序而言,MV的比例伸缩只有在左侧两个PU都不可用或者都是Intra模式时才会进行。同时,只有当相邻块候选MV指向的参考帧与当前PU相同时,才可以直接使用相邻MV,否则需要对其进行scale

    另外,AMVP技术中空域候选列表至多包含2个候选MV(merge模式至多包含4个)

  2. 时域候选列表的建立
    与merge模式构建的方式一致

PS:如果AMVP技术经过前面两个步骤之后,候选列表中不足2个候选MV,就填充(0, 0)。另外,AMVP技术在实际编码时,会对MV进行差分编码,即只编码MVD

2. 帧间预测入口函数(compressInterCU_rd0_4)

在x265的帧间预测入口函数中,仅简单分析compressInterCU_rd0_4(),函数中的0和4表示如果rdLevel位于0~4之间则使用这个函数,因为默认的配置中rdLevel=3,所以默认会使用这个函数进行帧间预测

函数的定义位于encoder\analysis.cpp中,其主要的工作流程为
(1)评估使用merge和skip模式带来的损失(checkMerge2Nx2N_rd0_4)
(2)评估划分成为4个子块带来的损失(递归调用compressInterCU_rd0_4)
(3)评估当前深度各种划分模式和Intra模式带来的损失(checkInter_rd0_4,checkIntraInInter)

SplitData Analysis::compressInterCU_rd0_4(const CUData& parentCTU, const CUGeom& cuGeom, int32_t qp)
{
    if (parentCTU.m_vbvAffected && calculateQpforCuSize(parentCTU, cuGeom, 1))
        return compressInterCU_rd5_6(parentCTU, cuGeom, qp);

    uint32_t depth = cuGeom.depth;
    uint32_t cuAddr = parentCTU.m_cuAddr;
    ModeDepth& md = m_modeDepth[depth];

	// searchMethod默认为X265_HEX_SEARCH
    if (m_param->searchMethod == X265_SEA)
    {
        int numPredDir = m_slice->isInterP() ? 1 : 2;
        int offset = (int)(m_frame->m_reconPic->m_cuOffsetY[parentCTU.m_cuAddr] + m_frame->m_reconPic->m_buOffsetY[cuGeom.absPartIdx]);
        for (int list = 0; list < numPredDir; list++)
            for (int i = 0; i < m_frame->m_encData->m_slice->m_numRefIdx[list]; i++)
                for (int planes = 0; planes < INTEGRAL_PLANE_NUM; planes++)
                    m_modeDepth[depth].fencYuv.m_integral[list][i][planes] = m_frame->m_encData->m_slice->m_refFrameList[list][i]->m_encData->m_meIntegral[planes] + offset;
    }

    PicYuv& reconPic = *m_frame->m_reconPic;
    SplitData splitCUData;
	// 是否进行hevc的分析(x265似乎对AVC做了兼容)
    bool bHEVCBlockAnalysis = (m_param->bAnalysisType == AVC_INFO && cuGeom.numPartitions > 16);
	// 是否进行avc分析的refine
    bool bRefineAVCAnalysis = (m_param->analysisLoadReuseLevel == 7 && (m_modeFlag[0] || m_modeFlag[1]));
	// no-off loading,如果为true,表示不会将CPU当中的任务移动到其他处理器(如GPU等)上面进行
    bool bNooffloading = !(m_param->bAnalysisType == AVC_INFO);

    if (bHEVCBlockAnalysis || bRefineAVCAnalysis || bNooffloading)
    {
        md.bestMode = NULL;
        bool mightSplit = !(cuGeom.flags & CUGeom::LEAF);
        bool mightNotSplit = !(cuGeom.flags & CUGeom::SPLIT_MANDATORY);
        uint32_t minDepth = topSkipMinDepth(parentCTU, cuGeom);
        bool bDecidedDepth = parentCTU.m_cuDepth[cuGeom.absPartIdx] == depth;
        bool skipModes = false; /* Skip any remaining mode analyses at current depth */
        bool skipRecursion = false; /* Skip recursion */
        bool splitIntra = true;
        bool skipRectAmp = false;
        bool chooseMerge = false;
        bool bCtuInfoCheck = false;
        int sameContentRef = 0;

        if (m_evaluateInter)
        {
            if (m_refineLevel == 2)
            {
                if (parentCTU.m_predMode[cuGeom.absPartIdx] == MODE_SKIP)
                    skipModes = true;
                if (parentCTU.m_partSize[cuGeom.absPartIdx] == SIZE_2Nx2N)
                    skipRectAmp = true;
            }
            mightSplit &= false;
            minDepth = depth;
        }

        if ((m_limitTU & X265_TU_LIMIT_NEIGH) && cuGeom.log2CUSize >= 4)
            m_maxTUDepth = loadTUDepth(cuGeom, parentCTU);

        SplitData splitData[4];
        splitData[0].initSplitCUData();
        splitData[1].initSplitCUData();
        splitData[2].initSplitCUData();
        splitData[3].initSplitCUData();

        // avoid uninitialize value in below reference
        if (m_param->limitModes)
        {
            md.pred[PRED_2Nx2N].bestME[0][0].mvCost = 0; // L0
            md.pred[PRED_2Nx2N].bestME[0][1].mvCost = 0; // L1
            md.pred[PRED_2Nx2N].sa8dCost = 0;
        }

        if (m_param->bCTUInfo && depth <= parentCTU.m_cuDepth[cuGeom.absPartIdx])
        {
            if (bDecidedDepth && m_additionalCtuInfo[cuGeom.absPartIdx])
                sameContentRef = findSameContentRefCount(parentCTU, cuGeom);
            if (depth < parentCTU.m_cuDepth[cuGeom.absPartIdx])
            {
                mightNotSplit &= bDecidedDepth;
                bCtuInfoCheck = skipRecursion = false;
                skipModes = true;
            }
            else if (mightNotSplit && bDecidedDepth)
            {
                if (m_additionalCtuInfo[cuGeom.absPartIdx])
                {
                    bCtuInfoCheck = skipRecursion = true;
                    md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom, qp);
                    md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom, qp);
                    checkMerge2Nx2N_rd0_4(md.pred[PRED_SKIP], md.pred[PRED_MERGE], cuGeom);
                    if (!sameContentRef)
                    {
                        if ((m_param->bCTUInfo & 2) && (m_slice->m_pps->bUseDQP && depth <= m_slice->m_pps->maxCuDQPDepth))
                        {
                            qp -= int32_t(0.04 * qp);
                            setLambdaFromQP(parentCTU, qp);
                        }
                        if (m_param->bCTUInfo & 4)
                            skipModes = false;
                    }
                    if (sameContentRef || (!sameContentRef && !(m_param->bCTUInfo & 4)))
                    {
                        if (m_param->rdLevel)
                            skipModes = m_param->bEnableEarlySkip && md.bestMode && md.bestMode->cu.isSkipped(0);
                        if ((m_param->bCTUInfo & 4) && sameContentRef)
                            skipModes = md.bestMode && true;
                    }
                }
                else
                {
                    md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom, qp);
                    md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom, qp);
                    checkMerge2Nx2N_rd0_4(md.pred[PRED_SKIP], md.pred[PRED_MERGE], cuGeom);
                    if (m_param->rdLevel)
                        skipModes = m_param->bEnableEarlySkip && md.bestMode && md.bestMode->cu.isSkipped(0);
                }
                mightSplit &= !bDecidedDepth;
            }
        }
        if ((m_param->analysisLoadReuseLevel > 1 && m_param->analysisLoadReuseLevel != 10))
        {
            if (mightNotSplit && depth == m_reuseDepth[cuGeom.absPartIdx])
            {
                if (m_reuseModes[cuGeom.absPartIdx] == MODE_SKIP)
                {
                    md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom, qp);
                    md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom, qp);
                    checkMerge2Nx2N_rd0_4(md.pred[PRED_SKIP], md.pred[PRED_MERGE], cuGeom);

                    skipRecursion = !!m_param->recursionSkipMode && md.bestMode;
                    if (m_param->rdLevel)
                        skipModes = m_param->bEnableEarlySkip && md.bestMode;
                }
                if (m_param->analysisLoadReuseLevel > 4 && m_reusePartSize[cuGeom.absPartIdx] == SIZE_2Nx2N)
                {
                    if (m_reuseModes[cuGeom.absPartIdx] != MODE_INTRA  && m_reuseModes[cuGeom.absPartIdx] != 4)
                    {
                        skipRectAmp = true && !!md.bestMode;
                        chooseMerge = !!m_reuseMergeFlag[cuGeom.absPartIdx] && !!md.bestMode;
                    }
                }
            }
        }
        if (m_param->analysisMultiPassRefine && m_param->rc.bStatRead && m_reuseInterDataCTU) 
        {
            if (mightNotSplit && depth == m_reuseDepth[cuGeom.absPartIdx])
            {
                if (m_reuseModes[cuGeom.absPartIdx] == MODE_SKIP)
                {
                    md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom, qp);
                    md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom, qp);
                    checkMerge2Nx2N_rd0_4(md.pred[PRED_SKIP], md.pred[PRED_MERGE], cuGeom);

                    skipRecursion = !!m_param->recursionSkipMode && md.bestMode;
                    if (m_param->rdLevel)
                        skipModes = m_param->bEnableEarlySkip && md.bestMode;
                }
            }
        }
        /* Step 1. Evaluate Merge/Skip candidates for likely early-outs, if skip mode was not set above */
		// 1. 对Merge、Skip候选模式进行评估以确定是否可以提前终止某些计算过程(如果skip模式在前面没有配置)
        if ((mightNotSplit && depth >= minDepth && !md.bestMode && !bCtuInfoCheck) || (m_param->bAnalysisType == AVC_INFO && m_param->analysisLoadReuseLevel == 7 && (m_modeFlag[0] || m_modeFlag[1])))
            /* TODO: Re-evaluate if analysis load/save still works */
        {
            /* Compute Merge Cost */
			// 初始化merge和skip模式的CU
            md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom, qp);
            md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom, qp);
			// 进行merge模式和skip模式的帧间预测
            checkMerge2Nx2N_rd0_4(md.pred[PRED_SKIP], md.pred[PRED_MERGE], cuGeom);
            if (m_param->rdLevel)
                skipModes = (m_param->bEnableEarlySkip || m_refineLevel == 2)
                && md.bestMode && md.bestMode->cu.isSkipped(0); // TODO: sa8d threshold per depth
        }
        if (md.bestMode && m_param->recursionSkipMode && !bCtuInfoCheck && !(m_param->bAnalysisType == AVC_INFO && m_param->analysisLoadReuseLevel == 7 && (m_modeFlag[0] || m_modeFlag[1])))
        {
            skipRecursion = md.bestMode->cu.isSkipped(0);
            if (mightSplit && !skipRecursion)
            {
                if (depth >= minDepth && m_param->recursionSkipMode == RDCOST_BASED_RSKIP)
                {
                    if (depth)
                        skipRecursion = recursionDepthCheck(parentCTU, cuGeom, *md.bestMode);
                    if (m_bHD && !skipRecursion && m_param->rdLevel == 2 && md.fencYuv.m_size != MAX_CU_SIZE)
                        skipRecursion = complexityCheckCU(*md.bestMode);
                }
                else if (cuGeom.log2CUSize >= MAX_LOG2_CU_SIZE - 1 && m_param->recursionSkipMode == EDGE_BASED_RSKIP)
                {
                    skipRecursion = complexityCheckCU(*md.bestMode);
                }

            }
        }
		// 检查是否需要跳过递归划分
        if (m_param->bAnalysisType == AVC_INFO && md.bestMode && cuGeom.numPartitions <= 16 && m_param->analysisLoadReuseLevel == 7)
            skipRecursion = true;
        /* Step 2. Evaluate each of the 4 split sub-blocks in series */
		// 评估4个子块的Inter模式
        if (mightSplit && !skipRecursion)
        {
            if (bCtuInfoCheck && m_param->bCTUInfo & 2)
                qp = int((1 / 0.96) * qp + 0.5);
            Mode* splitPred = &md.pred[PRED_SPLIT];
            splitPred->initCosts();
            CUData* splitCU = &splitPred->cu;
            splitCU->initSubCU(parentCTU, cuGeom, qp);

            uint32_t nextDepth = depth + 1;
            ModeDepth& nd = m_modeDepth[nextDepth];
            invalidateContexts(nextDepth);
            Entropy* nextContext = &m_rqt[depth].cur;
            int nextQP = qp;
            splitIntra = false;

            for (uint32_t subPartIdx = 0; subPartIdx < 4; subPartIdx++)
            {
                const CUGeom& childGeom = *(&cuGeom + cuGeom.childOffset + subPartIdx);
                if (childGeom.flags & CUGeom::PRESENT)
                {
                    m_modeDepth[0].fencYuv.copyPartToYuv(nd.fencYuv, childGeom.absPartIdx);
                    m_rqt[nextDepth].cur.load(*nextContext);

                    if (m_slice->m_pps->bUseDQP && nextDepth <= m_slice->m_pps->maxCuDQPDepth)
                        nextQP = setLambdaFromQP(parentCTU, calculateQpforCuSize(parentCTU, childGeom));
					// 进行4个子块的帧间预测
                    splitData[subPartIdx] = compressInterCU_rd0_4(parentCTU, childGeom, nextQP);

                    // Save best CU and pred data for this sub CU
                    splitIntra |= nd.bestMode->cu.isIntra(0);
                    splitCU->copyPartFrom(nd.bestMode->cu, childGeom, subPartIdx);
                    splitPred->addSubCosts(*nd.bestMode);

                    if (m_param->rdLevel)
                        nd.bestMode->reconYuv.copyToPartYuv(splitPred->reconYuv, childGeom.numPartitions * subPartIdx);
                    else
                        nd.bestMode->predYuv.copyToPartYuv(splitPred->predYuv, childGeom.numPartitions * subPartIdx);
                    if (m_param->rdLevel > 1)
                        nextContext = &nd.bestMode->contexts;
                }
                else
                    splitCU->setEmptyPart(childGeom, subPartIdx);
            }
            nextContext->store(splitPred->contexts);

            if (mightNotSplit)
                addSplitFlagCost(*splitPred, cuGeom.depth);
            else if (m_param->rdLevel > 1)
                updateModeCost(*splitPred);
            else
                splitPred->sa8dCost = m_rdCost.calcRdSADCost((uint32_t)splitPred->distortion, splitPred->sa8dBits);
        }
        /* If analysis mode is simple do not Evaluate other modes */
        if (m_param->bAnalysisType == AVC_INFO && m_param->analysisLoadReuseLevel == 7)
        {
            if (m_slice->m_sliceType == P_SLICE)
            {
                if (m_checkMergeAndSkipOnly[0])
                    skipModes = true;
            }
            else
            {
                if (m_checkMergeAndSkipOnly[0] && m_checkMergeAndSkipOnly[1])
                    skipModes = true;
            }
        }
        /* Split CUs
         *   0  1
         *   2  3 */
        uint32_t allSplitRefs = splitData[0].splitRefs | splitData[1].splitRefs | splitData[2].splitRefs | splitData[3].splitRefs;
        /* Step 3. Evaluate ME (2Nx2N, rect, amp) and intra modes at current depth */
		// 评估当前深度的ME和intra模式
        if (mightNotSplit && (depth >= minDepth || (m_param->bCTUInfo && !md.bestMode)))
        {
            if (m_slice->m_pps->bUseDQP && depth <= m_slice->m_pps->maxCuDQPDepth && m_slice->m_pps->maxCuDQPDepth != 0)
                setLambdaFromQP(parentCTU, qp);
			// 
			/*
				检查是否是skip模式
				(1)如果是skip模式,跳过当前深度的inter prediction
				(2)如果不是skip模式,进入下面的inter prediction,会按照顺序去检查各种划分方式
					(a)2Nx2N
					(b)矩形划分
						(i)  2NxN, Nx2N
						(ii) 2NxnD, 2NxnU
						(iii)nRx2N, nLx2N
			*/
            if (!skipModes)
            {
                uint32_t refMasks[2];
                refMasks[0] = allSplitRefs;
                md.pred[PRED_2Nx2N].cu.initSubCU(parentCTU, cuGeom, qp);
				// 2Nx2N的帧间预测
                checkInter_rd0_4(md.pred[PRED_2Nx2N], cuGeom, SIZE_2Nx2N, refMasks);

                if (m_param->limitReferences & X265_REF_LIMIT_CU)
                {
                    CUData& cu = md.pred[PRED_2Nx2N].cu;
                    uint32_t refMask = cu.getBestRefIdx(0);
                    allSplitRefs = splitData[0].splitRefs = splitData[1].splitRefs = splitData[2].splitRefs = splitData[3].splitRefs = refMask;
                }
				// B帧的2Nx2N帧间预测(没有研究)
                if (m_slice->m_sliceType == B_SLICE)
                {
                    md.pred[PRED_BIDIR].cu.initSubCU(parentCTU, cuGeom, qp);
                    checkBidir2Nx2N(md.pred[PRED_2Nx2N], md.pred[PRED_BIDIR], cuGeom);
                }

                Mode *bestInter = &md.pred[PRED_2Nx2N];
				// 检查是否进行rect模式预测,即矩形划分方式
                if (!skipRectAmp)
                {
					/*
							 2NxN划分		 Nx2N划分
							+---+---+		+---+---+
							|       |		|   |   |
							+---+---+		+   +   +
							|       |		|	|	|
							+---+---+		+---+---+
					*/
					// 检查是否允许进行矩形分割(非正方形)
                    if (m_param->bEnableRectInter)
                    {
						// 计算划分成为4个子块的总损失
                        uint64_t splitCost = splitData[0].sa8dCost + splitData[1].sa8dCost + splitData[2].sa8dCost + splitData[3].sa8dCost;
                        uint32_t threshold_2NxN, threshold_Nx2N;
						/*
							(1)如果是P帧,取出前向cost
							(2)如果是B帧,求前后向的平均cost
						*/
                        if (m_slice->m_sliceType == P_SLICE)
                        {
                            threshold_2NxN = splitData[0].mvCost[0] + splitData[1].mvCost[0];
                            threshold_Nx2N = splitData[0].mvCost[0] + splitData[2].mvCost[0];
                        }
                        else
                        {
                            threshold_2NxN = (splitData[0].mvCost[0] + splitData[1].mvCost[0]
                                + splitData[0].mvCost[1] + splitData[1].mvCost[1] + 1) >> 1;
                            threshold_Nx2N = (splitData[0].mvCost[0] + splitData[2].mvCost[0]
                                + splitData[0].mvCost[1] + splitData[2].mvCost[1] + 1) >> 1;
                        }
						/*
							下面代码的逻辑
							(1)如果try_2NxN_first = true,则按照检查顺序的1和2执行
							(2)如果try_Nx2N_first = true, 则按照检查顺序的2和3执行
						*/
                        int try_2NxN_first = threshold_2NxN < threshold_Nx2N;
						/*
							检查顺序1
							splitCost:划分成为4个子块的损失
							md.pred[PRED_2Nx2N].sa8dCost:按照2Nx2N模式进行预测的损失
							threshold_2NxN:划分成2NxN的阈值

							如果满足下面的不等式关系,表示使用2NxN有可能损失更小
						*/
                        if (try_2NxN_first && splitCost < md.pred[PRED_2Nx2N].sa8dCost + threshold_2NxN)
                        {
							// 上半部分
                            refMasks[0] = splitData[0].splitRefs | splitData[1].splitRefs; /* top */
							// 下半部分
                            refMasks[1] = splitData[2].splitRefs | splitData[3].splitRefs; /* bot */
                            md.pred[PRED_2NxN].cu.initSubCU(parentCTU, cuGeom, qp);
							// 检查2NxN帧间预测损失
                            checkInter_rd0_4(md.pred[PRED_2NxN], cuGeom, SIZE_2NxN, refMasks);
                            if (md.pred[PRED_2NxN].sa8dCost < bestInter->sa8dCost)
                                bestInter = &md.pred[PRED_2NxN];
                        }

						/*
							检查顺序2
							splitCost:划分成为4个子块的损失
							md.pred[PRED_2Nx2N].sa8dCost:按照2Nx2N模式进行预测的损失
							threshold_Nx2N:划分成Nx2N的阈值

							如果满足下面的不等式关系,表示使用Nx2N有可能损失更小
						*/
                        if (splitCost < md.pred[PRED_2Nx2N].sa8dCost + threshold_Nx2N)
                        {
                            refMasks[0] = splitData[0].splitRefs | splitData[2].splitRefs; /* left */
                            refMasks[1] = splitData[1].splitRefs | splitData[3].splitRefs; /* right */
                            md.pred[PRED_Nx2N].cu.initSubCU(parentCTU, cuGeom, qp);
                            checkInter_rd0_4(md.pred[PRED_Nx2N], cuGeom, SIZE_Nx2N, refMasks);
                            if (md.pred[PRED_Nx2N].sa8dCost < bestInter->sa8dCost)
                                bestInter = &md.pred[PRED_Nx2N];
                        }

						// 检查顺序3
                        if (!try_2NxN_first && splitCost < md.pred[PRED_2Nx2N].sa8dCost + threshold_2NxN)
                        {
                            refMasks[0] = splitData[0].splitRefs | splitData[1].splitRefs; /* top */
                            refMasks[1] = splitData[2].splitRefs | splitData[3].splitRefs; /* bot */
                            md.pred[PRED_2NxN].cu.initSubCU(parentCTU, cuGeom, qp);
                            checkInter_rd0_4(md.pred[PRED_2NxN], cuGeom, SIZE_2NxN, refMasks);
                            if (md.pred[PRED_2NxN].sa8dCost < bestInter->sa8dCost)
                                bestInter = &md.pred[PRED_2NxN];
                        }
                    }
					// 检查(SIZE_2NxnU, SIZE_2NxnD, SIZE_nLx2N, SIZE_nRx2N)
                    if (m_slice->m_sps->maxAMPDepth > depth)
                    {
                        uint64_t splitCost = splitData[0].sa8dCost + splitData[1].sa8dCost + splitData[2].sa8dCost + splitData[3].sa8dCost;
                        uint32_t threshold_2NxnU, threshold_2NxnD, threshold_nLx2N, threshold_nRx2N;
						// 根据帧类型获取threshold
                        if (m_slice->m_sliceType == P_SLICE)
                        {
                            threshold_2NxnU = splitData[0].mvCost[0] + splitData[1].mvCost[0];
                            threshold_2NxnD = splitData[2].mvCost[0] + splitData[3].mvCost[0];

                            threshold_nLx2N = splitData[0].mvCost[0] + splitData[2].mvCost[0];
                            threshold_nRx2N = splitData[1].mvCost[0] + splitData[3].mvCost[0];
                        }
                        else
                        {
                            threshold_2NxnU = (splitData[0].mvCost[0] + splitData[1].mvCost[0]
                                + splitData[0].mvCost[1] + splitData[1].mvCost[1] + 1) >> 1;
                            threshold_2NxnD = (splitData[2].mvCost[0] + splitData[3].mvCost[0]
                                + splitData[2].mvCost[1] + splitData[3].mvCost[1] + 1) >> 1;

                            threshold_nLx2N = (splitData[0].mvCost[0] + splitData[2].mvCost[0]
                                + splitData[0].mvCost[1] + splitData[2].mvCost[1] + 1) >> 1;
                            threshold_nRx2N = (splitData[1].mvCost[0] + splitData[3].mvCost[0]
                                + splitData[1].mvCost[1] + splitData[3].mvCost[1] + 1) >> 1;
                        }
						/*
							检查是否进行水平或者垂直的划分
							(1)如果partSize = 2Nx2N,则进行水平划分尝试
							(2)如果partSize = Nx2N,则进行垂直划分尝试
							(3)如果partSize = 2Nx2N,并且四叉树根节点有非零系数,则同时采用水平和垂直划分尝试
						*/
                        bool bHor = false, bVer = false;
                        if (bestInter->cu.m_partSize[0] == SIZE_2NxN)
                            bHor = true;
                        else if (bestInter->cu.m_partSize[0] == SIZE_Nx2N)
                            bVer = true;
                        else if (bestInter->cu.m_partSize[0] == SIZE_2Nx2N &&
                            md.bestMode && md.bestMode->cu.getQtRootCbf(0))
                        {
                            bHor = true;
                            bVer = true;
                        }
						// 尝试水平划分
                        if (bHor)
                        {
							// 检查2NxnD是否优先,确定检查顺序
							/*
								2NxnD						2NxnU
								+--+--+--+--+				+--+--+--+--+
								|			|				|			|	25% top
								+			+				+--+--+--+--+
								|			|	75% top		|			|
								+			+				+			+
								|			|				|			|	75% bottom
								+--+--+--+--+				+			+
								|			|	25% bottom	|			|
								+--+--+--+--+				+--+--+--+--+
							*/
                            int try_2NxnD_first = threshold_2NxnD < threshold_2NxnU;
                            if (try_2NxnD_first && splitCost < md.pred[PRED_2Nx2N].sa8dCost + threshold_2NxnD)
                            {
                                refMasks[0] = allSplitRefs;                                    /* 75% top */
                                refMasks[1] = splitData[2].splitRefs | splitData[3].splitRefs; /* 25% bot */
                                md.pred[PRED_2NxnD].cu.initSubCU(parentCTU, cuGeom, qp);
								// 检查2NxnD
								checkInter_rd0_4(md.pred[PRED_2NxnD], cuGeom, SIZE_2NxnD, refMasks);
                                if (md.pred[PRED_2NxnD].sa8dCost < bestInter->sa8dCost)
                                    bestInter = &md.pred[PRED_2NxnD];
                            }

                            if (splitCost < md.pred[PRED_2Nx2N].sa8dCost + threshold_2NxnU)
                            {
                                refMasks[0] = splitData[0].splitRefs | splitData[1].splitRefs; /* 25% top */
                                refMasks[1] = allSplitRefs;                                    /* 75% bot */
                                md.pred[PRED_2NxnU].cu.initSubCU(parentCTU, cuGeom, qp);
								// 检查2NxnU
                                checkInter_rd0_4(md.pred[PRED_2NxnU], cuGeom, SIZE_2NxnU, refMasks);
                                if (md.pred[PRED_2NxnU].sa8dCost < bestInter->sa8dCost)
                                    bestInter = &md.pred[PRED_2NxnU];
                            }

                            if (!try_2NxnD_first && splitCost < md.pred[PRED_2Nx2N].sa8dCost + threshold_2NxnD)
                            {
                                refMasks[0] = allSplitRefs;                                    /* 75% top */
                                refMasks[1] = splitData[2].splitRefs | splitData[3].splitRefs; /* 25% bot */
                                md.pred[PRED_2NxnD].cu.initSubCU(parentCTU, cuGeom, qp);
                                checkInter_rd0_4(md.pred[PRED_2NxnD], cuGeom, SIZE_2NxnD, refMasks);
                                if (md.pred[PRED_2NxnD].sa8dCost < bestInter->sa8dCost)
                                    bestInter = &md.pred[PRED_2NxnD];
                            }
                        }
						// 尝试垂直划分
                        if (bVer)
                        {	
							/*
								nRx2N
								 75% left       25% left
								+--+--+--+--+    +--+--+--+--+
								|		 |  |	 |  |        |
								+        +  +	 +  +        +
								|        |  |	 |  |        |
								+        +  +	 +  +        +
								|        |  |	 |  |        |
								+        +  +    +  +        +
								|        |  |    |  |        |
								+--+--+--+--+	 +--+--+--+--+
									   25% right     75% right
							*/
                            int try_nRx2N_first = threshold_nRx2N < threshold_nLx2N;
                            if (try_nRx2N_first && splitCost < md.pred[PRED_2Nx2N].sa8dCost + threshold_nRx2N)
                            {
                                refMasks[0] = allSplitRefs;                                    /* 75% left  */
                                refMasks[1] = splitData[1].splitRefs | splitData[3].splitRefs; /* 25% right */
                                md.pred[PRED_nRx2N].cu.initSubCU(parentCTU, cuGeom, qp);
                                checkInter_rd0_4(md.pred[PRED_nRx2N], cuGeom, SIZE_nRx2N, refMasks);
                                if (md.pred[PRED_nRx2N].sa8dCost < bestInter->sa8dCost)
                                    bestInter = &md.pred[PRED_nRx2N];
                            }

                            if (splitCost < md.pred[PRED_2Nx2N].sa8dCost + threshold_nLx2N)
                            {
                                refMasks[0] = splitData[0].splitRefs | splitData[2].splitRefs; /* 25% left  */
                                refMasks[1] = allSplitRefs;                                    /* 75% right */
                                md.pred[PRED_nLx2N].cu.initSubCU(parentCTU, cuGeom, qp);
                                checkInter_rd0_4(md.pred[PRED_nLx2N], cuGeom, SIZE_nLx2N, refMasks);
                                if (md.pred[PRED_nLx2N].sa8dCost < bestInter->sa8dCost)
                                    bestInter = &md.pred[PRED_nLx2N];
                            }

                            if (!try_nRx2N_first && splitCost < md.pred[PRED_2Nx2N].sa8dCost + threshold_nRx2N)
                            {
                                refMasks[0] = allSplitRefs;                                    /* 75% left  */
                                refMasks[1] = splitData[1].splitRefs | splitData[3].splitRefs; /* 25% right */
                                md.pred[PRED_nRx2N].cu.initSubCU(parentCTU, cuGeom, qp);
                                checkInter_rd0_4(md.pred[PRED_nRx2N], cuGeom, SIZE_nRx2N, refMasks);
                                if (md.pred[PRED_nRx2N].sa8dCost < bestInter->sa8dCost)
                                    bestInter = &md.pred[PRED_nRx2N];
                            }
                        }
                    }
                }
				/*
					检查是否需要进行intra模式的尝试,需要满足的条件为
					(1)sliceType不为B帧,或者允许B帧中使用intra模式
					(2)CUSize不能为64
					(3)bCTUInfo第三位不能为1(没研究过,但bCTUInfo默认为0)
					(4)bCtuInfoCheck表示是否启用基于CTU内容信息的编码策略调整
				*/
                bool bTryIntra = (m_slice->m_sliceType != B_SLICE || m_param->bIntraInBFrames) && cuGeom.log2CUSize != MAX_LOG2_CU_SIZE && !((m_param->bCTUInfo & 4) && bCtuInfoCheck);
                // rdLevel默认为3
				if (m_param->rdLevel >= 3)
                {
                    /* Calculate RD cost of best inter option */
                    if ((!m_bChromaSa8d && (m_csp != X265_CSP_I400)) || (m_frame->m_fencPic->m_picCsp == X265_CSP_I400 && m_csp != X265_CSP_I400)) /* When m_bChromaSa8d is enabled, chroma MC has already been done */
                    {
                        uint32_t numPU = bestInter->cu.getNumPartInter(0);
                        for (uint32_t puIdx = 0; puIdx < numPU; puIdx++)
                        {
                            PredictionUnit pu(bestInter->cu, cuGeom, puIdx);
                            motionCompensation(bestInter->cu, pu, bestInter->predYuv, false, true);
                        }
                    }
					// 不使用merge模式
                    if (!chooseMerge)
                    {
                    	// 将前面确定的模式进行编码并计算RdCost
                        encodeResAndCalcRdInterCU(*bestInter, cuGeom);
                        checkBestMode(*bestInter, depth);

                        /* If BIDIR is available and within 17/16 of best inter option, choose by RDO */
						// 如果BIDIR的损失小于等于最佳模式的17/16倍(应该是经验性参数)
                        if (m_slice->m_sliceType == B_SLICE && md.pred[PRED_BIDIR].sa8dCost != MAX_INT64 &&
                            md.pred[PRED_BIDIR].sa8dCost * 16 <= bestInter->sa8dCost * 17)
                        {
                            uint32_t numPU = md.pred[PRED_BIDIR].cu.getNumPartInter(0);
                            if (m_frame->m_fencPic->m_picCsp == X265_CSP_I400 && m_csp != X265_CSP_I400)
                                for (uint32_t puIdx = 0; puIdx < numPU; puIdx++)
                                {
                                    PredictionUnit pu(md.pred[PRED_BIDIR].cu, cuGeom, puIdx);
									// BIDIR模式的运动补偿
                                    motionCompensation(md.pred[PRED_BIDIR].cu, pu, md.pred[PRED_BIDIR].predYuv, true, true);
                                }
							// 计算BIDIR模式的损失
                            encodeResAndCalcRdInterCU(md.pred[PRED_BIDIR], cuGeom);
                            checkBestMode(md.pred[PRED_BIDIR], depth);
                        }
                    }
					// 尝试intra模式
                    if ((bTryIntra && md.bestMode->cu.getQtRootCbf(0)) ||
                        md.bestMode->sa8dCost == MAX_INT64)
                    {
                        if (!m_param->limitReferences || splitIntra)
                        {
                            ProfileCounter(parentCTU, totalIntraCU[cuGeom.depth]);
                            md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom, qp);
                            checkIntraInInter(md.pred[PRED_INTRA], cuGeom);
                            encodeIntraInInter(md.pred[PRED_INTRA], cuGeom);
                            checkBestMode(md.pred[PRED_INTRA], depth);
                        }
                        else
                        {
                            ProfileCounter(parentCTU, skippedIntraCU[cuGeom.depth]);
                        }
                    }
                }
                else
                {
                    /* SA8D choice between merge/skip, inter, bidir, and intra */
                    if (!md.bestMode || bestInter->sa8dCost < md.bestMode->sa8dCost)
                        md.bestMode = bestInter;

                    if (m_slice->m_sliceType == B_SLICE &&
                        md.pred[PRED_BIDIR].sa8dCost < md.bestMode->sa8dCost)
                        md.bestMode = &md.pred[PRED_BIDIR];

                    if (bTryIntra || md.bestMode->sa8dCost == MAX_INT64)
                    {
                        if (!m_param->limitReferences || splitIntra)
                        {
                            ProfileCounter(parentCTU, totalIntraCU[cuGeom.depth]);
                            md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom, qp);
                            checkIntraInInter(md.pred[PRED_INTRA], cuGeom);
                            if (md.pred[PRED_INTRA].sa8dCost < md.bestMode->sa8dCost)
                                md.bestMode = &md.pred[PRED_INTRA];
                        }
                        else
                        {
                            ProfileCounter(parentCTU, skippedIntraCU[cuGeom.depth]);
                        }
                    }

                    /* finally code the best mode selected by SA8D costs:
                     * RD level 2 - fully encode the best mode
                     * RD level 1 - generate recon pixels
                     * RD level 0 - generate chroma prediction */
                    if (md.bestMode->cu.m_mergeFlag[0] && md.bestMode->cu.m_partSize[0] == SIZE_2Nx2N)
                    {
                        /* prediction already generated for this CU, and if rd level
                         * is not 0, it is already fully encoded */
                    }
                    else if (md.bestMode->cu.isInter(0))
                    {
                        uint32_t numPU = md.bestMode->cu.getNumPartInter(0);
                        if (m_csp != X265_CSP_I400)
                        {
                            for (uint32_t puIdx = 0; puIdx < numPU; puIdx++)
                            {
                                PredictionUnit pu(md.bestMode->cu, cuGeom, puIdx);
                                motionCompensation(md.bestMode->cu, pu, md.bestMode->predYuv, false, true);
                            }
                        }
                        if (m_param->rdLevel == 2)
                            encodeResAndCalcRdInterCU(*md.bestMode, cuGeom);
                        else if (m_param->rdLevel == 1)
                        {
                            /* generate recon pixels with no rate distortion considerations */
                            CUData& cu = md.bestMode->cu;

                            uint32_t tuDepthRange[2];
                            cu.getInterTUQtDepthRange(tuDepthRange, 0);
                            m_rqt[cuGeom.depth].tmpResiYuv.subtract(*md.bestMode->fencYuv, md.bestMode->predYuv, cuGeom.log2CUSize, m_frame->m_fencPic->m_picCsp);
                            residualTransformQuantInter(*md.bestMode, cuGeom, 0, 0, tuDepthRange);
                            if (cu.getQtRootCbf(0))
                                md.bestMode->reconYuv.addClip(md.bestMode->predYuv, m_rqt[cuGeom.depth].tmpResiYuv, cu.m_log2CUSize[0], m_frame->m_fencPic->m_picCsp);
                            else
                            {
                                md.bestMode->reconYuv.copyFromYuv(md.bestMode->predYuv);
                                if (cu.m_mergeFlag[0] && cu.m_partSize[0] == SIZE_2Nx2N)
                                    cu.setPredModeSubParts(MODE_SKIP);
                            }
                        }
                    }
                    else
                    {
                        if (m_param->rdLevel == 2)
                            encodeIntraInInter(*md.bestMode, cuGeom);
                        else if (m_param->rdLevel == 1)
                        {
                            /* generate recon pixels with no rate distortion considerations */
                            CUData& cu = md.bestMode->cu;

                            uint32_t tuDepthRange[2];
                            cu.getIntraTUQtDepthRange(tuDepthRange, 0);

                            residualTransformQuantIntra(*md.bestMode, cuGeom, 0, 0, tuDepthRange);
                            if (m_csp != X265_CSP_I400)
                            {
                                getBestIntraModeChroma(*md.bestMode, cuGeom);
                                residualQTIntraChroma(*md.bestMode, cuGeom, 0, 0);
                            }
                            md.bestMode->reconYuv.copyFromPicYuv(reconPic, cu.m_cuAddr, cuGeom.absPartIdx); // TODO:
                        }
                    }
                }
            } // !earlyskip

            if (m_bTryLossless)
                tryLossless(cuGeom);

            if (mightSplit)
                addSplitFlagCost(*md.bestMode, cuGeom.depth);
        }

        if (mightSplit && !skipRecursion)
        {
            Mode* splitPred = &md.pred[PRED_SPLIT];
            if (!md.bestMode)
                md.bestMode = splitPred;
            else if (m_param->rdLevel > 1)
                checkBestMode(*splitPred, cuGeom.depth);
            else if (splitPred->sa8dCost < md.bestMode->sa8dCost)
                md.bestMode = splitPred;

            checkDQPForSplitPred(*md.bestMode, cuGeom);
        }

        /* determine which motion references the parent CU should search */
        splitCUData.initSplitCUData();

        if (m_param->limitReferences & X265_REF_LIMIT_DEPTH)
        {
            if (md.bestMode == &md.pred[PRED_SPLIT])
                splitCUData.splitRefs = allSplitRefs;
            else
            {
                /* use best merge/inter mode, in case of intra use 2Nx2N inter references */
                CUData& cu = md.bestMode->cu.isIntra(0) ? md.pred[PRED_2Nx2N].cu : md.bestMode->cu;
                uint32_t numPU = cu.getNumPartInter(0);
                for (uint32_t puIdx = 0, subPartIdx = 0; puIdx < numPU; puIdx++, subPartIdx += cu.getPUOffset(puIdx, 0))
                    splitCUData.splitRefs |= cu.getBestRefIdx(subPartIdx);
            }
        }

        if (m_param->limitModes)
        {
            splitCUData.mvCost[0] = md.pred[PRED_2Nx2N].bestME[0][0].mvCost; // L0
            splitCUData.mvCost[1] = md.pred[PRED_2Nx2N].bestME[0][1].mvCost; // L1
            splitCUData.sa8dCost = md.pred[PRED_2Nx2N].sa8dCost;
        }
		// 最佳模式是skip模式,更新cu统计信息
        if (mightNotSplit && md.bestMode->cu.isSkipped(0))
        {
            FrameData& curEncData = *m_frame->m_encData;
            FrameData::RCStatCU& cuStat = curEncData.m_cuStat[parentCTU.m_cuAddr];
            uint64_t temp = cuStat.avgCost[depth] * cuStat.count[depth];
            cuStat.count[depth] += 1;
            cuStat.avgCost[depth] = (temp + md.bestMode->rdCost) / cuStat.count[depth];
        }

        /* Copy best data to encData CTU and recon */
		// 拷贝最新的data到recon缓冲区中
        md.bestMode->cu.copyToPic(depth);
        if (m_param->rdLevel)
            md.bestMode->reconYuv.copyToPicYuv(reconPic, cuAddr, cuGeom.absPartIdx);

        if ((m_limitTU & X265_TU_LIMIT_NEIGH) && cuGeom.log2CUSize >= 4)
        {
            if (mightNotSplit)
            {
                CUData* ctu = md.bestMode->cu.m_encData->getPicCTU(parentCTU.m_cuAddr);
                int8_t maxTUDepth = -1;
                for (uint32_t i = 0; i < cuGeom.numPartitions; i++)
                    maxTUDepth = X265_MAX(maxTUDepth, md.bestMode->cu.m_tuDepth[i]);
                ctu->m_refTuDepth[cuGeom.geomRecurId] = maxTUDepth;
            }
        }
    }
    else
    {
        // ...
    }

    return splitCUData;
}

2.1 检查Merge/Skip模式(checkMerge2Nx2N_rd0_4)

函数的主要作用是检查merge模式和skip模式对应的损失,主要的工作流程为
(1)获取merge候选列表(getInterMergeCandidates)
(2)检查merge候选模式列表,确认最佳merge模式(使用了运动补偿motionCompensation,基于SAD)
(3)基于最佳merge模式,计算不编码残差的损失(encodeResAndCalcRdSkipCU,基于SSE)
(4)基于最佳merge模式,计算编码残差的损失(encodeResAndCalcRdInterCU,基于SSE)

PS:需要注意的是,这里说的Skip模式指的是基于最佳Merge模式,不对最佳Merge模式的残差进行编码的操作

/* sets md.bestMode if a valid merge candidate is found, else leaves it NULL */
void Analysis::checkMerge2Nx2N_rd0_4(Mode& skip, Mode& merge, const CUGeom& cuGeom)
{
    uint32_t depth = cuGeom.depth;
    ModeDepth& md = m_modeDepth[depth];
    Yuv *fencYuv = &md.fencYuv;

    /* Note that these two Mode instances are named MERGE and SKIP but they may
     * hold the reverse when the function returns. We toggle between the two modes */
    Mode* tempPred = &merge;
    Mode* bestPred = &skip;

    X265_CHECK(m_slice->m_sliceType != I_SLICE, "Evaluating merge in I slice\n");

    tempPred->initCosts();
    tempPred->cu.setPartSizeSubParts(SIZE_2Nx2N);
    tempPred->cu.setPredModeSubParts(MODE_INTER);
    tempPred->cu.m_mergeFlag[0] = true;

    bestPred->initCosts();
    bestPred->cu.setPartSizeSubParts(SIZE_2Nx2N);
    bestPred->cu.setPredModeSubParts(MODE_INTER);
    bestPred->cu.m_mergeFlag[0] = true;

    MVField candMvField[MRG_MAX_NUM_CANDS][2]; // double length for mv of both lists,存储MV列表
    uint8_t candDir[MRG_MAX_NUM_CANDS];	// 存储前后向
	// 1. 获取merge候选列表,MRG_MAX_NUM_CANDS = 5,实际使用时可能为3,与参数配置有关系
    uint32_t numMergeCand = tempPred->cu.getInterMergeCandidates(0, 0, candMvField, candDir);
    PredictionUnit pu(merge.cu, cuGeom, 0);

    bestPred->sa8dCost = MAX_INT64;
    int bestSadCand = -1;
    int sizeIdx = cuGeom.log2CUSize - 2;
    int safeX, maxSafeMv;
    if (m_param->bIntraRefresh && m_slice->m_sliceType == P_SLICE)
    {
        safeX = m_slice->m_refFrameList[0][0]->m_encData->m_pir.pirEndCol * m_param->maxCUSize - 3;
        maxSafeMv = (safeX - tempPred->cu.m_cuPelX) * 4;
    }
	// 2. 检查merge候选模式
    for (uint32_t i = 0; i < numMergeCand; ++i)
    {
		// 是否启用帧级并行处理
        if (m_bFrameParallel)
        {
            // Parallel slices bound check
            if (m_param->maxSlices > 1)
            {
                // NOTE: First row in slice can't negative
                if (X265_MIN(candMvField[i][0].mv.y, candMvField[i][1].mv.y) < m_sliceMinY)
                    continue;

                // Last row in slice can't reference beyond bound since it is another slice area
                // TODO: we may beyond bound in future since these area have a chance to finish because we use parallel slices. Necessary prepare research on load balance
                if (X265_MAX(candMvField[i][0].mv.y, candMvField[i][1].mv.y) > m_sliceMaxY)
                    continue;
            }

            if (candMvField[i][0].mv.y >= (m_param->searchRange + 1) * 4 ||
                candMvField[i][1].mv.y >= (m_param->searchRange + 1) * 4)
                continue;
        }

        if (m_param->bIntraRefresh && m_slice->m_sliceType == P_SLICE &&
            tempPred->cu.m_cuPelX / m_param->maxCUSize < m_frame->m_encData->m_pir.pirEndCol &&
            candMvField[i][0].mv.x > maxSafeMv)
            // skip merge candidates which reference beyond safe reference area
            continue;
		// merge候选模式存储在L0中
        tempPred->cu.m_mvpIdx[0][0] = (uint8_t)i; // merge candidate ID is stored in L0 MVP idx
        X265_CHECK(m_slice->m_sliceType == B_SLICE || !(candDir[i] & 0x10), " invalid merge for P slice\n");
        tempPred->cu.m_interDir[0] = candDir[i]; // 候选列表信息
        tempPred->cu.m_mv[0][0] = candMvField[i][0].mv;	// 前向mv(第二个维度0表示前向,1表示后向)
        tempPred->cu.m_mv[1][0] = candMvField[i][1].mv;	// 后向mv
        tempPred->cu.m_refIdx[0][0] = (int8_t)candMvField[i][0].refIdx;	// 前向参考帧索引
        tempPred->cu.m_refIdx[1][0] = (int8_t)candMvField[i][1].refIdx;	// 后向参考帧索引
		// 运动补偿(根据MV来获取预测块)
        motionCompensation(tempPred->cu, pu, tempPred->predYuv, true, m_bChromaSa8d && (m_csp != X265_CSP_I400 && m_frame->m_fencPic->m_picCsp != X265_CSP_I400));

        tempPred->sa8dBits = getTUBits(i, numMergeCand);
		// 根据运动补偿MC获取的预测块,来计算sad
        tempPred->distortion = primitives.cu[sizeIdx].sa8d(fencYuv->m_buf[0], fencYuv->m_size, tempPred->predYuv.m_buf[0], tempPred->predYuv.m_size);
        if (m_bChromaSa8d && (m_csp != X265_CSP_I400 && m_frame->m_fencPic->m_picCsp != X265_CSP_I400))
        {
            tempPred->distortion += primitives.chroma[m_csp].cu[sizeIdx].sa8d(fencYuv->m_buf[1], fencYuv->m_csize, tempPred->predYuv.m_buf[1], tempPred->predYuv.m_csize);
            tempPred->distortion += primitives.chroma[m_csp].cu[sizeIdx].sa8d(fencYuv->m_buf[2], fencYuv->m_csize, tempPred->predYuv.m_buf[2], tempPred->predYuv.m_csize);
        }
		// 计算rdCost
        tempPred->sa8dCost = m_rdCost.calcRdSADCost((uint32_t)tempPred->distortion, tempPred->sa8dBits);
		// 检查当前模式的rdCost是否是最佳的
        if (tempPred->sa8dCost < bestPred->sa8dCost)
        {
            bestSadCand = i;
            std::swap(tempPred, bestPred);
        }
    }

    /* force mode decision to take inter or intra */
    if (bestSadCand < 0)
        return;

    /* calculate the motion compensation for chroma for the best mode selected */
	// 检查chroma分量
    if ((!m_bChromaSa8d && (m_csp != X265_CSP_I400)) || (m_frame->m_fencPic->m_picCsp == X265_CSP_I400 && m_csp != X265_CSP_I400)) /* Chroma MC was done above */
        motionCompensation(bestPred->cu, pu, bestPred->predYuv, false, true);

    if (m_param->rdLevel)
    {
        if (m_param->bLossless)
            bestPred->rdCost = MAX_INT64;
        else // 3. 基于最佳merge模式,计算直接skip的损失(基于SSE),skip模式不会实际编码残差
            encodeResAndCalcRdSkipCU(*bestPred);

        /* Encode with residual */
        tempPred->cu.m_mvpIdx[0][0] = (uint8_t)bestSadCand;
        tempPred->cu.setPUInterDir(candDir[bestSadCand], 0, 0);
        tempPred->cu.setPUMv(0, candMvField[bestSadCand][0].mv, 0, 0);
        tempPred->cu.setPUMv(1, candMvField[bestSadCand][1].mv, 0, 0);
        tempPred->cu.setPURefIdx(0, (int8_t)candMvField[bestSadCand][0].refIdx, 0, 0);
        tempPred->cu.setPURefIdx(1, (int8_t)candMvField[bestSadCand][1].refIdx, 0, 0);
        tempPred->sa8dCost = bestPred->sa8dCost;
        tempPred->sa8dBits = bestPred->sa8dBits;
        tempPred->predYuv.copyFromYuv(bestPred->predYuv);
		// 4. 将bestSadCand使用SSE再计算一遍,获取基于SSE的损失,会实际编码残差
        encodeResAndCalcRdInterCU(*tempPred, cuGeom);
		/*
			从两者中取出最佳的模式
			(1)bestPred指向的是skip模式
			(2)tempPred指向的是从merge候选列表中得到bestSadCand模式(基于SSE重新计算之后)
		*/
        md.bestMode = tempPred->rdCost < bestPred->rdCost ? tempPred : bestPred;
    }
    else
        md.bestMode = bestPred;

    /* broadcast sets of MV field data */
	// 存储最佳模式
    md.bestMode->cu.setPUInterDir(candDir[bestSadCand], 0, 0);
    md.bestMode->cu.setPUMv(0, candMvField[bestSadCand][0].mv, 0, 0);
    md.bestMode->cu.setPUMv(1, candMvField[bestSadCand][1].mv, 0, 0);
    md.bestMode->cu.setPURefIdx(0, (int8_t)candMvField[bestSadCand][0].refIdx, 0, 0);
    md.bestMode->cu.setPURefIdx(1, (int8_t)candMvField[bestSadCand][1].refIdx, 0, 0);
    checkDQP(*md.bestMode, cuGeom);
}

2.1.1 获取Merge候选列表(getInterMergeCandidates)

/* Construct list of merging candidates, returns count */
uint32_t CUData::getInterMergeCandidates(uint32_t absPartIdx, uint32_t puIdx, MVField(*candMvField)[2], uint8_t* candDir) const
{
    uint32_t absPartAddr = m_absIdxInCTU + absPartIdx;
    const bool isInterB = m_slice->isInterB();

    const uint32_t maxNumMergeCand = m_slice->m_maxNumMergeCand;

    for (uint32_t i = 0; i < maxNumMergeCand; ++i)
    {
        candMvField[i][0].mv = 0;
        candMvField[i][1].mv = 0;
        candMvField[i][0].refIdx = REF_NOT_VALID;
        candMvField[i][1].refIdx = REF_NOT_VALID;
    }

    /* calculate the location of upper-left corner pixel and size of the current PU */
    int xP, yP, nPSW, nPSH;

    int cuSize = 1 << m_log2CUSize[0];
    int partMode = m_partSize[0];
	/*
		// Partition table.
		总共有3个维度:
		(1)第1维表示划分的方式,例如SIZE_2Nx2N,长度为9
		(2)第2维表示划分之后的索引号,即第几个块,长度为4
		(3)第3维长度为2,其中第一个表示划分的尺寸,第二个表示划分偏移量

		举例如下
		(1)partTable[0][0][0] = 0x44,表示这是一个4x4的块,partTable[0][0][1] = 0x00,表示不存在偏移量
		(2)partTable[3][0][0] = 0x22,表示这是一个划分成为4个子块的情况,并且标识的是第一个2x2的块;partTable[3][1][1] = 0x20,
			表示第二个2x2块,水平偏移量为2,垂直偏移量为0
		
		const uint32_t partTable[8][4][2] =
		{
		    //        XY
		    { { 0x44, 0x00 }, { 0x00, 0x00 }, { 0x00, 0x00 }, { 0x00, 0x00 } }, // SIZE_2Nx2N.
		    { { 0x42, 0x00 }, { 0x42, 0x02 }, { 0x00, 0x00 }, { 0x00, 0x00 } }, // SIZE_2NxN.
		    { { 0x24, 0x00 }, { 0x24, 0x20 }, { 0x00, 0x00 }, { 0x00, 0x00 } }, // SIZE_Nx2N.
		    { { 0x22, 0x00 }, { 0x22, 0x20 }, { 0x22, 0x02 }, { 0x22, 0x22 } }, // SIZE_NxN.
		    { { 0x41, 0x00 }, { 0x43, 0x01 }, { 0x00, 0x00 }, { 0x00, 0x00 } }, // SIZE_2NxnU.
		    { { 0x43, 0x00 }, { 0x41, 0x03 }, { 0x00, 0x00 }, { 0x00, 0x00 } }, // SIZE_2NxnD.
		    { { 0x14, 0x00 }, { 0x34, 0x10 }, { 0x00, 0x00 }, { 0x00, 0x00 } }, // SIZE_nLx2N.
		    { { 0x34, 0x00 }, { 0x14, 0x30 }, { 0x00, 0x00 }, { 0x00, 0x00 } }  // SIZE_nRx2N.
		};
	*/
    int tmp = partTable[partMode][puIdx][0]; // 尺寸
    nPSW = ((tmp >> 4) * cuSize) >> 2;  // 宽
    nPSH = ((tmp & 0xF) * cuSize) >> 2; // 高

    tmp = partTable[partMode][puIdx][1]; // 偏移量
    xP = ((tmp >> 4) * cuSize) >> 2;	 // x偏移量(或者说相对位置)
    yP = ((tmp & 0xF) * cuSize) >> 2;	 // y偏移量

    uint32_t count = 0;
	// 根据partSize计算left-bottom位置的idx
    uint32_t partIdxLT, partIdxRT, partIdxLB = deriveLeftBottomIdx(puIdx);
    PartSize curPS = (PartSize)m_partSize[absPartIdx];
    
	/*
		Merge候选列表的建立(下图中PU尺寸不固定,只表明相对位置),merge列表最多5个
		(1)空域候选列表 = { A1, B1, B0, A0, B2 },按照顺序至多选择4个
		+--+           +--+--+
		|B2|		   |B1|B0|
		+--+--+--+--+--+--+--+
		   |           |
		   +           +
		   |  Current  |
		   +    PU     +
		   |           |
		+--+           +
		|A1|           |
		+--+--+--+--+--+
		|A0|
		+--+

		需要特殊处理的情况,针对下面的PU 2
		情况1: 不能存在A1的运动信息				情况2: 不能存在B1的运动信息
		            +--+        +--+--+            +--+--+--+--+
		  		    |B2|        |B1|B0|		       |           |
		   +--+--+--+--+--+--+--+--+--+         +--+    PU 1   +--+
		   |           |           |            |B2|           |B0|
		   +           +           +			+--+--+--+--+--+--+
		   |           |           |               |           |
		   +    PU 1   +    PU 2   +			+--+    PU 2   +
		   |           |           |			|A1|           |
		   +           +           +			+--+--+--+--+--+
		   |           |           |            |A0|
		   +--+--+--+--+--+--+--+--+            +--+
		            |A0|
					+--+
		
		(2)时域候选列表 = { H or C3(if H not exist) }
		     Current PU
		+----+----+----+----+
		|         |         |
		+         +         +
		|         |         |
		+----+----+----+----+
		|         | C3 |    |
		+         +----+    +
		|         |         |
		+----+----+----+----+----+
						    |  H |
							+----+
	*/
    // left
    uint32_t leftPartIdx = 0;
    const CUData* cuLeft = getPULeft(leftPartIdx, partIdxLB);
	// 检查A1是否存在
    bool isAvailableA1 = cuLeft &&
	    /* 
	    	isDiffMER()用于检查当前PU和空域需要参考的PU是否位于同一merge区域
			(1)相邻块x = xP - 1,相邻块y = yP + nPSH - 1
			(2)当前块x = xP,当前块y = yP
			我理解这里相邻块x = xP - 1不是表示相邻块左上角的x,而是相邻区域(不然似乎对应不上)
		*/
        cuLeft->isDiffMER(xP - 1, yP + nPSH - 1, xP, yP) && 
        // 检查是否为情况1,如果是情况1,则不存在A1信息
        !(puIdx == 1 && (curPS == SIZE_Nx2N || curPS == SIZE_nLx2N || curPS == SIZE_nRx2N)) &&
        cuLeft->isInter(leftPartIdx);
	// 如果A1块存在,则取出dir和mv
    if (isAvailableA1)
    {
        // get Inter Dir
        candDir[count] = cuLeft->m_interDir[leftPartIdx];
        // get Mv from Left
        cuLeft->getMvField(cuLeft, leftPartIdx, 0, candMvField[count][0]);
        if (isInterB)
            cuLeft->getMvField(cuLeft, leftPartIdx, 1, candMvField[count][1]);

        if (++count == maxNumMergeCand)
            return maxNumMergeCand;
    }
	// 更新partIdxLT和partIdxRT
    deriveLeftRightTopIdx(puIdx, partIdxLT, partIdxRT);

    // above
    uint32_t abovePartIdx = 0;
    const CUData* cuAbove = getPUAbove(abovePartIdx, partIdxRT);
	// 检查B1是否存在
    bool isAvailableB1 = cuAbove &&
    	/* 
	    	检查当前PU和空域需要参考的PU是否位于同一merge区域
			(1)相邻块x = xP + nPSW - 1,相邻块y = yP - 1
			(2)当前块x = xP,当前块y = yP
		*/ 
        cuAbove->isDiffMER(xP + nPSW - 1, yP - 1, xP, yP) &&
        !(puIdx == 1 && (curPS == SIZE_2NxN || curPS == SIZE_2NxnU || curPS == SIZE_2NxnD)) &&
        cuAbove->isInter(abovePartIdx);
    if (isAvailableB1 && (!isAvailableA1 || !cuLeft->hasEqualMotion(leftPartIdx, *cuAbove, abovePartIdx)))
    {
        // get Inter Dir
        candDir[count] = cuAbove->m_interDir[abovePartIdx];
        // get Mv from Left
        cuAbove->getMvField(cuAbove, abovePartIdx, 0, candMvField[count][0]);
        if (isInterB)
            cuAbove->getMvField(cuAbove, abovePartIdx, 1, candMvField[count][1]);

        if (++count == maxNumMergeCand)
            return maxNumMergeCand;
    }

    // above right
    uint32_t aboveRightPartIdx = 0;
    const CUData* cuAboveRight = getPUAboveRight(aboveRightPartIdx, partIdxRT);
	// 检查B0是否存在
    bool isAvailableB0 = cuAboveRight &&
    	/* 
	    	检查当前PU和空域需要参考的PU是否位于同一merge区域
			(1)相邻块x = xP + nPSW,相邻块y = yP - 1
			(2)当前块x = xP,当前块y = yP
		*/ 
        cuAboveRight->isDiffMER(xP + nPSW, yP - 1, xP, yP) &&
        cuAboveRight->isInter(aboveRightPartIdx);
    if (isAvailableB0 && (!isAvailableB1 || !cuAbove->hasEqualMotion(abovePartIdx, *cuAboveRight, aboveRightPartIdx)))
    {
        // get Inter Dir
        candDir[count] = cuAboveRight->m_interDir[aboveRightPartIdx];
        // get Mv from Left
        cuAboveRight->getMvField(cuAboveRight, aboveRightPartIdx, 0, candMvField[count][0]);
        if (isInterB)
            cuAboveRight->getMvField(cuAboveRight, aboveRightPartIdx, 1, candMvField[count][1]);

        if (++count == maxNumMergeCand)
            return maxNumMergeCand;
    }

    // left bottom
    uint32_t leftBottomPartIdx = 0;
    const CUData* cuLeftBottom = this->getPUBelowLeft(leftBottomPartIdx, partIdxLB);
	// 检查A0是否存在
    bool isAvailableA0 = cuLeftBottom &&
    	/* 
	    	检查当前PU和空域需要参考的PU是否位于同一merge区域
			(1)相邻块x = xP - 1,相邻块y = yP + nPSH
			(2)当前块x = xP,当前块y = yP
		*/ 
        cuLeftBottom->isDiffMER(xP - 1, yP + nPSH, xP, yP) &&
        cuLeftBottom->isInter(leftBottomPartIdx);
    if (isAvailableA0 && (!isAvailableA1 || !cuLeft->hasEqualMotion(leftPartIdx, *cuLeftBottom, leftBottomPartIdx)))
    {
        // get Inter Dir
        candDir[count] = cuLeftBottom->m_interDir[leftBottomPartIdx];
        // get Mv from Left
        cuLeftBottom->getMvField(cuLeftBottom, leftBottomPartIdx, 0, candMvField[count][0]);
        if (isInterB)
            cuLeftBottom->getMvField(cuLeftBottom, leftBottomPartIdx, 1, candMvField[count][1]);

        if (++count == maxNumMergeCand)
            return maxNumMergeCand;
    }

    // above left
	// 如果前面获取的merge cand小于4个,还会检查左上角的块,即B2
    if (count < 4)
    {
        uint32_t aboveLeftPartIdx = 0;
        const CUData* cuAboveLeft = getPUAboveLeft(aboveLeftPartIdx, absPartAddr);
		// 检查B2是否可用
        bool isAvailableB2 = cuAboveLeft &&
            cuAboveLeft->isDiffMER(xP - 1, yP - 1, xP, yP) &&
            cuAboveLeft->isInter(aboveLeftPartIdx);
        if (isAvailableB2 && (!isAvailableA1 || !cuLeft->hasEqualMotion(leftPartIdx, *cuAboveLeft, aboveLeftPartIdx))
            && (!isAvailableB1 || !cuAbove->hasEqualMotion(abovePartIdx, *cuAboveLeft, aboveLeftPartIdx)))
        {
            // get Inter Dir
            candDir[count] = cuAboveLeft->m_interDir[aboveLeftPartIdx];
            // get Mv from Left
            cuAboveLeft->getMvField(cuAboveLeft, aboveLeftPartIdx, 0, candMvField[count][0]);
            if (isInterB)
                cuAboveLeft->getMvField(cuAboveLeft, aboveLeftPartIdx, 1, candMvField[count][1]);

            if (++count == maxNumMergeCand)
                return maxNumMergeCand;
        }
    }
	/*
		检查TemporalMVP是否可用,如果可用则去获取时域上的参考列表
	*/
    if (m_slice->m_sps->bTemporalMVPEnabled)
    {
		// 获取右下角pu idx
        uint32_t partIdxRB = deriveRightBottomIdx(puIdx);
        MV colmv;
        int ctuIdx = -1;

        // image boundary check
        if (m_encData->getPicCTU(m_cuAddr)->m_cuPelX + g_zscanToPelX[partIdxRB] + UNIT_SIZE < m_slice->m_sps->picWidthInLumaSamples &&
            m_encData->getPicCTU(m_cuAddr)->m_cuPelY + g_zscanToPelY[partIdxRB] + UNIT_SIZE < m_slice->m_sps->picHeightInLumaSamples)
        {
            uint32_t absPartIdxRB = g_zscanToRaster[partIdxRB];
            uint32_t numUnits = s_numPartInCUSize;
			// 检查absPartIdxRB是否是最后一列或者最后一行
            bool bNotLastCol = lessThanCol(absPartIdxRB, numUnits - 1); // is not at the last column of CTU
            bool bNotLastRow = lessThanRow(absPartIdxRB, numUnits - 1); // is not at the last row    of CTU
			// 确定时域候选列表同位PU的位置
            if (bNotLastCol && bNotLastRow)
            {
                absPartAddr = g_rasterToZscan[absPartIdxRB + RASTER_SIZE + 1];
                ctuIdx = m_cuAddr;
            }
            else if (bNotLastCol)
                absPartAddr = g_rasterToZscan[(absPartIdxRB + 1) & (numUnits - 1)];
            else if (bNotLastRow)
            {
                absPartAddr = g_rasterToZscan[absPartIdxRB + RASTER_SIZE - numUnits + 1];
                ctuIdx = m_cuAddr + 1;
            }
            else // is the right bottom corner of CTU
                absPartAddr = 0;
        }
		// B帧具有两个时域候选模式,P帧只有一个
        int maxList = isInterB ? 2 : 1;
        int dir = 0, refIdx = 0;
        for (int list = 0; list < maxList; list++)
        {
			// 获取colocated-mv
            bool bExistMV = ctuIdx >= 0 && getColMVP(colmv, refIdx, list, ctuIdx, absPartAddr);
            if (!bExistMV)
            {
				// 如果右下角的PU没有可用MV,则从C3位置获取mv,作为可用的mv
                uint32_t partIdxCenter = deriveCenterIdx(puIdx);
                bExistMV = getColMVP(colmv, refIdx, list, m_cuAddr, partIdxCenter);
            }
			// 如果找到可用MV,则加入到队列中
            if (bExistMV)
            {
                dir |= (1 << list);
                candMvField[count][list].mv = colmv;
                candMvField[count][list].refIdx = refIdx;
            }
        }

        if (dir != 0)
        {
            candDir[count] = (uint8_t)dir;

            if (++count == maxNumMergeCand)
                return maxNumMergeCand;
        }
    }
	// B帧处理组合列表(没研究过)
    if (isInterB)
    {
        const uint32_t cutoff = count * (count - 1);
        uint32_t priorityList0 = 0xEDC984; // { 0, 1, 0, 2, 1, 2, 0, 3, 1, 3, 2, 3 }
        uint32_t priorityList1 = 0xB73621; // { 1, 0, 2, 0, 2, 1, 3, 0, 3, 1, 3, 2 }

        for (uint32_t idx = 0; idx < cutoff; idx++, priorityList0 >>= 2, priorityList1 >>= 2)
        {
            int i = priorityList0 & 3;
            int j = priorityList1 & 3;

            if ((candDir[i] & 0x1) && (candDir[j] & 0x2))
            {
                // get Mv from cand[i] and cand[j]
                int refIdxL0 = candMvField[i][0].refIdx;
                int refIdxL1 = candMvField[j][1].refIdx;
                int refPOCL0 = m_slice->m_refPOCList[0][refIdxL0];
                int refPOCL1 = m_slice->m_refPOCList[1][refIdxL1];
                if (!(refPOCL0 == refPOCL1 && candMvField[i][0].mv == candMvField[j][1].mv))
                {
                    candMvField[count][0].mv = candMvField[i][0].mv;
                    candMvField[count][0].refIdx = refIdxL0;
                    candMvField[count][1].mv = candMvField[j][1].mv;
                    candMvField[count][1].refIdx = refIdxL1;
                    candDir[count] = 3;

                    if (++count == maxNumMergeCand)
                        return maxNumMergeCand;
                }
            }
        }
    }
    int numRefIdx = (isInterB) ? X265_MIN(m_slice->m_numRefIdx[0], m_slice->m_numRefIdx[1]) : m_slice->m_numRefIdx[0];
    int r = 0;
    int refcnt = 0;
	// 如果当前MV候选列表长度不足5个,需要填充(0,0)
    while (count < maxNumMergeCand)
    {
        candDir[count] = 1;
        candMvField[count][0].mv.word = 0;
        candMvField[count][0].refIdx = r;

        if (isInterB)
        {
            candDir[count] = 3;
            candMvField[count][1].mv.word = 0;
            candMvField[count][1].refIdx = r;
        }

        count++;

        if (refcnt == numRefIdx - 1)
            r = 0;
        else
        {
            ++r;
            ++refcnt;
        }
    }

    return count;
}

2.1.2 运动补偿(motionCompensation)

运动补偿会根据前面提取到的MV,进行预测,获取到参考帧中的参考块。在x265中,主要调用了predInterLumaPixel()进行帧间的运动补偿

void Predict::motionCompensation(const CUData& cu, const PredictionUnit& pu, Yuv& predYuv, bool bLuma, bool bChroma)
{
    int refIdx0 = cu.m_refIdx[0][pu.puAbsPartIdx];
    int refIdx1 = cu.m_refIdx[1][pu.puAbsPartIdx];
	// 是否是P帧
    if (cu.m_slice->isInterP())
    {
        /* P Slice */
        WeightValues wv0[3];

        X265_CHECK(refIdx0 >= 0, "invalid P refidx\n");
        X265_CHECK(refIdx0 < cu.m_slice->m_numRefIdx[0], "P refidx out of range\n");
        const WeightParam *wp0 = cu.m_slice->m_weightPredTable[0][refIdx0]; // 加权预测相关,没有研究过

        MV mv0 = cu.m_mv[0][pu.puAbsPartIdx];
        cu.clipMv(mv0);

        if (cu.m_slice->m_pps->bUseWeightPred && wp0->wtPresent)
        {
            for (int plane = 0; plane < (bChroma ? 3 : 1); plane++)
            {
                wv0[plane].w      = wp0[plane].inputWeight;
                wv0[plane].offset = wp0[plane].inputOffset * (1 << (X265_DEPTH - 8));
                wv0[plane].shift  = wp0[plane].log2WeightDenom;
                wv0[plane].round  = wp0[plane].log2WeightDenom >= 1 ? 1 << (wp0[plane].log2WeightDenom - 1) : 0;
            }

            ShortYuv& shortYuv = m_predShortYuv[0];

            if (bLuma)
                predInterLumaShort(pu, shortYuv, *cu.m_slice->m_refReconPicList[0][refIdx0], mv0);
            if (bChroma)
                predInterChromaShort(pu, shortYuv, *cu.m_slice->m_refReconPicList[0][refIdx0], mv0);

            addWeightUni(pu, predYuv, shortYuv, wv0, bLuma, bChroma);
        }
        else
        {	
			// 亮度模式运动补偿
            if (bLuma)
                predInterLumaPixel(pu, predYuv, *cu.m_slice->m_refReconPicList[0][refIdx0], mv0);
            // 色度模式运动补偿
			if (bChroma)
                predInterChromaPixel(pu, predYuv, *cu.m_slice->m_refReconPicList[0][refIdx0], mv0);
        }
    }
    else // B帧(没有研究)
    {	
        // ...
    }
}
2.1.2.1 获取预测块(predInterLumaPixel)

从参考帧中获取对应的参考块

void Predict::predInterLumaPixel(const PredictionUnit& pu, Yuv& dstYuv, const PicYuv& refPic, const MV& mv) const
{
    pixel* dst = dstYuv.getLumaAddr(pu.puAbsPartIdx);
    intptr_t dstStride = dstYuv.m_size;

    intptr_t srcStride = refPic.m_stride;
    intptr_t srcOffset = (mv.x >> 2) + (mv.y >> 2) * srcStride;
    int partEnum = partitionFromSizes(pu.width, pu.height);
    const pixel* src = refPic.getLumaAddr(pu.ctuAddr, pu.cuAbsPartIdx + pu.puAbsPartIdx) + srcOffset;
	
    int xFrac = mv.x & 3; // 水平方向偏移量
    int yFrac = mv.y & 3; // 垂直方向偏移量
	/*
		下面根据mv的值确定偏移量
		(1)如果x和y的偏移量都为0,直接copy,使用copy_pp()
			如果有偏移量,还会进行像素插值,用于后续的亚像素搜索,下面的8tap表示8抽头
		(2)如果有x方向的偏移量,使用luma_hpp()进行水平方向的亚像素插值
		(3)如果有y方向的偏移量,使用luma_vpp()进行垂直方向的亚像素插值
		(4)如果x和y方向的偏移量都不为0,使用luma_hvpp()进行两个方向的亚像素插值
	*/
    if (!(yFrac | xFrac))
    	/*
    		调试过程中发现会使用到的copy函数(非正方形也有对应的处理函数,例如blockcopy_pp_32x16_avx)
    		p.pu[LUMA_64x64].copy_pp  = PFX(blockcopy_pp_64x64_avx);
			p.pu[LUMA_32x32].copy_pp  = PFX(blockcopy_pp_32x32_avx);
			p.pu[LUMA_16x16].copy_pp  = x265_blockcopy_pp_16x16_sse2; 
			p.pu[LUMA_8x8].copy_pp  = x265_blockcopy_pp_8x8_sse2;
		*/
        primitives.pu[partEnum].copy_pp(dst, dstStride, src, srcStride);
    else if (!yFrac)
    	/*
    		调试过程中发现会使用到的hpp函数
    		p.pu[LUMA_8x8].luma_hpp = PFX(interp_8tap_horiz_pp_8x8_avx2);
    		p.pu[LUMA_16x16].luma_hpp = PFX(interp_8tap_horiz_pp_16x16_avx2);
			p.pu[LUMA_32x32].luma_hpp = PFX(interp_8tap_horiz_pp_32x32_avx2); 
		*/
        primitives.pu[partEnum].luma_hpp(src, srcStride, dst, dstStride, xFrac);
    else if (!xFrac)
    	/*
    		调试过程中发现会使用到的vpp函数
    		p.pu[LUMA_8x8].luma_vpp = PFX(interp_8tap_vert_pp_8x8_avx2);
    		p.pu[LUMA_16x16].luma_vpp = PFX(interp_8tap_vert_pp_16x16_avx2);
			p.pu[LUMA_32x32].luma_vpp = PFX(interp_8tap_vert_pp_32x32_avx2); 
		*/
        primitives.pu[partEnum].luma_vpp(src, srcStride, dst, dstStride, yFrac);
    else
    	/*
    		调试过程中发现可以使用的vpp函数
			ALL_LUMA_PU_T(luma_hvpp, interp_8tap_hv_pp_cpu);
			interp_8tap_hv_pp_cpu<size>是一个模板函数,模板变量为size
			当size = 1时,表示对8x8块进行处理
			当size = 2时,表示对16x16块进行处理
			当size = 3时,表示对32x32块进行处理
		*/
        primitives.pu[partEnum].luma_hvpp(src, srcStride, dst, dstStride, xFrac, yFrac);
}

2.1.3 计算不编码残差的损失(encodeResAndCalcRdSkipCU)

前面已经获取了基于SAD的最佳merge模式,这里计算如果不对残差进行编码,直接进行skip带来的损失,检查是否可以直接使用skip。这里计算distortion使用的是SSE而不是SAD

/* Note: this function overwrites the RD cost variables of interMode, but leaves the sa8d cost unharmed */
// 该函数会覆盖interMode中的RDCost,但不会改动sa8d开销
void Search::encodeResAndCalcRdSkipCU(Mode& interMode)
{
    CUData& cu = interMode.cu;
    Yuv* reconYuv = &interMode.reconYuv;
    const Yuv* fencYuv = interMode.fencYuv;
    Yuv* predYuv = &interMode.predYuv;
    X265_CHECK(!cu.isIntra(0), "intra CU not expected\n");
    uint32_t depth  = cu.m_cuDepth[0];

    // No residual coding : SKIP mode
	// skip模式不去编码残差
    cu.setPredModeSubParts(MODE_SKIP);
    cu.clearCbf();
    cu.setTUDepthSubParts(0, 0, depth);

    reconYuv->copyFromYuv(interMode.predYuv);

	// 计算基于SSE的Rdcost
    // Luma
    int part = partitionFromLog2Size(cu.m_log2CUSize[0]);
	// 计算sse损失,需要注意的是计算的双方是orig block和recon block
    interMode.lumaDistortion = primitives.cu[part].sse_pp(fencYuv->m_buf[0], fencYuv->m_size, reconYuv->m_buf[0], reconYuv->m_size);
    interMode.distortion = interMode.lumaDistortion;
    // Chroma
    if (m_csp != X265_CSP_I400 && m_frame->m_fencPic->m_picCsp != X265_CSP_I400)
    {
        interMode.chromaDistortion = m_rdCost.scaleChromaDist(1, primitives.chroma[m_csp].cu[part].sse_pp(fencYuv->m_buf[1], fencYuv->m_csize, reconYuv->m_buf[1], reconYuv->m_csize));
        interMode.chromaDistortion += m_rdCost.scaleChromaDist(2, primitives.chroma[m_csp].cu[part].sse_pp(fencYuv->m_buf[2], fencYuv->m_csize, reconYuv->m_buf[2], reconYuv->m_csize));
        interMode.distortion += interMode.chromaDistortion;
    }
    cu.m_distortion[0] = interMode.distortion;
    m_entropyCoder.load(m_rqt[depth].cur); // 将当前CU的信息输入到熵编码器中,为后续的编码做准备
    m_entropyCoder.resetBits(); // 重置比特缓冲区
    if (m_slice->m_pps->bTransquantBypassEnabled)
        m_entropyCoder.codeCUTransquantBypassFlag(cu.m_tqBypass[0]);
    m_entropyCoder.codeSkipFlag(cu, 0); // 编码skip flag
    int skipFlagBits = m_entropyCoder.getNumberOfWrittenBits();
    m_entropyCoder.codeMergeIndex(cu, 0); // 编码merge idx
    interMode.mvBits = m_entropyCoder.getNumberOfWrittenBits() - skipFlagBits;
    interMode.coeffBits = 0;
    interMode.totalBits = interMode.mvBits + skipFlagBits;
    if (m_rdCost.m_psyRd)
        interMode.psyEnergy = m_rdCost.psyCost(part, fencYuv->m_buf[0], fencYuv->m_size, reconYuv->m_buf[0], reconYuv->m_size);
    else if(m_rdCost.m_ssimRd)
        interMode.ssimEnergy = m_quant.ssimDistortion(cu, fencYuv->m_buf[0], fencYuv->m_size, reconYuv->m_buf[0], reconYuv->m_size, cu.m_log2CUSize[0], TEXT_LUMA, 0);

    interMode.resEnergy = primitives.cu[part].sse_pp(fencYuv->m_buf[0], fencYuv->m_size, predYuv->m_buf[0], predYuv->m_size);
    // 更新该模式的损失
	updateModeCost(interMode);
	// 存储已编码信息
    m_entropyCoder.store(interMode.contexts);
}

2.1.4 计算编码残差的损失(encodeResAndCalcRdInterCU)

沿用前面的最佳merge模式,依据这个模式,在本函数中进行SSE的distortion的计算,使用estimateResidualQT()计算损失。此外,还考虑了cbf为0的情况,即不编码和不传输残差的方式,评估是否有可能使用这种编码方式

/* encode residual and calculate rate-distortion for a CU block.
 * Note: this function overwrites the RD cost variables of interMode, but leaves the sa8d cost unharmed */
void Search::encodeResAndCalcRdInterCU(Mode& interMode, const CUGeom& cuGeom)
{
    ProfileCUScope(interMode.cu, interRDOElapsedTime[cuGeom.depth], countInterRDO[cuGeom.depth]);

    CUData& cu = interMode.cu;
    Yuv* reconYuv = &interMode.reconYuv;
    Yuv* predYuv = &interMode.predYuv;
    uint32_t depth = cuGeom.depth;
    ShortYuv* resiYuv = &m_rqt[depth].tmpResiYuv;
    const Yuv* fencYuv = interMode.fencYuv;

    X265_CHECK(!cu.isIntra(0), "intra CU not expected\n");

    uint32_t log2CUSize = cuGeom.log2CUSize;
    int sizeIdx = log2CUSize - 2;
	// 将预测块pred和编码块enc做差值,获得残差
    resiYuv->subtract(*fencYuv, *predYuv, log2CUSize, m_frame->m_fencPic->m_picCsp);

    uint32_t tuDepthRange[2];
    cu.getInterTUQtDepthRange(tuDepthRange, 0);

    m_entropyCoder.load(m_rqt[depth].cur);

    if ((m_limitTU & X265_TU_LIMIT_DFS) && !(m_limitTU & X265_TU_LIMIT_NEIGH))
        m_maxTUDepth = -1;
    else if (m_limitTU & X265_TU_LIMIT_BFS)
        memset(&m_cacheTU, 0, sizeof(TUInfoCache));

    Cost costs;
    if (m_limitTU & X265_TU_LIMIT_NEIGH)
    {
        /* Save and reload maxTUDepth to avoid changing of maxTUDepth between modes */
        int32_t tempDepth = m_maxTUDepth;
        if (m_maxTUDepth != -1)
        {
            uint32_t splitFlag = interMode.cu.m_partSize[0] != SIZE_2Nx2N;
            uint32_t minSize = tuDepthRange[0];
            uint32_t maxSize = tuDepthRange[1];
            maxSize = X265_MIN(maxSize, cuGeom.log2CUSize - splitFlag);
            m_maxTUDepth = x265_clip3(cuGeom.log2CUSize - maxSize, cuGeom.log2CUSize - minSize, (uint32_t)m_maxTUDepth);
        }
        estimateResidualQT(interMode, cuGeom, 0, 0, *resiYuv, costs, tuDepthRange);
        m_maxTUDepth = tempDepth;
    }
    else // 估计编码残差,并计算对应的rdcost
        estimateResidualQT(interMode, cuGeom, 0, 0, *resiYuv, costs, tuDepthRange);
	/*
		检查是否使用bypass(旁路)模式进行编码
		(1)对于那些概率接近均匀分布的符号,使用bypass编码可以减少编码开销
		(2)这些符号的概率大致相同,不适合使用普通的上下文自适应二进制算术编码
	*/
    uint32_t tqBypass = cu.m_tqBypass[0];
    if (!tqBypass)
    {
		// 计算Cbf为0情况下的损失,随后与当前模式的costs进行对比,Cbf为0表示不编码残差,也不传输残差
        sse_t cbf0Dist = primitives.cu[sizeIdx].sse_pp(fencYuv->m_buf[0], fencYuv->m_size, predYuv->m_buf[0], predYuv->m_size);
        if (m_csp != X265_CSP_I400 && m_frame->m_fencPic->m_picCsp != X265_CSP_I400)
        {
            cbf0Dist += m_rdCost.scaleChromaDist(1, primitives.chroma[m_csp].cu[sizeIdx].sse_pp(fencYuv->m_buf[1], predYuv->m_csize, predYuv->m_buf[1], predYuv->m_csize));
            cbf0Dist += m_rdCost.scaleChromaDist(2, primitives.chroma[m_csp].cu[sizeIdx].sse_pp(fencYuv->m_buf[2], predYuv->m_csize, predYuv->m_buf[2], predYuv->m_csize));
        }

        /* Consider the RD cost of not signaling any residual */
        m_entropyCoder.load(m_rqt[depth].cur);
        m_entropyCoder.resetBits();
        m_entropyCoder.codeQtRootCbfZero();
        uint32_t cbf0Bits = m_entropyCoder.getNumberOfWrittenBits();

        uint32_t cbf0Energy; uint64_t cbf0Cost;
        if (m_rdCost.m_psyRd)
        {
            cbf0Energy = m_rdCost.psyCost(log2CUSize - 2, fencYuv->m_buf[0], fencYuv->m_size, predYuv->m_buf[0], predYuv->m_size);
            cbf0Cost = m_rdCost.calcPsyRdCost(cbf0Dist, cbf0Bits, cbf0Energy);
        }
        else if(m_rdCost.m_ssimRd)
        {
            cbf0Energy = m_quant.ssimDistortion(cu, fencYuv->m_buf[0], fencYuv->m_size, predYuv->m_buf[0], predYuv->m_size, log2CUSize, TEXT_LUMA, 0);
            cbf0Cost = m_rdCost.calcSsimRdCost(cbf0Dist, cbf0Bits, cbf0Energy);
        }
        else
            cbf0Cost = m_rdCost.calcRdCost(cbf0Dist, cbf0Bits);
		// 对比cbf为0的cost和当前模式的cost
        if (cbf0Cost < costs.rdcost) // 
        {
            cu.clearCbf();
            cu.setTUDepthSubParts(0, 0, depth);
        }
    }

    if (cu.getQtRootCbf(0))
        saveResidualQTData(cu, *resiYuv, 0, 0);

    /* calculate signal bits for inter/merge/skip coded CU */
    m_entropyCoder.load(m_rqt[depth].cur);

    m_entropyCoder.resetBits();
    if (m_slice->m_pps->bTransquantBypassEnabled)
        m_entropyCoder.codeCUTransquantBypassFlag(tqBypass);

    uint32_t coeffBits, bits, mvBits;
	// 启用merge && size = 2Nx2N && 根节点Cbf为0
    if (cu.m_mergeFlag[0] && cu.m_partSize[0] == SIZE_2Nx2N && !cu.getQtRootCbf(0))
    {
		// 根节点的Cbf为0,说明子块不再需要继续预测,直接skip
        cu.setPredModeSubParts(MODE_SKIP);

        /* Merge/Skip */
        coeffBits = mvBits = 0;
        m_entropyCoder.codeSkipFlag(cu, 0); // 编码skip Flag
        int skipFlagBits = m_entropyCoder.getNumberOfWrittenBits();
        m_entropyCoder.codeMergeIndex(cu, 0); // 编码merge idx
        mvBits = m_entropyCoder.getNumberOfWrittenBits() - skipFlagBits;
        bits = mvBits + skipFlagBits;
    }
    else
    {
        m_entropyCoder.codeSkipFlag(cu, 0);
        int skipFlagBits = m_entropyCoder.getNumberOfWrittenBits();
        m_entropyCoder.codePredMode(cu.m_predMode[0]);
        m_entropyCoder.codePartSize(cu, 0, cuGeom.depth);
        m_entropyCoder.codePredInfo(cu, 0);
        mvBits = m_entropyCoder.getNumberOfWrittenBits() - skipFlagBits;

        bool bCodeDQP = m_slice->m_pps->bUseDQP;
        m_entropyCoder.codeCoeff(cu, 0, bCodeDQP, tuDepthRange);
        bits = m_entropyCoder.getNumberOfWrittenBits();

        coeffBits = bits - mvBits - skipFlagBits;
    }

    m_entropyCoder.store(interMode.contexts);

    if (cu.getQtRootCbf(0))
        reconYuv->addClip(*predYuv, *resiYuv, log2CUSize, m_frame->m_fencPic->m_picCsp);
    else
        reconYuv->copyFromYuv(*predYuv);

    // update with clipped distortion and cost (qp estimation loop uses unclipped values)
	// 计算最佳的SSE
    sse_t bestLumaDist = primitives.cu[sizeIdx].sse_pp(fencYuv->m_buf[0], fencYuv->m_size, reconYuv->m_buf[0], reconYuv->m_size);
    interMode.distortion = bestLumaDist;
	// 计算chroma分量
    if (m_csp != X265_CSP_I400 && m_frame->m_fencPic->m_picCsp != X265_CSP_I400)
    {
        sse_t bestChromaDist = m_rdCost.scaleChromaDist(1, primitives.chroma[m_csp].cu[sizeIdx].sse_pp(fencYuv->m_buf[1], fencYuv->m_csize, reconYuv->m_buf[1], reconYuv->m_csize));
        bestChromaDist += m_rdCost.scaleChromaDist(2, primitives.chroma[m_csp].cu[sizeIdx].sse_pp(fencYuv->m_buf[2], fencYuv->m_csize, reconYuv->m_buf[2], reconYuv->m_csize));
        interMode.chromaDistortion = bestChromaDist;
        interMode.distortion += bestChromaDist;
    }
    if (m_rdCost.m_psyRd) // 计算心理视觉的rdcost
        interMode.psyEnergy = m_rdCost.psyCost(sizeIdx, fencYuv->m_buf[0], fencYuv->m_size, reconYuv->m_buf[0], reconYuv->m_size);
    else if(m_rdCost.m_ssimRd)
        interMode.ssimEnergy = m_quant.ssimDistortion(cu, fencYuv->m_buf[0], fencYuv->m_size, reconYuv->m_buf[0], reconYuv->m_size, cu.m_log2CUSize[0], TEXT_LUMA, 0);

    interMode.resEnergy = primitives.cu[sizeIdx].sse_pp(fencYuv->m_buf[0], fencYuv->m_size, predYuv->m_buf[0], predYuv->m_size);
    interMode.totalBits = bits;
    interMode.lumaDistortion = bestLumaDist;
    interMode.coeffBits = coeffBits;
    interMode.mvBits = mvBits;
    cu.m_distortion[0] = interMode.distortion;
	// 更新cost
    updateModeCost(interMode);
    checkDQP(interMode, cuGeom);
}

2.2 常规帧间预测(checkInter_rd0_4)

该函数的作用为进行常规的帧间预测,其中主要调用了predInterSearch()进行帧间搜索,并使用SAD来衡量模式的损失

void Analysis::checkInter_rd0_4(Mode& interMode, const CUGeom& cuGeom, PartSize partSize, uint32_t refMask[2])
{
    interMode.initCosts();
    interMode.cu.setPartSizeSubParts(partSize);
    interMode.cu.setPredModeSubParts(MODE_INTER);
    int numPredDir = m_slice->isInterP() ? 1 : 2;
	// 是否使用编码分析重用
    if (m_param->analysisLoadReuseLevel > 1 && m_param->analysisLoadReuseLevel != 10 && m_reuseInterDataCTU)
    {
        int refOffset = cuGeom.geomRecurId * 16 * numPredDir + partSize * numPredDir * 2;
        int index = 0;

        uint32_t numPU = interMode.cu.getNumPartInter(0);
        for (uint32_t part = 0; part < numPU; part++)
        {
            MotionData* bestME = interMode.bestME[part];
            for (int32_t i = 0; i < numPredDir; i++)
                bestME[i].ref = m_reuseRef[refOffset + index++];
        }
    }
	// multi-pass优化
    if (m_param->analysisMultiPassRefine && m_param->rc.bStatRead && m_reuseInterDataCTU)
    {
        uint32_t numPU = interMode.cu.getNumPartInter(0);
        for (uint32_t part = 0; part < numPU; part++)
        {
            MotionData* bestME = interMode.bestME[part];
            for (int32_t i = 0; i < numPredDir; i++)
            {
                int* ref = &m_reuseRef[i * m_frame->m_analysisData.numPartitions * m_frame->m_analysisData.numCUsInFrame];
                bestME[i].ref = ref[cuGeom.absPartIdx];
                bestME[i].mv = m_reuseMv[i][cuGeom.absPartIdx].word;
                bestME[i].mvpIdx = m_reuseMvpIdx[i][cuGeom.absPartIdx];
            }
        }
    }
	// 进行帧间搜索
    predInterSearch(interMode, cuGeom, m_bChromaSa8d && (m_csp != X265_CSP_I400 && m_frame->m_fencPic->m_picCsp != X265_CSP_I400), refMask);

    /* predInterSearch sets interMode.sa8dBits */
	// 帧间搜索使用SAD来衡量最佳模式
    const Yuv& fencYuv = *interMode.fencYuv;
    Yuv& predYuv = interMode.predYuv;
    int part = partitionFromLog2Size(cuGeom.log2CUSize);
	// 计算SAD
    interMode.distortion = primitives.cu[part].sa8d(fencYuv.m_buf[0], fencYuv.m_size, predYuv.m_buf[0], predYuv.m_size);
    if (m_bChromaSa8d && (m_csp != X265_CSP_I400 && m_frame->m_fencPic->m_picCsp != X265_CSP_I400))
    {
        interMode.distortion += primitives.chroma[m_csp].cu[part].sa8d(fencYuv.m_buf[1], fencYuv.m_csize, predYuv.m_buf[1], predYuv.m_csize);
        interMode.distortion += primitives.chroma[m_csp].cu[part].sa8d(fencYuv.m_buf[2], fencYuv.m_csize, predYuv.m_buf[2], predYuv.m_csize);
    }
    interMode.sa8dCost = m_rdCost.calcRdSADCost((uint32_t)interMode.distortion, interMode.sa8dBits);

    if (m_param->analysisSaveReuseLevel > 1 && m_reuseInterDataCTU)
    {
        int refOffset = cuGeom.geomRecurId * 16 * numPredDir + partSize * numPredDir * 2;
        int index = 0;

        uint32_t numPU = interMode.cu.getNumPartInter(0);
        for (uint32_t puIdx = 0; puIdx < numPU; puIdx++)
        {
            MotionData* bestME = interMode.bestME[puIdx];
            for (int32_t i = 0; i < numPredDir; i++)
                m_reuseRef[refOffset + index++] = bestME[i].ref;
        }
    }
}

2.2.1 帧间预测搜索(predInterSearch)

该函数的作用是为当前PU寻找到最佳的Inter模式,在执行运动搜索之前,需要先确定可参考的MV
(1)如果当前PU是子CU(即partSize不等于2Nx2N),会评估merge模式(mergeEstimation)
(2)构建AMVP列表并从中选出最佳候选模式(getPMV,selectPMV)
(3)进行运动估计(motionEstimate)

/* find the best inter prediction for each PU of specified mode */
void Search::predInterSearch(Mode& interMode, const CUGeom& cuGeom, bool bChromaMC, uint32_t refMasks[2])
{
    ProfileCUScope(interMode.cu, motionEstimationElapsedTime, countMotionEstimate);

    CUData& cu = interMode.cu;
    Yuv* predYuv = &interMode.predYuv;

    // 12 mv candidates including lowresMV
    MV mvc[(MD_ABOVE_LEFT + 1) * 2 + 2]; // motion vector candidates

    const Slice *slice = m_slice;
    int numPart     = cu.getNumPartInter(0);
    int numPredDir  = slice->isInterP() ? 1 : 2;
    const int* numRefIdx = slice->m_numRefIdx;
    uint32_t lastMode = 0;
    int      totalmebits = 0;
    MV       mvzero(0, 0);
    Yuv&     tmpPredYuv = m_rqt[cuGeom.depth].tmpPredYuv;
    MergeData merge;
    memset(&merge, 0, sizeof(merge));
    bool useAsMVP = false;
	// 分成不同子块进行帧间预测
    for (int puIdx = 0; puIdx < numPart; puIdx++)
    {
        MotionData* bestME = interMode.bestME[puIdx];
        PredictionUnit pu(cu, cuGeom, puIdx);
		// 配置一些原子计算函数和变量
        m_me.setSourcePU(*interMode.fencYuv, pu.ctuAddr, pu.cuAbsPartIdx, pu.puAbsPartIdx, pu.width, pu.height, m_param->searchMethod, m_param->subpelRefine, bChromaMC);
        useAsMVP = false;
        x265_analysis_inter_data* interDataCTU = NULL;
        int cuIdx;
        cuIdx = (interMode.cu.m_cuAddr * m_param->num4x4Partitions) + cuGeom.absPartIdx;
        if (m_param->analysisLoadReuseLevel == 10 && m_param->interRefine > 1)
        {
            interDataCTU = m_frame->m_analysisData.interData;
            if ((cu.m_predMode[pu.puAbsPartIdx] == interDataCTU->modes[cuIdx + pu.puAbsPartIdx])
                && (cu.m_partSize[pu.puAbsPartIdx] == interDataCTU->partSize[cuIdx + pu.puAbsPartIdx])
                && !(interDataCTU->mergeFlag[cuIdx + puIdx])
                && (cu.m_cuDepth[0] == interDataCTU->depth[cuIdx]))
                useAsMVP = true;
        }
        /* find best cost merge candidate. note: 2Nx2N merge and bidir are handled as separate modes */
		// 1. 尽管在checkMerge_2Nx2N_rd0_4当中检查了2Nx2N块的merge模式,这里会对非2Nx2N的块去检查merge模式
        uint32_t mrgCost = numPart == 1 ? MAX_UINT : mergeEstimation(cu, cuGeom, pu, puIdx, merge);
        bestME[0].cost = MAX_UINT;
        bestME[1].cost = MAX_UINT;
		// 根据块信息来计算当前块使用的比特数(固定开销)
        getBlkBits((PartSize)cu.m_partSize[0], slice->isInterP(), puIdx, lastMode, m_listSelBits);
        bool bDoUnidir = true;
		// 获取相邻块的MV,为后续构建AMVP做准备
        cu.getNeighbourMV(puIdx, pu.puAbsPartIdx, interMode.interNeighbours);
        /* Uni-directional prediction */
        if ((m_param->analysisLoadReuseLevel > 1 && m_param->analysisLoadReuseLevel != 10)
            || (m_param->analysisMultiPassRefine && m_param->rc.bStatRead) || (m_param->bAnalysisType == AVC_INFO) || (useAsMVP))
        {
            // 双向预测没有研究      
            // ...        
        }
        else if (m_param->bDistributeMotionEstimation) // 分布式运动估计,与多线程相关(没研究过)
        {
            // ...
        }
        if (bDoUnidir) // 如果是单向预测
        {
            interMode.bestME[puIdx][0].ref = interMode.bestME[puIdx][1].ref = -1;
            uint32_t refMask = refMasks[puIdx] ? refMasks[puIdx] : (uint32_t)-1;

            for (int list = 0; list < numPredDir; list++)
            {
                for (int ref = 0; ref < numRefIdx[list]; ref++)
                {
                    ProfileCounter(interMode.cu, totalMotionReferences[cuGeom.depth]);

                    if (!(refMask & (1 << ref)))
                    {
                        ProfileCounter(interMode.cu, skippedMotionReferences[cuGeom.depth]);
                        continue;
                    }

                    uint32_t bits = m_listSelBits[list] + MVP_IDX_BITS;
                    bits += getTUBits(ref, numRefIdx[list]);
					// 3. 基于interNeighbours,构建AMVP列表,列表长度为2
                    int numMvc = cu.getPMV(interMode.interNeighbours, list, ref, interMode.amvpCand[list][ref], mvc);
					
                    const MV* amvp = interMode.amvpCand[list][ref];
					// 从AMVP列表中选择最佳的MV(2选1)
                    int mvpIdx = selectMVP(cu, pu, amvp, list, ref);
                    MV mvmin, mvmax, outmv, mvp = amvp[mvpIdx], mvp_lowres;
                    bool bLowresMVP = false;

                    if (!m_param->analysisSave && !m_param->analysisLoad) /* Prevents load/save outputs from diverging when lowresMV is not available */
                    {
						// 获取低分辨率帧的MV
                        MV lmv = getLowresMV(cu, pu, list, ref);
                        if (lmv.notZero())
                            mvc[numMvc++] = lmv;
                        if (m_param->bEnableHME)
                            mvp_lowres = lmv;
                    }
                    if (m_param->searchMethod == X265_SEA)
                    {
                        int puX = puIdx & 1;
                        int puY = puIdx >> 1;
                        for (int planes = 0; planes < INTEGRAL_PLANE_NUM; planes++)
                            m_me.integral[planes] = interMode.fencYuv->m_integral[list][ref][planes] + puX * pu.width + puY * pu.height * m_slice->m_refFrameList[list][ref]->m_reconPic->m_stride;
                    }
					// 设置搜索范围(searchRange默认为57)
                    setSearchRange(cu, mvp, m_param->searchRange, mvmin, mvmax);
					// 3. 进行运动估计
                    int satdCost = m_me.motionEstimate(&slice->m_mref[list][ref], mvmin, mvmax, mvp, numMvc, mvc, m_param->searchRange, outmv, m_param->maxSlices, 
                      m_param->bSourceReferenceEstimation ? m_slice->m_refFrameList[list][ref]->m_fencPic->getLumaAddr(0) : 0);
					// 默认不使用HME
                    if (m_param->bEnableHME && mvp_lowres.notZero() && mvp_lowres != mvp)
                    {
                        MV outmv_lowres;
                        setSearchRange(cu, mvp_lowres, m_param->searchRange, mvmin, mvmax);
                        int lowresMvCost = m_me.motionEstimate(&slice->m_mref[list][ref], mvmin, mvmax, mvp_lowres, numMvc, mvc, m_param->searchRange, outmv_lowres, m_param->maxSlices,
                            m_param->bSourceReferenceEstimation ? m_slice->m_refFrameList[list][ref]->m_fencPic->getLumaAddr(0) : 0);
                        if (lowresMvCost < satdCost)
                        {
                            outmv = outmv_lowres;
                            satdCost = lowresMvCost;
                            bLowresMVP = true;
                        }
                    }

                    /* Get total cost of partition, but only include MV bit cost once */
					// 计算对应MV的损失值
                    bits += m_me.bitcost(outmv);
                    uint32_t mvCost = m_me.mvcost(outmv);
                    uint32_t cost = (satdCost - mvCost) + m_rdCost.getCost(bits);
                    /* Update LowresMVP to best AMVP cand*/
                    if (bLowresMVP)
                        updateMVP(amvp[mvpIdx], outmv, bits, cost, mvp_lowres);

                    /* Refine MVP selection, updates: mvpIdx, bits, cost */
                    mvp = checkBestMVP(amvp, outmv, mvpIdx, bits, cost);
					// 更新损失值
                    if (cost < bestME[list].cost)
                    {
                        bestME[list].mv      = outmv;
                        bestME[list].mvp     = mvp;
                        bestME[list].mvpIdx  = mvpIdx;
                        bestME[list].ref     = ref;
                        bestME[list].cost    = cost;
                        bestME[list].bits    = bits;
                        bestME[list].mvCost  = mvCost;
                    }
                }
                /* the second list ref bits start at bit 16 */
                refMask >>= 16;
            }
        }

        /* Bi-directional prediction */
        MotionData bidir[2];
        uint32_t bidirCost = MAX_UINT;
        int bidirBits = 0;

        if (slice->isInterB() && !cu.isBipredRestriction() &&  /* biprediction is possible for this PU */
            cu.m_partSize[pu.puAbsPartIdx] != SIZE_2Nx2N &&    /* 2Nx2N biprediction is handled elsewhere */
            bestME[0].cost != MAX_UINT && bestME[1].cost != MAX_UINT)
        {
      		// B帧双向预测(没研究过)
            // ...
        }

        /* select best option and store into CU */
		// 检查最佳的模式
        if (mrgCost < bidirCost && mrgCost < bestME[0].cost && mrgCost < bestME[1].cost)
        {
            cu.m_mergeFlag[pu.puAbsPartIdx] = true;
            cu.m_mvpIdx[0][pu.puAbsPartIdx] = merge.index; /* merge candidate ID is stored in L0 MVP idx */
            cu.setPUInterDir(merge.dir, pu.puAbsPartIdx, puIdx);
            cu.setPUMv(0, merge.mvField[0].mv, pu.puAbsPartIdx, puIdx);
            cu.setPURefIdx(0, merge.mvField[0].refIdx, pu.puAbsPartIdx, puIdx);
            cu.setPUMv(1, merge.mvField[1].mv, pu.puAbsPartIdx, puIdx);
            cu.setPURefIdx(1, merge.mvField[1].refIdx, pu.puAbsPartIdx, puIdx);

            totalmebits += merge.bits;
        }
        else if (bidirCost < bestME[0].cost && bidirCost < bestME[1].cost)
        {
            lastMode = 2;

            cu.m_mergeFlag[pu.puAbsPartIdx] = false;
            cu.setPUInterDir(3, pu.puAbsPartIdx, puIdx);
            cu.setPUMv(0, bidir[0].mv, pu.puAbsPartIdx, puIdx);
            cu.setPURefIdx(0, bestME[0].ref, pu.puAbsPartIdx, puIdx);
            cu.m_mvd[0][pu.puAbsPartIdx] = bidir[0].mv - bidir[0].mvp;
            cu.m_mvpIdx[0][pu.puAbsPartIdx] = bidir[0].mvpIdx;

            cu.setPUMv(1, bidir[1].mv, pu.puAbsPartIdx, puIdx);
            cu.setPURefIdx(1, bestME[1].ref, pu.puAbsPartIdx, puIdx);
            cu.m_mvd[1][pu.puAbsPartIdx] = bidir[1].mv - bidir[1].mvp;
            cu.m_mvpIdx[1][pu.puAbsPartIdx] = bidir[1].mvpIdx;

            totalmebits += bidirBits;
        }
        else if (bestME[0].cost <= bestME[1].cost)
        {
            lastMode = 0;

            cu.m_mergeFlag[pu.puAbsPartIdx] = false;
            cu.setPUInterDir(1, pu.puAbsPartIdx, puIdx);
            cu.setPUMv(0, bestME[0].mv, pu.puAbsPartIdx, puIdx);
            cu.setPURefIdx(0, bestME[0].ref, pu.puAbsPartIdx, puIdx);
            cu.m_mvd[0][pu.puAbsPartIdx] = bestME[0].mv - bestME[0].mvp;
            cu.m_mvpIdx[0][pu.puAbsPartIdx] = bestME[0].mvpIdx;

            cu.setPURefIdx(1, REF_NOT_VALID, pu.puAbsPartIdx, puIdx);
            cu.setPUMv(1, mvzero, pu.puAbsPartIdx, puIdx);

            totalmebits += bestME[0].bits;
        }
        else
        {	// 存储最佳模式信息
            lastMode = 1;

            cu.m_mergeFlag[pu.puAbsPartIdx] = false;
            cu.setPUInterDir(2, pu.puAbsPartIdx, puIdx);
            cu.setPUMv(1, bestME[1].mv, pu.puAbsPartIdx, puIdx);
            cu.setPURefIdx(1, bestME[1].ref, pu.puAbsPartIdx, puIdx);
            cu.m_mvd[1][pu.puAbsPartIdx] = bestME[1].mv - bestME[1].mvp;
            cu.m_mvpIdx[1][pu.puAbsPartIdx] = bestME[1].mvpIdx;

            cu.setPURefIdx(0, REF_NOT_VALID, pu.puAbsPartIdx, puIdx);
            cu.setPUMv(0, mvzero, pu.puAbsPartIdx, puIdx);

            totalmebits += bestME[1].bits;
        }
		// 进行最佳模式的运动补偿,这样可以获得重建帧,用于后续的帧间预测
        motionCompensation(cu, pu, *predYuv, true, bChromaMC);
    }
    interMode.sa8dBits += totalmebits;
}
2.2.1.1 对子PU评估merge模式(mergeEstimation)

没有太多需要注释的地方,这里主要的一个区别是计算损失时使用的是SATD

/* estimation of best merge coding of an inter PU (2Nx2N merge PUs are evaluated as their own mode) */
uint32_t Search::mergeEstimation(CUData& cu, const CUGeom& cuGeom, const PredictionUnit& pu, int puIdx, MergeData& m)
{
	// 2Nx2N的块不会使用当前这个函数
    X265_CHECK(cu.m_partSize[0] != SIZE_2Nx2N, "mergeEstimation() called for 2Nx2N\n");

    MVField  candMvField[MRG_MAX_NUM_CANDS][2];
    uint8_t  candDir[MRG_MAX_NUM_CANDS];
    uint32_t numMergeCand = cu.getInterMergeCandidates(pu.puAbsPartIdx, puIdx, candMvField, candDir);

    if (cu.isBipredRestriction())
    {
        /* do not allow bidir merge candidates if PU is smaller than 8x8, drop L1 reference */
        for (uint32_t mergeCand = 0; mergeCand < numMergeCand; ++mergeCand)
        {
            if (candDir[mergeCand] == 3)
            {
                candDir[mergeCand] = 1;
                candMvField[mergeCand][1].refIdx = REF_NOT_VALID;
            }
        }
    }

    Yuv& tempYuv = m_rqt[cuGeom.depth].tmpPredYuv;

    uint32_t outCost = MAX_UINT;
	// 遍历merge候选列表,从中寻找到一个最佳的模式
    for (uint32_t mergeCand = 0; mergeCand < numMergeCand; ++mergeCand)
    {
        /* Prevent TMVP candidates from using unavailable reference pixels */
        if (m_bFrameParallel) // 是否允许帧级并行
        {
            // Parallel slices bound check
            if (m_param->maxSlices > 1)
            {
                if (cu.m_bFirstRowInSlice &
                    ((candMvField[mergeCand][0].mv.y < (2 * 4)) | (candMvField[mergeCand][1].mv.y < (2 * 4))))
                    continue;

                // Last row in slice can't reference beyond bound since it is another slice area
                // TODO: we may beyond bound in future since these area have a chance to finish because we use parallel slices. Necessary prepare research on load balance
                if (cu.m_bLastRowInSlice &&
                    ((candMvField[mergeCand][0].mv.y > -3 * 4) | (candMvField[mergeCand][1].mv.y > -3 * 4)))
                    continue;
            }

            if (candMvField[mergeCand][0].mv.y >= (m_param->searchRange + 1) * 4 ||
                candMvField[mergeCand][1].mv.y >= (m_param->searchRange + 1) * 4)
                continue;
        }

        cu.m_mv[0][pu.puAbsPartIdx] = candMvField[mergeCand][0].mv;
        cu.m_refIdx[0][pu.puAbsPartIdx] = (int8_t)candMvField[mergeCand][0].refIdx;
        cu.m_mv[1][pu.puAbsPartIdx] = candMvField[mergeCand][1].mv;
        cu.m_refIdx[1][pu.puAbsPartIdx] = (int8_t)candMvField[mergeCand][1].refIdx;
		// 运动补偿,获得预测块
        motionCompensation(cu, pu, tempYuv, true, m_me.bChromaSATD);
		// 计算的是SATD
        uint32_t costCand = m_me.bufSATD(tempYuv.getLumaAddr(pu.puAbsPartIdx), tempYuv.m_size);
        if (m_me.bChromaSATD)
            costCand += m_me.bufChromaSATD(tempYuv, pu.puAbsPartIdx);

        uint32_t bitsCand = getTUBits(mergeCand, numMergeCand);
        costCand = costCand + m_rdCost.getCost(bitsCand);
        if (costCand < outCost)
        {
            outCost = costCand;
            m.bits = bitsCand;
            m.index = mergeCand;
        }
    }

    m.mvField[0] = candMvField[m.index][0];
    m.mvField[1] = candMvField[m.index][1];
    m.dir = candDir[m.index];

    return outCost;
}
2.2.1.2 AMPV的实现

与Merge模式类似,AVMP也是从可用的空域相邻参考块和时域参考块中提取MV,其步骤大致为
(1)获取相邻可用MV(getNeighbourMV)
(2)构建AMVP列表(getPMV)
(3)从AMVP列表中选择最佳候选模式(selectPMV)

/* Constructs a list of candidates for AMVP, and a larger list of motion candidates */
void CUData::getNeighbourMV(uint32_t puIdx, uint32_t absPartIdx, InterNeighbourMV* neighbours) const
{
    // Set the temporal neighbour to unavailable by default.
    neighbours[MD_COLLOCATED].unifiedRef = -1;

    uint32_t partIdxLT, partIdxRT, partIdxLB = deriveLeftBottomIdx(puIdx);
    deriveLeftRightTopIdx(puIdx, partIdxLT, partIdxRT);

    // Load the spatial MVs.
	// 读取空域上可用块的MV
    getInterNeighbourMV(neighbours + MD_BELOW_LEFT, partIdxLB, MD_BELOW_LEFT);
    getInterNeighbourMV(neighbours + MD_LEFT,       partIdxLB, MD_LEFT);
    getInterNeighbourMV(neighbours + MD_ABOVE_RIGHT,partIdxRT, MD_ABOVE_RIGHT);
    getInterNeighbourMV(neighbours + MD_ABOVE,      partIdxRT, MD_ABOVE);
    getInterNeighbourMV(neighbours + MD_ABOVE_LEFT, partIdxLT, MD_ABOVE_LEFT);
	// 寻找时间域上可用块的MV
    if (m_slice->m_sps->bTemporalMVPEnabled)
    {
        uint32_t absPartAddr = m_absIdxInCTU + absPartIdx;
        uint32_t partIdxRB = deriveRightBottomIdx(puIdx);

        // co-located RightBottom temporal predictor (H)
        int ctuIdx = -1;

        // image boundary check
        if (m_encData->getPicCTU(m_cuAddr)->m_cuPelX + g_zscanToPelX[partIdxRB] + UNIT_SIZE < m_slice->m_sps->picWidthInLumaSamples &&
            m_encData->getPicCTU(m_cuAddr)->m_cuPelY + g_zscanToPelY[partIdxRB] + UNIT_SIZE < m_slice->m_sps->picHeightInLumaSamples)
        {
            uint32_t absPartIdxRB = g_zscanToRaster[partIdxRB];
            uint32_t numUnits = s_numPartInCUSize;
            bool bNotLastCol = lessThanCol(absPartIdxRB, numUnits - 1); // is not at the last column of CTU
            bool bNotLastRow = lessThanRow(absPartIdxRB, numUnits - 1); // is not at the last row    of CTU

            if (bNotLastCol && bNotLastRow)
            {
                absPartAddr = g_rasterToZscan[absPartIdxRB + RASTER_SIZE + 1];
                ctuIdx = m_cuAddr;
            }
            else if (bNotLastCol)
                absPartAddr = g_rasterToZscan[(absPartIdxRB + 1) & (numUnits - 1)];
            else if (bNotLastRow)
            {
                absPartAddr = g_rasterToZscan[absPartIdxRB + RASTER_SIZE - numUnits + 1];
                ctuIdx = m_cuAddr + 1;
            }
            else // is the right bottom corner of CTU
                absPartAddr = 0;
        }

        if (!(ctuIdx >= 0 && getCollocatedMV(ctuIdx, absPartAddr, neighbours + MD_COLLOCATED)))
        {
            uint32_t partIdxCenter =  deriveCenterIdx(puIdx);
            uint32_t curCTUIdx = m_cuAddr;
			// 获取参考块的MV
            getCollocatedMV(curCTUIdx, partIdxCenter, neighbours + MD_COLLOCATED);
        }
    }
}

构建AMVP列表的方式如下,简单来说,首先填充空域上的候选列表,其次填充时域上的后续选列表,最后如果列表长度不足2个,则填充0

// Create the PMV list. Called for each reference index.
int CUData::getPMV(InterNeighbourMV *neighbours, uint32_t picList, uint32_t refIdx, MV* amvpCand, MV* pmv) const
{
	// Direct MVP表示直接从相邻块的运动矢量信息中获取候选运动矢量
    MV directMV[MD_ABOVE_LEFT + 1];			
    // Indirect MVP表示scaled的运动矢量,这是因为相邻MV和当前PU指向的参考帧不是同一个参考帧,需要进行scale
    MV indirectMV[MD_ABOVE_LEFT + 1];		
    bool validDirect[MD_ABOVE_LEFT + 1];
    bool validIndirect[MD_ABOVE_LEFT + 1];

    // Left candidate.
    validDirect[MD_BELOW_LEFT]  = getDirectPMV(directMV[MD_BELOW_LEFT], neighbours + MD_BELOW_LEFT, picList, refIdx);
    validDirect[MD_LEFT]        = getDirectPMV(directMV[MD_LEFT], neighbours + MD_LEFT, picList, refIdx);
    // Top candidate.
    validDirect[MD_ABOVE_RIGHT] = getDirectPMV(directMV[MD_ABOVE_RIGHT], neighbours + MD_ABOVE_RIGHT, picList, refIdx);
    validDirect[MD_ABOVE]       = getDirectPMV(directMV[MD_ABOVE], neighbours + MD_ABOVE, picList, refIdx);
    validDirect[MD_ABOVE_LEFT]  = getDirectPMV(directMV[MD_ABOVE_LEFT], neighbours + MD_ABOVE_LEFT, picList, refIdx);

    // Left candidate.
    validIndirect[MD_BELOW_LEFT]  = getIndirectPMV(indirectMV[MD_BELOW_LEFT], neighbours + MD_BELOW_LEFT, picList, refIdx);
    validIndirect[MD_LEFT]        = getIndirectPMV(indirectMV[MD_LEFT], neighbours + MD_LEFT, picList, refIdx);
    // Top candidate.
    validIndirect[MD_ABOVE_RIGHT] = getIndirectPMV(indirectMV[MD_ABOVE_RIGHT], neighbours + MD_ABOVE_RIGHT, picList, refIdx);
    validIndirect[MD_ABOVE]       = getIndirectPMV(indirectMV[MD_ABOVE], neighbours + MD_ABOVE, picList, refIdx);
    validIndirect[MD_ABOVE_LEFT]  = getIndirectPMV(indirectMV[MD_ABOVE_LEFT], neighbours + MD_ABOVE_LEFT, picList, refIdx);

	/*
		1.填充空域可用相邻块的MV,读取的顺序为 A0 -> A1 -> B0 -> B1 -> B2
		+--+           +--+--+
		|B2|		   |B1|B0|
		+--+--+--+--+--+--+--+
		   |           |
		   +           +
		   |  Current  |
		   +    PU     +
		   |           |
		+--+           +
		|A1|           |
		+--+--+--+--+--+
		|A0|
		+--+
	*/

    int num = 0;
    // Left predictor search
    if (validDirect[MD_BELOW_LEFT])
        amvpCand[num++] = directMV[MD_BELOW_LEFT];
    else if (validDirect[MD_LEFT])
        amvpCand[num++] = directMV[MD_LEFT];
    else if (validIndirect[MD_BELOW_LEFT])
        amvpCand[num++] = indirectMV[MD_BELOW_LEFT];
    else if (validIndirect[MD_LEFT])
        amvpCand[num++] = indirectMV[MD_LEFT];

    bool bAddedSmvp = num > 0;

    // Above predictor search
    if (validDirect[MD_ABOVE_RIGHT])
        amvpCand[num++] = directMV[MD_ABOVE_RIGHT];
    else if (validDirect[MD_ABOVE])
        amvpCand[num++] = directMV[MD_ABOVE];
    else if (validDirect[MD_ABOVE_LEFT])
        amvpCand[num++] = directMV[MD_ABOVE_LEFT];

    if (!bAddedSmvp)
    {
        if (validIndirect[MD_ABOVE_RIGHT])
            amvpCand[num++] = indirectMV[MD_ABOVE_RIGHT];
        else if (validIndirect[MD_ABOVE])
            amvpCand[num++] = indirectMV[MD_ABOVE];
        else if (validIndirect[MD_ABOVE_LEFT])
            amvpCand[num++] = indirectMV[MD_ABOVE_LEFT];
    }

    int numMvc = 0;
    for (int dir = MD_LEFT; dir <= MD_ABOVE_LEFT; dir++)
    {
        if (validDirect[dir] && directMV[dir].notZero())
            pmv[numMvc++] = directMV[dir];

        if (validIndirect[dir] && indirectMV[dir].notZero())
            pmv[numMvc++] = indirectMV[dir];
    }

    if (num == 2)
        num -= amvpCand[0] == amvpCand[1];

    // Get the collocated candidate. At this step, either the first candidate
    // was found or its value is 0.
	// 2.填充时域上可用块的MV
    if (m_slice->m_sps->bTemporalMVPEnabled && num < 2)
    {
        int tempRefIdx = neighbours[MD_COLLOCATED].refIdx[picList];
        if (tempRefIdx != -1)
        {
            uint32_t cuAddr = neighbours[MD_COLLOCATED].cuAddr[picList];
            const Frame* colPic = m_slice->m_refFrameList[m_slice->isInterB() && !m_slice->m_colFromL0Flag][m_slice->m_colRefIdx];
            const CUData* colCU = colPic->m_encData->getPicCTU(cuAddr);

            // Scale the vector
            int colRefPOC = colCU->m_slice->m_refPOCList[tempRefIdx >> 4][tempRefIdx & 0xf];
            int colPOC = colCU->m_slice->m_poc;

            int curRefPOC = m_slice->m_refPOCList[picList][refIdx];
            int curPOC = m_slice->m_poc;
            pmv[numMvc++] = amvpCand[num++] = scaleMvByPOCDist(neighbours[MD_COLLOCATED].mv[picList], curPOC, curRefPOC, colPOC, colRefPOC);
        }
    }
	// 3.如果不满2个,则填充0
    while (num < AMVP_NUM_CANDS)
        amvpCand[num++] = 0;

    return numMvc;
}

从前面已经获取的AMVP列表中选择一个最佳的候选模式,即2选1,使用predInterLumaPixel进行运动补偿,并计算MV带来的损失

/* Pick between the two AMVP candidates which is the best one to use as
 * MVP for the motion search, based on SAD cost */
int Search::selectMVP(const CUData& cu, const PredictionUnit& pu, const MV amvp[AMVP_NUM_CANDS], int list, int ref)
{
    if (amvp[0] == amvp[1])
        return 0;

    Yuv& tmpPredYuv = m_rqt[cu.m_cuDepth[0]].tmpPredYuv;
    uint32_t costs[AMVP_NUM_CANDS];

    for (int i = 0; i < AMVP_NUM_CANDS; i++)
    {
        MV mvCand = amvp[i];

        // NOTE: skip mvCand if Y is > merange and -FN>1
        if (m_bFrameParallel)
        {
            costs[i] = m_me.COST_MAX;

            if (mvCand.y >= (m_param->searchRange + 1) * 4)
                continue;

            if ((m_param->maxSlices > 1) &
                ((mvCand.y < m_sliceMinY)
              |  (mvCand.y > m_sliceMaxY)))
                continue;
        }
        cu.clipMv(mvCand);
		// 执行帧间搜索,并计算对应MV带来的损失
        predInterLumaPixel(pu, tmpPredYuv, *m_slice->m_refReconPicList[list][ref], mvCand);
        costs[i] = m_me.bufSAD(tmpPredYuv.getLumaAddr(pu.puAbsPartIdx), tmpPredYuv.m_size);
    }

    return (costs[0] <= costs[1]) ? 0 : 1;
}
2.2.1.3 进行运动估计(motionEstimate)

运动估计(下面简称ME)是帧间预测的核心部分,主要完成了确认最佳MV和最佳损失的功能,其主要的步骤为
(1)依据前面获取的相邻可用MV,计算每个MV对应的损失,计算的是亚像素级别的SAD损失(subpelCompare)
(2)进行ME
  ME主要使用的是菱形搜索和六边形搜索,在获取了full pixel的best mv之后,会与neighbour mv进行对比。如果ME获取的best mv损失更小,则继续进行亚像素ME;否则,直接使用neighbour mv
(3)进行亚像素ME
  亚像素ME使用的是SATD,先进行1/2像素ME,随后进行1/4像素的ME

PS:需要注意的是,在进行整像素ME,计算的图像数据由fenc给出;在进行亚像素ME时,使用的是插值之后的ref图像数据

int MotionEstimate::motionEstimate(ReferencePlanes *ref,
                                   const MV &       mvmin,
                                   const MV &       mvmax,
                                   const MV &       qmvp,
                                   int              numCandidates,
                                   const MV *       mvc,
                                   int              merange,
                                   MV &             outQMv,
                                   uint32_t         maxSlices,
                                   pixel *          srcReferencePlane)
{
    ALIGN_VAR_16(int, costs[16]);
    bool hme = srcReferencePlane && srcReferencePlane == ref->fpelLowerResPlane[0];
    if (ctuAddr >= 0)
        blockOffset = ref->reconPic->getLumaAddr(ctuAddr, absPartIdx) - ref->reconPic->getLumaAddr(0);
    intptr_t stride = hme ? ref->lumaStride / 2 : ref->lumaStride;
    pixel* fenc = fencPUYuv.m_buf[0];
    pixel* fref = srcReferencePlane == 0 ? ref->fpelPlane[0] + blockOffset : srcReferencePlane + blockOffset;
	// qmvp表示前面AMVP当中最佳候选模式,这里设置为初始mv,也就是运动搜索的起点
    setMVP(qmvp);

    MV qmvmin = mvmin.toQPel(); // 转换成1/4像素
    MV qmvmax = mvmax.toQPel();

    /* The term cost used here means satd/sad values for that particular search.
     * The costs used in ME integer search only includes the SAD cost of motion
     * residual and sqrtLambda times MVD bits.  The subpel refine steps use SATD
     * cost of residual and sqrtLambda * MVD bits.  Mode decision will be based
     * on video distortion cost (SSE/PSNR) plus lambda times all signaling bits
     * (mode + MVD bits). */

    // measure SAD cost at clipped QPEL MVP
	// 根据min和max值进行clip
    MV pmv = qmvp.clipped(qmvmin, qmvmax);
    MV bestpre = pmv;
    int bprecost;

    if (ref->isLowres)
        bprecost = ref->lowresQPelCost(fenc, blockOffset, pmv, sad, hme);
    else // 对于cliped的AVMP最佳候选模式,进行亚像素级别的运动估计,获得初始损失值
        bprecost = subpelCompare(ref, pmv, sad);

    /* re-measure full pel rounded MVP with SAD as search start point */
    MV bmv = pmv.roundToFPel();
    int bcost = bprecost;
    if (pmv.isSubpel())
        bcost = sad(fenc, FENC_STRIDE, fref + bmv.x + bmv.y * stride, stride) + mvcost(bmv << 2);

    // measure SAD cost at MV(0) if MVP is not zero
    if (pmv.notZero()) 
    {
		// 如果MVP不为0,则计算零矢量的损失,与当前pmv的损失进行对比
        int cost = sad(fenc, FENC_STRIDE, fref, stride) + mvcost(MV(0, 0));
        if (cost < bcost)
        {
            bcost = cost;
            bmv = 0;
            bmv.y = X265_MAX(X265_MIN(0, mvmax.y), mvmin.y);
        }
    }

    X265_CHECK(!(ref->isLowres && numCandidates), "lowres motion candidates not allowed\n")
    // measure SAD cost at each QPEL motion vector candidate
	// 1.遍历MV候选列表(mvc),随后计算每个MV对应的损失
    for (int i = 0; i < numCandidates; i++)
    {
        MV m = mvc[i].clipped(qmvmin, qmvmax);
        if (m.notZero() & (m != pmv ? 1 : 0) & (m != bestpre ? 1 : 0)) // check already measured
        {
			// mvcost返回的是MVD消耗的比特数,已经乘以lambda
            int cost = subpelCompare(ref, m, sad) + mvcost(m);
            if (cost < bprecost)
            {
                bprecost = cost;
                bestpre = m;
            }
        }
    }

    pmv = pmv.roundToFPel();
    MV omv = bmv;  // current search origin or starting point
	// 2.进行运动搜索
    int search = ref->isHMELowres ? (hme ? searchMethodL0 : searchMethodL1) : searchMethod;
    switch (search)
    {
    case X265_DIA_SEARCH:
    {
        /* diamond search, radius 1 */
		/*
			使用钻石(菱形)搜索,半径为1,搜索的顺序如下,其中0为起始点
			  1
			3 0 4
			  2
		*/
        bcost <<= 4;
        int i = merange;
        do
        {
			/*
				COST_MV_X4_DIR的定义为
				#define COST_MV_X4_DIR(m0x, m0y, m1x, m1y, m2x, m2y, m3x, m3y, costs) \
				{ \
					pixel *pix_base = fref + bmv.x + bmv.y * stride; \
					sad_x4(fenc, \
						   pix_base + (m0x) + (m0y) * stride, \
						   pix_base + (m1x) + (m1y) * stride, \
						   pix_base + (m2x) + (m2y) * stride, \
						   pix_base + (m3x) + (m3y) * stride, \
						   stride, costs); \
					(costs)[0] += mvcost((bmv + MV(m0x, m0y)) << 2); \	// 上,MV(0, -1), lambda * R0
					(costs)[1] += mvcost((bmv + MV(m1x, m1y)) << 2); \  // 下,MV(0,  1), lambda * R1
					(costs)[2] += mvcost((bmv + MV(m2x, m2y)) << 2); \  // 左,MV(-1, 0), lambda * R2
					(costs)[3] += mvcost((bmv + MV(m3x, m3y)) << 2); \  // 右,MV(1,  0), lambda * R3
				}

				上面的计算方式为: SAD + lambda * R,其中mvcost返回的是当前mvd消耗的比特数,已经乘以了lambda
			*/
            COST_MV_X4_DIR(0, -1, 0, 1, -1, 0, 1, 0, costs);
			/*
				#define COPY1_IF_LT(x, y) {if ((y) < (x)) (x) = (y);}
				
				下面的代码用于存储最佳的cost和对应的位置
				(1)先将cost左移4位,空出来低4位,用于存储最佳搜索点位置
				(2)由于bcost前面已经左移4位,所以直接比较就能获取最佳cost
				(3)获取最佳cost之后,计算对应的mv

				COPY1_IF_LT(bcost, (costs[0] << 4) + 1);	//  1 -> 0001,表示上方
				COPY1_IF_LT(bcost, (costs[1] << 4) + 3);	//  3 -> 0011,表示下方
				COPY1_IF_LT(bcost, (costs[2] << 4) + 4);	//  4 -> 0100,表示左侧
				COPY1_IF_LT(bcost, (costs[3] << 4) + 12);	// 12 -> 1100,表示右侧
			*/
            if ((bmv.y - 1 >= mvmin.y) & (bmv.y - 1 <= mvmax.y))	// 检查是否超出上边界
                COPY1_IF_LT(bcost, (costs[0] << 4) + 1);
            if ((bmv.y + 1 >= mvmin.y) & (bmv.y + 1 <= mvmax.y))	// 检查是否超出下边界
                COPY1_IF_LT(bcost, (costs[1] << 4) + 3);
            COPY1_IF_LT(bcost, (costs[2] << 4) + 4);
            COPY1_IF_LT(bcost, (costs[3] << 4) + 12);
			// 检查后4位是否为0,如果为0,说明基于原点(0,0)的搜索带来的损失最小,就不用继续搜索了
            if (!(bcost & 15))
                break;
			/*
				举例如下,如果当前已经确认最好的块为右侧块,即MV(1, 0),则低4位为1100
				(bcost << 28) >> 30 = 11,bmv.x -= -1,水平方向,向右移动一个单位(这里二进制的11表示-1)
				(bcost << 30) >> 30 = 00,bmv.y -= 00,垂直方向,不移动
			*/
            bmv.x -= (bcost << 28) >> 30; // bcost是int类型,一共32位,先左移28位,保留了最低4位,随后右移30位,取出了x坐标
            bmv.y -= (bcost << 30) >> 30; // 先左移30位,保留低2位,再右移30位,取出了y坐标
			// 将后4位置0,取出最佳cost(如果要取出原始的cost,还需要向右移动4位)
            bcost &= ~15;
        }
        while (--i && bmv.checkRange(mvmin, mvmax)); // 检查是否超出了运动搜索范围或者超出了mv的范围
        bcost >>= 4; // 向右移动4位,此时是真实的最佳的cost
        break;
    }

    case X265_HEX_SEARCH: // 六边形搜索,半径为2
    {
me_hex2:
        /* hexagon search, radius 2 */
		/*
			六边形搜索的顺序如下,其中0为初始位置
			   2   3
			1    0    4
			   5   6
		*/
#if 0
        for (int i = 0; i < merange / 2; i++)
        {
            omv = bmv;
            COST_MV(omv.x - 2, omv.y);
            COST_MV(omv.x - 1, omv.y + 2);
            COST_MV(omv.x + 1, omv.y + 2);
            COST_MV(omv.x + 2, omv.y);
            COST_MV(omv.x + 1, omv.y - 2);
            COST_MV(omv.x - 1, omv.y - 2);
            if (omv == bmv)
                break;
            if (!bmv.checkRange(mvmin, mvmax))
                break;
        }

#else // if 0
        /* equivalent to the above, but eliminates duplicate candidates */
		/*
			COST_MV_X3_DIR的定义如下,与前面很类似,只不过这里是一次性计算3个点
			#define COST_MV_X3_DIR(m0x, m0y, m1x, m1y, m2x, m2y, costs) \
			{ \
				pixel *pix_base = fref + bmv.x + bmv.y * stride; \
				sad_x3(fenc, \
					   pix_base + (m0x) + (m0y) * stride, \
					   pix_base + (m1x) + (m1y) * stride, \
					   pix_base + (m2x) + (m2y) * stride, \
					   stride, costs); \
				(costs)[0] += mvcost((bmv + MV(m0x, m0y)) << 2); \
				(costs)[1] += mvcost((bmv + MV(m1x, m1y)) << 2); \
				(costs)[2] += mvcost((bmv + MV(m2x, m2y)) << 2); \
			}
		*/
        COST_MV_X3_DIR(-2, 0, -1, 2,  1, 2, costs);
        bcost <<= 3;
        if ((bmv.y >= mvmin.y) & (bmv.y <= mvmax.y))
            COPY1_IF_LT(bcost, (costs[0] << 3) + 2);	// 1号位置
        if ((bmv.y + 2 >= mvmin.y) & (bmv.y + 2 <= mvmax.y))
        {
            COPY1_IF_LT(bcost, (costs[1] << 3) + 3);	// 2号位置
            COPY1_IF_LT(bcost, (costs[2] << 3) + 4);	// 3号位置
        }

        COST_MV_X3_DIR(2, 0,  1, -2, -1, -2, costs);	
        if ((bmv.y >= mvmin.y) & (bmv.y <= mvmax.y))	
            COPY1_IF_LT(bcost, (costs[0] << 3) + 5);	// 4号位置
        if ((bmv.y - 2 >= mvmin.y) & (bmv.y - 2 <= mvmax.y))
        {
            COPY1_IF_LT(bcost, (costs[1] << 3) + 6);	// 5号位置
            COPY1_IF_LT(bcost, (costs[2] << 3) + 7);	// 6号位置
        }
		// 最佳损失对应的位置是否位于上述6个位置
        if (bcost & 7)
        {
            int dir = (bcost & 7) - 2; // 记录最佳位置
			// const MV hex2[8] = { MV(-1, -2), MV(-2, 0), MV(-1, 2), MV(1, 2), MV(2, 0), MV(1, -2), MV(-1, -2), MV(-2, 0) };

            if ((bmv.y + hex2[dir + 1].y >= mvmin.y) & (bmv.y + hex2[dir + 1].y <= mvmax.y))
            {
                bmv += hex2[dir + 1]; // 更新bmv位置

                /* half hexagon, not overlapping the previous iteration */
				// 基于前面搜索的最佳损失位置dir,再进行一次半六边形搜索
                for (int i = (merange >> 1) - 1; i > 0 && bmv.checkRange(mvmin, mvmax); i--)
                {
					/*
						假设前面记录的最佳位置为1号位置,即dir = 0,那么
						(1)dir + 0 => 5号位置
						(2)dir + 1 => 1号位置
						(3)dir + 2 => 2号位置
					*/
                    COST_MV_X3_DIR(hex2[dir + 0].x, hex2[dir + 0].y,
                        hex2[dir + 1].x, hex2[dir + 1].y,
                        hex2[dir + 2].x, hex2[dir + 2].y,
                        costs);
                    bcost &= ~7;

                    if ((bmv.y + hex2[dir + 0].y >= mvmin.y) & (bmv.y + hex2[dir + 0].y <= mvmax.y))
                        COPY1_IF_LT(bcost, (costs[0] << 3) + 1);

                    if ((bmv.y + hex2[dir + 1].y >= mvmin.y) & (bmv.y + hex2[dir + 1].y <= mvmax.y))
                        COPY1_IF_LT(bcost, (costs[1] << 3) + 2);

                    if ((bmv.y + hex2[dir + 2].y >= mvmin.y) & (bmv.y + hex2[dir + 2].y <= mvmax.y))
                        COPY1_IF_LT(bcost, (costs[2] << 3) + 3);

                    if (!(bcost & 7))
                        break;

                    dir += (bcost & 7) - 2;
                    dir = mod6m1[dir + 1];
                    bmv += hex2[dir + 1];
                }
            } // if ((bmv.y + hex2[dir + 1].y >= mvmin.y) & (bmv.y + hex2[dir + 1].y <= mvmax.y))
        }
        bcost >>= 3; // 获取真实的最佳损失
#endif // if 0

        /* square refine */
		// 进行正方形搜索,获取更加精细的MV
		/*
			正方形搜索的顺序为
			6 2 7 
			3 0 4
			5 1 8
		*/
        int dir = 0;
        COST_MV_X4_DIR(0, -1,  0, 1, -1, 0, 1, 0, costs);
        if ((bmv.y - 1 >= mvmin.y) & (bmv.y - 1 <= mvmax.y))
            COPY2_IF_LT(bcost, costs[0], dir, 1);
        if ((bmv.y + 1 >= mvmin.y) & (bmv.y + 1 <= mvmax.y))
            COPY2_IF_LT(bcost, costs[1], dir, 2);
        COPY2_IF_LT(bcost, costs[2], dir, 3);
        COPY2_IF_LT(bcost, costs[3], dir, 4);
        COST_MV_X4_DIR(-1, -1, -1, 1, 1, -1, 1, 1, costs);
        if ((bmv.y - 1 >= mvmin.y) & (bmv.y - 1 <= mvmax.y))
            COPY2_IF_LT(bcost, costs[0], dir, 5);
        if ((bmv.y + 1 >= mvmin.y) & (bmv.y + 1 <= mvmax.y))
            COPY2_IF_LT(bcost, costs[1], dir, 6);
        if ((bmv.y - 1 >= mvmin.y) & (bmv.y - 1 <= mvmax.y))
            COPY2_IF_LT(bcost, costs[2], dir, 7);
        if ((bmv.y + 1 >= mvmin.y) & (bmv.y + 1 <= mvmax.y))
            COPY2_IF_LT(bcost, costs[3], dir, 8);
		// const MV square1[9] = { MV(0, 0), MV(0, -1), MV(0, 1), MV(-1, 0), MV(1, 0), MV(-1, -1), MV(-1, 1), MV(1, -1), MV(1, 1) };
        bmv += square1[dir];
        break;
    }

    case X265_UMH_SEARCH: // 非对称十字多边形搜索(比较复杂,没研究过)
    {
        // ...
    }
		
    case X265_STAR_SEARCH: // Adapted from HM ME
    {	// 星型搜索(slow及更慢的档位会使用,没有研究)
        // ...
    }

    case X265_SEA: // Successive Elimination Algorithm
    {
       // ...
    }

    case X265_FULL_SEARCH: // 全搜索
    {
        // ...
    }

    default:
        X265_CHECK(0, "invalid motion estimate mode\n");
        break;
    }
	/*
		3.进行亚像素搜索
		检查相邻块MV的最佳cost与运动搜索出来的最佳cost关系
		(1)如果相邻块MV的性能更好,即bprecost < bcost,则抛弃当前搜索出来的mv,使用相邻块的mv
		(2)否则,使用当前搜索出来的mv进行后续的亚像素搜索
	*/
    if (bprecost < bcost)
    {
        bmv = bestpre;
        bcost = bprecost;
    }
    else
        bmv = bmv.toQPel(); // promote search bmv to qpel

    const SubpelWorkload& wl = workload[this->subpelRefine];

    // check mv range for slice bound
	// 检查mv是否超出了slice边界,一般配置下一个slice就是一帧,这种情况出现的概率应该比较低
    if ((maxSlices > 1) & ((bmv.y < qmvmin.y) | (bmv.y > qmvmax.y)))
    {
        bmv.y = x265_min(x265_max(bmv.y, qmvmin.y), qmvmax.y);
        bcost = subpelCompare(ref, bmv, satd) + mvcost(bmv);
    }

    if (!bcost) // 没有损失,直接跳过子像素搜索,此时返回的cost只包括比特开销
    {
        /* if there was zero residual at the clipped MVP, we can skip subpel
         * refine, but we do need to include the mvcost in the returned cost */
        bcost = mvcost(bmv);
    }
    else if (ref->isLowres) // 低分辨率图像
    {
        // ..
    }
    else
    {	
        pixelcmp_t hpelcomp;
		// 检查使用satd还是使用sad衡量损失(默认应该是satd)
        if (wl.hpel_satd)
        {
            bcost = subpelCompare(ref, bmv, satd) + mvcost(bmv);
            hpelcomp = satd;
        }
        else
            hpelcomp = sad;
		// 进行1/2像素运动搜索
        for (int iter = 0; iter < wl.hpel_iters; iter++)
        {
            int bdir = 0;
            for (int i = 1; i <= wl.hpel_dirs; i++)
            {
				// 按照正方形方式进行搜索
                MV qmv = bmv + square1[i] * 2;

                // check mv range for slice bound
                if ((qmv.y < qmvmin.y) | (qmv.y > qmvmax.y))
                    continue;
				// 计算损失并确认最佳的MV
                int cost = subpelCompare(ref, qmv, hpelcomp) + mvcost(qmv);
                COPY2_IF_LT(bcost, cost, bdir, i);
            }

            if (bdir)
                bmv += square1[bdir] * 2;
            else
                break;
        }

        /* if HPEL search used SAD, remeasure with SATD before QPEL */
		// 如果半像素搜索使用了SAD,那么需要在进行评估1/4像素之前重新使用SATD计算一边,因为1/4像素搜索使用的是SATD
        if (!wl.hpel_satd)
            bcost = subpelCompare(ref, bmv, satd) + mvcost(bmv);
		// 进行1/4像素运动搜索
        for (int iter = 0; iter < wl.qpel_iters; iter++)
        {
            int bdir = 0;
            for (int i = 1; i <= wl.qpel_dirs; i++)
            {
                MV qmv = bmv + square1[i];

                // check mv range for slice bound
                if ((qmv.y < qmvmin.y) | (qmv.y > qmvmax.y))
                    continue;

                int cost = subpelCompare(ref, qmv, satd) + mvcost(qmv);
                COPY2_IF_LT(bcost, cost, bdir, i);
            }

            if (bdir)
                bmv += square1[bdir];
            else
                break;
        }
    }

    // check mv range for slice bound
    X265_CHECK(((bmv.y >= qmvmin.y) & (bmv.y <= qmvmax.y)), "mv beyond range!");

    x265_emms();
    outQMv = bmv;
    return bcost;
}

2.3 P帧当中的Intra模式(checkIntraInInter)

如果Inter模式带来的损失值比较大,P帧当中的一些块也有可能会使用Intra模式,整体的流程基本和帧内预测一致

/* Note that this function does not save the best intra prediction, it must
 * be generated later. It records the best mode in the cu */
void Search::checkIntraInInter(Mode& intraMode, const CUGeom& cuGeom)
{
    ProfileCUScope(intraMode.cu, intraAnalysisElapsedTime, countIntraAnalysis);

    CUData& cu = intraMode.cu;
    uint32_t depth = cuGeom.depth;

    cu.setPartSizeSubParts(SIZE_2Nx2N);
    cu.setPredModeSubParts(MODE_INTRA);

    const uint32_t initTuDepth = 0;
    uint32_t log2TrSize = cuGeom.log2CUSize - initTuDepth;
    uint32_t tuSize = 1 << log2TrSize;
    const uint32_t absPartIdx = 0;

    // Reference sample smoothing
    IntraNeighbors intraNeighbors;
    initIntraNeighbors(cu, absPartIdx, initTuDepth, true, &intraNeighbors);
    initAdiPattern(cu, cuGeom, absPartIdx, intraNeighbors, ALL_IDX);

    const pixel* fenc = intraMode.fencYuv->m_buf[0];
    uint32_t stride = intraMode.fencYuv->m_size;

    int sad, bsad;
    uint32_t bits, bbits, mode, bmode;
    uint64_t cost, bcost;

    // 33 Angle modes once
    int scaleTuSize = tuSize;
    int scaleStride = stride;
    int costShift = 0;
    int sizeIdx = log2TrSize - 2;

    if (tuSize > 32) // CU尺寸是否为64
    {
        // CU is 64x64, we scale to 32x32 and adjust required parameters
        primitives.scale2D_64to32(m_fencScaled, fenc, stride);
        fenc = m_fencScaled;

        pixel nScale[129];
        intraNeighbourBuf[1][0] = intraNeighbourBuf[0][0];
        primitives.scale1D_128to64[NONALIGNED](nScale + 1, intraNeighbourBuf[0] + 1);

        // we do not estimate filtering for downscaled samples
        memcpy(&intraNeighbourBuf[0][1], &nScale[1], 2 * 64 * sizeof(pixel));   // Top & Left pixels
        memcpy(&intraNeighbourBuf[1][1], &nScale[1], 2 * 64 * sizeof(pixel));

        scaleTuSize = 32;
        scaleStride = 32;
        costShift = 2;
        sizeIdx = 5 - 2; // log2(scaleTuSize) - 2
    }

    pixelcmp_t sa8d = primitives.cu[sizeIdx].sa8d;
    int predsize = scaleTuSize * scaleTuSize;

    m_entropyCoder.loadIntraDirModeLuma(m_rqt[depth].cur);

    /* there are three cost tiers for intra modes:
     *  pred[0]          - mode probable, least cost
     *  pred[1], pred[2] - less probable, slightly more cost
     *  non-mpm modes    - all cost the same (rbits) */
    // 初始化MPM
    uint64_t mpms;
    uint32_t mpmModes[3];
    uint32_t rbits = getIntraRemModeBits(cu, absPartIdx, mpmModes, mpms);

    // DC
    // 进行DC模式的预测
    primitives.cu[sizeIdx].intra_pred[DC_IDX](m_intraPredAngs, scaleStride, intraNeighbourBuf[0], 0, (scaleTuSize <= 16));
    bsad = sa8d(fenc, scaleStride, m_intraPredAngs, scaleStride) << costShift;
    bmode = mode = DC_IDX;
    bbits = (mpms & ((uint64_t)1 << mode)) ? m_entropyCoder.bitsIntraModeMPM(mpmModes, mode) : rbits;
    bcost = m_rdCost.calcRdSADCost(bsad, bbits);

    // PLANAR
    // 进行Planar模式的预测
    pixel* planar = intraNeighbourBuf[0];
    if (tuSize & (8 | 16 | 32))
        planar = intraNeighbourBuf[1];

    primitives.cu[sizeIdx].intra_pred[PLANAR_IDX](m_intraPredAngs, scaleStride, planar, 0, 0);
    sad = sa8d(fenc, scaleStride, m_intraPredAngs, scaleStride) << costShift;
    mode = PLANAR_IDX;
    bits = (mpms & ((uint64_t)1 << mode)) ? m_entropyCoder.bitsIntraModeMPM(mpmModes, mode) : rbits;
    cost = m_rdCost.calcRdSADCost(sad, bits);
    COPY4_IF_LT(bcost, cost, bmode, mode, bsad, sad, bbits, bits);

    bool allangs = true;
    if (primitives.cu[sizeIdx].intra_pred_allangs)
    {
        primitives.cu[sizeIdx].transpose(m_fencTransposed, fenc, scaleStride);
        primitives.cu[sizeIdx].intra_pred_allangs(m_intraPredAngs, intraNeighbourBuf[0], intraNeighbourBuf[1], (scaleTuSize <= 16)); 
    }
    else
        allangs = false;
	// 定义角度模式的实现方式
#define TRY_ANGLE(angle) \
    if (allangs) { \
        if (angle < 18) \
            sad = sa8d(m_fencTransposed, scaleTuSize, &m_intraPredAngs[(angle - 2) * predsize], scaleTuSize) << costShift; \
        else \
            sad = sa8d(fenc, scaleStride, &m_intraPredAngs[(angle - 2) * predsize], scaleTuSize) << costShift; \
        bits = (mpms & ((uint64_t)1 << angle)) ? m_entropyCoder.bitsIntraModeMPM(mpmModes, angle) : rbits; \
        cost = m_rdCost.calcRdSADCost(sad, bits); \
    } else { \
        int filter = !!(g_intraFilterFlags[angle] & scaleTuSize); \
        primitives.cu[sizeIdx].intra_pred[angle](m_intraPredAngs, scaleTuSize, intraNeighbourBuf[filter], angle, scaleTuSize <= 16); \
        sad = sa8d(fenc, scaleStride, m_intraPredAngs, scaleTuSize) << costShift; \
        bits = (mpms & ((uint64_t)1 << angle)) ? m_entropyCoder.bitsIntraModeMPM(mpmModes, angle) : rbits; \
        cost = m_rdCost.calcRdSADCost(sad, bits); \
    }
	// 是否允许快速帧内预测
    if (m_param->bEnableFastIntra)
    {
        int asad = 0;
        uint32_t lowmode, highmode, amode = 5, abits = 0;
        uint64_t acost = MAX_INT64;

        /* pick the best angle, sampling at distance of 5 */
        for (mode = 5; mode < 35; mode += 5)
        {
            TRY_ANGLE(mode);
            COPY4_IF_LT(acost, cost, amode, mode, asad, sad, abits, bits);
        }

        /* refine best angle at distance 2, then distance 1 */
        for (uint32_t dist = 2; dist >= 1; dist--)
        {
            lowmode = amode - dist;
            highmode = amode + dist;

            X265_CHECK(lowmode >= 2 && lowmode <= 34, "low intra mode out of range\n");
            TRY_ANGLE(lowmode);
            COPY4_IF_LT(acost, cost, amode, lowmode, asad, sad, abits, bits);

            X265_CHECK(highmode >= 2 && highmode <= 34, "high intra mode out of range\n");
            TRY_ANGLE(highmode);
            COPY4_IF_LT(acost, cost, amode, highmode, asad, sad, abits, bits);
        }

        if (amode == 33)
        {
            TRY_ANGLE(34);
            COPY4_IF_LT(acost, cost, amode, 34, asad, sad, abits, bits);
        }

        COPY4_IF_LT(bcost, acost, bmode, amode, bsad, asad, bbits, abits);
    }
    else // calculate and search all intra prediction angles for lowest cost
    {
		// 遍历35种模式
        for (mode = 2; mode < 35; mode++)
        {
            TRY_ANGLE(mode);
            COPY4_IF_LT(bcost, cost, bmode, mode, bsad, sad, bbits, bits);
        }
    }

    cu.setLumaIntraDirSubParts((uint8_t)bmode, absPartIdx, depth + initTuDepth);
    intraMode.initCosts();
    intraMode.totalBits = bbits;
    intraMode.distortion = bsad;
    intraMode.sa8dCost = bcost;
    intraMode.sa8dBits = bbits;
}

这样x265的帧间预测简单分析就结束了

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值