图形图像处理－之－高质量的快速的图像缩放中篇二次线性插值和三次卷积插值

最新推荐文章于 2023-01-28 08:44:31 发布

yangdelong

最新推荐文章于 2023-01-28 08:44:31 发布

阅读量3.2k

点赞数

文章标签：图像处理图形 border table dst float

图形图像处理－之－高质量的快速的图像缩放中篇二次线性插值和三次卷积插值
HouSisong@263.net 2006.12.13

(2007.03.06更新)

tag:图像缩放、速度优化、定点数优化、近邻取样插值、二次线性插值、三次线性插值、
MipMap链、三次卷积插值、MMX/SSE优化、CPU缓存优化

摘要:首先给出一个基本的图像缩放算法，然后一步一步的优化其速度和缩放质量；

高质量的快速的图像缩放全文分为:
     上篇近邻取样插值和其速度优化
     中篇二次线性插值和三次卷积插值
     下篇三次线性插值和MipMap链

正文：
为了便于讨论，这里只处理32bit的ARGB颜色；
代码使用C++;涉及到汇编优化的时候假定为x86平台;使用的编译器为vc6;
为了代码的可读性,没有加入异常处理代码;
测试使用的CPU为赛扬2G;

速度测试说明:
只测试内存数据到内存数据的缩放
测试图片都是800*600缩放到1024*768; fps表示每秒钟的帧数,值越大表示函数越快

A:近邻取样插值、二次线性插值、三次卷积插值缩放效果对比

原图近邻取样缩放到0.6倍近邻取样缩放到1.6倍

二次线性插值缩放到0.6倍二次线性插值缩放到1.6倍

三次卷积插值缩放到0.6倍三次卷积插值缩放到1.6倍

原图近邻取样缩放到8倍二次线性插值缩放到8倍三次卷积插值缩放到8倍

近邻取样插值缩放简单、速度快，但是缩放出的图片质量比较差，特别是对于人物、景色等
图片的缩放有比较明显的锯齿；使用二次或更高次插值有利于改善缩放效果；

B: 首先定义图像数据结构:

#define asm __asm

typedef unsigned char TUInt8; // [0..255]
struct TARGB32      //32 bit color
{
    TUInt8  b,g,r,a;          //a is alpha
};

struct TPicRegion  //一块颜色数据区的描述，便于参数传递
{
    TARGB32*    pdata;         //颜色数据首地址
    long        byte_width;    // 一行数据的物理宽度(字节宽度)；
                //abs(byte_width)有可能大于等于width*sizeof(TARGB32);
    long        width;         //像素宽度
    long        height;        //像素高度
};

//那么访问一个点的函数可以写为：
inline TARGB32& Pixels(const TPicRegion& pic,const long x,const long y)
{
    return ( (TARGB32*)((TUInt8*)pic.pdata+pic.byte_width* y) )[x];
}

二次线性插值缩放：

C: 二次线性插值缩放原理和公式图示:

缩放后图片原图片
(宽DW,高DH) (宽SW,高SH)

缩放映射原理:
(Sx-0)/(SW-0)=(Dx-0)/(DW-0) (Sy-0)/(SH-0)=(Dy-0)/(DH-0)
=> Sx=Dx*SW/DW Sy=Dy*SH/DH

聚焦看看(Sx,Sy)坐标点(Sx,Sy为浮点数)附近的情况；

对于近邻取样插值的缩放算法，直接取Color0颜色作为缩放后点的颜色；
二次线性插值需要考虑(Sx,Sy)坐标点周围的4个颜色值Color0/Color1/Color2/Color3，
把(Sx,Sy)到A/B/C/D坐标点的距离作为系数来把4个颜色混合出缩放后点的颜色；
( u=Sx-floor(Sx); v=Sy-floor(Sy); 说明:floor函数的返回值为小于等于参数的最大整数 )
二次线性插值公式为:
tmpColor0=Color0*(1-u) + Color2*u;
tmpColor1=Color1*(1-u) + Color3*u;
DstColor =tmpColor0*(1-v) + tmpColor2*v;

展开公式为:
        pm0=(1-u)*(1-v);
        pm1=v*(1-u);
        pm2=u*(1-v);
        pm3=u*v;
则颜色混合公式为:
        DstColor = Color0*pm0 + Color1*pm1 + Color2*pm2 + Color3*pm3;

对于上面的公式，它将图片向右下各移动了半个像素，需要对此做一个修正；
=> Sx=Dx*SW/DW-0.5; Sy=Dy*SH/DH-0.5;
而实际的程序,还需要考虑到边界(访问源图片可能超界)对于算法的影响,边界的处理可能有各种
方案(不处理边界或边界回绕或边界饱和或边界映射或用背景颜色混合等;文章中默认使用边界饱和来处理超界);
比如:边界饱和函数

//访问一个点的函数，(x,y)坐标可能超出图片边界,进行边界饱和
inline TARGB32& Pixels_Bound(const TPicRegion& pic,long x,long y,bool& IsInPic)
{
    //assert((pic.width>0)&&(pic.height>0));
    IsInPic=true ;
    if (x<0) {x=0; IsInPic=false; } else if (x>=pic.width ) {x=pic.width -1; IsInPic=false ; }
    if (y<0) {y=0; IsInPic=false; } else if (y>=pic.height) {y=pic.height-1; IsInPic=false ; }
    return Pixels(pic,x,y);
}

D: 二次线性插值缩放算法的一个参考实现：PicZoom_BilInear0
该函数并没有做什么优化，只是一个简单的浮点实现版本；

    inline TARGB32 BilInear0(const TPicRegion& pic,float fx,float fy)
    {
        long x=(long)fx; if (x>fx) --x; //x=floor(fx);
        long y=(long)fy; if (y>fy) --y; //y=floor(fy);
        bool IsInPic;
        TARGB32 Color0= Pixels_Bound(pic,x,y,IsInPic);
        if (!IsInPic) Color0.a=0 ;
        TARGB32 Color2=Pixels_Bound(pic,x+1 ,y,IsInPic);
        if (!IsInPic) Color2.a=0 ;
        TARGB32 Color1=Pixels_Bound(pic,x,y+1 ,IsInPic);
        if (!IsInPic) Color1.a=0 ;
        TARGB32 Color3=Pixels_Bound(pic,x+1,y+1 ,IsInPic);
        if (!IsInPic) Color3.a=0 ;

        float u=fx- x;
        float v=fy- y;
        float pm3=u* v;
        float pm2=u*(1- v);
        float pm1=v*(1- u);
        float pm0=(1-u)*(1- v);

        TARGB32 result;
        result.a=(pm0*Color0.a+pm1*Color1.a+pm2*Color2.a+pm3* Color3.a);
        result.r=(pm0*Color0.r+pm1*Color1.r+pm2*Color2.r+pm3* Color3.r);
        result.g=(pm0*Color0.g+pm1*Color1.g+pm2*Color2.g+pm3* Color3.g);
        result.b=(pm0*Color0.b+pm1*Color1.b+pm2*Color2.b+pm3* Color3.b);

        return result;
    }

void PicZoom_BilInear0(const TPicRegion& Dst,const TPicRegion& Src)
{
    if (  (0==Dst.width)||(0== Dst.height)
        ||(0==Src.width)||(0==Src.height)) return ;

    unsigned long dst_width= Dst.width;
    TARGB32* pDstLine= Dst.pdata;
    for (unsigned long y=0;y<Dst.height;++ y)
    {
        float srcy=(float)y*Src.height/Dst.height-0.5 ;
        for (unsigned long x=0;x<dst_width;++ x)
        {
            float srcx=(float)x*Src.width/Dst.width-0.5 ;
            pDstLine[x]= BilInear0(Src,srcx,srcy);
        }
        ((TUInt8*&)pDstLine)+= Dst.byte_width;
    }
}

//速度测试:
//==============================================================================
// PicZoom_BilInear0 5.4 fps

(仔细优化的浮点算法不见得会比定点数实现的速度慢，但这里不再继续优化浮点函数版本，
因为重点是MMX的优化)

E: 把PicZoom_BilInear0的浮点计算改写为定点数实现：PicZoom_BilInear1
我使用一个近似的缩放映射修正公式，它没有平移的问题，也没有边界超界的麻烦，而且边界处理的效果也
很不错； => Sx=Dx*(SW-1)/DW; Sy=Dy*(SH-1)/DH;
当然,如果你有需要也可以借鉴后面的三次卷积的边界处理方法来改进二次线性插值;

    inline TARGB32 BilInear1(const TPicRegion& pic,const long x_16,const long y_16)
    {
        unsigned long x=x_16>>16 ;
        unsigned long y=y_16>>16 ;
        TARGB32 Color0= Pixels(pic,x,y);
        TARGB32 Color2=Pixels(pic,x+1 ,y);
        TARGB32 Color1=Pixels(pic,x,y+1 );
        TARGB32 Color3=Pixels(pic,x+1,y+1 );

        unsigned long u_8=(x_16 & 0xFFFF)>>8 ;
        unsigned long v_8=(y_16 & 0xFFFF)>>8 ;
        unsigned long pm3_16=(u_8* v_8);
        unsigned long pm2_16=(u_8*(unsigned long)(255- v_8));
        unsigned long pm1_16=(v_8*(unsigned long)(255- u_8));
        unsigned long pm0_16=((255-u_8)*(255- v_8));

        TARGB32 result;
        result.a=((pm0_16*Color0.a+pm1_16*Color1.a+pm2_16*Color2.a+pm3_16*Color3.a)>>16 );
        result.r=((pm0_16*Color0.r+pm1_16*Color1.r+pm2_16*Color2.r+pm3_16*Color3.r)>>16 );
        result.g=((pm0_16*Color0.g+pm1_16*Color1.g+pm2_16*Color2.g+pm3_16*Color3.g)>>16 );
        result.b=((pm0_16*Color0.b+pm1_16*Color1.b+pm2_16*Color2.b+pm3_16*Color3.b)>>16 );

        return result;
    }

void PicZoom_BilInear1(const TPicRegion& Dst,const TPicRegion& Src)
{
    if (  (0==Dst.width)||(0== Dst.height)
        ||(0==Src.width)||(0==Src.height)) return ;

    unsigned long xrIntFloat_16=((Src.width-1)<<16)/Dst.width+1 ;
    unsigned long yrIntFloat_16=((Src.height-1)<<16)/Dst.height+1 ;

    unsigned long dst_width= Dst.width;
    TARGB32* pDstLine= Dst.pdata;
    unsigned long srcy_16=0 ;
    for (unsigned long y=0;y<Dst.height;++ y)
    {
        unsigned long srcx_16=0 ;
        for (unsigned long x=0;x<dst_width;++ x)
        {
            pDstLine[x]= BilInear1(Src,srcx_16,srcy_16);
            srcx_16+= xrIntFloat_16;
        }
        srcy_16+= yrIntFloat_16;
        ((TUInt8*&)pDstLine)+= Dst.byte_width;
    }
}

//速度测试:
//==============================================================================
// PicZoom_BilInear1 12.1 fps

F: 简单优化一下PicZoom_BilInear1调用的BilInear1函数得到:PicZoom_BilInear2
减少颜色数据复制、减少一些不必要的重复坐标计算等

    inline TARGB32 BilInear2(TARGB32* PColor0,TARGB32* PColor1,unsigned long u_8,unsigned long v_8)
    {
        unsigned long pm3_16=u_8* v_8;
        unsigned long pm2_16=(u_8<<8)- pm3_16;
        unsigned long pm1_16=(v_8<<8)- pm3_16;
        unsigned long pm0_16=(1<<16)-pm1_16-pm2_16- pm3_16;

        TARGB32 result;

        result.r=((pm0_16*PColor0[0].r+pm2_16*PColor0[1].r+pm1_16*PColor1[0].r+pm3_16*PColor1[1].r)>>16 );
        result.a=((pm0_16*PColor0[0].a+pm2_16*PColor0[1].a+pm1_16*PColor1[0].a+pm3_16*PColor1[1].a)>>16 );
        result.g=((pm0_16*PColor0[0].g+pm2_16*PColor0[1].g+pm1_16*PColor1[0].g+pm3_16*PColor1[1].g)>>16 );
        result.b=((pm0_16*PColor0[0].b+pm2_16*PColor0[1].b+pm1_16*PColor1[0].b+pm3_16*PColor1[1].b)>>16 );

        return result;
    }

void PicZoom_BilInear2(const TPicRegion& Dst,const TPicRegion& Src)
{
    if (  (0==Dst.width)||(0== Dst.height)
        ||(0==Src.width)||(0==Src.height)) return ;

    unsigned long xrIntFloat_16=((Src.width-1)<<16)/Dst.width+1 ;
    unsigned long yrIntFloat_16=((Src.height-1)<<16)/Dst.height+1 ;

    unsigned long dst_width= Dst.width;
    long Src_byte_width= Src.byte_width;
    TARGB32* pDstLine= Dst.pdata;
    unsigned long srcy_16=0 ;
    for (unsigned long y=0;y<Dst.height;++ y)
    {
        unsigned long v_8=(srcy_16 & 0xFFFF)>>8 ;
        TARGB32* PSrcLineColor= (TARGB32*)((TUInt8*)(Src.pdata)+Src_byte_width*(srcy_16>>16 )) ;
        unsigned long srcx_16=0 ;
        for (unsigned long x=0;x<dst_width;++ x)
        {
            TARGB32* PColor0=&PSrcLineColor[srcx_16>>16 ];
            pDstLine[x]=BilInear2(PColor0,(TARGB32*)((TUInt8*)(PColor0)+Src_byte_width),(srcx_16 & 0xFFFF)>>8 ,v_8);
            srcx_16+= xrIntFloat_16;
        }
        srcy_16+= yrIntFloat_16;
        ((TUInt8*&)pDstLine)+= Dst.byte_width;
    }
}

//速度测试:
//==============================================================================
// PicZoom_BilInear2 22.5 fps

G:利用单指令多数据处理的MMX指令一般都可以加快颜色的运算；在使用MMX改写之前，利用
32bit寄存器(或变量)来模拟单指令多数据处理；
数据储存原理:一个颜色数据分量只有一个字节,用2个字节来储存单个颜色分量的计算结果，
对于很多颜色计算来说精度就够了；那么一个32bit寄存器(或变量)就可以储存2个计算出的
临时颜色分量；从而达到了单个指令两路数据处理的目的;
单个指令两路数据处理的计算:
乘法： ((0x00AA*a)<<16) | (0x00BB*a) = 0x00AA00BB * a
可见只要保证0x00AA*a和0x00BB*a都小于(1<<16)那么乘法可以直接使用无符号数乘法了
加法: ((0x00AA+0x00CC)<<16) | (0x00BB+0x00DD) = 0x00AA00BB + 0x00CC00DD
可见只要0x00AA+0x00CC和0x00BB+0x00DD小于(1<<16)那么加法可以直接使用无符号数加法了
(移位、减法等稍微复杂一点,因为这里没有用到就不推倒运算公式了)

    inline TARGB32 BilInear3(TARGB32* PColor0,TARGB32* PColor1,unsigned long u_8,unsigned long v_8)
    {
        unsigned long pm3_8=(u_8*v_8)>>8 ;
        unsigned long pm2_8=u_8- pm3_8;
        unsigned long pm1_8=v_8- pm3_8;
        unsigned long pm0_8=256-pm1_8-pm2_8- pm3_8;

        unsigned long Color=*(unsigned long* )(PColor0);
        unsigned long BR=(Color & 0x00FF00FF)* pm0_8;
        unsigned long GA=((Color & 0xFF00FF00)>>8)* pm0_8;
                      Color=((unsigned long*)(PColor0))[1 ];
                      GA+=((Color & 0xFF00FF00)>>8)* pm2_8;
                      BR+=(Color & 0x00FF00FF)* pm2_8;
                      Color=*(unsigned long* )(PColor1);
                      GA+=((Color & 0xFF00FF00)>>8)* pm1_8;
                      BR+=(Color & 0x00FF00FF)* pm1_8;
                      Color=((unsigned long*)(PColor1))[1 ];
                      GA+=((Color & 0xFF00FF00)>>8)* pm3_8;
                      BR+=(Color & 0x00FF00FF)* pm3_8;

        TARGB32 result;
        *(unsigned long*)(&result)=(GA & 0xFF00FF00)|((BR & 0xFF00FF00)>>8 );
        return result;
    }

void PicZoom_BilInear3(const TPicRegion& Dst,const TPicRegion& Src)
{
    if (  (0==Dst.width)||(0== Dst.height)
        ||(0==Src.width)||(0==Src.height)) return ;

    unsigned long xrIntFloat_16=((Src.width-1)<<16)/Dst.width+1 ;
    unsigned long yrIntFloat_16=((Src.height-1)<<16)/Dst.height+1 ;

    unsigned long dst_width= Dst.width;
    long Src_byte_width= Src.byte_width;
    TARGB32* pDstLine= Dst.pdata;
    unsigned long srcy_16=0 ;
    for (unsigned long y=0;y<Dst.height;++ y)
    {
        unsigned long v_8=(srcy_16 & 0xFFFF)>>8 ;
        TARGB32* PSrcLineColor= (TARGB32*)((TUInt8*)(Src.pdata)+Src_byte_width*(srcy_16>>16 )) ;
        unsigned long srcx_16=0 ;
        for (unsigned long x=0;x<dst_width;++ x)
        {
            TARGB32* PColor0=&PSrcLineColor[srcx_16>>16 ];
            pDstLine[x]=BilInear3(PColor0,(TARGB32*)((TUInt8*)(PColor0)+Src_byte_width),(srcx_16 & 0xFFFF)>>8 ,v_8);
            srcx_16+= xrIntFloat_16;
        }
        srcy_16+= yrIntFloat_16;
        ((TUInt8*&)pDstLine)+= Dst.byte_width;
    }
}

//速度测试:
//==============================================================================
// PicZoom_BilInear3 33.0 fps

H:使用MMX指令改写；(速度的瓶颈在于计算，内存读写的优化意义不大)

    //BilInear_MMX(out [edi+ebx*4];mm6=v_8,mm7=0,edx=srcx_16,esi=PSrcLineColor,ecx=PSrcLineColorNext)
    void BilInear_MMX()
    {
        asm
        {
              movzx     eax,dh //eax=u_8
              MOVD         MM5,eax
              mov       eax,edx
              PUNPCKLWD    MM5,MM5
              shr       eax,16     //srcx_16>>16
              PUNPCKLDQ    MM5,MM5 //mm5=u_8

              MOVD         MM0,[ecx +eax*4+4]//MM0=Color0
              MOVD         MM2,[ecx+eax*4]  //MM2=Color2
              MOVD         MM1,[esi+eax*4+4]//MM1=Color1
              MOVD         MM3,[esi+eax*4]  //MM3=Color3
              PUNPCKLBW    MM0,MM7
              PUNPCKLBW    MM1,MM7
              PUNPCKLBW    MM2,MM7
              PUNPCKLBW    MM3,MM7
              PSUBw        MM0,MM2
              PSUBw        MM1,MM3
              PSLLw        MM2,8
              PSLLw        MM3, 8
              PMULlw       MM0,MM5
              PMULlw       MM1,MM5
              PADDw        MM0,MM2
              PADDw        MM1,MM3

              PSRLw        MM0, 8
              PSRLw        MM1, 8
              PSubw        MM0,MM1
              PSLLw        MM1, 8
              PMULlw       MM0,MM6
              PADDw        MM0,MM1

              PSRLW     MM0, 8
              PACKUSWB  MM0,MM7
              MOVd      [edi +ebx*4],MM0 //write DstColor
        }
    }


void PicZoom_BilInear_MMX(const TPicRegion& Dst,const TPicRegion& Src)
{
    if (  (0==Dst.width)||(0== Dst.height)
        ||(0==Src.width)||(0==Src.height)) return ;

    unsigned long xrIntFloat_16=((Src.width-1)<<16)/Dst.width+1 ;
    unsigned long yrIntFloat_16=((Src.height-1)<<16)/Dst.height+1 ;

    unsigned long dst_width= Dst.width;
    long Src_byte_width= Src.byte_width;
    TARGB32* pDstLine= Dst.pdata;
    unsigned long srcy_16=0 ;
    asm pxor  mm7,mm7 //mm7=0
    for (unsigned long y=0;y<Dst.height;++ y)
    {
        unsigned long v_8=(srcy_16 & 0xFFFF)>>8 ;
        TARGB32* PSrcLineColor= (TARGB32*)((TUInt8*)(Src.pdata)+Src_byte_width*(srcy_16>>16 )) ;
        TARGB32* PSrcLineColorNext= (TARGB32*)((TUInt8*)(PSrcLineColor)+ Src_byte_width) ;
        asm
        {
              movd         mm6,v_8
              PUNPCKLWD    MM6,MM6
              PUNPCKLDQ    MM6,MM6//mm6=v_8

              mov       esi,PSrcLineColor
              mov       ecx,PSrcLineColorNext
              xor       edx,edx   //srcx_16=0
              mov       ebx,dst_width
              push      ebp
              mov       edi,pDstLine
              mov       ebp,xrIntFloat_16
              lea       edi,[edi+ebx*4 ]
              neg       ebx

        loop_start:
              call BilInear_MMX

              add       edx,ebp //srcx_16+=xrIntFloat_16
              inc       ebx
              jnz       loop_start
              pop       ebp
        }
        srcy_16+= yrIntFloat_16;
        ((TUInt8*&)pDstLine)+= Dst.byte_width;
    }
    asm emms  //结束MMX的使用
}

//=====================================================================
//鉴于有读者反映汇编代码阅读困难，这里给出一个使用intel提供的函数调用方式的实现，
//读者可以相互对照来阅读代码
//要编译PicZoom_BilInear_MMX_mmh,需要#include <mmintrin.h>
//并且需要编译器支持
//函数PicZoom_BilInear_MMX_mmh速度比PicZoom_BilInear_MMX稍慢,有55.5 fps
void PicZoom_BilInear_MMX_mmh(const TPicRegion& Dst,const TPicRegion& Src)
{
if ( (0==Dst.width)||(0==Dst.height)
||(0==Src.width)||(0==Src.height)) return;

unsigned long xrIntFloat_16=((Src.width-1)<<16)/Dst.width+1;
unsigned long yrIntFloat_16=((Src.height-1)<<16)/Dst.height+1;

    unsigned long dst_width=Dst.width;
    long Src_byte_width=Src.byte_width;
    TARGB32* pDstLine=Dst.pdata;
    unsigned long srcy_16=0;
    const __m64 m_0=_mm_setzero_si64();

    for (unsigned long y=0;y<Dst.height;++y)
    {
        unsigned long v_8=(srcy_16 & 0xFFFF)>>8;
        //__m64 m_v =_mm_set_pi16(v_8,v_8,v_8,v_8);
        __m64 m_v=_m_from_int(v_8);
        m_v=_m_punpcklwd(m_v,m_v);
        m_v=_m_punpckldq(m_v,m_v);

        TARGB32* PSrcLineColor= (TARGB32*)((TUInt8*)(Src.pdata)+Src_byte_width*(srcy_16>>16)) ;
        unsigned long srcx_16=0;
        for (unsigned long x=0;x<dst_width;++x)
        {
            TARGB32* PColor0=&PSrcLineColor[srcx_16>>16];
             TARGB32* PColor1=(TARGB32*)((TUInt8*)(PColor0)+Src_byte_width);
            unsigned long u_8=(srcx_16 & 0xFFFF)>>8;
            //BilInear_MMX_mmh
            {
                //__m64 m_u =_mm_set_pi16(u_8,u_8,u_8,u_8);
                __m64 m_u=_m_from_int(u_8);
                m_u=_m_punpcklwd(m_u,m_u);
                m_u=_m_punpckldq(m_u,m_u);

                __m64 m_color0 =_m_from_int(((int*)PColor1)[1]);
                __m64 m_color2 =_m_from_int(((int*)PColor1)[0]);
                __m64 m_color1 =_m_from_int(((int*)PColor0)[1]);
                __m64 m_color3 =_m_from_int(((int*)PColor0)[0]);
                m_color0=_m_punpcklbw(m_color0,m_0);
                m_color1=_m_punpcklbw(m_color1,m_0);
                m_color2=_m_punpcklbw(m_color2,m_0);
                m_color3=_m_punpcklbw(m_color3,m_0);

                m_color0=_m_psubw(m_color0,m_color2);
                m_color1=_m_psubw(m_color1,m_color3);
                m_color2=_m_psllwi(m_color2,8);
                m_color3=_m_psllwi(m_color3,8);
                m_color0=_m_pmullw(m_color0,m_u);
                m_color1=_m_pmullw(m_color1,m_u);
                m_color0=_m_paddw(m_color0,m_color2);
                m_color1=_m_paddw(m_color1,m_color3);
                m_color0=_m_psrlwi(m_color0,8);
                m_color1=_m_psrlwi(m_color1,8);

                m_color0=_m_psubw(m_color0,m_color1);
                m_color1=_m_psllwi(m_color1,8);
                m_color0=_m_pmullw(m_color0,m_v);
                m_color0=_m_paddw(m_color0,m_color1);
                m_color0=_m_psrlwi(m_color0,8);

m_color0=_m_packuswb(m_color0,m_0);

                *(int*)(&pDstLine[x])=_m_to_int(m_color0);
            }
            srcx_16+=xrIntFloat_16;
        }
        srcy_16+=yrIntFloat_16;
        ((TUInt8*&)pDstLine)+=Dst.byte_width;
    }
    _m_empty();
}

//速度测试:
//==============================================================================
// PicZoom_BilInear_MMX 70.8 fps

I: 把测试成绩放在一起：

//zoom 800*600 to 1024*1024
//==============================================================================
// StretchBlt           169 fps
// PicZoom3_SSE      301--330 fps
//
// PicZoom_BilInear0      5.4 fps
// PicZoom_BilInear1     12.1 fps
// PicZoom_BilInear2     22.5 fps
// PicZoom_BilInear3     33.0 fps
// PicZoom_BilInear_MMX 70.8 fps

(todo: 测试使用SSE、SSE2、SSE3来实现BilInear)

三次卷积插值:

J: 二次线性插值缩放出的图片很多时候让人感觉变得模糊(术语叫低通滤波)，特别是在放大
的时候；使用三次卷积插值来改善插值结果；三次卷积插值考虑映射点周围16个点(4x4)的颜色来
计算最终的混合颜色，如图；

P(0,0)所在像素为映射的点，加上它周围的15个点，按一定系数混合得到最终输出结果；

混合公式参见PicZoom_ThreeOrder0的实现；

插值曲线公式sin(x*PI)/(x*PI),如图:

三次卷积插值曲线sin(x*PI)/(x*PI) （其中PI=3.1415926...）

K:三次卷积插值缩放算法的一个参考实现：PicZoom_ThreeOrder0
该函数并没有做过多的优化，只是一个简单的浮点实现版本;

        inline double SinXDivX(double x)
        {
            //该函数计算插值曲线sin(x*PI)/(x*PI)的值 //PI=3.1415926535897932385;
            //下面是它的近似拟合表达式
            const float a = -1; //a还可以取 a=-2,-1,-0.75,-0.5等等，起到调节锐化或模糊程度的作用

            if (x<0) x=-x; //x=abs(x);
            double x2=x*x;
            double x3=x2*x;
            if (x<=1)
              return (a+2)*x3 - (a+3)*x2 + 1;
            else if (x<=2)
              return a*x3 - (5*a)*x2 + (8*a)*x - (4*a);
            else
              return 0;
        }

        inline TUInt8 ColorBound(long Color)
        {
            if (Color<=0)
                return 0;
            else if (Color>=255)
                return 255;
            else
                return Color;
        }
    TARGB32 ThreeOrder0(const TPicRegion& pic,const float fx,const float fy)
    {
        long x0=(long)fx; if (x0>fx) --x0; //x0=floor(fx);
        long y0=(long)fy; if (y0>fy) --y0; //y0=floor(fy);
        float fu=fx-x0;
        float fv=fy-y0;

TARGB32 pixel[16];
long i,j;

        for (i=0;i<4;++i)
        {
            for (j=0;j<4;++j)
            {
                long x=x0-1+j;
                long y=y0-1+i;
                bool IsInPic;
                pixel[i*4+j]=Pixels_Bound(pic,x,y,IsInPic);
                if (!IsInPic) pixel[i*4+j].a=0;
            }
        }

        float afu[4],afv[4];
        //
        afu[0]=SinXDivX(1+fu);
        afu[1]=SinXDivX(fu);
        afu[2]=SinXDivX(1-fu);
        afu[3]=SinXDivX(2-fu);
        afv[0]=SinXDivX(1+fv);
        afv[1]=SinXDivX(fv);
        afv[2]=SinXDivX(1-fv);
        afv[3]=SinXDivX(2-fv);

        float sR=0,sG=0,sB=0,sA=0;
        for (i=0;i<4;++i)
        {
            float aR=0,aG=0,aB=0,aA=0;
            for (long j=0;j<4;++j)
            {
                aA+=afu[j]*pixel[i*4+j].a;
                aR+=afu[j]*pixel[i*4+j].r;
                aG+=afu[j]*pixel[i*4+j].g;
                aB+=afu[j]*pixel[i*4+j].b;
            }
            sA+=aA*afv[i];
            sR+=aR*afv[i];
            sG+=aG*afv[i];
            sB+=aB*afv[i];
        }

        TARGB32 result;
        result.a=ColorBound((long)(sA+0.5));
        result.r=ColorBound((long)(sR+0.5));
        result.g=ColorBound((long)(sG+0.5));
        result.b=ColorBound((long)(sB+0.5));

return result;
}

void PicZoom_ThreeOrder0(const TPicRegion& Dst,const TPicRegion& Src)
{
if ( (0==Dst.width)||(0==Dst.height)
||(0==Src.width)||(0==Src.height)) return;

    unsigned long dst_width=Dst.width;
    TARGB32* pDstLine=Dst.pdata;
    for (unsigned long y=0;y<Dst.height;++y)
    {
        float srcy=(float)y*Src.height/Dst.height-0.5;
        for (unsigned long x=0;x<dst_width;++x)
        {
            float srcx=(float)x*Src.width/Dst.width-0.5;
            pDstLine[x]=ThreeOrder0(Src,srcx,srcy);
        }
        ((TUInt8*&)pDstLine)+=Dst.byte_width;
    }
}

//速度测试:
//==============================================================================
// PicZoom_ThreeOrder0 1.6 fps

L: 使用定点数来优化缩放函数；对SinXDivX做一个查找表；
//速度测试:
//==============================================================================
// PicZoom_ThreeOrder1 2.7 fps

     static long SinXDivX_Table_8[( 2 << 8 ) + 1 ];
     class _CAutoInti_SinXDivX_Table {
     private :
         void _Inti_SinXDivX_Table()
        {
             for ( long i = 0 ;i <= ( 2 << 8 ); ++ i)
                SinXDivX_Table_8[i] = long ( 0.5 + 256 * SinXDivX(i * ( 1.0 / ( 256 ))));
        };
     public :
        _CAutoInti_SinXDivX_Table() { _Inti_SinXDivX_Table(); }
    };
     static _CAutoInti_SinXDivX_Table __tmp_CAutoInti_SinXDivX_Table;

    TARGB32 ThreeOrder1( const TPicRegion & pic, const long x_16, const long y_16)
    {
        unsigned long x0_sub1 = (x_16 >> 16 ) - 1 ;
        unsigned long y0_sub1 = (y_16 >> 16 ) - 1 ;
         long u_8 = (unsigned char )(x_16 >> 8 );
         long v_8 = (unsigned char )(y_16 >> 8 );

        TARGB32 pixel[ 16 ];
         long i,j;

         for (i = 0 ;i < 4 ; ++ i)
        {
             long y = y0_sub1 + i;
             for (j = 0 ;j < 4 ; ++ j)
            {
                 long x = x0_sub1 + j;
                 bool IsInPic;
                pixel[i * 4 + j] = Pixels_Bound(pic,x,y,IsInPic);
                 if ( ! IsInPic) pixel[i * 4 + j].a = 0 ;
            }
        }

         long au_8[ 4 ],av_8[ 4 ];
         //
        au_8[ 0 ] = SinXDivX_Table_8[( 1 << 8 ) + u_8];
        au_8[ 1 ] = SinXDivX_Table_8[u_8];
        au_8[ 2 ] = SinXDivX_Table_8[( 1 << 8 ) - u_8];
        au_8[ 3 ] = SinXDivX_Table_8[( 2 << 8 ) - u_8];
        av_8[ 0 ] = SinXDivX_Table_8[( 1 << 8 ) + v_8];
        av_8[ 1 ] = SinXDivX_Table_8[v_8];
        av_8[ 2 ] = SinXDivX_Table_8[( 1 << 8 ) - v_8];
        av_8[ 3 ] = SinXDivX_Table_8[( 2 << 8 ) - v_8];

         long sR = 0 ,sG = 0 ,sB = 0 ,sA = 0 ;
         for (i = 0 ;i < 4 ; ++ i)
        {
             long aR = 0 ,aG = 0 ,aB = 0 ,aA = 0 ;
             for (j = 0 ;j < 4 ; ++ j)
            {
                aA += au_8[j] * pixel[i * 4 + j].a;
                aR += au_8[j] * pixel[i * 4 + j].r;
                aG += au_8[j] * pixel[i * 4 + j].g;
                aB += au_8[j] * pixel[i * 4 + j].b;
            }
            sA += aA * av_8[i];
            sR += aR * av_8[i];
            sG += aG * av_8[i];
            sB += aB * av_8[i];
        }

        TARGB32 result;
        result.a = ColorBound(sA >> 16 );
        result.r = ColorBound(sR >> 16 );
        result.g = ColorBound(sG >> 16 );
        result.b = ColorBound(sB >> 16 );
         return result;
    }

void PicZoom_ThreeOrder1( const TPicRegion & Dst, const TPicRegion & Src)
{
     if (  ( 0 == Dst.width) || ( 0 == Dst.height)
         || ( 0 == Src.width) || ( 0 == Src.height)) return ;

    unsigned long xrIntFloat_16 = ((Src.width - 1 ) << 16 ) / Dst.width + 1 ;
    unsigned long yrIntFloat_16 = ((Src.height - 1 ) << 16 ) / Dst.height + 1 ;

    unsigned long dst_width = Dst.width;
    TARGB32 * pDstLine = Dst.pdata;
    unsigned long srcy_16 = 0 ;
     for (unsigned long y = 0 ;y < Dst.height; ++ y)
    {
        unsigned long srcx_16 = 0 ;
         for (unsigned long x = 0 ;x < dst_width; ++ x)
        {
            pDstLine[x] = ThreeOrder1(Src,srcx_16,srcy_16);
            srcx_16 += xrIntFloat_16;
        }
        srcy_16 += yrIntFloat_16;
        ((TUInt8 *& )pDstLine) += Dst.byte_width;
    }
}

M: ThreeOrder0和ThreeOrder1 函数考虑了边界访问的情况，并拷贝了插值的16个颜色；所以将边界和内部分别处理；处理内部区域的函数就可以做一些优化了；
//速度测试:
//==============================================================================
// PicZoom_ThreeOrder2 4.2 fps

    TARGB32 ThreeOrder2_Fast( const TPicRegion & pic, const long x_16, const long y_16)
    {
         long u_8 = (unsigned char )((x_16) >> 8 );
         long v_8 = (unsigned char )((y_16) >> 8 );
         const TARGB32 * pixel =& Pixels(pic,(x_16 >> 16 ) - 1 ,(y_16 >> 16 ) - 1 );
         long pic_byte_width = pic.byte_width;

         long au_8[ 4 ],av_8[ 4 ];
         //
        au_8[ 0 ] = SinXDivX_Table_8[( 1 << 8 ) + u_8];
        au_8[ 1 ] = SinXDivX_Table_8[u_8];
        au_8[ 2 ] = SinXDivX_Table_8[( 1 << 8 ) - u_8];
        au_8[ 3 ] = SinXDivX_Table_8[( 2 << 8 ) - u_8];
        av_8[ 0 ] = SinXDivX_Table_8[( 1 << 8 ) + v_8];
        av_8[ 1 ] = SinXDivX_Table_8[v_8];
        av_8[ 2 ] = SinXDivX_Table_8[( 1 << 8 ) - v_8];
        av_8[ 3 ] = SinXDivX_Table_8[( 2 << 8 ) - v_8];

         long sR = 0 ,sG = 0 ,sB = 0 ,sA = 0 ;
         for ( long i = 0 ;i < 4 ; ++ i)
        {
             long aR = 0 ,aG = 0 ,aB = 0 ,aA = 0 ;
             for ( long j = 0 ;j < 4 ; ++ j)
            {
                aA += au_8[j] * pixel[j].a;
                aR += au_8[j] * pixel[j].r;
                aG += au_8[j] * pixel[j].g;
                aB += au_8[j] * pixel[j].b;
            }
            sA += aA * av_8[i];
            sR += aR * av_8[i];
            sG += aG * av_8[i];
            sB += aB * av_8[i];
            ((TUInt8 *& )pixel) += pic_byte_width;
        }

        TARGB32 result;
        result.a = ColorBound(sA >> 16 );
        result.r = ColorBound(sR >> 16 );
        result.g = ColorBound(sG >> 16 );
        result.b = ColorBound(sB >> 16 );
         return result;
    }

    TARGB32 ThreeOrder2_Border( const TPicRegion & pic, const long x_16, const long y_16)
    {
        unsigned long x0_sub1 = (x_16 >> 16 ) - 1 ;
        unsigned long y0_sub1 = (y_16 >> 16 ) - 1 ;
         long u_16_add1 = ((unsigned short )(x_16)) + ( 1 << 16 );
         long v_16_add1 = ((unsigned short )(y_16)) + ( 1 << 16 );

        TARGB32 pixel[ 16 ];
         long i,j;

         for (i = 0 ;i < 4 ; ++ i)
        {
             long y = y0_sub1 + i;
             for (j = 0 ;j < 4 ; ++ j)
            {
                 long x = x0_sub1 + j;
                 bool IsInPic;
                pixel[i * 4 + j] = Pixels_Bound(pic,x,y,IsInPic);
                 if ( ! IsInPic) pixel[i * 4 + j].a = 0 ;
            }
        }

        TPicRegion npic;
        npic.pdata      =& pixel[ 0 ];
        npic.byte_width = 4 * sizeof (TARGB32);
         // npic.width     =4;
         // npic.height    =4;
         return ThreeOrder2_Fast(npic,u_16_add1,v_16_add1);
    }

void PicZoom_ThreeOrder2( const TPicRegion & Dst, const TPicRegion & Src)
{
     if (  ( 0 == Dst.width) || ( 0 == Dst.height)
         || ( 0 == Src.width) || ( 0 == Src.height)) return ;

    unsigned long xrIntFloat_16 = ((Src.width - 1 ) << 16 ) / Dst.width + 1 ;
    unsigned long yrIntFloat_16 = ((Src.height - 1 ) << 16 ) / Dst.height + 1 ;

    unsigned long dst_width = Dst.width;

     // 计算出需要特殊处理的边界
     long border_y0 = (( 1 << 16 ) + yrIntFloat_16 - 1 ) / yrIntFloat_16;
     if (border_y0 >= Dst.height) border_y0 = Dst.height - 1 ;
     long border_x0 = (( 1 << 16 ) + xrIntFloat_16 - 1 ) / xrIntFloat_16;
     if (border_x0 >= Dst.width ) border_x0 = Dst.width - 1 ;
     long border_y1 = (( 1 << 16 ) * (Src.height - 1 - 2 )) / yrIntFloat_16;
     if (border_y1 < border_y0) border_y1 = border_y0;
     long border_x1 = (( 1 << 16 ) * (Src.width - 1 - 2 )) / xrIntFloat_16;
     if (border_x1 < border_x0) border_x1 = border_x0;

    TARGB32 * pDstLine = Dst.pdata;
    unsigned long srcy_16 = 0 ;
     long y;
     for (y = 0 ;y < border_y0; ++ y)
    {
        unsigned long srcx_16 = 0 ;
         for (unsigned long x = 0 ;x < dst_width; ++ x)
        {
            pDstLine[x] = ThreeOrder2_Border(Src,srcx_16,srcy_16); // border
            srcx_16 += xrIntFloat_16;
        }
        srcy_16 += yrIntFloat_16;
        ((TUInt8 *& )pDstLine) += Dst.byte_width;
    }
     for (y = border_y0;y < border_y1; ++ y)
    {
        unsigned long srcx_16 = 0 ;
         long x;
         for (x = 0 ;x < border_x0; ++ x)
        {
            pDstLine[x] = ThreeOrder2_Border(Src,srcx_16,srcy_16); // border
            srcx_16 += xrIntFloat_16;
        }
         for (x = border_x0;x < border_x1; ++ x)
        {
            pDstLine[x] = ThreeOrder2_Fast(Src,srcx_16,srcy_16); // fast  !
            srcx_16 += xrIntFloat_16;
        }
         for (x = border_x1;x < dst_width; ++ x)
        {
            pDstLine[x] = ThreeOrder2_Border(Src,srcx_16,srcy_16); // border
            srcx_16 += xrIntFloat_16;
        }
        srcy_16 += yrIntFloat_16;
        ((TUInt8 *& )pDstLine) += Dst.byte_width;
    }
     for (y = border_y1;y < Dst.height; ++ y)
    {
        unsigned long srcx_16 = 0 ;
         for (unsigned long x = 0 ;x < dst_width; ++ x)
        {
            pDstLine[x] = ThreeOrder2_Border(Src,srcx_16,srcy_16); // border
            srcx_16 += xrIntFloat_16;
        }
        srcy_16 += yrIntFloat_16;
        ((TUInt8 *& )pDstLine) += Dst.byte_width;
    }
}

N: 用MMX来优化ThreeOrder2_Fast函数了；
//速度测试:
//==============================================================================
// PicZoom_ThreeOrder_MMX 14.5 fps

    typedef   unsigned long TMMXData32;
     static TMMXData32 SinXDivX_Table_7_MMX[( 2 << 8 ) + 1 ];
     class _CAutoInti_SinXDivX_Table_MMX {
     private :
         void _Inti_SinXDivX_Table_MMX()
        {
             for ( long i = 0 ;i <= ( 2 << 8 ); ++ i)
            {
                unsigned short t = long ( 0.5 + 256 / 2 * SinXDivX(i * ( 1.0 / ( 256 )))); // ==SinXDivX_Table_8[i]/2
                unsigned long tl = t | (((unsigned long )t) << 16 );
                SinXDivX_Table_7_MMX[i] = tl;
            }
        };
     public :
        _CAutoInti_SinXDivX_Table_MMX() { _Inti_SinXDivX_Table_MMX(); }
    };
     static _CAutoInti_SinXDivX_Table_MMX __tmp_CAutoInti_SinXDivX_Table_MMX;

    TARGB32 ThreeOrder_Fast_MMX( const TPicRegion & pic, const long x_16, const long y_16)
    {
        TMMXData32 auv_7[ 4 + 4 ];
        Color32 result;
        asm
        {
            mov     ecx,pic
            mov     eax,y_16
            mov     ebx,x_16
            movzx   edi,ah // v_8
            mov     edx,[ecx + TPicRegion.pdata]
            shr     eax, 16
            mov     esi,[ecx + TPicRegion.byte_width]
            dec     eax
            movzx   ecx,bh // u_8
            shr     ebx, 16
            imul    eax,esi
            lea     edx,[edx + ebx * 4 - 4 ]
            add     edx,eax // pixel

            lea     eax,auv_7
            mov     ebx,[SinXDivX_Table_7_MMX + 256 * 4 + ecx * 4 ]
            mov     [eax + 0 * 4 ],ebx
            mov     ebx,[SinXDivX_Table_7_MMX + ecx * 4 ]
            neg     ecx
            mov     [eax + 1 * 4 ],ebx
            mov     ebx,[SinXDivX_Table_7_MMX + 256 * 4 + ecx * 4 ]
            mov     [eax + 2 * 4 ],ebx
            mov     ebx,[SinXDivX_Table_7_MMX + 512 * 4 + ecx * 4 ]
            mov     [eax + 3 * 4 ],ebx

            mov     ecx,edi
            mov     ebx,[SinXDivX_Table_7_MMX + 256 * 4 + ecx * 4 ]
            mov     [ 4 * 4 + eax + 0 * 4 ],ebx
            mov     ebx,[SinXDivX_Table_7_MMX + ecx * 4 ]
            neg     ecx
            mov     [ 4 * 4 + eax + 1 * 4 ],ebx
            mov     ebx,[SinXDivX_Table_7_MMX + 256 * 4 + ecx * 4 ]
            mov     [ 4 * 4 + eax + 2 * 4 ],ebx
            mov     ebx,[SinXDivX_Table_7_MMX + 512 * 4 + ecx * 4 ]
            mov     [ 4 * 4 + eax + 3 * 4 ],ebx


            pxor    mm7,mm7   // 0
             // mov     edx,pixel
            pxor    mm0,mm0   // result=0
             // lea     eax,auv_7
            mov     ecx, - 4

          for4_begin:
                movd        mm1,dword ptr [edx]
                movd        mm2,dword ptr [edx + 4 ]
                movd        mm3,dword ptr [edx + 8 ]
                movd        mm4,dword ptr [edx + 12 ]
                movd        mm5,dword ptr [eax]
                movd        mm6,dword ptr [eax + 4 ]
                punpcklbw   mm1,mm7
                punpcklbw   mm2,mm7
                punpcklwd   mm5,mm5
                punpcklwd   mm6,mm6
                pmullw      mm1,mm5
                pmullw      mm2,mm6
                punpcklbw   mm3,mm7
                punpcklbw   mm4,mm7
                movd        mm5,dword ptr [eax + 8 ]
                movd        mm6,dword ptr [eax + 12 ]
                punpcklwd   mm5,mm5
                punpcklwd   mm6,mm6
                pmullw      mm3,mm5
                pmullw      mm4,mm6
                psraw       mm1, 7
                psraw       mm2, 7
                psraw       mm3, 7
                psraw       mm4, 7
                movd        mm6,dword ptr [ 4 * 8 + eax + ecx * 4 ] // v
                paddsw      mm1,mm2
                paddsw      mm3,mm4
                punpcklwd   mm6,mm6
                paddsw      mm1,mm3

                add     edx,esi   // +pic.byte_width
                psraw       mm1, 1
                pmullw      mm1,mm6
                inc     ecx
                paddsw      mm0,mm1
            jnz     for4_begin

            psraw     mm0, 6
            packuswb  mm0,mm7
            movd      result,mm0

            emms
        }
         return result;
    }

    TARGB32 ThreeOrder_Border_MMX( const TPicRegion & pic, const long x_16, const long y_16)
    {
        unsigned long x0_sub1 = (x_16 >> 16 ) - 1 ;
        unsigned long y0_sub1 = (y_16 >> 16 ) - 1 ;
         long u_16_add1 = ((unsigned short )(x_16)) + ( 1 << 16 );
         long v_16_add1 = ((unsigned short )(y_16)) + ( 1 << 16 );

        TARGB32 pixel[ 16 ];
         long i,j;

         for (i = 0 ;i < 4 ; ++ i)
        {
             long y = y0_sub1 + i;
             for (j = 0 ;j < 4 ; ++ j)
            {
                 long x = x0_sub1 + j;
                 bool IsInPic;
                pixel[i * 4 + j] = Pixels_Bound(pic,x,y,IsInPic);
                 if ( ! IsInPic) pixel[i * 4 + j].a = 0 ;
            }
        }

        TPicRegion npic;
        npic.pdata      =& pixel[ 0 ];
        npic.byte_width = 4 * sizeof (TARGB32);
         // npic.width     =4;
         // npic.height    =4;
         return ThreeOrder_Fast_MMX(npic,u_16_add1,v_16_add1);
    }

void PicZoom_ThreeOrder_MMX( const TPicRegion & Dst, const TPicRegion & Src)
{
     if (  ( 0 == Dst.width) || ( 0 == Dst.height)
         || ( 0 == Src.width) || ( 0 == Src.height)) return ;

    unsigned long xrIntFloat_16 = ((Src.width - 1 ) << 16 ) / Dst.width + 1 ;
    unsigned long yrIntFloat_16 = ((Src.height - 1 ) << 16 ) / Dst.height + 1 ;

    unsigned long dst_width = Dst.width;

     // 计算出需要特殊处理的边界
     long border_y0 = (( 1 << 16 ) + yrIntFloat_16 - 1 ) / yrIntFloat_16;
     if (border_y0 >= Dst.height) border_y0 = Dst.height - 1 ;
     long border_x0 = (( 1 << 16 ) + xrIntFloat_16 - 1 ) / xrIntFloat_16;
     if (border_x0 >= Dst.width ) border_x0 = Dst.width - 1 ;
     long border_y1 = (( 1 << 16 ) * (Src.height - 1 - 2 )) / yrIntFloat_16;
     if (border_y1 < border_y0) border_y1 = border_y0;
     long border_x1 = (( 1 << 16 ) * (Src.width - 1 - 2 )) / xrIntFloat_16;
     if (border_x1 < border_x0) border_x1 = border_x0;

    TARGB32 * pDstLine = Dst.pdata;
    unsigned long srcy_16 = 0 ;
     long y;
     for (y = 0 ;y < border_y0; ++ y)
    {
        unsigned long srcx_16 = 0 ;
         for (unsigned long x = 0 ;x < dst_width; ++ x)
        {
            pDstLine[x] = ThreeOrder_Border_MMX(Src,srcx_16,srcy_16); // border
            srcx_16 += xrIntFloat_16;
        }
        srcy_16 += yrIntFloat_16;
        ((TUInt8 *& )pDstLine) += Dst.byte_width;
    }
     for (y = border_y0;y < border_y1; ++ y)
    {
        unsigned long srcx_16 = 0 ;
         long x;
         for (x = 0 ;x < border_x0; ++ x)
        {
            pDstLine[x] = ThreeOrder_Border_MMX(Src,srcx_16,srcy_16); // border
            srcx_16 += xrIntFloat_16;
        }
         for (x = border_x0;x < border_x1; ++ x)
        {
            pDstLine[x] = ThreeOrder_Fast_MMX(Src,srcx_16,srcy_16); // fast MMX !
            srcx_16 += xrIntFloat_16;
        }
         for (x = border_x1;x < dst_width; ++ x)
        {
            pDstLine[x] = ThreeOrder_Border_MMX(Src,srcx_16,srcy_16); // border
            srcx_16 += xrIntFloat_16;
        }
        srcy_16 += yrIntFloat_16;
        ((TUInt8 *& )pDstLine) += Dst.byte_width;
    }
     for (y = border_y1;y < Dst.height; ++ y)
    {
        unsigned long srcx_16 = 0 ;
         for (unsigned long x = 0 ;x < dst_width; ++ x)
        {
            pDstLine[x] = ThreeOrder_Border_MMX(Src,srcx_16,srcy_16); // border
            srcx_16 += xrIntFloat_16;
        }
        srcy_16 += yrIntFloat_16;
        ((TUInt8 *& )pDstLine) += Dst.byte_width;
    }
}

O：将测试结果放到一起：

//zoom 800*600 to 1024*1024
//==============================================================================
// StretchBlt          169   fps
// PicZoom3_SSE      301--330   fps
// PicZoom_BilInear_MMX     70.8 fps
//
// PicZoom_ThreeOrder0       1.6 fps
// PicZoom_ThreeOrder1       2.7 fps
// PicZoom_ThreeOrder2       4.2 fps
// PicZoom_ThreeOrder_MMX   14.5 fps

yangdelong

关注

0
点赞
踩
7

收藏

觉得还不错? 一键收藏
1
评论
图形图像处理－之－高质量的快速的图像缩放中篇二次线性插值和三次卷积插值

图形图像处理－之－高质量的快速的图像缩放中篇二次线性插值和三次卷积插值 HouSisong@263.net 2006.12.13(2007.03.06更新)tag:图像缩放、速度优化、定点数优化、近邻取样插值、二次线性插值、三次线性插值、 MipMap链、三次卷积插值、MMX/SSE优化、CPU缓存优化摘要:首先给出一个基本的图像缩放
复制链接

扫一扫