如何用反余弦求pi c语言,反余弦变换 IDCT 8X8 的快速算法和SIMD优化(二,C程序优化)...

反余弦变换 IDCT 8X8

的快速算法和SIMD优化(二,C程序优化)

8X8 IDCT 程序

前面我们学习了Feig等人对快速IDCT算法的论述。从算法的角度来说,2D-IDCT显然是最快的。但在在实际应用中,二维IDCT由于实现复杂,不常使用,常见的是一维IDCT。这里举出了几种IDCT的实现程序。包括笔者实现的AAN的几个变种,常见的CHENG_WANG,还有著名的JREV,以及几个MMX

IDCT。为了便于测试,我们设定了IDCT函数的输入与输出两个地址参数。实际使用中往往是IN-PLACE

,输入输出地址是同一个。这对性能的影响很微弱,并且修改成一个地址的工作也很容易。

AAN已经是很快的了。但是我们当然想追求更具效率的实现。出发点是,DCT系数中有很多的点数值为零,我们具体分析一下如何利用这个特性。

对于一维IDCT

AAN,如果存在二个以上系数不为零,计算量差别很小,因此我们只考虑一到二个系数非零的情况。我们姑且把这种特殊情况称之为快速通道。

为方便起见,我们称为零系数较多的情况为高负载,反之为低负载。

1)整个8x8矩阵,只有一个系数f(x,y)不为零。

此时一维IDCT: S8(u) =

Cu/2 * ∑ f(x)cos ((2x+1)*pi*u / 16 ),

退化为:S8(u) = Cu/2 * f(x)cos ((2x+1)*pi*u

/ 16 ),这说明每个一维IDCT通过8次乘法即可完成。根据x位置不同,这个表格是8 x 8 的。

a) f(x,y)在第0行第0列, 两个一维IDCT可以直接合并,S8(u) S8(v) = f(0,0)/ 8;只需要64次移位。

b) x > 0,y = 0,

f(x,y)在第0行第x列。首先在0行作8次乘法,每个系数乘一个2√2;然后做8个列变换,S8(u) = f(0)/

8;8次移位,8次乘法。

c) y > 0,x = 0,

f(x,y)在第y行第0列。首先在0列作8次乘法,每个系数乘一个2√2;然后做8个行变换,S8(u) = f(0)/

8;也是8次移位,8次乘法。

d) x,y > 0,

f(x,y)在第y行第x列。首先在y行作8次乘法;然后做8个列变换,每个列变换也作8次乘法。一共是72次乘法。

当x不在0行0列的时候,还有另外的办法。

S8(u) S8(v) =

Cu/2 * Cv/2 *∑ ∑f(x,y)cos ((2x+1)*pi*u

/ 16 ) * cos ((2y+1)*pi*v / 16 )

退化为:

S8(u) S8(v) =

Cu/2 * Cv/2 *f(x,y)cos ((2x+1)*pi*u /

16 ) * cos ((2y+1)*pi*v / 16 )

这相当于把前面8 x 8 的表展开,使用一个64 x

64的超级大表,直接作64个乘法来搞定。具体可以参看JREV

IDCT的实现文件。但是过大的表格总是让人心有余悸的,不用也罢。使用两场一维IDCT,表格很小,速度更快。

我们姑且把f(0,0)不为零的情况称为快速通道0,其他的情况称为快速通道1。

2) 整个8x8矩阵,多于一个的系数不为零,但都在一行或者一列。

都在一行的时候,首先在该行作标准一维行变换,然后做8个退化的一维列变换。

都在一列的时候,首先在该列作标准一维列变换,然后做8个退化的一维行变换。

3) 整个8x8矩阵,多于一个的系数不为零,但都在二行或者二列。

当有2个系数不为零的时候,每个1D-IDCT需要做16次乘法,但是因为没有其他的操作,还是快一些的。

都在二行的时候,首先作2行标准一维行变换,然后做8个退化的一维列变换。

都在二列的时候,首先作2列标准一维列变换,然后做8个退化的一维行变换。

我们把2,3这种情况称为快速通道2。

4)

单个行或者列,只有一个系数不为零。在该行或者该列作退化的一维变换。DC系数非零称为快速通道3,AC系数非零称为通道4。由于在1D-IDCT

里面,进行条件判断和程序跳转的代价很大,2个系数不为零的情况,如果不采用一些办法降低程序跳转的性能损失,2点的情况不会提高速度。

快速通道的总结:

快速通道代号

存在条件

0

8x8矩阵,只有f(0,0) != 0

1

8x8矩阵,只有f(x,y) != 0, x > 0, y >

0

2

8x8矩阵,所有f(x,y) != 0 都在一行或一列,或者二行或二列上

3

1x8或者8x1矩阵,只有f(0) != 0

4

1x8或者8x1矩阵,f(x) != 0(x > 0)。

或者f(x1) != 0 和 f(x2) != 0。

我们根据以上分析来实现IDCT程序。限于篇幅,这里只给出AAN_EX的代码,其余代码在测试工程中都可以找到。

首先来看看我们需要获取的DCT系数位置信息。

1) 总的非零系数的数量。

2) 当只有一个非零系数的时候,该系数的位置。

3) 总的非零行的数量。非零行的位置。

4) 总的非零列的数量。非零列的位置。

5) 每行非零系数的数量和位置。

看起来记录这些数据很繁琐。但是要看到,所有这些信息都可以查表快速获取。

// LxIDCT.h: interface for the LxIDCT

.

#define PI 3.14159265359

#define

IDCT_TAB_RC 12

#define

IDCT_TAB_RC_DELTA 4

#define

IDCT_TAB_RC0 (IDCT_TAB_RC - IDCT_TAB_RC_DELTA)

#define

IDCT_TAB_ROUND0 (1<

#define

IDCT_TAB_RC1 (IDCT_TAB_RC + IDCT_TAB_RC_DELTA + 1)

#define

IDCT_TAB_ROUND1 (1<

#define

IDCT_TAB_DC_RC0 (IDCT_TAB_RC_DELTA -

1) // 4;

#define

IDCT_TAB_DC_RC1 (IDCT_TAB_RC_DELTA +

2) // 6;

#define

IDCT_TAB_DC_ROUND1 (1<

#define SET_NZ_MASK(A) (A)

#define SET_NZ_INF(A) (A&0x00ffff00)

typedef struct LxNzInf{

BYTE byColMask; // 1 << (j&7);

BYTE byBlkNz; // byBlkNz ++;

BYTE byBlkNzPos; // byBlkNzPos = j;

BYTE byRowMask; // 1 <<

(j>>3);

}LxNzInf;

typedef struct LxIdctInf{

DWORD dwNzInf; // += SET_NZ_INF

DWORD dwColMask; // |= SET_NZ_MASK

BYTE byRowPos[8]; // byRowPos[j>>3] |= msk[j]

; // 1 <<

(j&7);

}LxIdctInf;

__declspec (align(16)) const int

IDCT_TAB[8][8] =

{

{2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, },

{2841, 2408, 1609, 565,

-565, -1609, -2408, -2841, },

{2676, 1108, -1108, -2676, -2676, -1108, 1108, 2676, },

{2408, -565, -2841, -1609, 1609, 2841, 565,

-2408, },

{2048, -2048, -2048, 2048, 2048, -2048, -2048, 2048, },

{1609, -2841, 565,

2408, -2408, -565, 2841, -1609, },

{1108, -2676, 2676, -1108, -1108, 2676, -2676, 1108, },

{565, -1609, 2408, -2841, 2841, -2408, 1609, -565, },

};

extern DWORD g_dwBlockInf[64];

void Initialize_Fast_IDCT();

void idct_aan_bridge_ex(short*

block,short* dst, LxIdctInf* pos );

void idct_sparse_dc(short* block,short*

dst, LxIdctInf* pos);

void idct_sparse_ac(short* block,short*

dst, LxIdctInf* pos);

// end of LxIDCT.h .

// start of LxIDCT.cpp: implementation

of the LxIDCT .

static short iclip[1024];

short *iclp;

BYTE g_RowNz[256];

DWORD g_RowPos[256];

DWORD g_dwColMsk[8];

DWORD g_dwBlockInf[64];

#define

GET_COL_MASK(A) ( A & 0xff )

#define

GET_ROW_MASK(A) ( A

>> 24 )

#define

GET_NZ_NUM(A) ( (A >> 8) &

0xff )

#define

GET_NZ_POS(A) ( A >> 16 )

void Initialize_Fast_IDCT()

{

int i;

//填写裁减表

iclp = iclip+512;

for (i= -512; i<512;

i++) {

iclp[i] = (i

-256 : ((i>255) ? 255 : i);

}

#if 0

for( i = 0; i < 8; i ++

) {

double scale = (i == 0) ? sqrt(0.125) :

0.5;

for( int j = 0; j < 8; j

++ ) {

double s = scale * cos((PI/8.0)*i*(j +

0.5));

s = s * sqrt(2);

double delt = s >= 0 ?

0.5 : -0.5;

IDCT_TAB[i][j] = (short)(

(1<

}

}

#endif

for( i = 0; i < 256; i

++ ) {

int nz = 0;

g_RowPos[i] = 0;

for( int j = 0; j < 8; j

++ ) {

if( i &

(1<

g_RowPos[i] |=

j<

nz ++;

}

}

g_RowNz[i] = nz;

}

for( i = 0; i < 8; i ++

) {

// g_RowPos[1<

g_dwColMsk[i] =

(1<

(1<

(1<

(1<

}

// DCT系数表。

for( i = 0; i < 8; i ++

) {

for( int j = 0; j < 8; j

++ ) {

LxNzInf inf;

inf.byColMask =

1<

inf.byBlkNz = 1;

inf.byBlkNzPos = i*8+j;

inf.byRowMask =

1<

g_dwBlockInf[i*8+j] =

*(DWORD*)&inf;

}

}

}

void idct_sparse_dc(short* block,short*

dst, LxIdctInf* pos)

{

int val32;

int val =

block[0];

val = ( val + 4)

>> 3;

val32 = (val

<< 16) | ( val &

0xffff );

int* b32 = (int*)dst;

for( int i = 0; i < 32;

i +=8 ){

b32[i+0] = val32;

b32[i+1] = val32;

b32[i+2] = val32;

b32[i+3] = val32;

b32[i+4] = val32;

b32[i+5] = val32;

b32[i+6] = val32;

b32[i+7] = val32;

}

}

void idct_sparse_ac( short* block,

short* dst, LxIdctInf* pos )

{

int tmp[8];

int i;

DWORD const dwRowMask =

GET_ROW_MASK(pos->dwColMask);

DWORD const dwColMask =

GET_COL_MASK(pos->dwColMask);

int hnc = g_RowPos[dwColMask];

int vnc = g_RowPos[dwRowMask];

int np = hnc*8 + vnc ;

if( 0 == np ) {

idct_sparse_dc(block,dst,pos);

return;

}

int ac = block[ np ];

int nc = hnc ? hnc : vnc;

tmp[0] = ( ( ac * IDCT_TAB[nc][0] +

IDCT_TAB_ROUND0 ) >> IDCT_TAB_RC0

);

tmp[1] = ( ( ac * IDCT_TAB[nc][1] +

IDCT_TAB_ROUND0 ) >> IDCT_TAB_RC0

);

tmp[2] = ( ( ac * IDCT_TAB[nc][2] +

IDCT_TAB_ROUND0 ) >> IDCT_TAB_RC0

);

tmp[3] = ( ( ac * IDCT_TAB[nc][3] +

IDCT_TAB_ROUND0 ) >> IDCT_TAB_RC0

);

tmp[4] = ( ( ac * IDCT_TAB[nc][4] +

IDCT_TAB_ROUND0 ) >> IDCT_TAB_RC0

);

tmp[5] = ( ( ac * IDCT_TAB[nc][5] +

IDCT_TAB_ROUND0 ) >> IDCT_TAB_RC0

);

tmp[6] = ( ( ac * IDCT_TAB[nc][6] +

IDCT_TAB_ROUND0 ) >> IDCT_TAB_RC0

);

tmp[7] = ( ( ac * IDCT_TAB[nc][7] +

IDCT_TAB_ROUND0 ) >> IDCT_TAB_RC0

);

if( hnc ) {

// horizontal ac;

if( vnc ){

// vertical ac;

for( i = 0; i < 8; i ++

) {

ac = IDCT_TAB[vnc][i];

dst[0+i*8] =

iclp[ ( tmp[0] * ac + IDCT_TAB_ROUND1 )

>> IDCT_TAB_RC1 ];

dst[1+i*8] =

iclp[ ( tmp[1] * ac + IDCT_TAB_ROUND1 )

>> IDCT_TAB_RC1 ];

dst[2+i*8] =

iclp[ ( tmp[2] * ac + IDCT_TAB_ROUND1 )

>> IDCT_TAB_RC1 ];

dst[3+i*8] =

iclp[ ( tmp[3] * ac + IDCT_TAB_ROUND1 )

>> IDCT_TAB_RC1 ];

dst[4+i*8] =

iclp[ ( tmp[4] * ac + IDCT_TAB_ROUND1 )

>> IDCT_TAB_RC1 ];

dst[5+i*8] =

iclp[ ( tmp[5] * ac + IDCT_TAB_ROUND1 )

>> IDCT_TAB_RC1 ];

dst[6+i*8] =

iclp[ ( tmp[6] * ac + IDCT_TAB_ROUND1 )

>> IDCT_TAB_RC1 ];

dst[7+i*8] =

iclp[ ( tmp[7] * ac + IDCT_TAB_ROUND1 )

>> IDCT_TAB_RC1 ];

}

}

else {

// vertical dc;

short tmp0[8];

# define LX_DC(i) /

tmp0[i] = iclp[( tmp[i] +

IDCT_TAB_DC_ROUND1 ) >>

IDCT_TAB_DC_RC1 ]

LX_DC(0);LX_DC(1);LX_DC(2);LX_DC(3);

LX_DC(4);LX_DC(5);LX_DC(6);LX_DC(7);

# undef LX_DC

# define LX_DC(i) /

*(int*)(dst+i*8) =

*(int*)(tmp0); /

*(int*)(dst+i*8+2) = *(int*)(tmp0+2);

/

*(int*)(dst+i*8+4) = *(int*)(tmp0+4);

/

*(int*)(dst+i*8+6) =

*(int*)(tmp0+6);

LX_DC(0);LX_DC(1);LX_DC(2);LX_DC(3);

LX_DC(4);LX_DC(5);LX_DC(6);LX_DC(7);

# undef LX_DC

}

}

else { // pos is at x = 0;

// vertical ac; horizontal dc;

int val32;

# define LX_DC(i) /

val32 = iclp[( tmp[i] +

IDCT_TAB_DC_ROUND1 ) >>

IDCT_TAB_DC_RC1];/

val32 =

(val32<<16)|(val32&0xffff);/

*(int*)(dst+i*8) =

val32; /

*(int*)(dst+i*8+2) =

val32; /

*(int*)(dst+i*8+4) =

val32; /

*(int*)(dst+i*8+6) = val32;

LX_DC(0);LX_DC(1);LX_DC(2);LX_DC(3);

LX_DC(4);LX_DC(5);LX_DC(6);LX_DC(7);

# undef LX_DC

}

}

static short preSC[64] = {

16384, 22725, 21407, 19266, 16384, 12873, 8867, 4520,

22725, 31521, 29692, 26722, 22725,

17855, 12299,

6270,

21407, 29692,

27969, 25172, 21407, 16819, 11585,

5906,

19266, 26722, 25172, 22654, 19266,

15137, 10426, 5315,

16384, 22725, 21407, 19266, 16384,

12873, 8867, 4520,

12873, 17855, 16819, 15137, 12873,

10114, 6967,

3552,

8867, 12299, 11585, 10426, 8867,

6967, 4799,  2446,

4520, 6270, 5906, 5315, 4520, 3552, 2446, 1247

};

#define

RC0 12

#define

RC00 10

#define

ROUND0 (1<

#define

RC1 (RC0 + 5 + (RC0-RC00) )

#define

ROUND1 (1<

#define MAKE_SCALE(A) ( (int) (

A*(1<= 0 ?

0.5 : -0.5 ) ) )

#define

C1 MAKE_SCALE(1) // 1

#define

C2 MAKE_SCALE(1.84776) // 2 COS

(PI/8) 1.84776

#define

C4 MAKE_SCALE(1.41421) //

SQRT(2) 1.41421

#define

C6 MAKE_SCALE(0.76537) // 2 SIN

(PI/8) 0.76537

#define

C6pC2 MAKE_SCALE(2.61313) //

C6+C2 2.61313

#define

C6sC2 MAKE_SCALE(-1.08239) // -1.08239

#define

ROLAAN 12

#define ROUNDAAN

(1<

#define

SCALEAAN(x0,i,j) x0 = ( x0 * (int)preSC[i*8+j]

+ ROUNDAAN ) >>

ROLAAN ;

__declspec (align(16)) int

IDCT_TAB_AAN[8*8] = {

C1,C1,C1,C1,C1,C1,C1,C1,

C1,C2-C1,C1-C2+C4,C6-C4+C2-C1,C1-C6+C4-C2,-C1-C4+C2 ,C1-C2,-C1,

C1,-C1+C4,-C4+C1,-C1,-C1,-C4+C1,-C1+C4,C1,

C1,C6-C1,C1-C6-C4,-C2+C4+C6-C1,C1+C2-C4-C6,-C1+C4+C6,C1-C6,-C1,

C1,-C1,-C1,C1,C1,-C1,-C1,C1,

C1,-C6-C1,C1+C6-C4,C2+C4-C6-C1,C1-C2-C4+C6,-C1+C4-C6,C1+C6,-C1,

C1, -C1-C4,C4+C1,-C1,-C1,C4+C1,-C1-C4,C1,

C1, -C2-C1,C1+C2+C4 ,-C6-C4-C2-C1,C1+C6+C4+C2,-C1-C4-C2 ,C1+C2,-C1

};

#if 0

// 为了容易理解,上面的表格采用了通前面矩阵形式一样的常量定义。

// 更精确的数值可以使用下面的IDCT_TAB_AAN;

__declspec (align(16)) const int

IDCT_TAB_AAN[8*8] =

{

4096, 4096, 4096, 4096, 4096, 4096, 4096, 4096,

4096, 3472, 2320, 815,

-815, -2320, -3472, -4096,

4096, 1697, -1697, -4096, -4096, -1697, 1697, 4096,

4096, -961, -4832, -2737, 2737, 4832, 961, -4096,

4096, -4096, -4096, 4096, 4096, -4096, -4096, 4096,

4096, -7231, 1438, 6130, -6130, -1438, 7231, -4096,

4096, -9889, 9889, -4096, -4096, 9889, -9889, 4096,

4096, -11664, 17457, -20592, 20592, -17457, 11664, -4096,

};

#endif

static void __inline

idct_aan_row(short* blk,int* blk32, int i )

{

int x0,x1,x2,x3,x4,x5,x6,x7,xa,xb;

int t0,t1;

x0 = blk[0];

x1 = blk[1];

x2 = blk[2];

x3 = blk[3];

x4 = blk[4];

x5 = blk[5];

x6 = blk[6];

x7 = blk[7];

SCALEAAN(x0,i,0);

SCALEAAN(x4,i,4);

SCALEAAN(x3,i,3);

SCALEAAN(x5,i,5);

SCALEAAN(x2,i,2);

SCALEAAN(x6,i,6);

SCALEAAN(x1,i,1);

SCALEAAN(x7,i,7);

xa = x0;

xb = x4;

x4 = x5 - x3;

t1 = x5 + x3;

x3 =

(x2+x6)<

x2 = (x2-x6)*C4-x3;

t0 = x1 + x7;

x6= x1 - x7;

x7= (t0 +

t1)<

x5 = (t0 - t1)*C4;

t0=C6*(x4+x6);

x4=C6sC2*x4-t0;

x6=C6pC2*x6-t0;

t0=x6-x7;

x1=(xa-xb)<

t1=t0-x5;

x6=(xa+xb)<

x0=x4-t1;

x4=x3+x6;

x6-=x3;

x3=x1+x2;

x5=x1-x2;

blk32[0] = ( (x4+x7+ROUND0)

>> RC00 );

blk32[1] = ( (x3+t0+ROUND0)

>> RC00 );

blk32[2] = ( (x5-t1+ROUND0)

>> RC00 );

blk32[3] = ( (x6-x0+ROUND0)

>> RC00 );

blk32[4] = ( (x6+x0+ROUND0)

>> RC00 );

blk32[5] = ( (x5+t1+ROUND0)

>> RC00 );

blk32[6] = ( (x3-t0+ROUND0)

>> RC00 );

blk32[7] = ( (x4-x7+ROUND0)

>> RC00 );

}

static void __inline idct_aan_col( int*

blk32, short* blk )

{

int x0,x1,x2,x3,x4,x5,x6,x7,xa,xb;

int t0,t1;

x0 = blk32[0];

x1 = blk32[8] ;

x2 = blk32[16];

x3 = blk32[24];

x4 = blk32[32];

x5 = blk32[40];

x6 = blk32[48];

x7 = blk32[56];

xa = x0;

xb = x4;

x4 = x5 - x3;

t1 = x5 + x3;

x3 =

(x2+x6)<

x2 = (x2-x6)*C4-x3;

t0 = x1 + x7;

x6= x1 - x7;

x7= (t0 +

t1)<

x5 = (t0 - t1)*C4;

t0=C6*(x4+x6);

x4=C6sC2*x4-t0;

x6=C6pC2*x6-t0;

t0=x6-x7;

x1=(xa-xb)<

t1=t0-x5;

x6=(xa+xb)<

x0=x4-t1;

x4=x3+x6;

x6-=x3;

x3=x1+x2;

x5=x1-x2;

blk[0*8] = iclp[ (x4+x7+ROUND1)

>> RC1 ];

blk[1*8] = iclp[ (x3+t0+ROUND1)

>> RC1 ];

blk[2*8] = iclp[ (x5-t1+ROUND1)

>> RC1 ];

blk[3*8] = iclp[ (x6-x0+ROUND1)

>> RC1 ];

blk[4*8] = iclp[ (x6+x0+ROUND1)

>> RC1 ];

blk[5*8] = iclp[ (x5+t1+ROUND1)

>> RC1 ];

blk[6*8] = iclp[ (x3-t0+ROUND1)

>> RC1 ];

blk[7*8] = iclp[ (x4-x7+ROUND1)

>> RC1 ];

}

static void __inline

idct_aan_bridge_ex_col_short(int* blk32,short* blk,int p)

{

int x0;

x0 = blk32[p*8];

blk[0*8] = iclp[( x0 *

IDCT_TAB_AAN[p*8+0] + ROUND1 ) >> RC1

];

blk[1*8] = iclp[( x0 *

IDCT_TAB_AAN[p*8+1] + ROUND1 ) >> RC1

];

blk[2*8] = iclp[( x0 *

IDCT_TAB_AAN[p*8+2] + ROUND1 ) >> RC1

];

blk[3*8] = iclp[( x0 *

IDCT_TAB_AAN[p*8+3] + ROUND1 ) >> RC1

];

blk[4*8] = iclp[( x0 *

IDCT_TAB_AAN[p*8+4] + ROUND1 ) >> RC1

];

blk[5*8] = iclp[( x0 *

IDCT_TAB_AAN[p*8+5] + ROUND1 ) >> RC1

];

blk[6*8] = iclp[( x0 *

IDCT_TAB_AAN[p*8+6] + ROUND1 ) >> RC1

];

blk[7*8] = iclp[( x0 *

IDCT_TAB_AAN[p*8+7] + ROUND1 ) >> RC1

];

}

#define

SCALEAAN_EX(x0,i,j) x0 = ( x0 * (int)preSC[i+j*8]

+ ROUNDAAN ) >>

ROLAAN ;

static void __inline

idct_aan_bridge_ex_col_first( short* blk, int* blk32, int i )

{

int x0,x1,x2,x3,x4,x5,x6,x7,xa,xb;

int t0,t1;

x0 = blk[0*8];

SCALEAAN_EX(x0,i,0);

x1 = blk[1*8];

SCALEAAN_EX(x1,i,1);

x2 = blk[2*8];

SCALEAAN_EX(x2,i,2);

x3 = blk[3*8];

SCALEAAN_EX(x3,i,3);

x4 = blk[4*8];

SCALEAAN_EX(x4,i,4);

x5 = blk[5*8];

SCALEAAN_EX(x5,i,5) ;

x6 = blk[6*8];

SCALEAAN_EX(x6,i,6) ;

x7 = blk[7*8];

SCALEAAN_EX(x7,i,7) ;

xa = x0;

xb = x4;

x4 = x5 - x3;

t1 = x5 + x3;

x3 =

(x2+x6)<

x2 = (x2-x6)*C4-x3;

t0 = x1 + x7;

x6= x1 - x7;

x7= (t0 +

t1)<

x5 = (t0 - t1)*C4;

t0=C6*(x4+x6);

x4=C6sC2*x4-t0;

x6=C6pC2*x6-t0;

t0=x6-x7;

x1=(xa-xb)<

t1=t0-x5;

x6=(xa+xb)<

x0=x4-t1;

x4=x3+x6;

x6-=x3;

x3=x1+x2;

x5=x1-x2;

blk32[0*8] = ( (x4+x7+ROUND0)

>> RC00 );

blk32[1*8] = ( (x3+t0+ROUND0)

>> RC00 );

blk32[2*8] = ( (x5-t1+ROUND0)

>> RC00 );

blk32[3*8] = ( (x6-x0+ROUND0)

>> RC00 );

blk32[4*8] = ( (x6+x0+ROUND0)

>> RC00 );

blk32[5*8] = ( (x5+t1+ROUND0)

>> RC00 );

blk32[6*8] = ( (x3-t0+ROUND0)

>> RC00 );

blk32[7*8] = ( (x4-x7+ROUND0)

>> RC00 );

}

static void __inline

idct_aan_bridge_ex_row_short(int* blk32,short* blk,int p)

{

int x0;

x0 = blk32[p];

blk[0] = iclp[( x0 *

IDCT_TAB_AAN[p*8+0] + ROUND1 ) >> RC1

];

blk[1] = iclp[( x0 *

IDCT_TAB_AAN[p*8+1] + ROUND1 ) >> RC1

];

blk[2] = iclp[( x0 *

IDCT_TAB_AAN[p*8+2] + ROUND1 ) >> RC1

];

blk[3] = iclp[( x0 *

IDCT_TAB_AAN[p*8+3] + ROUND1 ) >> RC1

];

blk[4] = iclp[( x0 *

IDCT_TAB_AAN[p*8+4] + ROUND1 ) >> RC1

];

blk[5] = iclp[( x0 *

IDCT_TAB_AAN[p*8+5] + ROUND1 ) >> RC1

];

blk[6] = iclp[( x0 *

IDCT_TAB_AAN[p*8+6] + ROUND1 ) >> RC1

];

blk[7] = iclp[( x0 *

IDCT_TAB_AAN[p*8+7] + ROUND1 ) >> RC1

];

}

static int __inline

idct_aan_bridge_ex_row(short* blk, int* blk32, int i, int np)

{

int x0,x1,x2,x3,x4,x5,x6,x7,xa,xb;

int t0,t1;

int const nc = g_RowNz[np];

if( !nc ) {

blk32[0] = blk32[1] = blk32[2] =

blk32[3] =

blk32[4] = blk32[5] = blk32[6] =

blk32[7] = 0;

return 0;

}

if( nc == 1 ) {

int const p = g_RowPos[np];

if( p == 0 ) {

x0 = blk[0];

SCALEAAN(x0,i,0);

blk32[0] = blk32[1] = blk32[2] =

blk32[3] =

blk32[4] = blk32[5] = blk32[6] =

blk32[7] = x0 << 2;

return -1;

}

#if 1

x0 = blk[p];

SCALEAAN(x0,i,p);

// blk32[0] = ( x0 * IDCT_TAB_AAN[p*8+0] + ROUND0 )

>> RC00;

blk32[0] = x0

<< 2;

blk32[1] = ( x0 * IDCT_TAB_AAN[p*8+1] +

ROUND0 ) >> RC00;

blk32[2] = ( x0 * IDCT_TAB_AAN[p*8+2] +

ROUND0 ) >> RC00;

blk32[3] = ( x0 * IDCT_TAB_AAN[p*8+3] +

ROUND0 ) >> RC00;

blk32[4] = ( x0 * IDCT_TAB_AAN[p*8+4] +

ROUND0 ) >> RC00;

blk32[5] = ( x0 * IDCT_TAB_AAN[p*8+5] +

ROUND0 ) >> RC00;

blk32[6] = ( x0 * IDCT_TAB_AAN[p*8+6] +

ROUND0 ) >> RC00;

blk32[7] = ( x0 * IDCT_TAB_AAN[p*8+7] +

ROUND0 ) >> RC00;

return -1;

#endif

}

x0 = blk[0];

SCALEAAN(x0,i,0);

x1 = blk[1];

SCALEAAN(x1,i,1);

x2 = blk[2];

SCALEAAN(x2,i,2);

x3 = blk[3];

SCALEAAN(x3,i,3);

x4 = blk[4];

SCALEAAN(x4,i,4);

x5 = blk[5];

SCALEAAN(x5,i,5) ;

x6 = blk[6];

SCALEAAN(x6,i,6) ;

x7 = blk[7];

SCALEAAN(x7,i,7) ;

xa = x0;

xb = x4;

x4 = x5 - x3;

t1 = x5 + x3;

x3 =

(x2+x6)<

x2 = (x2-x6)*C4-x3;

t0 = x1 + x7;

x6= x1 - x7;

x7= (t0 +

t1)<

x5 = (t0 - t1)*C4;

t0=C6*(x4+x6);

x4=C6sC2*x4-t0;

x6=C6pC2*x6-t0;

t0=x6-x7;

x1=(xa-xb)<

t1=t0-x5;

x6=(xa+xb)<

x0=x4-t1;

x4=x3+x6;

x6-=x3;

x3=x1+x2;

x5=x1-x2;

blk32[0] = ( (x4+x7+ROUND0)

>> RC00 );

blk32[1] = ( (x3+t0+ROUND0)

>> RC00 );

blk32[2] = ( (x5-t1+ROUND0)

>> RC00 );

blk32[3] = ( (x6-x0+ROUND0)

>> RC00 );

blk32[4] = ( (x6+x0+ROUND0)

>> RC00 );

blk32[5] = ( (x5+t1+ROUND0)

>> RC00 );

blk32[6] = ( (x3-t0+ROUND0)

>> RC00 );

blk32[7] = ( (x4-x7+ROUND0)

>> RC00 );

return -1;

}

static void __inline

idct_aan_bridge_ex_row_short2(

int* blk32,short* blk,int p0,int

p1)

{

int x0,x1;

x0 = blk32[p0];

x1 = blk32[p1];

blk[0] = iclp[( x0 *

IDCT_TAB_AAN[p0*8+0] +

x1 * IDCT_TAB_AAN[p1*8+0] + ROUND1 )

>> RC1 ];

blk[1] = iclp[( x0 *

IDCT_TAB_AAN[p0*8+1] +

x1 * IDCT_TAB_AAN[p1*8+1] + ROUND1 )

>> RC1 ];

blk[2] = iclp[( x0 *

IDCT_TAB_AAN[p0*8+2] +

x1 * IDCT_TAB_AAN[p1*8+2] + ROUND1 )

>> RC1 ];

blk[3] = iclp[( x0 *

IDCT_TAB_AAN[p0*8+3] +

x1 * IDCT_TAB_AAN[p1*8+3] + ROUND1 )

>> RC1 ];

blk[4] = iclp[( x0 *

IDCT_TAB_AAN[p0*8+4] +

x1 * IDCT_TAB_AAN[p1*8+4] + ROUND1 )

>> RC1 ];

blk[5] = iclp[( x0 *

IDCT_TAB_AAN[p0*8+5] +

x1 * IDCT_TAB_AAN[p1*8+5] + ROUND1 )

>> RC1 ];

blk[6] = iclp[( x0 *

IDCT_TAB_AAN[p0*8+6] +

x1 * IDCT_TAB_AAN[p1*8+6] + ROUND1 )

>> RC1 ];

blk[7] = iclp[( x0 *

IDCT_TAB_AAN[p0*8+7] +

x1 * IDCT_TAB_AAN[p1*8+7] + ROUND1 )

>> RC1 ];

}

static void __inline

idct_aan_bridge_ex_col_short2(

int* blk32,short* blk,int p0,int

p1)

{

int x0,x1;

x0 = blk32[p0*8];

x1 = blk32[p1*8];

blk[0*8] = iclp[( x0 *

IDCT_TAB_AAN[p0*8+0] +

x1 * IDCT_TAB_AAN[p1*8+0] + ROUND1 )

>> RC1 ];

blk[1*8] = iclp[( x0 *

IDCT_TAB_AAN[p0*8+1] +

x1 * IDCT_TAB_AAN[p1*8+1] + ROUND1 )

>> RC1 ];

blk[2*8] = iclp[( x0 *

IDCT_TAB_AAN[p0*8+2] +

x1 * IDCT_TAB_AAN[p1*8+2] + ROUND1 )

>> RC1 ];

blk[3*8] = iclp[( x0 *

IDCT_TAB_AAN[p0*8+3] +

x1 * IDCT_TAB_AAN[p1*8+3] + ROUND1 )

>> RC1 ];

blk[4*8] = iclp[( x0 *

IDCT_TAB_AAN[p0*8+4] +

x1 * IDCT_TAB_AAN[p1*8+4] + ROUND1 )

>> RC1 ];

blk[5*8] = iclp[( x0 *

IDCT_TAB_AAN[p0*8+5] +

x1 * IDCT_TAB_AAN[p1*8+5] + ROUND1 )

>> RC1 ];

blk[6*8] = iclp[( x0 *

IDCT_TAB_AAN[p0*8+6] +

x1 * IDCT_TAB_AAN[p1*8+6] + ROUND1 )

>> RC1 ];

blk[7*8] = iclp[( x0 *

IDCT_TAB_AAN[p0*8+7] +

x1 * IDCT_TAB_AAN[p1*8+7] + ROUND1 )

>> RC1 ];

}

void idct_aan_bridge_ex(short*

block,short* dst, LxIdctInf* pos )

{

DWORD const dwRowMask =

GET_ROW_MASK(pos->dwColMask);

DWORD const dwRowNum =

g_RowNz[dwRowMask];

DWORD const dwColMask =

GET_COL_MASK(pos->dwColMask);

DWORD const dwColNum =

g_RowNz[dwColMask];

if( dwRowNum == 1

&& dwColNum == 1 ) {

idct_sparse_ac(block,dst,pos );//

快速通道0,1;

return;

}

int i;

int block32[64];

// 检查快速通道2;

if( 1 == dwRowNum )

{ // all the nz data in one

row;

int const nRow =

g_RowPos[dwRowMask];

idct_aan_row(block+nRow*8,block32+nRow*8,nRow);

if( 0 == nRow ) {

// vertical dc;

short tmp0[8];

# define LX_DC(i) tmp0[i] = iclp[( block32[i] + 64 )

>> 7 ]

LX_DC(0);LX_DC(1);LX_DC(2);LX_DC(3);

LX_DC(4);LX_DC(5);LX_DC(6);LX_DC(7);

# undef LX_DC

# define LX_DC(i) /

*(int*)(dst+i*8) =

*(int*)(tmp0); /

*(int*)(dst+i*8+2) =

*(int*)(tmp0+2); /

*(int*)(dst+i*8+4) =

*(int*)(tmp0+4); /

*(int*)(dst+i*8+6) =

*(int*)(tmp0+6);

LX_DC(0);LX_DC(1);LX_DC(2);LX_DC(3);

LX_DC(4);LX_DC(5);LX_DC(6);LX_DC(7);

# undef LX_DC

return;

}

for( i = 0; i < 8; i ++

) {

idct_aan_bridge_ex_col_short(block32 +

i, dst + i, nRow );

}

return;

}

if( 1 == dwColNum )

{ // all the nz data in one

col;

int const nCol =

g_RowPos[dwColMask];

idct_aan_bridge_ex_col_first( block +

nCol, block32 + nCol,nCol);

if( 0 == nCol ) {

int val32;

# define LX_DC(i) /

val32 = iclp[ ( block32[i*8] + 64 )

>> 7 ];/

val32 =

(val32<<16)|(val32&0xffff);/

*(int*)(dst+i*8) =

val32; /

*(int*)(dst+i*8+2) =

val32; /

*(int*)(dst+i*8+4) =

val32; /

*(int*)(dst+i*8+6) = val32;

LX_DC(0);LX_DC(1);LX_DC(2);LX_DC(3);

LX_DC(4);LX_DC(5);LX_DC(6);LX_DC(7);

# undef LX_DC

return;

}

for( i = 0; i < 8; i ++

) {

idct_aan_bridge_ex_row_short( block32 +

i*8, dst + i*8, nCol);

}

return;

}

if( 2 == dwRowNum ) { // 所有非零系数都在2行内。

DWORD dwPos = g_RowPos[dwRowMask];

int nRow0 = dwPos &

7;

int nRow1 = dwPos

>> 3;

idct_aan_row( block + nRow0, block32 +

nRow0,nRow0);

idct_aan_row( block + nRow1, block32 +

nRow1,nRow1);

for( i = 0; i < 8; i ++

) {

idct_aan_bridge_ex_col_short2( block32

+ i, dst + i, nRow0, nRow1 );

}

return;

}

if( 2 == dwColNum ) { //

所有非零系数都在2列内。

DWORD dwPos = g_RowPos[dwColMask];

int nCol0 = dwPos &

7;

int nCol1 = dwPos

>> 3;

idct_aan_bridge_ex_col_first( block +

nCol0, block32 + nCol0,nCol0);

idct_aan_bridge_ex_col_first( block +

nCol1, block32 + nCol1,nCol1);

for( i = 0; i < 8; i ++

) {

idct_aan_bridge_ex_row_short2(

block32 + i*8, dst + i*8, nCol0, nCol1

);

}

return;

}

// 检查上半场的快速通道3,4;

for( i = 0; i < 8; i ++

) {

idct_aan_bridge_ex_row( block + i*8,

block32 + i*8, i, pos->byRowPos[i]);

}

// 经过前面的通道0,1,2,3,4

过滤,下半场已经没有快速通道可用。

for( i = 0; i < 8; i ++

) {

idct_aan_col(block32+i,dst+i );

}

}

我们看到:函数idct_aan_bridge_ex_row 并没有把两个AC系数非零的快速通道包括进来。原因是,该函数内部使用条件跳转的代价高达40%,抵消了2点非零情况的性能提升。

下面我们用一个跳转表来取代这个函数,消灭条件跳转带来的亏损。

typedef void (*aan_row_entry) ( short*

, int*, int, int);

static void row_all_zero(short* blk,

int* blk32, int i, int np)

{

blk32[0] = blk32[1] = blk32[2] =

blk32[3] =

blk32[4] = blk32[5] = blk32[6] =

blk32[7] = 0;

}

static void row_short_1(short* blk,

int* blk32, int i, int np )

{

int x0;

int const p = g_RowPos[np];

if( p == 0 ) {

x0 = blk[0];

SCALEAAN(x0,i,0);

blk32[0] = blk32[1] = blk32[2] =

blk32[3] =

blk32[4] = blk32[5] = blk32[6] =

blk32[7] = x0 << 2;

return ;

}

x0 = blk[p];

SCALEAAN(x0,i,p);

// blk32[0] = ( x0 * IDCT_TAB_AAN[p*8+0] + ROUND0 )

>> RC00;

blk32[0] = x0

<< 2;

#define SHORT_CUT(i) blk32[i] = ( x0 *

IDCT_TAB_AAN[p*8+i] + ROUND0 )

>> RC00

SHORT_CUT(1);

SHORT_CUT(2);

SHORT_CUT(3);

SHORT_CUT(4);

SHORT_CUT(5);

SHORT_CUT(6);

SHORT_CUT(7);

#undef SHORT_CUT

}

static void row_short_2(short* blk,

int* blk32, int i, int np )

{

DWORD dwPos = g_RowPos[np];

int p0 = dwPos & 7;

int p1 = dwPos

>> 3;

int x0 = blk[p0];

int x1 = blk[p1];

SCALEAAN(x0,i,p0);

SCALEAAN(x1,i,p1);

blk32[0] = ( x0 + x1 )

<< 2 ;

#define SHORT_CUT(i) blk32[i] = ( x0 *

IDCT_TAB_AAN[p0*8+i] + x1 * IDCT_TAB_AAN[p1*8+i] + ROUND0 )

>> RC00

SHORT_CUT(1);

SHORT_CUT(2);

SHORT_CUT(3);

SHORT_CUT(4);

SHORT_CUT(5);

SHORT_CUT(6);

SHORT_CUT(7);

#undef SHORT_CUT

}

static void row_normal(short* blk, int*

blk32, int i, int np )

{

int x0,x1,x2,x3,x4,x5,x6,x7,xa,xb;

int t0,t1;

x0 = blk[0];

SCALEAAN(x0,i,0);

x1 = blk[1];

SCALEAAN(x1,i,1);

x2 = blk[2];

SCALEAAN(x2,i,2);

x3 = blk[3];

SCALEAAN(x3,i,3);

x4 = blk[4];

SCALEAAN(x4,i,4);

x5 = blk[5];

SCALEAAN(x5,i,5) ;

x6 = blk[6];

SCALEAAN(x6,i,6) ;

x7 = blk[7];

SCALEAAN(x7,i,7) ;

xa = x0;

xb = x4;

x4 = x5 - x3;

t1 = x5 + x3;

x3 =

(x2+x6)<

x2 = (x2-x6)*C4-x3;

t0 = x1 + x7;

x6= x1 - x7;

x7= (t0 +

t1)<

x5 = (t0 - t1)*C4;

t0=C6*(x4+x6);

x4=C6sC2*x4-t0;

x6=C6pC2*x6-t0;

t0=x6-x7;

x1=(xa-xb)<

t1=t0-x5;

x6=(xa+xb)<

x0=x4-t1;

x4=x3+x6;

x6-=x3;

x3=x1+x2;

x5=x1-x2;

blk32[0] = ( (x4+x7+ROUND0)

>> RC00 );

blk32[1] = ( (x3+t0+ROUND0)

>> RC00 );

blk32[2] = ( (x5-t1+ROUND0)

>> RC00 );

blk32[3] = ( (x6-x0+ROUND0)

>> RC00 );

blk32[4] = ( (x6+x0+ROUND0)

>> RC00 );

blk32[5] = ( (x5+t1+ROUND0)

>> RC00 );

blk32[6] = ( (x3-t0+ROUND0)

>> RC00 );

blk32[7] = ( (x4-x7+ROUND0)

>> RC00 );

return;

}

static aan_row_entry row_entry[9] = {

row_all_zero,

row_short_1,

row_short_2,

row_normal, row_normal, row_normal,

row_normal, row_normal,row_normal,

};

然后修改调用方式,把

for( i = 0; i < 8; i ++

) {

idct_aan_bridge_ex_row( block + i*8,

block32 + i*8, i, pos->byRowPos[i]);

}

替换为:

for( i = 0; i < 8; i ++

) {

DWORD const dwRowPos1 =

pos->byRowPos[i];

row_entry[ g_RowNz[dwRowPos1] ]( block

+ i*8, block32 + i*8, i, dwRowPos1 );

}

这样每个行变换的调用代价大概是7%,两点非零的情况可以带来大概24%的利润。

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值