I 顶点着色器基础

最新推荐文章于 2020-12-25 11:31:56 发布

Baesky

最新推荐文章于 2020-12-25 11:31:56 发布

阅读量2.7k

点赞数

分类专栏： Shader 文章标签： float vector distance c direct3d input

Shader 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

声明：简要翻译shaderX，具体可参看Wolfgang Engel的《shaderX》系列!如原作者对本文的发表感觉不妥,请联系我,我将撤下本文.请勿转载.

你将会学到什么?

编写和编译顶点着色程序.
使用顶点着色器处理光照.
编写和编译像素着色程序.
使用像素着色器处理纹理映射.
纹理效果.
使用像素着色器处理每个像素的光照.

图形渲染管线中的顶点着色器

该图显示了Direct3D渲染管线中一些层次,诸如:源数据操作,顶点操作,像素操作等.

在源数据层中,顶点被组合和曲面细分。

下一层中,Direct3D渲染管线有两种不同的渲染方式组成.

1."固定函数"管线.该管线负责基本的变换与光照,可以使用SetRenderState()函数来操作他们.

2.顶点着色器,这是DX8中引进的新机制.通过灵活的编写顶点着色程序来代替固定功能的函数操作.

目前位置我们的目标是学习顶点着色,所以再往下的分层顶点着色器就无法控制了.

顶点着色器架构

所有的数据在顶点着色器中被表示为128-bit,也就是4个float.

硬件顶点着色器可以被看做是一个SIMD(Single Instruction Multiple Data,“单指令多数据”)处理器,即单条操作指令影响最多4个32-bit的变量.数据格式非常有用,因为大多数的矩阵变换和光照计算都采用4X4矩阵或者4元数.指令非常的简单并且易于理解.顶点着色器不允许任何循环,跳转和条件分支,意思是它只能线性的执行程序.在DX8中每个顶点着色程序最大使用128条指令.组和使用一个处理变换的着色器程序和一个处理光照的着色器程序是不可能的,因为同一时刻只能有一个着色器被激活,并且激活的着色器必须计算每个顶点作为输出数据要求的数据.

一个顶点着色器使用最大16个输入寄存器来访问输入的顶点数据(v0~v15,每个寄存器大小为128-bit).顶点输入寄存器可以方便的存储标准类型的顶点数据:坐标,法线,漫反射和镜面反射光,雾和点信息.

在顶点着色器开始执行程序员设定的参数时,常量寄存器的内容由CPU加载.顶点着色器无法去写入常量寄存器.这些寄存器用来存储诸如光照位置,特殊动画效果的过程数据,矩阵,用来变形/关键帧的顶点数据等等.常量寄存器可以辅助地址寄存器a0.x,但是一条指令只能使用一个常量.常量寄存器为(c0~c95),在ATI RADEON 8500 中为c0~c191.

最大有13个输出寄存器,这依赖于硬件.输出寄存器的命名由字母O开始,代表着输出"output".顶点输出在每次光栅化时可用并且顶点着色程序对它拥有只写属性.最终的结果是另一个顶点,一个在"齐次剪裁空间"的变换后的顶点.

着色器编程

同一时刻值有一个顶点着色器被激活.所以最好是编写基于任务的顶点着色程序.比如:假设你乘坐的飞船在一个陌生的星球失事.你穿着很普通的盔甲,拿着个锯子,你穿过一个点着蜡烛的递交.一个怪兽出现了,你躲在一个箱子后面,现在你除了思考是不是要拿着锯子做拯救世界的英雄外,你还要考虑,这个场景需要使用的顶点着色器的数量.

有一个怪兽需要去动态模拟,有可能需要计算他的皮肤反射周围环境中的光.另一个顶点着色器可用于处理地板,墙壁,镜头和烛光.地板墙壁可以使用一个着色器,烛光和镜头则需要各自单独使用一个.这些都取决与你的设计以及对硬件设备的理解.

使用顶点着色器的步骤:

检查D3DCAPs8::VertexShaderVersion是否支持顶点着色器。

使用D3DVSD_*宏来声明顶点着色器，将顶点缓冲流映射到输入寄存器中。

使用SetVertexShaderConstant()设置顶点着色器的常量寄存器。

使用D3DXAssembleShader*()来编译写好的顶点着色程序。（或者使用着色汇编器预编译着色程序。）

使用CreateVertexShader()创建一个顶点着色器的句柄.

使用SetVertexShader()为特定对象设定顶点着色器.

使用DeleteVertexShader()删除顶点着色器.

检查顶点着色器的支持:

//检查是否支持1.1版本的顶点着色器

if( pCaps->VertexShaderVersion < D3DVS_VERSION(1,1) )

return E_FAIL;

Version:	Functionality:
0.0	DirectX 7
1.0	DirectX 8 without address register A0
1.1	DirectX 8 and DirectX 8.1 with one address register A0
2.0	DirectX 9

顶点着色器的声明:

使用前需要先声明顶点着色器:

float c[4] = {0.0f,0.5f,1.0f,2.0f};

DWORD dwDecl0[] = {

  D3DVSD_STREAM(0),

  D3DVSD_REG(0, D3DVSDT_FLOAT3 ),    // 输入寄存器 v0

  D3DVSD_REG(5, D3DVSDT_D3DCOLOR ),  // 输入寄存器 v5

                                     // 设置一些常量

  D3DVSD_CONST(0,1),*(DWORD*)&c[0],*(DWORD*)&c[1],*(DWORD*)&c[2],*(DWORD*)&c[3],

  D3DVSD_END()

};

顶点着色器使用D3DVSD_STREAM(0)声明一个数据流0.在后面的步骤中,使用SetStreamSource()将一个顶点缓冲绑定到一个这里声明的设备数据流。可以使用这种方法对D3D渲染引擎提供不同的数据流。

你必须要声明要处理的顶点数据映射到哪个输入寄存器中。D3DVSD_REG将一个顶点寄存器和一个顶点数据流中的元素绑定。在我们的例子中，D3DVSDT_FLOAT3类型的值被放置在第一个输入寄存器中,D3DVSDT_D3DCOLOR类型的值放在第6个输入寄存器中.举个例子:我们可以通过声明D3DVSD_REG(0, D3DVSDT_FLOAT3 )把寄存器0里的值当作位置数据来处理,通过声明D3DVSD_REG(3,D3DVSD_FLOAT3)把寄存器3中的值当法向量来处理。

编程者如何把输入顶点的属性映射到不同的寄存器很重要,如果你想要使用N-Patches技术,那么N-Patches曲面细分需要寄存器0的位置数据和寄存器3中的法向量数据.另一方面,编程者可以自由定义合适的映射.

默认的顶点数据与输入寄存器的映射用于固定函数管线.可查阅d3d8types.h头文件:

#define D3DVSDE_POSITION 0

#define D3DVSDE_BLENDWEIGHT 1

#define D3DVSDE_BLENDINDICES 2

#define D3DVSDE_NORMAL 3

#define D3DVSDE_PSIZE 4

#define D3DVSDE_DIFFUSE 5

#define D3DVSDE_SPECULAR 6

#define D3DVSDE_TEXCOORD0 7

#define D3DVSDE_TEXCOORD1 8

#define D3DVSDE_TEXCOORD2 9

#define D3DVSDE_TEXCOORD3 10

#define D3DVSDE_TEXCOORD4 11

#define D3DVSDE_TEXCOORD5 12

#define D3DVSDE_TEXCOORD6 13

#define D3DVSDE_TEXCOORD7 14

#define D3DVSDE_POSITION2 15

#define D3DVSDE_NORMAL2 16

D3DVSD_REG的第二个参数指定维度和算法数据类型.以下的值定义在d3d8types.h:

// bit declarations for _Type fields

#define D3DVSDT_FLOAT1 0x00 // 1D float expanded to (value, 0., 0., 1.)

#define D3DVSDT_FLOAT2 0x01 // 2D float expanded to (value, value, 0., 1.)

#define D3DVSDT_FLOAT3 0x02 // 3D float expanded to (value, value, value, 1.)

#define D3DVSDT_FLOAT4 0x03 // 4D float

// 4D packed unsigned bytes mapped to 0. to 1. range // Input is in D3DCOLOR format (ARGB) expanded to (R, G, B, A) #define D3DVSDT_D3DCOLOR 0x04

#define D3DVSDT_UBYTE4 0x05 // 4D unsigned byte // 2D signed short expanded to (value, value, 0., 1.) #define D3DVSDT_SHORT2 0x06 #define D3DVSDT_SHORT4 0x07 // 4D signed short

    D3DVSD_CONST将常量值加载进顶点着色器的常量寄存器中.第一个参数代表开始填入常量向量的起始寄存器地址.我们设定的是0.第2个参数是要被读取的常量向量.

    一个向量有128-bit宽,所以我们一次使用4个32-bit的FLOAT数据类型.如果你想要加载一个4X4矩阵,可以使用如下语句:

    float c[16] = (0.0f, 0.5f, 1.0f, 2.0f,

			0.0f, 0.5f, 1.0f, 2.0f,

			0.0f, 0.5f, 1.0f, 2.0f,

               		0.0f, 0.5f, 1.0f, 2.0f);

			D3DVSD_CONST(0, 4), *(DWORD*)&c[0],*(DWORD*)&c[1],*(DWORD*)&c[2],*(DWORD*)&c[3],
                    					    *(DWORD*)&c[4],*(DWORD*)&c[5],*(DWORD*)&c[6],*(DWORD*)&c[7],
                    					    *(DWORD*)&c[8],*(DWORD*)&c[9],*(DWORD*)&c[10],*(DWORD*)&c[11],
                    					    *(DWORD*)&c[12],*(DWORD*)&c[13],*(DWORD*)&c[14],*(DWORD*)&c[15],

    D3DVSD_END产生一个END占位符标记顶点着色器声明的结束.另一个例子:

    	float	c[4] = {0.0f,0.5f,1.0f,2.0f};

	DWORD dwDecl[] = {
 	 D3DVSD_STREAM(0),
  	 D3DVSD_REG(0, D3DVSDT_FLOAT3 ), //input register v0
 	 D3DVSD_REG(3, D3DVSDT_FLOAT3 ), // input register v3
 	 D3DVSD_REG(5, D3DVSDT_D3DCOLOR ), // input register v5
 	 D3DVSD_REG(7, D3DVSDT_FLOAT2 ), // input register v7
 	 D3DVSD_CONST(0,1),*(DWORD*)&c[0],*(DWORD*)&c[1],*(DWORD*)&c[2],*(DWORD*)&c[3],
 	 D3DVSD_END()
	};

    如上例:设定数据流0,坐标值可能被绑定到v0,法向量值被绑定到V3,漫反射光被绑定到v5,纹理坐标被绑定到v7,常量寄存器c0读取一个128bit的值.

设定顶点着色器的常量寄存器

可以使用SetVertexShaderConstant()将值写入常量寄存器,使用GetVertexShaderConstant()读取常量寄存器:

// Set the vertex shader constants
m_pd3dDevice->SetVertexShaderConstant( 0, &vZero, 1 );
m_pd3dDevice->SetVertexShaderConstant( 1, &vOne, 1 );
m_pd3dDevice->SetVertexShaderConstant( 2, &vWeight, 1 );
m_pd3dDevice->SetVertexShaderConstant( 4, &matTranspose, 4 );
m_pd3dDevice->SetVertexShaderConstant( 8, &matCameraTranspose, 4 );
m_pd3dDevice->SetVertexShaderConstant( 12, &matViewTranspose, 4 );
m_pd3dDevice->SetVertexShaderConstant( 20, &fLight, 1 );
m_pd3dDevice->SetVertexShaderConstant( 21, &fDiffuse, 1 );
m_pd3dDevice->SetVertexShaderConstant( 22, &fAmbient, 1 );
m_pd3dDevice->SetVertexShaderConstant( 23, &fFog, 1 );
m_pd3dDevice->SetVertexShaderConstant( 24, &fCaustics, 1 );
m_pd3dDevice->SetVertexShaderConstant( 28, &matProjTranspose, 4 );
SetVertexShaderConstant()的函数原型为:
HRESULT SetVertexShaderConstant(
  DWORD Register,
  CONST void* pConstantData,
  DWORD ConstantCount);


最初,最多有96个常量寄存器可以使用.第一个参数指明常向量被加载的起始寄存器地址,最后一个参数指明被加载的常向量的个数.上例中第一行,vZero将被加载到寄存器0.
matTranspose将被加载进寄存器4,5,6,7.使用SetVertexShaderConstant()和D3DVSD_CONST有什么不同?使用宏只能使用一次,而使用函数则在每次调用DrawPrimitive*()前都可调用.


编写并编译一个顶点着色器
语法如下:


OpName dest, [-]s1 [,[-]s2 [,[-]s3]] ;comment
e.g.
mov r1, r2
mad r1, r2, -r3, r4 ; contents of r3 are negated
       
       Instruction Parameters Action
add dest, src1, src2 add src1 to src2 (and the optional negation creates substraction)
dp3 dest, src1, src2 three-component dot product
dest.x = dest.y = dest.z = dest.w = 
(src1.x * src2.x) + (src1.y * src2.y) + (src1.z * src2.z)
dp4 dest, src1, src2 four-component dot product
dest.w = (src1.x * src2.x) + (src1.y * src2.y) + (src1.z * src2.z) + (src1.w * src2.w);
dest.x = dest.y = dest.z = unused
What is the difference between dp4 and mul ? dp4 produces a scalar product and mul is a component by component vector product.
dst dest, src1, src2 The dst instruction works like this: The first source operand (src1) is assumed to be the vector (ignored, d*d, d*d, ignored) and the second source operand (src2) is assumed to be the vector (ignored, 1/d, ignored, 1/d).
Calculate distance vector:
dest.x = 1;
dest.y = src1.y * src2.y
dest.z = src1.z
dest.w = src2.w
dst is useful to calculate standard attenuation. Here is a code snippet that might calculate the attenuation for a point light:
; r7.w = distance * distance = (x*x) + (y*y) + (z*z)
dp3 r7.w, VECTOR_VERTEXTOLIGHT, VECTOR_VERTEXTOLIGHT

; VECTOR_VERTEXTOLIGHT.w = 1/sqrt(r7.w)
; = 1/||V|| = 1/distance
rsq VECTOR_VERTEXTOLIGHT.w, r7.w
...
; Get the attenuation
; d = distance
; Parameters for dst:
; src1 = (ignored, d * d, d * d, ignored)
; src2 = (ignored, 1/d, ignored, 1/d)
;
; r7.w = d * d
; VECTOR_VERTEXTOLIGHT.w = 1/d
dst r7, r7.wwww, VECTOR_VERTEXTOLIGHT.wwww 
; dest.x = 1
; dest.y = src0.y * src1.y
; dest.z = src0.z
; dest.w = src1.w
; r7(1, d * d * 1 / d, d * d, 1/d)

; c[LIGHT_ATTENUATION].x = a0
; c[LIGHT_ATTENUATION].y = a1
; c[LIGHT_ATTENUATION].z = a2
; (a0 + a1*d + a2* (d * d)) 
dp3 r7.w, r7, c[LIGHT_ATTENUATION] 
; 1 / (a0 + a1*d + a2* (d * d)) 
rcp ATTENUATION.w, r7.w 
...
; Scale the light factors by the attenuation
mul r6, r5, ATTENUATION.w
expp dest, src.w Exponential 10-bit precision
------------------------------------------
float w = src.w; 
float v = (float)floor(src.w);

dest.x = (float)pow(2, v); 
dest.y = w - v;

// Reduced precision exponent 
float tmp = (float)pow(2, w); 
DWORD tmpd = *(DWORD*)&tmp & 0xffffff00; 

dest.z = *(float*)&tmpd; 
dest.w = 1; 
--------------------------------------------
Shortcut:

dest.x = 2 **(int) src.w
dest.y = mantissa(src.w)
dest.z = expp(src.w)
dest.w = 1.0
lit dest, src Calculates lighting coefficients from two dot products and a power. 
---------------------------------------------
To calculate the lighting coefficients, set up the registers as shown:
src.x = N*L ; The dot product between normal and direction to light
src.y = N*H ; The dot product between normal and half vector 
src.z = ignored ; This value is ignored 
src.w = specular power ; The value must be between �128.0 and 128.0 
----------------------------------------------
usage:
dp3 r0.x, rn, c[LIGHT_POSITION]
dp3 r0.y, rn, c[LIGHT_HALF_ANGLE]
mov r0.w, c[SPECULAR_POWER]
lit r0, r0
------------------------------------------------
dest.x = 1.0;
dest.y = max (src.x, 0.0, 0.0);
dest.z= 0.0;
if (src.x > 0.0 && src.w == 0.0)
  dest.z = 1.0;
else if (src.x > 0.0 && src.y > 0.0)
  dest.z = (src.y)^src.w
dest.w = 1.0;
logp dest, src.w Logarithm 10-bit precision
---------------------------------------------------
float v = ABSF(src.w); 
if (v != 0) 
{ 
  int p = (int)(*(DWORD*)&v >> 23) - 127;
  dest.x = (float)p;  // exponent

  p = (*(DWORD*)&v & 0x7FFFFF) | 0x3f800000; 
  dest.y = *(float*)&p; // mantissa;

  float tmp = (float)(log(v)/log(2)); 
  DWORD tmpd = *(DWORD*)&tmp & 0xffffff00; 
  dest.z = *(float*)&tmpd;

  dest.w = 1; 
} 
else 
{ 
  dest.x = MINUS_MAX(); 
  dest.y = 1.0f; 
  dest.z = MINUS_MAX(); 
  dest.w = 1.0f; 
} 
-----------------------------------------------------
Sortcut: 
dest.x = exponent((int)src.w)
dest.y = mantissa(src.w)
dest.z = log2(src.w)
dest.w = 1.0
mad dest, src1, src2, src3 dest = (src1 * src2) + src3
max dest, src1, src2 dest = (src1 >= src2)?src1:src2
min dest, src1, src2 dest = (src1 < src2)?src1:src2
mov dest, src move
Optimization tip: question every use of mov (try to rap that !), because there might be methods that perform the desired operation directly from the source register or accept the required output register as the destination.
mul dest, src1, src2 set dest to the component by component product of src1 and src2; To calculate the Cross Product (r5 = r7 X r8),
; r0 used as a temp
mul r0,-r7.zxyw,r8.yzxw
mad r5,-r7.yzxw,r8.zxyw,-r0
nop   do nothing
rcp dest, src.w if(src.w == 1.0f)
{
  dest.x = dest.y = dest.z = dest.w = 1.0f;
}
else if(src.w == 0)
{
  dest.x = dest.y = dest.z = dest.w = PLUS_INFINITY();
}
else
{
  dest.x = dest.y = dest.z = m_dest.w = 1.0f/src.w;
}

Division:
; scalar r0.x = r1.x/r2.x
RCP r0.x, r2.x
MUL r0.x, r1.x, r0.x
rsq dest, src 
reciprocal square root of src
(much more useful than straight 'square root'):
float v = ABSF(src.w);
if(v == 1.0f)
{
  dest.x = dest.y = dest.z = dest.w = 1.0f;
}
else if(v == 0)
{
  dest.x = dest.y = dest.z = dest.w = PLUS_INFINITY();
}
else
{
  v = (float)(1.0f / sqrt(v));
  dest.x = dest.y = dest.z = dest.w = v;
}

Square root:
; scalar r0.x = sqrt(r1.x)
RSQ r0.x, r1.x
MUL r0.x, r0.x, r1.x
sge dest, src1, src2 
dest = (src1 >=src2) ? 1 : 0
useful to mimic conditional statements:
; compute r0 = (r1 >= r2) ? r3 : r4
; one if (r1 >= r2) holds, zero otherwise
SGE r0, r1, r2 
ADD r1, r3, -r4 
; r0 = r0*(r3-r4) + r4 = r0*r3 + (1-r0)*r4
; effectively, LERP between extremes of r3 and r4
MAD r0, r0, r1, r4
slt dest, src1, src2 dest = (src1 < src2) ? 1 : 0

顶点着色器的运算逻辑单元(ALU)是一个多线程向量处理器.它包含两个函数单元.SIMD向量单元负责mov,mul,mad,dp3,dp4,dst,min,max,slt,和sge指令.
特殊函数单元负责rcp,rsq,log,exp,和lit指令.大部分的指令花费一个指令周期,rcp和rsq在特定情况下将花费一个以上的周期.
应用提示
Rsq用于在光照方程中标准化向量。指数指令expp可用于雾效，过程噪声产生（见NVDIA柏林噪声（perlin noise）示例），例子系统中的例子行为，游戏中物体的损伤等。你会在任何需要快速变换的函数中使用它。如果你需要非常缓慢的增长，可以使用对数函数logp。可以用log函数做指数函数的反操作。Lit指令主要用来处理方向光照。它计算漫反射和镜面反射因素，这些因素基于N*L和N*H以及镜面反射强度。虽然这里没有包括衰减，但可以使用dst指令，操作lit指令的结果和一个衰减因素来完成衰减计算。这个命令对点光源和放射光源构造衰减因素很有用。Min和max指令可用于计算绝对值。

顶点着色器中的复杂指令


顶点着色器支持复杂指令，术语“macro”用于这些指令有些不合适，因为这些指令不是简单的C预处理器宏的子集。你要小心使用这些指令。如果你使用他们，那么你可能失去对128条指令上线的控制和一些可能的优化。另一方面，软件模拟模式中，Inter和AMD的处理器可能会优化一个m4x4复杂指令。所以，如果你需要4个dp4调用，也许使用一个m4X4复杂指令是一个好注意。如果你决定使用m4X4指令，那么对于同一数据不应该再使用dp4指令，因为他们的结果会有微妙的不同。



      
       Macro
  Parameters
  Action
  Clocks
 
 expp
  dest, src1
  provides exponential with full precision to at least 1/2²⁰
  12
 
 frc
  dest, src1
  returns fractional portion of each input component
  3
 
 log
  dest, src1
  provides log2(x) with full float precision of at least 1/2²⁰
  12
 
 m3x2
  dest, src1, src2
  computes the product of the input vector and a 3x2 matrix
  2
 
 m3x3
  dest, src1, src2
  computes the product of the input vector and a 3x3 matrix
  3
 
 m3x4
  dest, src1, src2
  computes the product of the input vector and a 3x4 matrix
  4
 
 m4x3
  dest, src1, src2
  computes the product of the input vector and a 4x3 matrix
  3
 
 m4x4
  dest, src1, src2
  computes the product of the input vector and a 4x4 matrix
  4
 
上表word格式文件下载链接

使用这些指令可以完成所有变换和光照操作。你甚至可以用他们写一个自定义的固定函数管线。
综合运用
现在让我们看看在顶点着色器的ALU中，这些寄存器和指令如何工作的。
在1.1版本的顶点着色器中，一次光栅化最大有16个输入寄存器，96个常量寄存器，12个临时寄存器，1个地址寄存器，和最多13个输出寄存器可用。每个寄存器可以处理4x32bit的值。每个32-bit的值可通过x，y，z和w子集访问。如果有一个包含x，y，z，w的128bit的值。访问相应数据可以在寄存器的名字后面加上“.”和x/y/z/w。


使用输入寄存器
从v0到v15一共16个输入寄存器。他们处理的典型值是：
l  位置信息（x,y,z,w）
l  漫反射光（r,g,b,a）->0.0~1.0
l  镜面反射光（r,g,b,a）->0.0~1.0
l  最大8个纹理坐标（s,t,r,q或u,v,w,q），一般是4个或6个，取决于硬件支持。
l  雾化（f,*,*,*）->值用于雾化方程
l  点尺寸（p,*,*,*）


你可以使用v0.x读取位置信息的x部分，v0.y读取y部分。如果需要知道漫反射光的RGBA结构中的绿色部分，可以检查v1.y.如果你读取雾化值可以使用v7.x,其余的v7.y,v7.z,v7.w是无用的.输入寄存器是只读的.每条指令只能访问一个顶点输入寄存器。未指定的值在寄存器中默认值为0.0.下面的例子演示v0和c0~c3的点乘，结果存储在oPos：
dp4 oPos.x , v0 , c0
dp4 oPos.y , v0 , c1
dp4 oPos.z , v0 , c2
dp4 oPos.w , v0 , c3
像这样的代码段常出现在使用现有的世界坐标系矩阵，视口坐标系矩阵和投影坐标系矩阵将投影空间的映射到剪裁空间。点乘进行如下操作：
oPos.x = (v0.x * c0.x) + (v0.y * c0.y) + (v0.z * c0.z) + (v0.w * c0.w)
假设我们使用单位向量，可知点乘的两个向量范围永远在[-1,1].所以oPos的值永远在这个区间内.你也可以使用如下指令:
4x4 oPos, v0 , c0
记住一条指令只能使用一个输入寄存器.
在属于寄存器中的值在运算后还会保留,这意味着下一个顶点着色器程序可重用这些数据.


使用常量寄存器
l  常量寄存器的典型使用包阔:
l  矩阵数据:4X4矩阵
l  光照属性(比如位置,衰减等.)
l  当前时间
l  顶点差值数据
l  例程数据
常量寄存器对于顶点着色器是只读的,但在应用程序中可读写它.常量寄存器的值保留时间与输入寄存器相同,所以他们也可以被重用.这就避免了程序中频繁的使用SetVertexShaderConstant()函数.读取超出范围的常量寄存器返回值(0.0, 0.0, 0.0, 0.0).
每条指令只能使用一个常量寄存器,但可以使用多次.例如:
; 合法
mul r5, c11, c11 ; The product of c11 and c11 is stored in r5
; 不合法
add v0, c4, c3
一个合法但更复杂的运用:
; dest = (src1 * src2) + src3
mad r0, r0, c20, c20 ; multiplies r0 with c20 and adds c20 to the result
使用地址寄存器
使用a0~aN(vs1.1以上版本可用超过1个地址寄存器)访问地址寄存器.在vs1.1中使用a0的唯一用法是指定常量寄存器的偏移:
c[a0.x + n] ; 仅在vs1.1或更高版本支持
            ; n 是基址a0.x 是偏移
下面是一个使用地址寄存器的例子:
//Set 1
mov a0.x,r1.x
m4x3 r4,v0,c[a0.x + 9];
m3x3 r5,v3,c[a0.x + 9];
不同的常量寄存器使用不同的指令,这取决于临时寄存器r1.x.请注意,a0只存储正数,a0.x是a0唯一可用的组件,并且只能使用mov指令写a0.x.
使用临时寄存器
可通过使用r0~r11访问12个临时寄存器.下面是例子:
dp3 r2, r1, -c4 ; A three-component dot product: dest.x = dest.y = dest.z 
;= dest.w = (r1.x * -c4.x) + (r1.y * -c4.y) + (r1.z * -c4.z)
            ...
mov r0.x, v0.x
mov r0.y, c4.w
mov r0.z, v0.y
mov r0.w, c4.w


如果试图读一个没有赋值的常量寄存器,在创建顶点着色器时会返回一条错误信息.


使用输出寄存器
最多有13个只写输出寄存器.他们作为光栅化的输入.每个输出寄存器使用小写字母”o”作为前缀并且以他们在定点着色中的用途命名:

      
       Name
  Value
  Description
 
 oDn
  2 quad-floats
  Output color data directly to the pixel shader. Required for diffuse color (oD0) and specular color (oD1).
 
 oPos
  1 quad-float
  Output position in homogenous clipping space. Must be written by the vertex shader.
 
 oTn
  up to 8 quad-floats
 Geforce 3: 4
 RADEON 8500: 6
  Output texture coordinates. Required for maximum number of textures simultaneously bound to the texture blending stage.
 
 oPts.x
  1 scalar float
  Output point-size registers. Only the scalar x-component of the point size is functional
 
 oFog.x
  1 scalar float
  the fog factor to be interpolated and then routed to the fog table. Only the scalar x-component is functional.
 
下面是的例子演示了如何使用oPos,oD0和oT0寄存器:
dp4 oPos.x , v0 , c4 ; x的放射投影
dp4 oPos.y , v0 , c5 ; y的放射投影
dp4 oPos.z , v0 , c6 ; z的放射投影
dp4 oPos.w , v0 , c7 ; w的放射投影
mov oD0, v5          ; 设定漫反射光颜色
mov oT0, v2 ; 从输入寄存器 v2输出纹理坐标到oT0
上面显示的是使用四个dp4指令完成从投影到剪裁空间的映射.第一个mov指令将v5的内容传送到颜色输出寄存器,第二个mov指令将v2内容传送到纹理输出寄存器.
使用oFog.x输出寄存器的示例:
； Scale by fog parameters :
; c5.x = fog start
; c5.y = fog end
; c5.z = 1/range
; c5.w = fog max
dp4 r2, v0, c2 ; r2 = 摄像机距离
sge r3, c0, c0 ; r3 = 1
add r2, r2, -c5.x           ; 摄像机深度 (z) - fog start
mad r3.x, -r2.x, c5.z, r3.x ; 1.0 - (z - fog start) * 1/range
                            ; fog=1.0 意味着没有雾, 
                            ; fog=0.0 意味着雾最大
max oFog.x, c5.w, r3.x      ; clamp the fog with our custom max value
每个顶点着色器必须至少写入oPos的一个组件(.x/.y/.z/.w),否则会得到一个汇编器产生的错误信息.


    当使用顶点着色器时,D3DTSS_TEXCOORDINDEX的D3DTSS_TCI_*标志位被忽略.所有的纹理坐标按数字顺序被映射.优化提示:尽早的输出oPos以触发像素着色器的并行执行.尝试重排列顶点着色程序的指令来达到此目的.


所有从顶点着色器输出的值区间为[0,1].如果在像素着色器中需要有符号的值,必须在顶点着色器中压缩,然后在像素着色器中使用_bx2重新展开.


 交换(swizzling)和掩码(masking)
如果你把输入,常量,临时寄存器做源寄存器,你可以使用交换功能.如果你把输出,临时寄存器当作目的寄存器,可以使用掩码功能.
交换非常的高效,比如可以快速的把(0.5, 0.0, 1.0, 0.6)转换为(0.0, 0.0, 1.0, 0.0)或者(0.6, 1.0, -0.5, 0.6)。.
所有在指令中做源寄存器的寄存器都可使用交换:
 
mov R1, R2.wxyz;



mov R1, -R2.xyyz



掩码
看例子:
mov  R1.x, R2


只有x部分被写入R1.


mov  R1.xw, R2
只有R2的x和w部分被写入R1,目标寄存器不支持交换和负数.


编写顶点着色程序的指南
最重要的约束如下：
l  至少要写入输出寄存器oPos中的一个组件。
l  每个程序最大128条指令。
l  每个指令只能使用一个常量寄存器。
l  每个指令只能使用一个输入寄存器。
l  没有C风格的条件语句。但可以使用sge指令模仿r0 = (r1 >= r2) ? r3 : r4
l  所有从顶点着色器输出的值区间为[0,1]


 一些优化顶点着色器的方法

      
      读关于Kim Pallister关于优化顶点着色程序的文章。
使用SetVertexShaderConstant()设定顶点着色器的常量寄存器。
尽量避免使用mov指令。
简化指令：
mad   r4,r3,c9,r4

mov   oD0,r4
==
mad   oD0,r3,c9,r4
优化前分解指令中的复杂指令。


一个关于平衡Gpu/Cpu加载平衡的规则:很多着色的计算可以按照”每个对象”的方式代替”每个顶点”放入常量寄存器.在cpu计算对象,然后把结果当作一个常量送给Gpu,比全部在Gpu中计算要快的多.


编译一个顶点着色程序


Direct3D使用字符编码,OpenGL使用字符串.因此D3D开发者需要使用汇编器来汇编代码.
有三种方式编译一个顶点着色程序:
 
l  使用ASCII编码写顶点着色源程序,比如text.vsh,使用顶点着色器汇编器讲它编译成二进制文件,比如text.vso.这个文件将会在游戏开始被打开和读取.
l  使用ASCII编码写顶点着色源程序,存为cpp文件,在程序开始时使用D3DXAssembleShader*()加载编译.
l  在效果文件中编写顶点着色程序,并在程序开始时打开这个效果文件. D3DXCreateEffectFromFile()用于打开效果文件.也可以预编译效果文件


创建一个顶点着色器
HRESULT CreateVertexShader(
CONST DWORD* pDeclaration,
CONST DWORD* pFunction,
DWORD* pHandle,
DWORD Usage);
该函数用于创建和使用一个顶点着色程序.pDeclaration是一个指向着色器声明的指针.pFunction指向被编译的着色器程序.pHandle返回函数成功后得到的着色器句柄.Usage可以强制软件方式顶点着色(D3DUSAGE_SOFTWAREPROCESSING).当D3DRS_SOFTWAREVERTEXPROCESSING被设为true时为启用.但是使用软件方式着色没有硬件方式快.


设置顶点着色器
可以为特定对象设置顶点着色器,只需在对象的DrawPrimitive*()调用前执行SetVertexShader().该函数可在图元间动态调用顶点着色程序.
// set the vertex shader
m_pd3dDevice->SetVertexShader( m_dwVertexShader );
该函数的参数是CreateVertexShader()返回的句柄。有多少个顶点，顶点着色程序就要执行多少次。例如你想要显示一个旋转的方形，你使用4个顶点和一个索引表，在NVDIA 着色调试器中，顶点着色器程序在DrawPrimitive*()调用前运行了4次。


释放顶点着色器资源
当游戏结束时，被顶点着色器占用的资源必须被释放。
// 删除顶点着色器
if (m_pd3dDevice->m_dwVertexShader != 0xffffffff)
{
m_pd3dDevice->DeleteVertexShader( m_dwVertexShader );
m_pd3dDevice->m_dwVertexShader = 0xffffffff;
}


 


（完）PS：不知道怎么回事，这些排版是自动编辑的 我改不过来，无语。。。。

Baesky

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
1
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

Instruction	Parameters	Action
add	dest, src1, src2	add src1 to src2 (and the optional negation creates substraction)
dp3	dest, src1, src2	three-component dot product dest.x = dest.y = dest.z = dest.w = (src1.x * src2.x) + (src1.y * src2.y) + (src1.z * src2.z)
dp4	dest, src1, src2	four-component dot product dest.w = (src1.x * src2.x) + (src1.y * src2.y) + (src1.z * src2.z) + (src1.w * src2.w); dest.x = dest.y = dest.z = unused What is the difference between dp4 and mul ? dp4 produces a scalar product and mul is a component by component vector product.
dst	dest, src1, src2	The dst instruction works like this: The first source operand (src1) is assumed to be the vector (ignored, dd, dd, ignored) and the second source operand (src2) is assumed to be the vector (ignored, 1/d, ignored, 1/d). Calculate distance vector: dest.x = 1; dest.y = src1.y * src2.y dest.z = src1.z dest.w = src2.w dst is useful to calculate standard attenuation. Here is a code snippet that might calculate the attenuation for a point light: ; r7.w = distance * distance = (xx) + (yy) + (zz) dp3 r7.w, VECTOR_VERTEXTOLIGHT, VECTOR_VERTEXTOLIGHT ; VECTOR_VERTEXTOLIGHT.w = 1/sqrt(r7.w) ; = 1/\|\|V\|\| = 1/distance rsq VECTOR_VERTEXTOLIGHT.w, r7.w ... ; Get the attenuation ; d = distance ; Parameters for dst: ; src1 = (ignored, d d, d * d, ignored) ; src2 = (ignored, 1/d, ignored, 1/d) ; ; r7.w = d * d ; VECTOR_VERTEXTOLIGHT.w = 1/d dst r7, r7.wwww, VECTOR_VERTEXTOLIGHT.wwww ; dest.x = 1 ; dest.y = src0.y * src1.y ; dest.z = src0.z ; dest.w = src1.w ; r7(1, d * d * 1 / d, d * d, 1/d) ; c[LIGHT_ATTENUATION].x = a0 ; c[LIGHT_ATTENUATION].y = a1 ; c[LIGHT_ATTENUATION].z = a2 ; (a0 + a1d + a2 (d * d)) dp3 r7.w, r7, c[LIGHT_ATTENUATION] ; 1 / (a0 + a1d + a2 (d * d)) rcp ATTENUATION.w, r7.w ... ; Scale the light factors by the attenuation mul r6, r5, ATTENUATION.w
expp	dest, src.w	Exponential 10-bit precision ------------------------------------------ float w = src.w; float v = (float)floor(src.w); dest.x = (float)pow(2, v); dest.y = w - v; // Reduced precision exponent float tmp = (float)pow(2, w); DWORD tmpd = (DWORD)&tmp & 0xffffff00; dest.z = (float)&tmpd; dest.w = 1; -------------------------------------------- Shortcut: dest.x = 2 **(int) src.w dest.y = mantissa(src.w) dest.z = expp(src.w) dest.w = 1.0
lit	dest, src	Calculates lighting coefficients from two dot products and a power. --------------------------------------------- To calculate the lighting coefficients, set up the registers as shown: src.x = NL ; The dot product between normal and direction to light src.y = NH ; The dot product between normal and half vector src.z = ignored ; This value is ignored src.w = specular power ; The value must be between �128.0 and 128.0 ---------------------------------------------- usage: dp3 r0.x, rn, c[LIGHT_POSITION] dp3 r0.y, rn, c[LIGHT_HALF_ANGLE] mov r0.w, c[SPECULAR_POWER] lit r0, r0 ------------------------------------------------ dest.x = 1.0; dest.y = max (src.x, 0.0, 0.0); dest.z= 0.0; if (src.x > 0.0 && src.w == 0.0) dest.z = 1.0; else if (src.x > 0.0 && src.y > 0.0) dest.z = (src.y)^src.w dest.w = 1.0;
logp	dest, src.w	Logarithm 10-bit precision --------------------------------------------------- float v = ABSF(src.w); if (v != 0) { int p = (int)((DWORD)&v >> 23) - 127; dest.x = (float)p; // exponent p = ((DWORD)&v & 0x7FFFFF) \| 0x3f800000; dest.y = (float)&p; // mantissa; float tmp = (float)(log(v)/log(2)); DWORD tmpd = (DWORD)&tmp & 0xffffff00; dest.z = (float)&tmpd; dest.w = 1; } else { dest.x = MINUS_MAX(); dest.y = 1.0f; dest.z = MINUS_MAX(); dest.w = 1.0f; } ----------------------------------------------------- Sortcut: dest.x = exponent((int)src.w) dest.y = mantissa(src.w) dest.z = log2(src.w) dest.w = 1.0
mad	dest, src1, src2, src3	*dest = (src1 src2) + src3**
max	dest, src1, src2	dest = (src1 >= src2)?src1:src2
min	dest, src1, src2	dest = (src1 < src2)?src1:src2
mov	dest, src	move Optimization tip: question every use of mov (try to rap that !), because there might be methods that perform the desired operation directly from the source register or accept the required output register as the destination.
mul	dest, src1, src2	set dest to the component by component product of src1 and src2 ; To calculate the Cross Product (r5 = r7 X r8), ; r0 used as a temp mul r0,-r7.zxyw,r8.yzxw mad r5,-r7.yzxw,r8.zxyw,-r0
nop		do nothing
rcp	dest, src.w	if(src.w == 1.0f) { dest.x = dest.y = dest.z = dest.w = 1.0f; } else if(src.w == 0) { dest.x = dest.y = dest.z = dest.w = PLUS_INFINITY(); } else { dest.x = dest.y = dest.z = m_dest.w = 1.0f/src.w; } Division: ; scalar r0.x = r1.x/r2.x RCP r0.x, r2.x MUL r0.x, r1.x, r0.x
rsq	dest, src	reciprocal square root of src (much more useful than straight 'square root'): float v = ABSF(src.w); if(v == 1.0f) { dest.x = dest.y = dest.z = dest.w = 1.0f; } else if(v == 0) { dest.x = dest.y = dest.z = dest.w = PLUS_INFINITY(); } else { v = (float)(1.0f / sqrt(v)); dest.x = dest.y = dest.z = dest.w = v; } Square root: ; scalar r0.x = sqrt(r1.x) RSQ r0.x, r1.x MUL r0.x, r0.x, r1.x
sge	dest, src1, src2	dest = (src1 >=src2) ? 1 : 0 useful to mimic conditional statements: ; compute r0 = (r1 >= r2) ? r3 : r4 ; one if (r1 >= r2) holds, zero otherwise SGE r0, r1, r2 ADD r1, r3, -r4 ; r0 = r0(r3-r4) + r4 = r0r3 + (1-r0)*r4 ; effectively, LERP between extremes of r3 and r4 MAD r0, r0, r1, r4
slt	dest, src1, src2	dest = (src1 < src2) ? 1 : 0

Macro	Parameters	Action	Clocks
expp	dest, src1	provides exponential with full precision to at least 1/2²⁰	12
frc	dest, src1	returns fractional portion of each input component	3
log	dest, src1	provides log2(x) with full float precision of at least 1/2²⁰	12
m3x2	dest, src1, src2	computes the product of the input vector and a 3x2 matrix	2
m3x3	dest, src1, src2	computes the product of the input vector and a 3x3 matrix	3
m3x4	dest, src1, src2	computes the product of the input vector and a 3x4 matrix	4
m4x3	dest, src1, src2	computes the product of the input vector and a 4x3 matrix	3
m4x4	dest, src1, src2	computes the product of the input vector and a 4x4 matrix	4

Name	Value	Description
oDn	2 quad-floats	Output color data directly to the pixel shader. Required for diffuse color (oD0) and specular color (oD1).
oPos	1 quad-float	Output position in homogenous clipping space. Must be written by the vertex shader.
oTn	up to 8 quad-floats Geforce 3: 4 RADEON 8500: 6	Output texture coordinates. Required for maximum number of textures simultaneously bound to the texture blending stage.
oPts.x	1 scalar float	Output point-size registers. Only the scalar x-component of the point size is functional
oFog.x	1 scalar float	the fog factor to be interpolated and then routed to the fog table. Only the scalar x-component is functional.