通用着色器内部结构

7 Common Shader Internals

7 通用着色器内部结构

Full details of the Shader models for each shader stage are provided in dedicated sections elsewhere in the spec. What follows is a discussion of a few general items (not an exhaustive list) that are common to all of the Shader models.

每个着色器阶段的着色器模型的完整细节都在规范的专门部分中提供。接下来是对一些通用项目(并非详尽无遗的列表)的讨论,这些项目是所有着色器模型的共有的。

7.1 Instruction Counts

7.1 指令数量

There are no limits on total shader program length or execution time (accounting for loops and subroutines), aside from any limitations in what may be expressed in the shader token format. Clearly longer programs will degrade in performance, but D3D11.3 currently does not specify how steeply performance will degrade relative to program length or execution time given that there are so many variables that might affect performance.
除了对着色器token 格式表示的内容有一些限制外,对整个着色器程序长度或执行时间(包括循环和子例程)没有限制。显然,较长的程序会降低性能,但 D3D11.3 目前没有指定性能下降有多严重,相对于程序长度或执行时间,因为有太多的变量可能会影响性能。

7.2 Common Instruction Set

7.2 通用指令集

Aside from a few exceptions, the instruction set for all the shader stages are identical. The exceptions are confined to instructions that only make sense in a given Shader unit.

除了少数例外情况外,所有着色器阶段的指令集都是相同的。例外情况仅限于只在给定的着色器单元中有意义的指令。

For example the sample instruction computes LOD based on derivatives, so sample and sample_b (sample with LOD bias) are only relevant in the Pixel Shader where derivatives are present, while sample_l (sample at selected LOD) and sample_d (sample with application-provided derivatives) is available in all stages.

例如,sample 指令基于导数计算 LOD,因此sample 和sample_b(具有 LOD 偏差的sample)仅与存在导数的像素着色器中相关,而 sample_l(选定 LOD 的采样)和 sample_d(具有应用程序提供的导数的采样)在所有阶段都可用。

7.3 Temporary Storage

7.3 临时存储

Temporary storage is composed of a single Element type, which is a 4-tuple of untyped 32-bit quantities. Temporary storage consists of two classes of storage: registers, which are non-indexed single elements; and arrays, which are indexable 1D arrays of elements. Temporary storage is read/write, and is uninitialized at the start of a Shader execution instance.

临时存储由单个元素类型组成,该类型是非类型化 32 位数量的 4 元组。临时存储由两类存储组成:

寄存器,它们是非索引的单个元素; 

数组,它们是元素的可索引一维数组。

临时存储是读/写的,在着色器执行实例开始时未初始化。

Reads of temporary storage that has not been previously written within a Shader execution instance return undefined values, but cannot return data outside of the address space of the device context.

在着色器执行实例内部,对于尚未在临时存储中写入的数据,读取操作会返回未定义的值,但不能返回设备上下文地址空间之外的数据。

Temporary registers are declared(22.3.35) r#, and can be used as a temporary operand in D3D11.3 instructions.

临时寄存器声明 (22.3.35) 为 r#,可用作 D3D11.3 指令中的临时操作数。

其中 # 是寄存器的编号。例如,r0、r1、r2 等都是临时寄存器的示例。mov o0.xyz, r0.xyzx

Temporary arrays are declared(22.3.36) as x#[n], where “n” is the array length (indexed with 0..n-1). Temporary arrays must be indexed by an r# scalar, statically indexed x# scalar, and/or and optional immediate constant (literal), and can have only one level of index nesting (e.g.x0[x1[r0.x+1].x+1] is not legal, but x0[x1[1].x+1] is legal). A temporary array reference, x#[?], can be used as a temporary operand in D3D11.3 instructions (i.e. anywhere an r# can be used). Out of bounds access to x#[?] is undefined, except that data outside the GPU process context is never visible.

临时数组声明 (22.3.36) 为 x#[n],其中“n”是数组长度(以 0..n-1 为索引)。临时数组必须由 r# 标量、静态索引的 x# 标量和/或可选的立即数常量(文字)进行索引,并且只能有一个级别的索引嵌套(例如x0[x1[r0.x+1].x+1] 不合法,但 x0[x1[1].x+1] 合法)。临时数组引用 x#[?] 可用作 D3D11.3 指令中的临时操作数(即 r# 的任何地方)。对 x#[?] 的越界访问是未定义的,但 GPU 进程上下文之外的数据永远不会可见。

其中 # 是数组的编号,n 是数组的长度。例如,x0[3] 就是一个长度为 3 的临时数组。mul r0.xz, x0[r0.w].xww, float4(0.5f,0,0.1f,0)

The total quantity of temporary storage per Shader execution instance is 4096 elements, which can be utilized in any combination of registers and arrays. i.e. the total number of r# and x# declared must be <= 4096.

每个 Shader 执行实例的临时存储总量为 4096 个元素,可用于寄存器和数组的任意组合。即声明的 R# 和 X# 总数必须为 <= 4096。

Note that the namespace for r# and x# (the #) are independent. e.g. Suppose r2 and x2[5] are declared. They are independent, but together both count as 6 units of storage against the limit of 4096 temporary registers.

请注意,r# 和 x# (#) 的命名空间是独立的。例如,假设声明了 r2 和 x2[5]。它们是独立的,但两者加在一起算作 6 个存储单元,而临时寄存器的限制为 4096 个。

To provide a run-time stack, a program allocates a temporary array of a fixed size. The program should provide its own stack bounds checking, e.g., skip calls if the stack push would exceed the array bounds.

为了提供运行时堆栈,程序会分配一个固定大小的临时数组。程序应提供自己的堆栈边界检查,例如,如果堆栈推送超过数组边界,则跳过调用。

There is no limit on the total number of times a temp registers (the same one or different ones) that can appear in a single instruction or in a shader.

临时寄存器(相同的寄存器或不同的寄存器)可以在单个指令或着色器中出现的总次数没有限制。

7.4 Immediate Constants

7.4 立即数常量

For any instruction source argument that is capable of taking a temporary register, it is also permitted to supply 32-bit immediate scalar or 32-bit immediate 4-vector in the Shader code. Only at most one source operand per instruction may be specified using an immediate value (having up to 4 components). Immediate scalar values used in indexing of registers can only be used once per indexed operand in an instruction, and but these immediate values do not count against the limit of one immediate as a raw source operand. e.g."add r0, v[1 + r0.x], float4(1.0f,2.0f,3.0f,4.0f)" is valid, since there is only one immediate source operand present (the float4), with the value 1 in the indexing of v[] not counting against the limit.

对于任何能够采用临时寄存器的指令源参数,还允许在着色器代码中提供32位立即数标量或32位立即数4维向量。每条指令最多只能使用立即数值(最多 4 个组件)指定一个源操作数。寄存器索引中使用的立即数标量值只能在指令中每个索引操作数使用一次,但这些立即数值不计入原始源操作数的一个立即值的限制。例如:“add r0, v[1 + r0.x], float4(1.0f,2.0f,3.0f,4.0f)”是有效的,因为只有一个立即数的源操作数(float4),v[] 索引中的值 1 不计入限制。

If a source operand is a Constant Buffer reference (see Constant Buffers below), the reference to a Constant Buffer DOES count against the same limit as immediate values.This allows implementations to provide immediate values through the same hardware path as Constant Buffers if desired. e.g."add r0, cb0[r1.x], float4(1.0f,2.0f,3.0f,4.0f)" is invalid, since both an immediate value is used as well as a Constant Buffer read in the same instruction.

如果源操作数是常量缓冲区引用(请参阅下面的常量缓冲区),则对常量缓冲区的引用与立即数值的限制相同。这允许实现在需要时通过与常量缓冲区相同的硬件路径提供立即数值。例如:“add r0,cb0[r1.x],float4(1.0f,2.0f,3.0f,4.0f)”是无效的,因为在同一条指令中同时使用了立即数值和常量缓冲区读取。

There is no limit on the total number of times immediate constants can appear in a single instruction or in a shader.

立即数常量在单个指令或一个着色器中出现的总次数没有限制。

7.5 Constant Buffers

7.5 常量缓冲

There are 15 slots for ConstantBuffers that can be active per Pipeline stage. Indexing across ConstantBuffers is not permitted. A given ConstantBuffer is accessed as an operand to any Shader operation as if it is an indexable read-only register in the Shader.Unlike other Buffer binding locations in the pipeline, Constant Buffers do not allow Buffer offsets nor custom strides. The stride of the Buffer is assumed to be the Element width of R32G32B32A32_TYPELESS; and the first Element in the Buffer (at Buffer offset zero) is assumed to constant #[ 0 ], when referenced from the Shader.

每个管线阶段最多可以有15个活动的常量缓冲区槽位。不允许跨常量缓冲区建立索引。在着色器中,一个给定的常量缓冲区被视为可索引的只读寄存器,并作为一些着色器操作的操作数来访问。与管道中的其他缓冲区绑定位置不同,常量缓冲区不允许缓冲区偏移或自定义步幅。缓冲区的步幅假定为 R32G32B32A32_TYPELESS 的元素宽度;当从着色器引用时,缓冲区中的第一个元素(缓冲区偏移量为零)被假定为常量 #[ 0 ]。

cbuffer CData : register(b0){float4 a};   cb0[0]

cbuffer CData : register(b1){float4 b};   cb1[0]

不支持位段操作。

In Shader code, just as a t# register is a placeholder for a Texture, a cb# register is a placeholder for a ConstantBuffer at "slot" #.A ConstantBuffer is accessed in a Shader using: cb#[index] as an operand to Shader instructions, where 'index' can be either an r# or statically indexed x# containing a 32-bit unsigned integer, an immediate 32-bit unsigned integer constant, or a combination of the two, added together. e.g. "mov r0, cb3[x3[0].x+6]" represents moving Element 7 from the ConstantBuffer assigned to slot 3 into r0, assuming x3[0].x contains 1.

在着色器代码中,就像 t# 寄存器是纹理的占位符一样,cb# 寄存器是“槽”#中常量缓冲区(ConstantBuffer)的占位符。在着色器中使用以下命令访问常量缓冲区:cb#[index] 作为着色器指令的操作数,其中“index”可以是 r# 或静态索引的 x#,其中包含一个 32 位无符号整数、一个直接的 32 位无符号整数常量或两者的组合,相加在一起。例如,假设 x3[0].x 包含 1,“mov r0, cb3[x3[0].x+6]” 表示将元素 7 从分配给插槽 3 的 常量缓冲区移动到 r0。

There is no limit on the total number of times constant buffer reads (from any buffer and location in the buffer) that can appear in a single instruction or in a shader.

常量缓冲区读取(从缓冲区中的任何缓冲区和缓冲区中的位置)的总次数没有限制,这些读取可以出现在单个指令或着色器中。

The declaration of a ConstantBuffer (cb# register) in a Shader includes the following information:

着色器中常量缓冲区 (cb# 寄存器) 的声明包括以下信息:

1.The size of the ConstantBuffer can be declared (a special flag will allow for unknown-length).

1.可以声明常量缓冲区的大小(特殊标志将允许未知长度)。

2.The Shader must indicate whether the ConstantBuffer will be accessed via Shader-computed offset values or only by literal offsets.

2.着色器必须指示常量缓冲区是通过着色器计算的偏移值访问的,还是仅通过文本偏移量访问的。

3.The order that the declaration of a cb# appears in a Shader, relative to other cb# declarations, defines the priority of that ConstantBuffer, starting at highest priority.

3.相对于其他 cb# 声明,cb# 的声明在着色器中出现的顺序定义了该常量缓冲区的优先级,开始在最高优先级。

定义多个常量缓冲区cb,最前面的那个cb优先级最高。

cbuffer cbPerFrame

{

DirectionalLight gDirLight;

PointLight gPointLight;

SpotLight gSpotLight;

float3 gEyePosW;

};

cbuffer cbPerObject

{

float4x4 gWorld;

float4x4 gWorldInvTranspose;

float4x4 gWorldViewProj;

Material gMaterial;

};

mfxWorldViewProj     = mFX->GetVariableByName("gWorldViewProj")->AsMatrix();

mfxWorld             = mFX->GetVariableByName("gWorld")->AsMatrix();

如果Shader需要从多个ConstantBuffer中读取数据,它会优先从cbPerFrame中读取。

Out of bounds access to ConstantBuffers returns 0 in all components. Out of bounds behavior is always with respect to the size of the buffer bound at that slot.

对常量缓冲区的越界访问在所有组件中返回 0。越界行为始终与该槽绑定的缓冲区的大小有关。

If the constant buffer bound to a slot is larger than the size declared in the shader for that slot, implementations are allowed to return incorrect data (not necessarily 0) for indices that are larger than the declared size but smaller than the buffer size.

如果绑定到槽的常量缓冲区大于着色器中为该槽声明的大小,那么对于大于声明大小但小于缓冲区大小的索引,实现被允许返回不正确的数据(不一定是0)。

// 在着色器中声明一个大小为 2 的常量缓冲区

cbuffer ConstantBuffer : register(cb0)

{

    float4 LightDirection;        // 4 维向量

    float4 LightColor;            // 4 维颜色向量

};

// Pixel Shader

float4 PS_Main(float2 TexCoords : TEXCOORD0) : SV_Target

{

    // 使用常量缓冲区中的数据

    float4 color = tex2Dlod(s0, float4(TexCoords, 0, LightDirection));

    return color;

}

我们在运行时将一个大小大于 2 的常量缓冲区绑定到 cb0 槽位,那么当着色器访问超出其声明大小(即,超过 LightColor 后面的数据)时,它可能会读取到未定义的数据。

ID3D11Buffer* pConstantBuffer; // 您的常量缓冲区

ID3D11DeviceContext* pContext; // 您的设备上下文

// 绑定常量缓冲区到像素着色器的 cb0 槽位

pContext->PSSetConstantBuffers(0, 3, &pConstantBuffer);

在这个例子中,pConstantBuffer 是您的常量缓冲区,pContext 是您的设备上下文。PSSetConstantBuffers 函数的第一个参数是您要绑定的槽位(在这个例子中是 0,对应于 cb0),第二个参数是要绑定的常量缓冲区的数量(在这个例子中是 3),第三个参数是一个指向您的常量缓冲区的指针。

Fetching from a ConstantBuffer slot with no Buffer present always returns 0 in all components for all indices.

从不存在 Buffer 的常量缓冲区槽中获取时,所有索引的所有组件中始终返回 0。

With this set of information, different hardware implementations sporting varying degrees of optimization for ConstantBuffer access may make informed decisions about how to compile access to the ConstantBuffer into Shader code. Compiled shaders must never have to recompile just because different ConstantBuffers get bound to the Shader, as the necessary characteristics have been statically declared.Runtime validation (at least in debug) will ensure that the Shader code and the sizes of bound ConstantBuffers satisfy the declarations.

有了这些信息,不同的硬件实现可以根据对常量缓冲区访问的优化程度做出明智的决策,如何将对常量缓冲区的访问编译到Shader代码中。编译后的着色器绝不应该因为不同的常量缓冲区被绑定到Shader而需要重新编译,因为必要的特性已经被静态声明了。运行时验证(至少在调试中)将确保着色器代码和绑定的常量缓冲区的大小满足声明。

XMFLOAT3 mEyePosW;

mfxEyePosW           = mFX->GetVariableByName("gEyePosW")->AsVector();

mfxEyePosW->SetRawValue(&mEyePosW, 0, sizeof(mEyePosW));

cbuffer cbPerFrame

{

DirectionalLight gDirLight;

PointLight gPointLight;

SpotLight gSpotLight;

float3 gEyePosW;

};

The priorities assigned to ConstantBuffers assist hardware in best utilizing any dedicated constant data access paths/mechanisms, if present.There is no guarantee, however, that accesses to ConstantBuffers with higher priority will always be faster than lower priority ConstantBuffers.It is possible that a higher priority ConstantBuffer could produce slower performance than a lower priority ConstantBuffer, depending on the declared characteristics of the buffers involved.For example an implementation may have some arbitrary sized fast constant RAM not large enough for a couple of high priority ConstantBuffers that a Shader has declared, but large enough to fit a declared low priority ConstantBuffer.Such an implementation may have no choice but to use the standard (assumed slow) texture load path for large high priority ConstantBuffers (perhaps tweaking the cache behavior at least), while placing the lowest priority ConstantBuffer into the (assumed fast) constant RAM.

分配给常量缓冲区的优先级可帮助硬件最好地利用任何专用的常量数据访问路径/机制(如果存在)。但是,不能保证对具有较高优先级的常量缓冲区的访问始终比对较低优先级的常量缓冲区的访问更快。优先级较高的常量缓冲区可能比优先级较低的常量缓冲区产生更慢的性能,具体取决于所涉及的缓冲区的声明特征。例如,一个实现可能具有一些任意大小的快速常量 RAM,这些 RAM 不足以容纳 Shader 声明的几个高优先级常量缓冲区,但足够大以容纳声明的低优先级常量缓冲区。这样的实现可能别无选择,只能对大型高优先级常量缓冲区使用标准(假定慢速)纹理加载路径(至少可以调整缓存行为),同时将最低优先级的常量缓冲区放入(假定的快速)常量 RAM 中。

Applications are able to write Shader code that reads constants in whatever pattern and quantity desired, while still allowing different hardware to easily achieve the best performance possible.

应用程序能够编写读取所需模式和数量的常量的着色器代码,同时仍然允许不同的硬件轻松实现最佳性能。

7.5.1 Immediate Constant Buffer

7.5.1 立即数常量缓冲区

In addition to the aforementioned 15 slots for Constant Buffers, every shader program can declare(22.3.4) a single Immediate Constant Buffer with up to 4096 4-vector values. The data is tied to the shader program permanently, but otherwise behaves (gets accessed) by the shader exactly the same way as Constant Buffers.

除了前面提到的 15 个常量缓冲区插槽外,每个着色器程序还可以声明 (22.3.4) 一个具有最多 4096 个4-vector的立即数常量缓冲区。这些数据永久地与Shader程序绑定,但在其他方面(被Shader访问的方式)与常量缓冲区的行为完全相同。

There is no limit on the total number of times immediate constant buffer reads (from any location the buffer) can appear in a single instruction or in a shader.

立即数常量缓冲区读取(从缓冲区的任何位置)可以在单个指令或着色器中出现的总次数没有限制。

//立即数常量缓冲区

cbuffer ImmediateConstantBuffer : register(b13)

{

  float4 PreDefinedColor = float4(1.0,0.0,0.0,1.0);  //预定义的颜色值

  float MaterialShininess = 32.0; //材料的光泽度

}

7.6 Shader Output Type Interpretation

7.6 着色器输出类型解释

D3D11_INPUT_ELEMENT_DESC vertexDesc[] =

{

{"POSITION", 0, DXGI_FORMAT_R32G32B32_FLOAT, 0, 0, D3D11_INPUT_PER_VERTEX_DATA, 0},

{"COLOR",    0, DXGI_FORMAT_R32G32B32A32_FLOAT, 0, 12, D3D11_INPUT_PER_VERTEX_DATA, 0}

};

// Create the input layout

D3DX11_PASS_DESC passDesc;

mTech->GetPassByIndex(0)->GetDesc(&passDesc);

HR(md3dDevice->CreateInputLayout(vertexDesc, 2, passDesc.pIAInputSignature, passDesc.IAInputSignatureSize, &mInputLayout));

着色器

struct VertexOut

{

float4 PosH  : SV_POSITION;

    float4 Color : COLOR;

};

float4 PS(VertexOut pin) : SV_Target

{

    return pin.Color;

}

The application is given control over the data type interpretation for Shader outputs (i.e. writing raw integer values vs. writing normalized float values) by simply choosing an appropriate format to interpret the output resource's contents as. See the Formats(19.1) section for detail.

应用程序只需选择适当的格式来解释输出资源的内容,即可控制着色器输出的数据类型解释(即写入原始整数值与写入规范化浮点值)。有关详细信息,请参阅“格式 (19.1) ”部分。

格式列表

7.7 Shader Input/Output

7.7 着色器输入/输出

Details on Shader input/output registers (indeed all registers) are provided in the sections dedicated to each Shader unit elsewhere in the spec.

着色器输入/输出寄存器(实际上所有寄存器)的详细信息在规范中其他地方的每个着色器单元的章节中提供。

struct VertexIn

{

float3 PosL  : POSITION;

    float4 Color : COLOR;

};

struct VertexOut

{

float4 PosH  : SV_POSITION;

    float4 Color : COLOR;

};

VertexOut VS(VertexIn vin)

{

VertexOut vout;

// Transform to homogeneous clip space.

vout.PosH = mul(float4(vin.PosL, 1.0f), gWorldViewProj);

// Just pass vertex color into the pixel shader.

    vout.Color = vin.Color;

    

    return vout;

}

float4 PS(VertexOut pin) : SV_Target

{

    return pin.Color;

}

One thing in common about input/output registers for all shaders is that if they are declared(22.3.30) to be dynamically indexable from the shader, and the shader indexes them out of the declared range, results are undefined, although no data from outside the GPU process context is never visible.

所有着色器的输入/输出寄存器的一个共同点是,如果它们被声明(22.3.30)为从着色器动态索引,并且着色器将它们索引到声明的范围之外,则结果是未定义的,尽管从GPU进程上下文之外永远不会看到任何数据。

Instruction:    dcl_indexRange minReg, maxReg

Stage(s):       All(22.1.1)

Description:    Declare a range of input or output registers that are to be indexed in the Shader code. The range is specified by indicating the minimum register and maximum register (minReg and maxReg).

声明一个在Shader代码中索引的输入或输出寄存器的范围。该范围通过指示最小寄存器和最大寄存器(minReg和maxReg)来指定。

dcl_indexRange v1, v3

dcl_indexRange v4, v9

dcl_indexRange o0, o4 // this line can't be used in PS

7.8 Integer Instructions

7.8 整数指令

7.8.1 Overview

7.8.1 概述

There is a collection of instructions available to Shaders which are dedicated to performing integer arithmetic and bitwise operations. Operands and output registers for integer instructions can be any of the register classes available to the floating point instructions.There is no data type associated with registers; Shader instructions determine how the data stored in registers is interpreted. Integer instructions simply assume that the data being read from operands and written to the destination are all 32-bit values (unsigned or signed 2's complement, depending on the instruction).

着色器(Shaders)中有一组指令专门用于执行整数算术和位运算。整数指令的操作数和输出寄存器可以是浮点指令可用的任何寄存器类别。寄存器没有关联的数据类型;Shader指令决定了寄存器中存储的数据如何被解释。整数指令简单地假设从操作数读取的数据和写入目的地的数据都是32位值(无符号或有符号的2的补码,具体取决于指令)。

iadd dest[.mask], [-]src0[.swizzle], [-]src1[.swizzle]

add[_sat] dest[.mask], [-]src0[_abs][.swizzle], [-]src1[_abs][.swizzle]

7.8.2 Implementation Notes

7.8.2 实施说明

Shader register storage is made up of 32-bit*4-component quantities, and integer arithmetic on these registers is required to be performed at full 32 bit in all cases.

着色器寄存器存储由 32 位 * 4 分量组成,在所有情况下,这些寄存器上的整数运算都需要以完整的 32 位执行。

7.8.3 Bitwise Operations

7.8.3 按位运算

The bitwise instructions are listed in the Bitwise Instructions(22.11) sub-section of the full instruction listing.

按位指令列在完整指令列表的“按位指令 (22.11) ”子部分中。

22.11 Bitwise Instructions

Section Contents

(back to chapter)

22.11.1 and

22.11.2 bfi

22.11.3 bfrev

22.11.4 countbits

22.11.5 firstbit

22.11.6 ibfe

22.11.7 ishl

22.11.8 ishr

22.11.9 not

22.11.10 or

22.11.11 ubfe

22.11.12 ushr

22.11.13 xor

7.8.4 Integer Arithmetic Operations

7.8.4 整数算术运算

See the Integer Arithmetic Instructions(22.12) sub-section of the full instruction listing.

请参阅完整说明列表的整数算术说明 (22.12) 子部分。

22.12 Integer Arithmetic Instructions

Section Contents

(back to chapter)

22.12.1 iadd

22.12.2 iaddcb

22.12.3 imad

22.12.4 imax

22.12.5 imin

22.12.6 imul

22.12.7 ineg

22.12.8 uaddc

22.12.9 udiv

22.12.10 umad

22.12.11 umax

22.12.12 umin

22.12.13 umul

22.12.14 usubb

22.12.15 msad

7.8.5 Integer/Float Conversion Operations

7.8.5 整数/浮点转换操作

There is no implicit conversion between floating-point and integer values. Contents of registers are interpreted as float or ints by the particular instruction being executed. Two instructions exist that allow explicit conversions to be performed, listed in the Type Conversion Instructions(22.13) sub-section of the full instruction listing.

浮点值和整数值之间没有隐式转换。寄存器的内容被正在执行的特定指令解释为浮点数或整数。存在两个允许执行显式转换的指令,列在完整指令列表的“类型转换指令 (22.13) ”子部分中。

22.13 Type Conversion Instructions

Section Contents
(back to chapter)
22.13.1 f16tof32
22.13.2 f32tof16
22.13.3 ftoi
22.13.4 ftou
22.13.5 itof
22.13.6 utof

7.8.6 Integer Addressing of Register Banks

7.8.6 寄存器库的整数寻址

Integer offsets for reads from register banks are available. These offsets must be scalar values (i.e. a select swizzle must be used to select one component of any vector-valued register used as an index) and are considered to be unsigned 32 bit values.

提供从寄存器组读取的整数偏移。这些偏移量必须是标量值(即,必须使用选择分量来选择用作索引的任何向量值寄存器的一个分量),并被视为无符号 32 位值。

This indexing mechanism applied to indexable x# registers allows compilers to generate stack-like behavior for Shader subroutines.

这种应用于可索引 x# 寄存器的索引机制允许编译器为 Shader 子例程生成类似堆栈的行为。

An example syntax for indexing is:

索引的示例语法如下:

mov r1, cb7[3+r2.x]

This instruction assumes that an unsigned 32-bit integer value exists in r2.x, and uses that value to offset into ConstantBuffer 7, starting from location 3 in the ConstantBuffer. Thus, if r2.x contains integer value 2, entry 5 of ConstantBuffer 7 would be referenced.

这条指令假设在r2.x中存在一个无符号的32位整数值,并使用该值偏移到ConstantBuffer 7中,从ConstantBuffer的位置3开始。因此,如果r2.x包含整数值2,那么将会引用ConstantBuffer 7的第5个条目。

7.9 Floating Point Instructions

7.9 浮点指令

Floating point instructions must follow the D3D11.3 Floating Point Rules(3.1).

浮点指令必须遵循 D3D11.3 浮点规则 (3.1) 。

A listing of all floating point instructions can be found here(22.10).

可以在此处 (22.10) 找到所有浮点指令的列表。

22.10 Floating Point Arithmetic Instructions

Section Contents

(back to chapter)

22.10.1 add

22.10.2 div

22.10.3 dp2

22.10.4 dp3

22.10.5 dp4

22.10.6 exp

22.10.7 frc

22.10.8 log

22.10.9 mad

22.10.10 max

22.10.11 min

22.10.12 mul

22.10.13 nop

22.10.14 round_ne

22.10.15 round_ni

22.10.16 round_pi

22.10.17 round_z

22.10.18 rcp

22.10.19 rsq

22.10.20 sincos

22.10.21 sqrt

7.9.1 Float Rounding

7.9.1 浮点四舍五入

Instructions are provided for rounding floating point values to integral floating point values:

提供了将浮点值舍入为整数浮点值的说明:

round_ne(22.10.14) (nearest-even)

round_ne (22.10.14) (最接近偶数)

1.5 rounds to 2 (even)

2.5 rounds to 2 (even)

round_ni(22.10.15) (negative-infinity)

round_ni (22.10.15) (负无穷大)

round_ni(2.5) 结果为 2。

round_ni(3.5) 结果为 3。

round_pi(22.10.16) (positive-infinity)

round_pi (22.10.16) (正无穷大)

round_pi(2.5) 结果为 3。

round_pi(3.5) 结果为 4。

round_z(22.10.17) (towards zero)

round_z (22.10.17) (趋向于零)

round_pi(2.5) 结果为 2

round_pi(-2.5) 结果为 -2

7.10 Vector Vs Scalar Instruction Set

7.10 vs标量指令集

The D3D intermediate language (IL) and register model are 4-vec oriented. Since this does not constrain hardware implementation (vector vs scalar) too much, this convention will carry forward until a good reason to switch paradigms surfaces.It is known that many implementations actually happen to operate on scalars or combinations of layouts even now.

D3D中间语言(IL)和寄存器模型是面向4-vec的。由于这不会过多地约束硬件实现(向量vs标量),因此这种惯例将一直延续下去,直到有充分的理由切换范式。众所周知,即使现在,许多实现实际上都是在标量或布局组合上操作的。

One area where the vector assumption seems to materially impact data organization is the indexing of registers such as inputs or outputs – the indexing happens across registers. If it is important to be able to express cleanly how to index through an array of scalars, this could be an example of an argument for switching the IL to be completely scalar.

向量假设在数据组织上产生实质性影响的一个领域是寄存器(如输入或输出)的索引 - 索引是跨寄存器进行的。如果能够清晰地表达如何通过标量数组进行索引非常重要,那么这可能是一个将中间语言(IL)完全切换为标量的论据示例。

7.11 Uniform Indexing Of Resources And Samplers

7.11 资源和采样器的统一索引

7.11.1 Overview

7.11.1 概述

Shaders have bindpoint arrays for various classes of read-only input resources: Constant Buffers (cb), Texture/Buffers (t), Samplers (s).

Shader具有用于各种类别的只读输入资源的绑定点数组:常量缓冲区(cb)、纹理/缓冲区(t)、采样器(s)。

常量缓冲区 (cb):它们通常用于存储那些在渲染过程中不会改变的数据,比如光源位置、摄像机参数等。

纹理/缓冲区 (t):它们用于存储图像数据,着色器可以从中读取信息以生成复杂的视觉效果。

采样器 (s):它们定义了如何从纹理中采样数据,包括过滤和重复模式等。

D3D11 allows all of these to be dynamically but uniformly indexed from a shader, whereas previously none of them were indexable.

D3D11允许所有这些资源从Shader动态但统一地索引,而以前它们都是不可索引的。

假设我们有一个常量缓冲区数组。在以前的版本中,你可能需要为每个变量创建一个单独的着色器变量,如下所示:

cbuffer ConstantBuffer : register(b0)

{

    float4 value0;

    float4 value1;

    // ...

    float4 value9;

}

然后,在着色器中,你需要根据你想要访问的变量来选择正确的变量。这可能会导致大量的代码重复,并且在运行时改变访问的变量会非常困难。

但是,在D3D11中,你可以使用一个索引来动态地访问常量缓冲区数组,如下所示:

cbuffer ConstantBuffer : register(b0)

{

    float4 values[10];

}

float4 main(uint valueIndex : VALUEINDEX) : SV_Target

{

    // 动态地索引常量缓冲区数组

    return values[valueIndex];

}

As with indexing of other types, such as indexable temps (x#), the dynamic index can be either an r# or statically indexed x# containing a 32-bit unsigned integer, an immediate 32-bit unsigned integer constant, or the combination of the two, added together.

与其他类型的索引(如可索引临时索引 (x#))一样,动态索引可以是 r# 或静态索引 x#,其中包含一个 32 位无符号整数、一个即时的 32 位无符号整数常量或两者的组合,相加在一起。

The constraint on the indexing of resources or samplers is that the index must be uniform. That is, the computed index must be the same at that point in the lockstep execution of the program for all invocations of the shader within the Draw*() call.

对资源或采样器进行索引的约束是索引必须是uniform。也就是说,在程序的锁步执行中,对于 Draw*() 调用中着色器的所有调用,计算索引必须相同。

在着色器中,我们可以使用uniformIndex来动态地索引纹理数组,如下所示:

Texture2D textures[10] : register(t0);

SamplerState sam : register(s0);

float4 main(uniform int uniformIndex : UNIFORMINDEX) : SV_Target

{

    // 动态但统一地索引纹理数组

    return textures[uniformIndex].Sample(sam, float2(0.5, 0.5));

}

对资源或采样器进行索引的约束是索引必须是uniform,这句话写的有问题

Texture2D textures[10] : register(t0);

SamplerState sam : register(s0);

float4 main(int uniformIndex : UNIFORMIN) : SV_Target

{

    // 动态但统一地索引纹理数组

    return textures[uniformIndex].Sample(sam, float2(0.5, 0.5));

}

Shader model 5.1之后都可以编译通过,Shader model 5.1之前报错:sampler array index must be a literal expression。

If due to flow control, some of the lockstep shader invocations are inactive, the computed index in those shaders is ignored and therefore cannot cause a violation of the uniform indexing constraint on all the active invocations.

如果由于流控制,一些同步着色器调用处于非活动状态,那么这些着色器中计算的索引将被忽略,因此不能导致所有活动调用的uniform索引约束被违反。

The HLSL compiler will enforce this behavior and driver compilers must not break it either. Violations of the uniform indexing constraint would be a result of an HLSL compiler bug or a driver compiler bug only, and in such cases the indexing results are undefined.

HLSL编译器将强制执行这种行为,驱动程序编译器也不得破坏它。违反uniform索引约束只能是HLSL编译器错误或驱动程序编译器错误的结果,在这种情况下,索引结果是未定义的。

7.11.2 Index Range

7.11.2 索引范围

Out of bounds resource indexing produces the same result as if accessing a slot with no resource bound.

越界资源索引产生的结果与访问未绑定资源的插槽的结果相同。

In particular note that with Constant Buffers, there are 14 API-visible Constant Buffer slots (a couple of other slots are reserved for various purposes). The valid indexing range for Constant Buffers is therefore [0..13], and accesses out of that range behave as if accessing a slot with no Constant Buffer bound.

特别注意,对于常量缓冲区,有14个API可见的常量缓冲区插槽(其他几个插槽保留用于各种目的)。因此,常量缓冲区的有效索引范围是[0…13],超出该范围的访问行为就像访问没有绑定常量缓冲区的插槽一样。

Out of bounds indexing of the Samplers (s#) results in undefined behavior.

采样器 (s#) 的越界索引会导致未定义的行为。

7.11.3 Constant Buffer Indexing Example

7.11.3 常量缓冲区索引示例

Suppose x3[0].x contains 4 and x4[2].y contains 5. The following mov instruction:

假设 x3[0].x 包含 4,x4[2].y 包含 5。以下 mov 指令:

mov r0, cb[x3[0].x+6][x4[2].y+9]

is therefore equivalent to:

因此等同于:

mov r0, cb[10][14]

which means read a 32-bit * 4-vector from location [14] in the ConstantBuffer, at ConstantBuffer bind point [10] (0-based counting).

这意味着从常量缓冲区的位置[14]读取一个32位*4-vector,在常量缓冲区绑定点[10](从0开始计数)。

The uniform dynamic indexing of which Constant Buffer to read from is what was not supported previously. Dynamic indexing within the Constant Buffer itself has always been supported.

从哪个常量缓冲区读取的uniform动态索引以前是不支持的。常量缓冲区本身的动态索引一直都是支持的。

7.11.4 Resource/Buffer Indexing Example

7.11.4 资源/缓冲区索引示例

Suppose x3[0].x contains 4. The following ld instruction:

假设 x3[0].x 包含 4。以下 ld 指令:

ld r0, r1, t[x3[0].x+6], texture2D

is equivalent to: 相当于:

ld r0, r1, t[10], texture2D

Note the "texture2D" at the end is also a new requirement, whereby all ld/sample instructions will indicate which Shader Resource View type is to be sampled.

请注意,末尾的“texture2D”也是一个新要求,所有 ld/sample 指令都将指示要采样的着色器资源视图类型。

7.11.5 Sampler Indexing Example

7.11.5 采样器索引示例

Suppose x3[0].x contains 4 and x4[2].y contains 5. The following sample instruction:

假设 x3[0].x 包含 4,x4[2].y 包含 5。以下示例说明:

sample r0, r1, t[x3[0].x+6], s[x4[2].y+9], textureCubeArray

is equivalent to: 相当于:

sample r0, r1, t[10], s[14], textureCubeArray

7.11.6 Resource Indexing Declarations

7.11.6 资源索引声明

Shader declarations from Shader Model 4.x for individual resources, constant buffers and samplers remain the same in Shader Model 5.0. These are particularly informative for parts of shader code that reference these objects directly, just as before.

Shader Model 4.x 中针对单个资源、常量缓冲区和采样器的着色器声明在 Shader Model 5.0 中保持不变。与以前一样,这些内容对于直接引用这些对象的着色器代码部分特别有用。

However, all instructions that reference texture objects (t#) now specify the view dimension (e.g.textureCubeArray) as a literal parameter. This is redundant when indexing is not used, since the up-front declaration of each t# has a view dimension, but useful when indexing is used.

但是,所有引用纹理对象 (t#) 的指令现在都指定了视图维度(例如textureCubeArray) 作为文本参数。当不使用索引时,这是多余的,因为每个 t# 的预先声明都有一个视图维度,但在使用索引时很有用。

7.12 Limitations On Flow Control And Subroutine Nesting

7.12 流程控制和子程序嵌套的限制

A flow control block is defined as an if(22.7.1) block, loop(22.7.4) block, or switch(22.7.18) block. Flow control blocks can nest up to 64 deep per subroutine (and main). Behavior of flow control instructions beyond this nesting limit is undefined.

流控制块被定义为if(22.7.1)块、loop(22.7.4)块或switch(22.7.18)块。流控制块可以在每个子程序(以及主程序)中嵌套最多64层。超出这个嵌套限制的流控制指令的行为是未定义的。

实测编译器不会报错,也没警告。

Subroutines can nest up to 32 deep. If there are already 32 entries on the return address stack and a "call" is issued, the call is skipped over.

子程序可以嵌套最多32层。如果返回地址栈上已经有32个条目,并且发出了一个“调用”,则会跳过这个调用。

7.13 Memory Addressing And Alignment Issues

7.13 内存寻址和对齐问题

UAV:无序访问视图,SRV:着色器资源视图,线程组共享内存

For Typed memory views, the number of components in an address when accessed by a shader instruction is determined by the number of components in the resource

对于类型化内存视图,当Shader指令访问地址时,地址中的组件数量由资源维度中的组件数量决定。每个地址组件都是一个无符号的32位整数元素索引。

For Raw memory views, the address is a single component unsigned 32-bit integer byte offset from the beginning of the view. The addresses must be 32-bit aligned. If an unaligned address is specified for an operation involving a write, the entire contents of the UAV(5.3.9) being written, or all of Thread Group Shared Memory (in the Compute Shader(18)) - whichever is being accessed - becomes undefined. If an unaligned address is specified for an operation involving a read, an undefined result is returned to the shader. It is invalid for implementations to perform the access as if there were no 32-bit alignment constraints.

对于原始内存视图,地址是从视图开始的单个组件无符号32位整数字节偏移。地址必须是32位对齐的。

如果为涉及写入的操作指定了未对齐的地址,那么正在写入的UAV(5.3.9)的全部内容,或者所有的线程组共享内存(在计算着色器(18)中)——无论访问的是哪个——都会变得未定义。如果为涉及读操作的操作指定了未对齐的地址,则返回给着色器的结果是未定义的。实现执行访问操作时,如果没有32位对齐限制,则是无效的。

For Structured memory views, the address is two unsigned 32-bit integer values.The first value is the struct index, and the second value is a byte offset into the struct. The byte offset must be aligned to 32-bits, otherwise the same behavior described for misaligned raw memory access above applies.

对于结构化内存视图,地址是两个无符号的 32 位整数值。第一个值是结构索引,第二个值是结构的字节偏移量。字节偏移量必须 32 位对齐,否则将适用上述未对齐的原始内存访问所描述的相同行为。

Each memory access instruction defines its behavior for out of bounds accesses, with distinctions for the memory location being accessed (UAV vs SRV vs Thread Group Shared Memory), and the layout (raw vs structured vs typed).See the documentation of individual instructions for details. The behaviors are similar for similar classes of instructions – e.g.all atomics have the same out of bounds behavior, all immediate atomics (which return a value to a shader) have their own consistent out of bounds access behavior, etc.

每个内存访问指令都定义了其越界访问的行为,并区分了所访问的内存位置(UAV vs SRV vs 线程组共享内存)和布局(原始 vs 结构化 vs 类型化)。有关详细信息,请参阅各个说明的文档。对于类似指令的类别,行为是相似的,例如,所有 Atomics 都具有相同的越界行为,所有立即原子操作(将值返回给着色器)都有自己一致的越界访问行为等。

7.14 Shader Memory Consistency Model

7.14 着色器内存一致性模型

7.14.1 Intro

7.14.1 简介

The types of memory accesses included in the scope of this chapter are: to Unordered Access Views(5.3.9) (UAVs, u#), available to the Compute Shader(18) and Pixel Shader(16), as well as Thread Group Shared Memory (g#), available to the Compute Shader.

本章范围中包含的内存访问类型包括:

无序访问视图 (5.3.9) (UAV,u#),可用于计算着色器和像素着色器,

以及线程组共享内存(g#),可用于计算着色器。

The D3D11 Shader Memory Consistency Model is weak/relaxed, as generally understood in existing architectures and literature. Loosely, this means the program author and/or compiler are responsible for identifying all memory and thread synchronization points via some appropriately expressive labeling.

D3D11着色器内存一致性模型是弱/松散的,这在现有的架构和文献中通常是这样理解的。简而言之,这意味着程序作者和/或编译器负责通过一些适当的表达方式来标识所有的内存和线程同步点。

This section outlines how this weak/relaxed Memory Consistency Model appears to function from the point of view of D3D software.

本节概述了从 D3D 软件的角度来看,这种弱/松弛内存一致性模型的功能。

7.14.2 Atomicity

7.14.2 原子性

An atomic operation may involve both reading from and then writing to a memory location. Atomic operations apply only to either u# (Unordered Access Views) or g# (Thread Group Shared Memory).

原子操作可能涉及读取内存位置,然后写入内存位置。原子操作仅适用于 u#(无序访问视图)或 g#(线程组共享内存)。

这种操作保证了在多线程环境中,对同一内存位置的读取和写入是不会被其他线程打断的,从而避免了数据的不一致性。

无序访问视图允许着色器(shader)对资源进行随机读写访问,

而线程组共享内存则是在一个线程组内部共享的内存资源。

这两种资源都可以进行原子操作。

需要注意的是,虽然原子操作可以保证数据的一致性,但它们通常会降低程序的并行性,因为原子操作需要对内存位置进行加锁,以防止其他线程的访问。因此,在编写图形程序时,应尽量减少原子操作的使用,以提高程序的性能。

It is guaranteed that when a thread issues an atomic operation on a memory address, no write to the same address from outside the current atomic operation by any thread can occur between the atomic read and write.

可以保证,当线程对内存地址发出原子操作时,在原子读取和写入之间不会发生任何线程从当前原子操作外部写入同一地址的情况。

If multiple atomic operations from different threads target the same address, the operations are serialized in an undefined order.

如果来自不同线程的多个原子操作以同一地址为目标,则这些操作将按未定义的顺序序列化。

Atomic operations do not imply a memory or thread fence. Fence operations (dubbed "sync") are introduced below.If the program author/compiler does not make appropriate use of fences, it is not guaranteed that all threads see the result of any given memory operation at the same time, or in any particular order with respect to updates to other memory addresses.

原子操作并不意味着内存或线程围栏(fence)。下面介绍了围栏(fence)操作(称为“同步”)。如果程序作者/编译器没有适当地使用围栏(fence),则不能保证所有线程都同时看到任何给定内存操作的结果,或者以任何特定顺序看到其他内存地址的更新。

Atomicity is implemented at 32-bit granularity. If a load or store operation spans more than 32-bits, the individual 32-bit operations are atomic, but not the whole.

原子性以 32 位粒度实现。如果加载或存储操作跨度超过 32 位,则单个 32 位操作是原子操作,但不是全部操作。

Limitation: Atomic operations on Thread Group Shared Memory are atomic with respect to other atomic operations, as well as operations that only perform reads ("load"s).However atomic operations on Thread Group Shared Memory are NOT atomic with respect to operations that perform only writes ("store"s) to memory. Mixing of atomics and stores on the same Thread Group Shared Memory address without thread synchronization and memory fencing between them produces undefined results at the address involved.This limitation arises because some implementations of loads and stores do not honor the locking semantics for implementing atomics. It turns out this has no impact on loads, since they are guaranteed to retrieve a value either before or after an atomic (they will not retrieve partially updated values, given they are all defined at 32-bit quanta).However store operations could find their way into the middle of an atomic operation and thus have their effect possibly lost.

限制:线程组共享内存的原子操作相对于其他原子操作以及仅执行读取(“load”)的操作来说是原子的。然而,对线程组共享内存的原子操作相对于仅执行写入(“store”)到内存的操作并不是原子的。在同一线程组共享内存地址上混合原子操作和存储操作,而没有线程同步和内存围栏,会产生未定义的结果。这种限制是因为一些加载和存储的实现并不遵守实现原子操作的锁定语义。事实证明,这对加载没有影响,因为它们保证在原子操作之前或之后检索值(它们不会检索部分更新的值,因为它们都定义为32位量子)。然而,存储操作可能会找到进入原子操作中间的方式,从而可能丧失其效果。

Note that there is no such limitation on atomics to UAV memory; atomic operations on UAV memory is atomic both with respect to other atomic operations as well as loads and stores.

请注意,对于 UAV 内存的原子操作,不存在此类限制;UAV 内存上的原子操作既对其他原子操作具有原子性,也对加载和存储操作具有原子性

7.14.3 Sync

7.14.3 同步

A sync(22.17.7) instruction is included in the Shader IL for Pixel Shader and the Compute Shader.

同步 (22.17.7) 指令包含在像素着色器和计算着色器的着色器 IL 中。

This provides memory fence semantics at various scopes, and optional thread group synchronization semantics (the latter only applies to the Compute Shader). For details, including some discussion of the implications see the description of the sync(22.17.7) instruction.

这提供了在不同范围内的内存屏障语义,以及可选的线程组同步语义(后者仅适用于计算着色器)。有关详细信息,包括一些关于影响的讨论,请参阅sync(22.17.7)指令的描述。

7.14.4 Global vs Group/Local Coherency on Non-Atomic UAV Reads

7.14.4 全局与组/局部一致性在非原子UAV读取

全局一致性:在全局一致性模型中,所有线程看到的内存操作顺序是一致的。这意味着,如果一个线程对一个变量进行了修改,那么所有其他线程都能看到这个修改。然而,这种一致性模型可能会限制并行性,并可能导致性能下降。

组/局部一致性:在组/局部一致性模型中,只有在同一个组或局部范围内的线程才能看到彼此的内存操作。这可以提高并行性,因为不需要等待所有线程达到一致。然而,这也可能导致一些复杂的同步问题。

Typical implementations will have a cache hierarchy to improve read access performance on UAV(5.3.9) accesses.A constraint that some implementations have with the first stage in this cache hierarchy is that, in addition to operating at per-thread-group scope only, the cache does not have an efficient way of being synchronized with writes or atomics that have happened by other thread groups.Such behavior only surfaces as an issue for applications when cross-thread-group communication needs to be performed involving data loads.In this case, the hardware basically needs to know that it must bypass the first stage of caches on loads, reaching out to a more global memory so that the cross thread-group communication can function. D3D allows applications specify this cross-thread-group communication intent as follows.

典型的实现将具有缓存层次结构,以提高UAV(5.3.9)访问的读取性能。这个缓存层次结构的第一阶段在某些实现中存在的一个限制是,除了只在每个线程组范围内操作外,缓存没有有效的方式与其他线程组发生的写入或原子操作进行同步。只有当需要执行涉及数据加载的跨线程组通信时,此类行为才会成为应用程序的问题。在这种情况下,硬件基本上需要知道它必须绕过loads缓存的第一阶段,到达更全局的内存,以便跨线程组通信可以正常工作。D3D 允许应用程序指定此跨线程组通信意图,如下所示。

If a Compute Shader(18) thread in a given thread group needs to perform loads of data that was written by atomics or stores in another thread group, the UAV slot where the data resides must be tagged upon declaration in the shader as "globally coherent", so the implementation can ignore the local cache. Otherwise, this form of cross-thread group data sharing will produce undefined results.

如果给定线程组中的计算着色器(18)线程需要执行由其他线程组中的原子或存储写入的数据的加载,那么数据所在的UAV插槽必须在着色器中声明时标记为“全局一致”,这样实现就可以忽略本地缓存。否则,这种形式的跨线程组数据共享将产生未定义的结果。

Atomic read-modify-write operations do not have this constraint (even though a part of the operation is a read/load), because a byproduct of the hardware honoring atomicity is that the entire system sees the operation, whereas simple loads on some implementations may only go to a local cache that has no knowledge of external updates.

原子读-修改-写操作没有此约束(即使操作的一部分是读/加载),因为硬件尊重原子性的副作用是整个系统都可以看到该操作,而某些实现中,简单加载可能只会去到一个没有外部更新信息的本地缓存。

If a UAV is not declared as "globally coherent", it is only "group coherent", which means loads can only see data written by stores and atomics in other threads in the same thread group. The affected hardware knows it can make use of its thread-group specific caching for loads, since writes to the memory only came from the current thread group. A UAV tagged as "globally coherent" is also inherently obviously "group coherent", although the affected hardware would not use its local cache.As such, the "globally coherent" flag should only be specified when necessary.

如果一个UAV没有被声明为“全局一致”,那么它只是“组一致”,这意味着加载只能看到由同一线程组中的其他线程的存储和原子写入的数据。受影响的硬件知道它可以利用其线程组特定的缓存进行加载,因为写入到内存的只来自当前的线程组。一个被标记为“全局一致”的UAV也显然是“组一致”的,尽管受影响的硬件不会使用其本地缓存。因此,只应在必要时指定“全局一致性”标志。

As a reminder though, to guarantee coherency on UAV accesses on all implementations, not only must shaders make the global vs group scope distinction discussed here upon UAV declaration, but they must also make appropriate use of memory and/or thread barriers ("sync_*" in the IL) as needed within in the shader to enforce proper ordering of operations by individual threads as seen by others.In addition, the "sync" operation has options for memory barriers that also distinguish between global vs group scope, but that control is separate from the topic of this section, and may not be exposed until a later time, as discussed in the sync instruction definition.

作为提醒,为了在所有实现上保证UAV访问的一致性,着色器不仅必须在UAV声明时做出这里讨论的全局与组范围的区分,而且还必须在着色器内部根据需要适当地使用内存和/或线程屏障(barriers)(在IL中为"sync_"),以强制执行由其他线程看到的单个线程的操作的正确排序。

此外,“sync”操作还具有内存屏障(barriers)选项,这些选项也区分了全局范围与组范围,但该控制与本节的主题是分开的,并且可能要等到以后才会公开,如同步指令定义中所述。

Instruction:    sync[_uglobal|_ugroup][_g][_t]

Stage(s):       All(22.1.1)

Description:    Thread group sync and/or memory barrier.

Back to issue of global vs group coherency on non-atomic UAV reads.Importantly, for many scenarios where cross thread-group communication or reduction (such as histograms) can be accomplished using only atomic operations (no cross thread-group loads involved), there is no problem since atomic operations are implemented by all hardware in a globally coherent way, regardless of whether the UAV has been tagged as "globally coherent" or not.

回到非原子UAV读取的全局与组一致性问题。重要的是,对于许多可以仅通过原子操作(不涉及跨线程组加载)实现的跨线程组通信或减少(如直方图)的场景,由于所有硬件都以全局一致的方式实现了原子操作,无论UAV是否被标记为“全局一致”,都不会有问题。

In the Pixel Shader(16), if a UAV is not declared as "globally coherent", it is only "locally coherent". "Local coherency" is the Pixel Shader’s equivalent of the Compute Shader’s "group coherency", except having scope limited only to a single Pixel Shader invocation.This indicates that the Pixel Shader is not doing any cross-PS-invocation communication involving simple load operations.Note, however, that in the Pixel Shader just like in the Compute Shader, atomic read-modify-write operations are always globally coherent. Indeed it is likely to be rare for a Pixel Shader or perhaps even the Compute Shader to need to declare a UAV as "globally coherent", given that atomic operations, which are always globally coherent, might provide the most practical mechanism for cross-PS-invocation or cross-group operations.

在像素着色器(16)中,如果一个UAV没有被声明为“全局一致”,那么它只是“局部一致”。 “局部一致性”是像素着色器的计算着色器的“组一致性”的等价物,只是范围仅限于单个像素着色器调用。这表示像素着色器未执行任何涉及简单加载操作的跨 PS 调用通信。但请注意,在像素着色器中,就像在计算着色器中一样,原子读-修改-写操作始终是全局一致的。事实上,像素着色器甚至计算着色器可能需要将UAV声明为“全局一致”的情况可能很少见,因为原子操作总是全局一致的,可能提供了跨PS调用或跨组操作最实用的机制。

7.15 Shader-Internal Cycle Counter (Debug Only)

7.15 着色器内部循环计数器(仅限调试)

7.15.1 Basic Semantics

7.15.1 基本语义

To assist comparisons of algorithms running on GPUs during application development, a cycle counter can be read into shaders. The cycle counter is a 64-bit unsigned integer.

为了帮助比较应用程序开发期间在 GPU 上运行的算法,可以将循环计数器读入着色器。循环计数器是一个 64 位无符号整数。

The cycle counter appears as an additional 2*32-bit (64 bit total) input register type that can declared in any version 5.0+ shader. There are currently no native 64-bit integer arithmetic operations in shaders, although it is simple enough to emulate this. It may be fine for shaders to just look at the low 32-bits of the counter – this can be requested in the shader. Applications may also export the measurements using standard shader outputs for later analysis such as on the CPU.

循环计数器表现为一个额外的2*32位(总共64位)输入寄存器类型,可以在任何5.0+版本的着色器中声明。目前在着色器中还没有原生的64位整数算术操作,尽管模拟这个操作足够简单。对于着色器来说,只看计数器的低32位可能就足够了 - 这可以在着色器中请求。应用程序也可以使用标准的着色器输出来导出测量结果,以便稍后在CPU上进行分析。

The counter is an implementation-dependent measure of cycles in the GPU engine, requiring care to interpret it usefully.

计数器是 GPU 引擎中周期实现相关度量,需要小心的有效的解释它。

7.15.2 Interpreting Cycle Counts

7.15.2 解释周期计数

For this discussion, consider a shader "invocation" to be a single execution of one shader program from beginning to end. For the Compute Shader however, an "invocation" is a single thread-group’s execution – e.g. the lifespan of the contents of thread-group shared memory.

在这个讨论中,我们将一个着色器的"调用"视为一个着色器程序从开始到结束的单次执行。然而,对于计算着色器来说,一个"调用"是一个单个线程组的执行 - 例如,线程组共享内存内容的生命周期。

The initial value of the counter is undefined.

计数器的初始值未定义。

A single reading of the cycle counter is meaningless. But any shader invocation can poll the counter value any number of times.

单独读取周期计数器的值是没有意义的。但是,任何着色器调用都可以任意次数地轮询计数器的值。

Computing a delta from cycle counter readings within a shader invocation is meaningful.

在着色器调用中计算周期计数器读数的差值是有意义的。

Computing a delta from cycle counter readings across separate shader invocations is not meaningful on all hardware. Developers must obtain information directly from IHVs about whether this is meaningful.

在所有硬件上,跨不同着色器调用计算周期计数器读数的差值可能并没有意义。开发者必须直接从独立硬件供应商(IHVs)获取信息,以确定这是否有意义。

The only IHV agnostic approach to interpreting the counters is to limit calculation of deltas to within a given shader invocation, and only make comparisons of deltas within or between shader invocations.

解释计数器的唯一与 IHV 无关的方法是将增量的计算限制在给定的着色器调用内,并且仅对着色器调用内或着色器调用之间的增量进行比较。

There are plenty of reasons why test runs will execute differently. The obvious one is that execution of a shader can be interrupted by thread switching, so delta measurements will be arbitrarily larger than the number of cycles spent executing instructions in a given thread.

测试运行执行的不同有很多原因。显而易见的一个是,着色器的执行可以被线程切换中断,所以增量测量将会比在给定线程中执行指令所花费的周期数任意地大。

There is no supported way to find out the frequency of the counter. There is no way to correlate this shader internal counter with external timers such as asynchronous time queries.

没有支持的方法来找出计数器的频率。无法将此着色器内部计数器与外部计时器(如异步时间查询)相关联。

The counter measurements cannot be correlated with measurements on different hardware by other hardware vendors or even necessarily the same vendor.

计数器测量值不能与其他硬件供应商(甚至不一定是同一供应商)在不同硬件上的测量值相关联。

If a GPU’s speed changes, such as for power saving, there is no way to know this happened, or its effect on cycle measurements.

如果 GPU 的速度发生变化,例如为了省电,则无法知道发生了什么,也没有办法知道它对周期测量的影响。

Beyond these hints about the care needed to interpret the counter, the onus is on developers to research the properties of new hardware designs that may affect measurements.

除了这些关于解释计数器需要注意的提示之外,开发人员还有责任研究可能影响测量的新硬件设计的特性。

7.15.3 Shader Compiler Constraints

7.15.3 着色器编译器约束

The HLSL shader compiler and driver compilers must treat reads of the cycle counter as barriers. Instructions can’t be moved across a counter read, and counter reads can’t be merged.

HLSL 着色器编译器和驱动程序编译器必须将循环计数器的读取视为屏障。指令不能在计数器读取之间移动,计数器读取也不能合并。

IS指令调度

7.15.4 Feature Availability

7.15.4 功能可用性

The runtime enforces that shaders using this feature can only be created on a system with debug layer enabled. The debug layer is not allowed to be redistributed to end-user machines. The point is that shaders that use this counter are not intended to be shipped.

运行时强制要求,着色器使用这个功能,只有在启用了调试层的系统。调试层不允许分发到最终用户的机器。关键是,使用此计数器的着色器并非用于发布。

7.15.5 Conformance

7.15.5 一致性

This feature will not be tested on hardware by WHQL, except perhaps simply checking that drivers do not crash. Microsoft will test that the HLSL compiler output is correct.

WHQL(Windows Hardware Quality Lab - Windows系统硬件质量实验室) 不会在硬件上测试此功能,除非只是检查驱动程序不会崩溃。Microsoft 将测试 HLSL 编译器输出是否正确。

7.15.6 Shader Bytecode Details

7.15.6 着色器字节码详细信息

A new input register, vCycleCounter(22.3.29), can be declared in any version 5_0 (and beyond) shader:

可以在任何版本 5_0(及更高版本)着色器中声明新的输入寄存器 vCycleCounter (22.3.29) :

dcl_input vCycleCounter.{x|xy}.  

Reading x yields the 32 LSBs of the 64-bit count, and reading y yields the 32 MSBs.

读取 x 得到 64 位计数的 32 LSB,读取 y 得到 32 个 MSB。

This register can only be used as the source to a mov instruction, e.g. mov r0.w, vCycleCounter.x.

此寄存器只能用作 mov 指令的源,e.g. mov r0.w、vCycleCounter.x。

7.16 Textures And Resource Loading

7.16 纹理和资源加载

Up to 128 Resources (e.g. Buffer, Texture1D/2D/3D/Cube) can be active per Pipeline stage. A Resource binding is a representation of a Resource's base pointer (and other data such as size and pixel layout) and is independent of the samplers.

每个 Pipeline 阶段最多可以激活 128 个资源(例如 Buffer、Texture1D/2D/3D/Cube)。资源绑定是资源基指针(以及其他数据,如大小和像素布局)的表示形式,并且独立于采样器。

A texture out of a set of bound textures cannot be selected via Shader indexing, however Texture1D/2D/3D resources with an Array dimension > 1, or TextureCube (which has an Array dimension of 6), allow indexing along the array axis from within Shader code.

一组绑定纹理中的纹理不能通过Shader索引来选择,但是具有大于1的数组维度的Texture1D/2D/3D资源,或者TextureCube(其数组维度为6),允许从Shader代码内沿数组轴进行索引。

Textures can only have a single Element format. Likewise, Buffers used as input to Shaders can also only have a single Element format, and have an implied data stride equal to the Element size.A single Buffer (or Texture) could be set to multiple input slots simultaneously, with different Element formats and/or offsets, however because Buffers bound as Shader inputs have their data stride implied by the Element format, it is not possible to describe "Array-of-Structures" style layouts in Buffers bound at Shader input.This unlike the Input Assembler Stage, where multiple element Buffers are permitted, and Element offsets and strides can be defined Buffers freely.

纹理只能具有一种元素格式。同样,用作着色器输入的缓冲区也只能具有一种元素格式,并且隐含的数据步幅等于 Element 大小。

单个缓冲区(或纹理)可以同时设置为多个输入槽,具有不同的元素格式和/或偏移量,但是由于作为着色器输入的缓冲区具有元素格式隐含的数据步长,因此无法在着色器输入绑定的缓冲区中描述“结构数组”样式布局。这与输入汇编阶段不同,在输入汇编阶段,允许多元素缓冲区,并且可以自由定义元素偏移和步长。

假设我们有一个RGBA纹理和一个浮点数缓冲区,我们可以这样在Shader代码中声明和使用它们:

// 声明一个RGBA纹理

Texture2D texture : register(t0);

// 声明一个浮点数缓冲区

Buffer<float> buffer : register(b0);

// 在Shader代码中,我们可以使用纹理采样器来访问纹理中的像素

float4 pixel = texture.Sample(sampler, uv);

// 我们也可以使用索引来访问缓冲区中的元素

float value = buffer[index];

在这个例子中,uv 是一个二维向量,用于指定我们想要采样的纹理坐标。index 是一个整数,用于指定我们想要访问的缓冲区元素的索引。sampler 是一个采样器对象,用于指定纹理采样的方式。

Data from textures is accessed in shaders via the load (ld) and sample instructions.The ld instruction provides a simple read and (optional) float32 conversion of texture data using integral addresses, while the sample instructions use normalized floating point addressing and perform filtering in addition to the format conversion.

在着色器中通过load(ld) 和sample指令访问来自纹理的数据。ld 指令使用整数地址提供纹理数据的简单读取和(可选)float32 转换,而sample指令使用规范化浮点寻址,并在格式转换之外执行过滤。

7.17 Texture Load

7.17 纹理加载

The load operation performs a non-filtered read of resource data. See the ld(22.4.6) instruction definition for details.

加载操作对资源数据执行不筛选读取。有关详细信息,请参阅 ld (22.4.6) 指令定义。

ld[_aoffimmi(u,v,w)][_s]

                    dest[.mask],

                    srcAddress[.swizzle],

                    srcResource[.swizzle]

ld:这是指令的名称,表示"load"。

[_aoffimmi(u,v,w)]:这是一个可选的后缀,表示地址偏移。它指示纹理坐标将按一组提供的立即纹素空间整数常量值进行偏移。

dest[.mask]:这是操作结果的地址。

srcAddress[.swizzle]:这是执行采样所需的纹理坐标。

srcResource[.swizzle]:这是一个必须声明的纹理寄存器 (t#),用于识别要从中提取数据的纹理或缓冲区。

7.17.1 Multisample Resource Load

7.17.1 多重采样资源加载

多重采样资源加载是一种在着色器中访问纹理数据的技术,它使用多个采样点来确定像素的最终颜色,从而提高图像的质量。

在DirectX中,您可以使用Load函数来从多重采样纹理中加载数据。这个函数接受一个整数坐标作为输入,并返回该坐标处的像素值。例如:

// 声明一个2D多重采样纹理

Texture2DMS<float4, 128> tex : register(t0);   //Texture2DMSArray<float4, 128> texture : register(t0);

// 使用Load函数访问纹理中的一个像素

int2 coord = int2(10, 20);

float4 pixel = tex.Load(coord, sampleIndex);

在这个例子中,coord 是一个二维向量,表示我们想要访问的像素的坐标。sampleIndex 是一个整数,表示我们想要访问的采样的索引。然后,pixel 就会包含该坐标处的像素值。

需要注意的是,多重采样资源加载需要更多的计算资源,因为它需要处理更多的样本。但是,它可以显著提高图像的质量,特别是在处理边缘锯齿时。

Multisample resources can be set as shader inputs, which allows individual samples to be read by the shader. Support for multisample shader reads has the following restrictions:

多重采样资源可以被设置为着色器输入,从而允许着色器读取单个样本。多重采样着色器读取的支持具有以下限制:

1.Pixel Shader only (not supported for other shader stages)

1.仅限像素着色器(其他着色器阶段不支持)

2.load instruction only (no use of sample instructions)

2.仅加载指令(不使用sample指令)

3.Texture2D and Texture2DArray resources only

3.仅限 Texture2D 和 Texture2DArray 资源

4.number of samples in bound resource must be declared in shader

4.绑定资源中的采样数必须在着色器中声明

5.sample index for load instruction must be a literal

5.加载指令的采样索引必须是文本

See ld(22.4.6) and dcl_resource(22.3.12) definitions for details.

有关详细信息,请参阅 ld (22.4.6) 和 dcl_resource (22.3.12) 定义。

7.18 Texture Sampling

7.18 纹理采样

7.18.1 Overview

7.18.1 概述

This section describes the mechanics of sampling Texture1D/2D/3D/Cube resources using filtering.The simplest form of sampling a texture is point sampling, supported for all data formats, however more complex filtering operations are only available to some formats, indicated in the format list in the Formats(19.1) section.

本节介绍使用过滤对 Texture1D/2D/3D/Cube 资源进行采样的机制。对纹理进行采样的最简单形式是点采样,支持所有数据格式,但更复杂的过滤操作仅适用于某些格式,如“格式 (19.1) ”部分的格式列表中所示。

The behaviors described here are obtained via the the various sample* instructions, such as sample(22.4.15). See the specs for those instructions for further details that complement this section.

此处描述的行为是通过各种 sample* 指令获得的,例如 sample (22.4.15) 。请参阅这些说明的规格,了解补充本节的更多详细信息。

sample[_aoffimmi(u,v,w)][_cl][_s]

dest[.mask],

srcAddress[.swizzle],

srcResource[.swizzle],

srcSampler

是一个在Shader Model 4或更高版本中使用的sample指令。这个指令用于从指定的纹理中采样数据。这个指令的各个部分的含义如下:

sample:这是指令的名称,表示"sample"。

[_aoffimmi(u,v,w)]:这是一个可选的后缀,表示地址偏移。它指示纹理坐标将按一组提供的立即纹素空间整数常量值进行偏移。

dest[.mask]:这是操作结果的地址。

srcAddress[.swizzle]:这是执行采样所需的纹理坐标。

srcResource[.swizzle]:这是一个必须声明的纹理寄存器 (t#),用于识别要从中提取数据的纹理或缓冲区。

srcSampler:这是一个采样器寄存器 (s#),用于指定纹理采样的方式。

这个指令是ld指令的复杂替代指令。与ld不同,sample还能够从纹理中提取数据,并进行纹理过滤,以平滑纹理中的像素值

Unless otherwise noted, all texture sampling address operations are performed according to the arithmetic processing rules described in the Basics(3) section.

除非另有说明,否则所有纹理采样地址操作均根据“基本信息 (3) ”部分中描述的算术处理规则执行。

3 Basics

3.1 Floating Point Rules

3.2 Data Conversion

3.3 Coordinate Systems

3.4 Rasterization Rules

3.5 Multisampling

Texture filtering theory or historical background is NOT provided in this spec.

本规范中未提供纹理过滤理论或历史背景。

Note that details of all required texture filtering algorithms are not fully/exactly specified for this version of D3D11.3; the specs below only explicitly define a subset of all filtering features available in D3D11.3.

请注意,所有必需的纹理过滤算法的详细信息并未在D3D11.3的这个版本中完全/精确地指定。以下规范仅显式定义 D3D11.3 中可用的所有过滤功能的子集。

7.18.2 Samplers

7.18.2 采样器

Samplers identify filtering modes and other sampler state, described below. Samplers are not indexable from within shaders. There are 16 samplers "slots" per Pipeline stage, to which "Sampler Objects" can be arbitrarily assigned/reassigned.

采样器可识别过滤模式和其他采样器状态,如下所述。采样器不能从着色器中建立索引。每个 Pipeline 阶段有 16 个采样器“插槽”,可以任意分配/重新分配“采样器对象”。

The state for a sampler is encapsulated in a "sampler object", up to 4096 of which can be created through the API. At the time a sampler object is created, all of its state must be chosen permanently, and can never be changed. These sampler objects can be arbitrarily assigned to any of the 16 "sampler slots" at each of the Shader stages (a single sampler object is allowed to be assigned to multiple sampler slots, even on multiple pipelines stages simultaneously, if desired.

采样器的状态封装在“采样器对象”中,其中最多可通过 API 创建 4096 个。在创建采样器对象时,必须永久选择其所有状态,并且永远不能更改。这些采样器对象可以任意分配给每个着色器阶段的 16 个“采样器插槽”中的任何一个(如果需要,允许将单个采样器对象分配给多个采样器插槽,甚至可以同时在多个流水线阶段上分配。

The reason Sampler Objects are statically created, and there is a limit on the number that can be created, is to enable hardware to maintain references to multiple samplers in flight in the Pipeline, without having to track changes or flush the Pipeline, which would be necessary if Sampler Objects were allowed to be edited.

采样器对象是静态创建的,且存在创建数量的限制,这是为了使硬件能够在Pipeline中维护对多个采样器的引用,而无需跟踪更改或刷新Pipeline,如果允许编辑采样器对象,那么就需要进行这些操作。

7.18.3 Sampler State

7.18.3 采样器状态

typedef enum D3D11_FILTER

{

    // Bits used in defining enumeration of valid filters:

    // bits [1:0] - mip: 0 == point, 1 == linear, 2,3 unused

    // bits [3:2] - mag: 0 == point, 1 == linear, 2,3 unused

    // bits [5:4] - min: 0 == point, 1 == linear, 2,3 unused

    // bit  [6]   - aniso

    // bit  [7]   - comparison

    // bits [8:7] - reduction type:

    //                0 == standard filtering

    //                1 == comparison

    //                2 == min

    //                3 == max

    // bit  [31]  - mono 1-bit (narrow-purpose filter) [no longer supported in D3D11]

    D3D11_FILTER_MIN_MAG_MIP_POINT                              = 0x00000000,

    D3D11_FILTER_MIN_MAG_POINT_MIP_LINEAR                       = 0x00000001,

    D3D11_FILTER_MIN_POINT_MAG_LINEAR_MIP_POINT                 = 0x00000004,

    D3D11_FILTER_MIN_POINT_MAG_MIP_LINEAR                       = 0x00000005,

    D3D11_FILTER_MIN_LINEAR_MAG_MIP_POINT                       = 0x00000010,

    D3D11_FILTER_MIN_LINEAR_MAG_POINT_MIP_LINEAR                = 0x00000011,

    D3D11_FILTER_MIN_MAG_LINEAR_MIP_POINT                       = 0x00000014,

    D3D11_FILTER_MIN_MAG_MIP_LINEAR                             = 0x00000015,

    D3D11_FILTER_ANISOTROPIC                                    = 0x00000055,

    D3D11_FILTER_COMPARISON_MIN_MAG_MIP_POINT                   = 0x00000080,

    D3D11_FILTER_COMPARISON_MIN_MAG_POINT_MIP_LINEAR            = 0x00000081,

    D3D11_FILTER_COMPARISON_MIN_POINT_MAG_LINEAR_MIP_POINT      = 0x00000084,

    D3D11_FILTER_COMPARISON_MIN_POINT_MAG_MIP_LINEAR            = 0x00000085,

    D3D11_FILTER_COMPARISON_MIN_LINEAR_MAG_MIP_POINT            = 0x00000090,

    D3D11_FILTER_COMPARISON_MIN_LINEAR_MAG_POINT_MIP_LINEAR     = 0x00000091,

    D3D11_FILTER_COMPARISON_MIN_MAG_LINEAR_MIP_POINT            = 0x00000094,

    D3D11_FILTER_COMPARISON_MIN_MAG_MIP_LINEAR                  = 0x00000095,

    D3D11_FILTER_COMPARISON_ANISOTROPIC                         = 0x000000d5,

    D3D11_FILTER_MINIMUM_MIN_MAG_MIP_POINT                      = 0x00000100,

    D3D11_FILTER_MINIMUM_MIN_MAG_POINT_MIP_LINEAR               = 0x00000101,

    D3D11_FILTER_MINIMUM_MIN_POINT_MAG_LINEAR_MIP_POINT         = 0x00000104,

    D3D11_FILTER_MINIMUM_MIN_POINT_MAG_MIP_LINEAR               = 0x00000105,

    D3D11_FILTER_MINIMUM_MIN_LINEAR_MAG_MIP_POINT               = 0x00000110,

    D3D11_FILTER_MINIMUM_MIN_LINEAR_MAG_POINT_MIP_LINEAR        = 0x00000111,

    D3D11_FILTER_MINIMUM_MIN_MAG_LINEAR_MIP_POINT               = 0x00000114,

    D3D11_FILTER_MINIMUM_MIN_MAG_MIP_LINEAR                     = 0x00000115,

    D3D11_FILTER_MINIMUM_ANISOTROPIC                            = 0x00000155,

    D3D11_FILTER_MAXIMUM_MIN_MAG_MIP_POINT                      = 0x00000180,

    D3D11_FILTER_MAXIMUM_MIN_MAG_POINT_MIP_LINEAR               = 0x00000181,

    D3D11_FILTER_MAXIMUM_MIN_POINT_MAG_LINEAR_MIP_POINT         = 0x00000184,

    D3D11_FILTER_MAXIMUM_MIN_POINT_MAG_MIP_LINEAR               = 0x00000185,

    D3D11_FILTER_MAXIMUM_MIN_LINEAR_MAG_MIP_POINT               = 0x00000190,

    D3D11_FILTER_MAXIMUM_MIN_LINEAR_MAG_POINT_MIP_LINEAR        = 0x00000191,

    D3D11_FILTER_MAXIMUM_MIN_MAG_LINEAR_MIP_POINT               = 0x00000194,

    D3D11_FILTER_MAXIMUM_MIN_MAG_MIP_LINEAR                     = 0x00000195,

    D3D11_FILTER_MAXIMUM_ANISOTROPIC                            = 0x000001d5

} D3D11_FILTER;

typedef enum D3D11_TEXTURE_ADDRESS_MODE

{

    D3D11_TEXADDRESS_WRAP         = 1,

    D3D11_TEXADDRESS_MIRROR       = 2,

    D3D11_TEXADDRESS_CLAMP        = 3,

    D3D11_TEXADDRESS_BORDER       = 4,

    D3D11_TEXADDRESS_MIRRORONCE   = 5

} D3D11_TEXTURE_ADDRESS_MODE;

typedef struct D3D11_SAMPLER_STATE

{

    D3D11_FILTER                  Filter;

    D3D11_TEXTURE_ADDRESS_MODE    AddressU; // U coordinate address mode

    D3D11_TEXTURE_ADDRESS_MODE    AddressV; // V coordinate address mode

    D3D11_TEXTURE_ADDRESS_MODE    AddressW; // W coordinate address mode

    float                         MinLOD;

    float                         MaxLOD;

    float                         MipLODBias; // (-16.0f..15.99f)

    DWORD                         MaxAnisotropy;  // (0 - 16)

    D3D11_COMPARISON_FUNC         ComparisonFunction; // for Percentage-Closer filter

    float                         BorderColor[4]; // R,G,B,A

} D3D11_SAMPLER_STATE;

// sampler state

AddressU

AddressV

AddressW

BorderColor

Filter

MaxAnisotropy

MaxLOD

MinLOD

MipLODBias

// sampler-comparison state

ComparisonFun

例:

Texture2D gDiffuseMap;

SamplerState samAnisotropic

{

Filter = ANISOTROPIC;

MaxAnisotropy = 4;

AddressU = WRAP;

AddressV = WRAP;

};

texColor = gDiffuseMap.Sample( samAnisotropic, pin.Tex );

7.18.4 Normalized-Space Texture Coordinate Magnitude vs. Maximum Texture Size

7.18.4 归一化空间纹理坐标大小与最大纹理大小的关系

The magnitude of normalized-space texture coordinates (allowing for texture tiling) has no effect on the maximum supportable texture dimensions that can be sampled. The only catch is that as the absolute magnitude of a normalized-space texture coordinate gets larger (e.g.large amounts of tiling), floating point dictates that less precision will be available to resolve individual texels in a given tiling of the texture being sampled.Large amounts of tiling of large dimension textures will yield sampling artifacts where float32 precision becomes inadequate.But separate from this tradoff, in order to otherwise achieve decoupling of the magnitude of normalized-space texture coordinates from having any effect on maximum texture dimension that can be sampled given float32 normalized-space addressing, a range reduction to about [-10...10], depending on the scenario, is applied on the texture coordinates.

归一化空间纹理坐标的大小(允许纹理平铺)对可采样的最大可支持纹理尺寸没有影响。唯一的问题是,随着归一化空间纹理坐标的绝对量级变大(例如大量平铺),浮点数决定了在给定纹理的平铺中解析单个纹素的精度会减小。大量使用大尺寸纹理的平铺会导致采样伪影,其中 float32 精度变得不足。但是,除了这种权衡,为了实现归一化空间纹理坐标大小的解耦,而不会对给定 float32 归一化空间寻址可采样的最大纹理维度产生一些影响,根据场景,对纹理坐标应用范围缩小到约 [-10...10]。

Details of this range reduction are described later(7.18.6). The reduction happens before scaling texture coordinates by texture size, conversion to fixed point, and final application of Texture Address modes (CLAMP/MIRROR/WRAP etc.) on texel addresses.The range reduction allows the fixed point representation to not have to dedicate storage for the texture tiling. It is important to note that range reduction is a separate step from applying Texture Address mode (although the particular Texture Address mode affects what type of reduction gets used).

此范围减少的详细信息将在后面 (7.18.6) 介绍。减少发生在按纹理大小缩放纹理坐标、转换为定点数,以及在纹素地址上最终应用纹理地址模式(CLAMP/MIRROR/WRAP等)之前。范围缩小允许定点表示不必为纹理平铺专门存储。需要注意的是,范围减少与应用纹理地址模式是不同的步骤(尽管特定的纹理地址模式影响使用哪种类型的减少)。

Using range reduction to decouple texture coordinate magnitude from supportable texture size has the following implication: The maximum texture dimension possible to be sampled in D3D11.3 is 2^17. This limit is derived starting with 24 bits of float32 fractional precision for the original texture coordinate, subtracting required subtexel precision (8 bits), and subtracting 1 more bit due to the factor of 2 scaling in the reduced range. Of course, the minimum upper limit for filterable texture dimension required to be exposed by all D3D11.3 implementations is far smaller, at only 16384 (see System Limits(21)).

使用范围减少将纹理坐标幅度与可支持的纹理大小分离具有以下含义:D3D11.3 中可能采样的最大纹理维度为 2^17。此限制是从原始纹理坐标的 24 位 float32 小数精度开始推导的,减去所需的 subtexel 精度(8 位),然后再减去 1 位,因为在缩小的范围内,比例为 2。当然,所有D3D11.3实现需要暴露的可过滤纹理尺寸的最小上限要小得多,只有16384(见系统限制(21))。

7.18.5 Processing Normalized Texture Coordinates

7.18.5 处理归一化纹理坐标

在计算机图形学中,纹理坐标通常被归一化,这意味着无论纹理的实际尺寸如何,纹理坐标的范围都是[0,1]。这使得纹理坐标独立于任何特定的纹理尺寸。

例如,对于一个256x256的纹理,纹理坐标(0.5, 0.5)对应于纹理的中心。如果我们将纹理大小改为512x512,纹理坐标(0.5, 0.5)仍然对应于纹理的中心。这是因为纹理坐标已经被归一化,所以它们不依赖于纹理的实际尺寸。

然而,当我们在着色器中采样纹理时,我们需要考虑纹理的最大尺寸。因为GPU在执行纹理过滤时,会考虑到纹理的实际尺寸。例如,当执行线性过滤时,GPU需要读取纹理中的多个像素,并根据这些像素的颜色和纹理坐标的精确位置来计算出最终的颜色。 

总的来说,虽然纹理坐标被归一化,但在进行纹理采样时,我们仍然需要考虑纹理的最大尺寸。

This section describes in general how to convert a normalized texture coordinate to a texture address. The description is based on sampling a Texture1D, but applies equally to Texture2D and Texture3D (and not TextureCubes).

本节介绍如何将归一化纹理坐标转换为纹理地址。该描述基于对 Texture1D 的采样,但同样适用于 Texture2D 和 Texture3D(不适用于 TextureCubes)。

A normalized texture coordinate (U) maps the range [0, 1] to the range [0, numTexelsU], where numTexelsU is the size of a 1D texture in texels. The process of computing a texture address is as follows:

归一化纹理坐标(U)将范围 [0,1] 映射到范围 [0,numTexelsU],其中 numTexelsU 是纹素中一维纹理的大小。纹理地址的计算过程如下:

1.Reducing the normalized texture coordinate range based on the texture address mode.

1.根据纹理地址模式缩小归一化纹理坐标范围

2.Performing point sample or linear sample addressing (scaling the normalized texture coordinate by the texture size and snapping the value to a fixed point number with 8 bits fraction).

2.执行点采样或线性采样寻址(按纹理大小缩放归一化纹理坐标,并将值捕捉到具有 8 位小数的定点数)。

3.Applying the texture address mode.

3.应用纹理地址模式

例:

假设我们有一个1D纹理,其大小(以texels为单位)为4。这意味着我们有4个texel,它们的索引分别为0, 1, 2, 3。现在,如果我们有一个规范化的纹理坐标U,其值为0.5,那么我们如何找到对应的纹理地址呢?

1. 根据纹理地址模式减小规范化纹理坐标的范围:这一步通常涉及到将规范化的纹理坐标调整到适合纹理寻址模式的范围。例如,如果我们的寻址模式是CLAMP(将纹理坐标限制在[0, 1]范围内),那么我们的规范化纹理坐标U=0.5不需要调整。如果我们的寻址模式是WRAP(在超出[0, 1]范围时将纹理坐标“包装”回来),那么我们的规范化纹理坐标U=0.5同样不需要调整。

2. 执行点采样或线性采样寻址:这一步涉及到将规范化的纹理坐标缩放到纹理的大小,然后将结果捕获为一个固定点数。在我们的例子中,我们将U=0.5乘以纹理的大小4,得到2。然后,我们将这个结果捕获为一个固定点数。如果我们正在进行点采样,那么我们将结果四舍五入到最近的整数,得到2。如果我们正在进行线性采样,那么我们将结果保留为一个小数,得到2.0。

3. 应用纹理地址模式:最后,我们将纹理地址模式应用到我们的结果上。在我们的例子中,无论我们的寻址模式是CLAMP还是WRAP,我们的结果都是2,因此我们不需要进行任何额外的调整。

所以,通过这个过程,我们将规范化的纹理坐标U=0.5转换为了纹理地址2。这意味着,如果我们在这个纹理上进行采样,我们将会得到索引为2的texel的值。

7.18.6 Reducing Texture Coordinate Range

7.18.6 缩小纹理坐标范围

To limit the number of bits needed to store the texture coordinate in fixed point after conversion from floating point, the range of the normalized texture coordinate is reduced to be within [-10,10], depending on the Address mode.This removes the magnitude of texture tiling from the texture coordinate, while not affecting the behavior of texture address wrap modes. The same address mode handling can be applied to the range reduced texture coordinate as the original, producing the same result. The benefit is that the magnitude of texture tiling is not stored in the coordinate at the same time that texture size scaling is performed on the coordinate.This enables far larger texture coordinate range to be handled cleanly than would otherwise be possible without reduction.

为了限制从浮点转换后将纹理坐标存储在定点所需的位数,归一化纹理坐标的范围将缩小到 [-10,10] 以内,具体取决于地址模式。这将从纹理坐标中移除纹理平铺的大小,同时不影响纹理地址wrap模式的行为。相同的地址模式处理可以应用于与原始的范围缩小的纹理坐标,产生相同的结果。好处在于,纹理平铺的大小不会存储在坐标中,同时纹理大小缩放是在坐标上执行的。这使得更大的纹理坐标范围能够被干净地处理,而不是在没有减少的情况下。

Note that the range reductions applied here in some cases leave a bit of extra padding (up to [-10,10] mentioned).This padding allows for the fact that after scaling by texture size, the selection of texels for point or linear sample kernels involves picking texel(s) to the left and/or right of the sample location, so coordinates that are not near the boundaries of the addresing mode must not appear as if they are on the boundary.e.g. Consider Linear sampling a coordinate that straddles a border when in BORDER mode: this needs to pick up the Border Color for 1/2 of the samples and the interior edge of the texture for the other 1/2. However range reduction cannot just clamp to [0..1) for BORDER mode, because it would make coordinates that fall completely into BORDER territory incorrectly behave as if they straddle the border (picking up some contribution of Border Color and interior).Range reduction has to also allow for immediate texel offsets permitted in shader code Range reduction does not change expected texture sampling behavior; it just helps keep the sequence of floating point operations on texture coordinates within manageable range.

请注意,在某些情况下,此处应用的范围缩小会留下一些额外的填充(最多提到 [-10,10])。这种填充允许在纹理尺寸缩放后,为点采样或线性采样内核选择 texel 时,选择位于采样位置左侧和/或右侧的 texel。因此,不靠近 addressing 模式边界的坐标不能表现得像它们在边界上一样。例如,考虑在BORDER模式下对跨越边界的坐标进行线性采样:这需要为一半的样本获取边界颜色,为另一半获取纹理的内部边缘。然而,范围缩减不能仅仅将BORDER模式下的值限制在[0…1)内,因为这会使完全落在BORDER区域内的坐标错误地表现为它们跨越了边界(获取了一些边界颜色和内部的贡献)。范围缩小还必须允许着色器代码中允许的即时纹素偏移:范围缩小不会改变预期的纹理采样行为;它只是有助于将纹理坐标上的浮点运算序列保持在可管理的范围内。

The following logic describes how normalized texture coordinate range reduction is performed. (This is different form final Texture Address Processing(7.18.9), which happens a couple of steps later, on scaled coordinates that identify texels.)

以下逻辑描述了如何执行归一化纹理坐标范围缩小。(这与最终的纹理地址处理(7.18.9)不同,它发生在几个步骤之后,在标识纹素的缩放坐标上。)

Given:

float signedFrac(float f) returns (f - round_z(f)) // round_z : "round towards zero"

float frac(float f) returns (f - round_ni(f))      // round_ni : "round towards negative infinity"

We have:

float ReduceRange(float U, D3D11_TEXTURE_ADDRESS_MODE AddressMode)

{

    switch (AddressMode)

    {

    case D3D11_TEXTURE_ADDRESS_WRAP:

        // The reduced range is [0, 1)

        return frac(U);  //取小数部分

    case D3D11_TEXTURE_ADDRESS_MIRROR:

        // The reduced range is (-2, 2)

        return signedFrac(U/2) * 2;  //取偶数与现在的差值

    case D3D11_TEXTURE_ADDRESS_MIRRORONCE:

    case D3D11_TEXTURE_ADDRESS_CLAMP:

    case D3D11_TEXTURE_ADDRESS_BORDER:

        // The reduced range is [-10, 10].

        // Each of these modes might use different tightnesses of reduced range,

        // but since there really is no benefit in that, a one-size-fits-all

        // approach is taken here.

        // Note that the range leaves room for immediate texel-space offsets

        // supported by sample instructions, [-8...7],

        // preventing these offsets from causing texcoords that clearly should

        // be out of range (i.e. in border/clamp region) from falling within

        // range after range reduction.  The point is that range reduction does

        // not have an affect on the texels that are supposed to be chosen.

        if(U <= -10)

            return -10;

        else if(U >= 10)

            return 10;

        else return U;

    }

    return 0;

}

Note that the amount of padding supported here for mirroronce/clamp/border are only feasible for use with point or linear filtering of a texture (a larger kernel becomes more likely to expose the reduced range boundary), including with immediate texel offsets from the shader.Furthermore, complex filters which use point or linear filter taps as building blocks (key example being Anisotropic Texture Filtering) are perfectly compatible with the specified range reduction.The reason is that such filters choose their "taps" by perturbing normalized texture coordinates (e.g.walking the line of anisotropy in Anisotropic Texture Filtering), and thus each pertubed "tap" individually goes through the range reduction described here before application of the usual Point/Linear Sample Addressing logic and Texture Address Processing described below.

请注意,这里支持的mirroronce/clamp/border的填充量只适用于对纹理进行点或线性过滤(更大的内核更有可能暴露出减小的范围边界),包括来自着色器的立即纹素偏移。此外,使用点采样或线性采样作为构建块的复杂过滤器(其中关键示例是各向异性纹理过滤)与指定的范围缩减完全兼容。原因是这些过滤器通过干扰归一化的纹理坐标来选择它们的“采样点(taps)”(例如,在各向异性纹理过滤中沿着各向异性的线进行),因此,每个被干扰的"tap"在应用通常的点/线性采样寻址逻辑和下面描述的纹理地址处理之前,都会单独经过这里描述的范围减小。

7.18.7 Point Sample Addressing

7.18.7 点采样寻址

Setting aside how sampler state is configured and how mipmap LOD is chosen, consider simply the task of point sampling an Element from a particular miplevel of a Texture1D, given a scalar floating point texture coordinate in normalized space. In the Texture Coordinate Interpretation(3.3.3) section, there is a diagram illustrating generally how a 1D texture coordinates maps to a texel (not accounting for wrapping).Note from the "Texture Coordinate System" diagram shown that texel corners have integral coordinates in texel-space, and so texel centers are at half-units away from the corners. Point sampling selects the "nearest" texel based on the proximity of texel centers to the texture coordinate (keeping in mind that texel centers are at half-units):

不考虑采样器状态如何配置以及如何选择mipmap LOD,只考虑从Texture1D的特定miplevel中点采样一个元素的任务,给定一个在归一化空间中的标量浮点纹理坐标。在纹理坐标解释(3.3.3)部分,有一个图示说明了1D纹理坐标如何映射到纹素(不考虑环绕)。从显示的"纹理坐标系统"图示中注意到,纹素角在纹素空间中具有整数坐标,因此纹素中心距离角点有半个单位。点采样根据纹素中心到纹理坐标的接近度选择"最近"的纹素(请记住,纹素中心在半个单位处):

1.Given a 1D texture coordinate in normalized space U, assumed to be any float32 value.

1.给定在归一化空间中的一维纹理坐标 U,假设为任何 float32 值。  //也就是说,它的范围在[0, 1]之间。

2.U is scaled by the Texture1D size. Call this scaledU

2.U 按 Texture1D 大小缩放。将此名称称为 scaledU.   //得到了一个在[0, numTexels]范围内的值,其中numTexels是纹理的大小

3.scaledU is converted to at least 16.8 Fixed Point(3.2.4.1). Call this fxpScaledU.

3.scaledU 转换为至少 16.8 的定点 (3.2.4.1) 。将此称为 fxpScaledU。

4.The integer part of fxpScaledU is the chosen texel. Call this t. Note that the conversion to Fixed Point(3.2.4.1) basically accomplished: t = floor(scaledU).

4.fxpScaledU 的整数部分是所选的纹素。称之为 t。请注意,到定点 (3.2.4.1) 的转换基本完成:t = floor(scaledU)。

5.If t is outside [0...numTexels-1] range, D3D11_SAMPLER_STATE's AddressU mode is applied(7.18.9).

5.如果 t 超出 [0...numTexels-1] 范围,则应用 (7.18.9) D3D11_SAMPLER_STATE 的 AddressU 模式。

For Texture2D and Texture3D Resources, the same rules apply independently on the other dimensions.

对于 Texture2D 和 Texture3D 资源,相同的规则独立应用于其他维度。

For TextureCube Resources, the following occurs:

对于 TextureCube 资源,将发生以下情况://对于TextureCube资源的点采样寻址的处理过程

1.Choose the largest magnitude component of the input vector. Call this magnitude of this value AxisMajor. In the case of a tie, the following precedence should occur: Z, Y, X.

1.选择输入向量的最大的分量。将此值的大小称为 AxisMajor。如果分量都相等,应按照以下优先级进行选择:Z、Y、X。

2.Select and mirror the minor axes as defined by the TextureCube(5.3.8) coordinate space. Call this new 2d coordinate Position.

2.根据TextureCube (5.3.8) 坐标空间定义,选择并镜像小轴。将此新 2d 坐标称为 Position。

3.Project the coordinate onto the cube by dividing the components Position by AxisMajor.

3.通过将分量 Position 除以 AxisMajor,将坐标投影到立方体上。

4.Transform to 2d Texture space as follows: Position = Position * 0.5f + 0.5f;

4.转换为 2d 纹理空间,如下所示: Position = Position * 0.5f + 0.5f;

5.Convert the coordinate to fixed point as for a Texture2D.

5.将坐标转换为 Texture2D 的定点。

假设我们有一个输入向量 (0.1, 0.2, 0.3),我们可以按照上述步骤进行处理:

1.选择输入向量的最大幅度分量。在这个例子中,Z 分量的值最大,所以 AxisMajor = 0.3。

2.选择并镜像次要轴。在这个例子中,次要轴是 X 和 Y,所以 Position = (0.1, 0.2)。

3.将坐标投影到立方体上。这是通过将 Position 的分量除以 AxisMajor 来完成的,所以新的 Position = (0.1/0.3, 0.2/0.3) = (0.333, 0.666)。

4.转换到 2d 纹理空间:Position = Position * 0.5 + 0.5 = (0.333 * 0.5 + 0.5, 0.666 * 0.5 + 0.5) = (0.666, 0.833)。

5.将坐标转换为 Texture2D 的定点。这个步骤取决于你的纹理坐标和纹理大小,所以我无法给出具体的数值。

7.18.8 Linear Sample Addressing

7.18.8 线性样本寻址

Similar to the previous section, set aside how sampler state is configured and how mipmap LOD is chosen for now, and consider simply the task of linear sampling an Element from a particular miplevel of a Texture1D, given a scalar floating point texture coordinate in normalized space.Linear sampling in 1D selects the nearest two texels to the sample location and weights the texels based on the proximity of the sample location to them.

与前一节类似,暂时不考虑采样器状态如何配置以及如何选择mipmap LOD,只考虑从Texture1D的特定miplevel中线性采样一个元素的任务,给定一个在归一化空间中的标量浮点纹理坐标。在线性采样中,1D 纹理会选择最接近采样位置的两个 texel,并根据采样位置与它们的接近程度对这些 texel 进行加权。

1.Given a 1D texture coordinate in normalized space U, assumed to be any float32 value.

1.给定归一化空间 U 中的一维纹理坐标,假设为任何 float32 值。

2.U is scaled by the Texture1D size, and 0.5f is subtracted. Call this scaledU.

2.U 按 Texture1D 大小缩放,并减去 0.5f。将此称为 scaledU。

3.scaledU is converted to at least 16.8 Fixed Point(3.2.4.1). Call this fxpScaledU.

3.scaledU 转换为至少 16.8 的定点 (3.2.4.1) 。将此称为 fxpScaledU。

4.The integer part of fxpScaledU is the chosen left texel. Call this tFloorU. Note that the conversion to Fixed Point(3.2.4.1) basically accomplished: tFloorU = floor(scaledU).

4.fxpScaledU 的整数部分是所选的左纹素。将此称为 tFloorU。请注意,到定点 (3.2.4.1) 的转换基本完成:tFloorU = floor(scaledU)。

5.The right texel, tCeilU is simply tFloorU + 1.

5.右纹素,tCeilU 只是 tFloorU + 1。

6.The weight value wCeilU is assigned the fractional part of fxpScaledU, converted to float(3.2.4.2) (although using less than full float32 precision for computing and processing wCeilU and wFloorU is permitted).

6.权重值 wCeilU 被分配为 fxpScaledU 的小数部分,转换为浮点数 (3.2.4.2) (尽管允许使用小于全 float32 精度来计算和处理 wCeilU 和 wFloorU)。

7.The weight value wFloorU is 1.0f - wCeilU.

7.权重值 wFloorU 为 1.0f - wCeilU。

8.If tFloorU or tCeilU are out of range of the texture, D3D11_SAMPLER_STATE's AddressU mode is applied(7.18.9) to each individually.

8.如果 tFloorU 或 tCeilU 超出纹理范围,则D3D11_SAMPLER_STATE的 AddressU 模式将单独应用于 (7.18.9) 每个纹理。

9.Since more than one texel is chosen, the single sample result is computed as:

9.由于选择了多个纹素,因此单个样本结果的计算公式为:

texelFetch(tFloorU) * wFloorU + texelFetch( tCeilU) *  wCeilU

The procedure described above applies to linear sampling of a given miplevel of a Texture2D as well:

上述过程也适用于 Texture2D 的给定 miplevel 的线性采样:

1.Peform the texel selection to both U and V directions independently, producing 2 U texel locations and 2 V texel locations. Combined, these select 4 texels: (tFloorU,tFloorV), (tFloorU,tCeilV), (tCeilU,tFloorV), (tCeilU,tCeilV).

1.将纹素选择独立地固定到 U 和 V 方向,产生 2 个 U 纹素位置和 2 个 V 纹素位置。组合起来,这些选择 4 个纹素:(tFloorU,tFloorV)、(tFloorU,tCeilV)、(tCeilU,tFloorV)、(tCeilU,tCeilV)。

2.There are also 4 weight values produced: wFloorU, wCeilU, wFloorV, wCeilV.

2.还产生了 4 个重量值:wFloorU、wCeilU、wFloorV、wCeilV。

3.The linear sample result is:

3.线性样本结果为:

texelFetch(tFloorU,tFloorV) * wFloorU * wFloorV +

texelFetch(tFloorU, tCeilV) * wFloorU *  wCeilV +

texelFetch( tCeilU,tFloorV) *  wCeilU * wFloorV +

texelFetch( tCeilU, tCeilV) *  wCeilU *  wCeilV

Performing linear sampling of a miplevel of a Texture3D Resource extends the concepts described above to fetching of 8 texels.

对 Texture3D 资源的 miplevel 执行线性采样将上述概念扩展到获取 8 个纹素。

In the case of a TextureCube, see the section regarding TextureCube Edge and Corner Handling(7.18.12)

对于 TextureCube,请参阅有关 TextureCube 边缘和角落处理 (7.18.12) 的部分

7.18.9 Texture Address Processing

7.18.9 纹理地址处理

The sample* instructions provide texture coordinates in normalized floating point form, such that values in [0..1] range span a given dimension of a texture, and values outside this range fall off the borders of the texture.Later in the filtering process, when individual texels are fetched, if the address is outside the extents of the texture, either the address gets mapped back into range by the texture address mode for each component, or the border-color is used.The texture address mode is defined by the AddressU, AddressV, and AddressW members of D3D11_SAMPLER_STATE.

sample*指令以归一化的浮点形式提供纹理坐标,使得[0…1]范围内的值跨越纹理的给定维度,而超出此范围的值则落在纹理的边界之外。在过滤过程的后期,当获取单个纹素时,如果地址超出纹理范围,则每个组件的纹理地址模式会将地址映射回范围,或者使用边框颜色。纹理地址模式由 D3D11_SAMPLER_STATE 的 AddressUAddressV 和 AddressW 成员定义。

Consider the moment in the process of sampling of a Texture1D just after picking a particular integer address scaledU to fetch a texel from (details on choosing sample locations described elsewhere for various filter modes).Suppose the texel address scaledU falls off the Texture1D, meaning either (scaledU < 0), or (scaledU > numTexelsU - 1), where numTexelsU is the count of texels in the U dimension of the Texture1D. The following pseudocode describes how the setting on D3D11_SAMPLER_STATE member AddressU gets applied on scaledU:

考虑在对Texture1D进行采样的过程中,在选择特定的整数地址scaledU以从中获取texel的那一刻(关于如何选择各种滤波模式下的采样位置的详细信息在其他地方有描述)。假设texel地址scaledU超出了Texture1D的范围,意味着(scaledU < 0)或者(scaledU > numTexelsU - 1),其中numTexelsU是Texture1D的U维度中的texel数量。以下伪代码描述了如何在scaledU上应用D3D11_SAMPLER_STATE成员AddressU的设置:

if ((scaledU < 0) || (scaledU > numTexelsU-1))

{

    switch (AddressU)

    {

    case D3D11_TEXADDRESS_WRAP:

        scaledU = scaledU % numTexelsU;

        if(scaledU < 0)

            scaledU += numTexelsU;

        break;

    case D3D11_TEXADDRESS_MIRROR:

        {

            if(scaledU < 0)

                scaledU = -scaledU - 1;

            bool Flip = (scaledU/numTexelsU) & 1;

            scaledU %= numTexelsU;

            if( Flip ) // Odd tile

                scaledU = numTexelsU - scaledU - 1;

            break;

        }

    case D3D11_TEXADDRESS_CLAMP:

        scaledU = max( 0, min( scaledU, numTexelsU - 1 ) );

        break;

    case D3D11_TEXADDRESS_MIRRORONCE:

        if(scaledU < 0)

            scaledU = -scaledU - 1;

        scaledU = max( 0, min( scaledU, numTexelsU - 1 ) );

        break;

    case D3D11_TEXADDRESS_BORDER:

        // Special case: Instead of fetching from the texture,

        // use the Border Color(7.18.9.1).

        bUseBorderColor = true;

        break;

    default:

        scaledU = 0;

    }

}

For Texture2D and Texture3D, all of the above modes apply to the V and W dimensions independently, based on AddressV and AddressW. If any single dimension selects Border Color, then the Border Color(7.18.9.1) is applied.

对于 Texture2D 和 Texture3D,上述所有模式都基于 AddressV 和 AddressW 分别应用于 V 和 W 维度。如果任何单个尺寸选取了“边框颜色”(Border Color),则应用“边框颜色”(Border Color (7.18.9.1) )。

7.18.9.1 Border Color

7.18.9.1 边框颜色

Border Color values are defined in the DDI via 4 floating point values (RGBA), in linear space. The Border Color used in filtering is snapped to the precision the hardware performs filtering at for the format.

边框颜色值在 DDI 中通过线性空间中的 4 个浮点值 (RGBA) 定义。在过滤中使用的边界颜色会根据硬件对格式执行过滤的精度进行调整。

Note that the only components of the BorderColor used by filtering hardware are the ones present in the resource format description.

请注意,过滤硬件使用的 BorderColor 的唯一组件是资源格式描述中存在的组件。

For example, suppose the resource format is DXGI_FORMAT_R8_SNORM, and BorderColor is needed during a sample operation. In this case only the RED component of BorderColor is used, along with the appropriate format-specific defaults for the other components.The BorderColor (the red part in this case) is taken as floating-point data and clamped into the range of the format before filtering.In this case, the red part of the BorderColor is clamped to [-1.0f,1.0f] range before being used by the filtering hardware. From this point (entering the filtering hardware) onward, the fact that BorderColor is being used has no more behavioral effect.

例如,假设资源格式为 DXGI_FORMAT_R8_SNORM,并且在采样操作期间需要 BorderColor。在这种情况下,仅使用 BorderColor 的 RED 组件,以及其他组件的适当的特定于格式的默认值。将 BorderColor(在本例中为红色部分)作为浮点数据,并在过滤之前将其限制在格式范围内。在这种情况下,在被过滤硬件使用之前,BorderColor的红色部分被限制在[-1.0f,1.0f]范围内。从这一点(进入过滤硬件)开始,使用BorderColor的事实没有更多行为上的影响。

7.18.10 Mipmap Selection

7.18.10 Mipmap选择

什么是Mipmap?

*.Mipmap是纹理的一组逐渐变小的版本,每个级别的尺寸是前一个级别的一半。

*.最高分辨率(原始)纹理称为基本级别(级别0),后续级别通过降采样生成。

*.Mipmap以图像链的形式存储在内存中,每个级别的尺寸都是前一个级别的一半(宽度和高度)。

*.Mipmap的目的是在不同距离和视角下提供更好的纹理质量,以减少纹理失真和锯齿。

为什么使用Mipmap?

*.提高性能:在远处或小尺寸上使用较低级别的mipmap,避免了对高分辨率纹理的不必要采样。

*.减少锯齿:mipmap可以减少纹理在远处或倾斜视角下的锯齿伪影。

*.避免过滤器伪影:在纹理缩小时,使用更接近原始像素的颜色值,避免了过滤器伪影。

如何选择Mipmap级别?

*.自动选择:大多数图形API和渲染器会自动选择适当的mipmap级别,以匹配纹理在屏幕上的大小和距离。

*.手动选择:有时您可以手动选择mipmap级别,例如在远处或倾斜视角下使用较低级别的mipmap。

下图是OpenGL中的解释:

Suppose the task at hand is to choose a mipmap level from a Resource, given a floating point LOD value. The choice of mipmap level is based on the particular choice of filter mode in the Sampler State(7.18.3); in which the possible choices are POINT and LINEAR. Anisotropic texture filtering uses LINEAR mipmap selection.

假设手头的任务是从给定浮点LOD值的资源中选择一个mipmap级别。mipmap 级别的选择基于采样器状态 (7.18.3) 中过滤器模式的特定选择;其中可能的选择是 POINT 和 LINEAR。各向异性纹理过滤使用 LINEAR mipmap 选择。

SamplerState samLinear

{

Filter   = MIN_MAG_MIP_LINEAR;

AddressU = CLAMP;

AddressV = CLAMP;

};

  1. If the Sampler defines a Filter for which MIP is set to POINT (otherwise known as 'nearest'), the LOD is first converted to at least 8.8 fixed point (if not already in fixed point form), 0.5 is added, and then the integer part of the LOD is taken as the mipmap level (clamped to available miplevels or any settings for clamping miplevels). This selects the "nearest" miplevel.
  2. 如果采样器定义了一个过滤器,其中 MIP 被设置为 POINT(也被称为 ‘最近’),那么 LOD 首先被转换为至少 8.8 定点(如果还没有以定点形式存在),加 0.5,然后取 LOD 的整数部分作为 mipmap 级别(限制到可用的 miplevels 或任何用于限制 miplevels 的设置)。这选择了 “最近” 的 miplevel。//D3D11_FILTER_MIN_MAG_MIP_POINT
  3. If the Sampler defines a Filter for which MIP is set to LINEAR:
  4. 如果采样器定义了 MIP 设置为 LINEAR 的过滤器://D3D11_FILTER_MIN_MAG_POINT_MIP_LINEAR

1.The two nearest mipmaps are selected as follows.

1.选择两个最近的 mipmap,如下所示。

2.First, the LOD is converted to at least 8.8 fixed point (if not already in fixed point form). Call this fxpLOD.

2.首先,将 LOD 转换为至少 8.8 个定点(如果尚未采用定点形式)。将此称为 fxpLOD。

3.The integer part of the fxpLOD is the first miplevel. Call this mipFloor.

3.fxpLOD 的整数部分是第一个 miplevel。将此称为 mipFloor。

4.The second miplevel, call it mipCeil, is miplFloor+1.

4.第二个 miplevel,称为 mipCeil,是 miplFloor+1。

5.The selected miplevels are clamped to the range of mipmaps available, plus any other settings for clamping miplevels.

5.选定的 miplevel 被限制到可用的 mipmap 范围,以及用于限制 miplevels 的任何其他设置。

6.The weight for mipCeil, call it wMipCeil, is the fractional component of fxpLOD, converted to float.

6.mipCeil 的权重,称为 wMipCeil,是 fxpLOD 的小数分量,转换为浮点数。

7.The weight for mipFloor, call it wMipFloor, is 1.0f - wMipCeil.

7.mipFloor(称为 wMipFloor)的权重为 1.0f - wMipCeil。

In the past multiple IHVs have cheated here (weight selection) with tactics such as snapping LOD values loosely "around" a given mipmap level to that level in order to avoid performing fetches from multiple mipmap levels.Such practices were always in violation of spec, and will continue to be violations in D3D11.3.

过去,多个 IHV 在这里作弊(权重选择),其策略包括将 LOD 值松散地“围绕”给定的 mipmap 级别捕捉到该级别,以避免从多个 mipmap 级别执行提取。这种做法始终违反规范,并将继续违反 D3D11.3 中的规范。

8.Finally, the texture filtering operation receives the pair of chosen miplevels and weights.The filter can perform some sampling operation at each miplevel combines them using the weights: sampleAt(mipCeil) * wMipCeil + sampleAt(mipFloor) * wMipFloor, where the particular sample operation performed depends on the filtering mode (and multiple such operations involving LINEAR mipmap selection could be involved in a complicated filtering process, e.g. in anisotropic filtering).

8.最后,纹理过滤操作接收所选的 miplevel 和权重对。过滤器可以在每个 miplevel 执行一些采样操作,使用权重将它们组合在一起:sampleAt(mipCeil)* wMipCeil + sampleAt(mipFloor)* wMipFloor,其中执行的特定采样操作取决于过滤模式(复杂的过滤过程可能涉及多个涉及线性 mipmap 选择的此类操作,例如在各向异性滤波中)。

7.18.11 LOD Calculations

7.18.11 LOD计算

This section describes how LOD is computed as part of sample* instructions involving filtering.

本节介绍如何将 计算LOD,作为涉及过滤的 sample* 指令的一部分。

  1. The following determines whether LOD will be computed by a sample instruction, either in an isotropic formulation or in anisotropic formulation:
  2. 以下内容确定是按各向同性公式还是各向异性公式的示例指令计算 LOD:

bool ComputeAnisotropicLOD =

        (SamplerState.Filter == D3D11_FILTER_ANISOTROPIC) &&

        IsTexture2D // Includes. 2D array.

                    // Note: Implementations may choose to perform anisotropic texture

                    // filtering for TextureCubes as well, however D3D11.3 does not require(7.18.13)

                    // filtering of TextureCubes to behave any better than tri-linear filtering.

    bool ComputeIsotropicLOD = !ComputeAnisotropicLOD

    bool Magnifying = (clampedLOD <= 0)

  1. Given a texture coordinate vector (1D, 2D or 3D), let it be referred to here as:
  2. 给定纹理坐标向量(1D、2D 或 3D),此处将其称为:

    float3 TC.uvw

  1. If the Shader is a Pixel Shader, compute the partial derivative vectors in the RenderTarget x and y directions for TC.uvw. Let the derivatives be referred to here as:
  2. 如果着色器是像素着色器,则计算 TC.uvw 的 RenderTarget x 和 y 方向上的导数向量。这里将导数称为:

    float3 dX.uvw

    float3 dY.uvw

  1. See the deriv_rtx_coarse(22.5.2) and deriv_rty_coarse(22.5.3) instructions for details on how to compute these quantities.
  2. 有关如何计算这些数量的详细信息,请参阅deriv_rtx_coarse (22.5.2) 和deriv_rty_coarse (22.5.3) 说明。

  1. A couple of variants of the sampling instructions allow the Shader to provide derivatives directly or specify LOD directly (and are available in all Shader stages, not just the Pixel Shader). The sample_d(22.4.17) instruction provides derivatives directly, and the sample_l(22.4.18) instruction allows the LOD to be provided directly. When anisotropic filtering, the ratio of anisotropy with sample_l(22.4.18) is 1 (isotropic).
  2. 采样指令的几个变体允许着色器直接提供导数或直接指定 LOD(并且可用于所有着色器阶段,而不仅仅是像素着色器)。sample_d (22.4.17) 指令直接提供导数,sample_l (22.4.18) 指令允许直接提供LOD。当各向异性过滤时,各向异性与sample_l的比 (22.4.18) 值为1(各向同性)。

  1. If the current texture is a TextureCube, transform the partial derivative vectors into the space of the primary TextureCube face as follows:
  2. 如果当前纹理是 TextureCube,则将导数向量转换为主 TextureCube 面的空间,如下所示:

1.Using TC, determine which component is of the largest magnitude, as when calculating the texel location(7.18.7). If any of the components are equivalent, precedence is as follows: Z, Y, X. The absolute value of this will be referred to as AxisMajor.

1.使用 TC 确定哪个分量的大小最大,就像计算纹素位置 (7.18.7) 时一样。如果所有分量都是相等的,则优先级如下:Z、Y、X。其绝对值将称为 AxisMajor。

2.select and mirror the minor axes of TC as defined by the TextureCube coordinate space to generate TC'.uv

2.选择并镜像 TextureCube 坐标空间定义的 TC 短轴,以生成 TC'.uv

3.select and mirror the minor axes of the partial derivative vectors as defined by the TextureCube coordinate space, generating 2 new partial derivative vectors dX'.uv & dY'.uv.

3.选择并镜像由 TextureCube 坐标空间定义的导数向量的短轴,生成 2 个新的导数向量 dX'.uv 和 dY'.uv。

4.Suppose DerivativeMajorX and DerivativeMajorY are the major axis component of the original partial derivative vectors.

4.假设 DerivativeMajorX 和 DerivativeMajorY 是原始导数向量的主轴分量。

5.Calculate 2 new dX and dY vectors for future calculations as follows:

5.计算 2 个新的 dX 和 dY 向量以供将来的计算,如下所示:

    dX.uv = (AxisMajor*dX'.uv - TC'.uv*DerivativeMajorX)/(AxisMajor*AxisMajor)

    dY.uv = (AxisMajor*dY'.uv - TC'.uv*DerivativeMajorY)/(AxisMajor*AxisMajor)

  1. Scale the derivatives by the texture size at largest mipmap:
  2. 按最大 mipmap 的纹理大小缩放导数:

if (IsTextureCube)

    {

        // multiplying by 0.5f to adjust for TextureCube coordinate system

        dX.uvw = 0.5f * dX.uvw * [NumTexelsAlongCubeSide,NumTexelsAlongCubeSide,0];

        dY.uvw = 0.5f * dY.uvw * [NumTexelsAlongCubeSide,NumTexelsAlongCubeSide,0];

    }

    else

    {

        dX.uvw = dX.uvw * [NumTexelsInUDimension,NumTexelsInVDimension,NumTexelsInWDimension];

        dY.uvw = dY.uvw * [NumTexelsInUDimension,NumTexelsInVDimension,NumTexelsInWDimension];

    }

  1. Given a pair of partial derivative vectors representing an elliptical transform, it is important to calculate LOD using a proper orthogonal Jacobian matrix, as described by [Heckbert 89].When performing anisotropic filtering, it is also important to use these modified vectors to calculate the proper filtering footprint. D3D11.3 will allow approximations to this effect. The following describes the ideal transformation, given 2 dimensional vectors:
  2. 给定一对表示椭圆变换的偏导数向量,使用合适的正交雅可比矩阵计算 LOD 是很重要的,如 [Heckbert 89] 所述。在执行各向异性过滤时,使用这些修改的向量来计算适当的过滤占用也很重要。D3D11.3 将允许对此效果进行近似。下面描述了给定二维向量的理想变换:

    Implicit ellipse coefficients:

隐式椭圆系数:

    A = dX.v ^ 2 + dY.v ^ 2

    B = -2 * (dX.u * dX.v + dY.u * dY.v)

    C = dX.u ^ 2 + dY.u ^ 2

    F = (dX.u * dY.v - dY.u * dX.v) ^ 2

Defining the following variables:

定义以下变量:

    p = A - C

    q = A + C

    t = sqrt(p ^ 2 + B ^ 2)

The new vectors may be then calculated as:

然后,新向量可以计算为:

    new_dX.u = sqrt(F * (t+p) / ( t * (q+t)))

    new_dX.v = sqrt(F * (t-p) / ( t * (q+t)))*sgn(B) // The paper says sgn(B*p), which appears to be incorrect.

    new_dY.u = sqrt(F * (t-p) / ( t * (q-t)))*-sgn(B)

    new_dY.v = sqrt(F * (t+p) / ( t * (q-t)))

If w is nonzero, as when calculating LOD for a volume map, an orthogonal transformation must be used to calculate a pair of 2 dimensional vectors with the same lengths and inner angle prior to computing the correct Jacobian matrix.The following is the transformation implemented by the reference rasterizer:

如果 w 不为零,例如在计算体积图的 LOD 时,必须使用正交变换来计算一对具有相同长度和内角的二维向量,然后再计算正确的雅可比矩阵。以下是参考光栅器实现的变换:

    orthovec = dX x (dX x dY)

    dX' = (|dX|, 0, 0)

    dY' = (dot(dY,dX) / |dX|, dot(dY,orthovec) / |orthovec|, 0)

The following caveats also apply:

以下注意事项也适用:

1.if either of dX or dY are of zero length, an implementation should skip these transformations.

1.如果 dX 或 dY 中的任何一个长度为零,则实现应跳过这些转换。

2.if dX and dY are parallel, an implementation should skip these transformations.

2.如果 dX 和 dY 是平行的,则实现应跳过这些转换。

3.if dX and dY are perpendicular, an implementation should skip these transformations.

3.如果 dX 和 dY 是垂直的,则实现应跳过这些转换。

4.if any component of dX or dY is inf or NaN, an implementation should skip these transformations.

4.如果 dX 或 dY 的任何组件是 inf 或 NaN,则实现应跳过这些转换。

5.if components of dX and dY are large or small enough to cause NaNs in these calculations, an implementation should skip these transformations.

5.如果 dX 和 dY 的分量足够大或小到足以在这些计算中产生 NaN,则实现应跳过这些转换。

  1. if(ComputeIsotropicLOD), the LOD calculation is:
  2. if(ComputeIsotropicLOD),则 LOD 计算为:

    float lengthX = sqrt(dX.u*dX.u + dX.v*dX.v + dX.w*dX.w)

    float lengthY = sqrt(dY.u*dY.u + dY.v*dY.v + dY.w*dY.w)

output.LOD = log2(max(lengthX,lengthY))

  1. if(ComputeAnisotropicLOD), the LOD calculation is:
  2. if(ComputeAnisotropicLOD),LOD计算为:

// Compute outputs:

    // (1) float ratioOfAnisotropy

    // (2) float anisoLineDirection

    // (3) float LOD

    // (For 1D Textures, dX.v and dY.v are 0, so all the

    // math below can be simplified)

    float squaredLengthX = dX.u*dX.u + dX.v*dX.v

    float squaredLengthY = dY.u*dY.u + dY.v*dY.v

    float determinant = abs(dX.u*dY.v - dX.v*dY.u)

    bool isMajorX = squaredLengthX > squaredLengthY

    float squaredLengthMajor = isMajorX ? squaredLengthX : squaredLengthY

    float lengthMajor = sqrt(squaredLengthMajor)

    float normMajor = 1.f/lengthMajor

    output.anisoLineDirection.u = (isMajorX ? dX.u : dY.u) * normMajor

    output.anisoLineDirection.v = (isMajorX ? dX.v : dY.v) * normMajor

    output.ratioOfAnisotropy = squaredLengthMajor/determinant

    // clamp ratio and compute LOD

    float lengthMinor

    if ( output.ratioOfAnisotropy > input.maxAniso ) // maxAniso comes from a Sampler state.

    {

        // ratio is clamped - LOD is based on ratio (preserves area)

        output.ratioOfAnisotropy = input.maxAniso

        lengthMinor = lengthMajor/output.ratioOfAnisotropy

    }

    else

    {

        // ratio not clamped - LOD is based on area

        lengthMinor = determinant/lengthMajor

    }

    // clamp to top LOD

    if (lengthMinor < 1.0)

    {

        output.ratioOfAnisotropy = MAX( 1.0, output.ratioOfAnisotropy*lengthMinor )

        // lengthMinor = 1.0 // This line is no longer recommended for future hardware

        //

        // The commented out line above was part of the D3D10 spec until 8/17/2009,

        // when it was finally noticed that it was undesirable.

        //

        // Consider the case when the LOD is negative (lengthMinor less than 1),

        // but a positive LOD bias will be applied later on due to

        // sampler / instruction settings.

        //

        // With the clamp of lengthMinor above, the log2() below would make a

        // negative LOD become 0, after which any LOD biasing would apply later.

        // That means with biasing, LOD values less than the bias amount are

        // unavailable.  This would look blurrier than isotropic filtering,

        // which is obviously incorrect.  The output of this routine must allow

        // negative LOD values, so that LOD bias (if used) can still result in

        // hitting the most detailed mip levels.

        //

        // Because this issue was only noticed years after the D3D10 spec was originally

        // authored, many implementations will include a clamp such as commented out

        // above.  WHQL must therefore allow implementations that support either

        // behavior - clamping or not.  It is recommended that future hardware

        // does not do the clamp to 1.0 (thus allowing negative LOD).

        // The same applies for D3D11 hardware as well, since even the D3D11 specs

        // had already been locked down for a long time before this issue was uncovered.

    }

    output.LOD = log2(lengthMinor);

  1. Given an LOD specified either from the shader or calculated from derivatives, MipLODBias, srcLODBias (sample_b(22.4.16) only), and MinLOD and MaxLOD clamps are applied to it:
  2. 给定从着色器指定或从导数计算的 LOD,将对其应用 MipLODBias、srcLODBias( (22.4.16) 仅限sample_b)以及 MinLOD 和 MaxLOD 限制:

    biasedLOD = output.LOD + MipLODBias;

    biasedLOD = biasedLOD + srcLODBias;  // for sample_b only; must be per done pixel

    clampedLOD = max(MinLOD,(min(MaxLOD, biasedLOD)));

The ordering of min/max guarantees that if MinLOD > MaxLOD, then MinLOD takes precedence. These min and max operations follow the Floating Point Rules(3.1), so NaN never gets propagated. A sampler state that specifies NaN for MinLOD or MaxLOD is invalid.

最小/最大值的排序保证了如果 MinLOD > MaxLOD,则 MinLOD 优先。这些最小值和最大值运算遵循浮点规则 (3.1) ,因此 NaN 永远不会传播。为 MinLOD 或 MaxLOD 指定 NaN 的采样器状态无效。

Note that the naming for MinLOD and MaxLOD is different/opposing from the D3DSAMP_MAXMIPLEVEL sampler state present in Direct3D9.

请注意,MinLOD 和 MaxLOD 的命名与 Direct3D9 中存在的D3DSAMP_MAXMIPLEVEL采样器状态不同/相反。

The selection of minification vs magnification occurs after LOD clamping.

缩小与放大的选择发生在LOD限制之后。

Prior to feature level 11.0, it was undefined whether magnification selection occurred before or after LOD clamping.

在特征级别11.0之前,放大率选择是发生在LOD限制之前还是之后是未定义的。

Also note the independent Per-Resource Mipmap Clamping(5.8) feature, which is an optional additional clamp on the LOD like MinLOD above but specified at a resource level as opposed to a sample+shader-resource view level.

另请注意独立的 Per-Resource Mipmap Clamping (5.8) 功能,它是 LOD 上的可选附加 Clamp,如上面的 MinLOD,但在资源级别指定,而不是 sample+shader-resource 视图级别。

In some future D3D version, a better definition of magnification should be considered. For one, filtering should take into account the available mipmaps after clamping.Further, perhaps whenever the most detailed available mipmap is read, it should receive magnification filtering, while minification filtering would always be applied to any less detailed mips read in a given filter operation.Thus a given trilinear filter operation could be applying both magnification on one of the mips referenced simultaneously with minification filtering on the other before blending the mips together.This distinction becomes interesting if more compelling magnification filter types are ever introduced, particularly in avoiding discontinuities transitioning between minification and magnification.

在未来的某个 D3D 版本中,应该考虑更好的放大定义。首先,过滤应考虑在限制后可用的 mipmap。此外,也许每当读取最详细的可用 mipmap 时,它都应该接受放大过滤,而缩小过滤将始终应用于给定过滤操作中读取的一些不太详细的 mips。因此,给定的三线性过滤操作可以同时在其中一个引用的 mips 上应用放大和在另一个上应用缩小过滤,然后再将 mips 混合在一起。如果引入了更有吸引力的放大过滤类型,这种区别变得有趣,特别是在避免在缩小和放大之间过渡时的不连续性。

Regarding MipLODBias: The valid range for MipLODBias in the sampler and srcLODBias in the sample_b(22.4.16) instruction are (-16.0f...15.99f). An implementation must support sufficient range for the LOD value before the application-defined MinLOD/MaxLOD/MipLODBias/srcLODBias equation above, such that if the calculated LOD before this equation is outside of the internally supported range and gets clamped (prior to applying application-defined MinLOD/MaxLOD), then the MipLODBias part of the equation (given any valid MipLODBias and srcLODBias value) must not cause the LOD to come back into the range that affects mip selection.

关于 MipLODBias:在采样器中的 MipLODBias 和 sample_b(22.4.16) 指令中的 srcLODBias 的有效范围是 (-16.0f…15.99f)。实现必须在上述应用程序定义的 MinLOD/MaxLOD/MipLODBias/srcLODBias 方程之前,支持足够的LOD 值范围,这样如果在此方程之前计算的 LOD 超出内部支持的范围并被限制(在应用程序定义的 MinLOD/MaxLOD 之前),那么方程的 MipLODBias 部分(给定任何有效的 MipLODBias 和 srcLODBias 值)不应导致 LOD 回到影响 mip 选择的范围。

7.18.12 TextureCube Edge and Corner Handling

7.18.12 纹理立方体边缘和角落处理

TextureCube filtering near Cube edges, where 2x2 (bilinear) filter taps would fall off a face are required to spill over by one texel row/column to the appropriate adjacent map.

在立方体边缘附近对TextureCube进行过滤时,需要将2x2(双线性)滤波采样点溢出一个纹素行/列到相应的相邻贴图中。

At TextureCube corners, a linear combination of the three relevant samples is required. The ideal (reference) linear combination of the three samples in the corner case is as follows: Imagine flattening out the Cube faces at the corner, yielding 3 texels and a missing one. Apply bilinear weights on this virtual grid of 4 texels, and then divide the weight for the missing texel evenly amongst the 3 other texels. It is alternatively permissible for an implementation to, instead of dividing the weight evenly amongst the 3 other texels, just split the weight of the missing texel across the 2 adjacent texels. However in future versions of D3D, only the reference behavior will be permitted.

在TextureCube的角落,需要三个相关样本的线性组合。

在角落情况下,三个样本的理想(参考)线性组合如下:想象一下,在角落处展平立方体面,产生 3 个纹素和一个缺失的纹素。在这个虚拟的 4纹素网格上应用双线性权重,然后将缺失的纹素的权重均匀地分配给其他3个纹素。

另一种允许的实现方式是,而不是将权重均匀地分配给其他 3个纹素,只是将缺失的纹素的权重分配给 2 个相邻的纹素。

然而,在未来的 D3D 版本中,只有这个参考行为将被允许。

7.18.13 Anisotropic Filtering of TextureCubes

7.18.13 纹理立方体的各向异性滤波

Anisotropic texture filtering on a TextureCube does not have specified/required behavior except that it must at least behave no "worse" than tri-linear filtering would.

TextureCube 上的各向异性纹理过滤没有指定/必需的行为,除非它至少必须表现得不比三线性过滤“差”。

7.18.14 Sample Return Value Type Interpretation

7.18.14 采样返回值类型解释

The application is given control over the return type of texture load instructions (i.e. reading raw integer values vs. reading normalized float values) by simply choosing an appropriate format to interpret the resource's contents as. See the Formats(19.1) section for detail.

应用程序可以通过简单地选择适当的格式来解释资源的内容,从而控制纹理加载指令的返回类型(即读取原始整数值与读取规范化浮点值)。有关详细信息,请参阅“格式 (19.1) ”部分。

7.18.15 Comparison Filtering

7.18.15 比较过滤

For details on comparison filtering, see the sample_c(22.4.19) and sample_c_lz(22.4.20) instructions.

有关比较过滤的详细信息,请参阅sample_c (22.4.19) 和sample_c_lz (22.4.20) 说明。

Comparision Filtering is an attempt by D3D11.3 to define basic building-block filtering operation that is useful for Percentage Closer Depth Filtering.

比较过滤是 D3D11.3 定义基本构建块过滤操作的尝试,该操作对Percentage Closer Depth Filtering很有用。

7.18.15.1 Shadow Buffer Exposure on Feature Level 9.x

7.18.15.1 功能级别 9.x 上的阴影缓冲区曝光

D3D9 never officially supported dedicated hardware support for shadow map scenarios. Namely, D3D9 does not spec the ability to bind a depth buffer as a shader input and to sample from it using comparision filtering (also known as "Percentage Closer Filtering").Even though this never made it into the D3D9 spec, the D3D9 runtime intentionally used loose validation to enabled IHVs to align on a convention for how to make the feature work.

D3D9 从未正式支持对阴影贴图方案的专用硬件支持。也就是说,D3D9 没有指定将深度缓冲区绑定为着色器输入并使用比较过滤(也称为“Percentage Closer Filtering”)从中采样的功能。尽管这从未进入 D3D9 规范,但 D3D9 运行时有意使用松散验证,使 IHV 能够就如何使功能工作的约定保持一致。

In the meantime, the D3D10+ hardware spec added a requirement for supporting binding depth as a texture and for comparison filtering.

同时,D3D10+ 硬件规范增加了支持绑定深度作为纹理和比较过滤的要求。

As more scenarios arise involving the D3D11+ APIs running on Feature Level 9.x it finally makes sense to expose the D3D9 shadow buffer support.It turns out this is possible simply by loosening validation on existing API constructs in the D3D11.1+ API for depth buffers and comparision filtering, mapping to the equivalent on the D3D9 convention IHVs had aligned on where applicable.

随着涉及在功能级别 9.x 上运行的 D3D11+ API 的更多方案出现,最终公开 D3D9 阴影缓冲区支持是有意义的。事实证明,这可以通过放宽对 D3D11.1+ API 中现有 API 构造的验证来实现,用于深度缓冲区和比较筛选,映射到 IHV 在适用的情况下对齐的 D3D9 约定上的等效项。

When Feature Level 9.x is used at the D3D11.1+ API (meaning the D3D9 DDI is used) on a Win8+ driver, regardless of hardware feature level, applications can do the following:

在 Win8+ 驱动程序上的 D3D11.1+ API(意味着使用 D3D9 DDI)使用功能级别 9.x 时,无论硬件功能级别如何,应用程序都可以执行以下操作:

1.Create Texture2D surfaces with the format DXGI_FORMAT_R16_TYPELESS or DXGI_FORMAT_R24G8_TYPELESS and set BindFlags to both D3D11_BIND_SHADER_RESOURCE and D3D11_BIND_DEPTH_STENCIL together.

1.使用 DXGI_FORMAT_R16_TYPELESS 或 DXGI_FORMAT_R24G8_TYPELESS 格式创建 Texture2D 曲面,并将 BindFlags 设置为同时D3D11_BIND_SHADER_RESOURCE和D3D11_BIND_DEPTH_STENCIL。

2.Create Sampler State objects with a comparison filter chosen and comparison mode lessEqual.

2.创建采样器状态对象,并选择比较过滤器和比较模式 lessEqual。

3.Use BorderColor addressing if desired on these samplers, even though border color is otherwise not normally allowed on Feature Levels 9.1 and 9.2. This is useful to allow applications to choose what happens when sampling off the bounds of Depth Buffer.A typical choice would be using a depth value (placed in the R component of the border color) that would result in the depth comparison always passing or always failing.

3.如果需要,在这些采样器上使用 BorderColor 寻址,即使功能级别 9.1 和 9.2 上通常不允许使用边框颜色。这对于允许应用程序选择在深度缓冲区边界之外采样时发生的情况非常有用。典型的选择是使用深度值(放置在边框颜色的 R 分量中),这将导致深度比较始终通过或始终失败。

4.The Mag and Min filter settings in the comparison filter choose between linear or point filtering (using different choices for Mag/Min filter is undefined). Anisotropic filtering is not allowed.The Mip filter choice is meaningless since Feature Level 9.x does not allow mipmapped depth buffers.

4.比较滤波器中的Mag 和Min 过滤器设置在线性滤波或点滤波之间进行选择(未定义对磁力/最小滤波器使用不同的选择)。不允许各向异性滤波。Mip 筛选器的选择毫无意义,因为功能级别 9.x 不允许 mipmapped 深度缓冲区。

5.Create a DepthStencil View of the typeless Texture2D resource with format DXGI_FORMAT_D16_UNORM / DXGI_FORMAT_D24_UNORM_S8_UINT and render depth to it.

5.创建格式为 DXGI_FORMAT_D16_UNORM / DXGI_FORMAT_D24_UNORM_S8_UINT 的无类型 Texture2D 资源的 DepthStencil 视图,并向其渲染深度。

6.Create a Shader Resource View of the typeless Texture2D resource with format DXGI_FORMAT_R16_UNORM / DXGI_FORMAT_R24_UNORM_X8_TYPELESS and bind it with the comparison sampler described above.

6.创建格式为 DXGI_FORMAT_R16_UNORM/DXGI_FORMAT_R24_UNORM_X8_TYPELESS的无类型 Texture2D 资源的着色器资源视图,并将其与上述比较采样器绑定。

7.Use the SampleCmp/SampleCmpLevelZero texture2D methods in ps_4_0_level_9_* shaders to sample from the Shader Resource View above.

7.使用 ps_4_0_level_9_* 着色器中的 SampleCmp/SampleCmpLevelZero texture2D 方法从上面的着色器资源视图中采样。

7.1 Note these methods already exist for ps_4_0+ (sample_c/sample_c_lz in the bytecode). They are simply now also available for ps_4_0_level_9_*, where the D3D9 texld operation is repurposed for comparision filtering, described further below.

7.1请注意,这些方法已存在于 ps_4_0+ (字节码中的 sample_c/sample_c_lz) 中。它们现在也可用于 ps_4_0_level_9_*,其中 D3D9 texld 操作被重新用于比较过滤,如下所述。

7.2 sample_c/sample_c_lz (the latter forcing mip level 0) behave identically since depth textures cannot be mipmapped on Feature Level 9.x.

7.2 sample_c/sample_c_lz(后者强制 mip 级别为 0)的行为相同,因为深度纹理无法在功能级别 9.x 上mipmapped 。

8.Passing these shaders to CreatePixelShader (or using any of these features) on an old runtime will fail.

8.在旧的运行时上将这些着色器传递给 CreatePixelShader(或使用其中任何功能)将失败。

9.If any state is configured incorrectly by the application, either the runtime will fail state creation, else if the mismatch is only visible at Draw-time the Draw call gets dropped by the runtime. Basically the runtime drops the Draw call if a texture is bound and it is depth but the sampler is not a comparision sampler or the texture is not depth and the sampler is comparison. This validation does not check whether the current shader even uses the texture at all, so in that sense it is stricter than necessary (for simplicity of implementation).

9.如果应用程序未正确配置一些状态,则运行时将无法创建状态,否则,如果不匹配只在绘制时可见,则运行时会放弃绘制调用。基本上,如果纹理已绑定并且它是深度但采样器不是比较采样器,或者纹理不是深度并且采样器是比较采样器,则运行时会放弃 Draw 调用。此验证不会检查当前着色器是否使用了纹理,因此从这个意义上说,它比必要的更严格(为了实现的简单性)。

The overbearing validation described above (dropping Draw calls when state is invalid) helps ensure that an application that can get shadows working at Feature Level 9.x will behave the same if the Feature Level is bumped up to 10+ with no code change required.

上面描述的强制验证(当状态无效时删除 Draw 调用)有助于确保在 Feature Level 9.x 上可以获得阴影效果的应用程序在 Feature Level 提升到 10+ 时仍然可以获得相同的效果,而无需进行代码更改。

The reason this feature is limited to Win8+ drivers (regardless of hardware feature level) is to avoid having to test on any old D3D9 hardware that is unlikely to be driven by the D3D11.1 APIs in the first place.

此功能仅限于 Win8+ 驱动程序(无论硬件功能级别如何)的原因是为了避免在任何旧的 D3D9 硬件上进行测试,这些硬件首先不太可能由 D3D11.1 API 驱动。

7.18.15.1.1 Mapping the Shadow Buffer Scenario to the D3D9 DDI

7.18.15.1.1 将阴影缓冲区方案映射到 D3D9 DDI

The D3D11.1 runtime maps this shadow scenario to the D3D9 DDI (regardless of hardware feature level) as follows.

D3D11.1 运行时将此阴影方案映射到 D3D9 DDI(无论硬件功能级别如何),如下所示。

1.Surfaces can be created with both depth and texture flags as long as the format is either D3DDDIFMT_S824 or D3DDIFMT_D16.

1.只要格式为D3DDDIFMT_S824或D3DDIFMT_D16,就可以使用深度和纹理标志创建表面。

2.If a depth texture is bound as a texure input to the Pixel Shader, comparison filtering with less-equal comparison is always assumed. (There is no DDI in D3D9 for explicitly turning on or off comparison filtering.)

2.如果深度纹理被绑定为像素着色器的纹理输入,总是假定比较过滤器是小于等于比较的。(在D3D9中没有DDI可以显式地打开或关闭比较过滤器。)

3.The Mag and Min filter settings in the comparison filter choose between linear or point filtering (using different choices for Mag/Min filter is undefined). Anisotropic filtering is not allowed.The Mip filter choice is meaningless since Feature Level 9.x does not allow mipmapped depth buffers.

3.比较过滤器中的 Mag 和 Min 过滤器设置在线性过滤和点过滤之间进行选择(对于 Mag/Min 过滤器使用不同的选择是未定义的)。不允许各向异性过滤。Mip 过滤器的选择毫无意义,因为功能级别 9.x 不允许 mipmapped 深度缓冲区。

4.BorderColor addressing is allowed to be requested by the application when a depth buffer is set as a texture. For all other cases border addressing is not allowed on Feature Level 9.1 and 9.2.

4.当深度缓冲区设置为纹理时,应用程序允许请求 BorderColor 寻址。对于所有其他情况,功能级别 9.1 和 9.2 上不允许边界寻址。

5.In the Pixel Shader, the 3rd component of the texture cooordinate input to the texld instruction specifies the reference z value to use during comparision filtering. For a description of comparision filtering, refer to D3D10+ sample_c shader instruction.The difference for the (repurposed) D3D9 texld instruction is that the z value is packed with the texture coordinate rather than a separate argument.

5.在像素着色器中,输入到 texld 指令的纹理坐标的第三个组件指定了在比较过滤期间使用的参考 z 值。有关比较过滤的描述,请参阅 D3D10+ sample_c 着色器指令。(重新调整用途的)D3D9 texld 指令的区别在于,z 值包含纹理坐标,而不是单独的参数。

This feature was added too late to enforce via hardware conformance kit testing. However all hardware vendors at the time of shipping agreed to support it, and tests are being authored to assist with basic verification (even if not enforced for now).

添加此功能为时已晚,无法通过硬件一致性工具包测试强制执行。但是,所有硬件供应商在发货时都同意支持它,并且正在编写测试以协助基本验证(即使目前尚未强制执行)。

7.18.15.1.2 Checking for Shadow Support on Feature Level 9.x

7.18.15.1.2 检查功能级别 9.x 上的阴影支持

The D3D11 CheckFeatureSupport() API has a new capability that can be checked: D3D11_FEATURE_D3D9_SHADOW_SUPPORT. This is set to true if the driver is Win8+ (no need to ask the driver anything else).

D3D11 CheckFeatureSupport() API 有一个新的可以检查的功能:D3D11_FEATURE_D3D9_SHADOW_SUPPORT。如果驱动程序是 Win8+,则此项设置为 true(无需向驱动程序询问任何其他问题)。

typedef enum D3D11_FEATURE {

  D3D11_FEATURE_THREADING = 0,

  ...

  D3D11_FEATURE_D3D9_SHADOW_SUPPORT,

  ...

  D3D11_FEATURE_DISPLAYABLE

} ;

HRESULT CheckFeatureSupport(

            D3D12_FEATURE Feature,

  [in, out] void          *pFeatureSupportData,

            UINT          FeatureSupportDataSize

);

On the other hand if the D3D11 CheckFeatureSupport() / CheckFormatSupport() APIs are used to query format support on the individual DXGI_FORMAT_* names described here, the runtime will NOT report support for any capabilities specific to the shadow buffer scenario.For example support for using DXGI_FORMAT_R16_UNORM as a texture is not reported on Feature Level 9.1/9.2 (though it is supported on 9.3, independent of the shadow scenario).

另一方面,如果 D3D11 CheckFeatureSupport() / CheckFormatSupport() API 用于查询此处所述的各个 DXGI_FORMAT_* 名称的格式支持,则运行时将不会报告于阴影缓冲区方案的任何功能的支持。例如,功能级别 9.1/9.2 上未报告对使用DXGI_FORMAT_R16_UNORM作为纹理的支持(尽管 9.3 支持它,但与阴影方案无关)。

HRESULT CheckFormatSupport(

  [in]  DXGI_FORMAT Format,

  [out] UINT        *pFormatSupport

);

Not reporting shadow support on format caps queries was a simplification. It avoids conflicts where this depth scenario allows operations with format names that are not allowed in non-shadow cases, particularly for DXGI_FORMAT_R16_UNORM. It was not worth disambiguating the format caps reporting for this unique case. The bottom line is all an application needs to do is check the D3D11_FEATURE_D3D9_SHADOW_SUPPORT cap described above to know if the entire scenario will work.

在格式能力查询中不报告阴影支持是一种简化。它避免了这种深度场景允许使用在非阴影情况下不允许的格式名称的操作,特别是对于 DXGI_FORMAT_R16_UNORM。为这个独特的情况消除格式能力报告的歧义是不值得的。最重要的是,应用程序需要做的所有事情就是检查上述的 D3D11_FEATURE_D3D9_SHADOW_SUPPORT 能力,以了解整个场景是否会工作。

7.18.16 Texture Sampling Precision

7.18.16 纹理采样精度

7.18.16.1 Texture Addressing and LOD Precision

7.18.16.1 纹理寻址和LOD精度

During Texture Sampling(7.18), the amount of range required for selecting texels (after scaling normalized texture coordinates by texture size) is at least 216. This range is centered around 0.

在纹理采样 (7.18) 期间,选择纹素所需的范围量(按纹理大小缩放归一化纹理坐标后)至少为 216。此范围以 0 为中心。

The amount of subtexel precision required (after scaling texture coordinates by texture size) is at least 8-bits of fractional precision (28subdivisions).

所需的子纹素精度量(按纹理大小缩放纹理坐标后)至少为 8 位小数精度(28 个 细分)。

In mipmap selection, after conversion from float, at least 8-bits must represent the integer component of the LOD, and at least 8-bits must represent the fractional component of an LOD (28subdivisions).

在 mipmap 选择中,从浮点数转换后,至少有 8 位必须表示 LOD 的整数分量,并且至少 8 位必须表示 LOD 的小数分量(28个细分)。

See the discussion in the Fixed Point Integers(3.2.4) section on how fixed point numbers should be defined and how it relates to texture coordinate precision.

请参阅“定点整数” (3.2.4) 部分中的讨论,了解应如何定义定点数以及它与纹理坐标精度的关系。

7.18.16.2 Texture Filtering Arithmetic Precision

7.18.16.2 纹理过滤算精度

All of the texture filtering operations in D3D11.3, when being performed on floating point formats (regardless of format width), are required to follow the D3D11.3 Floating Point Rules(3.1), with one exception: When a filter weight of 0.0 is encountered, NaN's or signed zeros may or may not be propagated from the source texture.

D3D11.3 中的所有纹理过滤操作在浮点格式(无论格式宽度如何)上执行时,都需要遵循 D3D11.3 浮点规则 (3.1) ,但有一个例外:当遇到过滤器权重为 0.0 时,NaN 或有符号零可能会也可能不会从源纹理传播。

Texture filtering operations performed on fixed point formats must be done with at least as much precision as the format.

对定点格式执行的纹理过滤操作必须至少具有与格式相同的精度。

7.18.16.3 General Texture Sampling Invariants

7.18.16.3 常规纹理采样不变量

Here are some general observations about things that can be expected of texture filtering operations.

以下是关于可以预期的纹理过滤操作的一些一般性观察。

1.Point sampling always yields a single texel, exactly.

1.点采样总是精确地产生一个纹素。

2.On a filter that samples from multiple texels, the output must fall between the min and the max values of the texels accessed. Consequently, filtering a constant-color texture always yields that color. The one exception to this is that as stated here(22.4.15), point sampling of a denormalized 32-bit float value, the result may or may not be flushed.

2.在从多个纹素采样的过滤器上,输出必须介于所访问纹素的最小值和最大值之间。因此,过滤恒定颜色的纹理始终会产生该颜色。唯一的例外是,如本文 (22.4.15) 所述,对非规范化的 32 位浮点值进行点采样,结果可能会刷新,也可能不会刷新。

3.In a bilinear filter operation, colors must increase or decrease monotonically as a function of the U or V filter weights.

3.在双线性过滤器操作中,颜色作为 U 或 V 过滤器权重的函数,必须单调增加或减少。

7.18.17 Sampling Unbound Data

7.18.17 采样未绑定数据

Sampling from a slot with no texture bound returns 0 in all components.

从没有纹理绑定的插槽中采样时,所有组件均返回 0。

7.19 Subroutines / Interfaces

7.19 子程序/接口

7.19.1 Overview

7.19.1 概述

The programmable graphics pipeline has given software developers greatly enhanced flexibility and power. As a result, shader programming has evolved to the point where programmers need to combine multiple code building blocks (i.e. subroutines) on the fly.Current approaches generally cause the static creation of thousands of one-off shaders, each using a particular combination of subroutines to realize a specific effect.The use of flow control and looping can reduce the number of these precompiled combinations, but these techniques have a dramatic effect on the runtime performance of the shader code, and applications are still sensitive to the extra instructions and registers used in common shaders. Furthermore, since the shader programs are "kernels" or inner loops, any extra overhead for trying to reuse the same instruction stream to represent multiple combinations is more noticeable than in more traditional CPU code.The application developer has no way of knowing when it is safe, in regards to performance, to use flow control to mitigate code complexity. This leads to a different performance problem: dealing with of thousands of shaders.

可编程图形流水线大大增强了软件开发人员的灵活性和功能。因此,着色器编程已经发展到程序员需要动态组合多个代码构建块(即子例程)的地步。当前的方法通常会导致静态创建数千个一次性着色器,每个着色器都使用特定的子程序组合来实现特定的效果。使用流控制和循环可以减少这些预编译组合的数量,但这些技术对着色器代码的运行时性能有巨大影响,并且应用程序仍然对常见着色器中使用的额外指令和寄存器敏感。此外,由于着色器程序是’内核’或内部循环,尝试重用相同的指令流来表示多种组合的一些额外开销,在更传统的CPU代码中会更加明显。

应用程序开发者无法知道何时可以安全地使用流控制来减轻代码复杂性,以便不影响性能。这导致了一个不同的性能问题:处理数千个着色器。

The goal of this feature is to allow applications to have a simple, expressive programming model that abstracts away this combinatoric complexity while still achieving the performance of the custom precompiled shaders.To achieve this goal, we move the complexity from the application level to the driver level where hardware-specific knowledge can be utilized to reduce program size and complexity.

此功能的目标是允许应用程序拥有简单、富有表现力的编程模型,该模型抽象出这种组合复杂性,同时仍能实现自定义预编译着色器的性能。为了实现这一目标,我们将复杂性从应用程序级别转移到驱动程序级别,从而可以利用特定于硬件的知识来减小程序大小和复杂性。

To satisfy the performance requirements of inner loop code, the overhead of calling conventions and lost optimizations needs to be addressed. Our method avoids the overhead by using a subroutine model that virtually "inlines" the functions that can be called.This is done by compiling code normally up to a call site, and then compiling all possible callees with the current state of the caller. The functions called would then be optimized for the current register state by mapping inputs and outputs to their current register locations.While this approach increases overall program size, it avoids the cost of both parameter passing and stack save/restore, thereby avoiding the overhead of traditional function calls while preserving runtime flexibility.

为了满足内部循环代码的性能要求,需要解决调用约定和丢失优化的开销。我们的方法通过使用子例程模型来避免开销,该模型实际上“内联”了可以调用的函数。这是通过正常编译代码直到调用点,然后使用调用者的当前状态编译所有可能的被调用函数来完成的。然后,被调用的函数将根据当前寄存器状态进行优化,将输入和输出映射到它们当前的寄存器位置。虽然这种方法增加了整体程序大小,但它避免了参数传递和堆栈保存/恢复的成本,从而避免了传统函数调用的开销,同时保持了运行时的灵活性。

The IL ASM has code blocks that act and look like subroutines; there are defined in/out parameters and registers are all local (in/out/temp/scratch).Some global references remain: textures, constant buffers, and sampler. The main difference from normal subroutines is that each location that can call a subroutine has a declaration describing the call destinations that are possible.

IL(Intermediate Language)汇编代码块类似于子程序,具有定义的输入/输出参数,而寄存器都是局部的(输入/输出/临时/临时存储)。一些全局引用仍然存在:纹理、常量缓冲区和采样器。与普通子例程的主要区别在于,每个可以调用子例程的位置都有一个声明,用于描述可能的调用目标。

The set of functions to call when executing a given shader program can be changed between draw calls when calling SetShader.When binding the shader program to the pipeline, the list of functions to use is specified. Selecting the set of functions to use between draw calls allows the driver to recalculate the hardware requirements for a specified set of functions.Calculating the true number of registers required for a given "specialization" of a shader provides the combined flexibility of choice at runtime and the performance of a specialized shader.

在调用SetShader时,执行给定着色器程序时要调用的函数集可以在绘制调用之间更改。将着色器程序绑定到管道时,将指定要使用的函数列表。在绘制调用之间选择要使用的函数集,可以让驱动程序重新计算指定函数集的硬件需求。计算给定着色器的’专门化’所需的真正寄存器数量,提供了在运行时选择的灵活性和专门化着色器的性能的组合。

7.19.2 Differences from 'Real' Subroutines

7.19.2 与“真实”子例程的区别

The primary difference of this approach from "real" subroutines is that at runtime no calling convention is used. Each time a function could be called, a version of the function is emitted to match the caller’s register and other state. Since a new version of the callee is emitted for each location in the caller code that the function is called from, all optimizations used when inlining apply, except that callee code must remain functionally separate from caller code.

此方法与“实际”子例程的主要区别在于,在运行时不使用调用约定。每次可以调用函数时,都会发出一个版本的函数以匹配调用方的寄存器和其他状态。由于调用函数的调用方代码中的每个位置都会发出新版本的被调用方,因此内联时使用的所有优化都适用,但被调用方代码在功能上必须与调用方代码保持独立。

Take an example: The main function has an fcall(22.7.19) instruction and that fcall instruction has two function implementations that could be called.When generating the microcode for the program to execute, the code is generated up to the fcall routine and the current state of the registers and other shader state is stored off in "StateBeforeCall". Then code is generated for the first function that can be called starting with the current state of register allocation, scratch registers, etc.Next the current state is restored to StateBeforeCall and the code for the second function is generated.Finally the current state is restored to StateBeforeCall again and the impacts of the outputs of the fcall are applied to the current state, and code generation continues after the fcall.

举个例子:main 函数有一个 fcall 指令,该 fcall (22.7.19) 指令有两个可以调用的函数实现。在生成要执行的程序的微码时,代码将生成到 fcall 例程,并且寄存器的当前状态和其他着色器状态存储在“StateBeforeCall”中。然后为第一个函数生成代码,该函数可以从寄存器分配、暂存寄存器等的当前状态开始调用。接下来,将当前状态还原为 StateBeforeCall,并生成第二个函数的代码。最后,将当前状态再次恢复到 StateBeforeCall,并将 fcall 输出的影响应用于当前状态,并在 fcall 之后继续生成代码。

Limitations are present in the IL that allow for the calling destination to have a version of a function’s microcode emitted using the current register knowledge of the caller to allocate the callee’s local registers after the caller’s registers so that no saving/restoring of data is required when crossing the function boundary.

在IL中存在限制,允许调用目标使用调用者当前的寄存器知识发出函数的微代码版本,以在调用者的寄存器之后分配被调用者的局部寄存器,从而在跨越函数边界时无需保存/恢复数据。

The downside from "real" subroutines is that the amount of code to represent the program can become quite large. No code sharing is done between multiple call sites.If code is larger than the code cache, and the miss latency is not hidden by some other mechanism, then "real" subroutines are very useful. Assuming that the code bloat size is minimal (i.e.each function is only ever called from one location), then performance will be better with the new method – no parameter passing overhead, inlining optimizations, etc.

“真实”子例程的缺点是表示程序的代码量可能会变得非常大。多个调用位置之间不会进行代码共享。如果代码大于代码缓存,并且失效延迟没有被其他机制隐藏,那么“真实”子例程非常有用。假设代码膨胀大小最小(即每个函数只从一个位置调用),那么使用新方法的性能会更好——没有参数传递开销、内联优化等。

Another problem with the new method is that all destinations must be known at compile time. Due to validation that is currently done, all calls will be need to be known. As that requirement is relaxed, "real" subroutines are a better way of handling late binding destinations.

新方法的另一个问题是,在编译时必须知道所有目标。由于当前已完成的验证,因此需要知道所有调用。随着该要求的放宽,“真实”子例程是处理后期绑定目标的更好方法。

HLSL requires that all texture and sampler parameters be rooted in some well-known global object so that the compiler can determine which texture or sampler index to use for a particular texture or sampler variable throughout the entire program. As fcalls constitute a late-binding boundary the compiler cannot easily track parameter identity and thus texture and sampler arguments to fcalls are not allowed.Note that when only concrete classes are used this isn’t a problem. Additionally, texture and sampler members of classes should be allowed, this limitation only applies to parameters to interface methods that are used with full fcall dispatch.

HLSL 要求所有纹理和采样器参数都植根于某个已知的全局对象,以便编译器可以确定在整个程序中对特定纹理或采样器变量使用哪个纹理或采样器索引。由于 fcalls 构成了后期绑定边界,编译器无法轻松跟踪参数标识,因此不允许对 fcalls 使用纹理和采样器参数。请注意,当仅使用具体类时,这不是一个问题。此外,应允许类的纹理和采样器成员,此限制仅适用于完全 fcall 调用使用的接口方法的参数。

Also see the related topics Uniform Indexing of Resources and Samplers(7.11) as well as the this[](22.7.20) register.

另请参阅相关主题 Unified Indexing of Resources and Samplers (7.11) 以及 this[] (22.7.20) 寄存器。

7.19.3 Subroutines: Non-goals

7.19.3 子程序:非目标

  1. Fast linking
  2. 快速链接

Not intended for improving compilation time 

不用于缩短编译时间

  1. DLL support
  2. DLL 支持

Not intended for reuse of microcode for standard libraries 

不适用于标准库的微码重用

  1. Dynamic virtual functions 
  2. 动态虚函数

Changes to functions called occurs between draw calls – relatively low frequency 

对调用的函数的更改发生在绘制调用之间 - 频率相对较低

7.19.4 Subroutines - Instruction Reference

7.19.4 子程序 - 指令参考

1.dcl_function_body (Function Body Declaration)(22.3.49)

2.dcl_function_table (Function Table Declaration)(22.3.50)

3.dcl_interface/dcl_interface_dynamicindexed (Interface Declaration)(22.3.51)

4.fcall fp#[arrayIndex][callSite](22.7.19)

5."this" Register(22.7.20)

7.19.5 Simple Example

7.19.5 简单示例

7.19.5.1 HLSL - Simple Example

7.19.5.1 HLSL - 简单示例

interface Light

    {

        float3 Calculate(float3 Position, float3 Normal);

    };

    class AmbientLight : Light

    {

        float3 Calculate(float3 Position, float3 Normal)

        {

            return AmbientValue;

        }

        float3 AmbientValue;

    };

    class DirectionalLight : Light

    {

        float3 Calculate(float3 Position, float3 Normal)

        {

            float3 LightDir = normalize(Position - LightPosition);

            float LightContrib = saturate( dot( Normal, -LightDir) );

            return LightColor * LightContrib;

        }

        float3 LightPosition;

        float3 LightColor;

    };

    AmbientLight MyAmbient;

    DirectionalLight MyDirectional;

    float4 main (Light MyInstance, float3 CurPos: CurPosition,

                 float3 Normal : Normal) : SV_Target

    {

        float4 Ret;

        Ret.xyz = MyInstance.Calculate(CurPos, Normal);

        Ret.w = 1.0;

        return Ret;

    }

7.19.5.2 IL - Simple Example

7.19.5.2 IL - 简单示例

Register:       this[]

Stage(s):       All(22.1.1)

Description:    Register that refers to 'this' data.

Operation:      'this' data associated with interface object instances is set at the API

                when any given shader is bound to the pipeline.  There are at most 253 slots

                for 'this' data.

                this’ 数据与接口对象实例关联,当任何给定的着色器绑定到管线时,API 设置了这些数据。最多有 253 个槽位用于 ‘this’ 数据。

 The number was chosen to put a bound on the size of the DDI for passing the data to the driver.

                This data can be considered from the point of view of a shader as a 253

                entry array of 32-bit per component 4 component read only registers.

选择这个数字是为了限制传递数据给驱动程序的 DDI 大小。从着色器的角度来看,这些数据可以视为一个 253 个条目的数组,每个条目都由 32 位每组件 4 组件的只读寄存器组成。

                The 4 components of a this[] register contain:

                x: UINT32 index for which constant buffer holds the instance data

                y: UINT32 base element offset of the instance data in the instance constant buffer.

                z: UINT32 base texture index

                w: UINT32 base sampler index

x: UINT32 表示常量缓冲区中保存实例数据的索引

y: UINT32 表示实例数据在实例常量缓冲区中的基本元素偏移量。

z: UINT32 表示基本纹理索引

w: UINT32 表示基本采样器索引

//

// Generated by Microsoft (R) HLSL Shader Compiler 10.1

//

//

// Buffer Definitions:

//

// cbuffer $Globals

// {

//

//   struct AmbientLight

//   {

//       

//       float3 AmbientValue;           // Offset:    0

//

//   } MyAmbient;                       // Offset:    0 Size:    12

//   

//   struct DirectionalLight

//   {

//       

//       float3 LightPosition;          // Offset:   16

//       float3 LightColor;             // Offset:   32

//

//   } MyDirectional;                   // Offset:   16 Size:    28

//

// }

//

// interfaces $ThisPointer

// {

//

//   interface Light MyInstance;        // Offset:    0 Size:     1

//

// }

//

//

// Resource Bindings:

//

// Name                                 Type  Format         Dim      HLSL Bind  Count

// ------------------------------ ---------- ------- ----------- -------------- ------

// $Globals                          cbuffer      NA          NA            cb0      1

//

//

//

// Input signature:

//

// Name                 Index   Mask Register SysValue  Format   Used

// -------------------- ----- ------ -------- -------- ------- ------

// CurPosition              0   xyz         0     NONE   float   xyz

// Normal                   0   xyz         1     NONE   float   xyz

//

//

// Output signature:

//

// Name                 Index   Mask Register SysValue  Format   Used

// -------------------- ----- ------ -------- -------- ------- ------

// SV_Target                0   xyzw        0   TARGET   float   xyzw

//

//

// Available Class Types:

//

// Name                             ID CB Stride Texture Sampler

// ------------------------------ ---- --------- ------- -------

// DirectionalLight                  0         2       0       0

// AmbientLight                      1         1       0       0

//

// Available Class Instances:

//

// Name                        Type CB CB Offset Texture Sampler

// --------------------------- ---- -- --------- ------- -------

// MyAmbient                      1  0         0       -       -

// MyDirectional                  0  0         1       -       -

//

// Interface slots, 1 total:

//

//             Slots

// +----------+---------+---------------------------------------

// | Type ID  |   0     |0    1    

// | Table ID |         |0    1    

// +----------+---------+---------------------------------------

ps_5_0

dcl_globalFlags refactoringAllowed

dcl_constantbuffer CB0[3], immediateIndexed

// Function table for AmbientLight.

    dcl_function_body fb0

    dcl_function_table ft0 = { fb0 }

    // Function table for DirectionalLight.

    dcl_function_body fb1

    dcl_function_table ft1 = { fb1 }

    // main's MyMaterial parameter.

    dcl_interface fp0[1][1] = { ft0, ft1 };

    // main shader code

// call AmbientLight or DirectionalLight based on function pointer bound

//float4 main (Light MyInstance, float3 CurPos: CurPosition,

                 float3 Normal : Normal) : SV_Target

//{

//        float4 Ret;

//        Ret.xyz = MyInstance.Calculate(CurPos, Normal);

//        Ret.w = 1.0;

//        return Ret;

//}

    fcall fp0[0][0]

    mov o0.xyz, r0.xyzx

    mov o0.w, l(1.000000)

    ret

// AmbientLight::Calculate

//float3 Calculate(float3 Position, float3 Normal)

//{

//  return AmbientValue;

//}

    label fb0

    mov r0.w, this[0].y  //实例数据在实例常量缓冲区中的基本元素偏移量

    mov r1.x, this[0].x  //常量缓冲区中保存实例数据的索引

    mov r0.xyz, cb[r1.x + 0][r0.w + 0].xyzx

    ret

// DirectionalLight::Calculate

//float3 Calculate(float3 Position, float3 Normal)

//{

//  float3 LightDir = normalize(Position - LightPosition);

//  float LightContrib = saturate( dot( Normal, -LightDir) );

//  return LightColor * LightContrib;

//}

    label fb1

    mov r0.w, this[0].y

    mov r1.xyz, this[0].xyxx

    add r1.yzw, v0.xxyz, -cb[r1.z + 0][r1.y + 0].xxyz

    dp3 r2.x, r1.yzwy, r1.yzwy

    rsq r2.x, r2.x  //dest = 1.0f / sqrt(src0)

    mul r1.yzw, r1.yyzw, r2.xxxx

    dp3_sat r1.y, v1.xyzx, -r1.yzwy

    mul r1.xyz, r1.yyyy, cb[r1.x + 0][r0.w + 1].xyzx

    mov r0.xyz, r1.xyzx

    ret

7.19.5.3 API - Simple Example

7.19.5.3 API - 简单示例

//create the shader

    //    and specify the class library to load class instance info into

    pDevice->CreatePixelShader(pShaderCode, pMyClassLinkage, &pMyPS);

    //get a handle to the MyDirectional and MyAmbient class instances

    //    from the class library

    //the zero is an array index for when the variable is an array.

    pMyClassLinkage->GetClassInstance(L"MyDirectional", 0, &pMyDirectionalLight);

    pMyClassLibrary->GetClassInstance(L"MyAmbient", 0, & pMyAmbientLight);

    while (true)

    {

        // select either the MyDirectionalList or MyAmbient class

        if (DirectionalLighting)

            pDevice->PSSetShader(pMyPS, &pMyDirectionalLight, 1);

        else

            pDevice->PSSetShader(pMyPS, &pMyAmbientLight, 1);

        RenderScene();

    }

7.19.6 Runtime API for Interfaces

7.19.6 接口的运行时 API

7.19.6.1 Overview

7.19.6.1概述

The programming model for subroutines is an interface driven model. The interface provides the definition of the function tables that can be switched between efficiently.A level of data abstraction is also present to allow for swapping of both data and function pointers during SetShader calls. At SetShader time, an array of class instantiations is specified that correspond to the interfaces that are used by the shader.The shader reflection system specifies information for each entry in the required interface array. A runtime reflection API is required to be able to specify the class instance in a way that can be efficiently mapped by the runtime to function pointers for the driver calls to consume.The runtime API does not need to be complex, just a method of providing handles to class instances.

子例程的编程模型是接口驱动的模型。该接口提供了函数表的定义,该函数表可以高效切换。还存在数据抽象级别,以允许在 SetShader 调用期间交换数据和函数指针。在 SetShader 时,指定了与着色器使用的接口相对应的类实例化数组。着色器反射系统为所需接口数组中的每个条目指定信息。运行时反射 API 需要能够指定类实例,以便运行时可以有效地将该实例映射到驱动程序调用的函数指针。运行时 API 不需要很复杂,只需为类实例提供句柄的方法即可。

The runtime API has only one goal: Provide a handle to SetShader that can be efficiently used to specify to the driver what functions should be executed for a given shader bind.To achieve this goal, a collection of class information is required if the class instance handles are to be shared across multiple shaders i.e. between all shaders within an effect.When a shader is created, a ID3D11ClassLinkage is a new parameter that specifies where to add the class metadata to. If the same class library is specified to two shaders, then the same class instance handles are used when binding either shader.The collection of class metadata could be global to a given device, but that could become cumbersome when mixing large collection of shaders (i. e. keeping a middleware solution separate from another middleware solution).

运行时 API 只有一个目标:为 SetShader 提供句柄,该句柄可用于有效地向驱动程序指定应为给定着色器绑定执行哪些函数。为了实现此目标,如果要在多个着色器之间共享类实例句柄,即在效果中的所有着色器之间共享类信息,则需要类信息的集合。创建着色器时,ID3D11ClassLinkage 是一个新参数,用于指定指定将类元数据添加到何处。如果为两个着色器指定了相同的类库,则在绑定任一着色器时将使用相同的类实例句柄。

类元数据的集合对于给定设备可能是全局的,但在混合大量着色器时,这可能会变得很麻烦(即将中间件解决方案与另一个中间件解决方案分开)。

7.19.6.2 Prototype of changes

7.19.6.2 变更原型

interface ID3D11ClassLinkage : IUnknown

    {

    // PRIMARY FUNCTION - get a reference to an instance of a class

    //    that exists in a shader.  The common scenario is to refer to

    //    variables declared in shaders, which means that a reference is

    //    acquired with this function and then passed in on SetShader

        HRESULT GetClassInstance(

            WCHAR *pszClassInstanceName,

            UINT uInstanceIndex,

            ID3D11ClassInstance **pClassInstance);

    //  Create a class instance reference that is the combination of a class

    //    type and the location of the data to use for the class instance

    //      - not the common scenario, but useful in case the data location

    //        for a class is dynamic or not known until runtime

        HRESULT CreateClassInstance(

            WCHAR *pszClassTypeName,

            UINT ConstantBufferOffset,

            UINT ConstantVectorOffset,

            UINT TextureOffset,

            UINT SamplerOffset,

            ID3D11ClassInstance **pClassInstance);

    }

    //  Specifying the calls in "10 speak".  Use the follow as an example

    //    of how one could retrofit D3D10 and then put that into the D3D11 API

    //    i.e. ignoring split of Creats off of device, new stages, etc.

    Interface ID3D11Device

    {

        [ … Existing calls … ]

    //  Shader create calls take a parameter to specify the class library

    //     to append the class symbol information from the shader into

    //     this is a NON-OPTIONAL parameter.  A shader is unusable without

    //     the funciton table information being used (assuming it has any)

        HRESULT CreateVertexShader(

            void *pShaderBytecode,

            SIZE_T BytecodeLength,

            ID3D11ClassLinkage *pClassLinkage,

            ID3D11VertexShader **ppVertexShader);

        HRESULT CreateGeometryShader(

            void *pShaderBytecode,

            SIZE_T BytecodeLength,

            ID3D11ClassLinkage *pClassLinkage,

            ID3D11VertexShader **ppVertexShader);

        HRESULT CreatePixelShader(

            void *pShaderBytecode,

            SIZE_T BytecodeLength,

            ID3D11ClassLinkage *pClassLinkage,

            ID3D11VertexShader **ppVertexShader);

    // Not shown: Similar to above for Hull Shader, Domain Shader and Compute Shader

        HRESULT CreateClassLinkage(

            ID3D11ClassLinkage **ppClassLinkage);

    //  Shader bind calls take an extra array to specify the function tables

    //      to use until the next bind shader call

        void VSSetShader(

            ID3D11VertexShader *pShader,

            ID3D11ClassInstance *ppClassInstances,

            UINT NumInstances);

        void GSSetShader(

            ID3D11GeometryShader *pShader,

            ID3D11ClassInstance *ppClassInstances,

            UINT NumInstances);

        void PSSetShader(

            ID3D11PixelShader *pShader,

            ID3D11ClassInstance *ppClassInstances,

            UINT NumInstances);

        // Not shown: Similar to above for Hull Shader, Domain Shader and Compute Shader

    }

7.19.7 Complex Example

7.19.7 复杂示例

7.19.7.1 HLSL - Complex Example

7.19.7.1 HLSL - 复杂示例

interface Light

    {

        float3 Calculate(float3 Position, float3 Normal);

    };

    class AmbientLight : Light

    {

        float3 m_AmbientValue;

        float3 Calculate(float3 Position, float3 Normal)

        {

            return m_AmbientValue;

        }

    };

    class DirectionalLight : Light

    {

        float3 m_LightDir;

        float3 m_LightColor;

        float3 Calculate(float3 Position, float3 Normal)

        {

            float LightContrib = saturate( dot( Normal, -m_LightDir) );

            return m_LightColor * LightContrib;

        }

    };

    uint g_NumLights;

    uint g_LightsInUse[4];

    Light g_Lights[9];

    float3 AccumulateLighting(float3 Position, float3 Normal)

    {

        float3 Color = 0;

        for (uint i = 0; i < g_NumLights; i++)

        {

            Color += g_Lights[g_LightsInUse[i]].Calculate(Position, Normal);

        }

        return Color;

    }

    interface Material

    {

        void Perturb(in out float3 Position, in out float3 Normal, in out float2 TexCoord);

        float3 CalculateLitColor(float3 Position, float3 Normal, float2 TexCoord);

    };

    class FlatMaterial : Material

    {

        float3 m_Color;

        void Perturb(in out float3 Position, in out float3 Normal, in out float2 TexCoord)

        {

        }

        float3 CalculateLitColor(float3 Position, float3 Normal, float2 TexCoord)

        {

            return m_Color * AccumulateLighting(Position, Normal);

        }

    };

    class TexturedMaterial : Material

    {

        float3 m_Color;

        Texture2D<float3> m_Tex;

        sampler m_Sampler;

        void Perturb(in out float3 Position, in out float3 Normal, in out float2 TexCoord)

        {

        }

        float3 CalculateLitColor(float3 Position, float3 Normal, float2 TexCoord)

        {

            float3 Color = m_Color;

            Color *= m_Tex.Sample(m_Sampler, TexCoord) * 0.1234;

            Color *= AccumulateLighting(Position, Normal);

            return Color;

        }

    };

    class StrangeMaterial : Material

    {

        void Perturb(in out float3 Position, in out float3 Normal, in out float2 TexCoord)

        {

            Position += Normal * 0.1;

        }

        float3 CalculateLitColor(float3 Position, float3 Normal, float2 TexCoord)

        {

            return AccumulateLighting(Position, Normal);

        }

    };

    float TestValueFromLight(Light Obj, float3 Position, float3 Normal)

    {

        float3 Calc = Obj.Calculate(Position, Normal);

        return saturate(Calc.x + Calc.y + Calc.z);

    }

    AmbientLight g_Ambient0;

    DirectionalLight g_DirLight0;

    DirectionalLight g_DirLight1;

    DirectionalLight g_DirLight2;

    DirectionalLight g_DirLight3;

    DirectionalLight g_DirLight4;

    DirectionalLight g_DirLight5;

    DirectionalLight g_DirLight6;

    DirectionalLight g_DirLight7;

    FlatMaterial g_FlatMat0;

    TexturedMaterial g_TexMat0;

    StrangeMaterial g_StrangeMat0;

    float4 main (

        Material MyMaterial,

        float3 CurPos: CurPosition,

        float3 Normal : Normal,

        float2 TexCoord : TexCoord0) : SV_Target

    {

        float4 Ret;

        if (TestValueFromLight(g_DirLight0, CurPos, Normal) > 0.5)

        {

            MyMaterial.Perturb(CurPos, Normal, TexCoord);

        }

        Ret.xyz = MyMaterial.CalculateLitColor(CurPos, Normal, TexCoord);

        Ret.w = 1;

        return Ret;

    }

7.19.7.2 IL - Complex Example

7.19.7.2 IL - 复杂示例

   // This pointers are a four-element vector with indices for

    // which constant buffer holds the instance data (.x element),

    // the base offset of the instance data in the instance constant

    // buffer, the base texture index and the base sampler index.

    // Basic instance members will therefore be referenced with

    // cb[r0.x][r0.y + member_offset].

    // This pointers can be in arrays so the first [] index

    // can also have a register to indicate array access.

    //

    //

    // For this example assume that globals are put in cbuffers

    // in the following order.  Entries are offset:size in

    // register (four-component) units.

    //

    // cb0:

    //     0:1 - g_NumLights.

    //     1:4 - g_LightsInUse.

    //     5:1 - g_Ambient0.

    //     6:2 - g_DirLight0.

    //     8:2 - g_DirLight1.

    //    10:2 - g_DirLight2.

    //    12:2 - g_DirLight3.

    //    14:2 - g_DirLight4.

    //    16:2 - g_DirLight5.

    //    18:2 - g_DirLight6.

    //    20:2 - g_DirLight7.

    //    22:1 - g_FlatMat0.

    //    23:1 - g_TexMat0.

    //

    // g_StrangeMat0 takes no space.

    //

    // interfaces:

    //     0:1 - MyMaterial.

    //     1:9 - g_Lights.

    //

    // textures:

    //     0:1 - g_TexMat0.

    //

    // samplers:

    //     0:1 - g_TexMat0.

    //

    // The this pointers for the concrete objects would then be:

    // g_Ambient0:    { 0,  5, -, - }

    // g_DirLight0:   { 0,  6, -, - }

    // g_DirLight1:   { 0,  8, -, - }

    // g_DirLight2:   { 0, 10, -, - }

    // g_DirLight3:   { 0, 12, -, - }

    // g_DirLight4:   { 0, 14, -, - }

    // g_DirLight5:   { 0, 16, -, - }

    // g_DirLight6:   { 0, 18, -, - }

    // g_DirLight7:   { 0, 20, -, - }

    // g_FlatMat0:    { 0, 22, -, - }

    // g_TexMat0:     { 0, 23, 0, 0 }

    // g_StrangeMat0: { -,  -, -, - }

    //

    //

    // Function bodies are declared explicitly so

    // that it’s known in advance which bodies exist

    // and how many bodies there are overall.

    //

    dcl_function_body fb0

    dcl_function_body fb1

    dcl_function_body fb2

    dcl_function_body fb3

    dcl_function_body fb4

    dcl_function_body fb5

    dcl_function_body fb6

    dcl_function_body fb7

    dcl_function_body fb8

    dcl_function_body fb9

    dcl_function_body fb10

    dcl_function_body fb11

    //

    // Function tables work similarly to vtables for C++ except

    // that a table has an entry per call site for an interface

    // instead of per method.

    //

    // Function table for AmbientLight.

    // One call site in AccumulateLighting multiplied by three calls of

    // AccumulateLighting from CalculateLitColor.

    dcl_function_table ft0 { fb3, fb6, fb9 }

    // Function table for DirectionalLight.

    // One call site in AccumulateLighting multiplied by three calls of

    // AccumulateLighting from CalculateLitColor.

    dcl_function_table ft1 { fb4, fb7, fb10 }

    // Function table for FlatMaterial.

    // One call to Perturb in main and one call to CalculateLitColor in main.

    dcl_function_table ft2 { fb0, fb5 }

    // Function table for TexturedMaterial.

    // One call to Perturb in main and one call to CalculateLitColor in main.

    dcl_function_table ft3 { fb1, fb8 }

    // Function table for StrangeMaterial.

    // One call to Perturb in main and one call to CalculateLitColor in main.

    dcl_function_table ft4 { fb2, fb11 }

    //

    // Function table pointers.  Each of these needs to bound before

    // the shader is usable.  The idea is that binding gives

    // a reference to one of the function tables above so that

    // the method slots can be filled in.

    // The compiler will not generate pointers for unreferenced objects.

    //

    // A function table pointer has a full set of method slots to

    // avoid the extra level of indirection that a C++ pointer-to-

    // pointer-to-vtable representation would require (that would also

    // require that this pointers be 5-tuples).  In the HLSL virtual

    // inlining model it's always known what global variable/input is

    // used for a call so we can set up tables per root object.

    //

    // Function pointer decls indicate which function tables are

    // legal to use with them.  This also allows derivation of

    // method correlation information.

    //

    // The first [] of an interface decl is the array size.

    // If dynamic indexing is used the decl will indicate

    // that, as shown below.  An array of interface pointers can

    // be indexed statically also, it isn’t required that

    // arrays of interface pointers mean dynamic indexing.

    //

    // Numbering of interface pointers takes array size into

    // account, so the first pointer after a four entry

    // array fp6[4][1] would be fp10.

    //

    // The second [] of an interface decl is the number

    // of call sites, which must match the number of bodies in

    // each table referenced in the decl.

    //

    // main's MyMaterial parameter.

    dcl_interface fp0[1][2] = { ft2, ft3, ft4 };

    // g_Lights entries.

    dcl_interface_dynamicindexed fp1[9][3] = { ft0, ft1 };

    // main routine.

    // TestValueFromLight is a regular routine and is inlined.

    // The Calculate reference inside of it is passed the concrete

    // instance DirLight0 so it is devirtualized and inlined.

    dp3_sat r0.x, v1.xyzx, -cb0[6].xyzx

    mul r0.yz, r0.xxxx, cb0[7].xxyx

    add r0.y, r0.z, r0.y

    mad_sat r0.x, cb0[7].z, r0.x, r0.y

    // The return of TestValueFromLight is tested.

    lt r0.x, l(0.500000), r0.x

    if_nz r0.x

      // The call to Perturb is a full fcall

      fcall fp0[0][0]

      mov r2.xyz, r0.xyzx

      mov r0.x, r0.w

      mov r0.y, r1.x

    else

      mov r2.xyz, v1.xyzx

      mov r0.xy, v2.xyxx

    endif

    // The call to CalculateLitColor is a full fcall.

    fcall fp0[0][1]

    mov o0.xyz, r1.xyzx

    mov o0.w, l(1.000000)

    ret

    //

    // Function bodies.

    //

    // FlatMaterial version of main's call to Perturb.

    label fb0

    mov r0.xyz, v1.xyzx

    mov r0.w, v2.y

    mov r1.x, v2.x

    ret

    // TexturedMaterial version of main's call to Perturb.

    label fb1

    mov r0.xyz, v1.xyzx

    mov r0.w, v2.x

    mov r1.x, v2.y

    ret

    // StrangeMaterial version of main's call to Perturb.

    // NOTE: Position is not used later so the compiler has killed

    // the update to Position from this body.

    label fb2

    mov r0.xyz, v1.xyzx

    mov r0.w, v2.x

    mov r1.x, v2.y

    ret

    // AmbientLight version of FlatMaterial.CalculateLitColor-calls-

    // AccumulateLighting's call to Calculate.

    // NOTE: the Calculate bodies all look superficially

    // identical but all are different.  In one case

    // the array index is r1 and the return value is r4,

    // in one case the array index is r1 and the return value

    // is r5 and in the last case the array index is in r0

    // and the return is in r5.  Bodies are not interchangeable.

    label fb3

    // Array index is r1, return is r4.

    mov r2.w, this[r1.w + 1].y

    mov r1.w, this[r1.w + 1].x

    mov r4.xyz, cb[r1.w + 0][r2.w + 0].xyzx

    ret

    // DirectionalLight version of FlatMaterial.CalculateLitColor-calls-

    // AccumulateLighting's call to Calculate.

    label fb4

    // Array index is r1, return is r4.

    mov r2.w, this[r1.w + 1].y

    mov r3.w, this[r1.w + 1].x

    mov r4.w, this[r1.w + 1].y

    mov r5.x, this[r1.w + 1].x

    dp3_sat r4.w, r2.xyzx, -cb[r5.x + 0][r4.w + 0].xyzx

    mul r5.xyz, r4.wwww, cb[r3.w + 0][r2.w + 1].xyzx

    mov r4.xyz, r5.xyzx

    ret

    // FlatMaterial version of main's call to CalculateLitColor.

    label fb5

    // AccumulateLighting is inlined.

    mov r3.xyz, l(0,0,0,0)

    mov r0.w, l(0)

    loop

      // g_NumLights is cb0[0].

      uge r1.w, r0.w, cb0[0].x

      breakc_nz r1.w

      // Get g_Lights[g_LightsInUse[i]].

      // g_LightsInUse is cb0[1-4].

      // g_Lights is cb0[5-13].

      mov r1.w, cb0[r0.w + 1].x

      // Call Calculate.  Array index is r1.

      fcall fp1[r1.w + 0][0]

      // Return is expected in r4.

      mov r0.xyz, r4.xyzx

      add r3.xyz, r3.xyzx, r0.xyzx

      iadd r0.w, r0.w, l(1)

    endloop

    // Multiply times color.

    mov r0.xy, this[0].yxyy

    mul r0.xyz, r3.xyzx, cb[r0.y + 0][r0.x + 0].xyzx

    mov r1.xyz, r0.xyzx

    ret

    // AmbientLight version of TexturedMaterial.CalculateLitColor-calls-

    // AccumulateLighting's call to Calculate.

    label fb6

    // Array index is r1, return is r5.

    mov r2.w, this[r1.w + 1].y

    mov r1.w, this[r1.w + 1].x

    mov r5.xyz, cb[r1.w + 0][r2.w + 0].xyzx

    ret

    // DirectionalLight version of TexturedMaterial.CalculateLitColor-calls-

    // AccumulateLighting's call to Calculate.

    label fb7

    // Array index is r1, return is r5.

    mov r2.w, this[r1.w + 1].y

    mov r3.w, this[r1.w + 1].x

    mov r4.w, this[r1.w + 1].y

    mov r5.w, this[r1.w + 1].x

    dp3_sat r4.w, r2.xyzx, -cb[r5.w + 0][r4.w + 0].xyzx

    mul r6.xyz, r4.wwww, cb[r3.w + 0][r2.w + 1].xyzx

    mov r5.xyz, r6.xyzx

    ret

    // TexturedMaterial version of main's call to CalculateLitColor.

    label fb8

    // Texture sample.

    mov r4.xy, this[0].zw

    sample r0.xyz, v2.xy, t[r4.x].xyz, s[r4.y]

    mul r0.xyz, r0.xyzx, l(0.123400, 0.123400, 0.123400, 0.000000)

    // m_Color multiplied by texture sample.

    mov r0.w, this[0].y

    mov r1.w, this[0].x

    mul r0.xyz, r0.xyzx, cb[r1.w + 0][r0.w + 0].xyzx

    // AccumulateLighting is inlined.

    mov r4.xyz, l(0,0,0,0)

    mov r0.w, l(0)

    loop

      // g_NumLights is cb0[0].

      uge r1.w, r0.w, cb0[0].x

      breakc_nz r1.w

      // Get g_Lights[g_LightsInUse[i]].

      // g_LightsInUse is cb0[1-4].

      // g_Lights is cb0[5-13].

      mov r1.w, cb0[r0.w + 1].x

      // Call Calculate.  Array index is in r1.

      fcall fp1[r1.w + 0][1]

      // Return is expected in r5.

      mov r3.xyz, r5.xyzx

      add r4.xyz, r4.xyzx, r3.xyzx

      iadd r0.w, r0.w, l(1)

    endloop

    // Multiply accumulated color times texture color.

    mul r0.xyz, r0.xyzx, r4.xyzx

    mov r1.xyz, r0.xyzx

    ret

    // AmbientLight version of StrangeMaterial.CalculateLitColor-calls-

    // AccumulateLighting's call to Calculate.

    label fb9

    // Array index is r0, return is r5.

    mov r1.w, this[r0.w + 1].y

    mov r0.w, this[r0.w + 1].x

    mov r5.xyz, cb[r0.w + 0][r1.w + 0].xyzx

    ret

    // DirectionalLight version of StrangeMaterial.CalculateLitColor-calls-

    // AccumulateLighting's call to Calculate.

    label fb10

    // Array index is r0, return is r5.

    mov r1.w, this[r0.w + 1].y

    mov r2.w, this[r0.w + 1].x

    mov r3.w, this[r0.w + 1].y

    mov r4.w, this[r0.w + 1].x

    dp3_sat r3.w, r2.xyzx, -cb[r4.w + 0][r3.w + 0].xyzx

    mul r6.xyz, r3.wwww, cb[r2.w + 0][r1.w + 1].xyzx

    mov r5.xyz, r6.xyzx

    ret

    // StrangeMaterial version of main's call to CalculateLitColor.

    label fb11

    // AccumulateLighting is inlined.

    mov r4.xyz, l(0,0,0,0)

    mov r0.z, l(0)

    loop

      // g_NumLights is cb0[0].x.

      uge r0.w, r0.z, cb0[0].x

      breakc_nz r0.w

      // Get g_Lights[g_LightsInUse[i]].

      // g_LightsInUse is cb0[1-4].

      // g_Lights is cb0[5-13].

      mov r0.w, cb0[r0.z + 1].x

      // Call Calculate.  Array index is in r0.

      fcall fp1[r0.w + 0][2]

      // Return is in r5.

      mov r3.xyz, r5.xyzx

      add r4.xyz, r4.xyzx, r3.xyzx

      iadd r0.z, r0.z, l(1)

    endloop

    mov r1.xyz, r4.xyzx

    ret

7.19.7.3 API - Complex Example

7.19.7.3 API - 复杂示例

// create a class library to hold class instance data

    pDevice->CreateClassLinkage(&pMyClassTable);

    // create the shader and supply a class library to add class instance data

    pDevice->

        CreatePixelShader(pMyCompiledPixelShader, pMyClassLinkage, &pMyPS);

    // use reflection to get where data should be stored in interface array

    NumInterfaces = pMyPSReflection->GetNumInterfaces();

    pMyLightsVar = pMyPSReflection->GetVariableByName("g_Lights");

    iLightOffset = pMyLightsVar->GetInterfaceSlot(0);

    pMyMaterialVar = pMyPSReflection->GetVariableByName("$MyMaterial");

    iMatOffset = pMyPSReflection->GetInterfaceSlot(0);

    // Use class library to get references to all class instances

    //   needed in the shader.

    pMyClassTable->GetClassInstance("g_Ambient0", 0, &pAmbient0);

    pMyClassTable->GetClassInstance("g_DirLight0", &pDirLight[0]);

    pMyClassTable->GetClassInstance("g_DirLight1", &pDirLight[1]);

    pMyClassTable->GetClassInstance("g_DirLight2", &pDirLight[2]);

    pMyClassTable->GetClassInstance("g_DirLight3", &pDirLight[3]);

    pMyClassTable->GetClassInstance("g_DirLight4", &pDirLight[4]);

    pMyClassTable->GetClassInstance("g_DirLight5", &pDirLight[5]);

    pMyClassTable->GetClassInstance("g_DirLight6", &pDirLight[6]);

    pMyClassTable->GetClassInstance("g_DirLight7", &pDirLight[7]);

    pMyClassTable->GetClassInstance("g_FlatMat0", &pFlatMat0);

    pMyClassTable->GetClassInstance("g_TexMat0", &pTexMat0);

    pMyClassTable->GetClassInstance("g_StrangeMat0", &pStrangeMat0);

    // sets lights in array - they do not change only indices to them do

    pMyInterfaceArray[iLightOffset] = pAmbient0;

    for (uint i = 0; i < 8; i++)

    {

        pMyInterfaceArray[iLightOffset + i + 1] = pDirLight[i];

    }

    while (true)

    {

        if (bFlatSunlightOnly)

        {

            // Set g_NumLights to 1 in constant buffer.

            // Set g_LightsInUse[0] to 1 in constant buffer.

            pMyInterfaceArray[iMatOffset] = pFlatMat0;

        }

        else if (bStrangeMaterials)

        {

            // Set g_NumLights and fill out g_LightsInUse.

            pMyInterfaceArray[iMatOffset] = pStrangeMat0;

        }

        else

        {

            // Set g_NumLights and fill out g_LightsInUse.

            pMyInterfaceArray[iMatOffset] = pTexMat0;

        }

       // Set the pixel shader and the interfaces to until the next bind call

        pDevice->PSSetShader(pMyPS, pMyInterfaceArray, NumInterfaces);

        // Use the shader that was just bound to draw something

        RenderScene();

    }

7.20 Low Precision Shader Support In D3D

7.20 D3D 中的低精度着色器支持

7.20.1 Overview

7.20.1 概述

This adds support for 10bit (2.8 fixed point) and 16bit precision float and in some cases limited integer arithmetic to shader model 2.0+.

增加了对 10 位(2.8 定点)和 16 位精度浮点数的支持,在某些情况下,将有限的整数算术应用到着色器模型2.0及以上版本。

"2.8定点"是一种数字表示方法,它是定点数的一种形式。在这种表示法中,数字被分为两部分:一部分是整数部分,另一部分是小数部分。"2.8"表示整数部分有2位,小数部分有8位。

  • min16float - minimum 16-bit floating point value.
    min16float - 最小 16 位浮点值。
  • min10float - minimum 10-bit floating point value.
    min10float - 最小 10 位浮点值。
  • min16int - minimum 16-bit signed integer.
    min16int - 最小 16 位有符号整数。
  • min12int - minimum 12-bit signed integer.
    min12int - 最小 12 位有符号整数。
  • min16uint - minimum 16-bit unsigned integer.
    min16uint - 最小 16 位无符号整数。

Shader<->memory I/O operations are unchanged for simplicity, e.g. shader constants continue to be defined as 32-bit per component.

为简单起见,Shader<->memory I/O 操作保持不变,例如,着色器常量继续被定义为每个组件32位。

Implementations are allowed to execute low precision operations at higher precision. So 10-bit arithmetic could be done at 10-bits or more (say 32-bit) precision.

允许实现以更高的精度执行低精度操作。因此,10 位算术可以以 10 位或更高(比如 32 位)的精度完成。

min10float a = 0.2;

min10float b = 1.1;

min10float c = a + b = 1.3;

可以使用32位精度实现

float a = 0.2;

float b = 1.1;

float c = a + b = 1.3;

7.20.1.1 Design Goals / Assumptions

7.20.1.1 设计目标/假设

1.Enable D3D applications to take advantage of hardware that implements low precision shader arithmetic

1.使 D3D 应用程序能够利用硬件实现低精度着色器运算。

2.Shaders authored for low precision work unmodified on hardware that operates at higher precision

2.在硬件上,低精度着色器的编写通常不需要进行修改,即可在支持更高精度的硬件上运行。

  • Application does not have to author multiple versions of a shader, but has to be careful that the shader will operate at variable precision as low as the minimum precision it chooses
  • 应用程序不必创作多个版本的着色器,但必须注意着色器将操作在可变精度低至它选择的最小精度

3.Shaders authored for low precision can trivially be cleaned up by the runtime to be in a format that old drivers understand

3.为低精度编写的着色器可以很容易地由运行时清理,使其采用旧驱动程序可以理解的格式

4.Minimal driver work to either support low precision processing or not support it

4.最小的驱动程序工作,要么支持低精度处理,要么不支持它

  • E.g. Drivers can compile shaders once when they are initially submitted
  • 例如,驱动程序可以在最初提交着色器时编译一次
  • Ideally, Constant Buffers also don’t require any special processing by drivers to account for the contents being referenced at various precisions (IHV can choose to build downconverting hardware for this)
  • 理想情况下,常量缓冲区也不需要驱动程序进行任何特殊处理来考虑以各种精度引用的内容 (IHV 可以选择为此生成下变频硬件)
  • Drivers that don’t support the feature can simply ignore the precision hints.
  • 不支持该功能的驱动程序可以简单地忽略精度提示。
  • To understand the precision level a given shader instruction in the bytecode can operate (including converting precisions on operands if necessary), drivers will not have to do any complex far reaching analysis – just looking at the current instruction should be informative enough, possibly with the help of shader declarations.
  • 要了解字节码中给定着色器指令可以操作的精度级别(包括在必要时转换操作数上的精度),驱动程序将不必进行任何复杂的深远分析 - 只需查看当前指令就应该有足够的信息,可能需要着色器声明的帮助。

struct VertexIn

{

float3 PosL : POSITION;

    float4 Color : COLOR;

};

struct VertexOut

{

float4 PosH  : SV_POSITION;

    float4 Color : COLOR;

};

min10float4 a = min10float4(1.0,2.0,3.0,4.0);

VertexOut VS(VertexIn vin)

{

VertexOut vout;

vout.PosH = a;

    return vout;

}

// cbuffer $Globals

// {

//   min10float4 a;                     // Offset:    0 Size:    16

//      = 0x3f800000 0x40000000 0x40400000 0x40800000

// }

vs_5_0

dcl_globalFlags refactoringAllowed | enableMinimumPrecision

dcl_constantbuffer CB0[1], immediateIndexed

dcl_output_siv o0.xyzw, position

mov o0.xyzw, cb0[0].xyzw {min2_8f as def32}

ret

5.Application codebase does not need to change at all to use low precision shaders

5.应用程序代码库不需要更改即可使用低精度着色器

  • Shaders can be dropped in with no other codebase change
  • 着色器可以放入其中,而无需更改其他代码库

Struct PSInput

{

  float4 position : SV_POSITION;

  float4 color : COLOR;

};

PSInput VSMain(float4 position : POSITION, float4 color : COLOR)

{

  PSInput  result;

  result.position = position;

  Result.color = color;

  return result;

}

改为低精度着色器

Struct PSInput

{

  min16float4 position : SV_POSITION;

  min16float4 color : COLOR;

};

PSInput VSMain(min16float4 position : POSITION, min16float4 color : COLOR)

{

  PSInput  result;

  result.position = position;

  Result.color = color;

  return result;

}

6.Low precision support is added to all interesting shader models (2.x-5.0) as opposed to limiting it to the bottom end or adding a new shader model.

6.低精度支持被添加到所有有趣的着色器模型 (2.x-5.0) 中,而不是将其限制在底端或添加新的着色器模型。

  • Applications don’t have to make a choice between choosing low precision vs using other features if the hardware supports it all.
  • 如果硬件支持所有功能,则应用程序不必在选择低精度和使用其他功能之间做出选择。
  • Similarly hardware vendors implementing any shader level can choose to exploit low precision (indepdendent decisions).
  • 同样,实现任何着色器级别的硬件供应商都可以选择利用低精度(独立决策)。

7.Data format for the various low precisions is well defined, though it is not directly visible to applications

7.各种低精度的数据格式已明确定义,但对应用程序不直接可见

  • During shader execution, implementations can use equal or any amount of additional precision.
  • 在着色器执行期间,实现可以使用相等或任意数量的额外精度。

7.20.2 Precision Levels

7.20.2 精度级别

The new 10 and 16 bit precision levels for shaders are inspired by their existence in some real hardware and their presence in OpenGL ES. (8 bit was considered but cut due to its limitations versus the value it seemed to provide at the time).

.

Default Precision

Min 10-bit fixed point (2.8)

Min 16-bit int / float

32-bit int/float

64-bit float

Executing at higher precision allowed?

-

Y

Y

N

N

Shader Constants

-

N

N

Y

Y

SM 2.x

VS: fp32 / int23
PS: fp24 (s16e7) / int 16

opt

opt

N

N

SM 3.0

fp32

N

N

Y

N

SM 4.x

fp32 / int32

opt

opt

Y

opt

SM 5.0

fp32 / int32

opt

opt

Y

opt

Float range

-

[-2,2)

[-214,214]

Full IEEE 754

Full IEEE 754

Float magnitude range

浮点浮动

-

2-8...2

On SM 4+,
includes INF(无穷大)/NAN

Full IEEE 754

Full IEEE 754

Int range

-

-

(-211,211),
Full range signed
and unsigned on SM4+

full

-

7.20.2.1 10-bit min precision level

7.20.2.1 10位最小精度级别

This is a 2.8 fixed point value, though the fixed point semantics may not be identical to the general fixed point semantics defined in the D3D10+ specs.Following the D3D10+ fixed point semantics is recommended for future hardware that may choose to implement the 10-bit precision level.

这是一个 2.8 定点值,但定点语义可能与 D3D10+ 规范中定义的一般定点语义不同。对于可能选择实现 10 位精度级别的未来硬件,建议遵循 D3D10+ 定点语义。

In Direct3D 10, the following types are modifiers to the float type:
在 Direct3D 10 中,以下类型是浮点类型的修饰符:

  • snorm float - IEEE 32-bit signed-normalized float in range -1 to 1 inclusive.
    snorm float - IEEE 32 位有符号规范化浮点数,范围为 -1 到 1(含)。
  • unorm float - IEEE 32-bit unsigned-normalized float in range 0 to 1 inclusive.
    unorm float - IEEE 32 位无符号规范化浮点数,范围为 0 到 1(含)。

8-bit UNORM data is invertable when passed through 10-bit min-precision storage. For example: Suppose UNORM 8-bit data that is point sampled from the texture format DXGI_FORMAT_R8G8B8A8_UNORM gets read into a shader and is stored and passed around in the 10-bit representation.If that data s subsequently written unchanged out to a UNORM 8-bit output (such as a DXGI_FORMAT_R8G8B8A8_UNORM rendertarget) the output UNORM value matches the input UNORM value.This guarantee does not (cannot) apply for other formats passing through 10-bit, such as 8-bit UNORM_SRGB or higher precision UNORM values like 16-bit UNORM.

当通过10位最小精度存储传递时,8位UNORM数据是可逆的。例如:假设从纹理格式DXGI_FORMAT_R8G8B8A8_UNORM中点采样的UNORM 8位数据被读入着色器,并以10位表示形式存储和传递。如果该数据随后不变地写入到UNORM 8位输出(例如,DXGI_FORMAT_R8G8B8A8_UNORM渲染目标),则输出的UNORM值将与输入的UNORM值匹配。这个保证不适用于通过10位传递的其他格式,例如8位的UNORM_SRGB或更高精度的UNORM值,如16位的UNORM。

From the shader point of view the 10-bit min-precision level this appears as a float value with at minimum [-2,2) range.

从着色器的角度来看,10位最小精度级别表现为一个浮点值,其范围至少为[-2,2)。

Hardware that supports 10-bit precision must also support 16-bit precision.

支持 10 位精度的硬件也必须支持 16 位精度。

7.20.2.2 16-bit min-precision level

7.20.2.2 16位最小精度级别

7.20.2.2.1 float16

7.20.2.2.1 浮点数16

For float values, this is float 16 as defined in the D3D10+ specs. The exception is that for Shader Models 2, the max. exponent encoding (normally defining NaN/INF) are unused (undefined).

对于浮点值,这是 D3D10+ 规范中定义的浮点 16。例外情况是,对于着色器模型 2,最大指数编码(通常定义 NaN/INF)未使用(未定义)。

Conversion from float32 (e.g. from shader constants) to float16 may or may not flush float16 denorm to 0, and round to zero is used, per D3D spec for high to low precision float.Float16 arithmetic operations within the shader may or may not flush float16 denorm to 0, and may either round to nearest even or truncate to a representable number. Out of range values in conversion from float32 or arithmetic may produce +/-MAX_FLOAT16 or +/- INF.

从float32(例如,来自着色器常量)到float16的转换可能会也可能不会将float16 denorm为0,并且根据D3D规范,用于高精度到低精度浮点数的转换时,将使用向零舍入。着色器内的float16算术运算可能会也可能不会将float16 denorm为0,并且可能向最近的偶数舍入或截断为可表示的数字。从float32转换或算术运算中得到的超出范围的值可能会产生+/-MAX_FLOAT16或+/- INF。

denorm 是指 IEEE754中那些特别小的接近0的数值

16-bit integer min-precision is available as well in HLSL. For Shader Models 2, this is constrained to be representable as integral floats (1.0f, 2.0f, etc.) in a float16 encoding.In the shader bytecode these appear simply as float16, so native integer operations are not available. (it may not be worth bothering to expose this constrained form of int16 for SM 2/3)

在HLSL中,也提供了16位整数的最小精度。对于着色器模型2,这被限制为能够以float16编码表示为整数浮点数(1.0f,2.0f等)。在着色器字节码中,这些只是以float16的形式出现,因此不可用原生的整数运算。(对于SM 2/3,可能没有必要去暴露这种受限制的int16形式。)

7.20.2.3 int16/uint16

For shader model 4+, native integer ops can be used on 16-bit min-precision values, however applications must beware that the device could choose to simply use larger-than-16-bit (e.g.32 bit) integer ops without any clamping to maintain the illusion that there are not more than 16 bits present.

对于着色器模型4+,可以在16位最小精度值上使用本地整数操作,但是应用程序必须注意,设备可能会选择简单地使用大于16位(例如32位)的整数操作,而不进行任何截断,以维持不存在超过16位的假象。

总之,这句话提醒开发人员在着色器中使用整数操作时要谨慎,因为设备可能会自动选择更大的整数位数,而不进行截断。

Shader Constants feeding 16-bit shader arithmetic are always fp32 encoded for Shader Model 2. For Shader Models 4+, Shader Constants feeding 16-bit in the shader are specified as float32 or UINT32/INT32 as appropriate (i.e.unchanged from the way constants feed into float32 arithmetic).

为Shader Model 2,输入16位着色器运算的着色器常量总是以fp32编码。对于Shader Models 4+,根据具体情况而定,在着色器中输16位的着色器常量被指定为float32或UINT32/INT32(即,与常量输入到float32运算的方式相同)。

7.20.3 Low Precision Shader Bytecode

7.20.3 低精度着色器字节码

7.20.3.1 D3D9

A new MIN_PRECISION enum is added to the source and dest parameter token, definition below. This specifies the minimum precision level for the entire operation – implementations can use equal or greater precision.

一个新的MIN_PRECISION枚举被添加到源和目的参数标记中,定义如下。这指定了整个操作的最小精度级别 - 实现可以使用相同或更高的精度。

This new enum co-exists with the PARTIALPRECISION flag that is already in the same dest parameter token – see the comment below.

这个新的枚举与已位于同一目的参数标记中的 PARTIALPRECISION 标志共存 - 请参阅下面的注释。

7.20.3.1.1 Token Format

7.20.3.1.1 Token格式

// Source or dest token bits [15:14]:

#define D3D11_SB_OPERAND_MIN_PRECISION_MASK  0x0001C000

#define D3D11_SB_OPERAND_MIN_PRECISION_SHIFT 14

typedef enum _D3DSHADER_MIN_PRECISION

{

    D3DMP_DEFAULT   = 0, // Default precision for the shader model

    D3DMP_16        = 1, // Min 16 bit per component

    D3DMP_2_8       = 2, // Min 10 bits (2.8) per component

} D3DSHADER_MIN_PRECISION;

// When MIN_PRECISION is nonzero on a dest token, the dest modifier

// D3DSPDM_PARTIALPRECISION must also be set for consistency

//

// If D3DSPDM_PARTIALPRECISION is set but

// D3DSHADER_MIN_PRECISION is D3DMP_DEFAULT(0),

// it is equivalent to D3DSPDM_PARTIALPRECISION + D3DMP_16

// (partial PARTIALPRECISION existed before MIN_PRECISION was

// added, so this defines how the two can coexist without changing

// meaning for old shaders)

7.20.3.1.2 Usage Cases

7.20.3.1.2 用例

The src/dest token for instructions in PS/VS 2.x can use the MIN_PRECISION enum in the following circumstances:

在以下情况下,PS/VS 2.x中的指令的src/dest token可以使用MIN_PRECISION枚举:

1.Any shader instruction with an output (e.g. arithmetic, texture fetch instructions )

1.任何带有输出的着色器指令(例如算术、纹理获取指令)

2.PS 2.x input texcoord (t#) declarations (allowing lower precision interpolation)

2.PS 2.x 输入texcoord(t#) 声明(允许较低精度的插值)

  • Does not apply to PS 2.x input color (v#) declarations, as these were already by definition called out to be as low as 8 bit)
  • 不适用于 PS 2.x 输入颜色 (v#) 声明,因为根据定义,这些声明已被列为低至 8 位)

3.PS 3.0 input attribute (v#) declarations (allowing lower precision interpolation)

3.PS 3.0 输入属性 (v#) 声明(允许较低精度的插值)

4.Constant references (discussed more here(7.20.3.5))

4.常量引用(此处 (7.20.3.5) 详细讨论)

5.(Shader Model 3.0 is not affected since the D3D11 runtime does not expose it)

5.(着色器模型 3.0 不受影响,因为 D3D11 运行时不会公开它)

7.20.3.1.3 Interpreting Minimum Precision

7.20.3.1.3 解释最小精度

See here(7.20.3.4); this is common across D3D9 and D3D10+.

看这里 (7.20.3.4) ; 这在 D3D9 和 D3D10+ 中很常见。

7.20.3.2 D3D10+

A new MIN_PRECISION enum is added to the dest parameter token, definition below. This specifies the minimum precision level for the entire operation – implementations can use equal or greater precision.

一个新的MIN_PRECISION枚举被添加到目的参数标记中,定义如下。这指定了整个操作的最小精度级别 - 实现可以使用相同或更高的精度。

The encoding distinguishes type (e.g. float vs. sint vs. uint), in addition to precision level, to disambiguate instructions like “mov” that don’t already imply a type. This makes a difference when there is a size change involved in the instruction. E.g.moving a 32 bit float to a min. 16 bit float is a different task for hardware than moving a 32 bit uint to a min. 16 bit uint. This type distinction is not needed for the D3D9 shader bytecode because all arithmetic is “float” there.

编码还区分类型(例如 float 和 sint 和 uint),除了精度级别之外,为了消除歧义指令诸如“mov”,这些指令尚未暗示类型。当指令中涉及大小更改时,这会有所不同。例如,对于硬件来说,将 32 位浮点数移动到最小 16 位浮点数与将 32 位 uint 移动到最小 16 位 uint 的任务是不同的。D3D9 着色器字节码不需要此类型区分,因为在那里所有运算都是“浮点”。

7.20.3.2.1 Token Format

7.20.3.2.1 Token格式

// Min precision specifier for source/dest operands.  This

// fits in the extended operand token field. Implementations are free to

// execute at higher precision than the min – details spec’d elsewhere.

// This is part of the opcode specific control range.

typedef enum D3D11_SB_OPERAND_MIN_PRECISION

{

    D3D11_SB_OPERAND_MIN_PRECISION_DEFAULT    = 0, // Default precision

                                                       // for the shader model

    D3D11_SB_OPERAND_MIN_PRECISION_FLOAT_16   = 1, // Min 16 bit/component float

    D3D11_SB_OPERAND_MIN_PRECISION_FLOAT_2_8  = 2, // Min 10(2.8)bit/comp. float

    D3D11_SB_OPERAND_MIN_PRECISION_SINT_16    = 4, // Min 16 bit/comp. signed integer

    D3D11_SB_OPERAND_MIN_PRECISION_UINT_16    = 5, // Min 16 bit/comp. unsigned integer

} D3D11_SB_OPERAND_MIN_PRECISION;

#define D3D11_SB_OPERAND_MIN_PRECISION_MASK  0x0001C000

#define D3D11_SB_OPERAND_MIN_PRECISION_SHIFT 14

// DECODER MACRO: For an OperandToken1 that can specify

// a minimum precision for execution, find out what it is.

#define DECODE_D3D11_SB_OPERAND_MIN_PRECISION(OperandToken1) ((D3D11_ SB_OPERAND_MIN_PRECISION)(((OperandToken1)& D3D11_SB_OPERAND_MIN_PRECISION_MASK)>> D3D11_SB_OPERAND_MIN_PRECISION_SHIFT))

// ENCODER MACRO: Encode minimum precision for execution

// into the extended operand token, OperandToken1

#define ENCODE_D3D11_SB_OPERAND_MIN_PRECISION(MinPrecision) (((MinPrecision)<< D3D11_SB_OPERAND_MIN_PRECISION_SHIFT)& D3D11_SB_OPERAND_MIN_PRECISION_MASK)

// ----------------------------------------------------------------------------

// Global Flags Declaration

//

// OpcodeToken0:

//

... snip ...

// [16:16] Enable minimum-precision data types

... snip ...

//

// OpcodeToken0 is followed by no operands.

//

// ----------------------------------------------------------------------------

... snip ...

#define D3D11_1_SB_GLOBAL_FLAG_ENABLE_MINIMUM_PRECISION        (1<<16)

... snip ...

// DECODER MACRO: Get global flags

#define DECODE_D3D10_SB_GLOBAL_FLAGS(OpcodeToken0) ((OpcodeToken0)&D3D10_SB_GLOBAL_FLAGS_MASK)

// ENCODER MACRO: Encode global flags

#define ENCODE_D3D10_SB_GLOBAL_FLAGS(Flags) ((Flags)&D3D10_SB_GLOBAL_FLAGS_MASK)

7.20.3.3 Usage Cases

7.20.3.3 用例

The dest and source operand tokens in SM 4.0+ can use the MIN_PRECISION enum in the following circumstances:

SM 4.0+ 中的 dest 和 source 操作数tokens可以在以下情况下使用 MIN_PRECISION 枚举:

1.Any instruction that returns values to the shader

1.向着色器返回值的任何指令

  • e.g. mul
  • 例如:mul

2.Any memory fetch (incl texture sampling) or data move

2.任何内存提取(包括纹理采样)或数据移动

3.Type conversion instructions

3.类型转换指令

  • Those involving doubles, e.g. ftod or dtof only allow precision lowering on the float32 side of the operation.
  • 对于涉及到双精度浮点数的操作,例如 ftod 或 dtof,只允许在 float32 的一侧降低精度。
  • Other conversions, such as f32tof16, allow precision lowering on either side of the operation.
  • 其他转换,如 f32tof16,允许在操作的任一侧精确降低。

4.Exceptions (precision lowering not allowed)

4.例外情况(不允许降低精度)

  • double precision arithmetic
  • 双精度运算
  • atomic operations
  • 原子操作
  • load/store to non-Typed Unordered Access Views (Typed UAVs ok, since that involves format conv.)
  • 加载/存储到非类型化的无序访问视图(类型化的UAVs可以,因为这涉及格式转换。)
  • Geometry Shader stream output
  • 几何着色器流输出

5.Input and output attribute declarations

5.输入和输出属性声明

  1. At VS input, the input data types continue to be defined externally (Input Layout), but MIN_PRECISION can still be part of the shader input declaration, indicating how the shader will expect to see the data after it has been read in (post format conversion).
  2. 在 VS 输入时,输入数据类型继续在外部定义 (输入布局) ,但MIN_PRECISION仍然可以是着色器输入声明的一部分,指示着色器在读入数据后将如何查看数据(格式转换后)。

7.20.3.4 Interpreting Precision (same for D3D9 and D3D10+)

7.20.3.4 解释精度(D3D9 和 D3D10+ 相同)

1.Source operands are incoming stored at the (minimum) precision indicated on the operand. If no minimum precision is specified (default) the operand precision is 32-bit.

1.源操作数按操作数上指示的(最小)精度存储。如果没有指定最小精度,(默认值)则操作数精度为 32 位。

2.The precision specified on the output operand determines the minimum storage needed for the output as well as the minimum precision for the operation.

2.在输出操作数上指定的精度决定了输出所需的最小存储以及操作的最小精度。

3.Mixing precisions across operands and the instruction is valid, but should be rare. Drivers may need to expand format changes into separate individual type conversions to the instruction’s precision unless the conversion is supported natively.

3.在操作数和指令之间混合精度是有效的,但应该很少见。驱动程序可能需要将格式更改扩展到单独的单个类型转换,以达到指令的精度,除非本机支持转换。

4.Precisions on the index value in dynamic indexing scenarios or other addressing (such as texture coordinates for a texture fetch) just follow the precision indicated on the value, unaffected by the instruction precision.

4.在动态索引方案或其他寻址(例如纹理提取的纹理坐标)中,索引值的精度仅遵循值上指示的精度,不受指令精度的影响。

The same applies for condition parameters in conditional instructions (like movc).

这同样适用于条件指令(如 movc)中的条件参数。

movc[_sat]    dest[.mask], src0[.swizzle], [-]src1[_abs][.swizzle], [-]src2[_abs][.swizzle],

if src0, then src1 else src2

5.See below(7.20.3.5) for a discussion about shader constants.

5.有关着色器常量的讨论,请参阅下文 (7.20.3.5) 。

7.20.3.5 Shader Constants

7.20.3.5 着色器常量

Shader constants are defined at full 32-bit per component.New hardware implementing low precision is encouraged to design efficient downconversion support upon constant access, otherwise some driver work or extra conversion instructions will need to be added by the driver into shaders that read 32-bit per component constants into lower precision shader operations.

着色器常量被定义每个分量完整的32 位。鼓励实现低精度的新硬件在常量访问时设计高效的下转换支持,否则驱动程序需要进行一些工作或者驱动程序需要在读取32位每组件常量到低精度着色器操作的着色器中添加额外的转换指令。

Alternative approaches were considered where low precision constants are exposed all the way to the application (freeing driver/hardware from having to convert constants), but the added complexity in the programming model vs the benefit didn’t hold up at least at this time.

考虑了一些替代方法,其中低精度常数被暴露给应用程序(使驱动程序/硬件无需转换常数),但是相对于收益,编程模型中增加的复杂性至少在目前并不成立。

7.20.3.6 Referencing Shader Constants within Shaders

7.20.3.6 在着色器中引用着色器常量

When referencing a shader constant from a low precision instruction, if the constant value is out of the range of the instruction’s precision level, the value read is undefined.For constant values within range of a low precision instruction reference, the precision of the value may still get quantized down from full 32 bits.

当从低精度指令中引用着色器常量时,如果常量值超出了指令精度级别的范围,读取的值是未定义的。对于低精度指令参考范围内的常量值,该值的精度仍可能从完整的 32 位向下量化。

Shader constants referenced in shader source operands will be marked at the precision they are to be referenced at, even though they come down the API/DDI at 32-bit per component.

在着色器源操作数中引用的着色器常量将被标记为它们应被引用的精度,即使它们通过API/DDI以每个组件32位的形式传递。

1.The constant buffer precision indicated on reference may be different than the precision of a given instruction, since multiple instructions in the shader at different precisions may read the same constant.

1.常量缓冲区精度在引用上的显示可能与给定指令的精度不同,因为着色器中不同精度的多个指令可能会读取相同的常量。

2.The HLSL compiler guarantees that all accesses of a given constant are marked with the same precision, indicating how much storage is needed for them regardless of what precision operations that reference them are using.

2.HLSL 编译器保证给定常量的所有访问都以相同的精度进行标记,从而指示它们需要多少存储空间,而不管引用它们使用的精度操作如何。

3.Implementations that may need to downconvert constants ahead of shader invocation (likely not ideal) can easily determine the required precision/storage for constants within a shader just by observing how they are tagged on first reference in the shader.

3.可能需要在着色器调用之前降低常量的实现(可能并非理想),只需通过观察着色器中首次引用时如何标记它们,就可以轻松确定常量所需的精度/存储。

4.In cases of dynamic indexing of constants, there is no way to know which parts of a constant buffer will be referenced at what precision ahead of time. Adding declarations that indicate this information was not deemed worth it at this time.

4.在常量动态索引的情况下,无法提前知道常量缓冲区的哪些部分将以什么精度被引用。添加声明此信息的声明目前被认为不值得。

7.20.3.7 Component Swizzling

7.20.3.7 组件置换

Low precision data is referenced by component in masks and swizzles – xyzw - just like default precision data. It is as though the registers do have a smaller number of bits (for hardware that supports lower precision).This is unlike the way double precision is mapped, where xy contains one double and zw contains another. Low precision doesn’t yield sub-fields within .x for example.

低精度数据在掩码和swizzles -xyzw中通过组件引用,就像默认精度数据一样。这就好像寄存器确实具有较少的位数(适用于支持较低精度的硬件)。这与双精度的映射方式不同,其中 xy 包含一个双精度,zw 包含另一个双精度。例如,低精度不会在 .x 中生成子字段。

一个组件是32位,一个寄存器有4个组件,共128位。

The HLSL compiler will not generate code that mixes precisions in different components of any xyzw register (mostly for simplicity, even though this may not matter for hardware).

HLSL 编译器不会生成在任何 xyzw 寄存器的不同组件中混合精度的代码(主要是为了简单起见,即使这对硬件可能无关紧要)。

7.20.3.8 Low Precision Shader Limits

7.20.3.8 低精度着色器限制

The use of min / low precision specifiers never increases the maximum amount of resources available to a shader (such as limits on inputs, outputs or temp storage), since the shader must always be able to function on hardware that does not operate at low precision.

使用最小/低精度说明符永远不会增加着色器可用的最大资源量(例如对输入、输出或临时存储的限制),因为着色器必须始终能够在不以低精度运行的硬件上运行。

7.20.4 Feature Exposure

7.20.4 功能曝光

In the D3D system, HLSL shaders are compiled independent of any given device – e.g. they should typically be compiled offline. This compilation step produces device-agnostic bytecode, apart from the choice of shader target, e.g. vs_4_0.

在 D3D 系统中,HLSL 着色器的编译独立于任何给定设备,例如,它们通常应脱机编译。此编译步骤生成与设备无关的字节码,除了选择着色器目标(例如vs_4_0)。

The minimum precision facility described above can be optionally used within any 4_0+ shader, including 4_0_level_9_1 to 4_0_level9_3. These shader targets are all available through the D3D11 runtime, exposing D3D9+ hardware via Shader Model 2_x+.The D3D9 runtime will not expose the low precision modes – updating that runtime is out of scope.

上述最小精度工具可以选择在任何 4_0+ 着色器中使用,包括4_0_level_9_1到4_0_level9_3。这些着色器目标都可通过 D3D11 运行时使用,并通过着色器模型 2_x+ 公开 D3D9+ 硬件。D3D9 运行时不会公开低精度模式 - 更新该运行时超出了范围。

7.20.4.1 Discoverability

7.20.4.1 可发现性

There is a mechanism at the API to discover the precision levels supported by the current device.Note that in Windows 8 the OS did not allow drivers to expose only 10 bit without also exposing 16 bit, but subsequent operating systems relax that requirement (so an implementation may expose 10 bit min precision but not 16 bit min precision).

API 中有一种机制可以发现当前设备支持的精度级别。请注意,在 Windows 8 中,操作系统不允许驱动程序只公开 10 位而不公开 16 位,但后续操作系统放宽了该要求(因此实现可能会公开 10 位最小精度,但不会公开 16 位最小精度)。

Even though the hardware’s precision support is visible to applications, applications do not have to adjust their shaders for the hardware’s precision level given that by definition operations defined with a min precision run at higher precision on hardware that doesn’t support the min precision.

即使硬件的精度支持对应用程序可见,应用程序也不必针对硬件的精度级别调整其着色器,因为根据定义,使用最小精度定义的操作在不支持最小精度的硬件上以更高的精度运行。

It is fine for hardware to not support low precision processing at all – by simply reporting “DEFAULT” as its precision support. The reason it is called “DEFAULT” rather than some numerical precision is depending on the shader model, there may not be standard value to express.E.g. the default precision in SM 2.x is fp24 (or greater) within the shader, even though there is no API visible fp24 format. If the device reports “DEFAULT” precision, all min-precision specifiers in shaders are ignored.

硬件完全不支持低精度处理是可以的 - 只需报告“DEFAULT”作为其精度支持即可。之所以将其称为“DEFAULT”而不是一些数值精度,是因为取决于着色器模型,可能没有标准值可以表示。例如,SM 2.x 中的默认精度在着色器中是 fp24(或更高),即使没有 API 可见的 fp24 格式。如果设备报告“DEFAULT”精度,则忽略着色器中的所有最小精度说明符。

D3D9 devices are permitted to report a min-precision level that is lower for the Pixel Shader than for the Vertex Shader (all reported via the Windows Next D3D9 DDI).D3D10+ devices can only report a single min-precision level that applies to all shader stages (reported via the Windows Next D3D11.1 DDI) – since it does not seem to make sense to single out the VS any more.

允许 D3D9 设备报告像素着色器的最小精度级别低于顶点着色器(全部通过 Windows Next D3D9 DDI 报告)。D3D10+设备只能报告适用于所有着色器阶段的单个最小精度级别(通过Windows Next D3D11.1 DDI报告)——因为似乎不再有必要单独指定顶点着色器(VS)的精度了。

Note that if the application uses Feature Level 9_x on D3D10+ hardware, the D3D9 DDIs are still used, so the min-precision levels can be reported differently there between VS and PS, as mentioned for D3D9, even though via the D3D11.1 DDI only a single precision can be reported.

请注意,如果应用程序在 D3D10+ 硬件上使用功能级别9_x,则仍使用 D3D9 DDI,因此 VS 和 PS 之间的最小精度级别可能会有所不同,如 D3D9 所述,即使通过 D3D11.1 DDI 只能报告单个精度。

DDI(device driver interface)指的是设备驱动接口。

7.20.4.2 Shader Management

7.20.4.2 着色器管理

Regardless of the min precision level supported by a given device, it is always valid to use a shader that was compiled using any combination of the low precision levels on it.For example if a device’s min precision level is 32-bit, it is fine to use a shader compiled with some variables that have a min precision of 10 bit.The device is free to implement the low precision operations at any equal or higher precision level (including precision levels not available at the API).

无论给定设备支持的最小精度级别是什么,使用在其上编译的任何低精度级别组合的着色器始终是有效的。例如,如果设备的最小精度级别为 32 位,可以使用着色器编译一些最小精度为 10 位的变量。该设备可以自由地实现低精度操作在一些同等或更高的精度级别(包括 API 上不可用的精度级别)。

For old drivers (pre-D3D11.1 DDI) that are not aware of the low precision feature, the D3D runtime will patch the shader bytecode on shader creation to remove it.This preserves the intent of the shader, since it is valid for the device to execute operations tagged with a min precision level at a higher precision.

对于不知道低精度功能的旧驱动程序(D3D11.1 之前的 DDI),在着色器创建上,D3D 运行时将修补着色器字节码以将其删除。这保留了着色器的意图,因为设备执行用最小精度级别标记的操作在更高的精度下是有效的。

7.20.4.3 APIs/DDIs

7.20.4.3 API/DDIs

An API for reporting device precision support, no other D3D11 API surface area changes apply.

用于报告设备精度支持的 API,不适用其他 D3D11 API 外围应用更改。

As far as other DDI additions, there is device precision reporting, the shader bytecode additions detailed earlier, and finally a variant of the existing shader stage I/O signature DDI:

就其他 DDI 的添加而言,有设备精度报告,前面详细介绍的着色器字节码添加,最后是现有着色器阶段 I/O 签名 DDI 的一个变体:

The I/O signature DDI includes MinPrecision in the signature entry. This shows up as D3D11_SB_INSTRUCTION_MIN_PRECISION_DEFAULT if the shader didn’t specify a min-precision:

I/O 签名 DDI 在条目中包括 MinPrecision。如果着色器未指定最小精度,则显示为D3D11_SB_INSTRUCTION_MIN_PRECISION_DEFAULT:

typedef struct D3D11_1DDIARG_SIGNATURE_ENTRY

{

    D3D10_SB_NAME SystemValue; // D3D10_SB_NAME_UNDEFINED if the particular entry doesn't have a system name.

    UINT Register;

    BYTE Mask;// (D3D10_SB_OPERAND_4_COMPONENT_MASK >> 4), meaning 4 LSBs are xyzw respectively

    D3D11_SB_INSTRUCTION_MIN_PRECISION MinPrecision;

} D3D11_1DDIARG_SIGNATURE_ENTRY;

typedef struct D3D11_1DDIARG_STAGE_IO_SIGNATURES

{

    D3D11_1DDIARG_SIGNATURE_ENTRY*  pInputSignature;

    UINT                            NumInputSignatureEntries;

    D3D11_1DDIARG_SIGNATURE_ENTRY*  pOutputSignature;

    UINT                            NumOutputSignatureEntries;

} D3D11_1DDIARG_STAGE_IO_SIGNATURES;

Motivation: Recall that this DDI exists to complement the shader creation DDIs by providing a more complete picture of the shader stage<->stage I/O layout than may be visible just from an individual shader’s bytecode.

动机:回想一下,此 DDI 的存在是为了补充着色器创建 DDI,通过提供比单个着色器的字节码可能可见的更完整的着色器阶段<->阶段I/O布局的图片。

For example sometimes an upstream stage provides data not consumed by a downstream shader, but it should be possible for a driver to compile a shader on its own without having to wait and see what other shaders it gets used with.

例如,有时上游阶段提供的数据并未被下游着色器使用,但驱动程序应该可以自行编译着色器,而不必等待并查看它会与哪些其他着色器一起使用。

MinPrecision is added in case that affects how the driver shader compiler would want to pack the inter-stage I/O data.

添加MinPrecision是为了考虑这可能会影响驱动程序着色器编译器如何打包阶段间I/O数据。

7.20.4.4 HLSL Exposure

7.20.4.4 HLSL暴露

Out of scope for this spec.

超出此规范的范围。

  • 0
    点赞
  • 28
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值