Directx11进阶教程之Tiled Based Deffered Shading

本文链接：https://blog.csdn.net/qq_29523119/article/details/115837278

前言

很多游戏中存在大量的点光源(PointLight)，环境艺术家为了让游戏模拟现实的氛围,一个场景下放下上千个点光源(PointLight)毫不奇怪。

下面介绍下传统的渲染管线大量点光源的表现。

传统前向渲染(Traditional Forward Rendering)的点光源计算

总体意思就是每个物体进行一次renderPass,并把在影响到物体的点光源作为一个数组放在Shader进行计算

总结：传统的前向渲染因为同个像素可能覆盖大量的物体，造成Overdraw很高，浪费了大量的计算，很多计算是不必要的，因为不透明物体的Shading呈现在屏幕的只有最前面的像素。因此引出了延迟渲染。

传统的延迟渲染(Traditional Deffered Rendering)

传统的延迟渲染很简单,就是渲染整个场景的物体输出多张几何贴图，然后利用几何贴图在一次全屏幕绘制的Shading中计算。其中渲染点光源比较流行的办法是把点光源当做一个几何球体(LightSphereVolume)，渲染到全屏上，有效计算的每个点光源半径内影响的像素。并且设置RT为累加模式，N个光源就累加N次，最后得到所有点光源着色的最终效果。

内的像素，而不必要每个光源对全屏像素都计算一次.

总结: 相比于前向渲染，因为我们只渲染最前面的一层像素, overdraw 大量减少，浪费的计算也减少了, 但是N个点光源意味着计算N次光源球体RenderPass, 每个pass中我们都读取了一次各种gbuffer和写入一次shading结果，这导致GPU bandwidth浪费严重。如下所示:

因此图形工程师针对延迟渲染提出了更有效计算点光源的渲染管线：Tiled Deffered Shading

基于分片的延迟渲染(Tiled Based Deffered Shading)

上面传统延迟渲染的示意图说明了传统延迟渲染的GPU Bangwidth高的缺点，按照理想的改进模型如下:

就是最理想的状态是：对于着色每个像素应该只读取一次GBuffer和只写入一次Shading结果

针对这个理想的状态模型, 图形渲染工程师提出分块(tiled)的思想: 延迟渲染的基础上把整个屏幕划分为NxN块，一块(tile)的分辨率是16x16, 利用并行能力强大的computeShader计算哪些光源了哪些块(tile)，并且让这些有效点光源对相应块的像素进行Pixel着色

下面简称 TiledBasedDefferedShading 为 TBDS

TBDS的渲染流程:

（1）渲染整个场景的GBuffer

(2)在computeShader里分好每个块(tile)，一个块(tile)一般是16x16或者32x32, 计算每个tile的所有像素(一般相机空间比较好)最大和最小的PosZ值

Texture2D<float4> DepthTex:register(t0);
Texture2D<float4> WorldPosTex:register(t1);
Texture2D<float4> WorldNormalTex:register(t2);
Texture2D<float4> SpecularRoughMetalTex:register(t3);
Texture2D<float4> AlbedoTex:register(t4);
SamplerState clampLinearSample:register(s0);
StructuredBuffer<PointLight> PointLights : register(t5);
RWTexture2D<float4> OutputTexture : register(u0);
groupshared uint minDepthInt;
groupshared uint maxDepthInt;
groupshared uint visibleLightCount = 0;
groupshared uint visibleLightIndices[1024];

[numthreads(GroundThreadSize, GroundThreadSize, 1)]
void CS(
	uint3 groupId :  SV_GroupID,
	uint3 groupThreadId : SV_GroupThreadID,
	uint groupIndex : SV_GroupIndex,
	uint3 dispatchThreadId : SV_DispatchThreadID)

//(2)计算每个Tiled的相机空间的MaxZ和MinZ
	float depth = DepthTex[dispatchThreadId.xy].r;
	float viewZ = DepthBufferConvertToLinear(depth);
	uint depthInt = asuint(viewZ);
	minDepthInt = 0xFFFFFFFF;
	maxDepthInt = 0;
	GroupMemoryBarrierWithGroupSync();

	if (depth != 0.0)
	{
		InterlockedMin(minDepthInt, depthInt);
		InterlockedMax(maxDepthInt, depthInt);
	}

	GroupMemoryBarrierWithGroupSync();

	float minViewZ = asfloat(minDepthInt);
	float maxViewZ = asfloat(maxDepthInt);

(3)计算每个块(tile)对应的frustum(相机空间的视截体)

	float3 frustumEqn0, frustumEqn1, frustumEqn2, frustumEqn3;
	uint tileResWidth = GroundThreadSize * GetNumTilesX();
	uint tileResHeight = GroundThreadSize * GetNumTilesY();
	uint pxm = GroundThreadSize * groupId.x;
	uint pym = GroundThreadSize * groupId.y;
	uint pxp = GroundThreadSize * (groupId.x + 1);
	uint pyp = GroundThreadSize * (groupId.y + 1);

	// four corners of the tile, clockwise from top-left
	float3 frustum0 = ConvertProjToView(float4(pxm / (float)tileResWidth*2.f - 1.f, (tileResHeight - pym) / (float)tileResHeight*2.f - 1.f, 1.f, 1.f)).xyz;
	float3 frustum1 = ConvertProjToView(float4(pxp / (float)tileResWidth*2.f - 1.f, (tileResHeight - pym) / (float)tileResHeight*2.f - 1.f, 1.f, 1.f)).xyz;
	float3 frustum2 = ConvertProjToView(float4(pxp / (float)tileResWidth*2.f - 1.f, (tileResHeight - pyp) / (float)tileResHeight*2.f - 1.f, 1.f, 1.f)).xyz;
	float3 frustum3 = ConvertProjToView(float4(pxm / (float)tileResWidth*2.f - 1.f, (tileResHeight - pyp) / (float)tileResHeight*2.f - 1.f, 1.f, 1.f)).xyz;
	frustumEqn0 = CreatePlaneEquation(frustum0, frustum1);
	frustumEqn1 = CreatePlaneEquation(frustum1, frustum2);
	frustumEqn2 = CreatePlaneEquation(frustum2, frustum3);
	frustumEqn3 = CreatePlaneEquation(frustum3, frustum0);

(4) 对每一个块(tile)，遍历所有点光源，用frustum和Depth双重剔除，并把影响点光源的的全局索引加入到块(tile)的可见光源列表

//(3)计算和每个Tiled相交的点光源数量,并记录它们的索引
	uint threadCount = GroundThreadSize * GroundThreadSize;
	uint passCount = (int(lightCount) + threadCount - 1) / threadCount;

	for (uint i = 0; i < passCount; ++i)
	{
		uint lightIndex = i * threadCount + groupIndex;
		if (lightIndex >= lightCount)
			continue;

		PointLight light = PointLights[lightIndex];
		float3 viewLightPos = mul(float4(light.pos, 1.0), View).xyz;
		if(TestFrustumSides(viewLightPos, light.radius, frustumEqn0, frustumEqn1, frustumEqn2, frustumEqn3))
		{
			if (minViewZ - viewLightPos.z < light.radius && viewLightPos.z - maxViewZ < light.radius)
			{
				uint offset;
				InterlockedAdd(visibleLightCount, 1, offset);
				visibleLightIndices[offset] = lightIndex;
			}
		}
	}

	GroupMemoryBarrierWithGroupSync();

(5)遍历块(tile)的可见光源列表的光源，对块内的所有像素进行着色,这样GBuffer的各种RT做到了只读一次，并只写一次Shading结果, GPU bandwidth低


	if (visibleLightCount > 0)
	{
		//G-Buffer-Pos(浪费1 float)
		float2 uv = float2(float(dispatchThreadId.x) / ScreenWidth, float(dispatchThreadId.y) / ScreenHeight);
		float3 worldPos = WorldPosTex.SampleLevel(clampLinearSample, uv, 0).xyz;

		//G-Buffer-Normal(浪费1 float)
		float3 worldNormal = WorldNormalTex.SampleLevel(clampLinearSample, uv, 0).xyz;
		worldNormal = normalize(worldNormal);

		float3 albedo = AlbedoTex.SampleLevel(clampLinearSample, uv, 0).xyz;

		//G-Buffer-Specual-Rough-Metal(浪费1 float)
		float3 gBufferAttrbite = SpecularRoughMetalTex.SampleLevel(clampLinearSample, uv, 0).xyz;
		float specular = gBufferAttrbite.x;
		float roughness = gBufferAttrbite.y;
		float metal = gBufferAttrbite.z;

		for (uint index = 0; index < visibleLightCount; ++index)
		{
			uint lightIndex = visibleLightIndices[index];
			PointLight light = PointLights[lightIndex];
			float3 pixelToLightDir = light.pos - worldPos;
			float distance = length(pixelToLightDir);
			float3 L = normalize(pixelToLightDir);
			float3 V = normalize(cameraPos - worldPos);
			float3 H = normalize(L + V);
			float4 attenuation = light.attenuation;
			float attenua = 1.0 / (attenuation.x + attenuation.y * distance + distance * distance * attenuation.z);
			float3 radiance = light.color * attenua;

			//f(cook_torrance) = D* F * G /(4 * (wo.n) * (wi.n))
			float D = DistributionGGX(worldNormal, H, roughness);
			float G = GeometrySmith(worldNormal, V, L, roughness);
			float3 fo = GetFresnelF0(albedo, metal);
			float cosTheta = max(dot(V, H), 0.0);
			float3 F = FresnelSchlick(cosTheta, fo);
			float3 ks = F;
			float3 kd = float3(1.0, 1.0, 1.0) - ks;
			kd *= 1.0 - metal;

			float3 dfg = D * G * F;
			float nDotl = max(dot(worldNormal, L), 0.0);
			float nDotv = max(dot(worldNormal, V), 0.0);
			float denominator = 4.0 * nDotv * nDotl;
			float3 specularFactor = dfg / max(denominator, 0.001);

			color.xyz += (kd * albedo / PI + specularFactor * specular) * radiance * nDotl * 2.2;
		}
	}

	OutputTexture[dispatchThreadId.xy] = color;