OpenGL 4 Shading Language Cookbook chapter 10——Using Compute Shaders

Compute shaders were introduced into OpenGL with Version 4.3. A compute shader is a
shader stage that can be used for arbitrary computation. It provides the ability to leverage 利用 the GPU and its inherent parallelism for general computing tasks that might have previously been implemented in serial on the CPU. The compute shader is most useful for tasks that are not directly related to rendering, such as physical simulation. Although APIs such as OpenCL and CUDA are already available for general purpose computation on the GPU, they are completely separate from OpenGL. Compute shaders are integrated directly within OpenGL, and therefore are more suitable for general computing tasks that are more closely related to graphics rendering.

The compute shader is not a traditional shader stage in the same sense as the fragment or vertex shader. It is not executed in response to rendering commands. In fact, when a compute shader is linked with vertex, fragment, or other shader stages, it is effectively inert when drawing commands are executed. The only way to execute the compute shader is via the OpenGL commands glDispatchCompute or glDispatchComputeIndirect.

Compute shaders do not have any direct user-defined inputs and no outputs at all. It gets its work by fetching data directly from memory using image access functions such as the image load/store operations, or via shader storage buffer objects. Similarly, it provides its results by writing to the same or other objects. The only non-user-defined inputs to a compute shader are a set of variables that determine where the shader invocation is within its “space” of execution.

The number of invocations of the compute shader is completely user defined. It is not tied in
any way to the number of vertices or fragments being rendered. We specify the number of
invocations by defining the number of work groups, and the number of invocations within each
work group.

Compute space and work groups
The number of invocations of a compute shader is governed by the user-defined compute
space. This space is divided into a number of work groups. Each work group is then broken
down into a number of invocations. We think of this in terms of the global compute space (all
shader invocations) and the local work group space (the invocations within a particular work
group). The compute space can be defined as a 1, 2 or 3 dimensional space.

technically, it is always defined as a three-dimensional space, but any of the three dimensions can be defined with a size of one(1), which effectively removes the dimension.

For example, a one-dimensional compute space with five work groups and three invocations
per work group could be represented as the following figure. The thicker lines represent the
work groups, and the thinner lines represent the invocations within each work group.
在这里插入图片描述

in this case, we have 5*3 = 15 shader invocations.
the grey shaded invocation is in work group 2, and within that work group, it is invocation 1 (the invocations are indexed starting at zero).
we can also refer to that invocation with a global index o 7, by indexing the total number of invocations starting at zero.
the global index determines an invocations’s location within the global compute space, rather than just within the work group.
It is determined by
taking the product of work group (2) and index number of invocations per work group (3), plus
the local invocation index (1) that is 2 * 3 + 1 = 7. The global index is simply the index of each
invocation in the global compute space, starting at zero on the left and counting from there.

The following figure shows a representation of a two-dimensional compute space where
the space is divided into 20 work groups, four in the x direction and five in the y direction.
Each work group is then divided into nine invocations, three in the x direction and three in
the y direction.

在这里插入图片描述

The cell that is shaded in gray represents invocation (0, 1) within work group (2, 0). The total
number of compute shader invocations in this example is then 20 * 9 = 180. The global index
of this shaded invocation is (6, 1). As with the one-dimensional case, this index is determined
by thinking of it as a global compute space (without the work groups), and can be computed
(for each dimension) by the number of invocations per work group times the work group index,
plus the local invocation index. For the x dimension, this would be 3 * 2 + 0 = 6, and for the y
dimension it is 3 * 0 + 1 = 1.

The same idea can extend in a straightforward manner to a three-dimensional compute
space. In general, we choose the dimensionality based on the data to be processed. For
example, if I’m working on the physics of a particle simulation, I would just have a list of
particles to process, so a one-dimensional compute space might make sense. On the other
hand, if I’m processing a cloth simulation, the data has a grid structure, so a two-dimensional
compute space would be appropriate.

There are limits on the total number of work groups and local
shader invocations. These can be queried (via glGetInteger*)
using the parameters GL_MAX_COMPUTE_WORK_GROUP_
COUNT, GL_MAX_COMPUTE_WORK_GROUP_SIZE, and GL_
MAX_COMPUTE_WORK_GROUP_INVOCATIONS.

The order of execution of the work groups and thereby the individual shader invocations is
unspecified and the system can execute them in any order. Therefore, we shouldn’t rely on
any particular ordering of the work groups. Local invocations within a particular work group
will be executed in parallel (if possible). Therefore, any communication between invocations
should be done with great care. Invocations within a work group can communicate via shared local data, but invocations should not (in general) communicate with invocations in other work groups without consideration of the various pitfalls involved such as deadlock and data races. In fact, those can also be issues for local shared data within a work group as well, and care must be taken to avoid these problems. In general, for reasons of efficiency, it is best to only attempt communication within a work group. As with any kind of parallel programming, “there be dragons here”. OpenGL provides a number of atomic operations and memory barriers that can help with the communication between invocations. We’ll see some examples in the recipes that follow.

Executing the Compute Shader

When we execute the compute shader, we define the compute space. The number of work groups are determined by the parameters to glDispatchCompute. For example, to execute the compute shader with a two-dimensional compute space with 4 work groups in the x dimension and 5 work groups in the y dimension (matching the preceding figure), we’d use the following call:

glDispatchCompute( 4, 5, 1 );

The number of local invocations within each work group is not specified on the OpenGL side. Instead, it is specified within the compute shader itself with a layout specifier. For example, here we specify nine local invocations per work group, 3 in the x direction and 3 in the y direction.

layout (local_size_x = 3, local_size_y = 3) in;

The size in the z dimension can be left out (the default is one).
When a particular invocation of the compute shader is executing, it usually needs to determine where it is within the global compute space. GLSL provides a number of built-in input variables that help with this. Most of them are listed in the following table:
在这里插入图片描述
the last one in the preceding table, gl_GlobalInvocationID is computed in the following way (each operation is component-wise):

gl_WorkGroupID * gl_WorkGroupSize + gl_LocalInvocationID

this helps us to locate the current invocation within the global compute space (refer to the preceding examples).

GLSL also defines LocalInvocationIndex, which s a flattened form of gl_LocalInvocationID. It can help when multidimensional data is provided in a linear buffer, but is not used in any of the examples that follow.

implementing a particle simulation with the compute shader

in this recipe we will implement a simple particle simulation. we will have the compute shader handle the physics computations and upate the particle positions directly.
then, we will just render the particles as points. without the compute shader, we would need to udpate the positions on the CPU by stepping through the array of particles and updating each position in a serial fashion, or by making use of transform feedback as shown in the creating a particle system using transform feedback recipe in chapter 9, particle systems and animation.
doing such animations with vertex shaders is sometimes counterintuitive and requries some additional work (such as transform feedback setup). with the compute shader, we can do the particle physics in parallel on the gpu, and customize our compute space to get the most ‘bang for the buck’ out of our gpu.

The following figure shows our particle simulation running with one million particles. Each
particle is rendered as a 1 x 1 point. The particles are partially transparent, and the particle
attractors are rendered as small 5 x 5 squares (barely visible).

在这里插入图片描述

These simulations can create beautiful, abstract figures, and are a lot of fun to produce.

For our simulation, we’ll define a set of attractors (two in this case, but you can create more),
which I’ll call the black holes. They will be the only objects that affect our particles and they’ll
apply a force on each particle that is inversely proportional to the distance between the
particle and the black hole. More formally, the force on each particle will be determined by the
following equation:
在这里插入图片描述
Where N is the number of black holes (attractors), ri is the vector between the ith attractor
and the particle (determined by the position of the attractor minus the particle position), and
Gi is the strength of the ith attractor.

To implement the simulation, we compute the force on each particle and then update the
position by integrating the Newtonian equations of motion. There are a number of well studied
numerical techniques for integrating the equations of motion. For this simulation, the simple
Euler method is sufficient. With the Euler method, the position of the particle at time t + Δt is
given by the following equation:

在这里插入图片描述

Where P is the position of the particle, v is the velocity, and a is the acceleration. Similarly, the
updated velocity is determined by the following equation:
在这里插入图片描述

These equations are derived from a Taylor expansion of the position function about time t. The
result is dependent upon the size of the time step (Δt), and is more accurate when the time
step is very small.
The acceleration is directly proportional to the force on the particle, so by calculating the
force on the particle (using the preceding equation), we essentially have a value for the
acceleration. To simulate the particle’s motion, we track its position and velocity, determine
the force on the particle due to the black holes, and then update the position and velocity
using the equations.
We’ll use the compute shader to implement the physics here. Since we’re just working with a
list of particles, we’ll use a one-dimensional compute space, and work groups of about 1000
particles each. Each invocation of the compute shader will be responsible for updating the
position of a single particle.
We’ll use shader storage buffer objects to track the positions and velocities, and when
rendering the particles themselves, we can just render directly from the position buffer.

in the opengl side, we need a buffer for the position of the particles and a buffer for the velocity.
create a buffer containing the initial positions of the particles and a buffer with zeroes for the initial velocities.
we will use four-component positions and velocities for this example in order to avoid issues with data layouts.
for example, to create the buffer for the positions, we might do sth. as follows:

vector<GLfloat> initPos;// Set initial positions
GLuint bufSize = totalParticles * 4 * sizeof(GLfloat);
GLuint posBuf;
glGenBuffers(1, &posBuf);
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, posBuf);
glBufferData(GL_SHADER_STORAGE_BUFFER, bufSize, &initPos[0], GL_DYNAMIC_DRAW);

use a similar process for the velocity data, but bind it to index one of the GL_SHADER_STORAGE_BUFFER binding location.

glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 1, velBuf);

set up a vertex array object that uses the same position buffer as its data source for the vertex position.

to render the points, set up a vertex and fragment shader pair that just produces a solid color.
enable blending and set up a standard blending function.

Use the following steps:

  1. We’ll use the compute shader for updating the positions of the particles.
layout( local_size_x = 1000 ) in;
uniform float Gravity1 = 1000.0;
uniform vec3 BlackHolePos1;
uniform float Gravity2 = 1000.0;
uniform vec3 BlackHolePos2;
uniform float ParticleInvMass = 1.0 / 0.1;
uniform float DeltaT = 0.0005;

layout(std430, binding=0) buffer Pos 
{
	vec4 Position[];
};
layout(std430, binding=1) buffer Vel 
{
	vec4 Velocity[];
};
void main() 
{
	uint idx = gl_GlobalInvocationID.x;
	vec3 p = Position[idx].xyz;
	vec3 v = Velocity[idx].xyz;
	// Force from black hole #1
	vec3 d = BlackHolePos1 - p;
	vec3 force = (Gravity1 / length(d)) * normalize(d);
	// Force from black hole #2
	d = BlackHolePos2 - p;
	force += (Gravity2 / length(d)) * normalize(d);
	// Apply simple Euler integrator
	vec3 a = force * ParticleInvMass;
	Position[idx] = vec4(
	p + v * DeltaT + 0.5 * a * DeltaT * DeltaT, 1.0);
	Velocity[idx] = vec4( v + a * DeltaT, 0.0);
}
  1. in the render routine, invoke the compute shader to update the particle positions.
glDispatchCompute(totalParticles / 1000, 1, 1);
  1. Then make sure that all data has been written out to the buffer, by invoking a
    memory barrier.
glMemoryBarrier( GL_SHADER_STORAGE_BARRIER_BIT );
  1. Finally, render the particles using data in the position buffer.

The compute shader starts by defining the number of invocations per work group using the layout specifier.

layout( local_size_x = 1000 ) in;

this specifies 1000 invocations per work group in the x dimension. u can choose a value for this that makes the most sense for the hardware on which u are running. just make sure to adjust the number of work groups appropriately. the default size for each dimension is one so we do not need to specify the size of the y and z directions.

Then, we have a set of uniform variables that define the simulation parameters. Gravity1
and Gravity2 are the strengths of the two black holes (G in the above equation), and
BlackHolePos1 and BlackHolePos2 are their positions. ParticleInvMass is the inverse
of the mass of each particle, which is used to convert force to acceleration. Finally, DeltaT is
the time-step size, used in the Euler method for integration of the equations of motion.
The buffers for position and velocity are declared next. Note that the binding values here
match those that we used on the OpenGL side when initializing the buffers.
Within the main function, we start by determining the index of the particle for which this
invocation is responsible. Since we’re working with a linear list of particles, and the number
of particles is the same as the number of shader invocations, what we want is the index within
the global range of invocations. This index is available via the built-in input variable
gl_GlobalInvocationID.x. We use the global index here because it is the index within
the entire buffer that we need, not the index within our work group, which would only
reference a portion of the entire array.
Next we retrieve the position and velocity from their buffers, and compute the force due to each
black hole, storing the sum in the variable force. Then we convert the force to acceleration
and update the particle’s position and velocity using the Euler method. We write to the same
location from which we read previously. Since invocations do not share data, this is safe.
In the render routine, we invoke the compute shader (step 2 in How to do it…), defining the
number of work groups per dimension. In the compute shader, we specified a work group size
of 1000. Since we want one invocation per particle, we divide the total number of particles by
1000 to determine the number of work groups. Finally, in step 3, before rendering the particles, we need to invoke a memory barrier to ensure that all compute shader writes have fully executed.

using the compute shader for cloth simulation
The compute shader is well suited for harnessing the GPU for physical simulation. Cloth
simulation is a prime example. In this recipe, we’ll implement a simple particle-spring based
cloth simulation using the compute shader. The following is a screenshot of the simulation of
a cloth hanging by five pins. (You’ll have to imagine it animating.)
在这里插入图片描述
a common way to represent cloth is with a particle-spring lattice. the cloth is composed of a 2D grid of point masses, each connected to its neighbouring masses with idealized springs. 弹簧 The following figure represents one of the point masses (center) connected to its
neighboring masses. The lines represent the springs. The dark lines are the horizontal/vertical
springs and the dashed lines are the diagonal springs.
在这里插入图片描述
the total force on a particle is the sum of the forces produced by the eight springs to which it is connected. The force for a single spring is given by the following equation:

在这里插入图片描述
Where K is the stiffness of the spring, R is the rest-length of the spring (the length where the spring applies zero force), and r is the vector between the neighboring particle and the particle (the neighbor’s position minus the particle’s position).

Similar to the previous recipe, the process is simply to compute the total force on each particle and then integrate Newton’s equations of motion using our favorite integration. Again, we’ll use the Euler method for this example. For details on the Euler method, refer to the previous recipe, Implementing a particle simulation with the compute shader. This particle-spring lattice is obviously a two-dimensional structure, so it makes sense to map it to a two-dimensional compute space. We’ll define rectangular work groups and use one shader invocation per particle. Each invocation needs to read the positions of its eight neighbors, compute the force on the particle, and update the particle’s position and velocity. Note that in this case, each invocation needs to read the positions of the neighboring particles. Those neighboring particles will be updated by other shader invocations. Since we can’t rely on any execution order for the shader invocations, we can’t read and write directly to the same buffer. If we were to do so, we wouldn’t know for sure whether we are reading the original positions of the neighbors or their updated positions. To avoid this problem, we’ll use pairs of buffers. For each simulation step, one buffer will be designated for reading and the other for writing, then we’ll swap them for the next step, and repeat.

it might be possible to read/write to the same buffer with careful use of local shared memory; however, there is still the issue of the particles along the edges of the work group their neighbor’s positions are managed by another work group, and again, we have the same problem.

this simulation tends to be quite sensitive to numerical noise, so we need to use a very small intergration time step. a value of around 0.000005 works well. additionally, the simulation looks better when we apply a damping force 阻尼 to simulate air resistance.
a good way to simulate air resistance is to add a force that is proportional to and in the opposite direction to the velocity, as in the following equation:
在这里插入图片描述
where D is the strength of the damping force and v is the velocity of the particle.

starting by setting up two buffers for the particle position and two for the particle velocity. we will bind them to the GL_SHADER_STORAGE_BUFFER indexed binding point at indices 0 and 1 for the position buffers and 2 and 3 for the velocity buffers.
the data layout in these buffers is important. we will lay out the particle positions/velocities in row-major order starting at the lower left and proceeding to the upper right of the lattice.

we will also set up a vertex array object for drawing the cloth using the particle positions as triangle vertices. we may also need buffers for normal vectors and texture coordinates. for brevity, i will omit them from this discussion, but the exmaple code for this book includes them.

use the following steps:

  1. in the compute shader, we start by defining the number of invocations per work group.
layout( local_size_x = 10, local_size_y = 10 ) in;
  1. then, define a set of uniform variables for the simulation parameters.
uniform vec3 Gravity = vec3(0,-10,0);
uniform float ParticleMass = 0.1;
uniform float ParticleInvMass = 1.0 / 0.1;
uniform float SpringK = 2000.0;
uniform float RestLengthHoriz;
uniform float RestLengthVert;
uniform float RestLengthDiag;
uniform float DeltaT = 0.000005;
uniform float DampingConst = 0.1;
  1. next, declare the shader storage buffer pairs for the position and velocity.
layout(std430, binding=0) buffer PosIn {
vec4 PositionIn[];
};
layout(std430, binding=1) buffer PosOut {
vec4 PositionOut[];
};
layout(std430, binding=2) buffer VelIn {
vec4 VelocityIn[];
};
layout(std430, binding=3) buffer VelOut {
vec4 VelocityOut[];
}
  1. in the main function, we get the position of the particle for which this invocation is responsible.
void main() {
uvec3 nParticles = gl_NumWorkGroups * gl_WorkGroupSize;
uint idx = gl_GlobalInvocationID.y * nParticles.x +
gl_GlobalInvocationID.x;
vec3 p = vec3(PositionIn[idx]);
vec3 v = vec3(VelocityIn[idx]), r;
  1. initialize our force with the force due to gravity.
vec3 force = Gravity * ParticleMass;
  1. add the force due to the particle above this one.
if( gl_GlobalInvocationID.y < nParticles.y - 1 ) 
{
	r = PositionIn[idx + nParticles.x].xyz - p;
	force += normalize(r)*SpringK*(length(r) - RestLengthVert);
}
  1. repeat the preceding steps for the particles below and to the left and right. then add the force due to the particle that is diagonally above and to the left.
if( gl_GlobalInvocationID.x > 0 &&
gl_GlobalInvocationID.y < nParticles.y - 1 ) {
r = PositionIn[idx + nParticles.x - 1].xyz - p;
force += normalize(r)*SpringK*(length(r) - RestLengthDiag);
}

6/7 不懂

  1. repeat the above for the other three diagonally connected particles. then add the damping force.
force += -DampingConst * v;
  1. next, we integrate the equations of motion using the euler method.
vec3 a = force * ParticleInvMass;
PositionOut[idx] = vec4(p + v * DeltaT + 0.5 * a * DeltaT * DeltaT, 1.0);
VelocityOut[idx] = vec4( v + a * DeltaT, 0.0);
  1. finally, we pin some of the top verts so that they do not move.
if( gl_GlobalInvocationID.y == nParticles.y - 1 &&
(gl_GlobalInvocationID.x == 0 ||
gl_GlobalInvocationID.x == nParticles.x / 4 ||
gl_GlobalInvocationID.x == nParticles.x * 2 / 4 ||gl_GlobalInvocationID.x == nParticles.x * 3 / 4 ||
gl_GlobalInvocationID.x == nParticles.x - 1)) {
PositionOut[idx] = vec4(p, 1.0);
VelocityOut[idx] = vec4(0,0,0,0);
}
}
  1. within the opengl render function, we invoke the compute shader such that each work group is responsible for 100 particles. since the time step size is so small, we need to execute the process many time (1000), each time swapping the input and output buffers.
for( int i = 0; i < 1000; i++ ) {
glDispatchCompute(nParticles.x/10, nParticles.y/10, 1);
glMemoryBarrier( GL_SHADER_STORAGE_BARRIER_BIT );
// Swap buffers
readBuf = 1 - readBuf;
glBindBufferBase(GL_SHADER_STORAGE_BUFFER,0, posBufs[readBuf]);
glBindBufferBase(GL_SHADER_STORAGE_BUFFER,1, posBufs[1-readBuf]);
glBindBufferBase(GL_SHADER_STORAGE_BUFFER,2, velBufs[readBuf]);
glBindBufferBase(GL_SHADER_STORAGE_BUFFER,3, velBufs[1-readBuf]);
}
  1. finally, we render the cloth using the position data form the position buffer.

how it works???

we use 100 invocations per work group, 10 in each dimension. the first statement in the compute shader defines the number of invocations per work group.

layout( local_size_x = 10, local_size_y = 10 ) in;

the uniform varibles that follow define the constants in the force equations and the rest lengths for each of the horizontal, vertical, and diagonal springs. the time step size is DeltaT. the position and velocity buffers are declared next.
we define the the position buffers at binding indexes 0 and 1, and the velocity buffers at indexes 2 and 3.

in the main function (step 4), we start by determing the number of particles in each dimension.

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值