简介
N体问题(n-body problem),也叫多体问题,是一个非常著名物理问题。当 N = 2 N=2 N=2时,即为二体问题,目前已完全解决。 N = 3 N=3 N=3时为著名的三体问题,除一些特殊条件下的三体问题可以得出解析解外 ,一般的三体问题目前依然没有解析解。
大刘的《三体》就是以三个太阳(三体)为背景展开的:处于高度文明阶段的某个星球,因为上方有三个太阳,导致形成了没有规律的日出日落,生存环境十分恶劣,该星球上的外星人发现了遥远的只有一个太阳的地球,因而打算占领地球…
回到具体的N体问题,它是指在已知N个物体的初始位置、速度和质量的情况下,在牛顿经典力学情况下研究它们的运动,包括轨迹预测等。
理论回顾
根据万有引力定律,物体
i
i
i与
j
j
j之间的万有引力
f
i
j
f_{ij}
fij可以通过下面公式得到:
f
i
j
=
G
m
i
m
j
∥
r
i
j
∥
2
⋅
r
i
j
∥
r
i
j
∥
f_{ij}=G \frac {m_im_j}{{\parallel r_{ij} \parallel}^2} \cdot \frac {r_{ij}}{\parallel r_{ij} \parallel}
fij=G∥rij∥2mimj⋅∥rij∥rij
其中
G
G
G为万有引力常数,
m
i
{m_i}
mi,
m
j
{m_j}
mj为物体
i
i
i,
j
j
j的质量,
r
i
j
r_{ij}
rij是从物体
i
i
i到物体
j
j
j之间距离矢量,
∥
r
i
j
∥
2
{\parallel r_{ij} \parallel}^2
∥rij∥2是距离的平方,是个标量。最后乘以
r
i
j
∥
r
i
j
∥
\frac {r_{ij}}{\parallel r_{ij} \parallel}
∥rij∥rij,则得到了力的矢量。
对于N个物体,若已知每个物体的质量
m
i
{m_i}
mi,初始位置
x
i
x_i
xi和速度
v
i
v_i
vi (
1
≤
i
≤
N
1 \leq i \leq N
1≤i≤N),那么作用在物体
i
i
i上的合力
F
i
{F_i}
Fi为:
F
i
=
∑
1
≤
j
≤
N
j
≠
i
N
f
i
j
=
G
m
i
⋅
∑
1
≤
j
≤
N
j
≠
i
N
m
j
r
i
j
∥
r
i
j
∥
3
{F_i}=\sum_{1 \leq j \leq N j \neq i}^N f_{ij}=Gm_i\cdot\sum_{1 \leq j \leq N j \neq i}^N\frac {m_jr_{ij}}{{\parallel r_{ij} \parallel}^3}
Fi=1≤j≤Nj̸=i∑Nfij=Gmi⋅1≤j≤Nj̸=i∑N∥rij∥3mjrij
由于两个物体距离为0时,那么它们之间的力将变为无穷大,上式中因而将
j
≠
i
j \neq i
j̸=i的限制加上,然而,这在数值模拟上并不是十分方便: CUDA底层采用SIMT的运行,引入额外的分支处理将会显著降低程序性能。这里参考文章《Fast N-Body Simulation
with CUDA》上的处理,引入软化因子
ϵ
\epsilon
ϵ,上式可以变为:
F i ≈ G m i ⋅ ∑ 1 ≤ j ≤ N N m j r i j ( ∥ r i j ∥ 2 + ϵ 2 ) 3 2 {F_i}\thickapprox Gm_i\cdot\sum_{1 \leq j \leq N}^N\frac {m_jr_{ij}}{({{\parallel r_{ij} \parallel}^2 + \epsilon^2)}^\frac{3}{2}} Fi≈Gmi⋅1≤j≤N∑N(∥rij∥2+ϵ2)23mjrij
那么,物体 m i {m_i} mi在某时刻的加速度 a i {a_i} ai便可以由下式得到:
a i = F i m i ≈ G ⋅ ∑ 1 ≤ j ≤ N N m j r i j ( ∥ r i j ∥ 2 + ϵ 2 ) 3 2 a_i=\frac{F_i}{m_i}\thickapprox G\cdot\sum_{1 \leq j \leq N}^N\frac {m_jr_{ij}}{({{\parallel r_{ij} \parallel}^2 + \epsilon^2)}^\frac{3}{2}} ai=miFi≈G⋅1≤j≤N∑N(∥rij∥2+ϵ2)23mjrij
因此,如果已知所有物体质量,以及它们在某时刻的速度和位置,通过加速度,便可以得到它们在下一时刻的速度和位置。
源码分析
NVIDIA CUDA toolkit 中有一个样例就是关于nbody的,其源码目录如下:
在分析nbody的源码前,我们首先看看nbody源码都实现了什么功能,nbody的命令行选项如下:
从nbody的命令行选项可知,nbody样例程序支持如下功能:
- -fullscreen 支持全屏
- -fp64 支持采用单,双精度模拟
- -hostmem 支持使用host端内存
- -benchmark支持基准测试模式(输出GPU或CPU模式下的多次迭代下的平均性能数据)
- -numbodies支持指定物体个数
- -compare支持比较模式(在相同物体个数情况下,用GPU和CPU各跑一遍对比结果)
- -cpu支持在CPU上运行(默认为GPU模式)
- -tipsy支持load一个其它指定了物体的初始位置,速度,质量的文件进行模拟(galaxy.bin就是银河系初始数据)
现在我们看一下nbody模拟galaxy运行的效果:
非常流畅和震撼,那么它是如何实现的呢,下面是nbody系统的主要类图:
结合nbody的命令行选项来看,可以发现其设计非常优雅:
- BodySystem作为基类,将GPU和CPU部分的共同接口抽象出来,通过BodySystemCUDA和BodySystemCPU继承基类BodySystem,各自实现GPU和CPU上的模拟部分
- 通过使用模板基类template<typename T> BodySystem和模板子类BodySystemCUDA和BodySystemCPU,可以支持模拟精度分别为单精度和双精度时的情形
- 通过将ParticleRender单独分离出来,实现Model和View的分离,ParticleRender通过OpenGL显示模拟图形。由于benchmark部分无需图形显示,能够独立的对model部分数据的产生(即GPU和CPU的运算能力)进行benchmark
- NBodyDemo类设计为单例,在作用上类似于一个controller,通过BodySystem基类指针操作BodySystemCUDA或BodySystemCPU接口,准备数据。如果需要显示模拟结果数据,通过组件ParticleRender进行显示
代码框架
具体的,NBodyDemo的代码框架如下:
- 默认创建NBodyDemo类型的单例,通过命令行传入的参数(或默认值)来对其进行初始化
- 通过reset设置NBody的初始位置和速度,以及质量
- glutDisplayFunc(display)通过设置display为回调函数将运行结果显示在OpenGL的输出窗口中
相应代码可以在这里找到,如下:
// Create the demo -- either double (fp64) or float (fp32, default) implementation
NBodyDemo<float>::Create();
NBodyDemo<float>::init(numBodies, numDevsRequested, blockSize, !(benchmark || compareToCPU || useHostMem), useHostMem, useCpu);
NBodyDemo<float>::reset(numBodies, NBODY_CONFIG_SHELL);
if (bSupportDouble)
{
NBodyDemo<double>::Create();
NBodyDemo<double>::init(numBodies, numDevsRequested, blockSize, !(benchmark || compareToCPU || useHostMem), useHostMem, useCpu);
NBodyDemo<double>::reset(numBodies, NBODY_CONFIG_SHELL);
}
if (fp64)
{
if (benchmark)
{
if (numIterations <= 0)
{
numIterations = 10;
}
else if (numIterations > 10)
{
printf("Advisory: setting a high number of iterations\n");
printf("in benchmark mode may cause failure on Windows\n");
printf("Vista and Win7. On these OSes, set iterations <= 10\n");
}
NBodyDemo<double>::runBenchmark(numIterations);
}
else if (compareToCPU)
{
bTestResults = NBodyDemo<double>::compareResults(numBodies);
}
else
{
glutDisplayFunc(display);
glutReshapeFunc(reshape);
glutMouseFunc(mouse);
glutMotionFunc(motion);
glutKeyboardFunc(key);
glutSpecialFunc(special);
glutIdleFunc(idle);
if (!useCpu)
{
checkCudaErrors(cudaEventRecord(startEvent, 0));
}
glutMainLoop();
}
}
else
{
if (benchmark)
{
if (numIterations <= 0)
{
numIterations = 10;
}
NBodyDemo<float>::runBenchmark(numIterations);
}
else if (compareToCPU)
{
bTestResults = NBodyDemo<float>::compareResults(numBodies);
}
else
{
glutDisplayFunc(display);
glutReshapeFunc(reshape);
glutMouseFunc(mouse);
glutMotionFunc(motion);
glutKeyboardFunc(key);
glutSpecialFunc(special);
glutIdleFunc(idle);
if (!useCpu)
{
checkCudaErrors(cudaEventRecord(startEvent, 0));
}
glutMainLoop();
}
}
重点在display()函数中,display()函数的实现如下:
void display()
{
static double gflops = 0;
static double ifps = 0;
static double interactionsPerSecond = 0;
// update the simulation
if (!bPause)
{
if (cycleDemo && (sdkGetTimerValue(&demoTimer) > demoTime))
{
activeDemo = (activeDemo + 1) % numDemos;
selectDemo(activeDemo);
}
updateSimulation();
if (!useCpu)
{
cudaEventRecord(hostMemSyncEvent, 0); // insert an event to wait on before rendering
}
}
glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
if (displayEnabled)
{
// view transform
{
glMatrixMode(GL_MODELVIEW);
glLoadIdentity();
for (int c = 0; c < 3; ++c)
{
camera_trans_lag[c] += (camera_trans[c] - camera_trans_lag[c]) * inertia;
camera_rot_lag[c] += (camera_rot[c] - camera_rot_lag[c]) * inertia;
}
glTranslatef(camera_trans_lag[0], camera_trans_lag[1], camera_trans_lag[2]);
glRotatef(camera_rot_lag[0], 1.0, 0.0, 0.0);
glRotatef(camera_rot_lag[1], 0.0, 1.0, 0.0);
}
displayNBodySystem();
// display user interface
if (bShowSliders)
{
glBlendFunc(GL_ONE_MINUS_DST_COLOR, GL_ZERO); // invert color
glEnable(GL_BLEND);
paramlist->Render(0, 0);
glDisable(GL_BLEND);
}
if (bFullscreen)
{
beginWinCoords();
char msg0[256], msg1[256], msg2[256];
if (bDispInteractions)
{
sprintf(msg1, "%0.2f billion interactions per second", interactionsPerSecond);
}
else
{
sprintf(msg1, "%0.2f GFLOP/s", gflops);
}
sprintf(msg0, "%s", deviceName);
sprintf(msg2, "%0.2f FPS [%s | %d bodies]",
ifps, fp64 ? "double precision" : "single precision", numBodies);
glBlendFunc(GL_ONE_MINUS_DST_COLOR, GL_ZERO); // invert color
glEnable(GL_BLEND);
glColor3f(0.46f, 0.73f, 0.0f);
glPrint(80, glutGet(GLUT_WINDOW_HEIGHT) - 122, msg0, GLUT_BITMAP_TIMES_ROMAN_24);
glColor3f(1.0f, 1.0f, 1.0f);
glPrint(80, glutGet(GLUT_WINDOW_HEIGHT) - 96, msg2, GLUT_BITMAP_TIMES_ROMAN_24);
glColor3f(1.0f, 1.0f, 1.0f);
glPrint(80, glutGet(GLUT_WINDOW_HEIGHT) - 70, msg1, GLUT_BITMAP_TIMES_ROMAN_24);
glDisable(GL_BLEND);
endWinCoords();
}
glutSwapBuffers();
}
fpsCount++;
// this displays the frame rate updated every second (independent of frame rate)
if (fpsCount >= fpsLimit)
{
char fps[256];
float milliseconds = 1;
// stop timer
if (useCpu)
{
milliseconds = sdkGetTimerValue(&timer);
sdkResetTimer(&timer);
}
else
{
checkCudaErrors(cudaEventRecord(stopEvent, 0));
checkCudaErrors(cudaEventSynchronize(stopEvent));
checkCudaErrors(cudaEventElapsedTime(&milliseconds, startEvent, stopEvent));
}
milliseconds /= (float)fpsCount;
computePerfStats(interactionsPerSecond, gflops, milliseconds, 1);
ifps = 1.f / (milliseconds / 1000.f);
sprintf(fps,
"CUDA N-Body (%d bodies): "
"%0.1f fps | %0.1f BIPS | %0.1f GFLOP/s | %s",
numBodies, ifps, interactionsPerSecond, gflops,
fp64 ? "double precision" : "single precision");
glutSetWindowTitle(fps);
fpsCount = 0;
fpsLimit = (ifps > 1.f) ? (int)ifps : 1;
if (bPause)
{
fpsLimit = 0;
}
// restart timer
if (!useCpu)
{
checkCudaErrors(cudaEventRecord(startEvent, 0));
}
}
glutReportErrors();
}
display函数中重点函数为如下两个:
- updateSimulation(): 调用GPU或CPU计算出下个时间点物体的位置和速度
- displayNBodySystem():将计算结果通过OpenGL显示在输出窗口
它们的实现如下:
void updateSimulation()
{
if (fp64)
{
NBodyDemo<double>::updateSimulation();
}
else
{
NBodyDemo<float>::updateSimulation();
}
}
void displayNBodySystem()
{
if (fp64)
{
NBodyDemo<double>::display();
}
else
{
NBodyDemo<float>::display();
}
}
这里涉及到了NBodyDemo单例的updateSimulation函数和display函数,它们的实现为:
static void updateSimulation()
{
m_singleton->m_nbody->update(activeParams.m_timestep);
}
static void display()
{
m_singleton->m_renderer->setSpriteSize(activeParams.m_pointSize);
if (useHostMem)
{
// This event sync is required because we are rendering from the host memory that CUDA is
// writing. If we don't wait until CUDA is done updating it, we will render partially
// updated data, resulting in a jerky frame rate.
if (!useCpu)
{
cudaEventSynchronize(hostMemSyncEvent);
}
m_singleton->m_renderer->setPositions(
m_singleton->m_nbody->getArray(BODYSYSTEM_POSITION),
m_singleton->m_nbody->getNumBodies());
}
else
{
m_singleton->m_renderer->setPBO(m_singleton->m_nbody->getCurrentReadBuffer(),
m_singleton->m_nbody->getNumBodies(),
(sizeof(T) > 4));
}
// display particles
m_singleton->m_renderer->display(displayMode);
}
为什么这里直接用m_singleton->m_nbody->update和m_singleton->m_nbody->getArray而不用区分GPU还是CPU,这是因为NBodyDemo在init时做了下面的准备工作:
void _init(int numBodies, int numDevices, int blockSize,
bool bUsePBO, bool useHostMem, bool useCpu)
{
if (useCpu)
{
m_nbodyCpu = new BodySystemCPU<T>(numBodies);
m_nbody = m_nbodyCpu;
m_nbodyCuda = 0;
}
else
{
m_nbodyCuda = new BodySystemCUDA<T>(numBodies, numDevices, blockSize, bUsePBO, useHostMem);
m_nbody = m_nbodyCuda;
m_nbodyCpu = 0;
}
.....
}
运行时多态的使用减少了重复代码的编写,使代码更紧凑。
GPU加速分析
可以查看nbody在benchmark参数下的运行结果对GPU与CPU计算能力进行性能分析:
GPU通过多核的使用,在模拟57433个物体的情况下,对比CPU模拟4096个物体的情况下,花的时间不到CPU的1/30。GPU单精度浮点数的吞吐量为5142.2 GFLOP/s, CPU的吞吐量为:0.899 GFLOP/s。
那么GPU是怎么做到的呢?通过上面的分析可以看出它们唯一的区别便是m_nbody->update(timestep)函数的实现。
CPU update函数
分析CPU的update函数,可以清晰的看到CPU的计算过程:
template <typename T>
void BodySystemCPU<T>::update(T deltaTime)
{
assert(m_bInitialized);
_integrateNBodySystem(deltaTime);
//std::swap(m_currentRead, m_currentWrite);
}
其主要实现在 _integrateNBodySystem中:
template <typename T>
void BodySystemCPU<T>::_integrateNBodySystem(T deltaTime)
{
_computeNBodyGravitation();
#ifdef OPENMP
#pragma omp parallel for
#endif
for (int i = 0; i < m_numBodies; ++i)
{
int index = 4*i;
int indexForce = 3*i;
T pos[3], vel[3], force[3];
pos[0] = m_pos[index+0];
pos[1] = m_pos[index+1];
pos[2] = m_pos[index+2];
T invMass = m_pos[index+3];
vel[0] = m_vel[index+0];
vel[1] = m_vel[index+1];
vel[2] = m_vel[index+2];
force[0] = m_force[indexForce+0];
force[1] = m_force[indexForce+1];
force[2] = m_force[indexForce+2];
// acceleration = force / mass;
// new velocity = old velocity + acceleration * deltaTime
vel[0] += (force[0] * invMass) * deltaTime;
vel[1] += (force[1] * invMass) * deltaTime;
vel[2] += (force[2] * invMass) * deltaTime;
vel[0] *= m_damping;
vel[1] *= m_damping;
vel[2] *= m_damping;
// new position = old position + velocity * deltaTime
pos[0] += vel[0] * deltaTime;
pos[1] += vel[1] * deltaTime;
pos[2] += vel[2] * deltaTime;
m_pos[index+0] = pos[0];
m_pos[index+1] = pos[1];
m_pos[index+2] = pos[2];
m_vel[index+0] = vel[0];
m_vel[index+1] = vel[1];
m_vel[index+2] = vel[2];
}
}
上面计算中第一步是计算出每个物体与其它物体间的万有引力m_force,第二步结合加速度定律计算出加速度,最后算出新的位置m_pos和速度m_vel,每个物体与其它物体间的万有引力的过程如下:
template <typename T>
void BodySystemCPU<T>::_computeNBodyGravitation()
{
#ifdef OPENMP
#pragma omp parallel for
#endif
for (int i = 0; i < m_numBodies; i++)
{
int indexForce = 3*i;
T acc[3] = {0, 0, 0};
// We unroll this loop 4X for a small performance boost.
int j = 0;
while (j < m_numBodies)
{
bodyBodyInteraction<T>(acc, &m_pos[4*i], &m_pos[4*j], m_softeningSquared);
j++;
bodyBodyInteraction<T>(acc, &m_pos[4*i], &m_pos[4*j], m_softeningSquared);
j++;
bodyBodyInteraction<T>(acc, &m_pos[4*i], &m_pos[4*j], m_softeningSquared);
j++;
bodyBodyInteraction<T>(acc, &m_pos[4*i], &m_pos[4*j], m_softeningSquared);
j++;
}
m_force[indexForce ] = acc[0];
m_force[indexForce+1] = acc[1];
m_force[indexForce+2] = acc[2];
}
}
上面有几点值得注意:
- 外层for循环依次计算每个物体i与其它物体的万有引力,内层for循环是实现的细节。
- 由于输入参数可以保证m_numBodies >= 256且是256的倍数,因而上面的加速小技巧是合法的。
- nbody是在3维空间进行模拟的,因而速度,位置,力都分解为计算沿x,y,z方向的分量,方便计算(m_pos的前三个分量对应于x,y,z三个方向的位置,第四个分量对应于物体的质量)。
单个物体与物体之间的引力计算如下,这一计算万有引力的过程是以上面的理论回顾一致的:
template <typename T>
void bodyBodyInteraction(T accel[3], T posMass0[4], T posMass1[4], T softeningSquared)
{
T r[3];
// r_01 [3 FLOPS]
r[0] = posMass1[0] - posMass0[0];
r[1] = posMass1[1] - posMass0[1];
r[2] = posMass1[2] - posMass0[2];
// d^2 + e^2 [6 FLOPS]
T distSqr = r[0] * r[0] + r[1] * r[1] + r[2] * r[2];
distSqr += softeningSquared;
// invDistCube =1/distSqr^(3/2) [4 FLOPS (2 mul, 1 sqrt, 1 inv)]
T invDist = (T)1.0 / (T)sqrt((double)distSqr);
T invDistCube = invDist * invDist * invDist;
// s = m_j * invDistCube [1 FLOP]
T s = posMass1[3] * invDistCube;
// (m_1 * r_01) / (d^2 + e^2)^(3/2) [6 FLOPS]
accel[0] += r[0] * s;
accel[1] += r[1] * s;
accel[2] += r[2] * s;
}
注意到,这里accel是加速度,而不是力, BodySystemCPU\t::_computeNBodyGravitation中,m_force代表的其实是加速度,而不是力,但BodySystemCPU<T>::_integrateNBodySystem这里force[0] * invMass一句把m_force当作力来使用,通过 F o r c e / M a s s Force/Mass Force/Mass计算加速度,其实是错误的
GPU update 函数
在bodysystemcuda_impl.h中,我们可以找到GPU的update函数如下:
template<typename T>
void BodySystemCUDA<T>::update(T deltaTime)
{
assert(m_bInitialized);
integrateNbodySystem<T>(m_deviceData, m_pGRes, m_currentRead,
(float)deltaTime, (float)m_damping,
m_numBodies, m_numDevices,
m_blockSize, m_bUsePBO);
std::swap(m_currentRead, m_currentWrite);
}
其中integrateNbodySystem的实现如下,因为它需要调用kernel function, 故其放在后缀为cu的bodysystemcuda.cu文件中:
template <typename T>
void integrateNbodySystem(DeviceData<T> *deviceData,
cudaGraphicsResource **pgres,
unsigned int currentRead,
float deltaTime,
float damping,
unsigned int numBodies,
unsigned int numDevices,
int blockSize,
bool bUsePBO)
{
if (bUsePBO)
{
checkCudaErrors(cudaGraphicsResourceSetMapFlags(pgres[currentRead], cudaGraphicsMapFlagsReadOnly));
checkCudaErrors(cudaGraphicsResourceSetMapFlags(pgres[1-currentRead], cudaGraphicsMapFlagsWriteDiscard));
checkCudaErrors(cudaGraphicsMapResources(2, pgres, 0));
size_t bytes;
checkCudaErrors(cudaGraphicsResourceGetMappedPointer((void **)&(deviceData[0].dPos[currentRead]), &bytes, pgres[currentRead]));
checkCudaErrors(cudaGraphicsResourceGetMappedPointer((void **)&(deviceData[0].dPos[1-currentRead]), &bytes, pgres[1-currentRead]));
}
for (unsigned int dev = 0; dev != numDevices; dev++)
{
if (numDevices > 1)
{
cudaSetDevice(dev);
}
int numBlocks = (deviceData[dev].numBodies + blockSize-1) / blockSize;
int numTiles = (numBodies + blockSize - 1) / blockSize;
int sharedMemSize = blockSize * 4 * sizeof(T); // 4 floats for pos
integrateBodies<T><<< numBlocks, blockSize, sharedMemSize >>>
((typename vec4<T>::Type *)deviceData[dev].dPos[1-currentRead],
(typename vec4<T>::Type *)deviceData[dev].dPos[currentRead],
(typename vec4<T>::Type *)deviceData[dev].dVel,
deviceData[dev].offset, deviceData[dev].numBodies,
deltaTime, damping, numTiles);
if (numDevices > 1)
{
checkCudaErrors(cudaEventRecord(deviceData[dev].event));
// MJH: Hack on older driver versions to force kernel launches to flush!
cudaStreamQuery(0);
}
// check if kernel invocation generated an error
getLastCudaError("Kernel execution failed");
}
if (numDevices > 1)
{
for (unsigned int dev = 0; dev < numDevices; dev++)
{
checkCudaErrors(cudaEventSynchronize(deviceData[dev].event));
}
}
if (bUsePBO)
{
checkCudaErrors(cudaGraphicsUnmapResources(2, pgres, 0));
}
}
上面blocksize默认为256,即每个thread block中thread数量为256。我们可以发现这里GridDim和BlockDim都是1维的,即将N个物体水平的分为numBlocks个block,每个block中水平的有blocksize个thread。
核函数integrateBodies的实现如下:
template<typename T>
__global__ void
integrateBodies(typename vec4<T>::Type *__restrict__ newPos,
typename vec4<T>::Type *__restrict__ oldPos,
typename vec4<T>::Type *vel,
unsigned int deviceOffset, unsigned int deviceNumBodies,
float deltaTime, float damping, int numTiles)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
if (index >= deviceNumBodies)
{
return;
}
typename vec4<T>::Type position = oldPos[deviceOffset + index];
typename vec3<T>::Type accel = computeBodyAccel<T>(position,
oldPos,
numTiles);
// acceleration = force / mass;
// new velocity = old velocity + acceleration * deltaTime
// note we factor out the body's mass from the equation, here and in bodyBodyInteraction
// (because they cancel out). Thus here force == acceleration
typename vec4<T>::Type velocity = vel[deviceOffset + index];
velocity.x += accel.x * deltaTime;
velocity.y += accel.y * deltaTime;
velocity.z += accel.z * deltaTime;
velocity.x *= damping;
velocity.y *= damping;
velocity.z *= damping;
// new position = old position + velocity * deltaTime
position.x += velocity.x * deltaTime;
position.y += velocity.y * deltaTime;
position.z += velocity.z * deltaTime;
// store new position and velocity
newPos[deviceOffset + index] = position;
vel[deviceOffset + index] = velocity;
}
这里面重点是computeBodyAccel的实现:
template <typename T>
__device__ typename vec3<T>::Type
computeBodyAccel(typename vec4<T>::Type bodyPos,
typename vec4<T>::Type *positions,
int numTiles)
{
typename vec4<T>::Type *sharedPos = SharedMemory<typename vec4<T>::Type>();
typename vec3<T>::Type acc = {0.0f, 0.0f, 0.0f};
for (int tile = 0; tile < numTiles; tile++)
{
sharedPos[threadIdx.x] = positions[tile * blockDim.x + threadIdx.x];
__syncthreads();
// This is the "tile_calculation" from the GPUG3 article.
#pragma unroll 128
for (unsigned int counter = 0; counter < blockDim.x; counter++)
{
acc = bodyBodyInteraction<T>(acc, bodyPos, sharedPos[counter]);
}
__syncthreads();
}
return acc;
}
假设blocksize为
p
p
p,按照一般情况,这
p
p
p个thread作为一个调度单元(实际是
p
p
p个thread中的单个wrap),每个线程负责计算1个物体的受到其它
N
N
N个物体的引力,都需要获取
t
t
t时刻其它
N
N
N个物体的位置,从而算出其加速度,进而获得
t
+
1
t+1
t+1时刻的位置。虽然同一个block内可以通过共享内存的方式只取一次
N
N
N个物体的位置的位置数据,但当
N
N
N很大时,GPU的内存和内存带宽还是存在性能瓶颈,因而上面采取将N个物体分为
N
/
p
N/p
N/p个tile,通过加速度的累计,仍然可以得到最后的加速度,这便是 numTiles的作用。下图是GPU Gems3上的截图,很好的说明了GPU加速的过程:
从上面我们可以看出,GPU加速从下面几个方面进行了加速和内存调优,以block size为p说明:
- N / p N/p N/p个block并行(纵向)
- 单个block内部 p p p个线程并行(横向多行)
- 同一个block内部使用共享内存,通过 N / p N/p N/p个tile来节省内存和加速(横向多行分段)
bodyBodyInteraction的实现如下,它和CPU的情况类似:
template <typename T>
__device__ typename vec3<T>::Type
bodyBodyInteraction(typename vec3<T>::Type ai,
typename vec4<T>::Type bi,
typename vec4<T>::Type bj)
{
typename vec3<T>::Type r;
// r_ij [3 FLOPS]
r.x = bj.x - bi.x;
r.y = bj.y - bi.y;
r.z = bj.z - bi.z;
// distSqr = dot(r_ij, r_ij) + EPS^2 [6 FLOPS]
T distSqr = r.x * r.x + r.y * r.y + r.z * r.z;
distSqr += getSofteningSquared<T>();
// invDistCube =1/distSqr^(3/2) [4 FLOPS (2 mul, 1 sqrt, 1 inv)]
T invDist = rsqrt_T(distSqr);
T invDistCube = invDist * invDist * invDist;
// s = m_j * invDistCube [1 FLOP]
T s = bj.w * invDistCube;
// a_i = a_i + s * r_ij [6 FLOPS]
ai.x += r.x * s;
ai.y += r.y * s;
ai.z += r.z * s;
return ai;
}
以上就是nbody问题的GPU和CPU实现的源码分析。