Using CUDA/GPU in Fortran
To leverage GPU acceleration in Fortran, you have several options. Here are the main approaches:
1. CUDA Fortran (NVIDIA’s Proprietary Extension)
CUDA Fortran is an extension to Fortran provided by NVIDIA through the PGI (now NVIDIA) compiler.
Setup:
- Install NVIDIA HPC SDK (includes the
nvfortran
compiler) - Requires NVIDIA GPU with CUDA support
Basic Example:
! Vector addition on GPU
module vecadd
contains
attributes(global) subroutine add(a, b, c, n)
real :: a(*), b(*), c(*)
integer, value :: n
integer :: i
i = (blockIdx%x-1)*blockDim%x + threadIdx%x
if (i <= n) c(i) = a(i) + b(i)
end subroutine add
end module vecadd
program main
use cudafor
use vecadd
implicit none
integer, parameter :: n = 10000
real :: a(n), b(n), c(n)
real, device :: a_d(n), b_d(n), c_d(n)
type(dim3) :: grid, block
! Initialize arrays
a = 1.0; b = 2.0
! Copy to device
a_d = a; b_d = b
! Configure kernel launch
block = dim3(256,1,1)
grid = dim3(ceiling(real(n)/block%x),1,1)
! Launch kernel
call add<<<grid,block>>>(a_d, b_d, c_d, n)
! Copy back result
c = c_d
! Verify
print *, maxval(abs(c-3.0))
end program main
2. OpenACC (Directive-Based Approach)
OpenACC uses compiler directives to offload computation to GPUs.
Example:
program vecadd
implicit none
integer, parameter :: n = 1000000
real :: a(n), b(n), c(n)
integer :: i
! Initialize arrays
!$acc parallel loop
do i = 1, n
a(i) = 1.0
b(i) = 2.0
end do
! Vector addition on GPU
!$acc parallel loop
do i = 1, n
c(i) = a(i) + b(i)
end do
! Verify
print *, maxval(abs(c-3.0))
end program vecadd
3. OpenMP Offload (Modern Alternative)
Recent versions of OpenMP support GPU offloading.
Example:
program vecadd
implicit none
integer, parameter :: n = 1000000
real :: a(n), b(n), c(n)
integer :: i
! Initialize arrays
!$omp target teams distribute parallel do map(tofrom: a, b)
do i = 1, n
a(i) = 1.0
b(i) = 2.0
end do
! Vector addition on GPU
!$omp target teams distribute parallel do map(to: a, b) map(from: c)
do i = 1, n
c(i) = a(i) + b(i)
end do
! Verify
print *, maxval(abs(c-3.0))
end program vecadd
4. ISO_C_BINDING with CUDA C
You can write CUDA kernels in C and call them from Fortran.
C kernel (vecadd.cu):
__global__ void vecadd_kernel(float *a, float *b, float *c, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) c[i] = a[i] + b[i];
}
extern "C" void launch_vecadd(float *a_d, float *b_d, float *c_d, int n) {
dim3 block(256);
dim3 grid((n + block.x - 1) / block.x);
vecadd_kernel<<<grid, block>>>(a_d, b_d, c_d, n);
}
Fortran caller:
program main
use iso_c_binding
implicit none
interface
subroutine launch_vecadd(a, b, c, n) bind(C)
use iso_c_binding
type(c_ptr), value :: a, b, c
integer(c_int), value :: n
end subroutine
end interface
! ... rest of the code similar to CUDA Fortran example ...
end program
Compilation:
For CUDA Fortran:
nvfortran -Mcuda myprogram.f90
For OpenACC:
nvfortran -acc myprogram.f90
For OpenMP Offload (with GCC):
gfortran -fopenmp -foffload=nvptx-none myprogram.f90
Considerations:
- Portability: OpenACC and OpenMP are more portable than CUDA Fortran
- Performance: CUDA Fortran often gives the best performance and control
- Compiler Support: Check which options your compiler supports
- Debugging: NVIDIA Nsight tools can help debug GPU Fortran code
Would you like more details on any specific approach?