Using CUDA/GPU in Fortran

东北豆子哥

于 2025-05-20 14:07:28 发布

阅读量704

点赞数 18

分类专栏： Fortran CUDA 文章标签： CUDA HPC

本文链接：https://blog.csdn.net/weixin_42849849/article/details/148086735

版权

Fortran 同时被 2 个专栏收录

64 篇文章

订阅专栏

CUDA

49 篇文章

订阅专栏

Using CUDA/GPU in Fortran

To leverage GPU acceleration in Fortran, you have several options. Here are the main approaches:

1. CUDA Fortran (NVIDIA’s Proprietary Extension)

CUDA Fortran is an extension to Fortran provided by NVIDIA through the PGI (now NVIDIA) compiler.

Setup:

Install NVIDIA HPC SDK (includes the nvfortran compiler)
Requires NVIDIA GPU with CUDA support

Basic Example:

! Vector addition on GPU
module vecadd
  contains
    attributes(global) subroutine add(a, b, c, n)
      real :: a(*), b(*), c(*)
      integer, value :: n
      integer :: i
      
      i = (blockIdx%x-1)*blockDim%x + threadIdx%x
      if (i <= n) c(i) = a(i) + b(i)
    end subroutine add
end module vecadd

program main
  use cudafor
  use vecadd
  implicit none
  
  integer, parameter :: n = 10000
  real :: a(n), b(n), c(n)
  real, device :: a_d(n), b_d(n), c_d(n)
  type(dim3) :: grid, block
  
  ! Initialize arrays
  a = 1.0; b = 2.0
  
  ! Copy to device
  a_d = a; b_d = b
  
  ! Configure kernel launch
  block = dim3(256,1,1)
  grid = dim3(ceiling(real(n)/block%x),1,1)
  
  ! Launch kernel
  call add<<<grid,block>>>(a_d, b_d, c_d, n)
  
  ! Copy back result
  c = c_d
  
  ! Verify
  print *, maxval(abs(c-3.0))
end program main

2. OpenACC (Directive-Based Approach)

OpenACC uses compiler directives to offload computation to GPUs.

Example:

program vecadd
  implicit none
  integer, parameter :: n = 1000000
  real :: a(n), b(n), c(n)
  integer :: i
  
  ! Initialize arrays
  !$acc parallel loop
  do i = 1, n
    a(i) = 1.0
    b(i) = 2.0
  end do
  
  ! Vector addition on GPU
  !$acc parallel loop
  do i = 1, n
    c(i) = a(i) + b(i)
  end do
  
  ! Verify
  print *, maxval(abs(c-3.0))
end program vecadd

3. OpenMP Offload (Modern Alternative)

Recent versions of OpenMP support GPU offloading.

Example:

program vecadd
  implicit none
  integer, parameter :: n = 1000000
  real :: a(n), b(n), c(n)
  integer :: i
  
  ! Initialize arrays
  !$omp target teams distribute parallel do map(tofrom: a, b)
  do i = 1, n
    a(i) = 1.0
    b(i) = 2.0
  end do
  
  ! Vector addition on GPU
  !$omp target teams distribute parallel do map(to: a, b) map(from: c)
  do i = 1, n
    c(i) = a(i) + b(i)
  end do
  
  ! Verify
  print *, maxval(abs(c-3.0))
end program vecadd

4. ISO_C_BINDING with CUDA C

You can write CUDA kernels in C and call them from Fortran.

C kernel (vecadd.cu):

__global__ void vecadd_kernel(float *a, float *b, float *c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) c[i] = a[i] + b[i];
}

extern "C" void launch_vecadd(float *a_d, float *b_d, float *c_d, int n) {
    dim3 block(256);
    dim3 grid((n + block.x - 1) / block.x);
    vecadd_kernel<<<grid, block>>>(a_d, b_d, c_d, n);
}

Fortran caller:

program main
  use iso_c_binding
  implicit none
  
  interface
    subroutine launch_vecadd(a, b, c, n) bind(C)
      use iso_c_binding
      type(c_ptr), value :: a, b, c
      integer(c_int), value :: n
    end subroutine
  end interface
  
  ! ... rest of the code similar to CUDA Fortran example ...
end program

Compilation:

For CUDA Fortran:

nvfortran -Mcuda myprogram.f90

For OpenACC:

nvfortran -acc myprogram.f90

For OpenMP Offload (with GCC):

gfortran -fopenmp -foffload=nvptx-none myprogram.f90

Considerations:

Portability: OpenACC and OpenMP are more portable than CUDA Fortran
Performance: CUDA Fortran often gives the best performance and control
Compiler Support: Check which options your compiler supports
Debugging: NVIDIA Nsight tools can help debug GPU Fortran code

Would you like more details on any specific approach?