CUDA Samples

CUDA Samples

Preface

This document contains a complete listing of the code samples that are included with the NVIDIA CUDA Toolkit. It describes                  each code sample, lists the minimum GPU specification, and provides links to the source code and white papers if available.               

The code samples are divided into the following categories:                                    
Simple
Basic CUDA samples for beginners that illustrate key concepts with using CUDA and CUDA runtime APIs.
Utilities
Utility samples that demonstrate how to query device capabilities and measure GPU/CPU bandwidth.
Graphics
Graphical samples that demonstrate interoperability between CUDA and OpenGL or DirectX.
Imaging
Samples that demonstrate image processing, compression, and data analysis.
Finance
Samples that demonstrate parallel algorithms for financial computing.
Simulations
Samples that illustrate a number of simulation algorithms implemented with CUDA.
Advanced
Samples that illustrate advanced algorithms implemented with CUDA.
CUDALibraries
Samples that illustrate how to use CUDA platform libraries (NPP, CUBLAS, CUFFT, CUSPARSE, and CURAND).

Simple

asyncAPI

This sample uses CUDA streams and events to overlap execution on CPU and GPU.

Minimum Required GPU SM 1.0
Windows Source asyncAPI.zip
Mac/Linux Source asyncAPI.tar.gz

C++ Integration

This example demonstrates how to integrate CUDA into an existing C++ application, i.e. the CUDA entry point on host side is                     only a function which is called from C++ code and only the file containing this function is compiled with nvcc. It also demonstrates                     that vector types can be used from cpp.                  

Minimum Required GPU SM 1.0
Windows Source cppIntegration.zip
Mac/Linux Source cppIntegration.tar.gz

Clock

This example shows how to use the clock function to measure the performance of kernel accurately.

Minimum Required GPU SM 1.0
Windows Source clock.zip
Mac/Linux Source clock.tar.gz

cudaOpenMP

This sample demonstrates how to use OpenMP API to write an application for multiple GPUs.  This executable is not pre-built                     with the SDK installer.                  

Minimum Required GPU SM 1.0
Windows Source cudaOpenMP.zip
Mac/Linux Source cudaOpenMP.tar.gz

Matrix Multiplication (CUBLAS)

This sample implements matrix multiplication from Chapter 3 of the programming guide.                     To illustrate GPU performance for matrix multiply, this sample also shows how to use the new CUDA 4.0 interface for CUBLAS                     to demonstrate high-performance performance for matrix multiplication.                  

Minimum Required GPU SM 1.0
Windows Source matrixMulCUBLAS.zip
Mac/Linux Source matrixMulCUBLAS.tar.gz

Matrix Multiplication (CUDA Driver API Version)

This sample implements matrix multiplication and uses the new CUDA 4.0 kernel launch Driver API.                     It has been written for clarity of exposition to illustrate various CUDA programming principles, not with the goal of providing                     the most performant generic kernel for matrix multiplication.                     CUBLAS provides high-performance matrix multiplication.                  

Minimum Required GPU SM 1.0
Windows Source matrixMulDrv.zip
Mac/Linux Source matrixMulDrv.tar.gz

Matrix Multiplication (CUDA Driver API version with Dynamic Linking Version)

This sample revisits matrix multiplication using the CUDA driver API.                      It demonstrates how to link to CUDA driver at runtime and how to use JIT (just-in-time) compilation from PTX code.                     It has been written for clarity of exposition to illustrate various CUDA programming principles, not with the goal of providing                     the most performant generic kernel for matrix multiplication.                     CUBLAS provides high-performance matrix multiplication.                  

Minimum Required GPU SM 1.0
Windows Source matrixMulDynlinkJIT.zip
Mac/Linux Source matrixMulDynlinkJIT.tar.gz

Matrix Multiplication (CUDA Runtime API Version)

This sample implements matrix multiplication and is exactly the same as Chapter 6 of the programming guide.                     It has been written for clarity of exposition to illustrate various CUDA programming principles, not with the goal of providing                     the most performant generic kernel for matrix multiplication.  To illustrate GPU performance for matrix multiply, this sample                     also shows how to use the new CUDA 4.0 interface for CUBLAS to demonstrate high-performance performance for matrix multiplication.                  

Minimum Required GPU SM 1.0
Windows Source matrixMul.zip
Mac/Linux Source matrixMul.tar.gz

Pitch Linear Texture

Use of Pitch Linear Textures

Minimum Required GPU SM 1.0
Windows Source simplePitchLinearTexture.zip
Mac/Linux Source simplePitchLinearTexture.tar.gz

Simple Atomic Intrinsics

A simple demonstration of global memory atomic instructions. Requires Compute Capability 1.1 or higher.

Minimum Required GPU SM 1.0
Windows Source simpleAtomicIntrinsics.zip
Mac/Linux Source simpleAtomicIntrinsics.tar.gz

Simple Cubemap Texture

Simple example that demonstrates how to use a new CUDA 4.1 feature to support cubemap Textures in CUDA C.

Minimum Required GPU SM 2.0
Windows Source simpleCubemapTexture.zip
Mac/Linux Source simpleCubemapTexture.tar.gz

Simple CUDA Callbacks

                     This sample implements multi-threaded heterogeneous computing workloads with the new CPU callbacks for CUDA streams and events                     introduced with CUDA 5.0.                  

Minimum Required GPU SM 1.0
Windows Source simpleCallback.zip
Mac/Linux Source simpleCallback.tar.gz

Simple Layered Texture

Simple example that demonstrates how to use a new CUDA 4.0 feature to support layered Textures in CUDA C.

Minimum Required GPU SM 2.0
Windows Source simpleLayeredTexture.zip
Mac/Linux Source simpleLayeredTexture.tar.gz

Simple Multi Copy and Compute

Supported in GPUs with Compute Capability 1.1, overlaping compute with one memcopy is possible from the host system.  For                     Quadro and Tesla GPUs with Compute Capability 2.0, a second overlapped copy operation in either direction at full speed is                     possible (PCI-e is symmetric).  This sample illustrates the usage of CUDA streams to achieve overlapping of kernel execution                     with data copies to and from the device.                                        

Minimum Required GPU SM 1.0
Windows Source simpleMultiCopy.zip
Mac/Linux Source simpleMultiCopy.tar.gz

Simple Multi-GPU

This application demonstrates how to use the new CUDA 4.0 API for CUDA context management and multi-threaded access to run                     CUDA kernels on multiple-GPUs.                  

Minimum Required GPU SM 1.0
Windows Source simpleMultiGPU.zip
Mac/Linux Source simpleMultiGPU.tar.gz

Simple Peer-to-Peer Transfers with Multi-GPU

This application demonstrates the new CUDA 4.0 APIs that support Peer-To-Peer (P2P) copies, Peer-To-Peer (P2P) addressing,                     and UVA (Unified Virtual Memory Addressing) between multiple Tesla GPUs.                  

Minimum Required GPU SM 2.0
Windows Source simpleP2P.zip
Mac/Linux Source simpleP2P.tar.gz

Simple Print (CUDA Dynamic Parallelism)

This sample demonstrates simple printf implemented using CUDA Dynamic Parallelism.  This sample requires devices with compute                     capability 3.5 or higher.                  

Minimum Required GPU SM 3.5
Windows Source cdpSimplePrint.zip
Mac/Linux Source cdpSimplePrint.tar.gz

Simple Quicksort (CUDA Dynamic Parallelism)

This sample demonstrates simple quicksort implemented using CUDA Dynamic Parallelism.  This sample requires devices with compute                     capability 3.5 or higher.                  

Minimum Required GPU SM 3.5
Windows Source cdpSimpleQuicksort.zip
Mac/Linux Source cdpSimpleQuicksort.tar.gz

Simple Static GPU Device Library

This sample demonstrates a CUDA 5.0 feature, the ability to create a GPU device static library and use it within another CUDA                     kernel.  This example demonstrates how to pass in a GPU device function (from the GPU device static library) as a function                     pointer to be called.  This sample requires devices with compute capability 2.0 or higher.                  

Minimum Required GPU SM 2.0
Windows Source simpleSeparateCompilation.zip
Mac/Linux Source simpleSeparateCompilation.tar.gz

Simple Surface Write

Simple example that demonstrates the use of 2D surface references (Write-to-Texture)

Minimum Required GPU SM 2.0
Windows Source simpleSurfaceWrite.zip
Mac/Linux Source simpleSurfaceWrite.tar.gz

Simple Templates

This sample is a templatized version of the template project. It also shows how to correctly templatize dynamically allocated                     shared memory arrays.                  

Minimum Required GPU SM 1.0
Windows Source simpleTemplates.zip
Mac/Linux Source simpleTemplates.tar.gz

Simple Texture

Simple example that demonstrates use of Textures in CUDA.

Minimum Required GPU SM 1.0
Windows Source simpleTexture.zip
Mac/Linux Source simpleTexture.tar.gz

Simple Texture (Driver Version)

Simple example that demonstrates use of Textures in CUDA.  This sample uses the new CUDA 4.0 kernel launch Driver API.

Minimum Required GPU SM 1.0
Windows Source simpleTextureDrv.zip
Mac/Linux Source simpleTextureDrv.tar.gz

Simple Vote Intrinsics

Simple program which demonstrates how to use the Vote (any, all) intrinsic instruction in a CUDA kernel.  Requires Compute                     Capability 1.2 or higher.                  

Minimum Required GPU SM 1.0
Windows Source simpleVoteIntrinsics.zip
Mac/Linux Source simpleVoteIntrinsics.tar.gz

simpleAssert

                     This CUDA Runtime API sample is a very basic sample that implements how to use the assert function in the device code. Requires                     Compute Capability 2.0 .                                                            

Minimum Required GPU SM 2.0
Windows Source simpleAssert.zip
Mac/Linux Source simpleAssert.tar.gz

simpleIPC

                     This CUDA Runtime API sample is a very basic sample that demonstrates Inter Process Communication with one process per GPU                     for computation.  Requires Compute Capability 2.0 or higher and a Linux Operating System                                                            

Minimum Required GPU SM 2.0
Windows Source simpleIPC.zip
Mac/Linux Source simpleIPC.tar.gz

simpleMPI

Simple example demonstrating how to use MPI in combination with CUDA.  This executable is not pre-built with the SDK installer.

Minimum Required GPU SM 1.0
Windows Source simpleMPI.zip
Mac/Linux Source simpleMPI.tar.gz

simplePrintf

This CUDA Runtime API sample is a very basic sample that implements how to use the printf function in the device code. Specifically,                     for devices with compute capability less than 2.0, the function cuPrintf is called; otherwise, printf can be used directly.                                       

Minimum Required GPU SM 1.0
Windows Source simplePrintf.zip
Mac/Linux Source simplePrintf.tar.gz

simpleStreams

This sample uses CUDA streams to overlap kernel executions with memory copies between the host and a GPU device.  This sample                     uses a new CUDA 4.0 feature that supports pinning of generic host memory.  Requires Compute Capability 1.1 or higher.                  

Minimum Required GPU SM 1.0
Windows Source simpleStreams.zip
Mac/Linux Source simpleStreams.tar.gz

simpleZeroCopy

This sample illustrates how to use Zero MemCopy, kernels can read and write directly to pinned system memory.  This sample                     requires GPUs that support this feature (MCP79 and GT200).                  

Minimum Required GPU SM 1.3
White Paper CUDA2.2PinnedMemoryAPIs.pdf
Windows Source simpleZeroCopy.zip
Mac/Linux Source simpleZeroCopy.tar.gz

Template

A trivial template project that can be used as a starting point to create new CUDA projects.

Minimum Required GPU SM 1.0
Windows Source template.zip
Mac/Linux Source template.tar.gz

Template using CUDA Runtime

A trivial template project that can be used as a starting point to create new CUDA Runtime API projects.

Minimum Required GPU SM 1.0
Windows Source template_runtime.zip
Mac/Linux Source template_runtime.tar.gz

Using Inline PTX

A simple test application that demonstrates a new CUDA 4.0 ability to embed PTX in a CUDA kernel.

Minimum Required GPU SM 1.0
Windows Source inlinePTX.zip
Mac/Linux Source inlinePTX.tar.gz

Vector Addition

This CUDA Runtime API sample is a very basic sample that implements element by element vector addition. It is the same as                     the sample illustrating Chapter 3 of the programming guide with some additions like error checking.                  

Minimum Required GPU SM 1.0
Windows Source vectorAdd.zip
Mac/Linux Source vectorAdd.tar.gz

Vector Addition Driver API

This Vector Addition sample is a basic sample that is implemented element by element.  It is the same as the sample illustrating                     Chapter 3 of the programming guide with some additions like error checking.   This sample also uses the new CUDA 4.0 kernel                     launch Driver API.                  

Minimum Required GPU SM 1.0
Windows Source vectorAddDrv.zip
Mac/Linux Source vectorAddDrv.tar.gz

Utilities

Bandwidth Test

This is a simple test program to measure the memcopy bandwidth of the GPU and memcpy bandwidth across PCI-e.  This test application                     is capable of measuring device to device copy bandwidth, host to device copy bandwidth for pageable and page-locked memory,                     and device to host copy bandwidth for pageable and page-locked memory.                  

Minimum Required GPU SM 1.0
Windows Source bandwidthTest.zip
Mac/Linux Source bandwidthTest.tar.gz

Device Query

This sample enumerates the properties of the CUDA devices present in the system.

Minimum Required GPU SM 1.0
Windows Source deviceQuery.zip
Mac/Linux Source deviceQuery.tar.gz

Device Query Driver API

This sample enumerates the properties of the CUDA devices present using CUDA Driver API calls

Minimum Required GPU SM 1.0
Windows Source deviceQueryDrv.zip
Mac/Linux Source deviceQueryDrv.tar.gz

Graphics

Bindless Texture

This example demonstrates use of cudaSurfaceObject, cudaTextureObject, and MipMap support in CUDA.  A GPU with Compute Capability                     SM 3.0 is required to run the sample.                  

Minimum Required GPU KEPLER SM 3.0
Windows Source bindlessTexture.zip
Mac/Linux Source bindlessTexture.tar.gz

Mandelbrot

This sample uses CUDA to compute and display the Mandelbrot or Julia sets interactively. It also illustrates the use of "double                     single" arithmetic to improve precision when zooming a long way into the pattern. This sample use double precision hardware                     if a GT200 class GPU is present.  Thanks to Mark Granger of NewTek who submitted this sample to the SDK!                  

Minimum Required GPU SM 1.0
Windows Source Mandelbrot.zip
Mac/Linux Source Mandelbrot.tar.gz

Marching Cubes Isosurfaces

This sample extracts a geometric isosurface from a volume dataset using the marching cubes algorithm. It uses the scan (prefix                     sum) function from the Thrust library to perform stream compaction.                   

Minimum Required GPU SM 1.0
Windows Source marchingCubes.zip
Mac/Linux Source marchingCubes.tar.gz

Simple D3D10 Texture

Simple program which demonstrates how to interoperate CUDA with Direct3D10 Texture.  The program creates a number of D3D10                     Textures (2D, 3D, and CubeMap) which are generated from CUDA kernels. Direct3D then renders the results on the screen.  A                     Direct3D10 Capable device is required.                  

Minimum Required GPU SM 1.0
Windows Source simpleD3D10Texture.zip
Mac/Linux Source simpleD3D10Texture.tar.gz

Simple D3D11 Texture

Simple program which demonstrates Direct3D11 Texture interoperability with CUDA.  The program creates a number of D3D11 Textures                     (2D, 3D, and CubeMap) which are written to from CUDA kernels. Direct3D then renders the results on the screen.  A Direct3D                     Capable device is required.                  

Minimum Required GPU SM 1.0
Windows Source simpleD3D11Texture.zip
Mac/Linux Source simpleD3D11Texture.tar.gz

Simple D3D9 Texture

Simple program which demonstrates Direct3D9 Texture interoperability with CUDA.  The program creates a number of D3D9 Textures                     (2D, 3D, and CubeMap) which are written to from CUDA kernels. Direct3D then renders the results on the screen.  A Direct3D                     capable device is required.                  

Minimum Required GPU SM 1.0
Windows Source simpleD3D9Texture.zip
Mac/Linux Source simpleD3D9Texture.tar.gz

Simple Direct3D10 (Vertex Array)

Simple program which demonstrates interoperability between CUDA and Direct3D10. The program generates a vertex array with                     CUDA and uses Direct3D10 to render the geometry.  A Direct3D Capable device is required.                  

Minimum Required GPU SM 1.0
Windows Source simpleD3D10.zip
Mac/Linux Source simpleD3D10.tar.gz

Simple Direct3D10 Render Target

Simple program which demonstrates interoperability between CUDA and Direct3D10. The program takes RenderTarget positions with                     CUDA and and generates a histogram with visualization.  A Direct3D Capable device is required.                  

Minimum Required GPU SM 1.0
Windows Source simpleD3D10RenderTarget.zip
Mac/Linux Source simpleD3D10RenderTarget.tar.gz

Simple Direct3D9 (Vertex Arrays)

Simple program which demonstrates interoperability between CUDA and Direct3D9. The program generates a vertex array with CUDA                     and uses Direct3D9 to render the geometry.  A Direct3D capable device is required.                  

Minimum Required GPU SM 1.0
Windows Source simpleD3D9.zip
Mac/Linux Source simpleD3D9.tar.gz

Simple OpenGL

Simple program which demonstrates interoperability between CUDA and OpenGL. The program modifies vertex positions with CUDA                     and uses OpenGL to render the geometry.                  

Minimum Required GPU SM 1.0
Windows Source simpleGL.zip
Mac/Linux Source simpleGL.tar.gz

Simple Texture 3D

Simple example that demonstrates use of 3D Textures in CUDA.

Minimum Required GPU SM 1.0
Windows Source simpleTexture3D.zip
Mac/Linux Source simpleTexture3D.tar.gz

SLI D3D10 Texture

Simple program which demonstrates SLI with Direct3D10 Texture interoperability with CUDA.  The program creates a D3D10 Texture                     which is written to from a CUDA kernel. Direct3D then renders the results on the screen.  A Direct3D Capable device is required.                  

Minimum Required GPU SM 1.0
Windows Source SLID3D10Texture.zip
Mac/Linux Source SLID3D10Texture.tar.gz

Volume Rendering with 3D Textures

This sample demonstrates basic volume rendering using 3D Textures.

Minimum Required GPU SM 1.0
Windows Source volumeRender.zip
Mac/Linux Source volumeRender.tar.gz

Volumetric Filtering with 3D Textures and Surface Writes

This sample demonstrates 3D Volumetric Filtering using 3D Textures and 3D Surface Writes.

Minimum Required GPU SM 2.0
Windows Source volumeFiltering.zip
Mac/Linux Source volumeFiltering.tar.gz

Imaging

1D Discrete Haar Wavelet Decomposition

Discrete Haar wavelet decomposition for 1D signals with a length which is a power of 2.

Minimum Required GPU SM 1.0
Windows Source dwtHaar1D.zip
Mac/Linux Source dwtHaar1D.tar.gz

Bicubic Texture Filtering

This sample demonstrates how to efficiently implement bicubic Texture filtering in CUDA.

Minimum Required GPU SM 1.0
Windows Source bicubicTexture.zip
Mac/Linux Source bicubicTexture.tar.gz

Bilateral Filter

Bilateral filter is an edge-preserving non-linear smoothing filter that is implemented with CUDA with OpenGL rendering. It                     can be used in image recovery and denoising. Each pixel is weight by considering both the spatial distance and color distance                     between its neibors. Reference:"C. Tomasi, R. Manduchi, Bilateral Filtering for Gray and Color Images, proceeding of the ICCV,                     1998, http://users.soe.ucsc.edu/~manduchi/Papers/ICCV98.pdf"                  

Minimum Required GPU SM 1.0
Windows Source bilateralFilter.zip
Mac/Linux Source bilateralFilter.tar.gz

Box Filter

Fast image box filter using CUDA with OpenGL rendering.

Minimum Required GPU SM 1.0
Windows Source boxFilter.zip
Mac/Linux Source boxFilter.tar.gz

CUDA Histogram

This sample demonstrates efficient implementation of 64-bin and 256-bin histogram.                                       

Minimum Required GPU SM 1.0
White Paper histogram.pdf
Windows Source histogram.zip
Mac/Linux Source histogram.tar.gz

CUDA Separable Convolution

This sample implements a separable convolution filter of a 2D signal with a gaussian kernel.

Minimum Required GPU SM 1.0
White Paper convolutionSeparable.pdf
Windows Source convolutionSeparable.zip
Mac/Linux Source convolutionSeparable.tar.gz

CUDA Video Decoder D3D9 API

This sample demonstrates how to efficiently use the CUDA Video Decoder API to decode MPEG-2, VC-1, or H.264 sources.  YUV                     to RGB conversion of video is accomplished with CUDA kernel.  The output result is rendered to a D3D9 surface.  The decoded                     video is not displayed on the screen, but with -displayvideo at the command line parameter, the video output can be seen.                     Requires a Direct3D capable device and Compute Capability 1.1 or higher.                  

Minimum Required GPU SM 1.0
White Paper nvcuvid.pdf
Windows Source cudaDecodeD3D9.zip
Mac/Linux Source cudaDecodeD3D9.tar.gz

CUDA Video Decoder GL API

This sample demonstrates how to efficiently use the CUDA Video Decoder API to decode video sources based on MPEG-2, VC-1,                     and H.264.  YUV to RGB conversion of video is accomplished with CUDA kernel.  The output result is rendered to a OpenGL surface.                     The decoded video is black, but can be enabled with -displayvideo added to the command line.  Requires Compute Capability                     1.1 or higher.                  

Minimum Required GPU SM 1.0
White Paper nvcuvid.pdf
Windows Source cudaDecodeGL.zip
Mac/Linux Source cudaDecodeGL.tar.gz

CUDA Video Encode (C Library) API

This sample demonstrates how to effectively use the CUDA Video Encoder API encode H.264 video.  Video input in YUV formats                     are taken as input (either CPU system or GPU memory) and video output frames are encoded to an H.264 file                  

Minimum Required GPU SM 1.0
White Paper nvcuvenc.pdf
Windows Source cudaEncode.zip
Mac/Linux Source cudaEncode.tar.gz

DCT8x8

This sample demonstrates how Discrete Cosine Transform (DCT) for blocks of 8 by 8 pixels can be performed using CUDA: a naive                     implementation by definition and a more traditional approach used in many libraries. As opposed to implementing DCT in a fragment                     shader, CUDA allows for an easier and more efficient implementation.                                        

Minimum Required GPU SM 1.0
White Paper dct8x8.pdf
Windows Source dct8x8.zip
Mac/Linux Source dct8x8.tar.gz

DirectX Texture Compressor (DXTC)

High Quality DXT Compression using CUDA.                                          This example shows how to implement an existing computationally-intensive CPU compression algorithm in parallel on the GPU,                     and obtain an order of magnitude performance improvement.                  

Minimum Required GPU SM 1.0
White Paper cuda_dxtc.pdf
Windows Source dxtc.zip
Mac/Linux Source dxtc.tar.gz

FFT-Based 2D Convolution

This sample demonstrates how 2D convolutions with very large kernel sizes can be efficiently implemented using FFT transformations.

Minimum Required GPU SM 1.0
Windows Source convolutionFFT2D.zip
Mac/Linux Source convolutionFFT2D.tar.gz

Image denoising

This sample demonstrates two adaptive image denoising technqiues: KNN and NLM, based on computation of both geometric and                     color distance between texels. While both techniques are implemented in the DirectX SDK using shaders, massively speeded up                     variation of the latter techique, taking advantage of shared memory, is implemented in addition to DirectX counterparts.                   

Minimum Required GPU SM 1.0
White Paper imageDenoising.pdf
Windows Source imageDenoising.zip
Mac/Linux Source imageDenoising.tar.gz

Optical Flow

Variational optical flow estimation example.  Uses textures for image operations. Shows how simple PDE solver can be accelerated                     with CUDA.                  

Minimum Required GPU SM 1.0
White Paper OpticalFlow.pdf
Windows Source HSOpticalFlow.zip
Mac/Linux Source HSOpticalFlow.tar.gz

Post-Process in OpenGL

This sample shows how to post-process an image rendered in OpenGL using CUDA.

Minimum Required GPU SM 1.0
Windows Source postProcessGL.zip
Mac/Linux Source postProcessGL.tar.gz

Recursive Gaussian Filter

This sample implements a Gaussian blur using Deriche's recursive method. The advantage of this method is that the execution                     time is independent of the filter width.                  

Minimum Required GPU SM 1.0
Windows Source recursiveGaussian.zip
Mac/Linux Source recursiveGaussian.tar.gz

Sobel Filter

This sample implements the Sobel edge detection filter for 8-bit monochrome images.

Minimum Required GPU SM 1.0
Windows Source SobelFilter.zip
Mac/Linux Source SobelFilter.tar.gz

Stereo Disparity Computation (SAD SIMD Intrinsics)

A CUDA program that demonstrates how to compute a stereo disparity map using SIMD SAD (Sum of Absolute Difference) intrinsics.                     Requires Compute Capability 2.0 or higher.                  

Minimum Required GPU SM 2.0
Windows Source stereoDisparity.zip
Mac/Linux Source stereoDisparity.tar.gz

Texture-based Separable Convolution

Texture-based implementation of a separable 2D convolution with a gaussian kernel. Used for performance comparison against                     convolutionSeparable.                   

Minimum Required GPU SM 1.0
Windows Source convolutionTexture.zip
Mac/Linux Source convolutionTexture.tar.gz

Finance

Binomial Option Pricing

This sample evaluates fair call price for a given set of European options under binomial model.  This sample will also take                     advantage of double precision if a GTX 200 class GPU is present.                  

Minimum Required GPU SM 1.0
White Paper binomialOptions.pdf
Windows Source binomialOptions.zip
Mac/Linux Source binomialOptions.tar.gz

Black-Scholes Option Pricing

This sample evaluates fair call and put prices for a given set of European options by Black-Scholes formula.

Minimum Required GPU SM 1.0
White Paper BlackScholes.pdf
Windows Source BlackScholes.zip
Mac/Linux Source BlackScholes.tar.gz

Excel 2007 CUDA Integration Example

This sample demonstrates how to integrate Excel 2007 with CUDA using array formulas.  This plug-in depends on the Microsoft                     Excel Developer Kit.  This sample is not pre-built with the CUDA SDK.                  

Minimum Required GPU GeForce 8
Windows Source ExcelCUDA2007.zip
Mac/Linux Source ExcelCUDA2007.tar.gz

Excel 2010 CUDA Integration Example

This sample demonstrates how to integrate Excel 2010 with CUDA using array formulas.  This plug-in depends on the Microsoft                     Excel 2010 Developer Kit, which can be downloaded from the Microsoft Developer website.  This sample is not pre-built with                     the CUDA SDK.                  

Minimum Required GPU GeForce 8
Windows Source ExcelCUDA2010.zip
Mac/Linux Source ExcelCUDA2010.tar.gz

Excel CUDA Integration Example

This sample Demonstrates how one could integrate Excel with CUDA using array formulas.  This plug-in is not pre-built with                     the SDK installer.                  

Minimum Required GPU SM 1.0
Windows Source ExcelCUDA.zip
Mac/Linux Source ExcelCUDA.tar.gz

Monte Carlo Option Pricing with Multi-GPU support

This sample evaluates fair call price for a given set of European options using the Monte Carlo approach, taking advantage                     of all CUDA-capable GPUs installed in the system. This sample use double precision hardware if a GTX 200 class GPU is present.                     The sample also takes advantage of CUDA 4.0 capability to supporting using a single CPU thread to                      control multiple GPUs                  

Minimum Required GPU SM 1.0
White Paper MonteCarlo.pdf
Windows Source MonteCarloMultiGPU.zip
Mac/Linux Source MonteCarloMultiGPU.tar.gz

Niederreiter Quasirandom Sequence Generator

This sample implements Niederreiter Quasirandom Sequence Generator and Inverse Cumulative Normal Distribution function for                     Standart Normal Distribution generation.                   

Minimum Required GPU SM 1.0
Windows Source quasirandomGenerator.zip
Mac/Linux Source quasirandomGenerator.tar.gz

Sobol Quasirandom Number Generator

This sample implements Sobol Quasirandom Sequence Generator.

Minimum Required GPU SM 1.0
Windows Source SobolQRNG.zip
Mac/Linux Source SobolQRNG.tar.gz

Simulations

CUDA FFT Ocean Simulation

This sample simulates an Ocean heightfield using CUFFT and renders the result using OpenGL.

Minimum Required GPU SM 1.0
Windows Source oceanFFT.zip
Mac/Linux Source oceanFFT.tar.gz

CUDA N-Body Simulation

This sample demonstrates efficient all-pairs simulation of a gravitational n-body simulation in CUDA.  This sample accompanies                     the GPU Gems 3 chapter "Fast N-Body Simulation with CUDA".  Starting in CUDA 4.0, the nBody sample has been updated to take                     advantage of new features to easily scale the n-body simulation across multiple GPUs in a single PC.  Adding "-numbodies=<bodies>"                     to the command line will allow users to set # of bodies for simulation.  Adding “-numdevices=<N>” to the command line option                     will cause the sample to use N devices (if available) for simulation.  In this mode, the position and velocity data for all                     bodies are read from system memory using “zero copy” rather than from device memory.  For a small number of devices (4 or                     fewer) and a large enough number of bodies, bandwidth is not a bottleneck so we can achieve strong scaling across these devices.                                       

Minimum Required GPU SM 1.0
White Paper nbody_gems3_ch31.pdf
Windows Source nbody.zip
Mac/Linux Source nbody.tar.gz

Fluids (Direct3D Version)

An example of fluid simulation using CUDA and CUFFT, with Direct3D 9 rendering.  A Direct3D Capable device is required.

Minimum Required GPU SM 1.0
Windows Source fluidsD3D9.zip
Mac/Linux Source fluidsD3D9.tar.gz

Fluids (OpenGL Version)

An example of fluid simulation using CUDA and CUFFT, with OpenGL rendering.

Minimum Required GPU SM 1.0
White Paper fluidsGL.pdf
Windows Source fluidsGL.zip
Mac/Linux Source fluidsGL.tar.gz

Particles

This sample uses CUDA to simulate and visualize a large set of particles and their physical interaction.  Adding "-particles=<N>"                     to the command line will allow users to set # of particles for simulation.  This example implements a uniform grid data structure                     using either atomic operations or a fast radix sort from the Thrust library                  

Minimum Required GPU SM 1.0
White Paper particles.pdf
Windows Source particles.zip
Mac/Linux Source particles.tar.gz

Smoke Particles

Smoke simulation with volumetric shadows using half-angle slicing technique. Uses CUDA for procedural simulation, Thrust Library                     for sorting algorithms, and OpenGL for graphics rendering.                   

Minimum Required GPU SM 1.0
White Paper smokeParticles.pdf
Windows Source smokeParticles.zip
Mac/Linux Source smokeParticles.tar.gz

VFlockingD3D10

This sample demonstrates a CUDA mathematical simulation of group of birds behavior when in flight.

Minimum Required GPU SM 1.0
Windows Source VFlockingD3D10.zip
Mac/Linux Source VFlockingD3D10.tar.gz

Advanced

Advanced Quicksort (CUDA Dynamic Parallelism)

This sample demonstrates an advanced quicksort implemented using CUDA Dynamic Parallelism.  This sample requires devices with                     compute capability 3.5 or higher.                  

Minimum Required GPU SM 3.5
Windows Source cdpAdvancedQuicksort.zip
Mac/Linux Source cdpAdvancedQuicksort.tar.gz

Aligned Types

A simple test, showing huge access speed gap between aligned and misaligned structures.

Minimum Required GPU SM 1.0
Windows Source alignedTypes.zip
Mac/Linux Source alignedTypes.tar.gz

Concurrent Kernels

                     This sample demonstrates the use of CUDA streams for concurrent execution of several kernels on devices of compute capability                     2.0 or higher.  Devices of compute capability 1.x will run the kernels sequentially.It also illustrates how to introduce dependencies                     between CUDA streams with the new cudaStreamWaitEvent function introduced in CUDA 3.2                  

Minimum Required GPU SM 1.0
Windows Source concurrentKernels.zip
Mac/Linux Source concurrentKernels.tar.gz

CUDA C 3D FDTD

This sample applies a finite differences time domain progression stencil on a 3D surface.

Minimum Required GPU SM 1.0
Windows Source FDTD3d.zip
Mac/Linux Source FDTD3d.tar.gz

CUDA Context Thread Management

Simple program illustrating how to the CUDA Context Management API and uses the new CUDA 4.0parameter passing and CUDA launch                     API.  CUDA contexts can be created separately and attached independently to different threads.                  

Minimum Required GPU SM 1.0
Windows Source threadMigration.zip
Mac/Linux Source threadMigration.tar.gz

CUDA Parallel Prefix Sum (Scan)

This example demonstrates an efficient CUDA implementation of parallel prefix sum, also known as "scan".  Given an array of                     numbers, scan computes a new array in which each element is the sum of all the elements before it in the input array.                  

Minimum Required GPU SM 1.0
Windows Source scan.zip
Mac/Linux Source scan.tar.gz

CUDA Parallel Prefix Sum with Shuffle Intrinsics (SHFL_Scan)

This example demonstrates how to use the shuffle intrinsic __shfl_up to perform a scan operation across a thread block.  A                     GPU with Compute Capability SM 3.0. is required to run the sample                  

Minimum Required GPU KEPLER SM 3.0
Windows Source shfl_scan.zip
Mac/Linux Source shfl_scan.tar.gz

CUDA Parallel Reduction

A parallel sum reduction that computes the sum of a large arrays of values.  This sample demonstrates several important optimization                     strategies for 1:Data-Parallel Algorithms like reduction.                  

Minimum Required GPU SM 1.0
White Paper reduction.pdf
Windows Source reduction.zip
Mac/Linux Source reduction.tar.gz

CUDA Radix Sort using the Thrust Library

This sample demonstrates a very fast and efficient parallel radix sort uses Thrust library (http://code.google.com/p/thrust/)..                     The included RadixSort class can sort either key-value pairs (with float or unsigned integer keys) or keys only.                  

Minimum Required GPU SM 1.0
White Paper readme.txt
Windows Source radixSortThrust.zip
Mac/Linux Source radixSortThrust.tar.gz

CUDA Segmentation Tree Thrust Library

This sample demonstrates an approach to the image segmentation trees construction.  This method is based on Boruvka's MST                     algorithm.                  

Minimum Required GPU SM 1.3
Windows Source segmentationTreeThrust.zip
Mac/Linux Source segmentationTreeThrust.tar.gz

CUDA Sorting Networks

This sample implements bitonic sort and odd-even merge sort (also known as Batcher's sort), algorithms belonging to the class                     of sorting networks. While generally subefficient on large sequences compared to algorithms with better asymptotic algorithmic                     complexity (i.e. merge sort or radix sort), may be the algorithms of choice for sorting batches of short- to mid-sized (key,                     value) array pairs.                     Refer to the excellent tutorial by H. W. Lang http://www.iti.fh-flensburg.de/lang/algorithmen/sortieren/networks/indexen.htm                                       

Minimum Required GPU SM 1.0
Windows Source sortingNetworks.zip
Mac/Linux Source sortingNetworks.tar.gz

Eigenvalues

The computation of all or a subset of all eigenvalues is an important problem in Linear Algebra, statistics, physics, and                     many other fields. This sample demonstrates a parallel implementation of a bisection algorithm for the computation of all                     eigenvalues of a                     tridiagonal symmetric matrix of arbitrary size with CUDA.                  

Minimum Required GPU SM 1.0
White Paper eigenvalues.pdf
Windows Source eigenvalues.zip
Mac/Linux Source eigenvalues.tar.gz

Fast Walsh Transform

Naturally(Hadamard)-ordered Fast Walsh Tranform for batched vectors of arbitrary eligible(power of two) lengths

Minimum Required GPU SM 1.0
Windows Source fastWalshTransform.zip
Mac/Linux Source fastWalshTransform.tar.gz

Function Pointers

This sample illustrates how to use function pointers and implements the Sobel Edge Detection filter for 8-bit monochrome images.

Minimum Required GPU SM 2.0
Windows Source FunctionPointers.zip
Mac/Linux Source FunctionPointers.tar.gz

Line of Sight

This sample is an implementation of a simple line-of-sight algorithm: Given a height map and a ray originating at some observation                     point, it computes all the points along the ray that are visible from the observation point. The implementation is based on                     the Thrust library (http://code.google.com/p/thrust/).                  

Minimum Required GPU SM 1.0
Windows Source lineOfSight.zip
Mac/Linux Source lineOfSight.tar.gz

LU Decomposition (CUDA Dynamic Parallelism)

This sample demonstrates LU Decomposition implemented using CUDA Dynamic Parallelism.  This sample requires devices with compute                     capability 3.5 or higher.                  

Minimum Required GPU SM 3.5
Windows Source cdpLUDecomposition.zip
Mac/Linux Source cdpLUDecomposition.tar.gz

Matrix Transpose

This sample demonstrates Matrix Transpose.  Different performance are shown to achieve high performance.

Minimum Required GPU SM 1.0
White Paper MatrixTranspose.pdf
Windows Source transpose.zip
Mac/Linux Source transpose.tar.gz

Merge Sort

This sample implements a merge sort (also known as Batcher's sort), algorithms belonging to the class of sorting networks.                     While generally subefficient on large sequences compared to algorithms with better asymptotic algorithmic complexity (i.e.                     merge sort or radix sort), may be the algorithms of choice for sorting batches of short- to mid-sized (key, value) array pairs.                     Refer to the excellent tutorial by H. W. Lang http://www.iti.fh-flensburg.de/lang/algorithmen/sortieren/networks/indexen.htm                                       

Minimum Required GPU SM 1.0
Windows Source mergeSort.zip
Mac/Linux Source mergeSort.tar.gz

NewDelete

                     This sample demonstrates dynamic global memory allocation through device C++ new and delete operators and virtual function                     declarations available with CUDA 4.0.                   

Minimum Required GPU SM 2.0
Windows Source newdelete.zip
Mac/Linux Source newdelete.tar.gz

PTX Just-in-Time compilation

                     This sample uses the Driver API to just-in-time compile (JIT) a Kernel from PTX code. Additionally, this sample demonstrates                     the seamless interoperability capability of CUDA runtime                     Runtime and CUDA Driver API calls.                  

Minimum Required GPU SM 1.0
Windows Source ptxjit.zip
Mac/Linux Source ptxjit.tar.gz

Quad Tree (CUDA Dynamic Parallelism)

This sample demonstrates Quad Trees implemented using CUDA Dynamic Parallelism.  This sample requires devices with compute                     capability 3.5 or higher.                  

Minimum Required GPU SM 3.5
Windows Source cdpQuadtree.zip
Mac/Linux Source cdpQuadtree.tar.gz

Scalar Product

This sample calculates scalar products of a given set of input vector pairs.

Minimum Required GPU SM 1.0
Windows Source scalarProd.zip
Mac/Linux Source scalarProd.tar.gz

simpleHyperQ

                     This sample demonstrates the use of CUDA streams for concurrent execution of several kernels on devices which provide HyperQ                     (SM 3.5).  Devices without HyperQ (SM 2.0 and SM 3.0) will run a maximum of two kernels concurrently.                  

Minimum Required GPU SM 1.3
White Paper HyperQ.pdf
Windows Source simpleHyperQ.zip
Mac/Linux Source simpleHyperQ.tar.gz

threadFenceReduction

This sample shows how to perform a reduction operation on an array of values using the thread Fence intrinsic.                     to produce a single value in a single kernel (as opposed to two or more kernel calls as shown in the "reduction" SDK sample).                     Single-pass reduction requires global atomic instructions (Compute Capability 1.1 or later) and the _threadfence() intrinsic                     (CUDA 2.2 or later).                  

Minimum Required GPU SM 1.0
Windows Source threadFenceReduction.zip
Mac/Linux Source threadFenceReduction.tar.gz

CUDALibraries

batchCUBLAS

A SDK sample that demonstrates how using batched CUBLAS API calls to improve overall performance.

Minimum Required GPU SM 1.0
Windows Source batchCUBLAS.zip
Mac/Linux Source batchCUBLAS.tar.gz

Box Filter with NPP

A NPP SDK sample that demonstrates how to use NPP FilterBox function to perform a Box Filter.

Minimum Required GPU SM 1.0
Windows Source boxFilterNPP.zip
Mac/Linux Source boxFilterNPP.tar.gz

ConjugateGradient

This sample implements a conjugate gradient solver on GPU                     using CUBLAS and CUSPARSE library.                  

Minimum Required GPU SM 1.0
Windows Source conjugateGradient.zip
Mac/Linux Source conjugateGradient.tar.gz

FreeImage and NPP Interopability

A simple SDK sample demonstrate how to use FreeImage library with NPP.

Minimum Required GPU SM 1.0
Windows Source freeImageInteropNPP.zip
Mac/Linux Source freeImageInteropNPP.tar.gz

GrabCut with NPP

CUDA Implementation of Rother et al. GrabCut approach using the 8 neighborhood NPP Graphcut primitive introduced in CUDA 4.1.                     (C. Rother, V. Kolmogorov, A. Blake. GrabCut: Interactive Foreground Extraction using Iterated Graph Cuts. ACM Transactions                     on Graphics (SIGGRAPH'04), 2004)                  

Minimum Required GPU SM 1.0
Windows Source grabcutNPP.zip
Mac/Linux Source grabcutNPP.tar.gz

Histogram Equalization with NPP

This SDK sample demonstrates how to use NPP for histogram equalization for image data.

Minimum Required GPU SM 1.0
Windows Source histEqualizationNPP.zip
Mac/Linux Source histEqualizationNPP.tar.gz

Image Segmentation using Graphcuts with NPP

This sample that demonstrates how to perform image segmentation using the NPP GraphCut function.

Minimum Required GPU SM 1.0
Windows Source imageSegmentationNPP.zip
Mac/Linux Source imageSegmentationNPP.tar.gz

MersenneTwisterGP11213

This sample demonstrates the Mersenne Twister random number generator GP11213 in cuRAND.

Minimum Required GPU SM 1.0
Windows Source MersenneTwisterGP11213.zip
Mac/Linux Source MersenneTwisterGP11213.tar.gz

Monte Carlo Estimation of Pi (batch inline QRNG)

This sample uses Monte Carlo simulation for Estimation of Pi (using batch inline QRNG).  This sample also uses the NVIDIA                     CURAND library.                  

Minimum Required GPU SM 1.0
Windows Source MC_EstimatePiInlineQ.zip
Mac/Linux Source MC_EstimatePiInlineQ.tar.gz

Monte Carlo Estimation of Pi (batch PRNG)

This sample uses Monte Carlo simulation for Estimation of Pi (using batch PRNG).  This sample also uses the NVIDIA CURAND                     library.                  

Minimum Required GPU SM 1.0
Windows Source MC_EstimatePiP.zip
Mac/Linux Source MC_EstimatePiP.tar.gz

Monte Carlo Estimation of Pi (batch QRNG)

This sample uses Monte Carlo simulation for Estimation of Pi (using batch QRNG).  This sample also uses the NVIDIA CURAND                     library.                  

Minimum Required GPU SM 1.0
Windows Source MC_EstimatePiQ.zip
Mac/Linux Source MC_EstimatePiQ.tar.gz

Monte Carlo Estimation of Pi (inline PRNG)

This sample uses Monte Carlo simulation for Estimation of Pi (using inline PRNG).  This sample also uses the NVIDIA CURAND                     library.                  

Minimum Required GPU SM 1.0
Windows Source MC_EstimatePiInlineP.zip
Mac/Linux Source MC_EstimatePiInlineP.tar.gz

Monte Carlo Single Asian Option

This sample uses Monte Carlo to simulate Single Asian Options using the NVIDIA CURAND library.

Minimum Required GPU SM 1.0
Windows Source MC_SingleAsianOptionP.zip
Mac/Linux Source MC_SingleAsianOptionP.tar.gz

Preconditioned Conjugate Gradient

This sample implements a preconditioned conjugate gradient solver on GPU                     using CUBLAS and CUSPARSE library.                  

Minimum Required GPU SM 1.0
Windows Source conjugateGradientPrecond.zip
Mac/Linux Source conjugateGradientPrecond.tar.gz

Random Fog

This sample illustrates pseudo- and quasi- random numbers produced by CURAND.

Minimum Required GPU SM 1.0
Windows Source randomFog.zip
Mac/Linux Source randomFog.tar.gz

Simple CUBLAS

Example of using CUBLAS using the new CUBLAS API interface available in CUDA 4.0.

Minimum Required GPU SM 1.0
Windows Source simpleCUBLAS.zip
Mac/Linux Source simpleCUBLAS.tar.gz

Simple CUFFT

Example of using CUFFT. In this example, CUFFT is used to compute the 1D-convolution of some signal with some filter by transforming                     both into frequency domain, multiplying them together, and transforming the signal back to time domain.                  

Minimum Required GPU SM 1.0
Windows Source simpleCUFFT.zip
Mac/Linux Source simpleCUFFT.tar.gz

simpleDevLibCUBLAS GPU Device API Library Functions (CUDA Dynamic Parallelism)

This sample implements a simple CUBLAS function calls that call GPU device API library running CUBLAS functions.  This sample                     requires a SM 3.5 capable device.                  

Minimum Required GPU SM 3.5
Windows Source simpleDevLibCUBLAS.zip
Mac/Linux Source simpleDevLibCUBLAS.tar.gz

Notices

Notice

ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND                        SEPARATELY, "MATERIALS") ARE BEING PROVIDED "AS IS." NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE                        WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS                        FOR A PARTICULAR PURPOSE.                      

Information furnished is believed to be accurate and reliable. However, NVIDIA Corporation assumes no responsibility for the                        consequences of use of such information or for any infringement of patents or other rights of third parties that may result                        from its use. No license is granted by implication of otherwise under any patent rights of NVIDIA Corporation. Specifications                        mentioned in this publication are subject to change without notice. This publication supersedes and replaces all other information                        previously supplied. NVIDIA Corporation products are not authorized as critical components in life support devices or systems                        without express written approval of NVIDIA Corporation.                     

Trademarks

NVIDIA and the NVIDIA logo are trademarks or registered trademarks of NVIDIA Corporation                        in the U.S. and other countries.  Other company and product names may be trademarks of                        the respective companies with which they are associated.                     

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值