Intel oneAPI Base Toolkit学习实践
Intel oneAPI简介
什么是Intel oneAPI?oneAPI是一个开放的、跨行业的、基于标准的、统一的、多架构的、多厂商的编程模型,在不同的加速器架构上提供共同的开发者体验–以实现更快的应用性能、更高的生产力和更大的创新。oneAPI倡议鼓励在整个生态系统内就oneAPI规范和兼容的oneAPI实施进行合作。
Intel oneAPI Base Toolkit安装
- 系统:Ubuntu 23.04 x86_64
- Linux内核:6.2.0-20-generic
- CPU:AMD Ryzen 7 4700U with Radeon Graphics (8) @ 2.000G
- GPU:AMD ATI 04:00.0 Renoir
安装方法
-
添加Intel官方软件包仓库,将密钥下载至系统密钥环中:
wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \ | gpg --dearmor | sudo tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null
-
将已签署的条目添加到APT源,并配置APT客户端使用英特尔资源库:
echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list
-
更新软件源并安装
Intel oneAPI base toolkit
:sudo apt update && sudo apt install -y intel-basekit
安装过程
Intel oneAPI Toolkit配置
-
安装相关依赖:
sudo apt update sudo apt -y install cmake pkg-config build-essential
-
配置命令行环境变量
每次打开命令行窗口时,运行
setvars.sh
来设置命令行环境变量。对于系统安装,该脚本文件位于/opt/intel/oneapi/setvars.sh
,对于个人安装,该文件位于~/intel/oneapi/setvars.sh
。-
对于系统安装
. /opt/intel/oneapi/setvars.sh
-
对于个人安装
. ~/intel/oneapi/setvars.sh
运行结果如下:
-
-
验证安装
命令行输入:
oneapi-cli
若出现如下界面,则配置成功:
开始oneAPI的第一个项目
搭建项目模板
-
命令行输入
. /opt/intel/oneapi/setvars.sh oneapi-cli
-
搭建步骤:
-
选择创建项目
-
选择cpp
-
选择Vector Add
-
创建完成,目录结构如下:
-
编译运行项目代码
源码目录中代码内容如下:
-
vector-add-buffers.cpp
//============================================================== // Vector Add is the equivalent of a Hello, World! sample for data parallel // programs. Building and running the sample verifies that your development // environment is setup correctly and demonstrates the use of the core features // of SYCL. This sample runs on both CPU and GPU (or FPGA). When run, it // computes on both the CPU and offload device, then compares results. If the // code executes on both CPU and offload device, the device name and a success // message are displayed. And, your development environment is setup correctly! // // For comprehensive instructions regarding SYCL Programming, go to // https://software.intel.com/en-us/oneapi-programming-guide and search based on // relevant terms noted in the comments. // // SYCL material used in the code sample: // • A one dimensional array of data. // • A device queue, buffer, accessor, and kernel. //============================================================== // Copyright © Intel Corporation // // SPDX-License-Identifier: MIT // ============================================================= #include <sycl/sycl.hpp> #include <vector> #include <iostream> #include <string> #if FPGA_HARDWARE || FPGA_EMULATOR || FPGA_SIMULATOR #include <sycl/ext/intel/fpga_extensions.hpp> #endif using namespace sycl; // num_repetitions: How many times to repeat the kernel invocation size_t num_repetitions = 1; // Vector type and data size for this example. size_t vector_size = 10000; typedef std::vector<int> IntVector; // Create an exception handler for asynchronous SYCL exceptions static auto exception_handler = [](sycl::exception_list e_list) { for (std::exception_ptr const &e : e_list) { try { std::rethrow_exception(e); } catch (std::exception const &e) { #if _DEBUG std::cout << "Failure" << std::endl; #endif std::terminate(); } } }; //************************************ // Vector add in SYCL on device: returns sum in 4th parameter "sum_parallel". //************************************ void VectorAdd(queue &q, const IntVector &a_vector, const IntVector &b_vector, IntVector &sum_parallel) { // Create the range object for the vectors managed by the buffer. range<1> num_items{a_vector.size()}; // Create buffers that hold the data shared between the host and the devices. // The buffer destructor is responsible to copy the data back to host when it // goes out of scope. buffer a_buf(a_vector); buffer b_buf(b_vector); buffer sum_buf(sum_parallel.data(), num_items); for (size_t i = 0; i < num_repetitions; i++ ) { // Submit a command group to the queue by a lambda function that contains the // data access permission and device computation (kernel). q.submit([&](handler &h) { // Create an accessor for each buffer with access permission: read, write or // read/write. The accessor is a mean to access the memory in the buffer. accessor a(a_buf, h, read_only); accessor b(b_buf, h, read_only); // The sum_accessor is used to store (with write permission) the sum data. accessor sum(sum_buf, h, write_only, no_init); // Use parallel_for to run vector addition in parallel on device. This // executes the kernel. // 1st parameter is the number of work items. // 2nd parameter is the kernel, a lambda that specifies what to do per // work item. The parameter of the lambda is the work item id. // SYCL supports unnamed lambda kernel by default. h.parallel_for(num_items, [=](auto i) { sum[i] = a[i] + b[i]; }); }); }; // Wait until compute tasks on GPU done q.wait(); } //************************************ // Initialize the vector from 0 to vector_size - 1 //************************************ void InitializeVector(IntVector &a) { for (size_t i = 0; i < a.size(); i++) a.at(i) = i; } //************************************ // Demonstrate vector add both in sequential on CPU and in parallel on device. //************************************ int main(int argc, char* argv[]) { // Change num_repetitions if it was passed as argument if (argc > 2) num_repetitions = std::stoi(argv[2]); // Change vector_size if it was passed as argument if (argc > 1) vector_size = std::stoi(argv[1]); // Create device selector for the device of your interest. #if FPGA_EMULATOR // Intel extension: FPGA emulator selector on systems without FPGA card. auto selector = sycl::ext::intel::fpga_emulator_selector_v; #elif FPGA_SIMULATOR // Intel extension: FPGA simulator selector on systems without FPGA card. auto selector = sycl::ext::intel::fpga_simulator_selector_v; #elif FPGA_HARDWARE // Intel extension: FPGA selector on systems with FPGA card. auto selector = sycl::ext::intel::fpga_selector_v; #else // The default device selector will select the most performant device. auto selector = default_selector_v; #endif // Create vector objects with "vector_size" to store the input and output data. IntVector a, b, sum_sequential, sum_parallel; a.resize(vector_size); b.resize(vector_size); sum_sequential.resize(vector_size); sum_parallel.resize(vector_size); // Initialize input vectors with values from 0 to vector_size - 1 InitializeVector(a); InitializeVector(b); try { queue q(selector, exception_handler); // Print out the device information used for the kernel code. std::cout << "Running on device: " << q.get_device().get_info<info::device::name>() << "\n"; std::cout << "Vector size: " << a.size() << "\n"; // Vector addition in SYCL VectorAdd(q, a, b, sum_parallel); } catch (exception const &e) { std::cout << "An exception is caught for vector add.\n"; std::terminate(); } // Compute the sum of two vectors in sequential for validation. for (size_t i = 0; i < sum_sequential.size(); i++) sum_sequential.at(i) = a.at(i) + b.at(i); // Verify that the two vectors are equal. for (size_t i = 0; i < sum_sequential.size(); i++) { if (sum_parallel.at(i) != sum_sequential.at(i)) { std::cout << "Vector add failed on device.\n"; return -1; } } int indices[]{0, 1, 2, (static_cast<int>(a.size()) - 1)}; constexpr size_t indices_size = sizeof(indices) / sizeof(int); // Print out the result of vector add. for (int i = 0; i < indices_size; i++) { int j = indices[i]; if (i == indices_size - 1) std::cout << "...\n"; std::cout << "[" << j << "]: " << a[j] << " + " << b[j] << " = " << sum_parallel[j] << "\n"; } a.clear(); b.clear(); sum_sequential.clear(); sum_parallel.clear(); std::cout << "Vector add successfully completed on device.\n"; return 0; }
-
vector-add-usm.cpp
//============================================================== // Vector Add is the equivalent of a Hello, World! sample for data parallel // programs. Building and running the sample verifies that your development // environment is setup correctly and demonstrates the use of the core features // of SYCL. This sample runs on both CPU and GPU (or FPGA). When run, it // computes on both the CPU and offload device, then compares results. If the // code executes on both CPU and offload device, the device name and a success // message are displayed. And, your development environment is setup correctly! // // For comprehensive instructions regarding SYCL Programming, go to // https://software.intel.com/en-us/oneapi-programming-guide and search based on // relevant terms noted in the comments. // // SYCL material used in the code sample: // • A one dimensional array of data shared between CPU and offload device. // • A device queue and kernel. //============================================================== // Copyright © Intel Corporation // // SPDX-License-Identifier: MIT // ============================================================= #include <sycl/sycl.hpp> #include <array> #include <iostream> #include <string> #if FPGA_HARDWARE || FPGA_EMULATOR || FPGA_SIMULATOR #include <sycl/ext/intel/fpga_extensions.hpp> #endif using namespace sycl; // Array size for this example. size_t array_size = 10000; // Create an exception handler for asynchronous SYCL exceptions static auto exception_handler = [](sycl::exception_list e_list) { for (std::exception_ptr const &e : e_list) { try { std::rethrow_exception(e); } catch (std::exception const &e) { #if _DEBUG std::cout << "Failure" << std::endl; #endif std::terminate(); } } }; //************************************ // Vector add in SYCL on device: returns sum in 4th parameter "sum". //************************************ void VectorAdd(queue &q, const int *a, const int *b, int *sum, size_t size) { // Create the range object for the arrays. range<1> num_items{size}; // Use parallel_for to run vector addition in parallel on device. This // executes the kernel. // 1st parameter is the number of work items. // 2nd parameter is the kernel, a lambda that specifies what to do per // work item. the parameter of the lambda is the work item id. // SYCL supports unnamed lambda kernel by default. auto e = q.parallel_for(num_items, [=](auto i) { sum[i] = a[i] + b[i]; }); // q.parallel_for() is an asynchronous call. SYCL runtime enqueues and runs // the kernel asynchronously. Wait for the asynchronous call to complete. e.wait(); } //************************************ // Initialize the array from 0 to array_size - 1 //************************************ void InitializeArray(int *a, size_t size) { for (size_t i = 0; i < size; i++) a[i] = i; } //************************************ // Demonstrate vector add both in sequential on CPU and in parallel on device. //************************************ int main(int argc, char* argv[]) { // Change array_size if it was passed as argument if (argc > 1) array_size = std::stoi(argv[1]); // Create device selector for the device of your interest. #if FPGA_EMULATOR // Intel extension: FPGA emulator selector on systems without FPGA card. auto selector = sycl::ext::intel::fpga_emulator_selector_v; #elif FPGA_SIMULATOR // Intel extension: FPGA simulator selector on systems without FPGA card. auto selector = sycl::ext::intel::fpga_simulator_selector_v; #elif FPGA_HARDWARE // Intel extension: FPGA selector on systems with FPGA card. auto selector = sycl::ext::intel::fpga_selector_v; #else // The default device selector will select the most performant device. auto selector = default_selector_v; #endif try { queue q(selector, exception_handler); // Print out the device information used for the kernel code. std::cout << "Running on device: " << q.get_device().get_info<info::device::name>() << "\n"; std::cout << "Vector size: " << array_size << "\n"; // Create arrays with "array_size" to store input and output data. Allocate // unified shared memory so that both CPU and device can access them. int *a = malloc_shared<int>(array_size, q); int *b = malloc_shared<int>(array_size, q); int *sum_sequential = malloc_shared<int>(array_size, q); int *sum_parallel = malloc_shared<int>(array_size, q); if ((a == nullptr) || (b == nullptr) || (sum_sequential == nullptr) || (sum_parallel == nullptr)) { if (a != nullptr) free(a, q); if (b != nullptr) free(b, q); if (sum_sequential != nullptr) free(sum_sequential, q); if (sum_parallel != nullptr) free(sum_parallel, q); std::cout << "Shared memory allocation failure.\n"; return -1; } // Initialize input arrays with values from 0 to array_size - 1 InitializeArray(a, array_size); InitializeArray(b, array_size); // Compute the sum of two arrays in sequential for validation. for (size_t i = 0; i < array_size; i++) sum_sequential[i] = a[i] + b[i]; // Vector addition in SYCL. VectorAdd(q, a, b, sum_parallel, array_size); // Verify that the two arrays are equal. for (size_t i = 0; i < array_size; i++) { if (sum_parallel[i] != sum_sequential[i]) { std::cout << "Vector add failed on device.\n"; return -1; } } int indices[]{0, 1, 2, (static_cast<int>(array_size) - 1)}; constexpr size_t indices_size = sizeof(indices) / sizeof(int); // Print out the result of vector add. for (int i = 0; i < indices_size; i++) { int j = indices[i]; if (i == indices_size - 1) std::cout << "...\n"; std::cout << "[" << j << "]: " << j << " + " << j << " = " << sum_sequential[j] << "\n"; } free(a, q); free(b, q); free(sum_sequential, q); free(sum_parallel, q); } catch (exception const &e) { std::cout << "An exception is caught while adding two vectors.\n"; std::terminate(); } std::cout << "Vector add successfully completed on device.\n"; return 0; }
-
编译运行
-
配置项目以使用基于缓冲区的实现(第一个方框)或基于统一共享内存(USM)的实现(第二个方框)。
cd vector-add mkdir build && cd build cmake ..
cd vector-add mkdir build && cd build cmake .. -DUSM=1
-
编译项目
make cpu-gpu
-
运行程序
./vector-add-buffers ./vector-add-usm
-
整理项目
make clean
-
-