Intel oneAPI Base Toolkit学习实践

Intel oneAPI Base Toolkit学习实践

Intel oneAPI简介

什么是Intel oneAPI?oneAPI是一个开放的、跨行业的、基于标准的、统一的、多架构的、多厂商的编程模型,在不同的加速器架构上提供共同的开发者体验–以实现更快的应用性能、更高的生产力和更大的创新。oneAPI倡议鼓励在整个生态系统内就oneAPI规范和兼容的oneAPI实施进行合作。

Intel oneAPI Base Toolkit安装

  • 系统:Ubuntu 23.04 x86_64
  • Linux内核:6.2.0-20-generic
  • CPU:AMD Ryzen 7 4700U with Radeon Graphics (8) @ 2.000G
  • GPU:AMD ATI 04:00.0 Renoir

安装方法

  1. 添加Intel官方软件包仓库,将密钥下载至系统密钥环中:

    wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \ | gpg --dearmor | sudo tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null
    
  2. 将已签署的条目添加到APT源,并配置APT客户端使用英特尔资源库:

    echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list
    
  3. 更新软件源并安装Intel oneAPI base toolkit:

    sudo apt update && sudo apt install -y intel-basekit
    

安装过程

Screenshot_20230612_184635

Intel oneAPI Toolkit配置

  1. 安装相关依赖:

    sudo apt update
    sudo apt -y install cmake pkg-config build-essential
    
  2. 配置命令行环境变量

    每次打开命令行窗口时,运行setvars.sh来设置命令行环境变量。对于系统安装,该脚本文件位于/opt/intel/oneapi/setvars.sh,对于个人安装,该文件位于~/intel/oneapi/setvars.sh

    • 对于系统安装

      . /opt/intel/oneapi/setvars.sh
      
    • 对于个人安装

      . ~/intel/oneapi/setvars.sh
      

    运行结果如下:

    image-20230612200619025

  3. 验证安装

    命令行输入:

    oneapi-cli
    

    若出现如下界面,则配置成功:

    image-20230612200645501

    开始oneAPI的第一个项目

    搭建项目模板

    • 命令行输入

      . /opt/intel/oneapi/setvars.sh
      oneapi-cli
      
    • 搭建步骤:

      1. 选择创建项目

        image-20230612201129830

      2. 选择cpp

      在这里插入图片描述

      1. 选择Vector Add

        image-20230612201214964

      2. 创建完成,目录结构如下:

        image-20230612201326514

    编译运行项目代码

    源码目录中代码内容如下:

    • vector-add-buffers.cpp

      //==============================================================
      // Vector Add is the equivalent of a Hello, World! sample for data parallel
      // programs. Building and running the sample verifies that your development
      // environment is setup correctly and demonstrates the use of the core features
      // of SYCL. This sample runs on both CPU and GPU (or FPGA). When run, it
      // computes on both the CPU and offload device, then compares results. If the
      // code executes on both CPU and offload device, the device name and a success
      // message are displayed. And, your development environment is setup correctly!
      //
      // For comprehensive instructions regarding SYCL Programming, go to
      // https://software.intel.com/en-us/oneapi-programming-guide and search based on
      // relevant terms noted in the comments.
      //
      // SYCL material used in the code sample:
      // •	A one dimensional array of data.
      // •	A device queue, buffer, accessor, and kernel.
      //==============================================================
      // Copyright © Intel Corporation
      //
      // SPDX-License-Identifier: MIT
      // =============================================================
      #include <sycl/sycl.hpp>
      #include <vector>
      #include <iostream>
      #include <string>
      #if FPGA_HARDWARE || FPGA_EMULATOR || FPGA_SIMULATOR
      #include <sycl/ext/intel/fpga_extensions.hpp>
      #endif
      
      using namespace sycl;
      
      // num_repetitions: How many times to repeat the kernel invocation
      size_t num_repetitions = 1;
      // Vector type and data size for this example.
      size_t vector_size = 10000;
      typedef std::vector<int> IntVector; 
      
      // Create an exception handler for asynchronous SYCL exceptions
      static auto exception_handler = [](sycl::exception_list e_list) {
        for (std::exception_ptr const &e : e_list) {
          try {
            std::rethrow_exception(e);
          }
          catch (std::exception const &e) {
      #if _DEBUG
            std::cout << "Failure" << std::endl;
      #endif
            std::terminate();
          }
        }
      };
      
      //************************************
      // Vector add in SYCL on device: returns sum in 4th parameter "sum_parallel".
      //************************************
      void VectorAdd(queue &q, const IntVector &a_vector, const IntVector &b_vector,
                     IntVector &sum_parallel) {
        // Create the range object for the vectors managed by the buffer.
        range<1> num_items{a_vector.size()};
      
        // Create buffers that hold the data shared between the host and the devices.
        // The buffer destructor is responsible to copy the data back to host when it
        // goes out of scope.
        buffer a_buf(a_vector);
        buffer b_buf(b_vector);
        buffer sum_buf(sum_parallel.data(), num_items);
      
        for (size_t i = 0; i < num_repetitions; i++ ) {
      
          // Submit a command group to the queue by a lambda function that contains the
          // data access permission and device computation (kernel).
          q.submit([&](handler &h) {
            // Create an accessor for each buffer with access permission: read, write or
            // read/write. The accessor is a mean to access the memory in the buffer.
            accessor a(a_buf, h, read_only);
            accessor b(b_buf, h, read_only);
        
            // The sum_accessor is used to store (with write permission) the sum data.
            accessor sum(sum_buf, h, write_only, no_init);
        
            // Use parallel_for to run vector addition in parallel on device. This
            // executes the kernel.
            //    1st parameter is the number of work items.
            //    2nd parameter is the kernel, a lambda that specifies what to do per
            //    work item. The parameter of the lambda is the work item id.
            // SYCL supports unnamed lambda kernel by default.
            h.parallel_for(num_items, [=](auto i) { sum[i] = a[i] + b[i]; });
          });
        };
        // Wait until compute tasks on GPU done
        q.wait();
      }
      
      //************************************
      // Initialize the vector from 0 to vector_size - 1
      //************************************
      void InitializeVector(IntVector &a) {
        for (size_t i = 0; i < a.size(); i++) a.at(i) = i;
      }
      
      //************************************
      // Demonstrate vector add both in sequential on CPU and in parallel on device.
      //************************************
      int main(int argc, char* argv[]) {
        // Change num_repetitions if it was passed as argument
        if (argc > 2) num_repetitions = std::stoi(argv[2]);
        // Change vector_size if it was passed as argument
        if (argc > 1) vector_size = std::stoi(argv[1]);
        // Create device selector for the device of your interest.
      #if FPGA_EMULATOR
        // Intel extension: FPGA emulator selector on systems without FPGA card.
        auto selector = sycl::ext::intel::fpga_emulator_selector_v;
      #elif FPGA_SIMULATOR
        // Intel extension: FPGA simulator selector on systems without FPGA card.
        auto selector = sycl::ext::intel::fpga_simulator_selector_v;
      #elif FPGA_HARDWARE
        // Intel extension: FPGA selector on systems with FPGA card.
        auto selector = sycl::ext::intel::fpga_selector_v;
      #else
        // The default device selector will select the most performant device.
        auto selector = default_selector_v;
      #endif
      
        // Create vector objects with "vector_size" to store the input and output data.
        IntVector a, b, sum_sequential, sum_parallel;
        a.resize(vector_size);
        b.resize(vector_size);
        sum_sequential.resize(vector_size);
        sum_parallel.resize(vector_size);
      
        // Initialize input vectors with values from 0 to vector_size - 1
        InitializeVector(a);
        InitializeVector(b);
      
        try {
          queue q(selector, exception_handler);
      
          // Print out the device information used for the kernel code.
          std::cout << "Running on device: "
                    << q.get_device().get_info<info::device::name>() << "\n";
          std::cout << "Vector size: " << a.size() << "\n";
      
          // Vector addition in SYCL
          VectorAdd(q, a, b, sum_parallel);
        } catch (exception const &e) {
          std::cout << "An exception is caught for vector add.\n";
          std::terminate();
        }
      
        // Compute the sum of two vectors in sequential for validation.
        for (size_t i = 0; i < sum_sequential.size(); i++)
          sum_sequential.at(i) = a.at(i) + b.at(i);
      
        // Verify that the two vectors are equal.  
        for (size_t i = 0; i < sum_sequential.size(); i++) {
          if (sum_parallel.at(i) != sum_sequential.at(i)) {
            std::cout << "Vector add failed on device.\n";
            return -1;
          }
        }
      
        int indices[]{0, 1, 2, (static_cast<int>(a.size()) - 1)};
        constexpr size_t indices_size = sizeof(indices) / sizeof(int);
      
        // Print out the result of vector add.
        for (int i = 0; i < indices_size; i++) {
          int j = indices[i];
          if (i == indices_size - 1) std::cout << "...\n";
          std::cout << "[" << j << "]: " << a[j] << " + " << b[j] << " = "
                    << sum_parallel[j] << "\n";
        }
      
        a.clear();
        b.clear();
        sum_sequential.clear();
        sum_parallel.clear();
      
        std::cout << "Vector add successfully completed on device.\n";
        return 0;
      }
      
    • vector-add-usm.cpp

      //==============================================================
      // Vector Add is the equivalent of a Hello, World! sample for data parallel
      // programs. Building and running the sample verifies that your development
      // environment is setup correctly and demonstrates the use of the core features
      // of SYCL. This sample runs on both CPU and GPU (or FPGA). When run, it
      // computes on both the CPU and offload device, then compares results. If the
      // code executes on both CPU and offload device, the device name and a success
      // message are displayed. And, your development environment is setup correctly!
      //
      // For comprehensive instructions regarding SYCL Programming, go to
      // https://software.intel.com/en-us/oneapi-programming-guide and search based on
      // relevant terms noted in the comments.
      //
      // SYCL material used in the code sample:
      // •	A one dimensional array of data shared between CPU and offload device.
      // •	A device queue and kernel.
      //==============================================================
      // Copyright © Intel Corporation
      //
      // SPDX-License-Identifier: MIT
      // =============================================================
      #include <sycl/sycl.hpp>
      #include <array>
      #include <iostream>
      #include <string>
      #if FPGA_HARDWARE || FPGA_EMULATOR || FPGA_SIMULATOR
      #include <sycl/ext/intel/fpga_extensions.hpp>
      #endif
      
      using namespace sycl;
      
      // Array size for this example.
      size_t array_size = 10000;
      
      // Create an exception handler for asynchronous SYCL exceptions
      static auto exception_handler = [](sycl::exception_list e_list) {
        for (std::exception_ptr const &e : e_list) {
          try {
            std::rethrow_exception(e);
          }
          catch (std::exception const &e) {
      #if _DEBUG
            std::cout << "Failure" << std::endl;
      #endif
            std::terminate();
          }
        }
      };
      
      //************************************
      // Vector add in SYCL on device: returns sum in 4th parameter "sum".
      //************************************
      void VectorAdd(queue &q, const int *a, const int *b, int *sum, size_t size) {
        // Create the range object for the arrays.
        range<1> num_items{size};
      
        // Use parallel_for to run vector addition in parallel on device. This
        // executes the kernel.
        //    1st parameter is the number of work items.
        //    2nd parameter is the kernel, a lambda that specifies what to do per
        //    work item. the parameter of the lambda is the work item id.
        // SYCL supports unnamed lambda kernel by default.
        auto e = q.parallel_for(num_items, [=](auto i) { sum[i] = a[i] + b[i]; });
      
        // q.parallel_for() is an asynchronous call. SYCL runtime enqueues and runs
        // the kernel asynchronously. Wait for the asynchronous call to complete.
        e.wait();
      }
      
      //************************************
      // Initialize the array from 0 to array_size - 1
      //************************************
      void InitializeArray(int *a, size_t size) {
        for (size_t i = 0; i < size; i++) a[i] = i;
      }
      
      //************************************
      // Demonstrate vector add both in sequential on CPU and in parallel on device.
      //************************************
      int main(int argc, char* argv[]) {
        // Change array_size if it was passed as argument
        if (argc > 1) array_size = std::stoi(argv[1]);
        // Create device selector for the device of your interest.
      #if FPGA_EMULATOR
        // Intel extension: FPGA emulator selector on systems without FPGA card.
        auto selector = sycl::ext::intel::fpga_emulator_selector_v;
      #elif FPGA_SIMULATOR
        // Intel extension: FPGA simulator selector on systems without FPGA card.
        auto selector = sycl::ext::intel::fpga_simulator_selector_v;
      #elif FPGA_HARDWARE
        // Intel extension: FPGA selector on systems with FPGA card.
        auto selector = sycl::ext::intel::fpga_selector_v;
      #else
        // The default device selector will select the most performant device.
        auto selector = default_selector_v;
      #endif
      
        try {
          queue q(selector, exception_handler);
      
          // Print out the device information used for the kernel code.
          std::cout << "Running on device: "
                    << q.get_device().get_info<info::device::name>() << "\n";
          std::cout << "Vector size: " << array_size << "\n";
      
          // Create arrays with "array_size" to store input and output data. Allocate
          // unified shared memory so that both CPU and device can access them.
          int *a = malloc_shared<int>(array_size, q);
          int *b = malloc_shared<int>(array_size, q);
          int *sum_sequential = malloc_shared<int>(array_size, q);
          int *sum_parallel = malloc_shared<int>(array_size, q);
      
          if ((a == nullptr) || (b == nullptr) || (sum_sequential == nullptr) ||
              (sum_parallel == nullptr)) {
            if (a != nullptr) free(a, q);
            if (b != nullptr) free(b, q);
            if (sum_sequential != nullptr) free(sum_sequential, q);
            if (sum_parallel != nullptr) free(sum_parallel, q);
      
            std::cout << "Shared memory allocation failure.\n";
            return -1;
          }
      
          // Initialize input arrays with values from 0 to array_size - 1
          InitializeArray(a, array_size);
          InitializeArray(b, array_size);
      
          // Compute the sum of two arrays in sequential for validation.
          for (size_t i = 0; i < array_size; i++) sum_sequential[i] = a[i] + b[i];
      
          // Vector addition in SYCL.
          VectorAdd(q, a, b, sum_parallel, array_size);
      
          // Verify that the two arrays are equal.
          for (size_t i = 0; i < array_size; i++) {
            if (sum_parallel[i] != sum_sequential[i]) {
              std::cout << "Vector add failed on device.\n";
              return -1;
            }
          }
      
          int indices[]{0, 1, 2, (static_cast<int>(array_size) - 1)};
          constexpr size_t indices_size = sizeof(indices) / sizeof(int);
      
          // Print out the result of vector add.
          for (int i = 0; i < indices_size; i++) {
            int j = indices[i];
            if (i == indices_size - 1) std::cout << "...\n";
            std::cout << "[" << j << "]: " << j << " + " << j << " = "
                      << sum_sequential[j] << "\n";
          }
      
          free(a, q);
          free(b, q);
          free(sum_sequential, q);
          free(sum_parallel, q);
        } catch (exception const &e) {
          std::cout << "An exception is caught while adding two vectors.\n";
          std::terminate();
        }
      
        std::cout << "Vector add successfully completed on device.\n";
        return 0;
      }
      
    • 编译运行

      1. 配置项目以使用基于缓冲区的实现(第一个方框)或基于统一共享内存(USM)的实现(第二个方框)。

        cd vector-add
        mkdir build && cd build
        cmake ..
        

        image-20230612202247378

        cd vector-add
        mkdir build && cd build
        cmake .. -DUSM=1
        

        image-20230612202341252

      2. 编译项目

        make cpu-gpu
        

        image-20230612202414219

      3. 运行程序

        ./vector-add-buffers
        ./vector-add-usm
        

        image-20230612202613438

      4. 整理项目

        make clean
        
  • 2
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 3
    评论
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值