基于oneAPI的C++/SYCL的图像卷积算法

最新推荐文章于 2024-10-05 12:07:50 发布

玄学小明

最新推荐文章于 2024-10-05 12:07:50 发布

阅读量107

点赞数

文章标签： oneapi c++ 算法

本文链接：https://blog.csdn.net/d2624039931/article/details/134719541

版权

一、题目描述

描述

使用基于oneAPI 的 C++/SYCL 实现一个用于计算图像的卷积操作。输⼊为一个图像矩阵和一个卷积核矩阵，输出为卷积后的图像。

分析

图像卷积是一种常见的图像处理操作，用于应用各种滤波器和特征检测器。其原理可以简单地描述为在图像的每个像素上应用一个小的矩阵（通常称为卷积核或滤波器），并将卷积核中的元素与图像中对应位置的像素值相乘，然后将所有乘积的和作为结果。这个过程可以看作是对图像进行了平滑、锐化、边缘检测等操作。

在异构计算编程中，可以使用并行计算来加速图像卷积操作。通过将图像分割成小块，然后在GPU 上并行处理这些块，可以实现高效的图像卷积计算。通过合理的块大小和线程组织方式，可以最大限度地利用GPU 的并行计算能力来加速图像处理过程。

基于GPU 的图像卷积操作的原理基于并行处理和矩阵乘法的基本原理，通过将图像数据和卷积核数据分配给不同的线程块和线程，利用GPU 的并行计算能力实现对图像的快速处理。

二、解题思路

设输入矩阵维度为M1*N1，卷积核维度为M2*N2，则输出矩阵的维度为M3*N3，其中M3 = M1 - M2 + 1, N3 = N1 - N2 + 1。

设置M3*N3个进程并行执行，其中第i, j个线程负责计算在输入图像的如下区域：此区域左上角的坐标为i, j，大小与卷积核大小相同。线程将此区域对应位置与卷积核对应位置相乘，后将所有位置的结果相加，存储于结果矩阵中第（i, j）位置。

三、代码

#include<CL/sycl.hpp>
#include <iostream>
#include <fstream>
#include <string>
#include <fstream>
#include <iomanip>

// Matrix dimensions
constexpr size_t M1 = 100;
constexpr size_t N1 = 100;
constexpr size_t M2 = 5;
constexpr size_t N2 = 5;
constexpr size_t M3 = 96;
constexpr size_t N3 = 96;

using namespace std;
using namespace sycl;

// Helper function to initialize matrices with random values
void initializeMatrix(float* matrix, size_t rows, size_t cols, ifstream& infile) {   
    // 逐行读取文件内容
    string line;
    int index = 0;
    for(int i=0;i<rows;i++) {
        getline(infile, line);
        istringstream iss(line);
        double value;
        while (iss >> value) {
            matrix[index++] = value;
        }
    }
}

int main() {
    // Allocate host memory for matrices
    float *matrix = new float[M1 * N1];
    float *kernel = new float[M2 * N2];
    float *result = new float[M3 * N3];

    string filename = "problem-3.txt";

    ifstream infile(filename);
    string line;
    getline(infile, line);
    initializeMatrix(matrix, M1, N1, infile);
    getline(infile, line);
    getline(infile, line);
    initializeMatrix(kernel, M2, N2, infile);

    sycl::queue q;

    // Allocate device memory for matrices
    sycl::buffer<float, 2> bufferA(matrix, sycl::range<2>{M1, N1});
    sycl::buffer<float, 2> bufferB(kernel, sycl::range<2>{M2, N2});
    sycl::buffer<float, 2> bufferResult(result, sycl::range<2>{M3, N3});

    // Submit a SYCL command group for matrix multiplication
    q.submit([&](sycl::handler &h) {
        // Accessors to matrices
        auto accessorA = bufferA.get_access<sycl::access::mode::read>(h);
        auto accessorB = bufferB.get_access<sycl::access::mode::read>(h);
        auto accessorResult = bufferResult.get_access<sycl::access::mode::write>(h);

        // Define a range representing the work items in a 2D grid
        sycl::range<2> globalRange{M3, N3};

        // Execute the kernel
        h.parallel_for<class MatrixMultiply>(globalRange, [=](sycl::id<2> idx) {
            float sum = 0.0f;
            for (size_t i = 0; i < M2; i++) {
                for(size_t j = 0; j < N2; j++) {
                    sum += accessorA[idx[0]+i][idx[1]+j] * accessorB[i][j];
                }
            }
            accessorResult[idx] = sum;
        });
    }).wait(); // Wait for the kernel to finish

    // Transfer results back to host
    sycl::host_accessor resultAccessor(bufferResult, sycl::read_only);
    
    std::ofstream outputFile("problem-3-result.txt");
    for (size_t i = 0; i < M3; ++i) {
        for (size_t j = 0; j < N3; ++j) {
            outputFile << std::fixed << std::setprecision(2) << result[i * N3 + j];
            outputFile << " ";
        }
        outputFile << "\n";
    }
    outputFile.close();
    return 0;
}