SIMD学习笔记2:高斯卷积计算优化

https://github.com/gredx/simd-parallel-conv
https://zhuanlan.zhihu.com/p/419806079
https://www.cnblogs.com/Imageshop/p/9069650.html
https://zhuanlan.zhihu.com/p/308004749
https://zhuanlan.zhihu.com/p/83694328

SSE图像算法优化系列十八:三次卷积插值的进一步SSE优化。
基于CPU SIMD和winograd的卷积计算加速技术_
如何学习SIMD(单指令多数据流)并应用?
SSE图像算法优化系列九:灵活运用SIMD指令16倍提升Sobel边缘检测的速度(4000*3000的24位图像时间由480ms降低到30ms)。
SSE图像算法优化系列二:高斯模糊算法的全面优化过程分享(一)。
数字图像处理之高斯滤波加速优化

Opencv findcontours函数原理,以及python numpy实现
AVX256加速矩阵乘法

microsoft/ DirectXMath github SIMD

我要实现循环卷积sse,暂时没有找到比较好的写法:

优化前

void gaussianConvolution(Matrix<double>& srcIamge, Matrix<double>& desImage, Matrix<double>& kernel)
{
    int kernelSize = kernel.numCols();

    //卷积填充
    int startOffset = -1 * int(kernelSize / 2);

    for (int i = 0; i < srcIamge.numRows(); i++)
    {
        for (int j = 0; j < srcIamge.numCols(); j++)
        {
            double blurredPixel = 0.0;
            for (int kx = 0; kx < kernelSize; kx++)
            {
                for (int ky = 0; ky < kernelSize; ky++)
                {
                    int x = i + startOffset + kx, y = j + startOffset + ky;
                    GetPixelWrapAround(srcIamge, x, y);
                    blurredPixel += kernel.get(kx, ky)* srcIamge.get(x, y);
                }
            }
            desImage.set(i, j, blurredPixel);
        }
    }
}

void  GetPixelWrapAround(const Matrix<double>& image, int& x, int& y)
{
    
    int w = image.numRows();
    int h = image.numCols();

    x = (x % w + w) % w;
    y = (y % h + h) % h;
}

sse优化后:

void greenNoise::gaussianConvolutionSSE(Matrix<double>& srcImage, Matrix<double>& desImage, Matrix<double>& kernel)
{
    int kernelSize = kernel.numCols();
    int width = srcImage.numRows();
    int height = srcImage.numCols();
    int startOffset = -1 * static_cast<int>(kernelSize / 2);
    double temp[4];
    for (int i = 0; i < width; i++)
    {
        for (int j = 0; j < height; j++)
        {
            double blurredPixel = 0.0;

            for (int kx = 0; kx < kernelSize; kx++)
            {
                int x = (i + startOffset + kx + width) % width;
                
                for (int ky = 0; ky < kernelSize-3; ky+=4)
                {
                    //int y = (j + startOffset + ky + height) % height;
                    int y0 = j + startOffset + ky + height;
                    int y1 = (y0 + 1)% height;
                    int y2 = (y0 + 2) % height;
                    int y3 = (y0 + 3) % height;
                    y0 = y0 % height;

                    __m256d srcValues = _mm256_set_pd(srcImage.get(x, y0), srcImage.get(x, y1), srcImage.get(x, y2), srcImage.get(x, y3));
                    __m256d kernelValues = _mm256_set_pd(kernel.get(kx, ky), kernel.get(kx, ky+1), kernel.get(kx, ky+2), kernel.get(kx, ky+3));
                    __m256d resultVec = _mm256_mul_pd(srcValues, kernelValues);

                    _mm256_storeu_pd(temp, resultVec);


                    blurredPixel += temp[0]+ temp[1] + temp[2] + temp[3] ;
                }

                // Process the remaining elements (if any) without SSE
                for (int ky = kernelSize - kernelSize % 4; ky < kernelSize; ++ky)
                {
                    int y = (j + startOffset + ky + height) % height;
                    blurredPixel += kernel.get(kx, ky) * srcImage.get(x, y);
                }

            }

            desImage.set(i, j, blurredPixel);
        }
    }
}

加入多线程:

void greenNoise::parallelGaussianConvolutionSSE(Matrix<double>& srcImage, Matrix<double>& desImage, Matrix<double>& kernel)
{
    int kernelSize = kernel.numCols();
    int width = srcImage.numRows();
    int height = srcImage.numCols();
    int startOffset = -1 * static_cast<int>(kernelSize / 2);

    std::vector<std::thread> threads;
    //std::mutex mutex; // Mutex to control access to the result matrix

    const int numThreads = std::thread::hardware_concurrency(); // Number of available threads
    const int rowsPerThread = (width + numThreads - 1) / numThreads; // Rows per thread
    for (int t = 0; t < numThreads; ++t)
    {
        threads.emplace_back([&srcImage, &desImage, &kernel, t, rowsPerThread,
            kernelSize, width, height, startOffset]()
            {
                for (int i = t* rowsPerThread; i < std::min(width, (t +1)* rowsPerThread); i++)
                {
                    for (int j = 0; j < height; j++)
                    {
                        double temp[4];
                        double blurredPixel = 0.0;

                        for (int kx = 0; kx < kernelSize; kx++)
                        {
                            int x = (i + startOffset + kx + width) % width;

                            for (int ky = 0; ky < kernelSize - 3; ky += 4)
                            {
                                //int y = (j + startOffset + ky + height) % height;
                                int y0 = j + startOffset + ky + height;
                                int y1 = (y0 + 1) % height;
                                int y2 = (y0 + 2) % height;
                                int y3 = (y0 + 3) % height;
                                y0 = y0 % height;

                                __m256d srcValues = _mm256_set_pd(srcImage.get(x, y0), srcImage.get(x, y1), srcImage.get(x, y2), srcImage.get(x, y3));
                                __m256d kernelValues = _mm256_set_pd(kernel.get(kx, ky), kernel.get(kx, ky + 1), kernel.get(kx, ky + 2), kernel.get(kx, ky + 3));
                                __m256d resultVec = _mm256_mul_pd(srcValues, kernelValues);

                                _mm256_storeu_pd(temp, resultVec);


                                blurredPixel += temp[0] + temp[1] + temp[2] + temp[3];
                            }

                            // Process the remaining elements (if any) without SSE
                            for (int ky = kernelSize - kernelSize % 4; ky < kernelSize; ++ky)
                            {
                                int y = (j + startOffset + ky + height) % height;
                                blurredPixel += kernel.get(kx, ky) * srcImage.get(x, y);
                            }

                        }

                        desImage.set(i, j, blurredPixel);
                    }
                }
            }
        );
    }

    for (auto& thread : threads)
    {
        thread.join();
    }
   
}
好的,我来通过实例说明这些函数的用法。 1. tf()函数:tf()函数用于将分子和分母多项式的系数转化为传递函数形式。例如,假设有一个传递函数H(s) = (s+1)/(s^2+2s+1),可以使用tf()函数将其转化为传递函数形式: ``` num = [1 1]; den = [1 2 1]; H = tf(num, den) ``` 输出结果为:H = s + 1 ------------- s^2 + 2 s + 1 2. conv()函数:conv()函数用于计算两个多项式的卷积。例如,假设有两个多项式f(x) = 2x^2 + 3x + 1和g(x) = x + 1,可以使用conv()函数计算它们的卷积: ``` f = [2 3 1]; g = [1 1]; conv(f, g) ``` 输出结果为:ans = 2 5 4 1 3. zpk()函数:zpk()函数用于将传递函数的零点和极点转化为传递函数形式。例如,假设有一个传递函数H(s) = (s+1)/(s+2)(s+3),可以使用zpk()函数将其转化为传递函数形式: ``` z = [-1]; p = [-2 -3]; K = 1; H = zpk(z, p, K) ``` 输出结果为:H = s + 1 ---------------------- (s+2) * (s+3) 4. series()函数:series()函数用于计算两个传递函数的级联组合。例如,假设有两个传递函数H1(s) = 1/(s+1)和H2(s) = 1/(s+2),可以使用series()函数计算它们的级联组合: ``` num1 = 1; den1 = [1 1]; num2 = 1; den2 = [1 2]; H1 = tf(num1, den1); H2 = tf(num2, den2); H = series(H1, H2) ``` 输出结果为:H = 1 ------------------ s^2 + 3 s + 2 5. parallel()函数:parallel()函数用于计算两个传递函数的并联组合。例如,假设有两个传递函数H1(s) = 1/(s+1)和H2(s) = 1/(s+2),可以使用parallel()函数计算它们的并联组合: ``` H = parallel(H1, H2) ``` 输出结果为:H = s + 2 ------------------- (s+1) * (s+2) + 1 6. feedback()函数:feedback()函数用于计算闭环系统的传递函数。例如,假设有一个开环传递函数G(s) = 1/(s+1),可以使用feedback()函数计算其闭环传递函数: ``` num = 1; den = [1 1]; G = tf(num, den); H = 1; T = feedback(G, H) ``` 输出结果为:T = 1 ------------- s^2 + 2 s + 2 7. cloop()函数:cloop()函数用于计算闭环系统的传递函数,并且可以指定反馈方式。例如,假设有一个开环传递函数G(s) = 1/(s+1),可以使用cloop()函数计算其闭环传递函数,并指定反馈方式为正反馈: ``` T = cloop(G, 1, 1) ``` 输出结果为:T = 1 ------------- s^2 + 2 s + 2
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值