Sobel图像滤波器是一种简单的卷积滤波器,主要用于边缘检测算法。
样例结果
输入图像
输出图像
算法
一种做图像边缘检测的技术是,找出图像的梯度。大梯度值的区域对应图像色彩或密度的剧变区域。典型地,这些区域是边缘。
如果你对于幅图像卷积两个Sobel算子,你会得到两个输出:
> X方向的梯度(dX)
> Y方向的梯度(dY)
通常梯度被合并,给出一个总的梯度图。
Sobel算子:dX与dY:
-1 0 1
-2 0 2
-1 0 1
1 2 1
0 0 0
-1 -2 -1
实现
填充
值得注意的是,我们在此并不考虑填充。在两个维度,输出图像相比小两个像素。因为,每个输出需要围绕它的像素,计算边沿像素的输出是不可能的。在此样例中,我们简单地把边缘输出像素用其初始值。
有时需要输出信号的大小与输入信号的大小一样,这种种情况下,"填充"必须应用到输入中,考虑到滤波应用的实际本性上是减少大小的。填充策略会有不同,但是对于图像,通常的选择是在所有边上重复边缘值(例如,最外层的像素集),或者在一些边上(在某些情况下)。
实现细节
简单起见,我们在一个8比特通道上实现Sobel滤波器。为了在RGB图像上做Sobel滤波,你可以独立地在每个通道上运行Sobel滤波,然后组合结果。在这个例子中,我们取了一张RGB图像,转换成8位辉度图,然后发送到GPU。
每个Sobel计算给出一个中心掩码像素的输出。中心像素的输出值是围绕像素的3x3网格上的像素值乘以Sobel掩码的和。这可以被分成三个阶段,通过为网格上的每一行独立地做乘法和累加(这一方法被使用在这个样例中)。
代码
注意:我们考虑矢量化是在本教程中使用的主要优化技术。为了显示使用适量的重要性,一个不包含矢量的Sobel滤波器的OpenCL实现被包含在sobel_no_vectors sample(sobel_no_vectors.cpp和sobel_no_vectors.cl)。在Mali-T600系列GPU上,矢量化的版本快很多。
除非另作说明,否则所有的代码片段来自sobel.cl中的OpenCL内核。
1. 选择内核大小
我们在内核中使用向量类型,因此我们实际上每个内核输出16个结果。看下面更多的向量化细节。内核使用3x3的Sobel滤波器到一个输入图像中的18x3的窗,在两个输出图像中来生成16x1结果,代表梯度的dx和dy。
/*
* Each kernel calculates 16 output pixels in the same row (hence the '* 16').
* column is in the range [0, width] in steps of 16.
* row is in the range [0, height].
*/
const int column = get_global_id(0) * 16;
const int row = get_global_id(1) * 1;
/* Offset calculates the position in the linear data for the row and the column. */
const int offset = row * width + column;
当我们在sobel.cpp中排队内核时,相应地减少工作大小:
/*
* Each instance of the kernel operates on a 16 * 1 portion of the image.
* Therefore, the global work size must be width / 16 by height / 1 work items.
*/
size_t globalWorksize[2] = {width / 16, height / 1};
2. 加载输入数据
Mali-T600系列GPU具有128位矢量寄存器,可以在矢量类型上做运算。因此,我们使用OpenCL矢量,来使得更加有效地使用硬件,从而有更高的性能。
这里我们从一行数据中做矢量加载:
/*
* First row of input.
* In a scalar Sobel calculation you would load 1 value for leftLoad, middleLoad and rightLoad.
* In the vector case we load 16 values for each.
* leftLoad, middleLoad and rightLoad load 16-bytes of data from the first row.
* The data they load overlaps. e.g. for the first column and row, leftLoad is 0->15, middleLoad is 1->16 and rightLoad is 2->17.
* So we're actually loading 18-bytes of data from the first row.
*/
uchar16 leftLoad = vload16(0, inputImage + (offset + 0));
uchar16 middleLoad = vload16(0, inputImage + (offset + 1));
uchar16 rightLoad = vload16(0, inputImage + (offset + 2));
3. 转换数据
在Mali-T600系列GPU上,扩展和收缩数据类型是一个很容易地操作。在此,我们将8比特每像素数据转换成16比特每像素。
/*
* Convert the data from unsigned chars to shorts (8-bit unsigned to 16-bit signed).
* The calculations can overflow 8-bits so we require larger intermediate storage.
* Additionally, the values can become negative so we need a signed type.
*/
short16 leftData = convert_short16(leftLoad);
short16 middleData = convert_short16(middleLoad);
short16 rightData = convert_short16(rightLoad);
4. 做计算
然后,我们在计算16个像素的结果。在Mali-T600系列GPU上,每个向量计算可以作为一个单一操作完成。
/*
* Calculate the results for the first row.
* Looking at the Sobel masks above for the first line of input,
* the dX calculation is the sum of 1 * leftData, 0 * middleData, and -1 * rightData.
* The dY calculation is the sum of 1 * leftData, 2 * middleData, and 1 * rightData.
* This is what is being calculated below, except we have removed the
* unnecessary calculations (multiplications by 1 or 0) and we are calculating 16 values at once.
* This pattern repeats for the other 2 rows of data.
*/
short16 dx = rightData - leftData;
short16 dy = rightData + leftData + middleData * (short)2;
5. 排列结果
最后,我们收缩和存储数据。我们使用一个向量存储一次写16个结果:
/*
* Store the results.
* The range of outputs from our Sobel calculations is [-1020, 1020].
* In order to output this as an 8-bit signed char we must divide it by 8 (or shift right 3 times).
* This gives the range [-128, 128]. Depending on what type of output you require,
* (signed/unsigned, seperate/combined gradients) it is possible to do more of the calculations on the GPU using OpenCL.
* In this sample we're assuming that the application requires signed uncombined gradient outputs.
*/
vstore16(convert_char16(dx >> 3), 0, outputImageDX + offset + width + 1);
vstore16(convert_char16(dy >> 3), 0, outputImageDY + offset + width + 1);
因为数据作为两个分开的梯度图返回,我们在CPU上将它们先组合,然后将它们作为一个位图写入。
运行样例
参见样例1。
附录1:内核源码
/*
* This confidential and proprietary software may be used only as
* authorised by a licensing agreement from ARM Limited
* (C) COPYRIGHT 2013 ARM Limited
* ALL RIGHTS RESERVED
* The entire notice above must be reproduced on all authorised
* copies and copies may only be made to the extent permitted
* by a licensing agreement from ARM Limited.
*/
/**
* \brief Sobel filter kernel function.
* \param[in] inputImage Input image data in row-major format.
* \param[in] width Width of the image passed in as inputImage.
* \param[out] outputImageDX Output image of the calculated gradient in the X direction.
* \param[out] outputImageDY Output image of the calculated gradient in the Y direction.
*/
__kernel void sobel(__global const uchar* restrict inputImage,
const int width,
__global char* restrict outputImageDX,
__global char* restrict outputImageDY)
{
/* [Kernel size] */
/*
* Each kernel calculates 16 output pixels in the same row (hence the '* 16').
* column is in the range [0, width] in steps of 16.
* row is in the range [0, height].
*/
const int column = get_global_id(0) * 16;
const int row = get_global_id(1) * 1;
/* Offset calculates the position in the linear data for the row and the column. */
const int offset = row * width + column;
/* [Kernel size] */
/* [Load row] */
/*
* First row of input.
* In a scalar Sobel calculation you would load 1 value for leftLoad, middleLoad and rightLoad.
* In the vector case we load 16 values for each.
* leftLoad, middleLoad and rightLoad load 16-bytes of data from the first row.
* The data they load overlaps. e.g. for the first column and row, leftLoad is 0->15, middleLoad is 1->16 and rightLoad is 2->17.
* So we're actually loading 18-bytes of data from the first row.
*/
uchar16 leftLoad = vload16(0, inputImage + (offset + 0));
uchar16 middleLoad = vload16(0, inputImage + (offset + 1));
uchar16 rightLoad = vload16(0, inputImage + (offset + 2));
/* [Load row] */
/* [Convert data] */
/*
* Convert the data from unsigned chars to shorts (8-bit unsigned to 16-bit signed).
* The calculations can overflow 8-bits so we require larger intermediate storage.
* Additionally, the values can become negative so we need a signed type.
*/
short16 leftData = convert_short16(leftLoad);
short16 middleData = convert_short16(middleLoad);
short16 rightData = convert_short16(rightLoad);
/* [Convert data] */
/* [Calculation] */
/*
* Calculate the results for the first row.
* Looking at the Sobel masks above for the first line of input,
* the dX calculation is the sum of 1 * leftData, 0 * middleData, and -1 * rightData.
* The dY calculation is the sum of 1 * leftData, 2 * middleData, and 1 * rightData.
* This is what is being calculated below, except we have removed the
* unnecessary calculations (multiplications by 1 or 0) and we are calculating 16 values at once.
* This pattern repeats for the other 2 rows of data.
*/
short16 dx = rightData - leftData;
short16 dy = rightData + leftData + middleData * (short)2;
/* [Calculation] */
/*
* Second row of input.
* By adding the 'width * 1' to the offset we get the next row of data at the same column position.
* middleData is not loaded because it is not used in any of the calculations.
*/
leftLoad = vload16(0, inputImage + (offset + width * 1 + 0));
rightLoad = vload16(0, inputImage + (offset + width * 1 + 2));
leftData = convert_short16(leftLoad);
rightData = convert_short16(rightLoad);
/*
* Calculate the results for the second row.
* The dX calculation is the sum of -2 * leftData, 0 * middleData, and -2 * rightData.
* There is no dY calculation to do: sum of 0 * leftData, 0 * middleData, and 0 * rightData.
*/
dx += (rightData - leftData) * (short)2;
/* Third row of input. */
leftLoad = vload16(0, inputImage + (offset + width * 2 + 0));
middleLoad = vload16(0, inputImage + (offset + width * 2 + 1));
rightLoad = vload16(0, inputImage + (offset + width * 2 + 2));
leftData = convert_short16(leftLoad);
middleData = convert_short16(middleLoad);
rightData = convert_short16(rightLoad);
/*
* Calculate the results for the third row.
* The dX calculation is the sum of -1 * leftData, 0 * middleData, and -1 * rightData.
* The dY calculation is the sum of -1 * leftData, -2 * middleData, and -1 * rightData.
*/
dx += rightData - leftData;
dy -= rightData + leftData + middleData * (short)2;
/* [Store] */
/*
* Store the results.
* The range of outputs from our Sobel calculations is [-1020, 1020].
* In order to output this as an 8-bit signed char we must divide it by 8 (or shift right 3 times).
* This gives the range [-128, 128]. Depending on what type of output you require,
* (signed/unsigned, seperate/combined gradients) it is possible to do more of the calculations on the GPU using OpenCL.
* In this sample we're assuming that the application requires signed uncombined gradient outputs.
*/
vstore16(convert_char16(dx >> 3), 0, outputImageDX + offset + width + 1);
vstore16(convert_char16(dy >> 3), 0, outputImageDY + offset + width + 1);
/* [Store] */
}
附录2:宿主机源码
/*
* This confidential and proprietary software may be used only as
* authorised by a licensing agreement from ARM Limited
* (C) COPYRIGHT 2013 ARM Limited
* ALL RIGHTS RESERVED
* The entire notice above must be reproduced on all authorised
* copies and copies may only be made to the extent permitted
* by a licensing agreement from ARM Limited.
*/
#include "common.h"
#include "image.h"
#include <CL/cl.h>
#include <iostream>
#include <fstream>
#include <sstream>
#include <cstddef>
#include <cmath>
using namespace std;
/**
* \brief Simple Sobel filter OpenCL sample.
* \details A sample which loads a bitmap and then passes it to the GPU.
* An OpenCL kernel which does Sobel filtering is then run on the data.
* The gradients of the image in x and y directions are returned by the GPU
* and are combined on the CPU to form the filtered data.
* The input image is loaded from assets/input.bmp. The output gradients in X and Y,
* as well as the combined gradient image are stored in output-dX.bmp, output-dY.bmp
* and output.bmp respectively.
* \return The exit code of the application, non-zero if a problem occurred.
*/
int main(void)
{
/*
* Name of the bitmap to load and run the sobel filter on.
* It's width must be divisible by 16.
*/
string filename = "assets/input.bmp";
cl_context context = 0;
cl_command_queue commandQueue = 0;
cl_program program = 0;
cl_device_id device = 0;
cl_kernel kernel = 0;
const unsigned int numberOfMemoryObjects = 3;
cl_mem memoryObjects[numberOfMemoryObjects] = {0, 0, 0};
cl_int errorNumber;
if (!createContext(&context))
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Failed to create an OpenCL context. " << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
if (!createCommandQueue(context, &commandQueue, &device))
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Failed to create the OpenCL command queue. " << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
if (!createProgram(context, device, "assets/sobel.cl", &program))
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Failed to create OpenCL program." << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
kernel = clCreateKernel(program, "sobel", &errorNumber);
if (!checkSuccess(errorNumber))
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Failed to create OpenCL kernel. " << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
/* Load 24-bits per pixel RGB data from a bitmap. */
cl_int width;
cl_int height;
unsigned char* imageData = NULL;
if (!loadFromBitmap(filename, &width, &height, &imageData))
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Failed loading bitmap. " << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
/* All buffers are the size of the image data. */
size_t bufferSize = width * height * sizeof(cl_uchar);
bool createMemoryObjectsSuccess = true;
/* Create one input buffer for the image data, and two output buffers for the gradients in X and Y respectively. */
memoryObjects[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR, bufferSize, NULL, &errorNumber);
createMemoryObjectsSuccess &= checkSuccess(errorNumber);
memoryObjects[1] = clCreateBuffer(context, CL_MEM_WRITE_ONLY | CL_MEM_ALLOC_HOST_PTR, bufferSize, NULL, &errorNumber);
createMemoryObjectsSuccess &= checkSuccess(errorNumber);
memoryObjects[2] = clCreateBuffer(context, CL_MEM_WRITE_ONLY | CL_MEM_ALLOC_HOST_PTR, bufferSize, NULL, &errorNumber);
createMemoryObjectsSuccess &= checkSuccess(errorNumber);
if (!createMemoryObjectsSuccess)
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Failed to create OpenCL buffers. " << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
/* Map the input luminance memory object to a host side pointer. */
cl_uchar* luminance = (cl_uchar*)clEnqueueMapBuffer(commandQueue, memoryObjects[0], CL_TRUE, CL_MAP_WRITE, 0, bufferSize, 0, NULL, NULL, &errorNumber);
if (!checkSuccess(errorNumber))
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Mapping memory objects failed " << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
/* Convert 24-bits per pixel RGB into 8-bits per pixel luminance data. */
RGBToLuminance(imageData, luminance, width, height);
delete [] imageData;
/* Unmap the memory so we can pass it to the kernel. */
if (!checkSuccess(clEnqueueUnmapMemObject(commandQueue, memoryObjects[0], luminance, 0, NULL, NULL)))
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Unmapping memory objects failed " << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
bool setKernelArgumentsSuccess = true;
/* Setup the kernel arguments. */
setKernelArgumentsSuccess &= checkSuccess(clSetKernelArg(kernel, 0, sizeof(cl_mem), &memoryObjects[0]));
setKernelArgumentsSuccess &= checkSuccess(clSetKernelArg(kernel, 1, sizeof(cl_int), &width));
setKernelArgumentsSuccess &= checkSuccess(clSetKernelArg(kernel, 2, sizeof(cl_mem), &memoryObjects[1]));
setKernelArgumentsSuccess &= checkSuccess(clSetKernelArg(kernel, 3, sizeof(cl_mem), &memoryObjects[2]));
if (!setKernelArgumentsSuccess)
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Failed setting OpenCL kernel arguments. " << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
/* An event to associate with the Kernel. Allows us to retreive profiling information later. */
cl_event event = 0;
/* [Kernel size] */
/*
* Each instance of the kernel operates on a 16 * 1 portion of the image.
* Therefore, the global work size must be width / 16 by height / 1 work items.
*/
size_t globalWorksize[2] = {width / 16, height / 1};
/* [Kernel size] */
/* Enqueue the kernel */
if (!checkSuccess(clEnqueueNDRangeKernel(commandQueue, kernel, 2, NULL, globalWorksize, NULL, 0, NULL, &event)))
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Failed enqueuing the kernel. " << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
/* Wait for completion */
if (!checkSuccess(clFinish(commandQueue)))
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Failed waiting for kernel execution to finish. " << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
/* Print the profiling information for the event. */
printProfilingInfo(event);
/* Release the event object. */
if (!checkSuccess(clReleaseEvent(event)))
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Failed releasing the event object. " << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
/* Map the arrays holding the output gradients. */
bool mapMemoryObjectsSuccess = true;
cl_char* outputDx = (cl_char*)clEnqueueMapBuffer(commandQueue, memoryObjects[1], CL_TRUE, CL_MAP_READ, 0, bufferSize, 0, NULL, NULL, &errorNumber);
mapMemoryObjectsSuccess &= checkSuccess(errorNumber);
cl_char* outputDy = (cl_char*)clEnqueueMapBuffer(commandQueue, memoryObjects[2], CL_TRUE, CL_MAP_READ, 0, bufferSize, 0, NULL, NULL, &errorNumber);
mapMemoryObjectsSuccess &= checkSuccess(errorNumber);
if (!mapMemoryObjectsSuccess)
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Mapping memory objects failed " << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
/*
* In order to better visualise the data, we take the absolute
* value of the gradients and then combine them together.
* This is work which could be done in the OpenCL kernel, it all depends
* on what type of output you require.
*/
/* To visualise the data we take the absolute values of the gradients. */
unsigned char *absDX = new unsigned char[width * height];
unsigned char *absDY = new unsigned char[width * height];
for (int i = 0; i < width * height; i++)
{
absDX[i] = abs(outputDx[i]);
absDY[i] = abs(outputDy[i]);
}
/* Unmap the memory. */
bool unmapMemoryObjectsSuccess = true;
unmapMemoryObjectsSuccess &= checkSuccess(clEnqueueUnmapMemObject(commandQueue, memoryObjects[1], outputDx, 0, NULL, NULL));
unmapMemoryObjectsSuccess &= checkSuccess(clEnqueueUnmapMemObject(commandQueue, memoryObjects[2], outputDy, 0, NULL, NULL));
if (!unmapMemoryObjectsSuccess)
{
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
cerr << "Unmapping memory objects failed " << __FILE__ << ":"<< __LINE__ << endl;
return 1;
}
/* Release OpenCL objects. */
cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
/* Convert the two output luminance arrays to RGB and save them out to files. */
unsigned char* rgbOut = new unsigned char[width * height * 3];
luminanceToRGB(absDX, rgbOut, width, height);
saveToBitmap("output-dX.bmp", width, height, rgbOut);
luminanceToRGB(absDY, rgbOut, width, height);
saveToBitmap("output-dY.bmp", width, height, rgbOut);
/* Calculate the total gradient of the image, convert it to RGB and store it out to a file. */
unsigned char* totalOutput = new unsigned char[width * height];
for (int index = 0; index < width * height; index++)
{
totalOutput[index] = sqrt(pow(absDX[index], 2) + pow(absDY[index], 2));
}
luminanceToRGB(totalOutput, rgbOut, width, height);
saveToBitmap("output.bmp", width, height, rgbOut);
delete [] absDX;
delete [] absDY;
delete [] rgbOut;
delete [] totalOutput;
return 0;
}