FFT为什么可以加速卷积运算

最新推荐文章于 2024-03-21 18:02:41 发布

JHY_HIT

最新推荐文章于 2024-03-21 18:02:41 发布

阅读量451

点赞数

分类专栏：深度学习学习笔记文章标签：神经网络

原文链接：https://softwareengineering.stackexchange.com/questions/171757/computational-complexity-of-correlation-in-time-vs-multiplication-in-frequency-s

版权

深度学习学习笔记专栏收录该内容

2 篇文章 0 订阅

订阅专栏

I'll assume this is being done on a conventional CPU, one core, executing one simple thread, no fancy hardware. If there is more than that going on, it can probably be accounted for with adjustments to the reasoning for a simpler system. Not much more can be said without either a specific system to discuss, or a whole textbook or research paper to cover a range of possibilities.

I wouldn't worry about power-of-two sizes. It doesn't matter. FFT algorithms with the butterfly units and all that exist for factors of 3, or any small number, not just 2. There are clever algorithms for prime-sized data series, too. I don't like quoting Wikipedia on this due to its impermanent nature, but anyway:

there are FFTs with O(N log N) complexity for all N, even for prime N

Implementations of FFTs for arbitrary N can be found in the GPL'd library FFTW.

The only trustworthy way in terms of serious engineering is to build and measure, but we certainly can get an idea from theory, to see relationships between variables. We need estimates of how many arithmetic operations are involved for each method.

Multiplying is still slower than addition on most CPUs, even if the difference has shrunk tremendously over the years, so let's just count multiplications. Accounting also for addition takes a bit more thinking and keeping track of stuff.

A straightforward convolution, actually multiplying and adding using the convolution kernel, repeating for each output pixel, needs W²·K² multiplications, where W is the number of pixels along one side of the image (assuming square for simplicity), and K is the size of the convolution kernel, as pixels along one side. It takes K² multiplications to compute one output pixel using the kernel and same-size portion of the input image. Repeat for all output pixels, which number the same as in the input image.

(Nmults)direct = W²·K²

To do the job in Fourier space, we must Fourier transform the image. This is done by applying an FFT to each column separately, and then to each row. The FFT for N data points takes about 2N·log(N) multiplications; we want N to be W, the length of one column or row. All logarithms here are base two.

There are W rows and W columns, so after all the FFTs are done, we have done 2W·(2W·log(W)) multiplications. Double that, because after we multiply by the Fourier transform of the kernel, we have to inverse-Fourier the data to get back to sensible image. That's 8W²·log(W). Of course, multiplying by the Fourier transform of the kernel has to be done, another W² multiplications. (Done once, not once per output pixel, per row or anything.) These are complex multiplications, so that's 4W² real multiplications.

So, unless I goofed up (and I probably did) we have

(Nmults)Fourier = 4W²·(2·log(W) + 1)

When do we want to do things the direct way? When K is sufficiently small to make W²·K² smaller than 4W²·(2·log(W) + 1). A common factor of W² is easily factored out. We can probably drop the "+1" since we're dealing with idealized estimates. The +1 is likely lost in errors relative to actual implementations, from not counting additions, loop overheads and so on. That leaves:

K² < 8·log(W)

This is the approximate condition for choosing a direct approach over a frequency space approach.

Note that correlation of two same-size images is just like convolving with a kernel of size K = W. Fourier space is always the way to do it.

This can be refined and argued over to account for overhead, pipelining of opcodes, float vs. fixed-point, and thrown out the window with GPGPU and specialized hardware.

JHY_HIT

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
FFT为什么可以加速卷积运算

I'll assume this is being done on a conventional CPU, one core, executing one simple thread, no fancy hardware. If there is more than that going on, it can probably be accounted for with adjustments to the reasoning for a simpler system. Not much more can
复制链接

扫一扫