Template Metaprograms

Template Metaprograms

Todd Veldhuizen

Introduction

Compile-time programs

The introduction of templates to C++ added a facility whereby the compiler can act as an interpreter. This makes it possible to write programs in a subset of C++ which are interpreted at compile time. Language features such as for loops and if statements can be replaced by template specialization and recursion. The first examples of these techniques were written by Erwin Unruh and circulated among members of the ANSI/ISO C++ standardization committee [1]. These programs didn't have to be executed -- they generated their output at compile time as warning messages. For example, one program generated warning messages containing prime numbers when compiled.

Here's a simple example which generates factorials at compile time:

 

template<int N>
class Factorial {
public:
    enum { value = N * Factorial<N-1>::value };
};

class Factorial<1> {
public:
    enum { value = 1 };
};
Using this class, the value N! is accessible at compile time as Factorial<N>::value. How does this work? When Factorial<N> is instantiated, the compiler needs Factorial<N-1> in order to assign the enum 'value'. So it instantiates Factorial<N-1>, which in turn requires Factorial<N-2>, requiring Factorial<N-3>, and so on until Factorial<1> is reached, where template specialization is used to end the recursion. The compiler effectively performs a for loop to evaluate N! at compile time.

Although this technique might seem like just a cute C++ trick, it becomes powerful when combined with normal C++ code. In this hybrid approach, source code contains two programs: the normal C++ run-time program, and a template metaprogram which runs at compile time. Template metaprograms can generate useful code when interpreted by the compiler, such as a massively inlined algorithm -- that is, an implementation of an algorithm which works for a specific input size, and has its loops unrolled. This results in large speed increases for many applications.

Here is a simple template metaprogram "recipe" for a bubble sort algorithm.

An example: Bubble Sort

Although bubble sort is a very inefficient algorithm for large arrays, it's quite reasonable for small N. It's also the simplest sorting algorithm. This makes it a good example to illustrate how template metaprograms can be used to generate specialized algorithms. Here's a typical bubble sort implementation, to sort an array of ints:

 

inline void swap(int& a, int& b)
{
    int temp = a;
    a = b;
    b = temp;
}


void bubbleSort(int* data, int N)
{
    for (int i = N - 1; i > 0; --i)
    {
        for (int j = 0; j < i; ++j)
        {
            if (data[j] > data[j+1])
                swap(data[j], data[j+1]);
        }
    }
}
A specialized version of bubble sort for N=3 might look like this:

 

inline void bubbleSort3(int* data)
{
    int temp;

    if (data[0] > data[1])
    { temp = data[0]; data[0] = data[1]; data[1] = temp; }
    if (data[1] > data[2])
    { temp = data[1]; data[1] = data[2]; data[2] = temp; }
    if (data[0] > data[1])
    { temp = data[0]; data[0] = data[1]; data[1] = temp; }
}

In order to generate an inlined bubble sort such as the above, it seems that we'll have to unwind two loops. We can reduce the number of loops to one, by using a recursive version of bubble sort:

 

void bubbleSort(int* data, int N)
{
    for (int j = 0; j < N - 1; ++j)
    {
        if (data[j] > data[j+1])
            swap(data[j], data[j+1]);
    }

    if (N > 2)
        bubbleSort(data, N-1);
}
Now the sort consists of a loop, and a recursive call to itself. This structure is simple to implement using some template classes:

 

template<int N>
class IntBubbleSort {
public:
    static inline void sort(int* data)
    {
        IntBubbleSortLoop<N-1,0>::loop(data);
        IntBubbleSort<N-1>::sort(data);
    }
};


class IntBubbleSort<1> {
public:
    static inline void sort(int* data)
    { }
};
To sort an array of N integers, we invoke IntBubbleSort<N>::sort(int* data). This routine calls a function IntBubbleSortLoop<N-1,0>::loop(data), which will replace the for loop in j, then makes a recursive call to itself. A template specialization for N=1 is provided to end the recursive calls. We can manually expand this for N=4 to see the effect:

 

static inline void IntBubbleSort<4>::sort(int* data)
{
    IntBubbleSortLoop<3,0>::loop(data);
    IntBubbleSortLoop<2,0>::loop(data);
    IntBubbleSortLoop<1,0>::loop(data);
}
The first template argument in the IntBubbleSortLoop classes (3,2,1) is the value of i in the original version of bubbleSort(), so it makes sense to call this argument I. The second template parameter will take the role of j in the loop, so we'll call it J. Now we need to write the IntBubbleSortLoop class. It needs to loop from J=0 to J=I-2, comparing elements data[J] and data[J+1] and swapping them if necessary:

 

template<int I, int J>
class IntBubbleSortLoop {

private:
    enum { go = (J <= I-2) };

public:
    static inline void loop(int* data)
    {
        IntSwap<J,J+1>::compareAndSwap(data);
        IntBubbleSortLoop<go ? I : 0, go ? (J+1) : 0>::loop(data);
    }
};


class IntBubbleSortLoop<0,0> {

public:
    static inline void loop(int*)
    { }
};

Writing a base case for this recursion is a little more difficult, since we have two variables. The solution is to make both template parameters revert to 0 when the base case is reached. This is accomplished by storing the loop flag in an enumerative type (go), and using a conditional expression operator (?:) to force the template parameters to 0 when 'go' is false. The class IntSwap<I,J> will perform the task of swapping data[I] and data[J] if necessary. Once again, we can manually expand IntBubbleSort<4>::sort(data) to see what is being generated:

 

static inline void IntBubbleSort<4>::sort(int* data)
{
    IntSwap<0,1>::compareAndSwap(data);
    IntSwap<1,2>::compareAndSwap(data);
    IntSwap<2,3>::compareAndSwap(data);
    IntSwap<0,1>::compareAndSwap(data);
    IntSwap<1,2>::compareAndSwap(data);
    IntSwap<0,1>::compareAndSwap(data);
}

The last remaining definition is the IntSwap<I,J>::compareAndSwap() routine:

 

template<int I, int J>
class IntSwap {

public:
    static inline void compareAndSwap(int* data)
    {
        if (data[I] > data[J])
            swap(data[I], data[J]);
    }
};

The swap() routine is the same as before:

 

inline void swap(int& a, int& b)
{
    int temp = a;
    a = b;
    b = temp;
}
It's easy to see that this results in code equivalent to

 

static inline void IntBubbleSort<4>::sort(int* data)
{
    if (data[0] > data[1]) swap(data[0], data[1]);
    if (data[1] > data[2]) swap(data[1], data[2]);
    if (data[2] > data[3]) swap(data[2], data[3]);
    if (data[0] > data[1]) swap(data[0], data[1]);
    if (data[1] > data[2]) swap(data[1], data[2]);
    if (data[0] > data[1]) swap(data[0], data[1]);
}

In the next section, the speed of this version is compared to the original version of bubbleSort().

 

Practical Issues

Performance

The goal of algorithm specialization is to improve performance. Figure 3 shows benchmarks comparing the time required to sort arrays of different sizes, using the specialized bubble sort versus a call to the bubbleSort() routine. These benchmarks were obtained on an 80486/66Mhz machine running Borland C++ V4.0. For small arrays (3-5 elements), the inlined version is 7 to 2.5 times faster than the bubbleSort() routine. As the array size gets larger, the advantage levels off to roughly 1.6.

 

Figure 3. Performance increase achieved by specializing the bubble sort algorithm.
The curve in Figure 3 is the general shape of the performance increase for algorithm specialization. For very small input sizes, the advantage is very good since the overhead of function calls and setting up stack variables is avoided. As the input size gets larger, the performance increase settles out to some constant value which depends on the nature of the algorithm and the processor for which code is generated. In most situations, this value will be greater than 1, since the compiler can generate faster code if array indices are known at compile time. Also, in superscalar architectures, the long code bursts generated by inlining can run faster than code containing loops, since the cost of failed branch predictions is avoided.

In theory, unwinding loops can result in a performance loss for processors with instruction caches. If an algorithm contains loops, the cache hit rate may be much higher for the non-specialized version the first time it is invoked. However, in practice, most algorithms for which specialization is useful are invoked many times, and hence are loaded into the cache.

The price of specialized code is that it takes longer to compile, and generates larger programs. There are less tangible costs for the developer: the explosion of a simple algorithm into scores of cryptic template classes can cause serious debugging and maintenance problems. This leads to the question: is there a better way of working with template metaprograms? The next section presents a better representation.

 

A condensed representation

In essence, template metaprograms encode an algorithm as a set of production rules. For example, the specialized bubble sort can be summarized by this grammar, where the template classes are non-terminal symbols, and C++ code blocks are terminal symbols:

 

IntBubbleSort<N>:
	IntBubbleSortLoop<N-1,0>  IntBubbleSort<N-1>
IntBubbleSort<1>:
	(nil)
IntBubbleSortLoop<I,J>:
	IntSwap<J,J+1> IntBubbleSortLoop<(J+2<=I) ? I : 0, (J+2<=I) ? (J+1) : 0>
IntBubbleSortLoop<0,0>:
	(nil)
IntSwap<I,J>:
	if (data[I] > data[J]) swap(data[I],data[J]);
Although this representation is far from elegant, it is more compact and easier to work with than the corresponding class implementations. When designing template metaprograms, it's useful to avoid thinking about details such as class declarations, and just concentrate on the underlying production rules.

 

Implementing control flow structures

Writing algorithms as production rules imposes annoying constraints, such as requiring the use of recursion to implement loops. Though primitive, production rules are powerful enough to implement all C++ control flow structures, with the exception of goto. This section provides some sample implementations.

 

If/if-else statements
An easy way to code an if-else statement is to use a template class with a bool parameter, which can be specialized for the true and false cases. If your compiler does not yet support the ANSI C++ bool type, then an integer template parameter works just as well.

C++ version Template metaprogram version
if (condition) 
   statement1;
else 
   statement2;
// Class declarations
template<bool C>
class _name { };

class _name<true> {
public:
    static inline void f()
    { statement1; }         // true case
};

class _name<false> {
public:
    static inline void f()
    { statement2; }         // false case
};

// Replacement for 'if/else' statement:
_name<condition>::f();
The template metaprogram version generates either statement1 or statement2, depending on whether condition is true. Note that since condition is used as a template parameter, it must be known at compile time. Arguments can be passed to the f() function as either template parameters of _name, or function arguments of f().

 

Switch statements

Switch statements can be implemented using specialization, if the cases are selected by the value of an integral type:

C++ versionTemplate metaprogram version
int i;

switch(i)
{
    case value1:
        statement1;
        break;

    case value2:
        statement2;
        break;

    default:
        default-statement;
        break;
}
// Class declarations
template<int I>
class _name {
public:
    static inline void f()
    { default-statement; }
};

class _name<value1> {
public:
    static inline void f()
    { statement1; }
};

class _name<value2> {
public:
    static inline void f()
    { statement2; }
};

// Replacement for switch(i) statement
_name<I>::f();

The template metaprogram version generates one of statement1, statement2, or default- statement depending on the value of I. Again, since I is used as a template argument, its value must be known at compile time.

 

Loops

Loops are implemented with template recursion. The use of the 'go' enumerative value avoids repeating the condition if there are several template parameters:

C++ versionTemplate metaprogram version
int i = N;

do {
    statement;
} while (--i > 0);
// Class declarations
template<int I>
class _name {
private:
    enum { go = (I-1) != 0 };

public:
    static inline void f()
    {
         statement;
         _name<go ? (I-1) : 0>::f();
    }
};

// Specialization provides base case for
// recursion
class _name<0> {
public:
    static inline void f()
    { }
};

// Equivalent loop code
_name<N>::f();
The template metaprogram generates statement N times. Of course, the code generated by statement can vary depending on the loop index I. Similar implementations are possible for while and for loops. The conditional expression operator (?:) can be replaced by multiplication, which is trickier but more clear visually (i.e. _name<(I- 1)*go> rather than name_<go ? (I-1) : 0>).

 

Temporary variables

In C++, temporary variables are used to store the result of a computation which one wishes to reuse, or to simplify a complex expression by naming subexpressions. Enumerative types can be used in an analogous way for template metaprograms, although they are limited to integral values. For example, to count the number of bits set in the lowest nibble of an integer, a C++ implementation might be:

 

int countBits(int N)
{
    int bit3 = (N & 0x08) ? 1 : 0,
        bit2 = (N & 0x04) ? 1 : 0,
        bit1 = (N & 0x02) ? 1 : 0,
        bit0 = (N & 0x01) ? 1 : 0;

    return bit0+bit1+bit2+bit3;
}

int i = countBits(13);
Writing a template metaprogram version of this function, the argument N is passed as a template parameter, and the four temporary variables (bit0,bit1,bit2,bit3) are replaced by enumerative types:

 

template<int N>
class countBits {
    enum {
           bit3 = (N & 0x08) ? 1 : 0,
           bit2 = (N & 0x04) ? 1 : 0,
           bit1 = (N & 0x02) ? 1 : 0,
           bit0 = (N & 0x01) ? 1 : 0 };

public:
    enum { nbits = bit0+bit1+bit2+bit3 };    
};

int i = countBits<13>::nbits;

Compile-time functions

Many C++ optimizers will use partial evaluation to simplify expressions containing known values to a single result. This makes it possible to write compile-time versions of library functions such as sin(), cos(), log(), etc. For example, to write a sine function which will be evaluated at compile time, we can use a series expansion:

sin x = x - x^3/3! + x^5/5! - x^7/7! + ...

A C implementation of this series might be

 

// Calculate sin(x) using j terms
float sine(float x, int j)
{
    float val = 1;

    for (int k = j - 1; k >= 0; --k)
        val = 1 - x*x/(2*k+2)/(2*k+3)*val;

    return x * val;
}
Using our production rule grammar analogy, we can design the corresponding template classes needed to evaluate sin x using J=10 terms of the series. Letting x = 2*Pi*I/N, which permits us to pass x using two integer template parameters (I and N), we can write

 

Sine<N,I>:
	x * SineSeries<N,I,10,0>
SineSeries<N,I,J,K>:
	1-x*x/(2*K+2)/(2*K+3) *  SineSeries<N*go, I*go,J*go, (K+1)*go>
SineSeries<0,0,0,0>:
	1;
Where go is the result of (K+1 != J). Here are the corresponding class implementations:

 

template<int N, int I>
class Sine {
public:
    static inline float sin()
    {
        return (I*2*M_PI/N) * SineSeries<N,I,10,0>::accumulate();
    }
};


// Compute J terms in the series expansion.  K is the loop variable.
template<int N, int I, int J, int K>
class SineSeries {
public:
    enum { go = (K+1 != J) };

    static inline float accumulate()
    {
        return 1-(I*2*M_PI/N)*(I*2*M_PI/N)/(2*K+2)/(2*K+3) *
            SineSeries<N*go,I*go,J*go,(K+1)*go>::accumulate();
    }
};


// Specialization to terminate loop
class SineSeries<0,0,0,0> {
public:
    static inline float accumulate()
    { return 1; }
};

The redeeming quality of that cryptic implementation is that a line of code such as

 

float f = Sine<32,5>::sin();
gets compiled into this single 80486 assembly statement by Borland C++:

 

mov	dword ptr [bp-4],large 03F7A2DDEh
The literal 03F7A2DDEh represents the floating point value 0.83147, which is the sine of 2*Pi*I/N. The sine function is evaluated at compile time, and the result is stored as part of the processor instruction. This is very useful in applications where sine and cosine values are needed, such as computing a Fast Fourier Transform [2]. Rather than retrieving sine and cosine values from lookup tables, they can be evaluated using compile-time functions and stored in the instruction stream. Although the implementation of an inlined FFT server is too complex to describe in detail here, it serves as an illustration that template metaprograms are not limited to simple applications such as sorting. An FFT server implemented with template metaprograms does much of its work at compile time, such as computing sine and cosine values, and determining the reordering needed for the input array. It can use template specialization to squeeze out optimization for special cases, such as replacing

 

y = a * cos(0) + b * sin(0);
with

 

y = a;
The result is a massively inlined FFT server containing no loops or function calls, which runs at roughly three times the speed of an optimized non-specialized routine.

 

References

[1] E. Unruh, "Prime number computation," ANSI X3J16-94-0075/ISO WG21-462.

[2] W. Press et al., Numerical Recipes in C, Cambridge University Press, 1992.

 

Web Site

More information on these techniques is available from the Blitz++ Project Home Page, at "http://oonumerics.org/blitz/". 
 
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Quartz是OpenSymphony开源组织在Job scheduling领域又一个开源项目,它可以与J2EE与J2SE应用程序相结合也可以单独使用。Quartz可以用来创建简单或为运行十个,百个,甚至是好几万个Jobs这样复杂的程序。Jobs可以做成标准的Java组件或 EJBs。 Quartz的优势: 1、Quartz是一个任务调度框架(库),它几乎可以集成到任何应用系统中。 2、Quartz是非常灵活的,它让您能够以最“自然”的方式来编写您的项目的代码,实现您所期望的行为 3、Quartz是非常轻量级的,只需要非常少的配置 —— 它实际上可以被跳出框架来使用,如果你的需求是一些相对基本的简单的需求的话。 4、Quartz具有容错机制,并且可以在重启服务的时候持久化(”记忆”)你的定时任务,你的任务也不会丢失。 5、可以通过Quartz,封装成自己的分布式任务调度,实现强大的功能,成为自己的产品。6、有很多的互联网公司也都在使用Quartz。比如美团 Spring是一个很优秀的框架,它无缝的集成了Quartz,简单方便的让企业级应用更好的使用Quartz进行任务的调度。   课程说明:在我们的日常开发中,各种大型系统的开发少不了任务调度,简单的单机任务调度已经满足不了我们的系统需求,复杂的任务会让程序猿头疼, 所以急需一套专门的框架帮助我们去管理定时任务,并且可以在多台机器去执行我们的任务,还要可以管理我们的分布式定时任务。本课程从Quartz框架讲起,由浅到深,从使用到结构分析,再到源码分析,深入解析Quartz、Spring+Quartz,并且会讲解相关原理, 让大家充分的理解这个框架和框架的设计思想。由于互联网的复杂性,为了满足我们特定的需求,需要对Spring+Quartz进行二次开发,整个二次开发过程都会进行讲解。Spring被用在了越来越多的项目中, Quartz也被公认为是比较好用的定时器设置工具,学完这个课程后,不仅仅可以熟练掌握分布式定时任务,还可以深入理解大型框架的设计思想。
[入门数据分析的第一堂课]这是一门为数据分析小白量身打造的课程,你从网络或者公众号收集到很多关于数据分析的知识,但是它们零散不成体系,所以第一堂课首要目标是为你介绍:Ø  什么是数据分析-知其然才知其所以然Ø  为什么要学数据分析-有目标才有动力Ø  数据分析的学习路线-有方向走得更快Ø  数据分析的模型-分析之道,快速形成分析思路Ø  应用案例及场景-分析之术,掌握分析方法[哪些同学适合学习这门课程]想要转行做数据分析师的,零基础亦可工作中需要数据分析技能的,例如运营、产品等对数据分析感兴趣,想要更多了解的[你的收获]n  会为你介绍数据分析的基本情况,为你展现数据分析的全貌。让你清楚知道自己该如何在数据分析地图上行走n  会为你介绍数据分析的分析方法和模型。这部分是讲数据分析的道,只有学会底层逻辑,能够在面对问题时有自己的想法,才能够下一步采取行动n  会为你介绍数据分析的数据处理和常用分析方法。这篇是讲数据分析的术,先有道,后而用术来实现你的想法,得出最终的结论。n  会为你介绍数据分析的应用。学到这里,你对数据分析已经有了初步的认识,并通过一些案例为你展现真实的应用。[专享增值服务]1:一对一答疑         关于课程问题可以通过微信直接询问老师,获得老师的一对一答疑2:转行问题解答         在转行的过程中的相关问题都可以询问老师,可获得一对一咨询机会3:打包资料分享         15本数据分析相关的电子书,一次获得终身学习

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值