

No Free Lunch Theorem: A Review


Abstract The “No Free Lunch” theorem states that, averaged over all optimization problems, without re-sampling, all optimization algorithms perform equally well.

Optimization, search, and supervised learning are the areas that have benefited more from this important theoretical concept. Formulation of the initial No Free Lunch theorem, very soon, gave rise to a number of research works which resulted in a suite of theorems that define an entire research field with significant results in other scientific areas where successfully exploring a search space is an essential and critical task. The objective of this paper is to go through the main research efforts that contributed to this research field, reveal the main issues, and disclose those points that are helpful in understanding the hypotheses, the restrictions, or even the inability of applying No Free Lunch theorems.



1 Introduction

1 介绍

        Optimization problems occurring in various fields of science, computing, and engineering depend on the number of parameters, the size of the solution space and, mainly, on the objective function whose definition is critical as it largely determines the level of difficulty of the problem. Hence, defining and solving an optimization problem is sometimes an extremely difficult and demanding task. Researchers from various fields have been involved in solving optimization problems either as this constitutes part of their main research or because the problem they face can be tackled by an optimization one. The research efforts on this matter have permitted the elaboration of numerous methods and techniques, built on solid mathematical concepts, whose application produced significantly good results.


        However, contrary to any opposite claim, none of these methods has proven to be successful to all types of the problems it was applied. This argument has been the objective of important theoretical work carried out by David Wolpert which gave rise to the well-known No Free Lunch (NFL) theorem. Briefly, the NFL theorem states that: “averaged over all optimization problems, without re-sampling all optimization algorithms perform equally well.” Besides optimization, the NFL theorem has been successfully used to tackle important theoretical issues pertaining supervised learning in machine learning systems. Actually, the NFL theorem has become a suite of theorems which has given significant results in various scientific fields where searching for some optimal solution is an important issue.

        然而,与任何相反的说法相反,这些方法中没有一种被证明对所有类型的问题都是成功的。这一论点一直是大卫·沃尔珀特(David Wolpert)进行的重要理论工作的目标,该工作产生了著名的“没有免费的午餐”(NFL)定理。简单地说,NFL定理指出:“对所有优化问题进行平均,无需重新采样,所有优化算法的表现都一样好。”除了优化之外,NFL定理已经成功地用于解决机器学习系统中与监督学习相关的重要理论问题。实际上,NFL定理已经成为一组定理,在寻找最优解是一个重要问题的各个科学领域中都给出了重要的结果。

        The NFL theorems constitute an important theoretic development which marked the limits of the range of successful application for a number of search, optimization, and supervised learning algorithms. At the same time the formulation of these theorems has provoked controversial discussions [4, 36, 44, 45] regarding the possibility to invent and effectively use general purpose algorithms in various fields where only a limited view of the real-world problem exists.


        In this paper we aim at presenting a review on the most sound research work published by several researchers on this matter including its impact on the most important fields, that is, optimization and supervised learning. Other existing fields of interest such as user interface design [24], network calculus [8] are worth of merit but they are out of the scope of this review. The emphasis of this review will be, mainly, on the critical questions which promoted the development of NFL theorems as well as on the issues that proved to be important: namely for (a) optimization, (b) searching, and (c) supervised learning.


        The rest of this paper is structured as follows. Section 2 provides a review of the early concepts and constructs that underpinned the definition of the NFL theorems. Section 3 covers the main research efforts of Wolpert establishing NFL for optimization and search. In Section 4 we survey the more recent work of Wolpert which clarifies older concepts while offering some new results on this field. Next, Section 5 is dedicated to the main research carried out by several researchers on NFL for optimization and evolutionary algorithms. Part of the research surveyed concerns the cases where NFL theorems do not apply and researchers have proved the existence of “Free Lunches.” In Section 6 we describe the main research efforts on NFL theorems for supervised learning. The paper ends in Section 7 with a synopsis and some concluding remarks.


2 Early Developments

2 早期发展

        As noted by David Wolpert [56], the first attempt to underline the limits of inductive inference was made by the Scottish philosopher David Hume in 1740 in his seminal work “A treatise of human nature” [26, 27]. Hume wrote that:

        正如David Wolpert所指出的[56],第一次尝试强调归纳推理的局限性是由苏格兰哲学家David Hume于1740年在他的开创性著作“A treatise of human nature”中提出的[26,27]。休谟写道:

        Even after the observation of the frequent conjunction of objects, we have no reason to draw any inference concerning any object beyond those of which we have had experience.


 In the machine learning context this can be stated as follows:


        It is not reasonable to believe that the generalization error of a classifier-generalizer on test data drawn off the training set correlates with its performance on the training set itself by simply considering a priori information on the real world.


        Wolpert based his theoretical work on earlier developments elaborated in his paper “On the connection between in-sample testing and generalization error” [55].

        In this paper the generalization error is taken as the off-training set (OTS) error and the question addressed concerns its correlation with the error produced using in-sample testing. Moreover, Wolpert tackles the question of how “. . . to take into account the probability distribution of target functions in the real world” as any theory of generalization is irrelevant concerning its applicability on real- world problems if it does not tackle the previous problem. Some, but not all, of the important issues arising in this paper are:



(a) “Can one prove inductive inference from first principles?” In other words, given the performance of a learning algorithm on the training data set is it possible to obtain information on its ability to provide an exact representation of the target function for examples outside the data set? (b) If one cannot answer the previous question then, what are the assumptions on the distribution of real-world data (the target function) can help with the generalization for training algorithms, such as back-propagation, which aim to minimize the error on the training data? (c) Is there a mathematical basis of estimating when over-training occurs and proceed in modifying the learning algorithm in order to bound the effects of such over-training? (d) Is it possible to express in mathematical terms the ability of a training set to faithfully represent the distribution over the entire data space? (e) What are the hypotheses under which non-parametric statistics techniques, such as cross-validations, which are designed to choose between learning algorithms, succeed to diminish the generalization error?


         In addressing these matters, the formalism proposed seems to extend the classical Bayesian formalism using the hypothesis function, i.e., the distribution of the data set as learned by the generalizer. The mathematical formalism adopted proposes a way to match the degree to which the distribution derived by the learning algorithm matches the distribution of the training data and it can be used to tackle various generalization issues such as over-training and minimum number of parameters for the model. From another point of view this formalism is proposed with the aim to express in mathematical terms the assumptions made by a generalizer so that the used model best fits the training set representing the real world. As a result the elaboration of important theoretical proofs proposes a solid basis for tackling several issues in machine learning and gives rise to the development of concepts such as the NFL theorems.


The first and foremost contributions of Wolpert concerning NFL theorems were presented in the papers [56, 57]. In this set of two papers, namely: (i) “The lack of a priori distinctions between learning algorithms” and (ii) “The existence of a priori distinctions between learning algorithms,”


Wolpert develops his theory and formulates the NFL theorems. In the former, he discusses the hypothesis that given any two learning algorithms one cannot claim having any prior information that these algorithms are distinct as far as the performance of these algorithms on specific class of problems is concerned. In the latter paper, Wolpert unfolds the arguments concerning the inverse assumption, i.e., there are prior distinctions regarding the performance of any two algorithms. These two papers deal with supervised learning but the theoretical constructs were applied to multiple domains where two different algorithms compete as for which performs better for a class of problems and associated error functions.


        Focusing on supervised learning, in the first of the previously mentioned papers the concept of “off-training set” (OTS) is defined and the associated performance measure of the supervised learning algorithm is proposed. The mathematical formalism used is based on the so-called extended Bayesian formalism and is refined in order to take into account the generalization error, the cost function, and their relation to the learning algorithm while providing the necessary hypotheses for the training sets and the targets. In the sequel the probability of some cost “c” of the learning algorithm associated with the loss function is proposed as follows:

        针对监督学习,在前面提到的论文中,首先定义了“非训练集”(off-training set, OTS)的概念,并提出了监督学习算法的相关性能度量。所使用的数学形式是基于所谓的扩展贝叶斯形式,并进行了改进,以考虑泛化误差、成本函数及其与学习算法的关系,同时为训练集和目标提供必要的假设。在后续中,我们提出了与损失函数相关的学习算法的某个代价c的概率:

3 No Free Lunch for Optimization and Search

3 优化和搜索没有免费的午餐

        Another direction of research for applying the ideas of the NFL theorems, as presented above, concerns the domain of optimization. The work “No free lunch theorems for optimization” [62] published by Wolpert and McReedy deals with this matter based on two technical reports produced by the authors at the Santa Fe Institute. The first technical report published in [35] with the title “What makes an optimization problem hard?” raises the question: “Are some classes of combinatorial optimization problems intrinsically harder than others, without regard to the algorithm one uses, or can difficulty be assessed only relative to a particular algorithm?” The second technical report [61], entitled: “No free lunch theorems for search” focuses on proving that all algorithms searching for an optimum of an optimization problem, i.e., an extremum of an objective function, performs exactly the same, no matter the performance measure used, when taking the average over all possible objective functions.


        The work of Wolpert and McReedy “No free lunch theorems for optimization” [62], sets up a formalism for investigating the relation of the effectiveness of optimization algorithms and the problems they are solving. The NFL theorems developed in the paper establish that the successful performance of any optimization algorithm on one class of problems is counterbalanced by its degraded performance on another class of problems. A geometric interpretation is provided concerning the meaning of the fitness of an algorithm to cope with some optimization problem.

        Moreover, as mentioned in the previous technical reports the authors examine applications of NFL theorems to information-theoretic aspects of optimization as well as to defining measures of performance for optimization benchmarks.



Given the multitude of black-box optimization techniques available, the authors try to provide the formalism for tackling the following problem: “is there a relationship between how well an algorithm performs and the optimization problem on which it is run?” This problem can be cast in several other such as: (a) What are the mathematical constituents of optimization theory one needs to know before deciding on the necessary probability distributions to be applied? (b) Are information theory and Bayesian analysis suitable for understanding the previous issues? (c) Given the performance results of a certain algorithm on a certain class of problems can one provide a priori generalization of these results on other classes of problems? (d) Is there a suitable measure of such generalization? Can one evaluate the performance of algorithms on problems so that he is able to compare those algorithms?







