考察数据结构

 
介绍:
本文是介绍在.Net平台下使用数据结构的系列文章,共分为六部分,这是本文的第一部分.本文试图考察几种数据结构,其中有的包含在.Net Framework的基类库中,有的是我们自己创建的.如果你对这些名词不太熟悉,那么我们可以把数据结构看作是一种抽象结构或是类,它通常用来组织数据,并提供对数据的操作.最常见并为我们所熟知的数据结构就是数组array,它包含了一组连续的数据,并通过索引进行访问.
在阅读本文内容之前,让我们先看看这六部分的主要内容.如果你有什么想法,或觉得本文有什么遗漏之处,希望你通过e-mail( mitchell@4guysfromrolla.com )和我联系,共同分享你的思想.假如有时间的话,我很高兴将你的建议放到合适的部分,如有必要,可以在这篇系列文章中加上第七部分.
第一部分:首先介绍数据结构在算法设计中的重要性.决定数据结构的优劣在于其性能.我们将经过严格分析数据结构的各种性能.此部分还将介绍.Net Frameword下两种常用的数据机构:Array 和ArrayList.我们将考察其结构的操作方式及其效率.
第二部分:我们将继续从更多细节上分析ArrayList结构,同时还将介绍Queue类和Stack类.和ArrayList一样,Queue和Stack存放的都是一组连续的数据集合,都属于.Net Framework基类库.与ArrayList不同的是,Stack和Queue只能以预先规定的序列顺序读取其数据(先进先出和先进后出),而ArrayList可以任意获取数据项.我们将通过示例程序来考察Queue,Stack,并通过扩展ArrayList类来实现它们.之后,我们还要分析哈希表HashTable,它象ArrayList一样可以直接访问数据,不同的是它以key(字符串)为索引.
ArrayList对数据直接读取和存储是一种理想的数据结构,同时,它也是支持数据搜索的候选方案.在第三部分,我们将考察二叉树结构,对于数据搜索而言,它比ArrayList更加有效. .Net Framework并不包含此种内置数据结构,因此需要我们自己创建.
二叉树搜索的效率受制于插入到树中的数据的顺序.如果我们插入的是有序或近似有序的数据,实际上,它的效率不如ArrayList.为了将这两种的优势结合起来,在第四部分,我门将考察一种有趣的随机数据结构——SkipList. SkipList既保留了二叉树搜索的高效率,同时输入数据的顺序对其效率影响甚微.
第五部分我们将注意力转向通常用来表现图形的数据结构.图(graph)是众多节点以及节点之间边的集合.举例来说,地图就可以图的形式来表现.城市是节点,公路则是连接节点之间的边.许多现实问题都可以抽象成图的形式,因此,图也是我们经常要用到的数据结构.
最后,第六部分我们将谈到reprisent sets(表示集?)和disjoint sets(非关联集,即交集为空?)集合是一种无序数据的集中.非关联集是指它和另外一个集合没有共同的元素.我们在程序编写时会经常用到集合和非关联集.我们将在这一部分中详细描述它.

数据结构性能分析
当我们在思考一个特别的应用程序或者程序的问题时,多数开发人员(包括我自己)都将兴趣集中到算法上以解决手头的难题,或者为应用程序加上一个很酷的特色以丰富用户的经验.我们似乎很少听到有人会为他所使用的数据结构而激动不已,啧啧赞叹. 然而,用在一个特定算法中的数据结构能够很大程度上影响其性能.最常见的例子就是在数据结构中查找一个元素.在数组中,查找过程所耗时间是与这个数组中元素的个数是成正比的.采用二叉数或者SkipLists(我找不到合适的翻译,按前所述,它包含了随机数的集合,也许看了后面的部分会想到合适的中文),耗时与数据个数比例成线型下降(sub-linear,我又黔驴词穷了).当我们要搜索大量的数据时,数据结构的选择对程序的性能尤其重要,其差别甚至达到数秒,乃至于数分钟.
既然在算法中使用的数据结构影响了算法的效率,因此比较各种数据结构的效率并从中选择一种更佳的方法就显得尤为重要.作为开发者而言,我们首先要关注的是随着存储的数据量的增长,数据结构性能是怎样随之改变的的?也就是说,每当数据结构中添加一个新元素时,它将怎样影响数据结构的运行时间?
考虑这样一种情形,我们在程序中使用了System.IO.Directory.GetFiles(路径)方法以返回文件的列表,存放到一个特定的字符串数组directory中.假设你需要搜索这个数组以判断在文件列表中是否存在XML文件(即扩展名为.xml的文件),一种方法是扫描(scan,或者是遍历)整个数组,当找到XML文件时,就设置一个标识.代码可能是这样:
using System;
using System.Collections;
using System.IO;
public class MyClass
{
   public static void Main()
   {
      string [] fs = Directory.GetFiles(@"C:/Inetpub/wwwroot");
      bool foundXML = false;
      int i = 0;
      for (i = 0; i < fs.Length; i++)
         if (String.Compare(Path.GetExtension(fs[i]), ".xml", true) == 0)
         {
            foundXML = true;
            break;
         }
  
     if (foundXML)
        Console.WriteLine("XML file found - " + fs[i]);
     else
        Console.WriteLine("No XML files found.");
     
   }
}

现在我们来看看最糟糕的一种情况,当这个列表中不存在XML文件或者XML文件是在列表的最后,我们将会搜索完这个数组的所有元素.再来分析一下数组的效率,我们必须问问自己,"假设数组中现有n个元素,如果我添加一个新元素,增长为n+1个元素,那么新的运行时间是多少?(术语"运行时间"--running time,不能顾名思义地认为是程序运行所消耗的绝对时间,而指的是程序完成该任务所必须执行的步骤数.以数组而言,运行时间特定被认为是访问数组元素所需执行的步骤数。)要搜索数组中的一个值,潜在的可能是访问数组的每一个元素,如果数组中有n+1个元素,就将执行n+1次检查。那就是说,搜索数组耗费的时间与数组元素个数成几何线形比。
当数据结构的长度趋于无穷大时,分析其结构的效率,我们把这种分析方法称为渐进分析(asymptotic analysis)。渐进分析中常用的符号是大写的O(big-Oh),以O(n)的形式描述遍历数组的性能。O是术语学中big-Oh符号的表示,n则代表遍历数组时随长度增长而与之线形增长的程序执行步数。
计算代码块中算法的运行时间的一种系统方法应遵循以下步骤:
1、判断组成算法运行时间的步骤。如前所述,对于数组而言,典型的步骤应是对数组进行读写访问的操作。而对于其他数据结构则不尽然。特别地,你应该考虑的是数据结构自身的步骤,而与计算机内部的操作无关。以上面的代码块为例,运行时间应该只计算访问数组的次数,而不用考虑创建和初始化变量以及比较两个字符串是否相等的时间。
2、找到符合计算运行时间条件的代码行。在这些行上面置1。
3、判断这些置1的行是否包含在循环中,如果是,则将1改为1乘上循环执行的最大次数。如果嵌套两重或多重循环,继续对循环做相同的乘法。
4、找到对每行写下的最大值,它就是运行时间。
现在我们按照这种步骤来标记上面的代码块。首先我们已经能够确定与计算运行时间有关的代码行,再根据步骤2,在数组fs被访问的两行代码作上标记,一行是数组元素作为String.Compare()方法的参数,一行是在Console.WriteLine()方法中。我们将这两行标记为1。然后根据步骤3,String.Compare()方法是在循环中,最大循环次数为n(因为数组长度为n)。因此将该行的标记1改为n。最后,我们得到的运行时间就是标记的最大值n,记为O(n)。(译注:即为数据结构中通常所说的时间复杂度)
O(n),或者说线形时间(linear-time),表示了多种算法运行时间中的一种。其他还有O(log2 n),O(n log 2 n),O(n2),O(2n)等等。我们无须关心这些繁杂的big-Oh记号,只需要知道在括号中的值越小,则代表数据结构的性能越好。举例来说,时间复杂度(在这里我还是觉得用时间复杂度比运行时间更能理解)为O(log n)的算法远比O(n)更有效率,因为log n<n。

注:
我们需要温习以下数学知识。在这里,log a b另外一种表示方法为ay=b。因此,log24=2,因为22=4。Log2n增长速度比单个的n要慢得多,在第三部分我们将考察时间复杂度为O(log2n)的二叉树结构。(这个注释没多大意思啊!)
在这篇系列文章中,我们将计算每一种新的数据结构和它们的渐进操作运行时间,并通过相似的操作比较其他数据结构在运行时间上的区别。
数组:一种线形的,可以直接访问的,单一数据结构
在程序编写中,数组是最简单也是最广泛使用的数据结构。在所有的程序语言中数组都具备以下共同的属性:
1.数组的数据存储在一段连续的内存之中;
2.数组的所有元素都必须是同一种数据类型,因此数组又被认为是单一数据结构(homogeneous data structures);
3.数组元素可以直接访问。(在很多数据结构中,这一特点是不必要的。例如,文章第四部分介绍的数据结构SkipList。要访问SkipList中的特定元素,你必须根据搜索其他元素直到找到搜索对象为止。然而对于数组而言,如果你知道你要查找第i个元素,就可以通过arrayName[i]来访问它。)(译注:很多语言都规定数组的下标从0开始,因此访问第i个元素,应为arrayName[i-1])
以下是数组常用的操作:
1.分配空间
2.数据访问
3.数组空间重分配(Redimensioning)
在C#里声明数组时,数组为空值(null)。下面的代码创建了一个名为booleanArray的数组变量,其值为空(null):
Bool [] boolleanArray;
在使用该数组时,必须用一个特定数字给它分配空间,如下所示:
booleanArray = new bool[10];
通用的表述为:
arrayName = new arrayType[allocationSize];
它将在CLR托管堆里分配一块连续的内存空间,足以容纳数据类型为arrayTypes、个数为allocationSize的数组元素。如果arrayType为值类型(译注:如int类型),则有allocationSize个未封箱(unboxed)的arrayType值被创建。如果arrayType为引用类型(译注:如string类型),则有allocationSize个arrayType引用类型值被创建。(如果你对值类型和引用类型、托管堆和栈之间的区别不熟悉,请查阅“理解.Net公共类型系统Common Type System”)
为帮助理解.Net Framework中数组的内部存储机制,请看下面的例子:
arrayName = new arrayType[allocationSize];
This allocates a contiguous block of memory in the CLR-managed heap large enough to hold the allocationSize number of arrayTypes. If arrayType is a value type, then allocationSize number of unboxed arrayType values are created. If arrayType is a reference type, then allocationSize number of arrayType references are created. (If you are unfamiliar with the difference between reference and value types and the managed heap versus the stack, check out Understanding .NET's Common Type System.)
To help hammer home how the .NET Framework stores the internals of an array, consider the following example:
bool [] booleanArray;
FileInfo [] files;
booleanArray = new bool[10];
files = new FileInfo[10];
这里,booleanArray是值类型System.Boolean数组,而files数组则是引用类型System.IO.FileInfo数组。图一显示了执行这四行代码后CLR托管堆的情况。
 
 
图一:在托管堆中顺序存放数组元素
请记住在files数组中存放的十个元素指向的是FileInfo实例。图二强调了这一点(hammers home this point,有些俚语的感觉,不知道怎么翻译),显示了如果我们为files数组中的FileInfo实例分配一些值后内存的分布情况。
 

图二:在托管堆中顺序存放数组元素

.Net中所有数组都支持对元素的读写操作。访问数组元素的语法格式如下:
// 读一个数组元素
bool b = booleanArray[7];
// 写一个数组元素,即赋值
booleanArray[0] = false;
访问一个数组元素的运行时间表示为O(1),因为对它的访问时间是不变的。那就是说,不管数组存储了多少元素,查找一个元素所花的时间都是相同的。运行时间之所以不变,是因为数组元素是连续存放的,查找定位的时候只需要知道数组在内存中的起始位置,每个元素的大小,以及元素的索引值。
在托管代码中,数组的查找比实际的实现稍微复杂一些,因为在CLR中访问每个数组,都要确保索引值在其边界之内。如果数组索引超出边界,会抛出IndexOutOfRangeException异常。这种边界检查有助于确保我们在访问数组不至于意外地超出数组边界而进入另外一块内存区。而且它不会影响数组访问的时间,因为执行边界检查所需时间并不随数组元素的增加而增加。
注:如果数组元素特别多,索引边界检查会对应用程序的执行性能有稍许影响。而对于非托管代码,这种边界检查就被忽略了。要了解更多信息,请参考Jeffrey Richter所著的Applied Microsoft .NET Framework Programming第14章。
使用数组时,你也许需要改变数组大小。可以通过根据特定的长度大小创建一个新数组实例,并将旧数组的内容拷贝到新数组,来实现该操作。我们称这一过程为数组空间重分配(redimensioning),如下代码:
using System;
using System.Collections;
public class MyClass
{
   public static void Main()
   {
      // 创建包含3个元素的int类型数组
      int [] fib = new int[3];
      fib[0] = 1;
      fib[1] = 1;
      fib[2] = 2;
     
      // 重新分配数组,长度为10
      int [] temp = new int[10];
// 将fib数组内容拷贝到临时数组
      fib.CopyTo(temp, 0);
     
      // 将临时数组赋给fib
      fib = temp;  
   }
}
在代码的最后一行,fib指向包含10个元素的Int32类型数组。Fib数组中3到9(译注:注意下标从0开始)的元素值默认为0(Int32类型)。
当我们要存储同种类型的数据(原文为heterogeneous types——异类数据类型,我怀疑有误)并仅需要直接访问数据时,数组是较好的数据结构。搜索未排序的数组时间复杂度是线形的。当我们对小型数组进行操作,或很少对它进行查询操作时,数组这种结构是可以接受的。但当你的应用程序需要存储大量数据,且频繁进行查询操作时,有很多其他数据结构更能适应你的工作。我们来看看本文接下来将要介绍的一些数据结构。(如果你要根据某个属性查找数组,且数组是根据该属性进行排序的,你可以使用二叉法(binary search)对其搜索,它的时间复杂度为O(log n),与在二叉树中搜索的时间复杂度相同。事实上,数组类中包含了一个静态方法BinarySearch()。如要了解该方法的更多信息,请参考我早期的一篇文章“有效地搜索有序数组”。
注:.Net Framework同样支持多维数组。与一维数组一样,多维数组对数据元素的访问运行时间仍然是不变的。回想一下我们前面介绍的在n个元素的一维数组中查询操作的时间复杂度为O(n)。对于一个nxn的二维数组,时间复杂度为O(n2),因为每次搜索都要检查n2个元素。以此类推,k维数组搜索的时间复杂度为O(nk)。
ArrayList:可存储不同类型数据、自增长的数组
明确地,数组在设计时受到一些限制,因为一维数组只能存储相同类型的数据,而且在使用数组时,必须为数组定义特定的长度。很多时候,开发人员要求数组更加灵活,它可以存储不同类型的数据,也不用去关心数组空间的分配。在.Net Framework基类库中提供了满足这样条件的数据结构——System.Collections.ArrayList。
如下的一小段代码是ArrayList的示例。注意到使用ArrayList时可以添加任意类型的数据,且不需要分配空间。所有的这些都由系统控制。
ArrayList countDown = new ArrayList();
countDown.Add(5);
countDown.Add(4);
countDown.Add(3);
countDown.Add(2);
countDown.Add(1);
countDown.Add("blast off!");
countDown.Add(new ArrayList());
从深层次的含义来讲,ArrayList使用的存放类型为object的System.Array对象。既然所有类型都是直接或间接从object派生,自然一个object类型的数组也可以存放任何类型的元素。ArrayList默认创建16个object类型元素的数组,当然我们也可以通过构造函数中的参数或设置Capacity属性来定制ArrayList大小。通过Add()方法添加新元素,数组内部自动检查其容量。如果添加新元素导致越界,则容量则自动成倍增加,我们称为自增长。
ArrayList和Array一样,也可以通过索引直接访问:
// Read access
int x = (int) countDown[0];
string y = (string) countDown[5];
// Write access
countDown[1] = 5;
// 会产生ArgumentOutOfRange 异常
countDown[7] = 5;
既然ArrayList存储的是object类型的元素,因此从ArrayList中读元素时应该显示的指定类型转换。同时要注意的是,如果你访问的数组元素超过ArrayList的长度,系统会抛出System.ArgumentOutOfRange异常。
ArrayList提供了标准数组所不具备的自增长灵活性,但这种灵活性是以牺牲性能为代价的,尤其是当我们存储的是值类型——例如System.Int32,System.Double,System.Boolean等。它们在托管堆中是以未封箱形式(unboxed form)连续存放的。然而,ArrayList的内部机制是一个引用的object对象数组;因此,即使ArrayList中只存放了值类型,这些元素仍然会通过封箱(boxing)转换为引用类型。如图三所示:
 
图三:存储连续块的object引用的ArrayList
在ArrayList中使用值类型,将额外进行封箱(boxing)和撤箱(unboxing)操作,当你的应用程序是一个很大的ArrayList,并频繁进行读写操作时,会很大程度上影响程序性能。如图3所示,对于引用类型而言,ArrayList和数组的内存分配是相同的。
比较数组而言,ArrayList的自增长并不会导致任何性能的下降。如果你知道存储到ArrayList的元素的准确数量,可以通过ArrayList构造函数初始化容量以关闭其自增长功能。而对于数组,当你不知道具体容量时,不得不在插入的数据元素超过数组长度的时候,手动改变数组的大小。
一个经典的计算机科学问题是:当程序运行时超出了缓存空间,应该分配多少新的空间为最佳。一种方案是是原来分配空间的基础上每次加1。例如数组最初分配了5个元素,那么在插入第6个元素之前,将其长度增加为6。显然,这种方案最大程度上节约了内存空间,但代价太大,因为每插入一个新元素都要进行一次再分配操作。
另一种方案刚好相反,也就是每次分配都在原来大小的基础上增加100倍。如果数组最初分配了5个元素,那么在插入第6个元素之前,数组空间增长为500。显然,该方案大大地减少了再分配操作的次数,但仅当插入极少的数据元素时,就会有上百的元素空间未使用,实在太浪费空间了!
ArrayList的渐近运行时间和标准数组一样。即使对ArrayList的操作是高开销的,尤其是存储值类型,其元素个数和每次操作的代价之间的关系与标准数组相同。

本文是"考察数据结构"系列文章的第二部分,考察了三种研究得最多的数据结构:队列(Queue),堆栈(Stack)和哈希表(Hashtable)。正如我们所知,Quenu和Stack其实一种特殊的ArrayList,提供大量不同类型的数据对象的存储,只不过访问这些元素的顺序受到了限制。Hashtable则提供了一种类数组(array-like)的数据抽象,它具有更灵活的索引访问。数组需要通过序数进行索引,而Hashtable允许通过任何一种对象索引数据项。

目录:

简介

“排队顺序”的工作进程

“反排队顺序”——堆栈数据结构

序数索引限制

System.Collections.Hashtable类

结论

 

简介

在第一部分中,我们了解了什么是数据结构,评估了它们各自的性能,并了解了选择何种数据结构对特定算法的影响。另外我们还了解并分析了数据结构的基础知识,介绍了一种最常用的数据结构:数组。

数组存储了同一类型的数据,并通过序数进行索引。数组实际的值是存储在一段连续的内存空间中,因此读写数组中特定的元素非常迅速。

因其具有的同构性及定长性,.Net Framework基类库提供了ArrayList数据结构,它可以存储不同类型的数据,并且不需要显式地指定长度。前文所述,ArrayList本质上是存储object类型的数组,每次调用Add()方法增加元素,内部的object数组都要检查边界,如果超出,数组会自动以倍数增加其长度。

第二部分,我们将继续考察两种类数组结构:Queue和Stack。和ArrayList相似,他们也是一段相邻的内存块以存储不同类型的元素,然而在访问数据时,会受到一定的限制。

之后,我们还将深入了解Hashtable数据结构。有时侯,我们可以把Hashtable看作杀一种关联数组(associative array),它同样是存储不同类型元素的集合,但它可通过任意对象(例如string)来进行索引,而非固定的序数。

“排队顺序”的工作进程

如果你要创建不同的服务,这种服务也就是通过多种资源以响应多种请求的程序;那么当处理这些请求时,如何决定其响应的顺序就成了创建服务的一大难题。通常解决的方案有两种:

“排队顺序”原则

“基于优先等级”的处理原则

当你在商店购物、银行取款的时候,你需要排队等待服务。“排队顺序”原则规定排在前面的比后面的更早享受服务。而“基于优先等级”原则,则根据其优先等级的高低决定服务顺序。例如在医院的急诊室,生命垂危的病人会比病情轻的更先接受医生的诊断,而不用管是谁先到的。

设想你需要构建一个服务来处理计算机所接受到的请求,由于收到的请求远远超过计算机处理的速度,因此你需要将这些请求按照他们递交的顺序依此放入到缓冲区中。

一种方案是使用ArrayList,通过称为nextJobPos的整型变量来指定将要执行的任务在数组中的位置。当新的工作请求进入,我们就简单使用ArrayList的Add()方法将其添加到ArrayList的末端。当你准备处理缓冲区的任务时,就通过nextJobPos得到该任务在ArrayList的位置值以获取该任务,同时将nextJobPos累加1。下面的程序实现该算法:

using System;
using System.Collections;
public class JobProcessing

{

   private static ArrayList jobs = new ArrayList();
   private static int nextJobPos = 0;
   public static void AddJob(string jobName)

   {
      jobs.Add(jobName);

   }  

   public static string GetNextJob()

   {

      if (nextJobPos > jobs.Count - 1)

         return "NO JOBS IN BUFFER";

      else

      {

         string jobName = (string) jobs[nextJobPos];

         nextJobPos++;

         return jobName;

      }

   }

  

   public static void Main()

   {

      AddJob("1");

      AddJob("2");

      Console.WriteLine(GetNextJob());

      AddJob("3");

Console.WriteLine(GetNextJob());

      Console.WriteLine(GetNextJob());

      Console.WriteLine(GetNextJob());

      Console.WriteLine(GetNextJob());

      AddJob("4");

      AddJob("5");

      Console.WriteLine(GetNextJob());

   }

}

 

输出结果如下:

1

2

3

NO JOBS IN BUFFER

NO JOBS IN BUFFER

4

这种方法简单易懂,但效率却可怕得难以接受。因为,即使是任务被添加到buffer中后立即被处理,ArrayList的长度仍然会随着添加到buffer中的任务而不断增加。假设我们从缓冲区添加并移除一个任务需要一秒钟,这意味一秒钟内每调用AddJob()方法,就要调用一次ArrayList的Add()方法。随着Add()方法持续不断的被调用,ArrayList内部数组长度就会根据需求持续不断的成倍增长。五分钟后,ArrayList的内部数组增加到了512个元素的长度,这时缓冲区中却只有不到一个任务而已。照这样的趋势发展,只要程序继续运行,工作任务继续进入,ArrayList的长度自然会继续增长。

出现如此荒谬可笑的结果,原因是已被处理过的旧任务在缓冲区中的空间没有被回收。也即是说,当第一个任务被添加到缓冲区并被处理后,此时ArrayList的第一元素空间应该被再利用。想想上述代码的工作流程,当插入两个工作——AddJob("1")和AddJob("2")后——ArrayList的空间如图一所示:
图一:执行前两行代码后的ArrayList

注意这里的ArrayList共有16个元素,因为ArrayList初始化时默认的长度为16。接下来,调用GetNextJob()方法,移走第一个任务,结果如图二:


图二:调用GetNextJob()方法后的ArrayList

当执行AddJob(“3”)时,我们需要添加新任务到缓冲区。显然,ArrayList的第一元素空间(索引为0)被重新使用,此时在0索引处放入了第三个任务。不过别忘了,当我们执行了AddJob(“3”)后还执行了AddJob(“4”),紧接着用调用了两次GetNextJob()方法。如果我们把第三个任务放到0索引处,则第四个任务会被放到索引2处,问题发生了。如图三:
图三:将任务放到0索引时,问题发生

现在调用GetNextJob(),第二个任务从缓冲中移走,nextJobPos指针指向索引2。因此,当再一次调用GetNextJob()时,第四个任务会先于第三个被移走,这就有悖于与我们的“排序顺序”原则。

问题发生的症结在于ArrayList是以线形顺序体现任务列表的。因此我们需要将新任务添加到就任务的右恻以保证当前的处理顺序是正确的。不管何时到达ArrayList的末端,ArrayList都会成倍增长。如果产生产生未被使用的元素,则是因为调用了GetNextJob()方法。

解决之道是使我们的ArrayList成环形。环形数组没有固定的起点和终点。在数组中,我们用变量来维护数组的起止点。环形数组如图四所示:


图四:环形数组图示

在环形数组中,AddJob()方法添加新任务到索引endPos处(译注:endPos一般称为尾指针),之后“递增”endPos值。GetNextJob()方法则根据头指针startPos获取任务,并将头指针指向null,且“递增”startPos值。我之所以把“递增”两字加上引号,是因为这里所说的“递增”不仅仅是将变量值加1那么简单。为什么我们不能简单地加1呢?请考虑这个例子:当endPos等于15时,如果endPos加1,则endPos等于16。此时调用AddJob(),它试图去访问索引为16的元素,结果出现异常IndexOutofRangeException。

事实上,当endPos等于15时,应将endPos重置为0。通过递增(increment)功能检查如果传递的变量值等于数组长度,则重置为0。解决方案是将变量值对数组长度值求模(取余),increment()方法的代码如下:

int increment(int variable)

{

  return (variable + 1) % theArray.Length;

}

注:取模操作符,如x % y,得到的是x 除以 y后的余数。余数总是在0 到 y-1之间。

这种方法好处就是缓冲区永远不会超过16个元素空间。但是如果我们要添加超过16个元素空间的新任务呢?就象ArrayList的Add()方法一样,我们需要提供环形数组自增长的能力,以倍数增长数组的长度。

System.Collection.Queue类

就象我们刚才描述的那样,我们需要提供一种数据结构,能够按照“排队顺序”的原则插入和移除元素项,并能最大化的利用内存空间,答案就是使用数据结构Queue。在.Net Framework基类库中已经内建了该类——System.Collections.Queue类。就象我们代码中的AddJob()和GetNextJob()方法,Queue类提供了Enqueue()和Dequeue()方法分别实现同样的功能。

Queue类在内部建立了一个存放object对象的环形数组,并通过head和tail变量指想该数组的头和尾。默认状态下,Queue初始化的容量为32,我们也可以通过其构造函数自定义容量。既然Queue内建的是object数组,因此可以将任何类型的元素放入队列中。

Enqueue()方法首先判断queue中是否有足够容量存放新元素。如果有,则直接添加元素,并使索引tail递增。在这里tail使用求模操作以保证tail不会超过数组长度。如果空间不够,则queue根据特定的增长因子扩充数组容量。增长因子默认值为2.0,所以内部数组的长度会增加一倍。当然你也可以在构造函数中自定义该增长因子。

Dequeue()方法根据head索引返回当前元素。之后将head索引指向null,再“递增”head的值。也许你只想知道当前头元素的值,而不使其输出队列(dequeue,出列),则Queue类提供了Peek()方法。

Queue并不象ArrayList那样可以随机访问,这一点非常重要。也就是说,在没有使前两个元素出列之前,我们不能直接访问第三个元素。(当然,Queue类提供了Contains()方法,它可以使你判断特定的值是否存在队列中。)如果你想随机的访问数据,那么你就不能使用Queue这种数据结构,而只能用ArrayList。Queue最适合这种情况,就是你只需要处理按照接收时的准确顺序存放的元素项。

注:你可以将Queues称为FIFO数据结构。FIFO意为先进先出(First In, First Out),其意等同于“排队顺序(First come, first served)”。

译注:在数据结构中,我们通常称队列为先进先出数据结构,而堆栈则为先进后出数据结构。然而本文没有使用First in ,first out的概念,而是first come ,first served。如果翻译为先进先服务,或先处理都不是很适合。联想到本文在介绍该概念时,以商场购物时需要排队为例,索性将其译为“排队顺序”。我想,有排队意识的人应该能明白其中的含义吧。那么与之对应的,对于堆栈,只有名为“反排队顺序”,来代表(First Come, Last Served)。希望各位朋友能有更好地翻译来取代我这个拙劣的词语。为什么不翻译为“先进先出”,“先进后出”呢?我主要考虑到这里的英文served,它所包含的含义很广,至少我们可以将其认为是对数据的处理,因而就不是简单地输出那么简单。所以我干脆避开这个词语的含义。

“反排队顺序”——堆栈数据结构

Queue数据结构通过使用内部存储object类型的环形数组以实现“排队顺序”的机制。Queue提供了Enqueue()和Dequeue()方法实现数据访问。“排队顺序”在处理现实问题时经常用到,尤其是提供服务的程序,例如web服务器,打印队列,以及其他处理多请求的程序。

在程序设计中另外一个经常使用的方式是“反排队顺序(first come,last served)”。堆栈就是这样一种数据结构。在.Net Framework基类库中包含了System.Collection.Stack类,和Queue一样,Stack也是通过存储object类型数据对象的内部环形数组来实现。Stack通过两种方法访问数据——Push(item),将数据压入堆栈;Pop()则是将数据弹出堆栈,并返回其值。

一个Stack可以通过一个垂直的数据元素集合来形象地表示。当元素压入堆栈时,新元素被放到所有其他元素的顶端,弹出时则从堆栈顶端移除该项。下面两幅图演示了堆栈的压栈和出栈过程。首先按照顺序将数据1、2、3压入堆栈,然后弹出:
 
图五:向堆栈压入三个元素
 
图六:弹出所有元素后的Stack

注意Stack类的缺省容量是10个元素,而非Queue的32个元素。和Queue和ArrayList一样,Stack的容量也可以根据构造函数定制。如同ArrayList,Stack的容量也是自动成倍增长。(回忆一下:Queue可以根据构造函数的可选项设置增长因子。)

注:Stack通常被称为“LIFO先进后出”或“LIFO后进先出”数据结构。
堆栈:计算机科学中常见的隐喻
现实生活中有很多同Queue相似的例子:DMV(译注:不知道其缩写,恕我孤陋寡闻,不知其意)、打印任务处理等。然而在现实生活很难找到和Stack近似的范例,但它在各种应用程序中却是一种非常重要的数据结构。

设想一下我们用以编程的计算机语言,例如:C#。当执行C#程序时,CLR(公共语言运行时)将调用Stack以跟踪功能模块(译注:这里原文为function,我理解作者的含义不仅仅代表函数,事实上很多编译器都会调用堆栈以确定其地址)的执行情况。每当调用一个功能模块,相关信息就会压入堆栈。调用结束则弹出堆栈。堆栈顶端数据为当前调用功能的信息。(如要查看功能调用堆栈的执行情况,可以在Visual Studio.Net下创建一个项目,设置断点(breakpoint),在执行调试。当执行到断点时,会在调试窗口(Debug/Windows/Call Stack)下显示堆栈信息。

序数索引的限制

我们在第一部分中讲到数组的特点是同种类型数据的集合,并通过序数进行索引。即:访问第i个元素的时间为定值。(请记住此种定量时间被标记为O(1)。)

也许我们并没有意识到,其实我们对有序数据总是“情有独钟”。例如员工数据库。每个员工以社保号(social security number)为其唯一标识。社保号的格式为DDD-DD-DDDD(D的范围为数字0——9)。如果我们有一个随机排列存储所有员工信息的数组,要查找社保号为111-22-3333的员工,可能会遍历数组的所有元素——即执行O(n)次操作。更好的办法是根据社保号进行排序,可将其查找时间缩减为O(log n)。

理想状态下,我们更愿意执行O(1)次时间就能查找到某员工的信息。一种方案是建立一个巨型的数组,以实际的社保号值为其入口。这样数组的起止点为000-00-0000到999-99-9999,如下图所示:
 
图七:存储所有9位数数字的巨型数组

如图所示,每个员工的信息都包括姓名、电话、薪水等,并以其社保号为索引。在这种方式下,访问任意一个员工信息的时间均为定值。这种方案的缺点就是空间极度的浪费——共有109,即10亿个不同的社保号。如果公司只有1000名员工,那么这个数组只利用了0.0001%的空间。(换个角度来看,如果你要让这个数组充分利用,也许你的公司不得不雇佣全世界人口的六分之一。)

用哈希函数压缩序数索引

显而易见,创建10亿个元素数组来存储1000名员工的信息是无法接受的。然而我们又迫切需要提高数据访问速度以达到一个常量时间。一种选择是使用员工社保号的最后四位来减少社保号的跨度。这样一来,数组的跨度只需要从0000到9999。图八显示了压缩后的数组。
 
图八:压缩后的数组

此方案既保证了访问耗时为常量值,又充分利用了存储空间。选择社保号的后四位是随机的,我们也可以任意的使用中间四位,或者选择第1、3、8、9位。

在数学上将这种9位数转换为4位数成为哈希转换(hashing)。哈希转换可以将一个索引器空间(indexers space)转换为哈希表(hash table)。

哈希函数实现哈希转换。以社保号的例子来说,哈希函数H()表示为:
H(x) = x 的后四位

哈希函数的输入可以是任意的九位社保号,而结果则是社保号的后四位数字。数学术语中,这种将九位数转换为四位数的方法称为哈希元素映射,如图九所示:
 
图九:哈希函数图示

图九阐明了在哈希函数中会出现的一种行为——冲突(collisions)。即我们将一个相对大的集合的元素映射到相对小的集中时时,可能会出现相同的值。例如社保号中所有后四位为0000的均被映射为0000。那么000-99-0000,113-14-0000,933-66-0000,还有其他的很多都将是0000。

看看之前的例子,如果我们要添加一个社保号为123-00-0191的新员工,会发生什么情况?显然试图添加该员工会发生冲突,因为在0191位置上已经存在一个员工。

数学标注:哈希函数在数学术语上更多地被描述为f:A->B。其中|A|>|B|,函数f不是一一映射关系,所以之间会有冲突。

显然冲突的发生会产生一些问题。在下一节,我们会看看哈希函数与冲突发生之间的关系,然后简单地犯下处理冲突的几种机制。接下来,我们会将注意力放在System.Collection.Hashtable类,并提供一个哈希表的实现。我们会了解有关Hashtable类的哈希函数,冲突解决机制,以及一些使用Hashtable的例子。

避免和解决冲突

当我们添加数据到哈希表中,冲突是导致整个操作被破坏的一个因素。如果没有冲突,则插入元素操作成功,如果发生了冲突,就需要判断发生的原因。由于冲突产生提高了代价,我们的目标就是要尽可能将冲突压至最低。

哈希函数中冲突发生的频率与传递到哈希函数中的数据分布有关。在我们的例子中,假定社保号是随机分配的,那么使用最后四位数字是一个不错的选择。但如果社保号是以员工的出生年份或出生地址来分配,因为员工的出生年份和地址显然都不是均匀分配的,那么选用后四位数就会因为大量的重复而导致更大的冲突。

注:对于哈希函数值的分析需要具备一定的统计学知识,这超出了本文讨论的范围。必要地,我们可以使用K维(k slots)的哈希表来保证避免冲突,它可以将一个随机值从哈希函数的域中映射到任意一个特定元素,并限定在1/k的范围内。(如果这让你更加的糊涂,千万别担心!)

我们将选择合适的哈希函数的方法成为冲突避免机制(collision avoidance),已有许多研究设计这一领域,因为哈希函数的选择直接影响了哈希表的整体性能。在下一节,我们会介绍在.Net Framework的Hashtable类中对哈希函数的使用。

有很多方法处理冲突问题。最直接的方法,我们称为“冲突解决机制”(collision resolution),是将要插入到哈希表中的对象放到另外一块空间中,因为实际的空间已经被占用了。其中一种最简单的方法称为“线性挖掘”(linear probing),实现步骤如下:
1. 当要插入一个新的元素时,用哈希函数在哈希表中定位;
2. 检查表中该位置是否已经存在元素,如果该位置内容为空,则插入并返回,否则转向步骤3。
3. 如果该地址为i,则检查i+1是否为空,如果已被占用,则检查i+2,依此类推,知道找到一个内容为空的位置。

例如:如果我们要将五个员工的信息插入到哈希表中:Alice(333-33-1234),Bob(444-44-1234), Cal (555-55-1237), Danny (000-00-1235), and Edward (111-00-1235)。当添加完信息后,如图十所示:
 
图十:有相似社保号的五位员工

Alice的社保号被“哈希(这里做动词用,译注)”为1234,因此存放位置为1234。接下来来,Bob的社保号也被“哈希”为1234,但由于位置1234处已经存在Alice的信息,所以Bob的信息就被放到下一个位置——1235。之后,添加Cal,哈希值为1237,1237位置为空,所以Cal就放到1237处。下一个是Danny,哈希值为1235。1235已被占用,则检查1236位置是否为空。既然为空,Danny就被放到那儿。最后,添加Edward的信息。同样他的哈希好为1235。1235已被占用,检查1236,也被占用了,再检查1237,直到检查到1238时,该位置为空,于是Edward被放到了1238位置。

搜索哈希表时,冲突仍然存在。例如,如上所示的哈希表,我们要访问Edward的信息。因此我们将Edward的社保号111-00-1235哈希为1235,并开始搜索。然而我们在1235位置找到的是Bob,而非Edward。所以我们再搜索1236,找到的却是Danny。我们的线性搜索继续查找知道找到Edward或找到内容为空的位置。结果我们可能会得出结果是社保号为111-00-1235的员工并不存在。

线性挖掘虽然简单,但并是解决冲突的好的策略,因为它会导致同类聚合(clustering)。如果我们要添加10个员工,他们的社保号后四位均为3344。那么有10个连续空间,从3344到3353均被占用。查找这10个员工中的任一员工都要搜索这一簇位置空间。而且,添加任何一个哈希值在3344到3353范围内的员工都将增加这一簇空间的长度。要快速查询,我们应该让数据均匀分布,而不是集中某几个地方形成一簇。

更好的挖掘技术是“二次挖掘”(quadratic probing),每次检查位置空间的步长以平方倍增加。也就是说,如果位置s被占用,则首先检查s+12处,然后检查s-12s+22s-22s+32 依此类推,而不是象线性挖掘那样从s+1,s+2……线性增长。当然二次挖掘同样会导致同类聚合。

下一节我们将介绍第三种冲突解决机制——二度哈希,它被应用在.Net Framework的哈希表类中。

System.Collections.Hashtable 类
.Net Framework 基类库包括了Hashtable类的实现。当我们要添加元素到哈希表中时,我们不仅要提供元素(item),还要为该元素提供关键字(key)。Key和item可以是任意类型。在员工例子中,key为员工的社保号,item则通过Add()方法被添加到哈希表中。

要获得哈希表中的元素(item),你可以通过key作为索引访问,就象在数组中用序数作为索引那样。下面的C#小程序演示了这一概念。它以字符串值作为key添加了一些元素到哈希表中。并通过key访问特定的元素。

using System;
using System.Collections;

public class HashtableDemo
{
   private static Hashtable ages = new Hashtable();

   public static void Main()
   {
        // Add some values to the Hashtable, indexed by a string key
        ages.Add("Scott", 25);
        ages.Add("Sam", 6);
        ages.Add("Jisun", 25);
       
        // Access a particular key
        if (ages.ContainsKey("Scott"))
        {
            int scottsAge = (int) ages["Scott"];
            Console.WriteLine("Scott is " + scottsAge.ToString());
        }
        else
            Console.WriteLine("Scott is not in the hash table...");
   }
}
程序中的ContainsKey()方法,是根据特定的key判断是否存在符合条件的元素,返回布尔值。Hashtable类中包含keys属性(property),返回哈希表中使用的所有关键字的集合。这个属性可以通过遍历访问,如下:

// Step through all items in the Hashtable
foreach(string key in ages.Keys)
Console.WriteLine("Value at ages[/"" + key + "/"] = " + ages[key].ToString());

要认识到插入元素的顺序和关键字集合中key的顺序并不一定相同。关键字集合是以存储的关键字对应的元素为基础,上面的程序的运行结果是:

Value at ages["Jisun"] = 25
Value at ages["Scott"] = 25
Value at ages["Sam"] = 6

即使插入到哈希表中的顺序是:Scott,Sam, Jisun。

Hashtable类的哈希函数

Hashtable类中的哈希函数比我们前面介绍的社保号的哈希值更加复杂。首先,要记住的是哈希函数返回的值是序数。对于社保号的例子来说很容易办到,因为社保号本身就是数字。我们只需要截取其最后四位数,就可以得到合适的哈希值。然而Hashtable类中可以接受任何类型的值作为key。就象上面的例子,key是字符串类型,如“Scott”或“Sam”。在这样一个例子中,我们自然想明白哈希函数是怎样将string转换为数字的。

这种奇妙的转换应该归功于GetHashCode()方法,它定义在System.Object类中。Object类中GetHashCode()默认的实现是返回一个唯一的整数值以保证在object的生命期中不被修改。既然每种类型都是直接或间接从Object派生的,因此所以object都可以访问该方法。自然,字符串或其他类型都能以唯一的数字值来表示。

Hashtable类中的对于哈希函数的定义如下:

H(key) = [GetHash(key) + 1 + (((GetHash(key) >> 5) + 1) % (hashsize – 1))] % hashsize

这里的GetHash(key),默认为对key调用GetHashCode()方法的返回值(虽然在使用Hashtable时,你可以自定义GetHash()函数)。GetHash(key)>>5表示将得到key的哈希值,向右移动5位,相当于将哈希值除以32。%操作符就是之前介绍的求模运算符。Hashsize指的是哈希表的长度。因为要进行求模,因此最后的结果H(k)在0到hashsize-1之间。既然hashsize为哈希表的长度,因此结果总是在可以接受的范围内。

Hashtable类中的冲突解决方案

当我们在哈希表中添加或获取一个元素时,会发生冲突。插入元素时,必须查找内容为空的位置,而获取元素时,即使不在预期的位置处,也必须找到该元素。前面我们简单地介绍了两种解决冲突的机制——线性和二次挖掘。在Hashtable类中使用的是一种完全不同的技术,成为二度哈希(rehasing)(有的资料也将其称为双精度哈希double hashing)。

二度哈希的工作原理如下:有一个包含多个哈希函数(H1……Hn)的集合。当我们要从哈希表中添加或获取元素时,首先使用哈希函数H1。如果导致冲突,则尝试使用H2,一直到Hn。各个哈希函数极其相似,不同的是它们选用的乘法因子。通常,哈希函数Hk的定义如下:
Hk(key) = [GetHash(key) + k * (1 + (((GetHash(key) >> 5) + 1) % (hashsize – 1)))] % hashsize

注:运用二度哈希重要的是在执行了hashsize次挖掘后,哈希表中的每一个位置都确切地被有且仅有一次访问。也就是说,对于给定的key,对哈希表中的同一位置不会同时使用Hi和Hj。在Hashtable类中使用二度哈希公式,其保证为:(1 + (((GetHash(key) >> 5) + 1) % (hashsize – 1))与hashsize两者互为素数。(两数互为素数表示两者没有共同的质因子。)如果hashsize是一个素数,则保证这两个数互为素数。

二度哈希较前两种机制较好地避免了冲突。

调用因子(load factors)和扩充哈希表

Hashtable类中包含一个私有成员变量loadFactor,它指定了哈希表中元素个数与表位置总数之间的最大比例。例如:loadFactor等于0.5,则说明哈希表中只有一半的空间存放了元素值,其余一半皆为空。

哈希表的构造函数以重载的方式,允许用户指定loadFactor值,定义范围为0.1到1.0。要注意的是,不管你提供的值是多少,范围都不超过72%。即使你传递的值为1.0,Hashtable类的loadFactor值还是0.72。微软认为loadFactor的最佳值为0.72,因此虽然默认的loadFactor为1.0,但系统内部却自动地将其改变为0.72。所以,建议你使用缺省值1.0(事实上是0.72,有些迷惑,不是吗?)

注:我花了好几天时间去咨询微软的开发人员为什么要使用自动转换?我弄不明白,为什么他们不直接规定值为0.072到0.72之间。最后我从编写Hashtable类的开发团队的到了答案,他们非常将问题的缘由公诸于众。事实上,这个团队经过测试发现如果loadFactor超过了0.72,将会严重的影响哈希表的性能。他们希望开发人员能够更好地使用哈希表,但却可能记不住0.72这个无规律数,相反如果规定1.0为最佳值,开发者会更容易记住。于是,就形成现在的结果,虽然在功能上有少许牺牲,但却使我们能更加方便地使用数据结构,而不用感到头疼。

向Hashtable类添加新元素时,都要进行检查以保证元素与空间大小的比例不会超过最大比例。如果超过了,哈希表空间将被扩充。步骤如下:
1. 哈希表的位置空间近似地成倍增加。准确地说,位置空间值从当前的素数值增加到下一个最大的素数值。(回想一下前面讲到的二度哈希的工作原理,哈希表的位置空间值必须是素数。)
2. 既然二度哈希时,哈希表中的所有元素值将依赖于哈希表的位置空间值,所以表中所有值也需要二度哈希(因为在第一步中位置空间值增加了)。

幸运的是,Hashtable类中的Add()方法隐藏了这些复杂的步骤,你不需要关心它的实现细节。

调用因子(load factor)对冲突的影响决定于哈希表的总体长度和进行挖掘操作的次数。Load factor越大,哈希表越密集,空间就越少,比较于相对稀疏的哈希表,进行挖掘操作的次数就越多。如果不作精确地分析,当冲突发生时挖掘操作的预期次数大约为1/(1-lf),这里lf指的是load factor。

如前所述,微软将哈希表的缺省调用因子设定为0.72。因此对于每次冲突,平均挖掘次数为3.5次。既然该数字与哈希表中实际元素个数无关,因此哈希表的渐进访问时间为O(1),显然远远好于数组的O(n)。

最后,我们要认识到对哈希表的扩充将以性能损耗为代价。因此,你应该预先估计你的哈希表中最后可能会容纳的元素总数,在初始化哈希表时以合适的值进行构造,以避免不必要的扩充。

本文是"考察数据结构"系列文章的第三部分,讨论的是.Net Framework基类库没有包括的常用数据结构:
二叉树。就像线形排列数据的数组一样,我们可以将二叉树想象为以二维方式来存储数据。其中一种特殊的二叉树,我们称为二叉搜索树(binary search tree),简称为BST,它的数据搜索能力比一般数组更加优化。
 
目录:
简介
在树中排列数据
理解二叉树
用BSTs改善数据搜索时间
现实世界的二叉搜索树
 
简介:
 
在本系列的第一部分,我们讲述了什么是数据结构,怎么评估它们的性能,以及怎样根据其性能选择具体的数据结构来处理特定的算法。另外,我们复习了数据结构的基础知识,了解了最常用的数据结构——数组及与其相关的ArrayList。在第二部分,我们讲述了ArrayList的两个兄弟——堆栈和队列。它们存储数据的方式与ArrayList非常相似,但它们访问数据的方式受到了限制。我们还讨论了哈希表,它可以以任意对象作为其索引,而非一般所用的序数。
 
ArrayList,堆栈,队列和哈希表从存储数据的表现形式看,都可以认为是一种数组结构。这意味着,这四种数据结构都将受到数组边界的限制。回想第一部分所讲的,数组在内存中以线形存储数据,当数组容量到达最大值时,需要显式地改变其容量,同时会造成线形搜索时间的增加。
 
本部分,我们讲考察一种全新的数据结构——二叉树。它以一种非线性的方式存储数据。之后,我们还将介绍一种更具特色的二叉树——二叉搜索树(BST)。BST规定了排列树的每个元素项的一些规则。这些规则保证了BSTs能够以一种低于线形搜索时间的性能来搜索数据。
 
 
在树中排列数据
 
如果我们看过家谱,或者是一家公司的组织结构图,那么事实上你已经明白在树中数据的排列方式了。树由许多节点的集合组成,这些节点又有许多相关联的数据和“孩子”。子节点就是直接处于节点之下的节点。父节点则位于与节点直接关联的上方。树的根是一个不包含父节点的单节点。
 
图1显示了公司职员的组织结构图。
图一
 
例中,树的根为Bob Smith,是公司的CEO。这个节点为根节点是因为其上没有父亲。Bob Smith节点有一个孩子Tina Jones,公司总裁。其父节点为Bob Smith。Tina Jones有三个孩子:
Jisun Lee, CIO
Frank Mitchell, CFO
Davis Johnson, VP of Sales
这三个节点的父亲都是Tina Jones节点。
 
所有的树都有如下共同的特性:
1、只有一个根;
2、除了根节点,其他所有节点又且只有一个父节点;
3、没有环结构。从任意一个节点开始,都没有回到起始节点的路径。正是前两个特性保证没有环结构的存在。
 
对于有层次关系的数据而言,树非常有用。后面我们会讲到,当我们有技巧地以层次关系排列数据时,搜索每个元素的时间会显著减少。在此之前,我们首先需要讨论的是一种特殊的树:二叉树。
 
理解二叉树
 
二叉树是一种特殊的树,因为它的所有节点最多只能有两个子节点。并且,对于二叉树中指定的节点,第一个子节点必须指向左孩子,第二个节点指向右孩子。如图二所示:
图二
 
二叉树(a)共有8个节点,节点1为根。节点1的左孩子为节点2,右孩子为节点3。注意,节点并不要求同时具有左孩子和右孩子。例如,二叉树(a)中,节点4就只有一个右孩子。甚至于,节点也可以没有孩子。如二叉树(b),节点4、5、6都没有孩子。
 
没有孩子的节点称为叶节点。有孩子的节点称为内节点。如图二,二叉树(a)中节点6、8为叶节点,节点1、2、3、4、5、7为内节点。
 
不幸的是,.Net Framework中并不包含二叉树类,为了更好地理解二叉树,我们需要自己来创建这个类。
 
第一步:创建节点类Node
 
节点类Node抽象地表示了树中的一个节点。认识到二叉树中节点应包括两个内容:
1、 数据;
2、 子节点:0个、1个、2个;
 
节点存储的数据依赖于你的实际需要。就像数组可以存储整型、字符串和其他类类型的实例一样,节点也应该如此。因此我们应该将节点类存储的数据类型设为object。
 
注意:在C# 2.0版中可以用泛型来创建强类型的节点类,这样比使用object类型更好。要了解更多使用泛型的信息,请阅读Juval Lowy的文章: An Introduction to C# Generics
 
下面是节点类的代码:
 
public class Node
{
   private object data;
   private Node left, right;
 
   #region Constructors
   public Node() : this(null) {}
   public Node(object data) : this(data, null, null) {}
   public Node(object data, Node left, Node right)
   {
      this.data = data;
      this.left = left;
      this.right = right;
   }
   #endregion
 
   #region Public Properties
   public object Value
   {
      get
      {
         return data;
      }
      set
      {
         data = value;
      }
   }
 
   public Node Left
   {
      get
      {
         return left;
      }
      set
      {
         left = value;
      }
   }
 
   public Node Right
   {
      get
      {
         return right;
      }
      set
      {
         right = value;
      }
   }
   #endregion
}
 
注意类Node有三个私有成员:
1、 data,类型为object:为节点存储的数据;
2、 left,Node类型:指向Node的左孩子;
3、 right,Node类型:指向Node的右孩子;
4、 类的其他部份为构造函数和公共字段,访问了这三个私有成员变量。注意,left和right私有变量为Node类型,就是说Node类的成员中包含Node类的实例本身。
 
创建二叉树类BinaryTree
 
创建好Node类后,紧接着创建BinaryTree类。BinaryTree类包含了一个私有字段——root——它是Node类型,表示二叉树的根。这个私有字段以公有字段的方式暴露出来。
 
BinaryTree类只有一个公共方法Clear(),它用来清除树中所有元素。Clear()方法只是简单地将根节点置为空null。代码如下:
public class BinaryTree
{
   private Node root;
 
   public BinaryTree()
   {
      root = null;
   }
 
   #region Public Methods
   public virtual void Clear()
   {
      root = null;
   }
   #endregion
 
   #region Public Properties
   public Node Root
   {
      get
      {
         return root;
      }
      set
      {
         root = value;
      }
   }
   #endregion
}
 
下面的代码演示了怎样使用BinaryTree类来生成与图二所示的二叉树(a)相同的数据结构:
BinaryTree btree = new BinaryTree();
btree.Root = new Node(1);
btree.Root.Left = new Node(2);
btree.Root.Right = new Node(3);
 
btree.Root.Left.Left = new Node(4);
btree.Root.Right.Right = new Node(5);
 
btree.Root.Left.Left.Right = new Node(6);
btree.Root.Right.Right.Right = new Node(7);
 
btree.Root.Right.Right.Right.Right = new Node(8);
 
注意,我们创建BinaryTree类的实例后,要创建根节点(root)。我们必须人工地为相应的左、右孩子添加新节点类Node的实例。例如,添加节点4,它是根节点的左节点的左节点,我们的代码是:
btree.Root.Left.Left = new Node(4);
 
回想一下我们在第一部分中提到的数组元素,使存放在连续的内存块中,因此定位时间为常量。因此,访问特定元素所耗费时间与数组增加的元素个数无关。
 
然而,二叉树却不是连续地存放在内存中,如图三所示。事实上,BinaryTree类的实例指向root Node类实例。而root Node类实例又分别指向它的左右孩子节点实例,以此类推。关键在于,组成二叉树的不同的Node实例是分散地放在CLR托管堆中。他们没有必要像数组元素那样连续存放。
图三
 
如果我们要访问二叉树中的特定节点,我们需要搜索二叉树的每个节点。它不能象数组那样根据指定的节点直接访问。搜索二叉树要耗费线性时间,最坏情况是查询所有的节点。也就是说,当二叉树节点个数增加时,查找任意节点的步骤数也将相应地增加。
 
因此,如果二叉树的定位时间为线性,查询时间也为线性,那怎么说二叉树比数组更好呢?因为数组的查询时间虽然也是线性,但定位时间却是常量啊?是的,一般的二叉树确实不能提供比数组更好的性能。然而当我们有技巧地排列二叉树中的元素时,我们就能很大程度改善查询时间(当然,定位时间也会得到改善)。
 
用BSTs改善数据搜索时间
 
二叉搜索树是一种特殊的二叉树,它改善了二叉树数据搜索的效率。二叉搜索树有以下属性:对于任意一个节点n,其左子树下的每个后代节点的值都小于节点n的值;而其右子树下的每个后代节点的值都大于节点n的值。
 
所谓节点n的子树,可以将其看作是以节点n为根节点的树。因此,子树的所有节点都是节点n的后代,而子树的根则是节点n本身。图四演示了子树的概念和二叉搜索树的属性。
图四
 
图五显示了二叉树的两个例子。图右,二叉树(b),是一个二叉搜索树(BST),因为它符合二叉搜索树的属性。而二叉树(a),则不是二叉搜索树。因为节点10的右孩子节点8小于节点10,但却出现在节点10的右子树中。同样,节点8的右孩子节点4小于节点8,却出现在了它的右子树中。不管是在哪个位置,不符合二叉搜索树的属性规定,就不是二叉搜索树。例如,节点9的右子树只能包含值小于节点9的节点(8和4)。
图五
 
从二叉搜索树的属性可知,BST各节点存储的数据必须和另外的节点进行比较。给出任意两个节点,BST必须能够判断这两个节点的值是小于、大于还是等于。
 
现在,设想一下,我们要查找BST的某个特定的节点。例如图五中的二叉搜索树(b),我们要查找节点10。BST和一般的二叉树一样,都只有一个根节点。那么如果节点10存在于树中,搜索这棵树的最佳方案是什么?有没有比搜索整棵树更好的方法?
 
如果节点10存在于树中,我们从根开始。可以看到,根节点的值为7,小于我们要查找的节点值。因此,一旦节点10存在,必然存在其右子树。所以应该跳到节点11继续查找。此时,节点10小于节点11的值,必然存在于节点11的左子树中。移到节点11的左孩子,此时我们已经找到了目标节点,定位于此。
 
如果我们要查找的节点在树中不存在,会发生问题?例如我们查找节点9。重复上述操作,直到到达节点10,它大于节点9,那么如果节点9存在,必然是在节点10的左子树中。然而我们看到节点10根本就没有左孩子,因此节点9在树中不存在。
 
正式地,我们的搜索算法如下所示。假定我们要查找节点n,此时已指向BST的根节点。算法不断地比较数值的大小直到找到该节点,或指为空值。每一步我们都要处理两个节点:树中的节点c,要查找的节点n,并比较c和n的值。C的初始化值为BST根节点的值。然后执行以下步骤:
1、 如果c值为null,则n不在BST中;
2、 比较c和n的值;
3、 如果值相同,则找到了指定节点n;
4、 如果n的值小于c,那么如果n存在,必然在c的左子树中。因此回到第一步,将c的左孩子作为c;
5、 如果n的值大于c,那么如果n存在,必然在c的右子树中。因此回到第一步,将c的右孩子作为c;
 
分析BST搜索算法
 
通过BST查找节点,理想情况下我们需要检查的节点数可以减半。如图六的BST树,包含了15个节点。从根节点开始执行搜索算法,第一次比较决定我们是移向左子树还是右子树。对于任意一种情况,一旦执行这一步,我们需要访问的节点数就减少了一半,从15降到了7。同样,下一步访问的节点也减少了一半,从7降到了3,以此类推。
图六
 
这里一个重要概念就是算法的每一步在理想状态下都将使被访问的节点数减少一半。比较一下数组的搜索算法。搜索数组时,要搜索所有所有元素,每个元素搜索一次。也就是说,搜索有n个元素的数组,从第一个元素开始,我们要访问n-1个元素。而有n个节点的二叉搜索树,在访问了根节点后,只需要再搜索n/2个节点。
 
搜索二叉树与搜索排序数组相似。例如,你要在电话薄中查找是否有John King。你可以从电话薄的中间开始查找,即从以M开头的姓氏开始查找。按照字母顺序,K是在M之前,那么你可以将M之前的部分在折半,此时,可能是字母H。因为K是在H之后,那么再将H到M这部分折半。这次你找到了字母K,你可以马上看到电话薄里有没有James King。
 
搜索BST与之相似。BST的中点是根节点。然后从上到下,浏览你所需要的左孩子或右孩子。每一步都将节约一半的搜索时间。根据这一特点,这个算法的时间复杂度应该是log ­ 2n,简写为log n。回想我们在第一部分讨论的数学问题,log ­ 2n = y,相当于2 y = n。即,节点数增加n,搜索时间只缓慢地增加到log ­ 2n。图七表示了log ­ 2n和线性增长的增长率之间的区别。时间复杂度为log ­ 2n的算法运行时间为下面那条直线。
 
图七
 
可以看出,这条对数曲线几乎是水平线,随着N值的增加,曲线增长缓慢。举例来说吧,搜索一个具有1000个元素的数组,需要查询1000个元素,而搜索一个具有1000个元素的BST树,仅需要查询不到10个节点(log 10 1024 = 10)。
 
在分析BST搜索算法中,我不断地重复“理想地(ideally)”这个字眼儿。这是因为BST实际的搜索时间要依赖于节点的拓扑结构,也就是说节点之间的布局关系。象图六中所示的二叉树,每一步比较操作都可以使搜索时间减半。然而,我们来看看图八所示的BST树,它的拓扑结构是与数组的排列方式是同构的。
图八
 
搜索图八中的BST树,仍然要耗费线性时间,因为每比较一步,都紧紧减少了一个节点,而非像图六中那样减半。
 
因此,搜索BST所耗费的时间要依赖于它的拓扑结构。最佳情况下,耗费时间为log 2 n,最坏情况则要耗费线性时间。在下一节我们将看到,BST的拓扑结构与插入节点的顺序有关。因此,插入节点的顺序将直接影响BST搜索算法的耗时。
 
插入节点到BST中
 
我们已经知道了在BST中查询一个特定节点的方法,但是我们还应该掌握插入一个新节点的方法。向二叉搜索树插入一个新节点,不能任意而为,必须遵循二叉搜索树的特性。
 
通常我们插入的新节点都是作为叶节点。唯一的问题是,怎样查找合适的节点,使其成为这个新节点的父节点。与搜索算法相似,我们首先应该比较节点c和要插入的新节点n。我们还需要跟踪节点c的父节点。初始状态下,c节点为树的根节点,父节点为null。定位一个新的父节点遵循如下算法:
1、 如果c指向null,则c节点作为n的父节点。如果n的值小于父节点值,则n为父节点新的左孩子,否则为右孩子;
(译注:原文为If c is a null reference,then parent will be the parent of n.. If n’s value is less than parent’s value,then n will be parent’s new left child; otherwise n will be parent’s new right child. 那么翻译过来就是如果c的值为空,当前父节点为n的父节点。笔者以为这似乎有误。因为如果c值为空,则说明BST树为空,没有任何节点,此时应为后面讲到的特殊情况。如果是说c指向null。那么说明c为叶节点,则新插入的节点应作为c的孩子。即c作为n的父节点,也不是原文所说的c的父节点作为n的父节点)
2、 比较n和c的值;
3、 如果c等于n,则用于试图插入一个相同的节点。此时要么直接抛弃该节点,要么抛出异常。(注意,在BST中节点的值必须是唯一的。)
4、 如果n小于c,则n必然在c的左子树中。让父节点等于c,c等于c的左孩子,返回到第一步。
5、 如果n大于c,则n必然在c的右子树中。让父节点等于c,c等于c的右孩子,返回到第一步。
当合适的叶节点找到后,算法结束。将新节点放到BST中使其成为父节点合适的孩子节点。插入算法中有种特例需要考虑。如果BST树中没有根节点,则父节点为空,那么添加新节点作为父节点的孩子这一步就忽略。而且在这种情况下,BST的根节点必须分配为新节点。
 
图九描述了BST插入算法:
图九
BST插入算法和搜索算法时间复杂度一样:最佳情况为log 2 n,最坏情况为线性时间。之所以相同,是因为它为插入的新节点定位所采取的策略是一致的。
 
节点插入顺序决定BST的拓扑结构
 
既然新插入的节点是作为叶节点插入的,则插入的顺序将直接影响BST自身的拓扑结构。例如,我们依次插入节点:1,2,3,4,5,6。当插入节点1时,作为根节点。接着插入2作为1的右孩子,插入3作为2的右孩子,4作为3的右孩子,以此类推。结果BST就形成如图八那样的结构。
 
如果我们有技巧地排列插入值1,2,3,4,5,6的顺序,则BST树将伸展得更宽,看起来更像图六所示的结构。理想的插入顺序是:4,2,5,2,3,6。这样将4作为根节点,2作为4的左孩子,5作为4的右孩子,1和3分别作为2的左孩子和右孩子。而6则作为5的右孩子。
 
既然BST的拓扑结构将影响搜索、插入和删除(下一节介绍)操作的时间复杂度,那么以升序或降序(或近似升序降序)的方式插入数据,会极大地破坏BST的效率。在本文的后面将详细地讨论。
 
从BST中删除节点
 
从BST中删除节点比之插入节点难度更大。因为删除一个非叶节点,就必须选择其他节点来填补因删除节点所造成的树的断裂。如果不选择节点来填补这个断裂,那么二叉搜索树就违背了它的特性。例如,图六中的二叉搜索树。如果删除节点150,就需要某些节点来填补删除造成的断裂。如果我们随意地选择,比如选择92,那么就违背了BST的特性,因为这个时候节点95和111出现在了92的左子树中,而它们的值是大于92的。
 
删除节点算法的第一步是定位要删除的节点。这可以使用前面介绍的搜索算法,因此运行时间为log 2 n。接着应该选择合适的节点来代替删除节点的位置,它共有三种情况需要考虑,在后面的图十有图例说明。
 
情况1:如果删除的节点没有右孩子,那么就选择它的左孩子来代替原来的节点。二叉搜索树的特性保证了被删除节点的左子树必然符合二叉搜索树的特性。因此左子树的值要么都大于,要么都小于被删除节点的父节点的值,这取决于被删除节点是左孩子还是右孩子。因此用被删除节点的左子树来替代被删除节点,是完全符合二叉搜索树的特性的。
 
情况2:如果被删除节点的右孩子没有左孩子,那么这个右孩子被用来替换被删除节点。因为被删除节点的右孩子都大于被删除节点左子树的所有节点。同时也大于或小于被删除节点的父节点,这同样取决于被删除节点是左孩子还是右孩子。因此,用右孩子来替换被删除节点,符合二叉搜索树的特性。
 
情况3:最后,如果被删除节点的右孩子有左孩子,就需要用被删除节点右孩子的左子树中的最下面的节点来替代它,就是说,我们用被删除节点的右子树中最小值的节点来替换。
注意:我们要认识到,在BST中,最小值的节点总是在最左边,最大值的节点总是在最右边。
因为替换选择了被删除节点右子树中最小的一个节点,这就保证了该节点一定大于被删除节点左子树的所有节点,同时,也保证它替代了被删除节点的位置后,它的右子树的所有节点值都大于它。因此这种选择策略符合二叉搜索树的特性。
 
图十描述了三种情况的替换选择方案
 
 
  图十
和搜索、插入算法一样,删除算法的运行时间与BST的拓扑结构有关。理想状态下,时间复杂度为log 2 n,最坏情况下,耗费的为线性时间。
 
BST节点的遍历
 
对于线性的连续的数组元素,采用的是单向的迭代法。从第一个元素开始,依次向后迭代每个元素。而BST则有三种常用的遍历方式:
1、 前序遍历(Perorder traversal)
2、 中序遍历(Inorder traversal)
3、 后序遍历(Postorder traversal)
 
当然,这三种遍历工作原理几乎相似。它们都是从根节点开始,然后访问其子节点。区别在于遍历时,访问节点本身和其子节点的顺序不同。为帮助理解,我们看看图十一所示的BST树。(注意图六和图十一所示的BST树完全相同。

图十一
 
前序遍历
 
前序遍历从当前节点(节点c)开始,然后访问其左孩子,再访问右孩子。如果从BST树的根节点c开始,算法如下:
1、 访问c。(这里所谓访问时指输出节点的值,并将节点添加到ArrayList中,或者其它地方。这取决于你遍历BST的目的。)
2、 对c的左孩子重复第一步;
3、 对c的右孩子重复第一步;
 
设想算法的第一步打印出c的值。以图十一所示的BST树为例,以前序遍历的方法输出的值是什么?是的,我们在第一步首先输出根节点的值。然后对根的左孩子执行第一步,输出50。因为第二步是反复执行第一步操作,因此是对根节点的左孩子的左孩子访问,输出20。如此重复直到树的最左边底层。当到达节点5时,输出其值。既然5没有左、右孩子,我们又回到节点20,执行第三步。此时是对节点20的右孩子反复执行第一步,即输出25。25没有孩子节点,又回到20。但我们对20已经做完了三步操作,所以回到节点50。再对50执行第三步操作,即对50的右孩子重复执行第一步。这个过程不断进行,直到遍历完树的所有节点。最后通过前序遍历输出的结果如下:
90, 50, 20, 5, 25, 75, 66, 80, 150, 95, 92, 111, 175, 166, 200
 
可以理解,这个算法确实有点让人糊涂。或许我们来看看算法的代码可以理清思路。下面的代码为BST类的PreorderTraversal()方法,这个类在文章后面会构建。注意这个方法调用了Node类的实例作为输出参数。输出的节点就是算法步骤中所提到的节点c。执行前序遍历就是从BST的根节点开始调用PreorderTraversal()方法。
 
protected virtual string PreorderTraversal(Node current, string separator)
{
   if (current != null)
   {
      StringBuilder sb = new StringBuilder();
      sb.Append(current.Value.ToString());
      sb.Append(separator);
 
      sb.Append(PreorderTraversal(current.Left, separator));
      sb.Append(PreorderTraversal(current.Right, separator));
      return sb.ToString();
   }
   else
      return String.Empty;
}
 
(译注:实际上本方法就是一个递归调用)
注意遍历后的结果放到字符串中,这个字符串时通过StringBuilder创建。首先将当前节点的值放到字符串中,然后再访问当前节点的左、右孩子,将结果放到字符串中。
 
中序遍历
 
中序遍历是从当前节点的左孩子开始访问,再访问当前节点,最后是其右节点。假定BST树的根节点为c,算法如下:
1、 访问c的左孩子。(这里所谓访问时指输出节点的值,并将节点添加到ArrayList中,或者其它地方。这取决于你遍历BST的目的。)
2、 对c重复第一步;
3、 对c的右孩子重复第一步。
 
InorderTraversal()方法的代码和PreorderTraversal()相似,只是添加当前节点值到StringBuilder的操作之前,先递归调用方法本身,并将当前节点的左孩子作为参数传递。
 
protected virtual string InorderTraversal
                (Node current, string separator)
{
   if (current != null)
   {
      StringBuilder sb = new StringBuilder();
      sb.Append(InorderTraversal(current.Left, separator));
 
      sb.Append(current.Value.ToString());
      sb.Append(separator);
 
      sb.Append(InorderTraversal(current.Right, separator));
      return sb.ToString();
   }
   else
      return String.Empty;
}
 
对图十一所示BST树执行中序遍历,输出结果如下:
5, 20, 25, 50, 66, 75, 80, 90, 92, 95, 111, 150, 166, 175, 200
 
可以看到返回的结果正好是升序排列。
 
后序遍历
 
后序遍历首先从访问当前节点的左孩子开始,然后是右孩子,最后才是当前节点本身。假定BST树的根节点为c,算法如下:
1、 访问c的左孩子。(这里所谓访问时指输出节点的值,并将节点添加到ArrayList中,或者其它地方。这取决于你遍历BST的目的。)
2、 对c的右孩子重复第一步;
3、 对c重复第一步;
 
图十一所示的BST树经后序遍历输出的结果为:
5, 25, 20, 66, 80, 75, 50, 92, 111, 95, 166, 200, 175, 150, 90
 
注意:本文提供的下载内容包括BST和BinaryTree类的完整源代码,同时还包括对BST类的windows窗体的测试应用程序。尤其有用的是,通过Windows应用程序,你可以看到对BST进行前序、中序、后序遍历输出的结果。
 
这三种遍历的运行时间都是线性的。因为每种遍历都将访问树的每一个节点,而其对每个节点正好访问一次。因此,BST树的节点数成倍增加,则遍历的时间也将倍增。
 
实现BST类
 
虽然Java的SDK包括了BST类(称为TreeMap),但.Net Framework基类库却不包括该类。因此我们必须自己创建。和二叉树一样,首先要创建Node类。我们不能对普通二叉树中的Node类进行简单地重用,因为BST树的节点是可比较的。因此,不仅仅是要求节点数据为object类型,还要求数据为实现IComparable接口的类类型。
 
另外,BST节点需要实现接口Icloneable,因为我们必须允许开发者能够对BST类进行克隆clone(即深度拷贝)。使Node类可克隆,那么我们就可以通过返回根节点的克隆达到克隆整个BST的目的。Node类如下:
 
public class Node : ICloneable
{
   private IComparable data;
   private Node left, right;
 
   #region Constructors
   public Node() : this(null) {}
   public Node(IComparable data) : this(data, null, null) {}
   public Node(IComparable data, Node left, Node right)
   {
      this.data = data;
      this.left = left;
      this.right = right;
   }
   #endregion
 
   #region Public Methods
   public object Clone()
   {
      Node clone = new Node();
      if (data is ICloneable)
         clone.Value = (IComparable) ((ICloneable) data).Clone();
      else
         clone.Value = data;
 
      if (left != null)
         clone.Left = (Node) left.Clone();
     
      if (right != null)
         clone.Right = (Node) right.Clone();
 
      return clone;
   }
   #endregion
 
   #region Public Properties
   public IComparable Value
   {
      get
      {
         return data;
      }
      set
      {
         data = value;
      }
   }
 
   public Node Left
   {
      get
      {
         return left;
      }
      set
      {
         left = value;
      }
   }
 
   public Node Right
   {
      get
      {
         return right;
      }
      set
      {
         right = value;
      }
   }
   #endregion
}
 
注意BST的Node类与二叉树的Node类有很多相似性。唯一的区别是data的类型为Icomparable而非object类型,而其Node类实现了Icloneable接口,因此可以调用Clone()方法。
 
现在将重心放到创建BST类上,它实现了二叉搜索树。在下面的几节中,我们会介绍这个类的每个主要方法。至于类的完整代码,可以点击 Download the BinaryTrees.msi sample file 下载源代码,以及测试BST类的Windows应用程序。
 
搜索节点
 
BST之所以重要就是它提供得搜索算法时间复杂度远低于线性时间。因此了解Search()方法是非常有意义的。Search()方法接收一个IComparable类型的输入参数,同时还将调用一个私有方法SearchHelper(),传递BST的根节点和所有搜索的数据。
 
SearchHelper()对树进行递归调用,如果没有找到指定值,返回null值,否则返回目标节点。Search()方法的返回结果如果为空,说明要查找的数据不在BST中,否则就指向等于data值的节点。
 
public virtual Node Search(IComparable data)
{
   return SearchHelper(root, data);
}
 
protected virtual Node SearchHelper(Node current, IComparable data)
{
   if (current == null)
      return null;   // node was not found
   else
   {
      int result = current.Value.CompareTo(data);
      if (result == 0)
         // they are equal - we found the data
         return current;
      else if (result > 0)
      {
         // current.Value > n.Value
         // therefore, if the data exists it is in current's left subtree
         return SearchHelper(current.Left, data);
      }
      else // result < 0
      {
         // current.Value < n.Value
         // therefore, if the data exists it is in current's right subtree
         return SearchHelper(current.Right, data);
      }
   }
}
 
添加节点到BST
 
和前面创建的BinaryTree类不同,BST类并不提供直接访问根的方法。通过BST的Add()方法可以添加节点到BST。Add()接收一个实现IComparable接口的实例类对象作为新节点的值。然后以一种迂回的方式查找新节点的父节点。(回想前面提到的插入新的叶节点的内容)一旦父节点找到,则比较新节点与父节点值的大小,以决定新节点是作为父节点的左孩子还是右孩子。
 
public virtual void Add(IComparable data)
{
   // first, create a new Node
   Node n = new Node(data);
   int result;
 
   // now, insert n into the tree
   // trace down the tree until we hit a NULL
   Node current = root, parent = null;
   while (current != null)
   {
      result = current.Value.CompareTo(n.Value);
      if (result == 0)
         // they are equal - inserting a duplicate - do nothing
         return;
      else if (result > 0)
      {
         // current.Value > n.Value
         // therefore, n must be added to current's left subtree
         parent = current;
         current = current.Left;
      }
      else if (result < 0)
      {
         // current.Value < n.Value
         // therefore, n must be added to current's right subtree
         parent = current;
         current = current.Right;
      }
   }
 
   // ok, at this point we have reached the end of the tree
   count++;
   if (parent == null)
      // the tree was empty
      root = n;
   else
   {
      result = parent.Value.CompareTo(n.Value);
      if (result > 0)
         // parent.Value > n.Value
         // therefore, n must be added to parent's left subtree
         parent.Left = n;
      else if (result < 0)
         // parent.Value < n.Value
         // therefore, n must be added to parent's right subtree
         parent.Right = n;
   }
}
 
Search()方法是对BST从上到下进行递归操作,而Add()方法则是使用一个简单的循环。两种方式殊途同归,但使用while循环在性能上比之递归更有效。所以我们应该认识到BST的方法都可以用这两种方法——递归或循环——其中任意一种来重写。(个人认为递归算法更易于理解。)
 
注意:当用户试图插入一个重复节点时,Add()方法的处理方式是放弃该插入操作,你也可以根据需要修改代码使之抛出一个异常。
 
从BST中删除节点
 
在BST的所有操作中,删除一个节点是最复杂的。复杂度在于删除一个节点必须选择一个合适的节点来替代因删除节点造成的断裂。注意选择替代节点必须符合二叉搜索树的特性。
 
在前面“从BST中删除节点”一节中,我们提到选择节点来替代被删除节点共有三种情形,这些情形在图十中已经有了总结。下面我们来看看Delete()方法是怎样来确定这三种情形的。
 
public void Delete(IComparable data)
{
   // find n in the tree
   // trace down the tree until we hit n
   Node current = root, parent = null;
   int result = current.Value.CompareTo(data);
   while (result != 0 && current != null)
   {           
      if (result > 0)
      {
         // current.Value > n.Value
         // therefore, n must be added to current's left subtree
         parent = current;
         current = current.Left;
      }
      else if (result < 0)
      {
         // current.Value < n.Value
         // therefore, n must be added to current's right subtree
         parent = current;
         current = current.Right;
      }
 
      result = current.Value.CompareTo(data);
   }
 
   // if current == null, then we did not find the item to delete
   if (current == null)
      throw new Exception("Item to be deleted does not exist in the BST.");
 
 
   // at this point current is the node to delete, and parent is its parent
   count--;
  
   // CASE 1: If current has no right child, then current's left child becomes the
   // node pointed to by the parent
   if (current.Right == null)
   {
      if (parent == null)
         root = current.Left;
      else
      {
         result = parent.Value.CompareTo(current.Value);
         if (result > 0)
            // parent.Value > current
            // therefore, the parent's left subtree is now current's Left subtree
            parent.Left = current.Left;
         else if (result < 0)
            // parent.Value < current.Value
            // therefore, the parent's right subtree is now current's left subtree
            parent.Right = current.Left;
      }
   }
   // CASE 2: If current's right child has no left child, then current's right child replaces
   // current in the tree
   else if (current.Right.Left == null)
   {
      if (parent == null)
         root = current.Right;
      else
      {
         result = parent.Value.CompareTo(current.Value);
         if (result > 0)
            // parent.Value > current
            // therefore, the parent's left subtree is now current's right subtree
            parent.Left = current.Right;
         else if (result < 0)
            // parent.Value < current.Value
            // therefore, the parent's right subtree is now current's right subtree
            parent.Right = current.Right;
      }
   }  
   // CASE 3: If current's right child has a left child, replace current with current's
   // right child's left-most node.
   else
   {
      // we need to find the right node's left-most child
      Node leftmost = current.Right.Left, lmParent = current.Right;
      while (leftmost.Left != null)
      {
         lmParent = leftmost;
         leftmost = leftmost.Left;
      }
 
      // the parent's left subtree becomes the leftmost's right subtree
      lmParent.Left = leftmost.Right;
     
      // assign leftmost's left and right to current's left and right
      leftmost.Left = current.Left;
      leftmost.Right = current.Right;
 
      if (parent == null)
         root = leftmost;
      else
      {
         result = parent.Value.CompareTo(current.Value);
         if (result > 0)
            // parent.Value > current
            // therefore, the parent's left subtree is now current's right subtree
            parent.Left = leftmost;
         else if (result < 0)
            // parent.Value < current.Value
            // therefore, the parent's right subtree is now current's right subtree
            parent.Right = leftmost;
      }
   }
}
 
注意:当没有找到指定被删除的节点时,Delete()方法抛出一个异常。
 
其他的BST方法和属性
 
还有其他的BST方法和属性在本文中没有介绍。我们可以下载本文附带的完整的源代码来仔细分析BST类。其余的方法包括:
Clear():移出BST的所有节点。
Clone():克隆BST(创建一个深度拷贝)。
Contains(IComparable):返回一个布尔值确定BST中是否存在其值为指定数据的节点。
GetEnumerator():用中序遍历算法对BST节点进行枚举,并返回枚举数。这个方法使BST可通过foreach循环迭代节点。
PreorderTraversal()/InorderTraversal()/PostorderTraversal():在“遍历BST节点”一节中已经介绍。
ToString():使用BST特定的遍历算法返回字符型的表示结果。
Count:公共的只读属性,返回BST的节点数。
 
现实世界的二叉搜索树
 
二叉搜索树理想的展示了对于插入、搜索、删除操作在时间复杂度上低于线性时间的特点,而这种时间复杂度与BST的拓扑结构有关。在“插入节点到BST中”一节中,我们提到拓扑结构与插入节点的顺序有关。如果插入的数据是有序的,或者近似有序的,都将导致BST树成为一颗深而窄,而非浅而宽的树。而在很多现实情况下,数据都处于有序或近似有序的状态。
 
BST树的问题是很容易成为不均衡的。均衡的二叉树是指宽度与深度之比是优化的。在本系列文章的下一部份,会介绍一种自我均衡的特殊BST类。那就是说,不管是添加新节点还是删除已有节点,BST都会自动调节其拓扑结构来保持最佳的均衡状态。最理想的均衡状态,就是插入、搜索和删除的时间复杂度在最坏情况下也为log 2 n。我在前面提到过Java SDK中有一个名为TreeMap的BST类,这个类实际上就是派生于一种职能地、自我均衡的BST树——红黑树(the red-black tree)。
 
在本系列文章的下一部分,我们就将介绍这种可自我均衡的BST树,包括红黑树。重点介绍一种成为SkipList的数据结构。这种结构体现了自我均衡的二叉树的性能,同时并不需要对其拓扑结构进行重构。
 

先到此为止,好好享受编程的乐趣吧!

An Extensive Examination of Data Structures
 

Scott Mitchell
4GuysFromRolla.com

February 9, 2004

Summary: This article, the fourth in the series, begins with a quick examination of AVL trees and red-black trees, which are two different self-balancing, binary search tree data structures. The remainder of the article examines the skip list data structure, an ingenious data structure that turns a linked list into a data structure that offers the same running time as the more complex self-balancing tree data structures. (31 printed pages)

Note   This article assumes the reader is familiar with C# and the data structure topics discussed previously in this article series.

Download the BuildingBetterBST.msi sample file.

Contents

Introduction
Self-Balancing Binary Search Trees
A Quick Primer on Linked Lists
Skip Lists: A Linked List with Self-Balancing BST-Like Properties
Conclusion
Books

Introduction

In Part 3 of this article series, we looked at the general tree data structure. A tree is a data structure that consists of nodes, where each node has some value and an arbitrary number of children nodes. Trees are common data structures because many real-world problems exhibit tree-like behavior. For example, any sort of hierarchical relationship among people, things, or objects can be modeled as a tree.

A binary tree is a special kind of tree, which limits each node to no more than two children. A binary search tree, or BST, is a binary tree whose nodes are arranged such that for every node n, all of the nodes in n's left subtree have a value less than n, and all nodes in n's right subtree have a value greater than n. As we discussed, in the average case BSTs offer log2 n asymptotic time for inserts, deletes, and searches. (log2 n is often referred to as sublinear because it outperforms linear asymptotic times.)

The disadvantage of BSTs is that in the worst-case their asymptotic running time is reduced to linear time. This happens if the items inserted into the BST are inserted in order or in near-order. In such a case, a BST performs no better than an array. As we discussed at the end of Part 3, there exist self-balancing binary search trees that ensure that regardless of the order of the data inserted, the tree maintains a log2 n running time. In this article, we'll briefly discuss two self-balancing binary search trees—AVL trees and red-black trees. Following that, we'll take an in-depth look at skip lists. Skip lists are a really cool data structure that is much easier to implement than AVL trees or red-black trees, yet still guarantees a running time of log2 n.

Self-Balancing Binary Search Trees

Recall that new nodes are inserted into a binary search tree at the leaves. That is, adding a node to a binary search tree involves tracing down a path of the binary search tree, taking lefts and rights based on the comparison of the value of the current node, and the node being inserted, until the path reaches a dead end. At this point, the newly inserted node is plugged into the tree at this reached dead end. Figure 1 illustrates the process of inserting a new node into a BST.

Figure 1. Inserting a new node into a BST

As Figure 1 shows, when making the comparison at the current node, the node to be inserted travels down the left path if its value is less than the current node, and down the right if its value is greater than the current node's value. Therefore, the structure of the BST is relative to the order with which the nodes are inserted. Figure 2 depicts a BST after nodes with values of 20, 50, 90, 150, 175, and 200 have been added. Specifically, these nodes have been added in ascending order. The result is a BST with no breadth. That is, its topology consists of a single line of nodes rather than having the nodes fanned out.

Figure 2. A BST after nodes with values of 20, 50, 90, 150, 175, and 200 have been added

BSTs—which offer sublinear running time for insertions, deletions, and searches—perform optimally when their nodes are arranged in a fanned-out manner. This is because when searching for a node in a BST, each single step down the tree reduces the number of nodes that need to be potentially checked by one half. However, when a BST has a topology similar to the one in Figure 2, the running time for the BST's operations are much closer to linear time. To see why, consider what must happen when searching for a particular value, such as 175. Starting at the root, 20, we must navigate down through each right child until we hit 175. That is, there is no savings in nodes that need to be checked at each step. Searching a BST like the one in Figure 2 is identical to searching an array. Each element must be checked one at a time. Therefore, such a structured BST exhibits a linear search time.

What is important to realize is that the running time of a BST's operations is related to the BST's height. The height of a tree is defined as the length of the longest path starting at the root. The height of a tree can be defined recursively as follows:

  • The height of a node with no children is 0
  • The height of a node with one child is the height of that child plus one
  • The height of a node with two children is one plus the greater height of the two children

To compute the height of a tree, start at its leaf nodes and assign them a height of 0. Then move up the tree using the three rules outlined to compute the height of each leaf nodes' parent. Continue in this manner until every node of the tree has been labeled. The height of the tree, then, is the height of the root node. Figure 3 shows a number of binary trees with their height computed at each node. For practice, take a second to compute the heights of the trees yourself to make sure your numbers match up with the numbers presented in the figure below.

Figure 3. Example binary trees with their height computed at each node

A BST exhibits log2 n running times when its height, when defined in terms of the number of nodes, n, in the tree, is near the floor of log2 n. (The floor of a number x is the greatest integer less than x. So the floor of 5.38 would be 5; the floor of 3.14159 would be 3. For positive numbers x, the floor of x can be found by simply truncating the decimal part of x, if any.) Of the three trees in Figure 3, tree (b) has the best height to number of nodes ratio, as the height is 3 and the number of nodes present in the tree is 8. As we discussed in Part 1 of this article series, loga b = y is another way of writing ay = b. log2 8, then, equals 3 since 23 = 8. Tree (a) has 10 nodes and a height of 4. log2 10 equals 3.3219 and change, the floor of that being 3. So, 4 is not the ideal height. Notice that by rearranging the topology of tree (a)—by moving the far-bottom right node to the child of one of the non-leaf nodes with only one child—we could reduce the tree's height by one, thereby giving the tree an optimal height to node ratio. Finally, tree (c) has the worst height to node ratio. With its 5 nodes it could have an optimal height of 2, but due to its linear topology is has a height of 4.

The challenge we are faced with, then, is ensuring that the topology of the resulting BST exhibits an optimal ratio of height to the number of nodes. Since the topology of a BST is based upon the order with which the nodes are inserted, intuitively you might opt to solve this problem by ensuring that the data that's added to a BST is not added in near-sorted order. While this is possible if you know the data that will be added to the BST beforehand, it might not be practical. If you are not aware of the data that will be added—like if it's added based on user input, or added as it's read from a sensor—then there is no hope of guaranteeing the data is not inserted in near-sorted order. The solution, then, is not to try to dictate the order with which the data is inserted, but to ensure that after each insertion the BST remains balanced. Data structures that are designed to maintain balance are referred to as self-balancing binary search trees.

A balanced tree is a tree that maintains some predefined ratio between its height and breadth. Different data structures define their own ratios for balance, but all have it close to log2 n. A self-balancing BST, then, exhibits log2 n running time. There are numerous self-balancing BST data structures in existence, such as AVL trees, red-black trees, 2-3 trees, 2-3-4 trees, splay trees, B-trees, and others. In the next two sections, we'll take a brief look at two of these self-balancing trees—AVL trees and red-black trees.

Examining AVL Trees

In 1962 Russian mathematicians G. M. Andel'son-Vel-skii and E. M. Landis invented the first self-balancing BST, called an AVL tree. AVL trees must maintain the following balance property: for every node n, the height of n's left and right subtrees can differ by at most 1. The height of a node's left or right subtree is the height computed for its left or right node using the technique discussed in the previous section. If a node has only one child, say a left child, but no right child, then the height of the right subtree is -1.

Figure 4 shows, conceptually, the height-relationship each node in an AVL tree must maintain. Figure 5 provides three examples of BSTs. The numbers in the nodes represent the nodes' values; the numbers to the right and left of each node represent the height of the nodes' left and right subtrees. In Figure 5, trees (a) and (b) are valid AVL trees, but trees (c) and (d) are not because not all nodes adhere to the AVL balance property.

Figure 4. The height of left and right subtrees in an AVL tree cannot differ by more than one.

click for larger image

Figure 5. Example trees, where (a) and (b) are valid AVL trees, but (c) and d are not.

Note   Realize that AVL trees are binary search trees, so in addition to maintaining a balance property, an AVL tree must also maintain the binary search tree property.

When creating an AVL tree data structure, the challenge is to ensure that the AVL balance remains regardless of the operations performed on the tree. That is, as nodes are added or deleted, it is vital that the balance property remains. AVL trees maintain the balance through rotations. A rotation slightly reshapes the tree's topology such that the AVL balance property is restored and, equally important, the binary search tree property is maintained as well.

Inserting a new node into an AVL tree is a two-stage process. First, the node is inserted into the tree using the same algorithm for adding a new node to a BST. That is, the new node is added as a leaf node in the appropriate location to maintain the BST property. After adding a new node, it might be the case that adding this new node caused the AVL balance property to be violated at some node along the path traveled down from the root to where the newly inserted node was added. To fix any violations, stage two involves traversing back up the access path, checking the height of the left and right subtree for each node along this return path. If the heights of the subtrees differ by more than 1, a rotation is performed to fix the anomaly.

Figure 6 illustrates the steps for a rotation on node 3. Notice that after stage 1 of the insertion routine, the AVL tree property was violated at node 5 because node 5's left subtree's height was two greater than its right subtree's height. To remedy this violation, a rotation was performed on node 3, the root of node 5's left subtree. This rotation fixed the balance inconsistency and also maintained the BST property.

click for larger image

Figure 6. AVL trees stay balanced through rotations

In addition to the simple, single rotation shown in Figure 6, there are more involved rotations that are sometimes required. A thorough discussion of the set of rotations potentially needed by an AVL tree is beyond the scope of this article. What is important to realize is that both insertions and deletions can disturb the balance property or which AVL trees must adhere. To fix any perturbations, rotations are used.

Note   To familiarize yourself with insertions, deletions, and rotations from an AVL tree, check out the AVL tree applet at http://www.seanet.com/users/arsen/avltree.html. This Java applet illustrates how the topology of an AVL tree changes with additions and deletions.

By ensuring that all nodes' subtrees' heights differ by at most 1, AVL trees guarantee that insertions, deletions, and searches will always have an asymptotic running time of log2 n, regardless of the order of insertions into the tree.

A Look at Red-Black Trees

Rudolf Bayer, a computer science professor at the Technical University of Munich, invented the red-black tree data structure in 1972. In addition to its data and left and right children, the nodes of a red-black tree contain an extra bit of information—a color, which can be either red or black. Red-black trees are complicated further by the concept of a specialized class of node referred to as NIL nodes. NIL nodes are pseudo-nodes that exist as the leaves of the red-black tree. That is, all regular nodes—those with some data associated with them—are internal nodes. Rather than having a NULL pointer for a childless regular node, the node is assumed to have a NIL node in place of that NULL value. This concept can be understandably confusing, and hopefully the diagram in Figure 7 clears up any confusion.

Figure 7. Red-black trees add the concept of a NIL node.

Red-black trees are trees that have the following four properties:

  1. Every node is colored either red or black.
  2. Every NIL node is black.
  3. If a node is red, then both of its children are black.
  4. Every path from a node to a descendant leaf contains the same number of black nodes.

The first three properties are self-explanatory. The fourth property, which is the most important of the four, simply states that starting from any node in the tree, the number of black nodes from that node to any leaf (NIL), must be the same. In Figure 7 take the root node as an example. Starting from 41 and going to any NIL, you encounter the same number of black nodes—3. For example, taking a path from 41 to the left-most NIL node, we start on 41, a black node. We then travel down to node 9, then node 2, which is also black, then node 1, and finally the left-most NIL node. In this journey we encountered three black nodes—41, 2, and the final NIL node. In fact, if we travel from 41 to any NIL node, we'll always encounter precisely three black nodes.

Like the AVL tree, red-black trees are another form of self-balancing binary search tree. Whereas the balance property of an AVL tree was explicitly stated as a relationship between the heights of each node's left and right subtrees, red-black trees guarantee their balance in a more conspicuous manner. It can be shown that a tree that implements the four red-black tree properties has a height that is always less than 2 * log2 (n+1), where n is the total number of nodes in the tree. For this reason, red-black trees ensure that all operations can be performed within an asymptotic running time of log2 n.

Like AVL trees, any time a red-black tree has nodes inserted or deleted, it is important to verify that the red-black tree properties have not been violated. With AVL trees, the balance property was restored through rotations. With red-black trees, the red-black tree properties are restored through re-coloring and rotations. Red-black trees are notoriously complex in their re-coloring and rotation rules, requiring the nodes along the access path to make decisions based upon their color in contrast to the color of their parents and uncles. (An uncle of a node n is the node that is n's parent's sibling node.) A thorough discussion of re-coloring and rotation rules is far beyond the scope of this article.

To view the re-coloring and rotations of a red-black tree as nodes are added and deleted, check out the red-black tree applet, which can also be accessed viewed at http://www.seanet.com/users/arsen/avltree.html.

A Quick Primer on Linked Lists

One common data structure we've yet to discuss is the linked list. Since the skip list data structure we'll be examining next is the mutation of a linked list into a data structure with self-balanced binary tree running times, it is important that before diving into the specifics of skip lists we take a moment to discuss linked lists.

Recall that with a binary tree, each node in the tree contains some bit of data and a reference to its left and right children. A linked list can be thought of as a unary tree. That is, each element in a linked list has some data associated with it, and a single reference to its neighbor. As Figure 8 illustrates, each element in a linked list forms a link in the chain. Each link is tied to its neighboring node, which is the node on its right.

Figure 8. A four-element linked list

When we created a binary tree data structure in Part 3, the binary tree data structure only needed to contain a reference to the root of the tree. The root itself contained references to its children, and those children contained references to their children, and so on. Similarly, with the linked list data structure, when implementing a structure we only need to keep a reference to the head of the list since each element in the list maintains a reference to the next item in the list.

Linked lists have the same linear running time for searches as arrays. That is, to determine if the element Sam is in the linked list in Figure 8, we have to start at the head and check each element one by one. There are no shortcuts as with binary trees or hashtables. Similarly, deleting from a linked list takes linear time because the linked list must first be searched for the item to be deleted. Once the item is found, removing it from the linked list involves reassigning the deleted item's left neighbor's neighbor reference to the deleted item's neighbor. Figure 9 illustrates the pointer reassignment that must occur when deleting an item from a linked list.

Figure 9. Deleting an element from a linked list

The asymptotic time required to insert a new element into a linked list depends on whether or not the linked list is a sorted list. If the list's elements need not be sorted, insertion can occur in constant time because we can add the element to the front of the list. This involves creating a new element, having its neighbor reference point to the current linked list head, and, finally, reassigning the linked list's head to the newly inserted element.

If the linked list elements need to be maintained in sorted order, then when adding a new element the first step is to locate where it belongs in the list. This is accomplished by exhaustively iterating from the beginning of the list to the element until the spot where the new element belongs. Let e be the element immediately before the location where the new element will be added. To insert the new element, e's processor reference must now point to the newly inserted element, and the new element's neighbor reference needs to be assigned to e's old neighbor. Figure 10 illustrates this concept graphically.

Figure 10. Inserting elements into a sorted linked list

Notice that linked lists do not provide direct access, like an array. That is, if you want to access the ith element of a linked list, you have to start at the front of the list and walk through i links. With an array, though, you can jump straight to the ith element. Given this, along with the fact that linked lists do not offer better search running times than arrays, you might wonder why anyone would want to use a linked list.

The primary benefit of linked lists is that adding or removing items does not involve messy and time-consuming re-dimensioning. Recall that array's have fixed size. If an array needs to have more elements added to it than it has capacity, the array must be re-dimensioned. Granted, the ArrayList class hides the code complexity of this, but re-dimensioning still carries with it a performance penalty. In short, an array is usually a better choice if you have an idea on the upper bound of the amount of data that needs to be stored. If you have no conceivable notion as to how many elements need to be stored, then a link list might be a better choice.

In summary, linked lists are fairly simple to implement. The main challenge comes with the threading or rethreading of the neighbor links with insertions or deletions, but the complexity of adding or removing an element from a linked list pales in comparison to the complexity of balancing an AVL or red-black tree.

Skip Lists: A Linked List with Self-Balancing BST-Like Properties

Back in 1989 William Pugh, a computer science professor at the University of Maryland, was looking at sorted linked lists one day thinking about their running time. Clearly a sorted linked list takes linear time to search because each element may potentially be visited, one right after the other. Pugh thought to himself that if half the elements in a sorted linked list had two neighbor references—one pointing to its immediate neighbor, and another pointing to the neighbor two elements ahead—while the other half had one, then searching a sorted linked list could be done in half the time. Figure 11 illustrates a two-reference sorted linked list.

Figure 11. A skip list

The way such a linked list saves search time is due in part to the fact that the elements are sorted, as well as the varying heights. To search for, say, Dave, we'd start at the head element, which is a dummy element whose height is the same height as the maximum element height in the list. The head element does not contain any data. It merely serves as a place to start searching.

We start at the highest link because it lets us skip over lower elements. We begin by following the head element's top link to Bob. At this point we can ask ourselves, does Bob come before or after Dave? If it comes before Dave, then we know Dave, if he's in the list, must exist somewhere to the right of Bob. If Bob comes before Dave, then Bob must exist somewhere between where we're currently positioned and Bob. In this case, Dave comes after Bob alphabetically, so we can repeat our search again from the Bob element. Notice that by moving onto Bob, we are skipping over Alice. At Bob, we repeat the search at the same level. Following the top-most pointer we reach Dave. Since we have found what we are looking for, we can stop searching.

Now, imagine that we wanted to search for Cal. We'd begin by starting at the head element, and then moving onto Bob. At Bob, we'd start by following the top-most reference to Dave. Since Dave comes after Cal, we know that Cal must exist somewhere between Bob and Dave. Therefore, we move down to the next lower reference level and continue our comparison.

The efficiency of such a linked list arises because we are able to move two elements over every time instead of just one. This makes the running time on the order of n/2, which, while better than a regular sorted link list, is still an asymptotically linear running time. Realizing this, Pugh wondered what would happen if rather than limiting the height of an element to 2, it was instead allowed to go up to log2 n for n elements. That is, if there were 8 elements in the linked list, there would be elements with height up to 3. If there were 16 elements, there would be elements with height up to 4. As Figure 12 shows, by intelligently choosing the heights of each of the elements, the search time is reduced to log2 n.

Figure 12. By increasing the height of each node in a skip list, better searching performance can be gained.

Notice that in the nodes in Figure 12, every 2ith node has references 2i elements ahead (that is, the 20 element). Alice has a reference to 20 elements ahead—Bob. The 21 element, Bob, has a reference to a node 21 elements ahead—Dave. Dave, the 22 element, has a reference 22 elements ahead—Frank. Had there been more elements, Frank—the 23 element—would have a reference to the element 23 elements ahead.

The disadvantage of the approach illustrated in Figure 12 is that adding new elements or removing existing ones can wreak havoc on the precise structure. That is, if Dave is deleted, now Ed becomes the 22 element, and Gil the 23 element, and so on. This means all of the elements to the right of the deleted element need to have their height and references readjusted. The same problem crops up with inserts. This redistribution of heights and references would not only complicate the code for this data structure, but would also reduce the insertion and deletion running times to linear.

Pugh noticed that this pattern created 50 percent of the elements at height 1, 25 percent at height 2, 12.5 percent at height 3, and so on. That is, 1/2i percent of the elements were at height i. Rather than trying to ensure the correct heights for each element with respect to its ordinal index in the list, Pugh decided to randomly pick a height using the ideal distribution—50 percent at height 1, 25 percent at height 2, and so on. What Pugh discovered was that such a randomized linked list was not only very easy to create in code, but that it also exhibited log2 n running time for insertions, deletions, and lookups. Pugh named his randomized lists skip lists because iterating through the list skips over lower-height elements.

In the remaining sections we'll examine the insertion, deletion, and lookup functions of the skip list, and implement them in a C# class. We'll finish off with an empirical look at the skip list performance and discuss the tradeoffs between skip lists and self-balancing BSTs.

Creating the Node and NodeList Classes

A skip list, like a binary tree, is made up of a collection of elements. Each element in a skip list has some data associated with it—a height, and a collection of element references. For example, in Figure 12 the Bob element has the data Bob, a height of 2, and two element references, one to Dave and one to Cal. Before creating a skip list class, we first need to create a class that represents an element in the skip list. I named this class Node, and its germane code is shown below. (The complete skip list code is available in this article as a code download.)

public class Node
{
   #region Private Member Variables
   private NodeList nodes;
   IComparable myValue;
   #endregion

   #region Constructors
   public Node(IComparable value, int height)
   {
      this.myValue = value;
            this.nodes = new NodeList(height);
   }
   #endregion

   #region Public Properties
   public int Height
   {
      get { return nodes.Capacity; }
   }

   public IComparable Value
   {
      get { return myValue; }
   }

   public Node this[int index]
   {
      get { return nodes[index]; }
      set { nodes[index] = value; }
   }
   #endregion
}

Notice that the Node class only accepts objects that implement the IComparable interface as data. This is because a skip list is maintained as a sorted list, meaning that its elements are ordered by their data. In order to order the elements, the data of the elements must be comparable. (If you remember back to Part 3, our binary search tree Node class also required that its data implement IComparable for the same reason.)

The Node class uses a NodeList class to store its collection of Node references. The NodeList class, shown below, is a strongly-typed collection of Nodes and is derived from System.Collections.CollectionBase.

public class NodeList : CollectionBase
{
   public NodeList(int height)
   {
      // set the capacity based on the height
      base.InnerList.Capacity = height;

      // create dummy values up to the Capacity
      for (int i = 0; i < height; i++)
         base.InnerList.Add(null);
   }

   // Adds a new Node to the end of the node list
   public void Add(Node n)
   {
      base.InnerList.Add(n);
   }

   // Accesses a particular Node reference in the list
   public Node this[int index]
   {
      get { return (Node) base.InnerList[index]; }
      set { base.InnerList[index] = value; }
   }

   // Returns the capacity of the list
   public int Capacity
   {
      get { return base.InnerList.Capacity; }
   }
}

The NodeList constructor accepts a height input parameter that indicates the number of references that the node needs. It appropriately sets the Capacity of the InnerList to this height and adds null references for each of the height references.

With the Node and NodeList classes created, we're ready to move on to creating the SkipList class. The SkipList class, as we'll see, contains a single reference to the head element. It also provides methods for searching the list, enumerating through the list's elements, adding elements to the list, and removing elements from the list.

Note   For a graphical view of skip lists in action, be sure to check out the skip list applet at http://iamwww.unibe.ch/~wenger/DA/SkipList/. You can add and remove items from a skip list and visually see how the structure and height of the skip list is altered with each operation.

Creating the SkipList Class

The SkipList class provides an abstraction of a skip list. It contains public methods like:

  • Add(IComparable): adds a new item to the skip list.
  • Remove(IComparable): removes an existing item from the skip list.
  • Contains(IComparable): returns true if the item exists in the skip list, false otherwise.

And public properties, such as:

  • Height: the height of the tallest element in the skip list.
  • Count: the total number of elements in the skip list.

The skeletal structure of the class is shown below. Over the next several sections, we'll examine the skip list's operations and fill in the code for its methods.

public class SkipList
{
   #region Private Member Variables
   Node head;
   int count;
   Random rndNum;

   protected const double PROB = 0.5;
   #endregion

   #region Public Properties
   public virtual int Height
   {
      get { return head.Height; }
   }

   public virtual int Count
   {
      get { return count; }
   }
   #endregion

   #region Constructors
   public SkipList() : this(-1) {}
   public SkipList(int randomSeed)
   {
      head = new Node(1);
      count = 0;
      if (randomSeed < 0)
         rndNum = new Random();
      else
         rndNum = new Random(randomSeed);
   }
   #endregion

   protected virtual int chooseRandomHeight(int maxLevel)
   {
      ...
   }

   public virtual bool Contains(IComparable value)
   {
      ...
   }

   public virtual void Add(IComparable value)
   {
      ...
   }

   public virtual void Remove(IComparable value)
   {
      ...
   }
}

We'll fill in the code for the methods in a bit, but for now pay close attention to the class's private member variables, public properties, and constructors. There are three relevant private member variables:

  • head, which is the list's head element. Remember that a skip list has a dummy head element (refer back to Figures 11 and 12 for a graphical depiction of the head element).
  • count, an integer value keeping track of how many elements are in the skip list.
  • rndNum, an instance of the Random class. Since we need to randomly determine the height when adding a new element to the list, we'll use this Random instance to generate the random numbers.

The SkipList class has two read-only public properties, Height and Count. Height returns the height of the tallest skip list element. Since the head is always equal to the tallest skip list element, we can simply return the head element's Height property. The Count property simply returns the current value of the private member variable count. (count, as we'll see, is incremented in the Add() method and decremented in the Remove() method.)

Notice there are two forms of the SkipList constructor. The default constructor merely calls the second, passing in a value of -1. The second form assigns to head a new Node instance with height 1, and sets count equal to 0. It then checks to see if the passed in randomSeed value is less than 0 or not. If it is, then it creates an instance of the Random class using an auto-generated random seed value. Otherwise, it uses the random seed value passed into the constructor.

Note   Computer random-number generators, such as the Random class in the .NET Framework, are referred to as pseudo-random number generators because they don't pick random numbers, but instead use a function to generate the random numbers. The random number generating function works by starting with some value, called the seed. Based on the seed, a sequence of random numbers are computed. Slight changes in the seed value lead to seemingly random changes in the series of numbers returned.

If you use the Random class's default constructor, the system clock is used to generate a seed. You can optionally specify a seed. The benefit of specifying a seed is that if you use the same seed value, you'll get the same sequence of random numbers. Being able to get the same results is beneficial when testing the correctness and efficiency of a randomized algorithm like the skip list.

Searching a skip list

The algorithm for searching a skip list for a particular value is straightforward. Non-formally, the search process can be described as follows: we start with the head element's top-most reference. Let e be the element referenced by the head's top-most reference. We check to see if the e's value is less than, greater than, or equal to the value for which we are searching. If it equals the value, then we have found the item for which we're looking. If it's greater than the value we're looking for and if the value exists in the list, it must be to the left of e, meaning it must have a lesser height than e. Therefore, we move down to the second level head node reference and repeat this process.

If, on the other hand, the value of e is less than the value we're looking for then the value, if it exists in the list, must be on the right hand side of e. Therefore, we repeat these steps for the top-most reference of e. This process continues until we find the value we're searching for, or exhaust all the "levels" without finding the value.

More formally, the algorithm can be spelled out with the following pseudo-code:

Node current = head
for i = skipList.Height downto 1
  while current[i].Value < valueSearchingFor
    current = current[i]  // move to the next node

if current[i].Value == valueSearchingFor then
   return true
else
return false

Take a moment to trace the algorithm over the skip list shown in Figure 13. The red arrows show the path of checks when searching the skip lists. Skip list (a) shows the results when searching for Ed. Skip list (b) shows the results when searching for Cal. Skip list (c) shows the results when searching for Gus, which does not exist in the skip list. Notice that throughout the algorithm we are moving in a right, downward direction. The algorithm never moves to a node to the left of the current node, and never moves to a higher reference level.

Figure 13. Searching over a skip list.

The code for the Contains(IComparable) method is quite simple, involving a while and a for loop. The for loop iterates down through the reference level layers. The while loop iterates across the skip list's elements.

public virtual bool Contains(IComparable value)
{
   Node current = head;
   int i = 0;

for (i = head.Height - 1; i >= 0; i--)
   {
      while (current[i] != null)
      {
         int results = current[i].Value.CompareTo(value);
         if (results == 0)
            return true;
         else if (results < 0)
            current = current[i];
         else // results > 0
            break;  // exit while loop
      }
   }

   // if we reach here, we searched to the end of the list without finding the element
   return false;
}
Inserting into a skip list

Inserting a new element into a skip list is akin to adding a new element in a sorted link list, and involves two steps:

  1. Locate where in the skip list the new element belongs. This location is found by using the search algorithm to find the location that comes immediately before the spot the new element will be added
  2. Thread the new element into the list by updating the necessary references.

Since skip list elements can have many levels and, therefore, many references, threading a new element into a skip list is not as simple as threading a new element into a simple linked list. Figure 14 shows a diagram of a skip list and the threading process that needs to be done to add the element Gus. For this example, imagine that the randomly determined height for the Gus element was 3. To successfully thread in the Gus element, we'd need to update Frank's level 3 and 2 references, as well as Gil's level 1 reference. Gus's level 1 reference would point to Hank. If there were additional nodes to the right of Hank, Gus's level 2 reference would point to the first element to the right of Hank with height 2 or greater, while Gus's level 3 reference would point to the first element right of Hank with height 3 or greater.

Figure 14. Inserting elements into a skip list

In order to properly rethread the skip list after inserting the new element, we need to keep track of the last element encountered for each height. In Figure 14, Frank was the last element encountered for references at levels 4, 3, and 2, while Gil was the last element encountered for reference level 1. In the insert algorithm below, this record of last elements for each level is maintained by the updates array, which is populated as the search for the location for the new element is performed.

public virtual void Add(IComparable value)
{
   Node [] updates = new Node[head.Height];
   Node current = head;
   int i = 0;

   // first, determine the nodes that need to be updated at each level
   for (i = head.Height - 1; i >= 0; i--)
   {
      while (current[i] != null && current[i].Value.CompareTo(value) < 0)
         current = current[i];

      updates[i] = current;
   }

   // see if a duplicate is being inserted
   if (current[0] != null && current[0].Value.CompareTo(value) == 0)
      // cannot enter a duplicate, handle this case by either just returning or by throwing an exception
      return;

   // create a new node
   Node n = new Node(value, chooseRandomHeight(head.Height + 1));
   count++;   // increment the count of elements in the skip list

   // if the node's level is greater than the head's level, increase the head's level
   if (n.Height > head.Height)
   {
      head.IncrementHeight();
      head[head.Height - 1] = n;
   }

   // splice the new node into the list
   for (i = 0; i < n.Height; i++)
   {
      if (i < updates.Length)
      {
         n[i] = updates[i][i];
         updates[i][i] = n;
      }
   }
}

There are a couple of key portions of the Add(IComparable) method that are important. First, be certain to examine the first for loop. In this loop, not only is the correct location for the new element located, but the updates array is also fully populated. After this loop, a check is done to make sure that the data being entered is not a duplicate. I chose to implement my skip list such that duplicates are not allowed. However, skip lists can handle duplicate values just fine. If you want to allow for duplicates, simply remove this check.

Next, a new Node instance, n, is created. This represents the element to be added to the skip list. Note that the height of the newly created Node is determined by a call to the chooseRandomHeight() method, passing in the current skip list height plus one. We'll examine this method shortly. Another thing to note is that after adding the Node, a check is made to see if the new Node's height is greater than that of the skip list's head element's height. If it is, then the head element's height needs to be incremented, because the head element height should have the same height as the tallest element in the skip list.

The final for loop rethreads the references. It does this by iterating through the updates array, having the newly inserted Node's references point to the Nodes previously pointed to by the Node in the updates array, and then having the updates array Node update its reference to the newly inserted Node. To help clarify things, try running through the Add(IComparable) method code using the skip list in Figure 14, where the added Node's height is 3.

Randomly Determining the Newly Inserted Node's Height

When inserting a new element into the skip list, we need to randomly select a height for the newly added Node. Recall from our earlier discussions of skip lists that when Pugh first envisioned multi-level, linked-list elements, he imagined a linked list where each 2ith element had a reference to an element 2i elements away. In such a list, precisely 50 percent of the nodes would have height 1, 25 percent with height 2, and so on.

The chooseRandomHeight() method uses a simple technique to compute heights so that the distribution of values matches Pugh's initial vision. This distribution can be achieved by flipping a coin and setting the height to one greater than however many heads in a row were achieved. That is, if upon the first flip you get a tails, then the height of the new element will be one. If you get one heads and then a tails, the height will be 2. Two heads followed by a tails indicates a height of three, and so on. Since there is a 50 percent probability that you will get a tails, a 25 percent probability that you will get a heads and then a tails, a 12.5 percent probability that you will get two heads and then a tails, and so on. The distribution works out to be the same as the desired distribution.

The code to compute the random height is given by the following simple code snippet:

const double PROB = 0.5;
protected virtual int chooseRandomHeight()
{
   int level = 1;
   while (rndNum.NextDouble() < PROB)
      level++;

   return level;
}

One concern with the chooseRandomHeight()method is that the value returned might be extraordinarily large. That is, imagine that we have a skip list with, say, two elements, both with height 1. When adding our third element, we randomly choose the height to be 10. This is an unlikely event, since there is only roughly a 0.1 percent chance of selecting such a height, but it could conceivable happen. The downside of this, now, is that our skip list has an element with height 10, meaning there is a number of superfluous levels in our skip list. To put it more bluntly, the references at levels 2 up to 10 would not be utilized. Even as additional elements were added to the list, there's still only a 3 percent chance of getting a node over a height of 5, so we'd likely have many wasted levels.

Pugh suggests a couple of solutions to this problem. One is to simply ignore it. Having superfluous levels doesn't require any change in the code of the data structure, nor does it affect the asymptotic running time. The approach I chose to use is to use "fixed dice" when choosing the random level. That is, you restrict the height of the new element to be a height of at most one greater than the tallest element currently in the skip list. The actual implementation of the chooseRandomHeight() method is shown below, which implements this "fixed dice" approach. Notice that a maxLevel input parameter is passed in, and the while loop exits prematurely if level reaches this maximum. In the Add(IComparable) method, note that the maxLevel value passed in is the height of the head element plus one. (Recall that the head element's height is the same as the height of the maximum element in the skip list.)

protected virtual int chooseRandomHeight(int maxLevel)
{
   int level = 1;

   while (rndNum.NextDouble() < PROB && level < maxLevel)
      level++;

   return level;
}

The head element should be the same height as the tallest element in the skip list. So, in the Add(IComparable) method, if the newly added Node's height is greater than the head element's height, I call the IncrementHeight() method:

/* - snippet from the Add() method… */
if (n.Height > head.Height)
{
   head.IncrementHeight();
   head[head.Height - 1] = n;
}
/************************************/

The IncrementHeight() is a method of the Node class that I left out for brevity. It simply increases the Capacity of the Node's NodeList and adds a null reference to the newly added level. For the method's source code refer to the article's code sample.

Note   In his paper, "Skip Lists: A Probabilistic Alternative to Balanced Trees," Pugh examines the effects of changing the value of PROB from 0.5 to other values, such as 0.25, 0.125, and others. Lower values of PROB decrease the average number of references per element, but increase the likelihood of the search taking substantially longer than expected. For more details, be sure to read Pugh's paper, which is mentioned in the References section at the end of this article.
Deleting an element from a skip list

Like adding an element to a skip list, removing an element involves a two-step process:

  1. The element to be deleted must be found.
  2. That element needs to be snipped from the list and the references need to be rethreaded.

Figure 15 shows the rethreading that must occur when Dave is removed from the skip list.

Figure 15. Deleting an element from a skip list

As with the Add(IComparable) method, Remove(IComparable) maintains an updates array that keeps track of the elements at each level that appear immediately before the element to be deleted. Once this updates array has been populated, the array is iterated through from the bottom up, and the elements in the array are rethreaded to point to the deleted element's references at the corresponding levels. The Remove(IComparable) method code follows.

public virtual void Remove(IComparable value)
{
   Node [] updates = new Node[head.Height];
   Node current = head;
   int i = 0;

   // first, determine the nodes that need to be updated at each level
   for (i = head.Height - 1; i >= 0; i--)
   {
      while (current[i] != null && current[i].Value.CompareTo(value) < 0)
         current = current[i];

      updates[i] = current;
   }

   current = current[0];
   if (current != null && current.Value.CompareTo(value) == 0)
   {
      count--;

      // We found the data to delete
      for (i = 0; i < head.Height; i++)
      {
         if (updates[i][i] != current)
            break;
         else
            updates[i][i] = current[i];
      }

      // finally, see if we need to trim the height of the list
      if (head[head.Height - 1] == null)
      {
         // we removed the single, tallest item... reduce the list height
         head.DecrementHeight();
      }
   }
   else
   {
      // the data to delete wasn't found.  Either return or throw an exception
      return;
   }
}

The first for loop should look familiar. It's the same code found in Add(IComparable), used to populate the updates array. Once the updates array has been populated, we check to ensure that the element we reached does indeed contain the value to be deleted. If not, the Remove() method simply returns. You might opt to have it throw an exception of some sort, though. Assuming the element reached is the element to be deleted, the count member variable is decremented and the references are rethreaded. Lastly, if we deleted the element with the greatest height, then we should decrement the height of the head element. This is accomplished through a call to the DecrementHeight() method of the Node class.

Analyzing the running time

In "Skip Lists: A Probabilistic Alternative to Balanced Trees," Pugh provides a quick proof showing that the skip list's search, insertion, and deletion running times are asymptotically bounded by log2 n in the average case. However, a skip list can exhibit linear time in the worst case, but the likelihood of the worst-case scenario happening is extremely unlikely.

Since the heights of the elements of a skip list are randomly chosen, there is a chance that all, or virtually all, elements in the skip list will end up with the same height. For example, imagine that we had a skip list with 100 elements, all that happen to have height 1 chosen for their randomly selected height. Such a skip list would be, essentially, a normal linked list, not unlike the one shown in Figure 8. As we discussed earlier, the running time for operations on a normal linked list is linear.

While such worst-case scenarios are possible, realize that they are highly improbable. To put things in perspective, the likelihood of having a skip list with 100 height 1 elements is the same likelihood of flipping a coin 100 times and having it come up tails all 100 times. The chances of this happening are precisely 1 in 1,267,650,600,228,229,401,496,703,205,376. Of course with more elements, the probability goes down even further. For more information, be sure to read about Pugh's probabilistic analysis of skip lists in his paper.

Examining Some Empirical Results

Included in the article's download is the SkipList class, along with a testing Windows Forms application. With this testing application, you can manually add, remove, and inspect the list, and can see the nodes of the list displayed. Also, this testing application includes a "stress tester," where you can indicate how many operations to perform and an optional random seed value. The stress tester then creates a skip list, adds at least half as many elements as operations requested, and then, with the remaining operations, does a mix of inserts, deletes, and queries. At the end you can see review a log of the operations performed and their result, along with the skip list height, the number of comparisons needed for the operation, and the number of elements in the list.

The graph in Figure 16 shows the average number of comparisons per operation for increasing skip list sizes. Note that as the skip list doubles in size, the average number of comparisons needed per operation only increases by a small amount (one or two more comparisons). To fully understand the utility of logarithmic growth, consider how the time for searching an array would fare on this graph. For a 256 element array, on average 128 comparisons would be needed to find an element. For a 512 element array, on average 256 comparisons would be needed. Compare that to the skip list, which for skip lists with 256 and 512 elements require only 9 and 10 comparisons on average!

Figure 16. Viewing the logarithmic growth of comparisons required for an increasing number of skip list elements.

Conclusion

In Part 3 of this article series, we looked at binary trees and binary search trees. BSTs provide an efficient log2 n running time in the average case. However, the running time is sensitive to the topology of the tree, and a tree with a suboptimal ratio of breadth to height can reduce the running time of a BST's operations to linear time.

To remedy this worst-case running time of BSTs, which could happen quite easily since the topology of a BST is directly dependent on the order with which items are added, computer scientists have been inventing a myriad of self-balancing BSTs, starting with the AVL tree created in the 1960s. While data structures such as the AVL tree, the red-black tree, and numerous other specialized BSTs offer log2 n running time in both the average and worst case, they require especially complex code that can be difficult to correctly create.

An alternative data structure that offers the same asymptotic running time as a self-balanced BST, is William Pugh's skip list. The skip list is a specialized, sorted link list, whose elements have a height associated with them. In this article we constructed a SkipList class and saw how straightforward the skip list's operations were, and how easy it was to implement them in code.

This fourth part of the article series is the last proposed part on trees. In the fifth installment, we'll look at graphs, which is a collection of vertexes with an arbitrary number of edges connecting each vertex to one another. As we'll see in Part 5, trees are a special form of graphs. Graphs have an extraordinary number of applications in real-world problems.

As always, if you have questions, comments, or suggestions for future material to discuss, I invite your comments! I can be reached at mitchell@4guysfromrolla.com.

Happy Programming!

References

  • Cormen, Thomas H., Charles E. Leiserson, and Ronald L. Rivest. "Introduction to Algorithms." MIT Press. 1990.
  • Pugh, William. "Skip Lists: A Probabilistic Alternative to Balanced Trees." Available online at ftp://ftp.cs.umd.edu/pub/skipLists/skiplists.pdf.

Related Books

Scott Mitchell, author of five books and founder of 4GuysFromRolla.com, has been working with Microsoft Web technologies for the past five years. Scott works as an independent consultant, trainer, and writer, and recently completed his Masters degree in Computer Science at the University of California – San Diego. He can be reached at mitchell@4guysfromrolla.com or through his blog at http://ScottOnWriting.NET.

An Extensive Examination of Data Structures
 

Scott Mitchell
4GuysFromRolla.com

March 2004

Summary: A graph, like a tree, is a collection of nodes and edges, but has no rules dictating the connection among the nodes. In this fifth part of the article series, we'll learn all about graphs, one of the most versatile data structures. (26 printed pages)


Download the Graphs.msi sample file.

Contents

Introduction
Examining the Different Classes of Edges
Creating a C# Graph Class
A Look at Some Common Graph Algorithms
Conclusion
Related Books

Introduction

Part 1 and Part 2 of this article series focused on linear data structures—the array, the ArrayList, the Queue, the Stack, and the Hashtable. In Part 3, we began our investigation of trees. Recall that trees consist of a set of nodes, where all of the nodes share some connection to other nodes. These connections are referred to as edges. As we discussed, there are numerous rules as to how these connections can occur. For example, all nodes in a tree except for one—the root—must have precisely one parent node, while all nodes can have an arbitrary number of children. These simple rules ensure that, for any tree, the following three statements will hold true:

  1. Starting from any node, any other node in the tree can be reached. That is, there exists no node that can't be reached through some simple path.
  2. There are no cycles. A cycle exists when, starting from some node v, there is some path that travels through some set of nodes v1, v2, ..., vk that then arrives back at v.
  3. The number of edges in a tree is precisely one less than the number of nodes.

In Part 3 we focused on binary trees, which are a special form of trees. Binary trees are trees whose nodes have at most two children.

In this fifth installment of the article series we're going to examine graphs. Graphs are composed of a set of nodes and edges, just like trees, but with graphs there are no rules for the connections between nodes. With graphs, there is no concept of a root node, nor is there a concept of parents and children. Rather, a graph is a collection of interconnected nodes.

Note   Realize that all trees are graphs. A tree is a special case of a graph in which all nodes are reachable from some starting node and one that has no cycles.

Figure 1 shows three examples of graphs. Notice that graphs, unlike trees, can have sets of nodes that are disconnected from other sets of nodes. For example, graph (a) has two distinct, unconnected sets of nodes. Graphs can also contain cycles. Graph (b) has several cycles. One cycle is the path from v1 to v2 to v4 and back to v1. Another one is from v1 to v2 to v3 to v5 to v4 and back to v1. (There are also cycles in graph (a).) Graph (c) does not have any cycles, as it has one less edge than it does number of nodes, and all nodes are reachable. Therefore, it is a tree.

Figure 1. Three examples of graphs

Many real-world problems can be modeled using graphs. For example, search engines like Google model the Internet as a graph, where Web pages are the nodes in the graph and the links among Web pages are the edges. Programs like Microsoft MapPoint that can generate driving directions from one city to another use graphs, modeling cities as nodes in a graph and the roads connecting the cities as edges.

Examining the Different Classes of Edges

Graphs, in their simplest terms, are a collection of nodes and edges, but there are different kinds of edges:

  • Directed versus undirected edges
  • Weighted versus unweighted edges

When talking about using graphs to model a problem, it is important to indicate the class of graph with which you are working. Is it a graph whose edges are directed and weighted, or one whose edges are undirected and weighted? In the next two sections, we'll discuss the differences between directed and undirected edges and weighted and unweighted edges.

Directed and Undirected Edges

The edges of a graph provide the connections between one node and another. By default, an edge is assumed to be bidirectional. That is, if there exists an edge between nodes v and u, it is assumed that one can travel from v to u and from u to v. Graphs with bidirectional edges are said to be undirected graphs because there is no implicit direction in their edges.

For some problems, though, an edge might infer a one-way connection from one node to another. For example, when modeling the Internet as a graph, a hyperlink from Web page v linking to Web page u would imply that the edge between v to u would be unidirectional. That is, that one could navigate from v to u, but not from u to v. Graphs that use unidirectional edges are said to be directed graphs.

When drawing a graph, bidirectional edges are drawn as a straight line, as shown in Figure 1. Unidirectional edges are drawn as an arrow, showing the direction of the edge. Figure 2 shows a directed graph where the nodes are Web pages for a particular Web site and a directed edge from u to v indicates that there is a hyperlink from Web page u to Web page v. Notice that both u links to v and v links to u, two arrows are used—one from v to u and another from u to v.

Figure 2. Model of pages making up a website

Weighted and Unweighted Edges

Typically graphs are used to model a collection of "things" and their relationship among these "things." For example, the graph in Figure 2 modeled the pages in a website and their hyperlinks. Sometimes, though, it is important to associate some cost with the connection from one node to another.

A map can be easily modeled as a graph, with the cities as nodes and the roads connecting the cities as edges. If we wanted to determine the shortest distance and route from one city to another, we first need to assign a cost from traveling from one city to another. The logical solution would be to give each edge a weight, such as how many miles it is from one city to another.

Figure 3 shows a graph that represents several cities in southern California. The cost of any particular path from one city to another is the sum of the costs of the edges along the path. The shortest path, then, would be the path with the least cost. In Figure 3, for example, a trip from San Diego to Santa Barbara is 210 miles if driving through Riverside, then to Barstow, and then to Santa Barbara. The shortest trip, however, is to drive 100 miles to Los Angeles, and then another 30 miles up to Santa Barbara.

Figure 3. Graph of California cities with edges valued as miles

Realize that the directionality and weight of edges are orthogonal. That is, a graph can have one of four arrangements of edges:

  • Directed, weighted edges
  • Directed, unweighted edges
  • Undirected, weighted edges
  • Undirected, unweighted edges

The graphs in Figure 1 had undirected, unweighted edges. Figure 2 had directed, unweighted edges, and Figure 3 used undirected, weighted edges.

Sparse Graphs and Dense Graphs

While a graph could have zero or a handful of edges, typically a graph will have more edges than it has nodes. What's the maximum number of edges a graph could have, given n nodes? It depends on whether the graph is directed or undirected. If the graph is directed, then each node could have an edge to every other node. That is, all n nodes could have n – 1 edges, giving a total of n * (n – 1) edges, which is nearly n2.

Note   For this article, I am assuming nodes are not allowed to have edges to themselves. In general, though, graphs allow for an edge to exist from a node v back to node v. If self-edges are allowed, the total number of edges for a directed graph would be n 2.

If the graph is undirected, then one node, call it v1, could have an edge to each and every other node, or n – 1 edges. The next node, call it v2, could have at most n – 2 edges because an edge from v2 to already exists. The third node, v3, could have at most n – 3 edges, and so forth. Therefore, for n nodes, there would be at most (n – 1) + (n – 2) + ... + 1 edges. As you might have guessed, summed up this comes to [n * (n-1)] / 2, or exactly half as many edges as a directed graph.

If a graph has significantly less than n2 edges, the graph is said to be sparse. For example, a graph with n nodes and n edges, or even 2n edges would be said to be sparse. A graph with close to the maximum number of edges is said to be dense.

When using graphs in an algorithm it is important to know the ratio between nodes and edges. As we'll see later on in this article, the asymptotic running time operations performed on a graph is typically expressed in terms of the number of nodes and edges in the graph.

Creating a C# Graph Class

While graphs are a very common data structure used in a wide array of different problems, there is no built-in graph data structure in the .NET Framework. Part of the reason is because an efficient implementation of a Graph class depends on a number of factors specific to the problem at hand. For example, graphs are typically modeled in one of two ways:

  • Adjacency list
  • Adjacency matrix

These two techniques differ in how the nodes and edges of the graph are maintained internally by the Graph class. Let's examine both of these approaches and weigh the pros and cons of each method.

Representing a Graph Using an Adjacency List

In Part 3 we created a C# class for binary trees, called BinaryTree. Recall that each node in a binary tree was represented by a Node class. The Node class contained three properties:

  • Value, which held the value of the node, an object
  • Left, a reference to the Node's left child
  • Right, a reference to the Node's right child

Clearly the Node class and BinaryTree classes are not sufficient for a graph. First, the Node class for a binary tree allows for only two edges—a left and right child. For a more general graph, though, there could be an arbitrary number of edges emanating from a node. Also, the BinaryTree class contains a reference to a single node, the root. But with a graph, there is no single point of reference. Rather, the graph would need to know about all of its nodes.

One option, then, is to create a Node class that has as one of its properties an array of Node instances, which represent the Node's neighbors. Our Graph class would also have an array of Node instances, with one element for each of the nodes in the graphs. Such a representation is called an adjacency list because each node maintains a list of adjacent nodes. Figure 4 depicts an adjacency list representation in graphical form.

Figure 4. Adjacency list representation in graphical form

Notice that with an undirected graph, an adjacency list representation duplicated the edge information. For example, in adjacency list representation (b) in Figure 4, the node a has b in its adjacency list, and node b also has node a in its adjacency list.

Each node has precisely as many Nodes in its adjacency list as it has neighbors. Therefore, an adjacency list is a very space-efficient representation of a graph. You never store more data than needed. Specifically, for a graph with V nodes and E edges, a graph using an adjacency list representation will require V + E Node instances for a directed graph and V + 2E Node instances for an undirected graph.

While Figure 4 does not show it, adjacency lists can also be used to represent weighted graphs. The only addition is that for each Node n's adjacency list, each Node instance in the adjacency list needs to store the cost of the edge from n.

The one downside of an adjacency list is that determining if there is an edge from some node u to v requires that u's adjacency list be searched. For dense graphs, u will likely have many Nodes in its adjacency list. Determining if there is an edge between two nodes, then, takes linear time for dense adjacency list graphs. Fortunately, when using graphs we'll likely not need to determine if there exists an edge between two particular nodes. More often than not, we'll want to simply enumerate all the edges of a particular node.

Representing a Graph Using an Adjacency Matrix

An alternative method for representing a graph is to use an adjacency matrix. For a graph with n nodes, an adjacency matrix is an n x n two-dimensional array. For weighted graphs the array element (u, v) would give the cost of the edge between u and v (or, perhaps -1 if no such edge existed between u and v). For an unweighted graph, the array could be an array of Booleans, where a True at array element (u, v) denotes an edge from u to v and a False denotes a lack of an edge.

Figure 5 depicts how an adjacency matrix representation in graphical form.

Figure 5. Adjacency matrix representation in graphical form

Note that undirected graphs display symmetry along the adjacency matrix's diagonal. That is, if there is an edge from u to v in an undirected graph then there will be two corresponding array entries in the adjacency matrix: (u, v) and (v, u).

Since determining if an edge exists between two nodes is simply an array lookup, this can be determined in constant time. The downside of adjacency matrices is that they are space inefficient. An adjacency matrix requires an n2 element array, so for sparse graphs much of the adjacency matrix will be empty. Also, for undirected graphs half of the graph is repeated information.

While either an adjacency matrix or adjacency list would suffice as an underlying representation of a graph for our Graph class, let's move forward using the adjacency list model. I chose this approach primarily because it is a logical extension from the Node and BinaryTree classes that we've already created together.

Creating the Node Class

The Node class represents a single node in the graph. When working with graphs, the nodes typically represent some entity. Therefore, our Node class contains a Data property of type object that can be used to store any sort of data associated with the node. Furthermore, we'll need some way to easily identify nodes, so let's add a string Key property, which serves as a unique identifier for each Node.

Since we are using the adjacency list technique to represent the graph, each Node instance needs to have a list of its neighbors. If the graph uses weighted edges, the adjacency list also needs to store the weight of each edge. To manage this adjacency list, we'll first need to create an AdjacencyList class.

The AdjacencyList and EdgeToNeighbor classes

A Node contains an AdjacencyList class, which is a collection of edges to the Node's neighbors. Since an AdjacencyList stores a collection of edges, we first need to create a class that represents an edge. Let's call this class EdgeToNeighbor, since it models an edge that extends to a neighboring node. Since we might want to associate a weight with this edge, EdgeToNeighbor needs two properties:

  • Cost, an integer value indicating the weight of the edge
  • Neighbor, a Node reference

The AdjancencyList class, then, is derived from the System.Collections.CollectionBase class, and is simply a strongly-typed collection of EdgeToNeighbor instances. The code for EdgeToNeighbors and AdjacencyList is shown below:

public class EdgeToNeighbor
{
   // private member variables
   private int cost;      
   private Node neighbor;

   public EdgeToNeighbor(Node neighbor) : this(neighbor, 0) {}

   public EdgeToNeighbor(Node neighbor, int cost)
   {
      this.cost = cost;
      this.neighbor = neighbor;
   }

   public virtual int Cost
   {
      get
      {
         return cost;
      }
   }

   public virtual Node Neighbor
   {
      get
      {
         return neighbor;
      }
   }
}


public class AdjacencyList : CollectionBase
{
   protected internal virtual void Add(EdgeToNeighbor e)
   {
      base.InnerList.Add(e);
   }

   public virtual EdgeToNeighbor this[int index]
   {
      get { return (EdgeToNeighbor) base.InnerList[index]; }
      set { base.InnerList[index] = value; }
   }
}

The Node class's Neighbors property exposes the Node's internal AdjacencyList member variable. Notice that the AdjacencyList class's Add() method is marked internal so that only classes in the assembly can add an edge to a Node's adjacency list. This is done so that the developer using the Graph class can only modify the graph's structure through the Graph class members and not indirectly through the Node's Neighbors property.

Adding edges to a node

In addition to its Key, Data, and Neighbors properties, the Node class needs to provide a method to allow the developer manipulating the Graph class to add an edge from itself to a neighbor. Recall that with the adjacency list approach, if there exists an undirected edge between nodes u and v, then u will have a reference to v in its adjacency list and v will have a reference to u in its adjacency list. Nodes should only be responsible for maintaining their own adjacency lists, and not that of others Nodes in the graph. As we'll see later, the Graph class contains methods to add either directed or undirected edges between two nodes.

To make the job of the Graph class adding an edge between two Nodes easier, the Node class contains a method for adding a directed edge from itself to some neighbor. This method, AddDirected(), takes in a Node instance and an optional weight, creates an EdgeToNeighbor instance, and adds it to the Node's adjacency list. The following code highlights this process:

protected internal virtual void AddDirected(Node n)
{
   AddDirected(new EdgeToNeighbor(n));
}

protected internal virtual void AddDirected(Node n, int cost)
{
   AddDirected(new EdgeToNeighbor(n, cost));
}

protected internal virtual void AddDirected(EdgeToNeighbor e)
{
   neighbors.Add(e);
}

Building the Graph Class

Recall that with the adjacency list technique, the graph maintains a list of its nodes. Each node, then, maintains a list of adjacent nodes. So, in creating the Graph class we need to have a list of Nodes. We could opt to use an ArrayList to maintain this list, but a more efficient approach would be to use a Hashtable. A Hashtable here is a more sensible approach because in the methods used to add an edge in the Graph class, we'll need to make sure that the two Nodes specified to add an edge between both exist in the graph. With an ArrayList we'd have to linearly search through the array to find both Node instances; with a Hashtable we can take advantage of a constant-time lookup. (For more information on Hashtables and their asymptotic running times, read Part 2 of the article series.)

The NodeList class, shown below, contains strongly-typed Add() and Remove() methods for adding and removing Node instances from the graph. It also has a ContainsKey() method, which determines if a particular Node Key value already exists in the graph.

public class NodeList : IEnumerable
{
   // private member variables
   private Hashtable data = new Hashtable();


   // methods
   public virtual void Add(Node n)
   {
      data.Add(n.Key, n);
   }

   public virtual void Remove(Node n)
   {
      data.Remove(n.Key);
   }

   public virtual bool ContainsKey(string key)
   {
      return data.ContainsKey(key);
   }

   public virtual void Clear()
   {
      data.Clear();
   }

   // Properties...
   public virtual Node this[string key]
   {
      get
      {
         return (Node) data[key];
      }
   }
   
   // ... some methods and properties removed for brevity ...
}

The Graph class contains a public property Nodes, which is of type NodeList. Additionally, the Graph class has a number of methods for adding directed or undirected, and weighted or unweighted edges between two existing nodes in the graph. The AddDirectedEdge() method takes in two Nodes and an optional weight, and creates a directed edge from the first Node to the second. Similarly, the AddUndirectedEdge() method takes in two Nodes and an optional weight, adding a directed edge from the first to the second Node, as well as a directed edge from the second back to the first Node.

In addition to its methods for adding edges, the Graph class has a Contains() method that returns a Boolean indicating if a particular Node exists in the graph or not. The germane code for the Graph class is shown below:

public class Graph
{
   // private member variables
   private NodeList nodes;


   public Graph()
   {
      this.nodes = new NodeList();
   }


   public virtual Node AddNode(string key, object data)
   {
      // Make sure the key is unique
      if (!nodes.ContainsKey(key))
      {
         Node n = new Node(key, data);
         nodes.Add(n);
         return n;
      }
      else
         throw new ArgumentException("There already exists a node 
            in the graph with key " + key);
   }

   public virtual void AddNode(Node n)
   {
      // Make sure this node is unique
      if (!nodes.ContainsKey(n.Key))
         nodes.Add(n);
      else
         throw new ArgumentException("There already exists a node 
            in the graph with key " + n.Key);
   }


   public virtual void AddDirectedEdge(string uKey, string vKey)
   {
      AddDirectedEdge(uKey, vKey, 0);
   }

   public virtual void AddDirectedEdge(string uKey, string vKey, int cost)
   {
      // get references to uKey and vKey
      if (nodes.ContainsKey(uKey) && nodes.ContainsKey(vKey))
         AddDirectedEdge(nodes[uKey], nodes[vKey], cost);
      else
         throw new ArgumentException("One or both of the nodes 
            supplied were not members of the graph.");
   }

   public virtual void AddDirectedEdge(Node u, Node v)
   {
      AddDirectedEdge(u, v, 0);
   }

   public virtual void AddDirectedEdge(Node u, Node v, int cost)
   {
      // Make sure u and v are Nodes in this graph
      if (nodes.ContainsKey(u.Key) && nodes.ContainsKey(v.Key))
         // add an edge from u -> v
         u.AddDirected(v, cost);
      else         
         // one or both of the nodes were not found in the graph
         throw new ArgumentException("One or both of the nodes 
         supplied were not members of the graph.");
   }


   public virtual void AddUndirectedEdge(string uKey, string vKey)
   {
      AddUndirectedEdge(uKey, vKey, 0);
   }

   public virtual void AddUndirectedEdge(string uKey, string vKey, int cost)
   {
      // get references to uKey and vKey
      if (nodes.ContainsKey(uKey) && nodes.ContainsKey(vKey))
         AddUndirectedEdge(nodes[uKey], nodes[vKey], cost);
      else
         throw new ArgumentException("One or both of the nodes 
         supplied were not members of the graph.");
   }

   public virtual void AddUndirectedEdge(Node u, Node v)
   {
      AddUndirectedEdge(u, v, 0);
   }

   public virtual void AddUndirectedEdge(Node u, Node v, int cost)
   {
      // Make sure u and v are Nodes in this graph
      if (nodes.ContainsKey(u.Key) && nodes.ContainsKey(v.Key))
      {
         // Add an edge from u -> v and from v -> u
         u.AddDirected(v, cost);
         v.AddDirected(u, cost);
      }
      else         
         // one or both of the nodes were not found in the graph
         throw new ArgumentException("One or both of the nodes 
         supplied were not members of the graph.");
   }


   public virtual bool Contains(Node n)
   {
      return Contains(n.Key);
   }

   public virtual bool Contains(string key)
   {
      return nodes.ContainsKey(key);
   }


   public virtual NodeList Nodes
   {
      get
      {
         return this.nodes;
      }
   }
}

Notice that the AddDirectedEdge() and AddUndirectedEdge() methods check to ensure that the Nodes passed in exist in the graph. If they do not, an ArgumentException is thrown. Also note that these two methods have a number of overloads. You can add two nodes by passing in Node references or the Nodes' Key values.

Using the Graph Class

At this point we have created all of the classes needed for our graph data structure. We'll soon turn our attention to some of the more common graph algorithms, such as constructing a minimum spanning tree and finding the shortest path from a single node to all other nodes. But before we, do let's examine how to use the Graph class in a C# application.

Once we create an instance of the Graph class, the next task is to add the Nodes to the graph. This involves calling the AddNode() method of the Graph class for each node to add to the graph. Let's recreate the graph from Figure 2. We'll need to start by adding six nodes. For each of these nodes let's have the Key be the Web page's filename. We'll leave the Data as null, although this might conceivably contain the contents of the file, or a collection of keywords describing the Web page content.

Graph web = new Graph();
web.AddNode("Privacy.htm", null);
web.AddNode("People.aspx", null);
web.AddNode("About.htm", null);
web.AddNode("Index.htm", null);
web.AddNode("Products.aspx", null);
web.AddNode("Contact.aspx", null);

Next we need to add the edges. Since this is a directed, unweighted graph, we'll use the AddDirectedEdge(u, v) method of the Graph class to add an edge from u to v.

web.AddDirectedEdge("People.aspx", "Privacy.htm");  // People -> Privacy

web.AddDirectedEdge("Privacy.htm", "Index.htm");    // Privacy -> Index
web.AddDirectedEdge("Privacy.htm", "About.htm");    // Privacy -> About

web.AddDirectedEdge("About.htm", "Privacy.htm");    // About -> Privacy
web.AddDirectedEdge("About.htm", "People.aspx");    // About -> People
web.AddDirectedEdge("About.htm", "Contact.aspx");   // About -> Contact

web.AddDirectedEdge("Index.htm", "About.htm");      // Index -> About
web.AddDirectedEdge("Index.htm", "Contact.aspx");   // Index -> Contacts
web.AddDirectedEdge("Index.htm", "Products.aspx");  // Index -> Products

web.AddDirectedEdge("Products.aspx", "Index.htm");  // Products -> Index
web.AddDirectedEdge("Products.aspx", "People.aspx");// Products -> People

After these commands, web represents the graph shown in Figure 2. Once we have constructed a graph, we'll want to answer some questions. For example, for the graph we just created, we might want to ask, "What's the least number of links a user must click to reach any Web page when starting from the homepage (Index.htm)?" To answer such questions, we can usually fall back on using existing graph algorithms. In the next section we'll examine two common algorithms for weighted graphs:

  • Constructing a minimum spanning tree
  • Finding the shortest path from one node to all others

A Look at Some Common Graph Algorithms

Because graphs are a data structure that can be used to model a bevy of real-world problems, there are unlimited numbers of algorithms designed to find solutions for common problems. To further our understanding of graphs, let's take a look at two of the most studied applications of graphs.

The Minimum Spanning Tree Problem

Imagine that you work for the phone company and your task is to provide phone lines to a village with 10 houses, each labeled H1 through H10. Specifically this involves running a single cable that connects each home. That is, the cable must run through houses H1, H2, and so forth, up through H10. Due to geographic obstacles like hills, trees, rivers, and so on, it is not feasible to run the cable from one house to another.

Figure 6 shows this problem depicted as a graph. Each node is a house, and the edges are the means by which one house can be wired up to another. The weights of the edges dictate the distance between the homes. Your task is to wire up all ten houses using the least amount of telephone wiring possible.

Figure 6. Graphical representation of hooking up a 10-home village with phone lines

For a connected, undirected graph, there exists some subset of the edges that connect all the nodes and does not introduce a cycle. Such a subset of edges would form a tree (since it would comprise one less edge than vertices and is acyclic), and is called a spanning tree. There are typically many spanning trees for a given graph. Figure 7 shows two valid spanning trees from the Figure 6 graph. (The edges forming the spanning tree are bolded.)

Figure 7.Spanning tree subsets based on Figure 6

For graphs with weighted edges, different spanning trees have different associated costs, where the cost is the sum of the weights of the edges that comprise the spanning tree. A minimum spanning tree, then, is the spanning tree with a minimum cost.

There are two basic approaches to solving the minimum spanning tree problem. One approach is build up a spanning tree by choosing the edges with the minimum weight, so long as adding that edge does not create a cycle among the edges chosen thus far. This approach is shown in Figure 8.

Figure 8. Minimum spanning tree that uses the edges with the minimum weight

The other approach builds up the spanning tree by dividing the nodes of the graph into two disjoint sets: the nodes currently in the spanning tree and those nodes not yet added. At each iteration, the least weighted edge that connects the spanning tree nodes to a node in the spanning tree is added to the spanning tree. To start off the algorithm, some random start node must be selected. Figure 9 illustrates this approach in action, using H1 as the starting node. (In Figure 9 those nodes that are in the set of nodes in the spanning tree are shaded light yellow.)

Figure 9. Prim method of finding the minimum spanning tree

Notice that the techniques illustrated in Figure 8 and Figure 9 arrived at the same minimum spanning tree. If there is only one minimum spanning tree for the graph, then both of these approaches will reach the same conclusion. If, however, there are multiple minimum spanning trees, these two approaches might arrive with different results (both results will be correct, naturally).

Note   The first approach we examined was discovered by Joseph Kruskal in 1956 at Bell Labs. The second technique was discovered in 1957 by Robert Prim, also a researcher at Bell Labs. There is a plethora of information on these two algorithms on the Web, including Java applets showing the algorithms in progress graphically ( Kruskal's Algorithm | Prim's Algorithm), as well as source code in a variety of languages.

Computing the Shortest Path from a Single Source

When flying from one city to another, part of the headache is finding a route that requires the fewest number of connections. No one likes their flight from New York to Los Angeles to go from New York to Chicago, then Chicago to Denver, and finally Denver to Los Angeles. Most people would rather have a direct flight straight from New York to Los Angeles.

Imagine, however, that you are not one of those people. Instead, you are someone who values his money much more than his time, and are most interested in finding the cheapest route, regardless of the number of connections. This might mean flying from New York to Miami, then Miami to Dallas, then Dallas to Phoenix, Phoenix to San Diego, and finally San Diego to Los Angeles.

We can solve this problem by modeling the available flights and their costs as a directed, weighted graph. Figure 10 shows such a graph.

Figure 10. Modeling of available flights based on cost

What we are interested in knowing is what is the shortest path from New York to Los Angeles. By inspecting the graph, we can quickly determine that it's from New York to Chicago to San Francisco and finally down to Los Angeles, but in order to have a computer accomplish this task we need to formulate an algorithm to solve the problem at hand.

Edgar Dijkstra, one of the most noted computer scientists of all time, invented the most commonly used algorithm for finding the shortest path from a source node to all other nodes in a weighted, directed graph. This algorithm, dubbed Dijkstra's Algorithm, works by maintaining two tables, each of which has a record for each node. These two tables are:

  • A distance table, which keeps an up-to-date "best distance" from the source node to every other node.
  • A route table, which, for each node n, indicates what node was used to reach n to get the best distance.

Initially, the distance table has each record set to some high value (like positive infinity) except for the start node, which has a distance to itself of 0. The route table's rows are all set to null. Also, a collection of nodes, Q, that need to be examined is maintained; initially, this collection contains all of the nodes in the graph.

The algorithm proceeds by selecting (and removing) the node from Q that has the lowest value in the distance table. Let this selected node be called n and the value in the distance table for n is d. For each of the n's edges, a check is made to see if d plus the cost to get from n to that particular neighbor is less than the value for that neighbor in the distance table. If it is, then we've found a better way to reach that neighbor, and the distance and route tables are updated accordingly.

To help clarify this algorithm, let's begin applying it to the graph from Figure 10. Since we want to know the cheapest route from New York to Los Angeles we use New York as our source node. Our initial distance table, then, contains a value of infinity for each of the other cities, and a value of 0 for New York. The route table contains nulls for all entries, and Q contains all nodes (see Figure 11).

Figure 11. Distance table and route table for determining cheapest fare

Now, we start by extracting the city from Q that has the lowest value in the distance table, which is New York. We then examine each of New York's neighbors and check to see if the cost to fly from New York to that neighbor is less than the best cost we know of, namely the cost in the distance table. After this first check, we'd have removed New York from Q and updated the distance and route tables for Chicago, Denver, Miami, and Dallas.

Figure 12. Step 2 in the process of determining the cheapest fare

The next iteration gets the cheapest city out of Q, Chicago, and then checks its neighbors to see if there is a better cost. Specifically, we'll check to see if there's a better route for getting to San Francisco or Denver. Clearly the cost to get to San Francisco from Chicago—$75 + $25Los Angeles is less than Infinity, so San Francisco's records are updated. Also, note that it is cheaper to fly from Chicago to Denver than from New York to Denver ($75 + $20 < $100), so Denver is updated as well. Figure 13 shows the values of the tables and Q after Chicago has been processed.

Figure 13. Table status after the third leg of the process is finished

This process continues until there are no more nodes in Q. Figure 14 shows the final values of the tables when Q has been exhausted.

Figure 14. Final results of determining the cheapest fare

At the point of exhausting Q, the distance table contains the lowest cost from New York to each city. To determine the flight path to arrive at L.A., start by examining the L.A. entry in the route table and work back up to New York. That is, the route table entry for L.A. is San Francisco, meaning the last leg of the flight to L.A. leaves from San Francisco. The route table entry for San Francisco is Chicago, meaning you'll get to San Francisco through Chicago. Finally, Chicago's route table entry is New York. Putting this together we see that the flight path is from New York to Chicago to San Francisco to L.A.

Note   To see a working implementation of Dijkstra's Algorithm in C#, check out the download for this article, which includes a testing application for the Graph class that determines the shortest distance from one city to another using Dijkstra's Algorithm.

Conclusion

Graphs are a commonly used data structure because they can be used to model many real-world problems. A graph consists of a set of nodes with an arbitrary number of connections, or edges, between the nodes. These edges can be either directed or undirected and weighted or unweighted.

In this article, we examined the basics of graphs and created a Graph class. This class was similar to the BinaryTree class created in Part 3, the difference being that instead of only have a reference for at most two edges, the Nodes of the Graph class could have an arbitrary number of references. This similarity is not surprising because trees are a special case of graphs.

In addition to creating a Graph class, we also looked at two common graph algorithms, the minimum spanning tree problem, and computing the shortest path from some source node to all other nodes in a weighted, directed graph. While we did not examine source code to implement these algorithms, there are plenty source-code examples available on the Internet. Also, the download included with this article contains a testing application for the Graph class that uses Dijkstra's Algorithm to compute the shortest route between two cities.

In the next installment, Part 6, we'll look at efficiently maintaining disjoint sets. Disjoint sets are a collection of two or more sets that do not share any elements in common. For example, with Prim's Algorithm for finding the minimum spanning tree, the nodes of the graph can be divided into two disjoint sets—the set of nodes that currently constitute the spanning tree and the set of nodes that are not yet in the spanning tree.

Related Books

Scott Mitchell, author of five books and founder of 4GuysFromRolla.com, has been working with Microsoft Web technologies for the past five years. Scott works as an independent consultant, trainer, and writer, and recently completed his Masters degree in Computer Science at the University of California – San Diego. He can be reached at mitchell@4guysfromrolla.com or through his blog at http://ScottOnWriting.NET.


 
An Extensive Examination of Data Structures
 

Scott Mitchell
4GuysFromRolla.com

April 2004

Summary: Scott Mitchell discusses data structures for implementing general and disjoint sets. A set is an unordered collection of unique items that can be enumerated and compared to other sets in a variety of ways. (20 printed pages)


Download the Sets.msi sample file.

Contents

Introduction
The Fundamentals of Sets
Implementing an Efficient Set Data Structure
Maintaining a Collection of Disjoint Sets
References
Related Books

Introduction

One of the most basic mathematical constructs is a set, which is an unordered collection of unique objects. The objects contained within a set are referred to as the set's elements. Formally, a set is denoted as a capital, italic letter, with its elements appearing within curly braces ({...}). Examples of this notation can be seen below:

S = { 1, 3, 5, 7, 9 }
T = { Scott, Jisun, Sam }
U = { -4, 3.14159, Todd, x }

In mathematics, sets are typically comprised of numbers, such as set S above, which contains the odd positive integers less than 10. However, notice that the elements of a set can be anything—numbers, people, strings, letters, variables, and so on. Set T, for example, contains peoples' names, and set U contains a mix of numbers, names, and variables.

In this article, we'll start with a basic introduction of sets, including common notation and the operations that can be performed on sets. Following that, we'll examine how to efficiently implement a set data structure with a defined universe. The article concludes with an examination of disjoint sets, and the best data structures to use.

The Fundamentals of Sets

Recall that a set is simply a collection of elements. The "element of" operator, denoted x Î S, implies that x is an element in the set S. For example, if set S contains the odd positive integers less than 10, then 1Î S. When reading such notation, you'd say, "1 is an element of S." In addition to 1 being an element of S, we have 3 Î S, 5 Î S, 7 Î S, and 9 Î S. The "not an element of" operator, denoted x Ï S, means that x is not an element of set S.

The number of unique elements in a set is the set's cardinality. The set {1, 2, 3} has cardinality 3, just as does the set {1, 1, 1, 1, 1, 1, 1, 2, 3} (because it only has three unique elements). A set may have no elements in it at all. Such a set is called the empty set, and is denoted as {} or Æ, and has a cardinality of 0.

When first learning about sets, many developers assume they are the same as collections, like an ArrayList. However, there are some subtle differences. An ArrayList is an ordered collection of elements—each element in an ArrayList has an associated ordinal index, which implies order. Too, there can be duplicate elements in an ArrayList.

A set, on the other hand, is unordered and contains unique items. Since sets are unordered, the elements of a set may be listed in any order. That is, the sets {1, 2, 3} and {3, 1, 2} are considered equivalent. Also, any duplicates in a set are considered redundant. The set {1, 1, 1, 2, 3} and the set {1, 2, 3} are equivalent. Two sets are equivalent if they have the same elements. (Equivalence is denoted with the = sign; if S and T are equivalent they are written as S = T.)

Note   In mathematics, an ordered collection of elements that allows duplicates is referred to as a list. Two lists, L 1 and L 2 are considered equal if and only if for i ranging from 1 to the number of elements in the list the ith element in L 1 equals the ith element in L 2.

Typically the elements that can appear in a set are restricted to some universe. The universe is the set of all possible values that can appear in a set. For example, we might only be interested in working with sets whose universe are integers. By restricting the universe to integers, we can't have a set that has a non-integer element, like 8.125, or Sam. (The universe is denoted as the set U.)

Relational Operators of Sets

There are a bevy of relational operators that are commonly used with numbers. Some of the more often used ones, especially in programming languages, include <, <=, =, !=, >, and >=. A relational operator determines if the operand on the left hand side is related to the operand on the right hand side based on criteria defined by the relational operator. Relational operators return a "true" or "false" value, indicating whether or not the relationship holds between the operands. For example, x < y returns true if x is less than y, and false otherwise. (Of course the meaning of "less than" depends on the data type of x and y.)

Relational operators like <, <=, =, !=, >, and >= are typically used with numbers. Sets, as we've seen, use the = relational operator to indicate that two sets are equivalent (and can likewise use != to denote that two sets are not equivalent), but relational operators <, <=, >, and >= are not defined for sets. After all, how is one to determine if the set {1, 2, 3} is less than the set {Scott, 3.14159}?

Instead of notions of < and <=, sets use the relational operators subset and proper subset, denoted Í and Ì, respectively. (Some older texts will use Ì for subset and Í for proper subset.) S is a subset of T—denoted S Í T—if every element in S is in T. That is, S is a subset of T if it is contained within T. If S = {1, 2, 3}, and T = {0, 1, 2, 3, 4, 5}, then S Í T since every element in S—1, 2 and 3—is an element in T. S is a proper subset of T—denoted S Ì T—if S Í T and S ¹ T. That is, if S = {1, 2, 3} and T = {1, 2, 3}, then S Í T since every element in S is an element in T, but S T since S = T. (Notice that there is a similarity between the relational operators < and <= for numbers and the relational operators Í and Ì for sets.)

Using the new subset operator, we can more formally define set equality. Given sets S and T, S = T if and only if S Í T and T Í S. In English, S and T are equivalent if and only if every element in S is in T, and every element in T is in S.

Note   Since Í is analogous to <=, it would make sense that there exists a set relational operator analogous to >=. This relational operator is called superset, and is denoted Ê; a proper superset is denoted . Like with <= and >=, S Ê T if and only if T Í S.

Set Operations

As with the relational operators, many operations defined for numbers don't translate well to sets. Common operations on numbers include addition, multiplication, subtraction, exponentiation, and so on. For sets, there are four basic operations:

  1. Union – the union of two sets, denoted S È T, is akin to addition for numbers. The union operator returns a set that contains all of the elements in S and all of the elements in T. For example, {1, 2, 3} È {2, 4, 6} equals {1, 2, 3, 2, 4, 6}. (The duplicate 2 can be removed to provide a more concise answer, yielding {1, 2, 3, 4, 6}.) Formally, S È T = {x : x Î S or x Î T}. In English, this translates to S union T results in the set that contains an element x if x is in S or in T.
  2. Intersection the intersection of two sets, denoted S Ç T, is the set of elements that S and T have in common. For example, {1, 2, 3} Ç {2, 4, 6} equals {2}, since that's the only element both {1, 2, 3} and {2, 4, 6} share in common. Formally, S Ç T = {x : x Î S and x Î T}. In English, this translates to S intersect T results in the set that contains an element x if x is both in S and in T.
  3. Difference – the difference of two sets, denoted S - T, are all of the elements in S that are not in T. For example, {1, 2, 3} - {2, 4, 6} equals {1, 3}, since 1 and 3 are the elements in S that are not in T. Formally, S - T = {x : x Î S and x Ï T}. In English, this translates to S set difference T results in the set that contains an element x if x is in S and not in T.
  4. Complement – Earlier we discussed how typically sets are limited to a known universe of possible values, such as the integers. The complement of a set, denoted S', is U - S. (Recall that U is the universe set.) If our universe is the integers 1 through 10, and S = {1, 4, 9, 10}, then S' = {2, 3, 5, 6, 7, 8}. (Complementing a set is akin to negating a number. Just like negating a number twice will give you the original number back—that is, --x = x—complementing a set twice will give you the original set back—S'' = S.)

When examining new operations, it is always important to get a solid grasp on the nature of the operations. Some questions to ask yourself when learning about any operation—be it one defined for numbers or one defined for sets—are:

  • Is the operation commutative? An operator op is commutative if x op y is equivalent to y op x. In the realm of numbers, addition is an example of a commutative operator, while division is not commutative.
  • Is the operation associative? That is, does the order of operations matter. If an operator op is associative, then x op (y op z) is equivalent to (x op y) op z. Again, in the realm of numbers addition is associative, but division is not.

For sets, the union and intersection operations are both commutative and associative. S È T is equivalent to T È S, and S È (T È V) is equivalent to (S È T) È V. Set difference, however, is neither commutative nor associative. (To see that set difference is not commutative, consider that {1, 2, 3} - {3, 4, 5} = {1, 2}, but {3, 4, 5} - {1, 2, 3} = {4, 5}.)

Finite Sets and Infinite Sets

All of the set examples we've looked at thus far have dealt with finite sets. A finite set is a set that has a finite number of elements. While it may seem counterintuitive at first, a set can contain an infinite number of elements. The set of positive integers, for example, is an infinite set since there are no bounds to the number of elements in the set.

In mathematics, there are a couple infinite sets that are used so often that they are given a special symbol to represent them. These include:

  • N = {0, 1, 2, …}
  • Z = {…, -2, -1, 0, 1, 2, …}
  • Q = {a/b: a Î Z, b Î Z, and b ¹ 0}
  • R = set of real numbers

N is the set of natural numbers, or positive integers greater than or equal to 0. Z is the set of integers. Q is the set of rational numbers, which are numbers that can be expressed as a fraction of two integers. Finally, R is the set of real numbers, which are all rational numbers, plus irrational numbers as well (numbers that cannot be expressed as a fraction of two integers, such as pi, and the square root of 2).

Infinite sets, of course, can't be written down in their entirety, as you'd never finish jotting down the elements, but instead are expressed more tersely using mathematical notation like so:

S = {x : xÎ N and x > 100}

Here S would be the set of all natural numbers greater than 100.

In this article we will be looking at data structures for representing finite sets. While infinite sets definitely have their place in mathematics, rarely will we need to work with infinite sets in a computer program. Too, there are unique challenges with representing and operating upon infinite sets, since an infinite set's contents cannot be completely stored in a data structure or enumerated.

Note   Computing the cardinality of finite sets is simple—just count up the number of elements. But how does one compute the cardinality of an infinite set? This discussion is beyond the scope of this article, but realize that there are different types of cardinality for infinite sets. Interestingly, there is the same "number" of positive integers, as there are all integers, but there are more real numbers than there are integers.

Sets in Programming Languages

C++, C#, Visual Basic .NET, and Java don't provide inherent language features for working with sets. If you want to use sets, you need to create your own set class with the appropriate methods, properties, and logic. (We'll do precisely this in the next section!) There have been programming languages in the past, though, that have offered sets as a fundamental building block in the language. Pascal, for example, provides a set construct that can be used to create sets with an explicitly defined universe. To work with sets, Pascal provides the in operator to determine if an element is in a particular set. The operators +, *, and – are used for union, intersection, and set difference. The following Pascal code illustrates the syntax used to work with sets:

/* declares a variable named possibleNumbers, a set whose universe is 
the set of integers between 1 and 100... */
var  
  possibleNumbers = set of 1..100;

...

/* Assigns the set {1, 45, 23, 87, 14} to possibleNumbers */
possibleNumbers := [1, 45, 23, 87, 14];

/* Sets possibleNumbers to the union of possibleNumbers and {3, 98} */
possibleNumbers := possibleNumbers + [3, 98];

/* Checks to see if 4 is an element of possible numbers... */
if 4 in possibleNumbers then write("4 is in the set!");

Other previous languages have allowed for more powerful set semantics and syntax. A language called SETL (an acronym for SET Language) was created in the 70s and offered sets as a first-class citizen. Unlike Pascal, when using sets in SETL you are not restricted to specifying the set's universe.

Implementing an Efficient Set Data Structure

In this section we'll look at creating a class that provides the functionality and features of a set. When creating such a data structure, one of the first things we need to decide is how to store the elements of the set. This decision can greatly affect the asymptotic efficiency of the operations performed on the set data structure. (Keep in mind that the operations we'll need to perform on the set data structure include: union, intersection, set difference, subset, and element of.)

To illustrate how storing the set's elements can affect the run time, imagine that we created a set class that used an underlying ArrayList to hold the elements of the set. If we had two sets, S1 and S2 that we wanted to union (where S1 had m elements and S2 had n elements), we'd have to perform the following steps:

  1. Create a new set class, T, that holds the union of S1 and S2
  2. Iterate through the elements of S1, adding it to T.
  3. Iterate through the elements of S2. If the element does not already exist in T, then add it to T.

How many steps would performing the union take? Step (2) would require m steps through S1's m elements. Step (3) would take n steps, and for each element in S2, we'd have to determine if the element was in T. Using unsorted ArrayLists, to determine if an element is in an ArrayList the entire ArrayList must be enumerated linearly. So, for each of the n elements in S2 we might have to search through the m elements in T. This would lead to a quadratic running time for union of O(m * n).

The reason a union with an ArrayList takes quadratic time is because determining if an element exists within a set takes linear time. That is, to determine if an element exists in a set, the set's ArrayList must be exhaustively searched. If we could reduce the running time for the "element of" operation to a constant, we could improve the union's running time to a linear O(m + n). Recall from Part 2 of this article series that a Hashtable provides constant running time to determine if an item resides within the Hashtable. Hence, a Hashtable would be a better choice for storing the set's elements than an ArrayList.

If we require that the set's universe be known, we can implement an even more efficient set data structure using a bit array. Assume that the universe consists of elements e1, e2, …, ek. Then we can denote a set with a k-element bit array; if the ith bit is 1, then the element ei is in the set; if, on the other hand, the ith bit is 0, then the element ei is not in the set. Representing sets as a bit array not only provides tremendous space savings, but also enables efficient set operations, as these set-based operations can be performed using simple bit-wise instructions. For example, determining if element ei exists in a set takes constant time since only the ith bit in the bit array needs to be checked. The union of two sets is simply the bit-wise OR of the sets' bit arrays; the intersection of two sets is the bit-wise AND of the sets' bit arrays. Set difference and subset can be reduced down to bit-wise operations as well.

Note   A bit array is a compact array composed of 1s and 0s, typically implemented as an integer array. Since an integer in the Microsoft .NET Framework has 32 bits, a bit array can store 32 bit values in one element of an integer array (rather than requiring 32 array elements).
Bit-wise operations are ones that are performed on the individual bits of an integer. There are both binary bit-wise operators and unary bit-wise operators. The bit-wise AND and bit-wise OR operators are binary, taking in two bits each, and returning a single bit. Bit-wise AND returns 1 only if both inputs are 1, otherwise it returns 0. Bit-wise OR returns 0 only if both inputs are 0, otherwise it returns 1.
For a more in-depth look at bit-wise operations in C# be sure to read: Bit-Wise Operators in C#.

Let's look at how to implement a set class that uses C#'s bit-wise operations.

Creating the PascalSet Class

Understand that to implement a set class that uses the efficient bit-wise operators the set's universe must be known. This is akin to the way Pascal uses sets, so in honor of the Pascal programming language I have decided to name this set class the PascalSet class. PascalSet restricts the universe to a range of integers or characters (just like the Pascal programming language). This range can be specified in the PascalSet's constructor.

public class PascalSet : ICloneable, ICollection
{
   // Private member variables
   private int lowerBound, upperBound;
   private BitArray data;

   public PascalSet(int lowerBound, int upperBound)
   {
      // make sure lowerbound is less than or equal to upperbound
      if (lowerBound > upperBound)
         throw new ArgumentException("The set's lower bound cannot be 
greater than its upper bound.");

      this.lowerBound = lowerBound;
      this.upperBound = upperBound;

      // Create the BitArray
      data = new BitArray(upperBound - lowerBound + 1);
   }

   ...
}

So, to create a PascalSet whose universe is the set of integers between -100 and 250, the following syntax could be used:

PascalSet mySet = new PascalSet(-100, 250);
Implementing the Set Operations

PascalSet implements the standard set operations—union, intersection, and set difference—as well as the standard relational operators—subset, proper subset, superset, and proper superset. The set operations union, intersection, and set difference, all return a new PascalSet instance, which contains the result of unioning, intersecting, or set differencing. The following code for the Union(PascalSet) method illustrates this behavior:

public virtual PascalSet Union(PascalSet s)
{
   if (!AreSimilar(s))
      throw new ArgumentException("Attempting to union two dissimilar 
sets. Union can only occur between two sets with the same universe.");
   
   // do a bit-wise OR to union together this.data and s.data
   PascalSet result = new PascalSet(this.lowerBound, this.upperBound);
   result.data = this.data.Or(s.data);

   return result;
}

public static PascalSet operator +(PascalSet s, PascalSet t)
{
   return s.Union(t);
}

The AreSimilar(PascalSet) method determines if the PascalSet passed has the same lower and upper bounds as the PascalSet instance. Therefore, union (and intersection and set difference) can only be applied to two sets with the same universe. (You could make a modification to the code here to have the returned PascalSet's universe be the union of the two universe sets, thereby allowing sets with non-disjoint universes to be unioned.) If the two PascalSets have the same universe, then a new PascalSetresult—is created with the same universe and its BitArray member variable—data—is assigned to the bit-wise OR of the two PascalSets' BitArrays. Notice that the PascalSet class also overloads the + operator for union (just like the Pascal programming language).

Enumerating the PascalSet's Members

Since sets are an unordered collection of elements, it would not make sense to have PascalSet implement IList, as collections that implement IList imply that the list has some ordinal order. Since PascalSet is a collection of elements, though, it makes sense to have it implement ICollection. Since ICollection implements IEnumerable, PascalSet needs to provide a GetEnumerator() method that returns an IEnumerator instance allowing a developer to iterate through the set's elements.

Oftentimes when creating a specialized collection class that uses some other underlying collection class to hold the data, the GetEnumerator() method for the specialized class can simply return the IEnumerator from the underlying collection's GetEnumerator() method. Since the PascalSet uses a BitArray to represent what elements are in the set, it might at first seem plausible to have PascalSet's GetEnumerator() method return the IEnumerator from the internal BitArray's GetEnumerator() method. However, the BitArray's GetEnumerator() returns an IEnumerator that enumerates all of the bits in the BitArray, returning a Boolean value for each bit—true if the bit is 1, false if the bit is 0.

The elements in the PascalSet, however, are only those elements where the BitArray's bit is 1. Therefore we need to create a custom class that implements IEnumerator and that intelligently walks the BitArray, only returning those elements whose corresponding bit in the BitArray is 1. To handle this I created a class within the PascalSet class called PascalSetEnumerator. This class's constructor takes in the current PascalSet instance as a sole input parameter. In the MoveNext() method it steps through the BitArray until it finds a bit with a value of 1.

class PascalSetEnumerator : IEnumerator
{
   private PascalSet pSet;
   private int position;

   public PascalSetEnumerator(PascalSet pSet)
   {
      this.pSet = pSet;
      position = -1;
   }

   ...

   public bool MoveNext()
   {
      // increment position
      position++;

      // see if there is another element greater than position            
      for (int i = position; i < pSet.data.Length; i++)
      {
         if (pSet.data.Get(i)) 
         {
            position = i;
            return true;
         }
      }

      // no element found
      return false;
   }
}

The complete code for the PascalSet class is included as a download with this article. Along with the class, there is an interactive WinForms testing application, SetTester, from which you can create a PascalSet instance and perform various set operations, viewing the resulting set.

Maintaining a Collection of Disjoint Sets

Next time you do a search at Google notice that with each result there's a link titled "Similar Pages." If you click this link, Google displays a list of URLs that are related to the item whose "Similar Pages" link you clicked. While I don't know how Google particularly determines how pages are related, one approach would be the following:

  • Let x be the Web page we are interested in finding related pages for.
  • Let S1 be the set of Web pages that x links to.
  • Let S2 be the set of Web pages that the Web pages in S1 link to.
  • Let S3 be the set of Web pages that the Web pages in S2 link to.
  • Let Sk be the set of Web pages that the Web pages in Sk-1 link to.

All of the Web pages in S1, S2, up to Sk are the related pages for x. Rather than compute the related Web pages on demand, we might opt to create the set of related pages for all Web pages once, and to store this relation in a database or some other permanent store. Then, when a user clicks on the "Similar Pages" link for a search term, we simply query the display to get the links related to this page.

Google has some sort of database with all of the Web pages it knows about. Each of these Web pages has a set of links. We can compute the set of related Web pages using the following algorithm:

  1. For each Web page in the database create a set, placing the single Web page in the set. (After this step completes, if we have n Web pages in the database, we'll have n one-element sets.)
  2. For a Web page x in the database, find all of those Web pages it directly links to. Call these linked-to pages S. For each element p in S, union the set containing p with x's set.
  3. Repeat step 2 for all Web pages in the database.

After step 3 completes, the Web pages in the database will be partitioned out into related groups. To view a graphical representation of this algorithm in action, consult Figure 1.

 

Figure 1. A graphical representation of an algorithm for grouping linked web pages.

Examining Figure 1, notice that in the end, there are three related partitions:

  • w0
  • w1, w2, w3, and w4
  • w5 and w6

So, when a user clicks the "Similar Pages" link for w2, they would see links to w1, w3, and w4; clicking the "Similar Pages" link for w6 would show only a link to w6.

Notice that with this particular problem only one set operation is being performed—union. Furthermore, all of the Web pages fall into disjoint sets. Given an arbitrary number of sets, these sets are said to be disjoint if they share no elements in common. {1,2,3} and {4,5,6} are disjoint, for example, while {1,2,3} and {2,4,6} are not, since they share the common element {2}. In all stages shown in Figure 1, each of the sets containing Web pages are disjoint. That is, it's never the case that one Web page exists in more than one set at a time.

When working with disjoint sets in this manner, we often need to know what particular disjoint set a given element belongs to. To identify each set we arbitrarily pick a representative. A representative is an element from the disjoint set that uniquely identifies that entire disjoint set. With the notion of a representative, I can determine if two given elements are in the same set by checking to see if they have the same representative.

A disjoint set data structure needs to provide two methods:

  • GetRepresentative(element) – this method accepts an element as an input parameter and returns the element's representative element.
  • Union(element, element) – this method takes in two elements. If the elements are from the same disjoint set, then Union() does nothing. If, however, the two elements are from different disjoint sets, then Union() combines the two disjoint sets into one set.

The challenge that faces us now is how to efficiently maintain a number of disjoint sets, where these disjoint sets are often merged from two sets into one? There are two basic data structures that can be used to tackle this problem: one uses a series of linked lists, the other a collection of trees.

Maintaining Disjoint Sets with Linked Lists

In Part 4 of this article series we took a moment to look at a primer on linked lists. Recall that linked lists are a set of nodes that typically have a single reference to their next neighbor. Figure 2 shows a linked list with four elements.

Figure 2. A linked list with four elements

For the disjoint set data structure, a set is represented using a modified linked list. Rather than just having a reference to its neighbor, each node in the disjoint set linked list has a reference to the set's representative. As Figure 3 illustrates, all nodes in the linked list point to the same node as their representative, which is, by convention, the head of the linked list. (Figure 3 shows the linked list representation of the disjoint sets from the final stage of the algorithm dissected in Figure 1. Notice that for each disjoint set there exists a linked list, and the nodes of the linked list contain the elements of that particular disjoint set.)

Figure 3. A linked list representation of the disjoint sets from the final stage of the algorithm dissected in Figure 1.

Since each element in a set has a direct reference back to the set's representative, the GetRepresentative(element) method takes constant time. (To understand why, consider that regardless of how many elements a set has, it will always take one operation to find a given element's representative, since it involves just checking the element's representative reference.)

Using the linked list approach, combining two disjoint sets into one involves adding one linked list to the end of another, and updating the representative reference in each of the appended nodes. The process of joining two disjoint sets is depicted in Figure 4.

Figure 4. The process of joining two disjoint sets

When unioning together two disjoint sets, the correctness of the algorithm is not affected by which of the two sets is appended to the other. However, the running time can be. Imagine that our union algorithm randomly chose one of the two linked lists to be appended to the other. By a stroke of bad luck, imagine that we always chose the longer of the two linked lists to append. This can negatively impact the running time of the union operation since we have to enumerate all of the nodes in the appended linked list to update their representative reference. That is, imagine we make n disjoint sets, S1 to Sn. Each set would have one element. We could then do n - 1 unions, joining all n sets into one big set with n elements. Imagine the first union joined S1 and S2, having S1 be the representative for this two element unioned set. Since S2 only has one element, only one representative reference would need to be updated. Now, imagine S1—which has two elements—is unioned with S3, and S3 is made the representative. This time two representative references—S1's and S2's—will need to be updated. Similarly, when joining S3 with S4, if S4 is made the representative of the new set, three representative references will need to be updated (S1, S2, and S3). In the (n-1)th union, n-2 representative references will need to be updated.

Summing up the number of operations that must be done for each step, we find that the entire sequence of steps—n make set operations and n-1 unions—takes quadratic time—O(n2).

This worst-case running time can transpire because it is possible that union will choose the longer set to append to the shorter set. Appending the longer set requires that more nodes' representative references need to be updated. A better approach is to keep track of the size of each set, and then, when joining two sets, to append the smaller of the two linked lists. The running time when using this improved approach is reduced to O(n log2 n). A thorough time analysis is a bit beyond the scope of this article, and is omitted for brevity. Refer to the readings in the References section for a formal proof of the time analysis.

To appreciate the improvement of O(n log2 n) from O(n2), observe Figure 5, which shows the growth rate of n2 in blue, and the growth rate of n log2 n in pink. For small values of n, these two are comparable, but as n exceeds 32, the n log2 n grows much slower than n2. For example, performing 64 unions would require over 4,000 operations using the naïve linked list implementation, while it would take only 384 operations for the optimized linked list implementation. These differences become even more profound as n gets larger.

Figure 5. Growth rates of n2 and n log2

Maintaining Disjoint Sets with a Forest

Disjoint sets can also be maintained using a forest. A forest is a set of trees (get it? ). Recall that with the linked list implementation, the set's representative was the head of the list. With the forest implementation, each set is implemented as a tree, and the set's representative is the root of the tree. (If you are unfamiliar with what trees are, consider reading Part 3 of this article series, where we discussed trees, binary trees, and binary search trees.)

With the linked list approach, given an element, finding its set's representative was fast since each node had a direct reference to its representative. However, with the linked list approach unioning took longer because it involved appending one linked list to another, which required that the appended nodes' representative references be updated. The forest approach aims at making unions fast, at the expense of finding a set's representative given an element in the set.

The forest approach implements each disjoint set as a tree, with the root as the representative. To union together two sets, one tree is appended as a child of the other. Figure 6 illustrates this concept graphically.

Figure 6. The union of two sets

To union two sets together requires constant time, as only one node needs to have its representative reference updated. (In Figure 6, to union together the w1 and w3 sets, all we had to do was have w3 update its reference to w1—nodes w4 and w5 didn't need any modification.)

Compared to the linked list implementation, the forest approach has improved the time required for unioning two disjoint sets, but has worsened the time for finding the representative for a set. The only way we can determine a set's representative, given an element, is to walk up the set's tree until we find the root. Imagine that we wanted to find the representative for w5 (after sets w1 and w3 had been unioned). We'd walk up the tree until we reached the root—first to w3, and then to w1. Hence, finding the set's representative takes time relative to the depth of the tree, and not constant time as it does with the linked list representation.

The forest approach offers two optimizations that, when both employed, yield a linear running time for performing n disjoint set operations, meaning that each single operation has an average constant running time. These two optimizations are called union by rank and path compression. What we are trying to avoid with these two optimizations is having a sequence of unions generate a tall, skinny tree. As discussed in Part 3 of this article series, the ratio of a tree's height to breadth typically impacts its running time. Ideally, a tree is fanned out as much as possible, rather than being tall and narrow.

The Union by Rank Optimization

Union by rank is akin to the linked list's optimization of appending the shorter list to the longer one. Specifically, union by rank maintains a rank for each sets' root, which provides an upper-bound on the height of the tree. When unioning two sets, the set with the smaller rank is appended as a child of the root with the larger rank. Union by rank helps ensure that our trees will be broad. However, even with union by rank we might still end up with tall, albeit wide, trees. Figure 7 shows a picture of a tree that might be formed by a series of unions that adhere only to the union by rank optimization. The problem is that leaf nodes on the right hand side still must perform a number of operations to find their set's representative.

Figure 7. A tree that might be formed by a series of unions that adhere only to the union by rank optimization

Note   The forest approach, when implementing just the union by rank optimization, has the same running time as the optimized link list implementation.
The Path Compression Optimization

Since a tall tree makes finding a set's representative expensive, ideally we'd like our trees to be broad and flat. The path compression optimization works to flatten out a tree. As we discussed earlier, whenever an element is queried for its set's representative, the algorithm walks up the tree to the root. The way the path compression optimization works is in this algorithm; the nodes that are visited in the walk up to the root have their parent reference updated to the root.

To understand how this flattening works, consider the tree in Figure 7. Now, imagine that we need to find the set representative for w13. The algorithm will start at w13, walk up to w12, then to w8, and finally to w1, returning w1 as the representative. Using path compression, this algorithm will also have the side effect of updating w13 and w12's parents to the root—w1. Figure 8 shows a screenshot of the tree after this path compression has occurred.

Figure 8. A tree after path compression

Path compression pays a slight overhead the first time when finding a representative, but benefits future representative lookups. That is, after this path compression has occurred, finding the set representative for w13 takes one step, since w13 is a child of the root. In Figure 7, prior to path compression, finding the representative for w13 would have taken three steps. The idea here is that you pay for the improvement once, and then benefit from the improvement each time the check is performed in the future.

When employing both the union by rank and path compression algorithms, the time it takes to perform n operations on disjoint sets is linear. That is, the forest approach, utilizing both optimizations, has a running time of O(n). You'll have to take my word on this, as the formal proof for the time complexity is quite lengthy and involved, and could easily fill several printed pages. If you are interested, though, in reading this multi-page time analysis, refer to the "Introduction to Algorithms" text listed in the references.

References

  • Alur, Rajeev. "Disjoint Sets"
  • Cormen, Thomas H., Charles E. Leiserson, and Ronald L. Rivest. "Introduction to Algorithms." MIT Press. 1990.
  • Devroye, Luc. "Disjoint Set Structures".
Related Books

Scott Mitchell, author of five books and founder of 4GuysFromRolla.com, has been working with Microsoft Web technologies for the past five years. Scott works as an independent consultant, trainer, and writer, and recently completed his Masters degree in Computer Science at the University of California, San Diego. He can be reached at mitchell@4guysfromrolla.com or through his blog at http://www.ScottOnWriting.NET.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值