Effective C#之Item 40:Match Your Collection to Your Needs

To the question of "Which collection is best?," the answer is "It depends." Different collections have different performance characteristics and are optimized for different actions. The .NET Framework supports many of the familiar collections: lists, arrays, queue, stack, and others. C# supports multidimensional arrays, which have performance characteristics that differ from either single-dimensional arrays or jagged arrays. The framework also includes many specialized collections; look through those before you build your own. You can find all the collections quickly because all collections implement the ICollection interface. The documentation for ICollection lists all classes that implement that interface. Those 20-odd classes are the collections at your disposal.


To pick the right collection for your proposed use, you need to consider the actions you'll most often perform on that collection. To produce a resilient program, you'll rely in the interfaces that are implemented by the collection classes so that you can substitute a different collection when you find that your assumptions about the usage of the collection were incorrect (see Item 19).


The .NET Framework has three different kinds of collections: arrays, arraylike collections, and hash-based containers. Arrays are the simplest and generally the fastest, so let's start there. This is the collection type you'll use most often.


Your first choice should often be the System.Array classor, more correctly, a type-specific array class. The first and most significant reason for choosing the array class is that arrays are type-safe. All other collection classes store System.Object references, until C# 2.0 introduces generics (see Item 49). When you declare any array, the compiler creates a specific System.Array derivative for the type you specify. For example, this declaration creates an array of integers:

你最开始的选择应该是System.Array类,更准确的说,是一个指定了类型的数组类。选择数组类的最首要最明显的原因是数组是类型安全的。所有其它集合存储的是System.Object引用,直到C#2.0引入了泛型(Item 49)。当你声明任何数组时,编译器就创建一个特定的System.Array,派生出你指定的类型。例如,下面这个声明创建一个整型数组:

private int [] _numbers = new int[100];

The array stores integers, not System.Object. That's significant because you avoid the boxing and unboxing penalty when you add, access, or remove value types from the array (see Item 17). That initialization creates a single-dimensional array with 100 integers stored in it. All the memory occupied by the array has a 0 bit pattern stored in it. Arrays of value types are all 0s, and arrays of reference types are all null. Each item in the array can be accessed by its index:


  1. int  j = _numbers[ 50 ];

In addition to the array access, you can iterate the array using foreach or an enumerator (see Item 11):


  1. foreach ( int i in _numbers )
  2.   Console.WriteLine( i.ToString( ) );
  3. // or:
  4. IEnumerator it = _numbers.GetEnumerator( );
  5. while( it.MoveNext( ))
  6. {
  7.   int i = (int) it.Current;
  8.   Console.WriteLine( i.ToString( ) );
  9. }

If you are storing a single sequence of like objects, you should store them in an array. But often, your data structures are more complicated collections. It's tempting to quickly fall back on the C-style jagged array, an array that contains other arrays. Sometimes, this is exactly what you need. Each element in the outer collection is an array along the inner direction:


  1. public class MyClass
  2. {
  3.   // Declare a jagged array:
  4.   private int[] [] _jagged;
  6.   public MyClass()
  7.   {
  8.     // Create the outer array:
  9.     _jagged = new int[5][];
  11.     // Create each inner array:
  12.     _jagged[0] = new int[5];
  13.     _jagged[1] = new int[10];
  14.     _jagged[2] = new int[12];
  15.     _jagged[3] = new int[7];
  16.     _jagged[4] = new int[23];
  17.   }
  18. }

Each inner single-dimension array can be a different size than the outer arrays. Use jagged arrays when you need to create differently sized arrays of arrays. The drawback to jagged arrays is that a column-wise traversal is inefficient. Examining the third column in each row of a jagged array requires two lookups for each access. There is no relationship between the locations of the element at row 0, column 3 and the element at row 1, column 3. Only multidimensional arrays can perform column-wise traversals more efficiently. Old-time C and C++ programmers made their own two- (or more) dimensional arrays by mapping them onto a single-dimension array. For old-time C and C++ programmers, this notation is clear:


  1. double num = MyArray[ i * rowLength + j ];

The rest of the world would prefer this:


  1. double num = MyArray[ i, j ];

But C and C++ did not support multidimensional arrays. C# does. Use the multidimensional array syntax: It's clearer to both you and the compiler when you mean to create a true multidimensional structure. You create multidimensional arrays using an extension of the familiar single-dimension array notation:


  1. private int[ , ] _multi = new int[ 10, 10 ];

The previous declaration creates a two-dimensional array, a 10x10 array with 100 elements. The length of each dimension in a multidimensional array is always constant. The compiler utilizes this property to generate more efficient initialization code. Initializing the jagged array requires multiple initialization statements. In my simple example earlier, you need five statements. Larger arrays or arrays with more dimensions require more extensive initialization code. You must write code by hand. However, multidimensional arrays with more dimensions merely require more dimension specifiers in the initialization statement. Furthermore, the multidimensional array initializes all array elements more efficiently. Arrays of value types are initialized to contain a value at each valid index in the array. The contents of the value are all 0. Arrays of reference types contain null at each index in the array. Arrays of arrays contain null inner arrays.


Traversing multidimensional arrays is almost always faster than traversing jagged arrays, especially for by-column or diagonal traversals. The compiler can use pointer arithmetic on any dimension of the array. Jagged arrays require finding each correct value for each single-dimension array.


Multidimensional arrays can be used like any collection in many ways. Suppose you are building a game based on a checker board. You'd make a board of 64 Squares laid out in a grid:


  1. private Square[ , ] _theBoard = new Square[ 8, 8 ];

This initialization creates the array storage for the Squares. Assuming that Square is a reference type, the Squares themselves are not yet created, and each array element is null. To initialize the elements, you must look at each dimension in the array:


  1. for ( int i = 0; i < _theBoard.GetLength( 0 ); i++ )
  2.   forint j = 0; j < _theBoard.GetLength( 1 ); j++ )
  3.     _theBoard[ i, j ] = new Square( );

But you have more flexibility in traversing the elements in a multidimensional array. You can get an individual element using the array indexes:


  1. Square sq = _theBoard[ 4, 4 ];

If you need to iterate the entire collection, you can use an iterator:


  1. foreach( Square sq in _theBoard )
  2.   sq.PaintSquare( );

Contrast that with what you would write for jagged arrays:


  1. foreach( Square[] row in _theBoard )
  2.   foreach( Square sq in row )
  3.     sq.PaintSquare( );
    Every new dimension in a jagged array introduces another foreach statement. However, with a multidimensional array, a single foreach statement generates all the code necessary to check the bounds of each dimension and get each element of the array. The foreach statement generates specific code to iterate the array by using each array dimension. The foreach loop generates the same code as if you had written this:


  1. for ( int i = _theBoard.GetLowerBound( 0 );  i <= _theBoard.GetUpperBound( 0 ); i++ )
  2.   forint j = _theBoard.GetLowerBound( 1 ); j <= _theBoard.GetUpperBound( 1 ); j++ )
  3.     _theBoard[ i, j ].PaintSquare( );

This looks inefficient, considering all those calls to GetLowerBound and GetUpperBound inside the loop statement, but it's actually the most efficient construct. The JIT compiler knows enough about the array class to cache the boundaries and to recognize that internal bounds checking can be omitted (see Item 11).


Two major disadvantages to the array class will make you examine the other collection classes in the .NET Framework. The first affects resizing the arrays: Arrays cannot be dynamically resized. If you need to modify the size of any dimension of an array, you must create a new array and copy all the existing elements to it. Resizing takes time: A new array must be allocated, and all the elements must be copied from the existing array into the new array. Although this copying and moving is not as expensive on the managed heap as it was in the C and C++ days, it still costs time. More important, it can result in stale data being used. Consider this code fragment:


  1. private string [] _cities = new string[ 100 ];
  3. public void SetDataSources( )
  4. {
  5.   myListBox.DataSource = _cities;
  6. }
  8. public void AddCity( string CityName )
  9. {
  10.   String[] tmp = new string[ _cities.Length + 1 ];
  11.   _cities.CopyTo( tmp, 0 );
  12.   tmp[ _cities.Length ] = CityName;
  14.   _cities = tmp; // swap the storage.
  15. }

Even after AddCity is called, the list box uses the old copy of the _cities array for its data source. Your new city never shows up in the list box.


The ArrayList class is a higher-level abstraction built on an array. The ArrayList collection mixes the semantics of a single-dimension array with the semantics of a linked list. You can perform inserts on an ArrayList, and you can resize an ArrayList. The ArrayList delegates almost all of its responsibilities to the contained array, which means that the ArrayList class has very similar performance characteristics to the Array class. The main advantage of ArrayList over Array is that ArrayList is easier to use when you don't know exactly how large your collection will be. ArrayList can grow and shrink over time. You still pay the performance penalty of moving and copying items, but the code for those algorithms has already been written and tested. Because the internal storage for the array is encapsulated in the ArrayList object, the problem of stale data does not exist: Clients point to the ArrayList object instead of the internal array. The ArrayList collection is the .NET Framework's version of the C++ Standard Library vector class.


The Queue and the Stack classes provide specialized interfaces on top of the System.Array class. The specific interfaces for those classes build custom interfaces for the first-in, first-out queue and the last-in, first-out stack. Always remember that these collections are built using a single-dimension array as their internal storage. The same performance penalty applies when you resize them.


The .NET collections don't contain a linked list structure. The efficiency of the garbage collector minimizes the times when a list structure is really the best choice. If you really need linked list behavior, you have two options. If you are using a list because you expect to add and remove items often, you can use the dictionary classes with null values. Simply store the keys. You can use the ListDictionary class, which implements a single linked list of key/value pairs. Or, you can use the HybridDictionary class, which uses the ListDictionary for small collections and switches to a Hashtable for larger collections. These collections and a host of others are in the System.Collections.Specialized namespace. However, if you want to use a list structure because of a user-controllable order, you can use the ArrayList collection. The ArrayList can perform inserts at any location, even though it uses an array as its internal storage.


Two other classes support dictionary-based collections: SortedList and Hashtable. Both contain key/value pairs. SortedList orders the keys, whereas Hashtable does not. Hashtable performs searches for a given key faster, but SortedList provides an ordered iteration of the elements by key. Hashtable finds keys using the hash value of the key object. It searches by a constant time operation, O(1), if the hash key is very efficient. The sorted list uses a binary search algorithm to find the keys. This is a logarithmic operation: O(ln n).

还有2个支持基于字典的集合:SortedListHashtable。两者都包含键/值对。SortedList对键进行排序,而HashTable不这样做。Hashtable在对一个给定的键进行搜索时,执行的更快;SortedList对所有的元素提供根据键进行有序的迭代。Hashtable使用键对象的hash值对键进行查找。如果键是非常高效的话,它花费的查找时间是个常量,O(1)SortedList使用二叉查找算法来寻找键,它的算法花费是:O(ln n)

Finally, there is the BitArray class. As its name suggests, this holds bit values. The storage for the BitArray is an array of ints. Each int stores 32 binary values. This makes the BitArray class compact, but it can also decrease performance. Each get or set operation in the BitArray performs bit manipulations on the int value that stores the sought value and 31 other bits. BitArray contains methods that apply Boolean operations to many values at once: OR, XOR, AND, and NOT. These methods take a BitArray as a parameter and can be used to quickly mask multiple bits in the BitArray. The BitArray is the optimized container for bit operations; use it when you are storing a collection of bitflags that are often manipulated using masks. Do not use it as a substitute for a general-purpose array of Boolean values.

最后,还有BitArray类。正如它的名字所暗示的,它用来存储位值。BitArray的存储结构是一个整型数组。每个整型存储32个二进制值。这使得BitArray很紧缩,但是它同时也降低了效率。BitArray的每个get或者set操作,都在存储数值的整型上面以及其它31位上面执行位操作。BitArray包含了一次对多个值进行布尔操作的方法:OR, XOR, AND NOT。这些方法将BitArray作为一个参数,可以快速的在BitArray上掩饰多个位。BitArray是对位操作的优化过的容器,当你存储关于位标志的集合时,并且有很多掩码操作时,就使用BitArray。对于一般目的的布尔数值数组,不要使用它。

With the exception of the Array class, none of the collection classes in the 1.x release of C# is strongly typed. They all store references to Object. C# generics will contain new versions of all these topologies that are built on generics. That will be the best way to create type-safe collections. In the meantime, the current System.Collections namespace contains abstract base classes that you can use to build your own type-safe interfaces on top of the type-unsafe collections: CollectionBase and ReadOnlyCollectionBase provide base classes for a list or vector structure. DictionaryBase provides a base class for key/value pairs. The DictionaryBase class is built using a Hashtable implementation; its performance characteristics are consistent with the Hashtable.


Anytime your classes contain collections, you'll likely want to expose that collection to the users of your class. You do this in two ways: with indexers and the IEnumerable interface. Remember that, early in this item, I showed you that you can directly access items in an array using [] notation, and you can iterate the items in the array using foreach.


You can create multidimensional indexers for your classes. These are analogous to the overloaded operator [] that you could write in C++. As with arrays in C#, you can create multidimensional indexers:


  1. public int this [ int x, int y ]
  2. {
  3.   get
  4.   {
  5.     return ComputeValue( x, y );
  6.   }
  7. }

Adding indexer support usually means that your type contains a collection. That means you should support the IEnumerable interface. IEnumerable provides a standard mechanism for iterating all the elements in your collection:


  1. public interface IEnumerable
  2. {
  3.   IEnumerator GetEnumerator( );
  4. }

The GetEnumerator method returns an object that implements the IEnumerator interface. The IEnumerator interface supports traversing a collection:


  1. public interface IEnumerator
  2. {
  3.   object Current
  4.   { get; }
  6.   bool MoveNext( );
  8.   void Reset( );
  9. }

In addition to the IEnumerable interface, you should consider the IList or ICollection interfaces if your type models an array. If your type models a dictionary, you should consider implementing the IDictionary interface. You could create the implementations for these large interfaces yourself, and I could spend several more pages explaining how. But there is an easier solution: Derive your class from CollectionBase or DictionaryBase when you create your own specialized collections.

对于IEnumerable接口,如果你的类型建立在数组模型上,你还应该考虑IList ICollection接口。如果你的类型模型是字典,你应该考虑实现IDictionary接口。你可以对这些大接口创建自己的实现,我将会花更多的页面来解释如何做。但是有个简单的解决方法:如果你创建自己特定的集合的话,从CollectionBase或者DictionaryBase派生。

Let's review what we've covered. The best collection depends on the operations you perform and the overall goals of space and speed for your application. In most situations, the Array provides the most efficient container. The addition of multidimensional arrays in C# means that it is easier to model multidimensional structures clearly without sacrificing performance. When your program needs more flexibility in adding and removing items, use one of the more robust collections. Finally, implement indexers and the IEnumerable interface whenever you create a class that models a collection.






