True Binary Serialization and Compression of DataSets
N.B. There is a newer updated article with up-to-date source code that readers should visit after reading this article, which may be found here.
The DataSet, found in the System.Data namespace, is an amazing animal. DataSets store data in a disconnected cache. The structure of a DataSet is similar to that of a relational database; it exposes a hierarchical object model of tables, rows, and columns. In addition, it can contain constraints and relationships that have been defined for the DataSet .
An ADO.NET DataSet is one view -- a relational view -- of data that can be represented in XML. In Visual Studio and the .NET Framework, XML is the format for storing and transmitting data of all kinds. This relationship between DataSets and XML enables us to take advantage of the following features of DataSets:
- The structure of a DataSet -- its tables, columns, relationships, and constraints -- can be defined in an XML Schema. XML Schemas are a standards-based format of the W3C for defining the structure of XML data. DataSets can read and write schemas that store structured information using the ReadXmlSchema and WriteXmlSchema methods. If no schema is available, the DataSet can be asked to infer one (via its InferXmlSchema method) from data in an XML document, provided that it is structured in a relational way.
- We can generate a DataSet class that incorporates schema information to define its data structures (such as tables and columns) as class members, and we can also generate a strongly typed DataSet.
- We can read an XML document or stream into a DataSet using the DataSet's ReadXML method and write a DataSet out as XML using the DataSet's WriteXML method. Because XML is a standard format for interchange of data between different applications, this means that we can load a DataSet with XML-formatted information sent by other applications. We can even specify an HTTP URL as the source of the DataSet, opening up a whole range of interesting possibilities for ServerControls and so on. Similarly, a DataSet can write out its data as an XML stream or document, to be shared with other applications or just for storage in a standard format.
- We can create an XML view (an XMLDataDocument object) of the contents of a DataSet, and then view and manipulate the data using either relational methods (via the DataSet ) or XML methods. The two views are automatically synchronized as they are changed.
The DataTable is a central object in the ADO.NET library, and contains DataRows, Constraints, and many other properties and methods that enable it to work together with other DataTables and the containing DataSet as a disconnected "in - memory database". There are methods to filter, search and sort data, and even to perform computations on data, all self-contained within these classes. The DataSet and the DataTable are the only ADO.NET objects that can be serialized and remoted. In addition, one can work with the disconnected DataSet and its tables, adding, changing or deleting rows in memory on a completely disconnected basis, call the GetChanges method, and send the modified DataSet in Diffgram Document format back to the database where it can be "hooked up" with a new DataAdapter on the server to automatically perform all the updates, inserts and deletes to the database and its underlying tables.
Both the DataRow and DataTable classes also have AcceptChanges methods. Calling AcceptChanges at the DataTable level causes the AcceptChanges method for each DataRow to be called. Similarly, invoking AcceptChanges on the DataSet causes AcceptChanges to be called on each table within the DataSet. In this manner, we have multiple levels at which the method can be invoked. Calling AcceptChanges on the DataSet enables us to invoke the method on all subordinate objects (for example, tables and rows) with a single call.
When we call AcceptChanges on the DataSet , any DataRow objects that are still in edit-mode successfully end their edits. The RowState property of each DataRow also changes; Added and Modified rows become Unchanged, and Deleted rows are removed. We now have a complete, in-memory representation of our "current database", and we can then persist the DataSet and when we subsequently load it back from our persistence medium, it will act as a kind of "mini database" so that we can always have a disconnected, complete representation of all our current working data. The DataSet and DataTable classes also support an ExtendedProperties collection in which it is possible to store, for example, a timestamp, or even a connection string.
There is one glaring drawback to both the DataSet and the DataTable that Microsoft kind of "left out" of ADO.NET: The .NET Framework has a powerful set of classes to perform serialization of objects. One of these, the BinaryFormatter, produces a very compact byte stream from virtually any .NET class or object. However, when we apply BinaryFormatter serialization to a DataSet, what we get, instead, is a byte stream filled with a huge amount of textual XML that looks just like the XML we would get by calling the WriteXml method on the DataSet. This is not good for remoting, or for storing DataSets in some sort of persistence medium (a file on the hard drive, for example) because these files take up a lot of space and/or bandwidth "over the wire".
The ADO.NET objects that are serializable through formatters -- the DataTable and DataSet classes -- correctly implement the ISerializable interface, which makes them responsible for providing the data to be serialized. The ISerializable interface consists of a single method, GetObjectData, whose output the formatter takes and writes into the output stream.
However, the problem is that the DataTable and DataSet classes were engineered to describe themselves to serializers -- using XML! The binary formatter, which is of course oblivious to this inefficiency, takes this particularly long string of XML and proceeds to append it to the output stream. Unfortunately, if we are looking for a compact binary representation of a DataSet, the ordinary .NET Framework runtime serialization available for ADO.NET objects will therefore prove quite disappointing.
A number of more advanced developers have been acutely aware of this problem, as can be seen from the many newsgroup posts complaining about the lack of binary serialization support in the DataSet. Dino Esposito wrote an article for MSDN in which he shows a way to create a "Ghost Serializer" which will allow Binary Serialization. And even Microsoft, responding to many requests I am sure, published a Knowledge Base article, KB82970 in which Ravinder Vuppula provides what is called a DataSetSurrogate class - basically a wrapper class that accepts a DataSet in the constructor, and goes through every object in it, converting them mostly to ArrayLists (which go through the BinaryFormatter as byte streams, not XML -- just fine, thank you) and when we are done, we have a reasonably compact "DataSetSurrogate" that can be remoted and can be reconstituted by calling the BinaryFormatter's Deserialize method, and then we can call the Surrogate's "ConvertToDataSet" method to get back to "First Base", as it were. There are other approaches as well, not the least of which is creating custom ISurrogateSelector implementations. However, they are not for the faint of heart. In the domain of custom approaches to binary serialization of objects, there are at least two other offerings I have looked at; Richard Lowe's Fast Binary Serialization and Dominic Cooney's Pickle. Each has its benefits and drawbacks in terms of what they are best at, and how easy they are to use.
My approach goes a lot farther than just performing custom binary serialization of the DataSet -- much farther. The BinaryFormatter is great, but Angelo Scotto has developed what he calls the CompactFormatter. This is not BinaryFormatter compatible, but since there is no BinaryFormatter at all in the .NET Compact Framework, it certainly becomes a welcome addition. And of course, it can be compiled and used on the regular .NET Framework. I played and studied with Angelo's marvelous creation quite a bit and helped him spotlight some bugs which, I am pleased to report, he has completely fixed as of Version 11.2. One of the biggest advantages of CompactFormatter, besides its promise to provide Binary Serialization for the Compact Framework, is the fact that it can create an even more compact byte stream from a DataSet than does the Native BinaryFormatter. While there remains a bit more work to get this all to compile and run as a true Compact Framework application, we are not very far off at this point.
Now for the good part -- in addition to creating a compact byte stream from a full DataSet (or DataTable) and being able to deserialize it at the other end back into a fully populated, live DataSet, I've added another dimension to the whole picture. I added in Mike Krueger's ICSharpCode SharpZiplib in-memory Zip compression and decompression to the resultant byte stream. To give us some example of what can be done with this, let's take a test case:
I have a US Zip Code database that consists of all 42,698 current US Zip codes - it has the City name, State abbreviation, 5 digit Zip, and Telephone Area Code. If this is stored in an MS Access Database and freshly compacted to reduce bloat, the file size will be 4,516 KB. If I create a DataSet consisting of all 42,698 rows in the table, and ask the DataSet to render itself as XML, it will produce an XML document of 5,445 KB - ouch!
Now if I run this DataSet through the BinaryFormatter, I'll get a binary file of 6,064 KB. But it will just be a binary file containing a huge amount of textual XML. That's the problem with the DataSet and the BinaryFormatter. However, if I choose instead to run this through Mr. Scotto's Compact Formatter, I'll get a file of only 1,161 KB - an 80% space savings over the BinaryFormatter. Then, if I further compress this resultant byte stream with Zip compression, my entire DataSet will take up just 278,650 bytes! That is a byte array that represents just 4.6 percent of the size of the best serialization available to me previously. With this vastly reduced, compact footprint of my DataSet, it now becomes quite feasible to, for example, store my entire "database" as an embedded resource directly inside my assembly, or to save it to disk to be loaded "on demand" by my application so that no "real" database is ever necessary to run my app, or to send it over the wire through Remoting or via a WebService (ASP.NET WebServices automatically handle byte arrays very nicely, as Base64 encoded strings) to be updated with a DataAdapter as one normally would do with any DataSet into the middle tier.
With my custom DataSet Binary Serialization / Compression scheme, I can now create, for example, a self - contained web - based application where users can add, edit and delete data, and when they press the "Save" button, I simply send my in-memory DataSet through the CompressDataSet method of my class library, and save it to the hard drive, like this:
// Where "comp" is my compressed DataSet byte array:
string thePath =System.Environment.CurrentDirectory + "//" + "ds.dat";
BinaryWriter bw= new BinaryWriter( File.Open(thePath,FileMode.Create));
Of course, loading it and reconstituting the compressed DataSet is as simple as this:
string thePath =System.Environment.CurrentDirectory + "//" + "ds.dat";
FileInfo f = new FileInfo(thePath);
BinaryReader br= new BinaryReader( File.OpenRead(thePath));
byte comp1= br.ReadBytes((int)f.Length );
PAB.CompressDataSet.Wrapper ww = new PAB.CompressDataSet.Wrapper();
DataSet dsl = ww.DecompressDataSet(comp1);
The above process, by the way, using the 42,698 row Zip Code database ("DataSet") described above, takes just about exactly one second on a moderately fast Pentium class machine. Serialization and compression can take a little longer - 1.338 seconds at a compression ratio of 5 (eventual size, 285,648 bytes) up to 3.21 seconds at a compression ratio of 9 (the max), resulting in a file size of 278,650 bytes.
The finished class library weighs in at just 84K, and I am making my CompressDataSet class library available free for non-commercial use at the download below. However, since I have several potential products based on this new technology, I'm not releasing any source code for it at this time. Feel free to download and try it out, and let me know if you have any custom uses for this that you would like me to help on.
Known Issues: Serialization and Compression of very large DataSets (over 20,000 rows) may fail at compression levels below 5. Workaround: use compression levels of 5 or higher. If no compression level is specified, the maximum (9) is assumed by the library. There are no other issues discovered at this time; the library can be considered to be production-ready.