A better way to parse XML documents in .NET (zz)
Takeaway: The .NET Framework supports the XML DOM parsing model, but not the SAX model. .NET guru Leonardo Esposito tells you why this is actaully an improvement for parsing XML in .NET.
Although always easy to transfer as flat files, XML documents need a parser to become really useful pieces of information. Parsers, which take care of translating XML documents into platform-specific objects, traditionally have come in two different varieties—tree-based parsers and event-driven parsers.
The .NET Framework improves on this model with the introduction of XML readers, which employ a more practical "pull" model of smart data passing, as opposed to the impractical "push" method of previous parser models.
Let's first take a look at the basic structure of traditional parser models, and then discuss how .NET's approach represents a step forward.
A tree-based parser reads in the entire content of an XML document and creates an in-memory object that represents it. Typically, the object will be an instance of a COM component on a Win32 platform and a Java class on non-Windows platforms. The prototype of the object is ruled by the W3C through the Document Object Model (DOM) Level 1 and 2 standards.
An event-driven parser follows a radically different logic and pursues other goals. Event-driven parsers don’t even think about creating an in-memory representation of the source document. They simply parse the document and notify client applications about any tag they find along the way. What happens next is the responsibility of the client application. Event-driven parsers don’t cache information and have an enviably small memory footprint. The Simple API for XML (SAX) community sets the ground rules for event-driven parsers.
The XML parser in .NET?
The XML API available in the .NET Framework supports the XML DOM parsing model, but not the SAX model.
SAX is a push model; the parser and the client application are two separate entities. In fact, the application plays a rather passive role and is expected to work on nodes and fragments, not the document as a whole. The application registers with the parser and receives notifications about all the nodes found in the document being processed. While registering, the application can provide some general information about the types of nodes it is interested in.
However, such a filter is static and doesn’t select nodes based on runtime conditions. An application can instruct the parser to return only content nodes and discard all other nodes, such as processing instructions, comments, and entities. To filter out unneeded element nodes, the application's only recourse is to ignore the related events and all the information pushed by the SAX parser.
The .NET Framework offers a more effective way to parse XML documents in a read-only, noncached, forward-only manner. This new generation of parser presents a document reader and functions as a pull model, as opposed to the SAX’s push model. Document readers are a common element in the .NET Framework and cover various areas such as file I/O, database access, and memory management.
An XML reader is a class that reads a source document node after node, proceeding from the root to the rightmost leaf of the XML tree via node-first visiting. The node-first algorithm prescribes that the reader first analyzes the root of a subtree and then recursively moves to its children in the order that they appear. The visiting order of the node-first algorithm coincides with the order in which XML nodes appear on a disk file.
XML reader’s architecture
Unlike the SAX parser, the .NET XML reader accepts direct orders from the application. The application controls the reader component, deciding how to read it and when. Basically, parsing an XML document is a loop that moves from the beginning to the end of the data stream. In the push model, the parser controls the loop and the application is a registered client of the parser. In the pull model, the application itself controls the loop, and the parser is a helper tool.
What’s the difference? First off, the pull model is easier to set up, more flexible, and results in a more readable and programmer-friendly code. Secondly, the pull model is faster because it minimizes data transfer between application and the parser.
A SAX parser always passes node information down to the application, irrespective of whether the application has requested it. A .NET XML reader provides the application with direct methods to skip over nodes with no further memory and CPU overhead.
The guts of both approaches
Let’s compare the ways in which SAX parsers and XML readers work. The code snippet in Listing A shows some Visual Basic code that exploits the SAX services provided by the Microsoft MSXML 4.0 COM library.
What happens next depends on the user implementation of the ContentHandlerImpl class, which implements the IContentHandler interface, just one of the interfaces defined by the SAX specification and interaction model. During the parsing process, which you start by using the parseURL method, the interface methods defined in the ContentHandlerImpl class are called back by the parser and allowed to execute their own tasks on the node being processed. To discard a node, the method has simply to return.
Borrowing from a popular movie title, I’d say that this approach has something good, something bad, and something ugly. The good is that the application logic resides in distinct components. The bad is that a lot of data is needlessly passed between components, in some cases just to be discarded. The ugly is that writing a SAX-based application is boring, as you have to create and instantiate classes even for trivial tasks.
With .NET XML readers, you basically sacrifice the good points of the SAX solution to significantly improve in the areas that are ugly and bad. Let’s see why.
The application code controls the parsing process and directly accesses or selectively skips over nodes, as shown in Listing B.
Programming is easier and more natural, and you spend no overhead. The application logic is not clearly separated from the parsing module, but creating specialized reader classes is as easy as inheriting from the abstract reader class (XmlReader) or more specific classes such as XmlTextReader.
XML Readers in .NET
XML readers are an innovative, SAX-like type of parser, but also a fundamental building block for all XML classes in the .NET Framework. In fact, XPath, XSLT, and XMLDOM classes use readers to build their own more complex object models.
The .NET Framework's support of readers instead of SAX parsers does not limit your programming power. But if you’re a fan of the SAX model, you can still set up a SAX parser using a .NET reader with little hassle. You have to create a new reader class that exposes events for each node found and that uses a reader to visit a document. This ability stems from the inherently greater flexibility of the pull model. While you can emulate the push model using pull-based components, you just can't build a pull model parser using a SAX parser.