A better way to parse XML documents in .NET (zz)

原创 2007年09月14日 11:34:00

Takeaway: The .NET Framework supports the XML DOM parsing model, but not the SAX model. .NET guru Leonardo Esposito tells you why this is actaully an improvement for parsing XML in .NET.


Although always easy to transfer as flat files, XML documents need a parser to become really useful pieces of information. Parsers, which take care of translating XML documents into platform-specific objects, traditionally have come in two different varieties—tree-based parsers and event-driven parsers.

The .NET Framework improves on this model with the introduction of XML readers, which employ a more practical "pull" model of smart data passing, as opposed to the impractical "push" method of previous parser models.

Let's first take a look at the basic structure of traditional parser models, and then discuss how .NET's approach represents a step forward.

Tree-based parser
A tree-based parser reads in the entire content of an XML document and creates an in-memory object that represents it. Typically, the object will be an instance of a COM component on a Win32 platform and a Java class on non-Windows platforms. The prototype of the object is ruled by the W3C through the Document Object Model (DOM) Level 1 and 2 standards.

Event-driven parser
An event-driven parser follows a radically different logic and pursues other goals. Event-driven parsers don’t even think about creating an in-memory representation of the source document. They simply parse the document and notify client applications about any tag they find along the way. What happens next is the responsibility of the client application. Event-driven parsers don’t cache information and have an enviably small memory footprint. The Simple API for XML (SAX) community sets the ground rules for event-driven parsers.

The XML parser in .NET?
The XML API available in the .NET Framework supports the XML DOM parsing model, but not the SAX model.

SAX is a push model; the parser and the client application are two separate entities. In fact, the application plays a rather passive role and is expected to work on nodes and fragments, not the document as a whole. The application registers with the parser and receives notifications about all the nodes found in the document being processed. While registering, the application can provide some general information about the types of nodes it is interested in.

However, such a filter is static and doesn’t select nodes based on runtime conditions. An application can instruct the parser to return only content nodes and discard all other nodes, such as processing instructions, comments, and entities. To filter out unneeded element nodes, the application's only recourse is to ignore the related events and all the information pushed by the SAX parser.

The .NET Framework offers a more effective way to parse XML documents in a read-only, noncached, forward-only manner. This new generation of parser presents a document reader and functions as a pull model, as opposed to the SAX’s push model. Document readers are a common element in the .NET Framework and cover various areas such as file I/O, database access, and memory management.

An XML reader is a class that reads a source document node after node, proceeding from the root to the rightmost leaf of the XML tree via node-first visiting. The node-first algorithm prescribes that the reader first analyzes the root of a subtree and then recursively moves to its children in the order that they appear. The visiting order of the node-first algorithm coincides with the order in which XML nodes appear on a disk file.

XML reader’s architecture
Unlike the SAX parser, the .NET XML reader accepts direct orders from the application. The application controls the reader component, deciding how to read it and when. Basically, parsing an XML document is a loop that moves from the beginning to the end of the data stream. In the push model, the parser controls the loop and the application is a registered client of the parser. In the pull model, the application itself controls the loop, and the parser is a helper tool.

What’s the difference? First off, the pull model is easier to set up, more flexible, and results in a more readable and programmer-friendly code. Secondly, the pull model is faster because it minimizes data transfer between application and the parser.

A SAX parser always passes node information down to the application, irrespective of whether the application has requested it. A .NET XML reader provides the application with direct methods to skip over nodes with no further memory and CPU overhead.

The guts of both approaches
Let’s compare the ways in which SAX parsers and XML readers work. The code snippet in Listing A shows some Visual Basic code that exploits the SAX services provided by the Microsoft MSXML 4.0 COM library.

What happens next depends on the user implementation of the ContentHandlerImpl class, which implements the IContentHandler interface, just one of the interfaces defined by the SAX specification and interaction model. During the parsing process, which you start by using the parseURL method, the interface methods defined in the ContentHandlerImpl class are called back by the parser and allowed to execute their own tasks on the node being processed. To discard a node, the method has simply to return.

Borrowing from a popular movie title, I’d say that this approach has something good, something bad, and something ugly. The good is that the application logic resides in distinct components. The bad is that a lot of data is needlessly passed between components, in some cases just to be discarded. The ugly is that writing a SAX-based application is boring, as you have to create and instantiate classes even for trivial tasks.

With .NET XML readers, you basically sacrifice the good points of the SAX solution to significantly improve in the areas that are ugly and bad. Let’s see why.

The application code controls the parsing process and directly accesses or selectively skips over nodes, as shown in Listing B.

Programming is easier and more natural, and you spend no overhead. The application logic is not clearly separated from the parsing module, but creating specialized reader classes is as easy as inheriting from the abstract reader class (XmlReader) or more specific classes such as XmlTextReader.

XML Readers in .NET
XML readers are an innovative, SAX-like type of parser, but also a fundamental building block for all XML classes in the .NET Framework. In fact, XPath, XSLT, and XMLDOM classes use readers to build their own more complex object models.

The .NET Framework's support of readers instead of SAX parsers does not limit your programming power. But if you’re a fan of the SAX model, you can still set up a SAX parser using a .NET reader with little hassle. You have to create a new reader class that exposes events for each node found and that uses a reader to visit a document. This ability stems from the inherently greater flexibility of the pull model. While you can emulate the push model using pull-based components, you just can't build a pull model parser using a SAX parser.

 

Listing A 
 
 
Dim parser As New SaxXMLReader
 
Dim cntHandler As New ContentHandlerImpl
 
' Set the contentHandler property with the living
 ' instance of a class implementing the
 ' IContentHandler interface
 Set parser.contentHandler = cntHandler
 
' Tell the parser to parse the specified XML document
 parser.parseURL (App.Path & " oo.xml"
 

 

 

Listing B 
 
 
Dim reader As XmlTextReader
 reader 
= New XmlTextReader(xmlFile)
 
While reader.Read()    
' Write the start tag    
If reader.NodeType = XmlNodeType.Element Then       
' Do something    
Else       
' Do something else    End If
 End While
 

 

 

英特尔® Performance Counter Monitor(PCM)--测量 CPU 利用率的更好方法

http://software.intel.com/zh-cn/articles/intel-performance-counter-monitor 下载代码示例 IntelPerformance...
  • xinpo66
  • xinpo66
  • 2013年07月02日 00:30
  • 2546

Asp.net动态页面静态化之include和parse区别

Asp.net动态页面静态化之include和parse区别     #include就是在模版中在将其他模版包括进来,就好比网站的头部,尾部,广告模版等等,这些内容都是相同的时候,就可以做成一个单...
  • l1158513573
  • l1158513573
  • 2015年08月14日 22:34
  • 816

java用Digester解析xml文件——高效率的xml解析

Digester不是jdk里面自带的,有依赖包
  • qq525099302
  • qq525099302
  • 2014年05月02日 21:23
  • 154103

Linux命令中的常用符号解释(zz)

一、通配符:" * ” 、" ? ”   和DOS下一样,当我们不知道确切的文件名时,可以用通配符来进行模糊操作。“*”可以代表任意长度的任意字符,“?”代表一个任意字符。 二、转义字符:" \ ...
  • xiaocainiaoshangxiao
  • xiaocainiaoshangxiao
  • 2013年12月07日 17:30
  • 1139

.NET操作XML文件---[添加]

最近学习了.NET操作XML文件,总结如下: 关于XML 全名:可扩展标记语言 (Extensible Markup Language) XML用于标记电子文件使其具有结构性的标记语言,可以用来...
  • wyzhangchengjin123
  • wyzhangchengjin123
  • 2013年03月06日 12:02
  • 3087

.NET操作XML文件---[读取]

接上一遍博客------(.NET操作XML文件---[添加]) readXml.aspx文件的详情如下: 效果图: 代码: ...
  • wyzhangchengjin123
  • wyzhangchengjin123
  • 2013年03月06日 15:16
  • 4029

Cocos2d-x XML文件读取操作与解析操作

1、 void BB:: File() {     //从app中读取文件(本项目)     std::string path=CCFileUtils::sharedFil...
  • u011269801
  • u011269801
  • 2014年09月15日 09:20
  • 1539

问题记录--通过jaxb转换的xml增加namespace信息

JAXB Object to XML与package-info.java 线上问题描述 通过 JABX将Object转换成XML,转化后的格式带namespace信息,此信息不是希望生成的,为什么会...
  • hubert_bubert
  • hubert_bubert
  • 2013年11月22日 15:15
  • 1186

.NET操作XML文件---[修改]

接上一遍博客------.NET操作XML文件---[读取] updateXml.aspx的详情如下: 效果图: 代码: ...
  • wyzhangchengjin123
  • wyzhangchengjin123
  • 2013年03月06日 16:16
  • 2049

perl解析XML的性能比较

perl XML模块 perl解析XML实现方式有很多模块。下面总结一些。 模块 说明 XML::Simple 用于读写 XML 的普通 API,最好与 XM...
  • nuptuser
  • nuptuser
  • 2015年11月27日 12:49
  • 381639
内容举报
返回顶部
收藏助手
不良信息举报
您举报文章:A better way to parse XML documents in .NET (zz)
举报原因:
原因补充:

(最多只允许输入30个字)