libxml2的SAX和DOM模式解析文件的效率实测

最新推荐文章于 2024-04-27 10:24:18 发布

树哥

最新推荐文章于 2024-04-27 10:24:18 发布

阅读量4.3k

点赞数

分类专栏：杂项文章标签： c++ xml html dom libxml2

本文链接：https://blog.csdn.net/play_fun_tech/article/details/20064397

版权

杂项专栏收录该内容

6 篇文章 1 订阅

订阅专栏

一、libxml2介绍

libxml2是一个高效的xml解析库，支持多种协议，html，xml，xpath等。

具体见（http://www.xmlsoft.org/）

libxml2解析xml的时候主要有DOM和SAX两种模式，本文比较此两种模式的效率。

SAX模式的使用，见（http://www.jamesh.id.au/articles/libxml-sax/libxml-sax.html）

`二、解析的过程`

本文使用两种模式解析文件，从中遍历如下节点：

<PTNOID_ID OID="1" IDDst="1943" IDSrc="0"/>

此节点在<LastCrossInfo>节点中，且是多个，我们遍历并读取每个节点的属性值。

三、SAX模式使用方法

包含头文件libxml2/tree.h，链接库libxml2.lib也要加入工程。

1）定义回调函数类

SAX模式是先注册回调函数，读取xml文件，遇到节点时，发事件通知回调函数，在回调函数中处理此节点的数据。

首先我们定义的回调函数接口类

class ICallBack
{
	static const int AttributeArrayWidth = 5;
	static const int LocalNameIndex = 0;
	static const int PrefixIndex = 1;
	static const int URIIndex = 2;
	static const int ValueIndex = 3;
	static const int EndIndex = 4;

public:

	 virtual void startElementLocalName(
		 const xmlChar* localname,
		 const xmlChar* prefix,
		 const xmlChar* URI,
		 int nb_namespaces,
		 const xmlChar** namespaces,
		 int nb_attributes,
		 int nb_defaulted,
		 const xmlChar** attributes) = 0;

};

这里定义了一系列的常量，这个表示的是attributes的结构，attributes一共有五(AttributeArrayWidth)项，分别存储LocalName，Prefix，URI，Value和End字段，没有的字段则为空。

2）定义具体的回调函数列表

此列表是libxml2中已经定义了的，我们要重新定义一下，从而使用我们可以使用自己的回调函数。

static xmlSAXHandler _saxHandlerStruct = {
	NULL,            /* internalSubset */
	NULL,            /* isStandalone   */
	NULL,            /* hasInternalSubset */
	NULL,            /* hasExternalSubset */
	NULL,            /* resolveEntity */
	NULL,            /* getEntity */
	NULL,            /* entityDecl */
	NULL,            /* notationDecl */
	NULL,            /* attributeDecl */
	NULL,            /* elementDecl */
	NULL,            /* unparsedEntityDecl */
	NULL,            /* setDocumentLocator */
	NULL,            /* startDocument */
	NULL,            /* endDocument */
	NULL,            /* startElement*/
	NULL,            /* endElement */
	NULL,            /* reference */
	NULL, /* characters */
	NULL,            /* ignorableWhitespace */
	NULL,            /* processingInstruction */
	NULL,            /* comment */
	NULL,            /* warning */
	NULL,            /* error */
	NULL,            /* fatalError //: unused error() get all the errors */
	NULL,            /* getParameterEntity */
	NULL,            /* cdataBlock */
	NULL,            /* externalSubset */
	XML_SAX2_MAGIC,  /* initialized */
	NULL,            /* private */
	startElementNsHandler,    /* startElementNs */
	NULL,      /* endElementNs */
	NULL,            /* serror */
};

如上，可以注册很多回调函数，对应不同事件，我们进注册了startElementNsHandler对应节点开始这个事件。

此回调函数的具体实现如下：

static void startElementHandler(
								void* ctx,
								const xmlChar* localname,
								const xmlChar* prefix,
								const xmlChar* URI,
								int nb_namespaces,
								const xmlChar** namespaces,
								int nb_attributes,
								int nb_defaulted,
								const xmlChar** attributes)
{
	ICallBack* pCallback = (ICallBack*)((_xmlParserCtxt*)ctx)->_private;

	pCallback ->startElementLocalName(localname,
		prefix, URI, nb_namespaces,
		namespaces, nb_attributes,
		nb_defaulted, attributes);
}

这里我们将ctx转换成ICallBack接口，利用此接口调用我们的处理函数。可以实现多个ICallBack的子类，将其指针传递给ctx，那么就可以不修改注册函数而实现不同的处理过程。

3）如何传递ctx

这是在创建是用户传递的自己的数据。

	OIDCallBack*pDO = new OIDCallBack();

	try
	{
		
		xmlDocPtr	pDoc =
				xmlSAXParseFileWithData(
				&_saxHandlerStruct, "PtnScript2_1.xml", 1, pDO);

	}
	catch (...){

	}

如上，我们将OIDCallBack的指针传入了xmlSAXParaseFileWithData函数，那么在解析过程中局可以使用此指针了。

4）具体的处理类OIDCallBack

class OIDCallBack : public ICallBack
{
public:

	 virtual void startElementLocalName(
		 const xmlChar* localname,
		 const xmlChar* prefix,
		 const xmlChar* URI,
		 int nb_namespaces,
		 const xmlChar** namespaces,
		 int nb_attributes,
		 int nb_defaulted,
		 const xmlChar** attributes)
	 {
		 string sOId, sDId, sSId;
		 try
		 {
			 if (xmlStrcmp(localname, (const xmlChar *)"PTNOID_ID"))
			 {
				 return;
			 }
			 else
			 {
				 for (int i = 0; i < nb_attributes; ++i)
				 {
					 if (0 == xmlStrcmp(attributes[ AttributeArrayWidth * i + LocalNameIndex],
						 (const xmlChar *)"OID"))
					 {
						 sOId = string(attributes[ AttributeArrayWidth * i + ValueIndex],attributes[ AttributeArrayWidth * i + EndIndex]);
					 }
					 else if (0 == xmlStrcmp(attributes[ AttributeArrayWidth * i + LocalNameIndex],
						 (const xmlChar *)"IDDst"))
					 {
						 sDId = string(attributes[ AttributeArrayWidth * i + ValueIndex],attributes[ AttributeArrayWidth * i + EndIndex]);
					 }
					 else if (0 == xmlStrcmp(attributes[ AttributeArrayWidth * i + LocalNameIndex],
						 (const xmlChar *)"IDSrc"))
					 {
						 sSId = string(attributes[ AttributeArrayWidth * i + ValueIndex],attributes[ AttributeArrayWidth * i + EndIndex]);
					 }
					 else
					 {
						 return;
					 }
				 }
			 }
		 }
		 catch (...)
		 {

			 return;
		 }


	 }

};

如上，使用了位移来获取OID、IDDst和IDSrc的属性名(localname)和值(value)。

四、DOM模式使用方法

此方式比较直接，直接从文件载入所有节点，内存中构造树的结构，然后利用查询来获取节点数据。

整块代码如下：

bool QueryAttributeString(const xmlNodePtr pNode, 
									  const string &sName, string &sAttribute)
{
	if (sName.empty())
	{
		return false;
	}

	if (NULL == pNode)
	{
		return false;
	}

	// 获取节点属性
	xmlChar *pChar = xmlGetProp(pNode, (const xmlChar *)sName.c_str());
	if (NULL == pChar)
	{
		return false;
	}
	sAttribute.assign((const char *)pChar);
	// 释放字符串内存
	xmlFree(pChar);

	return true;
}

void Test_DOMParser()
{
	boost::timer oTimer;

	oTimer.restart();

	int iCount = 0;

	try
	{
		xmlDocPtr pDoc = xmlParseFile("PtnScript2_1.xml");
		if (NULL == pDoc)
		{
			return;
		}

		xmlNodePtr pRoot = xmlDocGetRootElement(pDoc);
		if (NULL == pRoot)
		{
			return;
		}

		xmlNodePtr pLastCrossInfo = pRoot->xmlChildrenNode;
		// 遍历子节点
		bool bFind = false;
		while (NULL != pLastCrossInfo)
		{
			// 比较节点名称
			if ((!xmlStrcmp(pLastCrossInfo->name, (const xmlChar *)"LastCrossInfo")))
			{
				bFind = true;
				break;
			}
			pLastCrossInfo = pLastCrossInfo->next;
		}

		if (!bFind)
		{
			return;
		}

		

		xmlNodePtr pChild = pLastCrossInfo->xmlChildrenNode;
		// 遍历子节点
		while (NULL != pChild)
		{
			// 比较节点名称
			if ((!xmlStrcmp(pChild->name, (const xmlChar *)"PTNOID_ID")))
			{
				string sOId, sDId, sSId;
				if (!QueryAttributeString(pChild, "OID", sOId)
					|| !QueryAttributeString(pChild, "IDDst", sDId)
					|| !QueryAttributeString(pChild, "IDSrc", sSId))
				{
					break;
				}

				++iCount;
			}
			pChild = pChild->next;
			
		}

	}
	catch (...)
	{
		return;
	}

	cout << "Test_DOMParser: " <<
		" Node(" << iCount << ") " << oTimer.elapsed() << endl;
	cout << endl;
}

五、测试小结

遍历一个1M大小的xml文件，其中有PTNOID_ID节点5002个，两个程序都去遍历了这些节点，且取出了属性和其值，性能如下：

SAX：0.026s

DOM：0.375s

可以看出来，SAX在遍历节点上比DOM高一个数量级的。

树哥

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
libxml2的SAX和DOM模式解析文件的效率实测

libxml2是一个高效的xml解析库，支持多种协议，html，xml，xpath等。具体见（http://www.xmlsoft.org/）libxml2解析xml的时候主要有DOM和SAX两种模式，本文比较此两种模式的效率。
复制链接

扫一扫