Cexer中XML解析的源代码分析

最新推荐文章于 2023-03-26 17:13:31 发布

xj_fox

最新推荐文章于 2023-03-26 17:13:31 发布

阅读量742

点赞数

分类专栏： XML 文章标签： xml 代码分析 encoding buffer null file

本文链接：https://blog.csdn.net/xj_fox/article/details/3081329

版权

XML 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Cexer中XML解析的源代码分析

1、 XML源文件的加载

Cexer中将XML文件名作为参数对象的初始化。在系统内部开辟缓冲区，将整个XML源文件读入缓冲区，以后对XML文件的解析都是建立在该缓冲区基础上的。

代码如下：

001、 _FILEPtr handle = NULL;

002、 #if defined(_MSC_VER) && _MSC_VER>=1400

003、 errno_t error = ::_wfopen_s( &handle,file,L"rb" );

004、 if ( error)

005、 {

006、 return false;

007、 }

008、 #else

009、 handle = ::_wfopen( file,L"rb" );

0010、 #endif

其中file是文件名字符串，

002-010是根据宏版本采用不同的系统函数打开文件，获得文件句柄handle。

001、 ::fseek( handle,0,SEEK_END );

002、 long length = ::ftell( handle );

003、 ::fseek( handle,0,SEEK_SET );

004、

005、 if ( length <= 3 )

006、 {

007、 return false;

008、 }

009、

0010、 ScopedBuffer<BYTE> buffer( length+2 );

0011、 if ( !buffer )

0012、 {

0013、 return false;

0014、 }

0015、

0016、 buffer[0] = 0;

0017、 buffer[1] = 0;

0018、 if ( 1 != ::fread(buffer,length,1,handle) )

0019、 {

0020、 return false;

0021、 }

0022、 buffer[length] = 0;

0023、 buffer[length+1] = 0;

行001-002获取源文件的长度，

行003将文件指针复位到文件头。

行010-014调用内部缓冲区管理申请缓存，注意多申请了两个字节的长度。行018调用fread函数将文件读入缓冲区。

行022-023将缓冲区字符串置结束符。此处为什么要多申请一个字节？？（因为置结束符只需要一个字节）

以后所有的解析操作均是对这个缓冲区指针操作。

2、对不同编码格式的支持

Cexer支持多种格式的编码，它是通过解析XML头中的encoding字段来识别的。XML头如下：

<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>

解析XML头是函数_parseDeclaration完成的。

001、 PBYTE XmlFile::_parseDeclaration( PBYTE buffer,long length )

002、 {

003、 // 有UNICODE签名,直接解析

004、 if ( 0xFF==*buffer && 0xFE==*(buffer+1) )

005、 {

006、 m_encoding = encodingUTF16LE;

007、 m_bom = true;

008、 return buffer + 2;

009、 }

0010、 else if ( isUnicode(buffer,length>100?100:length,IS_TEXT_UINCODE_TEST) )

0011、 {

0012、 m_encoding = encodingUTF16LE;

0013、 return buffer;

0014、 }

0015、

0016、 // 有UTF8签名

0017、 if ( 0xEF==*buffer && 0xBB==*(buffer+1) && 0xBF==*(buffer+2) )

0018、 {

0019、 m_encoding = encodingUTF8;

0020、 m_bom = true;

0021、 return buffer + 3;

0022、 }

0023、

0024、 //从缓冲区中查找到第一个'<'

0025、 PCSTR ptr = reinterpret_cast<PCSTR>( buffer );

0026、 while ( *ptr && *ptr!='<' )

0027、 {

0028、 ++ptr;

0029、 }

0030、

0031、 if ( !*ptr )

0032、 {

0033、 return NULL;

0034、 }

0035、 //找到第一个'<'

0036、 //跳过"<?xml"

0037、 //要求"<?xml"必须在文件头出现

0038、 if ( ::strncmp(ptr,"<?xml",5) )

0039、 {

0040、 return NULL;

0041、 }

0042、

0043、 ptr += 5;

0044、

0045、 //查找且必须出现"encoding"字段

0046、 while ( *ptr && ::strncmp(ptr,"encoding",8) && *ptr!='?' )

0047、 {

0048、 ++ptr;

0049、 }

0050、

0051、 if ( !*ptr )

0052、 {

0053、 return NULL;

0054、 }

0055、

0056、 //如果后面紧跟着'?'

0057、 //采用缺省的UTF8编码方式

0058、 //即允许<?xml version="1.0" encoding?>此XML格式

0059、 if ( *ptr == '?' )

0060、 {

0061、 m_encoding = s_defaultEncoding;

0062、 return buffer;

0063、 }

0064、

0065、 //skip "encoding"

0066、 ptr += 8;

0067、 ptr = xml::skipInvalidChar( ptr );

0068、 if ( !ptr || *ptr!='=' )

0069、 {

0070、 return NULL;

0071、 }

0072、

0073、 //'=' must be next to "encoding"

0074、

0075、 ++ptr;

0076、 ptr = xml::skipInvalidChar( ptr );

0077、 if ( !ptr )

0078、 {

0079、 return NULL;

0080、 }

0081、

0082、 CHAR endQuot = 0;

0083、 if ( *ptr=='/'' )

0084、 {

0085、 endQuot = '/'';

0086、 }

0087、 else if ( *ptr=='/"' )

0088、 {

0089、 endQuot= '/"';

0090、 }

0091、 else

0092、 {

0093、 return NULL;

0094、 }

0095、

0096、 ++ptr;

0097、 PCSTR first = ptr;

0098、

0099、 while ( *ptr && *ptr!=endQuot )

00100、 {

00101、 ++ptr;

00102、 }

00103、

00104、 //skip the format "encoding = ' '" or "encoding = " " "

00105、

00106、 if ( !*ptr )

00107、 {

00108、 return NULL;

00109、 }

00110、

00111、 if ( ptr == first )

00112、 {

00113、 m_encoding = encodingUTF8;

00114、 return buffer;

00115、 }

00116、

00117、 stringA name;

00118、 //把编码字符串取出来

00119、 name.assign( first,(stringA::size_type)(ptr-first) );

00120、

00121、 m_encoding = theEncodings.find( name.c_str() );

00122、 return buffer;

00123、 }

004-0022是查看文件头是否直接携带格式签名字符串。若携带，则直接返回相应的格式字符串。否则需要头中查找encoding字段来确定。

0025首先把缓冲区指针强转为PCSTR。

0026-0034在XML头中寻找’<’字符，

0038-0043查找并跳过XML头中的”<?xml”字符串。注意此处要求”<?xml”是严密格式，中间不能有任何其他字符。

0046-0054查找”encoding”字符串或者查看是否到头结束（以’?>’标识）。

0059-0063表示如果XML头中没有encoding字段，就采用缺省的编码格式进行解析，返回。

否则0066-0071跳过”encoding”字符串。

0075-0080跳过’=’及其后的合法字符串。

0082-0094查看encoding的赋值方式，是采用’’还是””,

0099-00102解析’’或””之间的部分，记录下此时的结束指针。

00111-00115是处理encoding字段未赋值的情况，此时采用encodingUTF8编码方式。00117-00121将’’或””之间的字符串取出赋给变量m_encoding

函数外部根据编码类型将缓冲区重新进行转换，是通过newUnicodeFromAnsi函数实现

的，而内部是通过MultiByteToWideChar系统函数实现了不同格式的转换。

3、 XML头的解析

Cexer中对XML头的解析是通过XmlDeclaration类的_parse方法完成的。

001、 _parse( PCWSTR ptr )

002、 {

003、 CEXER_ASSERT( ptr != NULL );

004、

005、 static const PCWSTR s_begin = L"<?xml"; //5;

006、 static const PCWSTR s_end = L"?>"; //2;

007、

008、 if ( !ptr || !*ptr || ::wcsncmp(ptr,s_begin,5) )

009、 {

0010、 return NULL;

0011、 }

0012、

0013、 ptr += 5;

0014、 //skip header "<?xml"

0015、 while ( ptr && *ptr )

0016、 {

0017、 //skip 合法字符,此时ptr此时跳过的字符地址

0018、 ptr = xml::skipInvalidChar( ptr );

0019、

0020、 if ( !::wcsncmp(ptr,s_end,2) )

0021、 {

0022、 //这里作为解析头成功

0023、 //+2表示跳过"?>"

0024、 //返回的是指向XML开始

0025、 return ptr+2;

0026、 }

0027、

0028、 //解析version属性

0029、 else if ( !::wcsncmp(ptr,L"version",7) )

0030、 {

0031、 XmlAttribute attr;

0032、 ptr = attr._parse( ptr );

0033、 m_version = attr.value();

0034、 }

0035、 else if ( !::wcsncmp(ptr,L"encoding",8) )

0036、 {

0037、 XmlAttribute attr;

0038、 ptr = attr._parse( ptr );

0039、 m_encoding = attr.value();

0040、 }

0041、 else if ( !::wcsncmp(ptr,L"standalone",10) )

0042、 {

0043、 XmlAttribute attr;

0044、 ptr = attr._parse( ptr );

0045、 m_standalone = attr.value();

0046、 }

0047、 else

0048、 {

0049、 return NULL;

0050、 }

0051、 }

0052、 return ptr;

0053、 }

_parser函数传入参数为XML缓冲区的首指针。

行0012-0046是遍历头缓冲区。作为合法的XML头，此循环会在行0016-0022中退出。

其他部分分别解析XML头各个字段。

4、 XML中Element元素的解析

对Element元素的识别是通过如下方式：

001、 // <element>..</element>

002、 else if ( ::isalnum( *(ptr+1) ) || *(ptr+1)==L'_' )

003、 {

004、 child = new XmlElement();

005、 }

在字符’<’后紧跟着字母或者数字标识着元素的开始。

对Element元素的解析是通过XmlElement类的_parse方法实现。

001、 PCWSTR XmlElement::_parse( PCWSTR ptr )

002、 {

003、 CEXER_ASSERT( ptr != NULL );

004、

005、 if ( !ptr || *ptr!=L'<' )

006、 {

007、 return NULL;

008、 }

009、

0010、 ++ptr;

0011、

0012、 //允许'<'和Element之间有合法字符

0013、 ptr = xml::skipInvalidChar( ptr );

0014、 if ( !ptr || !*ptr )

0015、 {

0016、 return NULL;

0017、 }

0018、

0019、 ptr = XmlNamedNode::_parse( ptr );

0020、 if ( !ptr || !*ptr )

0021、 {

0022、 return NULL;

0023、 }

0024、

0025、 while ( ptr && *ptr )

0026、 {

0027、 ptr = xml::skipInvalidChar( ptr );

0028、 if ( *ptr == L'/' )

0029、 {

0030、 ++ptr;

0031、

0032、 if ( *ptr != L'>' )

0033、 {

0034、 return NULL;

0035、 }

0036、

0037、 return ptr+1;

0038、 }

0039、 else if ( *ptr == L'>' )

0040、 {

0041、 //如果此行以<Element>

0042、 //则记录下相应的结束匹配标记</Element>

0043、 //skip '>' char

0044、 ++ptr;

0045、

0046、 stringW endTag = L"</";

0047、 endTag += m_name + L">";

0048、 long endTagLen = (long)endTag.length();

0049、

0050、 //解析<Element>...</Element>之间的部分

0051、 ptr = _parseContent( ptr );

0052、 if ( !ptr || !*ptr || ::wcsncmp(ptr,endTag.c_str(),endTagLen) )

0053、 {

0054、 return NULL;

0055、 }

0056、

0057、 return ptr + endTagLen;

0058、 }

0059、 else

0060、 {

0061、 ptr = _parseAttributes( ptr );

0062、 }

0063、 }

0064、

0065、 return ptr;

0066、 }

此函数是一个递归解析函数。

行005表示Element必须以字符’<’开始。

行0013跳过字符’<’和元素名之间的合法字符。

行0019解析出元素名称。

行0025开始while循环遍历，此处会完整的解析出此元素范围内的所有内容。对于一个Element元素，存在两种形式：

一种是<Element Attribute1= “ Attribute_Value 1” Attribute2= “ Attribute_Value 2” />

一种是

</Element2>

</Element>

对于第一种情况，循环退出在行0037。

对于第二种情况，需要先把当前元素的结束条件记录下来，即是变量endTag。对于中间的内容，调用_parseContent进行解析。_parseContent中又会递归的调用_parse方法，所以_parse是一个递归函数，它解析完整的Element内容。它符合Element内容的嵌套定义过程。

5、 XML中节点的组织方式

（1）、将XML文档中同层的所有节点串成链表，维护一个头尾指针。通过appendChild函数实现。

（2）、将元素内的所有属性用Vector串联。

呵呵，关于Cexer的代码解析至此就告一段落了。下面着手将此用C语言实现出来。

xj_fox

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录