libxml2不再介绍,使用的很多,我今天只是分享一下我在使用过程中遇到的一些问题:
解析时候设置合理的选项可以避免在XML解析过程中,将空白部分文本解析为节点;
int opt = XML_PARSE_RECOVER | XML_PARSE_NOBLANKS | XML_PARSE_NOERROR;
Xml2Doc doc = Xml2Doc::tryParseFile(fileName, "", opt);
html文件大多都有错误,所以在解析时候需要忽略错误以及提示消息:
XML_PARSE_RECOVER
XML_PARSE_NOERROR
XML_PARSE_NOWARNING
还有页面不符合XML规范,节点没有闭合,造成XPATH搜不到任何东西,主要是解析树时候名字丢失了,但是通过自己手写深度递归查找还是可以的,
比如有以下一段页面
<div class="SoundBox oh last" data-id="294360">
<div class="SoundContent">
<div class="buttons"><a target="_blank" href="https://www.yespik.com/download-sound_294360_1.html"
class="downBtn">立即下载</a><a href="javascript:;" onclick="listFav(294360,this,21)"
class="fav favBtn" fav-id="294360" action="add"></a> </div>
<div class="SoundTitle"> <a target="_blank"
href="https://www.yespik.com/show-sound_294360.html">震撼大气的年会颁奖片头上场背景音乐</a> </div>
<div class="SoundPlayer">
<div class="SoundDiskBox pr fl SoundDiskBox114 StartPlay" data-id="294360"
data-mp3="preview/sound/00/29/43/515ppt-S294360-6C1F5108.mp3" data-width="164">
<div class="SoundPlayerBg"></div>
<!--<div class="SoundPlayerBtn pa opacity-8 SoundPlayerBtn114"></div>-->
</div>
<audio preload="none" data-time="63">
<source
src="//img-bsy2.yespik.com/sound/00/29/43/60/294360_a4f43b70e64c10553ba7fb8451dcd269.mp3"
type="audio/mpeg">
</audio>
<div class="DurationBox">
<div class="SoundStartTime fl star-time">00:00</div>
<div class="SoundWave fl pr time-bar"> <span class="progressBar"></span> <i
class="move-color"></i>
<p class="timetip"></p>
</div>
<div class="SoundEndTime fl end-time">01:03</div>
</div>
</div>
</div>
</div>
我需要找到有用的信息,
// https://www.yespik.com/sound/0-5_0_0_0-0-default/p_1/
void testDeepFind()
{
string fileName = "音效素材下载-音效大全-配乐.htm";
int opt = XML_PARSE_RECOVER | XML_PARSE_NOBLANKS | XML_PARSE_NOERROR;
Xml2Doc doc = Xml2Doc::tryParseFile(fileName, "", opt);
if (doc.isNull())
return;
// 根据属性对,递归找到外层
Xml2NodeArray vec = doc.getByName("div", "class", "SoundContent");
for (auto& it : vec)
{
std::cout << it.getName() << "[ ";
// 从此节点再次找source节点
Xml2NodeArray srcList = it.getDeepElementsByName("source", "", "");
if (srcList.size() > 0)
{
printAttrs(srcList[0]);
}
// 从此节点找带有 SoundTitle 属性值的节点,从里面找到文本信息
Xml2NodeArray titleList = it.getDeepElementsByName("div", "class", "SoundTitle");
if (titleList.size() > 0)
{
Xml2NodeArray txtList = titleList[0].getChildrenByName("a");
if (txtList.size() > 0)
{
string val = StringEncoding::utf8ToAnsi(txtList[0].getValue());
cout << val;
}
}
std::cout << "] \n ";
}
}
输出信息例如: