【问题】
C#中,中HtmlAgilityPack,去解析:
的html中的:World’s first tablet with dual-band, dual-antenna Wi-Fi for over 35% faster downloads and streaming (compared to the iPad mini)
时,发现对应的源码是:World’s first tablet with dual-band, dual-antenna Wi-Fi for over 35% faster downloads and streaming (compared to the iPad mini
amznJQ.available(‘jQuery’, function() {
(function ($) {
amznJQ.available(‘popover’, function() {
var content = ‘
Two Antennas, Better Bandwidth
’+ ‘’
$(‘#kpp-popover-0’).amazonPopoverTrigger({
literalContent: content,
closeText: ‘Close’,
title: ‘ ’,
width: 550,
location: ‘centered’
});
});
}(jQuery));
});
)
然后用HtmlAgilityPack解析后,结果发现其中的InnerText却是:World’s first tablet with dual-band, dual-antenna Wi-Fi for over 35% faster downloads and streaming (compared to the iPad mini\namznJQ.available(‘jQuery’, function() { \n(function ($) {\namznJQ.available(‘popover’, function() {\n\tvar content = ‘
Two Antennas, Better Bandwidth
’ \n\n\t+ ‘ ’\n\t\n\t$(‘#kpp-popover-0’).amazonPopoverTrigger({\n\t\tliteralContent: content,\n\t\tcloseText: ‘Close’,\n\t\ttitle: ‘ ’,\n\t\twidth: 550,\n\t\tlocation: ‘centered’\n\t});\n\n});\n}(jQuery)); \n}); \n\n)而不是所希望的:World’s first tablet with dual-band, dual-antenna Wi-Fi for over 35% faster downloads and streaming (compared to the iPad mini)
即,需要去除InnerText中的Javascript。
【解决过程】
1.参考之前就看过的:
和对应的:
然后调试了半天,最终用://remove sub node from current html node
//eg:
//"script"
//for
//
public HtmlNode removeSubHtmlNode(HtmlNode curHtmlNode, string subNodeToRemove)
{
HtmlNode afterRemoved = curHtmlNode;
HtmlNodeCollection foundAllSub = curHtmlNode.SelectNodes(subNodeToRemove);
if ((foundAllSub!= null ) && (foundAllSub.Count > 0))
{
foreach (HtmlNode subNode in foundAllSub)
{
curHtmlNode.RemoveChild(subNode);
}
}
//foreach (var subNode in afterRemoved.Descendants(subNodeToRemove))
//{
// //An unhandled exception of type 'System.InvalidOperationException' occurred in mscorlib.dll
// //Additional information: Collection was modified; enumeration operation may not execute.
// afterRemoved.RemoveChild(subNode);
// curHtmlNode.RemoveChild(subNode);
// //subNode.Remove();
//}
return afterRemoved;
}
HtmlNode curBulletNode = allBulletNodeList[idx];
HtmlNode noJsNode = crl.removeSubHtmlNode(curBulletNode, "script");
HtmlNode noStyleNode = crl.removeSubHtmlNode(curBulletNode, "style");
string bulletStr = noStyleNode.InnerText;
而解决了问题。
其中可以看出:
1.那人给出的例子中,用
htmlDoc.DocumentNode.Descendants("script")
找到子节点,然后用
script.Remove();
去删除,是可以的。
2.但是此处如果用,当前的Html节点,做类似的处理:foreach (var subNode in afterRemoved.Descendants(subNodeToRemove))
{
//An unhandled exception of type 'System.InvalidOperationException' occurred in mscorlib.dll
//Additional information: Collection was modified; enumeration operation may not execute.
afterRemoved.RemoveChild(subNode);
curHtmlNode.RemoveChild(subNode);
//subNode.Remove();
}
就会出现注释中提示的错误:
Additional information: Collection was modified; enumeration operation may not execute.
即,在枚举Collection中,删除其中的值,是不允许的。
所以才想了别的办法去实现类似的remove的效果的。
【总结】
实现类似的删除的效果,真的是累屎了。。。。
删除根节点其下的子节点,好删;
删除当前某个节点下的节点,难删。(后来调试中,发现,其实执行subNode.Remove(); 时,已经删除成功了,但是接着还是会去执行foreach循环,导致报错的。。。)