c# html agility pack,【已解决】C#用HtmlAgilityPack执行Html解析时,发现InnerText中包含javascript,要去除Javascript...

【问题】

C#中,中HtmlAgilityPack,去解析:

的html中的:World’s first tablet with dual-band, dual-antenna Wi-Fi for over 35% faster downloads and streaming (compared to the iPad mini)

时,发现对应的源码是:World’s first tablet with dual-band, dual-antenna Wi-Fi for over 35% faster downloads and streaming (compared to the iPad mini

amznJQ.available(‘jQuery’, function() {

(function ($) {

amznJQ.available(‘popover’, function() {

var content = ‘

Two Antennas, Better Bandwidth

+ ‘782135cb331b93c24967bcd64c9fe6d2.gif

$(‘#kpp-popover-0’).amazonPopoverTrigger({

literalContent: content,

closeText: ‘Close’,

title: ‘ ’,

width: 550,

location: ‘centered’

});

});

}(jQuery));

});

)

然后用HtmlAgilityPack解析后,结果发现其中的InnerText却是:World’s first tablet with dual-band, dual-antenna Wi-Fi for over 35% faster downloads and streaming (compared to the iPad mini\namznJQ.available(‘jQuery’, function() { \n(function ($) {\namznJQ.available(‘popover’, function() {\n\tvar content = ‘

Two Antennas, Better Bandwidth

’ \n\n\t+ ‘ tate_feature-wifi._V395653267_.gif%5C%22’\n\t\n\t$(‘#kpp-popover-0’).amazonPopoverTrigger({\n\t\tliteralContent: content,\n\t\tcloseText: ‘Close’,\n\t\ttitle: ‘ ’,\n\t\twidth: 550,\n\t\tlocation: ‘centered’\n\t});\n\n});\n}(jQuery)); \n}); \n\n)

而不是所希望的:World’s first tablet with dual-band, dual-antenna Wi-Fi for over 35% faster downloads and streaming (compared to the iPad mini)

即,需要去除InnerText中的Javascript。

【解决过程】

1.参考之前就看过的:

和对应的:

然后调试了半天,最终用://remove sub node from current html node

//eg:

//"script"

//for

//

public HtmlNode removeSubHtmlNode(HtmlNode curHtmlNode, string subNodeToRemove)

{

HtmlNode afterRemoved = curHtmlNode;

HtmlNodeCollection foundAllSub = curHtmlNode.SelectNodes(subNodeToRemove);

if ((foundAllSub!= null ) && (foundAllSub.Count > 0))

{

foreach (HtmlNode subNode in foundAllSub)

{

curHtmlNode.RemoveChild(subNode);

}

}

//foreach (var subNode in afterRemoved.Descendants(subNodeToRemove))

//{

// //An unhandled exception of type 'System.InvalidOperationException' occurred in mscorlib.dll

// //Additional information: Collection was modified; enumeration operation may not execute.

// afterRemoved.RemoveChild(subNode);

// curHtmlNode.RemoveChild(subNode);

// //subNode.Remove();

//}

return afterRemoved;

}

HtmlNode curBulletNode = allBulletNodeList[idx];

HtmlNode noJsNode = crl.removeSubHtmlNode(curBulletNode, "script");

HtmlNode noStyleNode = crl.removeSubHtmlNode(curBulletNode, "style");

string bulletStr = noStyleNode.InnerText;

而解决了问题。

其中可以看出:

1.那人给出的例子中,用

htmlDoc.DocumentNode.Descendants("script")

找到子节点,然后用

script.Remove();

去删除,是可以的。

2.但是此处如果用,当前的Html节点,做类似的处理:foreach (var subNode in afterRemoved.Descendants(subNodeToRemove))

{

//An unhandled exception of type 'System.InvalidOperationException' occurred in mscorlib.dll

//Additional information: Collection was modified; enumeration operation may not execute.

afterRemoved.RemoveChild(subNode);

curHtmlNode.RemoveChild(subNode);

//subNode.Remove();

}

就会出现注释中提示的错误:

Additional information: Collection was modified; enumeration operation may not execute.

f9a6eb9f281d1c1a4ce2be66a0d599ba.png

即,在枚举Collection中,删除其中的值,是不允许的。

所以才想了别的办法去实现类似的remove的效果的。

【总结】

实现类似的删除的效果,真的是累屎了。。。。

删除根节点其下的子节点,好删;

删除当前某个节点下的节点,难删。(后来调试中,发现,其实执行subNode.Remove(); 时,已经删除成功了,但是接着还是会去执行foreach循环,导致报错的。。。)

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值