c# 正则删除 html代码,使用C＃正则表达式删除HTML标签

最新推荐文章于 2021-12-23 11:43:52 发布

纳特帕格

最新推荐文章于 2021-12-23 11:43:52 发布

阅读量88

点赞数

文章标签： c# 正则删除 html代码

The correct answer is don't do that, use the HTML Agility Pack.

Edited to add:

To shamelessly steal from the comment below by jesse, and to avoid being accused of inadequately answering the question after all this time, here's a simple, reliable snippet using the HTML Agility Pack that works with even most imperfectly formed, capricious bits of HTML:

HtmlDocument doc = new HtmlDocument();

doc.LoadHtml(Properties.Resources.HtmlContents);

var text = doc.DocumentNode.SelectNodes("//body//text()").Select(node => node.InnerText);

StringBuilder output = new StringBuilder();

foreach (string line in text)

{

output.AppendLine(line);

}

string textOnly = HttpUtility.HtmlDecode(output.ToString());

There are very few defensible cases for using a regular expression for parsing HTML, as HTML can't be parsed correctly without a context-awareness that's very painful to provide even in a nontraditional regex engine. You can get part way there with a RegEx, but you'll need to do manual verifications.

Html Agility Pack can provide you a robust solution that will reduce the need to manually fix up the aberrations that can result from naively treating HTML as a context-free grammar.

A regular expression may get you mostly what you want most of the time, but it will fail on very common cases. If you can find a better/faster parser than HTML Agility Pack, go for it, but please don't subject the world to more broken HTML hackery.

纳特帕格

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
c# 正则删除 html代码,使用C＃正则表达式删除HTML标签

The correct answer is don't do that, use the HTML Agility Pack.Edited to add:To shamelessly steal from the comment below by jesse, and to avoid being accused of inadequately answering the question aft...
复制链接

扫一扫