去除html标记,并保留某些标记

roden

于 2007-05-24 23:12:00 发布

阅读量2k

点赞数

分类专栏： .NET 文章标签： html string c

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/roden/article/details/1624791

版权

.NET 专栏收录该内容

20 篇文章 0 订阅

订阅专栏

在网页上直接复制过来的新闻通常包括大量的html标记,下面两个函数消除这些标记,并可以有选择的保留一些标记

private static string html2TextPattern =

@" (?<script><script[^>]*?>.*?</script>)|(?<style><style>.*?</style>)|(?<comment>) " +

@" |(?<html>(?!(<a)|<ps|(<p>)|(<img)|(<br)|(</)|(<strong)) " + // 保留的html标记前缀,<a>,<p>,<img><br><STRONG>

@" <[^>]+>) " + // HTML标记

@" |(?<quot>&(quot|#34);) " + // 符号: "

@" |(?<amp>&(amp|#38);) " + // 符号: &

@" |(?<end>(?!(</a)|(</strong)|(</p>))</[^>]+>) " + // HTML闭合标签保留</A>,</STRONG>,</P>

@" |(?<iexcl>&(iexcl|#161);) " + // 符号: (char)161

@" |(?<cent>&(cent|#162);) " + // 符号: (char)162

@" |(?<pound>&(pound|#163);) " + // 符号: (char)163

@" |(?<copy>&(copy|#169);) " + // 符号: (char)169

@" |(?<others>&(d+);) " ; // 符号: 其他

/// <param name="html">HTML字符串</param>

public static string Html2Text( string html)

{

string pattern = html2TextPattern;

string pattern2 = @"([^>] s+)|(<br>( ){2,4})|(<br>s{2,4})"; //匹配换行符+空格并替换为<P>标签

RegexOptions options = RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.Compiled;

string txt = Regex.Replace(html, pattern2, "<P>", options);

txt = Regex.Replace(txt, pattern, new MatchEvaluator(Html2Text_Match), options);

return txt;

}

private static string Html2Text_Match(Match m)

{

if (m.Groups["quot"].Value != string.Empty)

return """;

else if (m.Groups["amp"].Value != string.Empty)

return "&";

else if (m.Groups["iexcl"].Value != string.Empty)

return "¡";

else if (m.Groups["cent"].Value != string.Empty)

return "¢";

else if (m.Groups["pound"].Value != string.Empty)

return "£";

else if (m.Groups["copy"].Value != string.Empty)

return "(c)";

else

return string.Empty;

}

调用html2text()即可将html标记去掉并返回去掉后的文本,保留了加粗,超链接,段落,图片,并将以换行加空格来分段的字符替换成<P>

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

roden CSDN认证博客专家 CSDN认证企业博客

码龄18年

34: 原创

26万+: 周排名

185万+: 总排名

12万+: 访问

: 等级

1712: 积分

10: 粉丝

4: 获赞

27: 评论

10: 收藏

私信

关注

热门文章

分类专栏

.NET 20篇
java 6篇

最新评论

两根一样的内存条也会不兼容
zhax110: 我和你一样的问题，求联系方式！真心请教！
C++ Primer 学习笔记(9): 基类和派生类的转换
ganmaojiushijiu: 代码评论放不下所以就放到我的百度空间里面了： http://hi.baidu.com/wcmxiaolizi/item/2426fe41bab14bf11f19bc3b
C++ Primer 学习笔记(9): 基类和派生类的转换
ganmaojiushijiu: 修改第一条： (1)如果类的继承是公有的，则此派生类和接下来的派生类都可以使用派生类到基类的转换。
C++ Primer 学习笔记(9): 基类和派生类的转换
ganmaojiushijiu: [code=cpp] 如果类的继承是公有的，则接下来的派生类都可以使用派生类到基类的转换。如果一个类的派生方式是私有或者保护的，则派生类的对象不可以转换到基类对象。如果是继承是私有的，则从私有继承而来的类在派生出来的派生类是不可以转换到基类的。如果继承是保护的，则后来的派生类可以被转换成基类。 If the inheritance is public, then both user code and member functions of subsequently derived classes may use the derived-to-base conversion. If a class is derived using private or protected inheritance, then user code may not convert an object of derived type to a base type object. If the inheritance is private, then classes derived from the privately inherited class may not convert to the base class. If the inheritance is protected, then the members of subsequently derived classes may convert to the base type. [/code]
GridView的分页功能
小鬼编程: 这个看着好模糊啊

最新文章

目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。