将Html代码转换为Text

在抓取html页时,需要过滤掉html代码,获取Html源代码中的Text,有正则表达式可以解决这个问题:
VB.NET
None.gif      ' '' -----------------------------------------------------------------------------
None.gif
     ' '' <summary>
None.gif
     ' '' 移除所有的html标签
None.gif
     ' '' </summary>
None.gif
     ' '' <param name="HTML">html代码</param>
None.gif
     ' '' <returns></returns>
None.gif
     ' '' <remarks>
None.gif
     ' '' </remarks>
None.gif
     ' '' <history>
None.gif
     ' ''     [Administrator]    2004-9-25    Created
None.gif
     ' '' </history>
None.gif
     ' '' -----------------------------------------------------------------------------
ExpandedBlockStart.gifContractedBlock.gif
     Public   Function ParseTags() Function ParseTags(ByVal HTML As StringAs String
InBlock.gif        
' 使用正则表达式识别并移除所有的html标签,返回过滤掉Html标签的文本
InBlock.gif
        Dim objRegEx As System.Text.RegularExpressions.Regex
InBlock.gif        
Return objRegEx.Replace(HTML, "<[^>]*>""")
ExpandedBlockEnd.gif    
End Function

C#
ExpandedBlockStart.gif ContractedBlock.gif          /**/ /// <summary>
InBlock.gif        
/// 移除所有的html标签
InBlock.gif        
/// </summary>
InBlock.gif        
/// <param name="HTML">html源代码</param>
ExpandedBlockEnd.gif        
/// <returns></returns>

None.gif          public   string  ParseTags( string  HTML) 
ExpandedBlockStart.gifContractedBlock.gif        
dot.gif
InBlock.gif            
return System.Text.RegularExpressions.Regex.Replace(HTML, "<[^>]*>"""); 
ExpandedBlockEnd.gif        }
提供一简单示例如下:
VB.NET
ExpandedBlockStart.gif ContractedBlock.gif      Private   Sub Page_Load() Sub Page_Load(ByVal sender As System.ObjectByVal e As System.EventArgs) Handles MyBase.Load
InBlock.gif        
Dim oStringBuilder As System.Text.StringBuilder
InBlock.gif
InBlock.gif        oStringBuilder 
= New System.Text.StringBuilder
InBlock.gif        oStringBuilder.Append(ControlChars.CrLf 
+ "<!DOCTYPE HTML PUBLIC ""-//W3C//DTD HTML 4.0 Transitional//EN"">")
InBlock.gif        oStringBuilder.Append(ControlChars.CrLf 
+ "<HTML>")
InBlock.gif        oStringBuilder.Append(ControlChars.CrLf 
+ "    <HEAD>")
InBlock.gif        oStringBuilder.Append(ControlChars.CrLf 
+ "        <title>WebForm1</title>")
InBlock.gif        oStringBuilder.Append(ControlChars.CrLf 
+ "        <meta name=""GENERATOR"" content=""Microsoft Visual Studio .NET 7.1"">")
InBlock.gif        oStringBuilder.Append(ControlChars.CrLf 
+ "        <meta name=""CODE_LANGUAGE"" content=""Visual Basic .NET 7.1"">")
InBlock.gif        oStringBuilder.Append(ControlChars.CrLf 
+ "        <meta name=""vs_defaultClientScript"" content=""JavaScript"">")
InBlock.gif        oStringBuilder.Append(ControlChars.CrLf 
+ "        <meta name=""vs_targetSchema"" content=""http://schemas.microsoft.com/intellisense/ie5"">")
InBlock.gif        oStringBuilder.Append(ControlChars.CrLf 
+ "    </HEAD>")
InBlock.gif        oStringBuilder.Append(ControlChars.CrLf 
+ "    <body MS_POSITIONING=""GridLayout"">")
InBlock.gif        oStringBuilder.Append(ControlChars.CrLf 
+ "        <form id=""Form1"" method=""post"" runat=""server"">")
InBlock.gif        oStringBuilder.Append(ControlChars.CrLf 
+ "            <FONT face=""宋体"">测试</FONT>")
InBlock.gif        oStringBuilder.Append(ControlChars.CrLf 
+ "        </form>")
InBlock.gif        oStringBuilder.Append(ControlChars.CrLf 
+ "    </body>")
InBlock.gif        oStringBuilder.Append(ControlChars.CrLf 
+ "</HTML>")
InBlock.gif        Response.
Write(ParseTags(oStringBuilder.ToString))
ExpandedBlockEnd.gif    
End Sub

C#
None.gif          private   void  Page_Load( object  sender, System.EventArgs e)
ExpandedBlockStart.gifContractedBlock.gif        
dot.gif {
InBlock.gif            System.Text.StringBuilder oStringBuilder; 
InBlock.gif            oStringBuilder 
= new System.Text.StringBuilder(); 
InBlock.gif            oStringBuilder.Append(Microsoft.VisualBasic.ControlChars.CrLf 
+ "<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">"); 
InBlock.gif            oStringBuilder.Append(Microsoft.VisualBasic.ControlChars.CrLf 
+ "<HTML>"); 
InBlock.gif            oStringBuilder.Append(Microsoft.VisualBasic.ControlChars.CrLf 
+ "  <HEAD>"); 
InBlock.gif            oStringBuilder.Append(Microsoft.VisualBasic.ControlChars.CrLf 
+ "    <title>WebForm1</title>"); 
InBlock.gif            oStringBuilder.Append(Microsoft.VisualBasic.ControlChars.CrLf 
+ "    <meta name="GENERATOR" content="Microsoft Visual Studio .NET 7.1">"); 
InBlock.gif            oStringBuilder.Append(Microsoft.VisualBasic.ControlChars.CrLf 
+ "    <meta name="CODE_LANGUAGE" content="Visual Basic .NET 7.1">"); 
InBlock.gif            oStringBuilder.Append(Microsoft.VisualBasic.ControlChars.CrLf 
+ "    <meta name="vs_defaultClientScript" content="JavaScript">"); 
InBlock.gif            oStringBuilder.Append(Microsoft.VisualBasic.ControlChars.CrLf 
+ "    <meta name="vs_targetSchema" content="http://schemas.microsoft.com/intellisense/ie5">"); 
InBlock.gif            oStringBuilder.Append(Microsoft.VisualBasic.ControlChars.CrLf 
+ "  </HEAD>"); 
InBlock.gif            oStringBuilder.Append(Microsoft.VisualBasic.ControlChars.CrLf 
+ "  <body MS_POSITIONING="GridLayout">"); 
InBlock.gif            oStringBuilder.Append(Microsoft.VisualBasic.ControlChars.CrLf 
+ "    <form id="Form1" method="post" runat="server">"); 
InBlock.gif            oStringBuilder.Append(Microsoft.VisualBasic.ControlChars.CrLf 
+ "      <FONT face="宋体">测试</FONT>"); 
InBlock.gif            oStringBuilder.Append(Microsoft.VisualBasic.ControlChars.CrLf 
+ "    </form>"); 
InBlock.gif            oStringBuilder.Append(Microsoft.VisualBasic.ControlChars.CrLf 
+ "  </body>"); 
InBlock.gif            oStringBuilder.Append(Microsoft.VisualBasic.ControlChars.CrLf 
+ "</HTML>"); 
InBlock.gif            Response.Write(ParseTags(oStringBuilder.ToString()));
ExpandedBlockEnd.gif        }

输出结果为:
None.gif WebForm1 测试 
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值