C#使用正则表达式提取网页中的信息数据

最新推荐文章于 2020-06-16 17:32:06 发布

iteye_4515

最新推荐文章于 2020-06-16 17:32:06 发布

阅读量256

点赞数 1

文章标签： xhtml c# javascript ViewUI

大家好，今天来分享一下在ASP.NET中如何通过正则表达式的使用来获取HTML的信息。如我们所知，网页中经常会包含一些非常有用的信息，比如网页标题（title），文本（text），图片（image），链接（link），表格（table），一些搜索引擎的工程师很可能需要关注这方面的信息，通常他们需要在网页中查询一些关键字，图片等信息。

这里介绍一下怎么在.NET中通过正则表达式快速的获取这些信息, 我们需要在VS2010中建立一个空的web应用程序：

首先需要制作一个源页面，本页面包含一些基本信息，也就是需要获取信息的源页面，这里这个页面包括文本，脚本，图片和链接等信息。

[本示例完整源码下载(0分)] http://download.csdn.net/source/3450356

在本项目中页面的头部都需要设置AutoEventWireup属性，

<%@ Page Language="C#" AutoEventWireup="true" CodeBehind="SourcePage.aspx.cs" Inherits="CSASPNETStripHtmlCode.SourcePages" %>

AutoEventWireup 属性被设置为 true时该页框架将自动调用页面的事件，在本例中如果不这样设置，第二次执行获取HTML代码的方法将会失败。

SourcePage.aspx

<html xmlns="http://www.w3.org/1999/xhtml"> <head id="Head1" runat="server"> <title></title> </head> <mce:script type="text/javascript"></mce:script> <mce:script type="text/javascript"></mce:script> <body> <form id="form1" runat="server"> <div> Hello everybody:<br /> <a href="http://www.microsoft.com" mce_href="http://www.microsoft.com" type="text/html">www.microsoft.com</a><br /> <a href="http://www.asp.net" mce_href="http://www.asp.net">www.asp.net</a><br /> <input type="text" id="textDisplay" runat="server" /><asp:Button id="Button1" runat="server" Text="Submit" OnClientClick="return click_client()" /> <input id="Checkbox1" type="checkbox" value="Check" /><br /> </div> <img alt="Image/asp.jpg" src="Image/asp.jpg" mce_src="Image/asp.jpg" /> <img alt="Image/asp.jpg" src="Image/asp.jpg" mce_src="Image/asp.jpg" width="100"/> </form> </body> </html>

添加一个Default.aspx页面我们将从这个页面中访问SourcePage并从中提取需要的信息，先来看看它的页面信息，包括一个多行的TextBox和几个Button，Button用于获取页面的资源信息并且置于TextBox中. 同样，在页面头部的page信息也将加上AutoEventWireup属性：

<%@ Page Language="C#" AutoEventWireup="true" CodeBehind="Default.aspx.cs" Inherits="CSASPNETStripHtmlCode.Defaults" %>

Default.aspx (HTML)：

<html xmlns="http://www.w3.org/1999/xhtml"> <head runat="server"> <title></title> </head> <body> <form id="form1" runat="server"> <div> <a href="SourcePage.aspx" mce_href="SourcePage.aspx">View the SourcePage.aspx</a><br /> <asp:TextBox ID="tbResult" runat="server" Height="416px" Width="534px" TextMode="MultiLine"></asp:TextBox> <br /> <asp:Button ID="btnRetrieveAll" runat="server" Text="Retrieve entire Html" οnclick="btnRetrieveAll_Click" /> <asp:Button ID="btnRetrievePureText" runat="server" Text="Retrieve pure text" οnclick="btnRetrievePureText_Click" /> <asp:Button ID="btnRetrieveSriptCode" runat="server" Text="Retrieve sript code" οnclick="btnRetrieveSriptCode_Click" /> <asp:Button ID="btnRetrieveImage" runat="server" Text="Retrieve images" οnclick="btnRetrieveImage_Click" /> <asp:Button ID="btnRetrievelink" runat="server" Text="Retrieve links" οnclick="btnRetrievelink_Click" /> </div> </form> </body> </html>

最后一步，就是写正则表达式获取HTML代码的方法了。

首先我们需要的获取整个页面的HTML代码，通过HttpWebRequest和HttpWebResponse类访问源页面的代码并用StreamReader读取并返回string类型的变量。

接着我们可以对HTML代码进行解析和截取，本例中btnRetrievePureText用于获取纯文本，btnRetrieveSriptCode用于获取脚本信息（不常用），btnRetrieveImage用于获取图片信息，btnRetrievelink用于获取链接，当然你可以改变正则表达式的内容和方法，获取你想要的其他信息：

下面是完整代码

Default.aspx.cs

public partial class Default : System.Web.UI.Page { string strUrl = String.Empty; string strWholeHtml = string.Empty; const string MsgPageRetrieveFailed = "Sorry, the web page is not run successful"; bool flgPageRetrieved = true; protected void Page_Load(object sender, EventArgs e) { strUrl = this.Page.Request.Url.ToString().Replace("Default","SourcePage"); tbResult.Text = string.Empty; } protected void btnRetrieveAll_Click(object sender, EventArgs e) { strWholeHtml = this.GetWholeHtmlCode(strUrl); if (flgPageRetrieved) { tbResult.Text = strWholeHtml; } else { tbResult.Text = MsgPageRetrieveFailed; } } /// <summary> /// Retrieve the entire html code from SourcePage.aspx with WebRequest and /// WebRespond. We transfer the format of html code to uft-8. /// </summary> /// <param name="url"></param> /// <returns></returns> public string GetWholeHtmlCode(string url) { string strHtml = string.Empty; StreamReader strReader = null; HttpWebResponse wrpContent = null; try { HttpWebRequest wrqContent = (HttpWebRequest)WebRequest.Create(strUrl); wrqContent.Timeout = 300000; wrpContent = (HttpWebResponse)wrqContent.GetResponse(); if (wrpContent.StatusCode != HttpStatusCode.OK) { flgPageRetrieved = false; strHtml = "Sorry, the web page is not run successful"; } if (wrpContent != null) { strReader = new StreamReader(wrpContent.GetResponseStream(), Encoding.GetEncoding("utf-8")); strHtml = strReader.ReadToEnd(); } } catch (Exception e) { flgPageRetrieved = false; strHtml = e.Message; } finally { if (strReader != null) strReader.Close(); if (wrpContent != null) wrpContent.Close(); } return strHtml; } /// <summary> /// Retrieve the pure text from html code, this pure text include /// only the Body tags of html. /// </summary> /// <param name="sender"></param> /// <param name="e"></param> protected void btnRetrievePureText_Click(object sender, EventArgs e) { strWholeHtml = this.GetWholeHtmlCode(strUrl); if (flgPageRetrieved) { string strRegexScript = @"(?m)<body[^>]*>(/w|/W)*?</body[^>]*>"; string strRegex = @"<[^>]*>"; string strMatchScript = string.Empty; Match matchText = Regex.Match(strWholeHtml, strRegexScript, RegexOptions.IgnoreCase); strMatchScript = matchText.Groups[0].Value; string strPureText = Regex.Replace(strMatchScript, strRegex, string.Empty, RegexOptions.IgnoreCase); tbResult.Text = strPureText; } else { tbResult.Text = MsgPageRetrieveFailed; } } /// <summary> /// Retrieve the script code from html code. /// </summary> /// <param name="sender"></param> /// <param name="e"></param> protected void btnRetrieveSriptCode_Click(object sender, EventArgs e) { strWholeHtml = this.GetWholeHtmlCode(strUrl); if (flgPageRetrieved) { string strRegexScript = @"(?m)<script[^>]*>(/w|/W)*?</script[^>]*>"; string strRegex = @"<[^>]*>"; string strMatchScript = string.Empty; MatchCollection matchList = Regex.Matches(strWholeHtml, strRegexScript, RegexOptions.IgnoreCase); StringBuilder strbScriptList = new StringBuilder(); foreach (Match matchSingleScript in matchList) { string strSingleScriptText = Regex.Replace(matchSingleScript.Value, strRegex, string.Empty, RegexOptions.IgnoreCase); strbScriptList.Append(strSingleScriptText + "/r/n"); } tbResult.Text = strbScriptList.ToString(); } else { tbResult.Text = MsgPageRetrieveFailed; } } /// <summary> /// Retrieve the image information from html code /// </summary> /// <param name="sender"></param> /// <param name="e"></param> protected void btnRetrieveImage_Click(object sender, EventArgs e) { strWholeHtml = this.GetWholeHtmlCode(strUrl); if (flgPageRetrieved) { string strRegexImg = @"(?is)<img.*?>"; MatchCollection matchList = Regex.Matches(strWholeHtml, strRegexImg, RegexOptions.IgnoreCase); StringBuilder strbImageList = new StringBuilder(); foreach (Match matchSingleImage in matchList) { strbImageList.Append(matchSingleImage.Value + "/r/n"); } tbResult.Text = strbImageList.ToString(); } else { tbResult.Text = MsgPageRetrieveFailed; } } /// <summary> /// Retrieve the links from html code /// </summary> /// <param name="sender"></param> /// <param name="e"></param> protected void btnRetrievelink_Click(object sender, EventArgs e) { strWholeHtml = this.GetWholeHtmlCode(strUrl); if (flgPageRetrieved) { string strRegexLink = @"(?is)<a .*?>"; MatchCollection matchList = Regex.Matches(strWholeHtml, strRegexLink, RegexOptions.IgnoreCase); StringBuilder strbLinkList = new StringBuilder(); foreach (Match matchSingleLink in matchList) { strbLinkList.Append(matchSingleLink.Value + "/r/n"); } tbResult.Text = strbLinkList.ToString(); } else { tbResult.Text = MsgPageRetrieveFailed; } } }

本例中的两个重点：

第一，介绍如何使用WebRequest.Create()和WebResponse.GetResponseStream()获取Web page内容，通过StreamReader.ReadToEnd()方法返回HTML字符串。

第二，使用Regex.Match()和Regex.Replace()两个基本的方法，获得指定的内容。至于正则表达式的写法这里就不详细介绍了，可以从网上查看到很多这方面的信息。

这只是一个简单的获取和解析HTML代码的例子，欢迎大家补充指正。