网站采集（根据正则表达式截取需要的html数据）

最新推荐文章于 2022-06-11 11:08:08 发布

weixin_30249203

最新推荐文章于 2022-06-11 11:08:08 发布

阅读量166

点赞数

原文链接：http://www.cnblogs.com/mapleclever/archive/2012/01/31/2333283.html

版权

所有网站都可以通过url地址获取该网站编译之后的html源代码，方法如下：

需要用到的命名空间：

using System;

using System.Collections.Generic;

using System.Text;

using System.Diagnostics;

using System.Text.RegularExpressions;

using System.IO;

using System.Net;

/// <summary>

/// 取得网页源码

/// </summary>

/// <param name="url">网页地址，eg:"http://www.xxx.com/" </param>

/// <param name="charset">网页编码，eg:"utf-8"</param>

/// <returns>返回网页源文件</returns>

public static string GetHtmlSource(string url, string charset)

{

//编码处理

Encoding nowCharset;

if (charset == "" || charset == null)

{

nowCharset = Encoding.Default;

}

else

{

nowCharset = Encoding.GetEncoding(charset);

}

//处理内容

string html = "";

try

{

//WebRequest myWebRequest = WebRequest.Create(url);

//WebResponse myWebResponse = myWebRequest.GetResponse();

//Stream stream = myWebResponse.GetResponseStream();

//StreamReader reader = new StreamReader(stream, nowCharset);

HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);

HttpWebResponse response = (HttpWebResponse)request.GetResponse();

Stream stream = response.GetResponseStream();

StreamReader reader = new StreamReader(stream, nowCharset);

html = reader.ReadToEnd();

stream.Close();

}

catch (Exception e)

{

}

return html;

}

/// <summary>

/// 取得网页源码

/// </summary>

/// <param name="url">网页地址，eg: "http://www.xxx.com/" </param>

/// <param name="charset">网页编码，eg: Encoding.UTF8</param>

/// <returns>返回网页源文件</returns>

public static string GetHtmlSource(string url, Encoding charset)

{

//处理内容

string html = "";

try

{

HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);

HttpWebResponse response = (HttpWebResponse)request.GetResponse();

Stream stream = response.GetResponseStream();

StreamReader reader = new StreamReader(stream, charset);

html = reader.ReadToEnd();

stream.Close();

}

catch (Exception e)

{

}

return html;

}

/// <summary>

/// 取得网页源码

/// 对于带BOM的网页很有效，不管是什么编码都能正确识别

/// </summary>

/// <param name="url">网页地址，eg: "http://www.xxx.com/" </param>

/// <returns>返回网页源文件</returns>

public static string GetHtmlSource(string url)

{

//处理内容

string html = "";

try

{

HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);

HttpWebResponse response = (HttpWebResponse)request.GetResponse();

Stream stream = response.GetResponseStream();

StreamReader reader = new StreamReader(stream, Encoding.Default);

html = reader.ReadToEnd();

stream.Close();

}

catch (Exception e)

{

}

return html;

}

根据不同的情况调用不同的方式去获取，例如：

string _html = Collection.GetHtmlSource("http://www.luohx.com/a.html", "utf-8");

也可以在url参数里面附带参数，例如

string _html = Collection.GetHtmlSource("http://www.luohx.com/a.aspx?a=1&b=2", "utf-8");

当采集到网站源码后，会发现，我们的需求往往不是什么代码都需要，只是需要其中的一部分，比如:标签<div id=”xml” class” wrap”></div>中间的html，那么，我们需要对源代码进行截取，方法如下：

#region 获取画册页面代码

public string strHtml(string url, string charset)

{

string _html = Collection.GetHtmlSource(url, charset);//根据url获取网站html

string sss = "";

//正则表达式

string pattern = @"(?six)<div\s+id=""xml""\s+class=""wrap"">

(?'MyCont'

(?>

(?!<div\b|</div>).

<div(?:\s+(?:""[^""]*""|'[^']*'|[^""'>])*)?>(?'div')

</div>(?'-div')

(?(div)(?!))

)

</div>";

foreach (Match m in Regex.Matches(_html, pattern))

{

sss = m.Groups["MyCont"].Value;

}

return sss;

}

#endregion

这里的参数pattern就是针对标签<div id=”xml” class” wrap”></div>的正则表达式，但是，必须保证，选取的参考对象的唯一的格式，不能同时存在2个或者2个以上的<div id=”xml” class” wrap”></div>，这样就不能用这个标签作为参考来判定。

当截取需要的html代码模块的时候，我们发现，得到的还是部分的html代码，如果我们需要的是不包含html元素的内容的时候，就将内容去掉html的标签，例如：、

public static string checkStr(string html)

{

System.Text.RegularExpressions.Regex regex1 = new System.Text.RegularExpressions.Regex(@"<script[\s\S]+</script *>", System.Text.RegularExpressions.RegexOptions.IgnoreCase);

System.Text.RegularExpressions.Regex regex2 = new System.Text.RegularExpressions.Regex(@" href *= *[\s\S]*script *:", System.Text.RegularExpressions.RegexOptions.IgnoreCase);

System.Text.RegularExpressions.Regex regex3 = new System.Text.RegularExpressions.Regex(@" no[\s\S]*=", System.Text.RegularExpressions.RegexOptions.IgnoreCase);

System.Text.RegularExpressions.Regex regex4 = new System.Text.RegularExpressions.Regex(@"<iframe[\s\S]+</iframe *>", System.Text.RegularExpressions.RegexOptions.IgnoreCase);

System.Text.RegularExpressions.Regex regex5 = new System.Text.RegularExpressions.Regex(@"<frameset[\s\S]+</frameset *>", System.Text.RegularExpressions.RegexOptions.IgnoreCase);

System.Text.RegularExpressions.Regex regex6 = new System.Text.RegularExpressions.Regex(@"\<img[^\>]+\>", System.Text.RegularExpressions.RegexOptions.IgnoreCase); System.Text.RegularExpressions.Regex regex7 = new System.Text.RegularExpressions.Regex(@"</p>", System.Text.RegularExpressions.RegexOptions.IgnoreCase);

System.Text.RegularExpressions.Regex regex8 = new System.Text.RegularExpressions.Regex(@"<p>", System.Text.RegularExpressions.RegexOptions.IgnoreCase);

System.Text.RegularExpressions.Regex regex9 = new System.Text.RegularExpressions.Regex(@"<[^>]*>", System.Text.RegularExpressions.RegexOptions.IgnoreCase);

html = regex1.Replace(html, "");

html = regex2.Replace(html, "");

html = regex3.Replace(html, " _disibledevent=");

html = regex4.Replace(html, "");

html = regex5.Replace(html, "");

html = regex6.Replace(html, "");

html = regex7.Replace(html, "");

html = regex8.Replace(html, "");

html = regex9.Replace(html, "");

html = html.Replace(" ", " ");

html = html.Replace("</strong>", "");

html = html.Replace("<strong>", "");

return html;

}

调用方法很简单，直接string strhtml= checkStr（html）就可以了，当得到所需要的数据时，就可以入库、显示等其他的操作了~

转载于:https://www.cnblogs.com/mapleclever/archive/2012/01/31/2333283.html

weixin_30249203

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
网站采集（根据正则表达式截取需要的html数据）

所有网站都可以通过url地址获取该网站编译之后的html源代码，方法如下：需要用到的命名空间： using System;using System.Collections.Generic;using System.Text;using System.Diagnostics;using System.Text.RegularExpre...
复制链接

扫一扫