c 中html抓取页面图片,【教程】抓取网并提取网页中所需要的信息 之 C#版

在通过:

了解了抓取网页的一般流程之后,加上之前介绍的:

应该就很清楚如何利用工具去抓取网页,并分析源码,获得所需内容了。

下面,就来通过实际的例子来介绍,如何通过Python语言,实现这个抓取网页并提取所需内容的过程:

假设我们的需求是,从我(crifan)的Songtaste上的页面:

先抓取网页的html源码,然后再提取其中我的songtaste上面的名字:crifan

对应的html代码为:

crifan

此任务,相对很简单。下面就来说说,如何用C#来实现。

新建一个C#项目,使用.NET Framework 2.0,设置一些基本的控件用于显示。

相关的,先写出,获得html的代码:using System.Net;

using System.IO;

//step1: get html from url

//http://www.songtaste.com/user/351979/

string urlToCrawl = txbUrlToCrawl.Text;

//generate http request

HttpWebRequest req = (HttpWebRequest)WebRequest.Create(urlToCrawl);

//use GET method to get url's html

req.Method = "GET";

//use request to get response

HttpWebResponse resp = (HttpWebResponse)req.GetResponse();

string htmlCharset = "GBK";

//use songtaste's html's charset GB2312 to decode html

//otherwise will return messy code

Encoding htmlEncoding = Encoding.GetEncoding(htmlCharset);

StreamReader sr = new StreamReader(resp.GetResponseStream(), htmlEncoding);

//read out the returned html

string respHtml = sr.ReadToEnd();

rtbExtractedHtml.Text = respHtml;

对应的,UI中,点击按钮“抓取网页html源码”:

9d56ba846c2a55f3f8d8399548e40ca0.png

可以获得对应的html了:

be1a0c2494c6dcbfbf1f2e765088e306.png

注意:

此处,需要根据你的需要,而决定是否关心html的编码类型(charset);

以及,此处为何使用GBK的编码,不了解的均可参考:

然后获得了html之后,再去通过C#中的正则表达式库函数,Regex,去提取出我们想要的数据:using System.Text.RegularExpressions;

//step2: extract expected info

//

crifan

string h1userP = @"

(?.+?)

";

Match foundH1user = (new Regex(h1userP)).Match(rtbExtractedHtml.Text);

if (foundH1user.Success)

{

//extracted the expected h1user's value

txbExtractedInfo.Text = foundH1user.Groups["h1user"].Value;

}

else

{

txbExtractedInfo.Text = "Not found h1 user !";

}

点击“提取所需的信息”,即可提取出我们要的h1user的值crifan:

a766a3747e1a85a461175cd24ad4e425.png

对应的完整的C#代码为:using System;

using System.Collections.Generic;

using System.ComponentModel;

using System.Data;

using System.Drawing;

using System.Text;

using System.Windows.Forms;

using System.Net;

using System.IO;

using System.Text.RegularExpressions;

namespace crawlWebsiteAndExtractInfo

{

public partial class frmCrawlWebsite : Form

{

public frmCrawlWebsite()

{

InitializeComponent();

}

private void btnCrawlAndExtract_Click(object sender, EventArgs e)

{

//step1: get html from url

//http://www.songtaste.com/user/351979/

string urlToCrawl = txbUrlToCrawl.Text;

//generate http request

HttpWebRequest req = (HttpWebRequest)WebRequest.Create(urlToCrawl);

//use GET method to get url's html

req.Method = "GET";

//use request to get response

HttpWebResponse resp = (HttpWebResponse)req.GetResponse();

string htmlCharset = "GBK";

//use songtaste's html's charset GB2312 to decode html

//otherwise will return messy code

Encoding htmlEncoding = Encoding.GetEncoding(htmlCharset);

StreamReader sr = new StreamReader(resp.GetResponseStream(), htmlEncoding);

//read out the returned html

string respHtml = sr.ReadToEnd();

rtbExtractedHtml.Text = respHtml;

}

private void btnExtractInfo_Click(object sender, EventArgs e)

{

//step2: extract expected info

//

crifan

string h1userP = @"

(?.+?)

";

Match foundH1user = (new Regex(h1userP)).Match(rtbExtractedHtml.Text);

if (foundH1user.Success)

{

//extracted the expected h1user's value

txbExtractedInfo.Text = foundH1user.Groups["h1user"].Value;

}

else

{

txbExtractedInfo.Text = "Not found h1 user !";

}

}

private void lklTutorialUrl_LinkClicked(object sender, LinkLabelLinkClickedEventArgs e)

{

string tutorialUrl = "https://www.crifan.com/crawl_website_html_and_extract_info_using_csharp";

System.Diagnostics.Process.Start(tutorialUrl);

}

}

}

完整的VS2010的项目,可以去这里下载:

【总结】

总的来说,使用C#抓取网站,从返回的html源码中提取所需内容,相对之前的Python,还是要复杂一些的。

因为要手动处理很多和http相关的request,response,以及stream,编码类型等内容。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值