C#写的爬虫程序

一个C#写的爬虫程序作者:西洋樱草    来源:CSDN   

 

简介
      网页爬虫(也被称做蚂蚁或者蜘蛛)是一个自动抓取万维网中网页数据的程序.网页爬虫一般都是用于抓取大量的网页,为日后搜索引擎处理服务的.抓取的网页由一些专门的程序来建立索引(如:Lucene,DotLucene),加快搜索的速度.爬虫也可以作为链接检查器或者HTML代码校验器来提供一些服务.比较新的一种用法是用来检查E-mail地址,用来防止Trackback spam.
 
爬虫概述
在这篇文章中,我将介绍一个用C#写的简单的爬虫程序.这个程序根据输入的目标URL地址,来有针对性的进行网页抓取.用法相当简单,只需要输入你想要抓取的网站地址,按下"GO"就可以了.

 
 
      这个爬虫程序有一个队列,存储要进行抓取的URL,这个设计和一些大的搜索引擎是一样的.抓取时是多线程的,从URL队列中取出URL进行抓取,然后将抓取的网页存在指定的存储区(Storage 如图所示).用C#Socket库进行Web请求.分析当前正在抓取的页面中的链接,存入URL队列中(设置中有设置抓取深度的选项)
 
状态查看
这个程序提供三种状态查看:
抓取线程列表
每个抓取线程的详细信息
查看错误信息
线程查看
线程列表查看显示所有正在工作中的线程.每个线程从URI队列中取出一个URI,进行链接.

.
.
请求查看
请求查看显示了所有的最近下载的页面的列表,也显示了HTTP头中的详细信息.

 
每个请求的头显示类似下面的信息:
GET / HTTP/1.0
 Host: www.cnn.com
 Connection: Keep-Alive
 
Response头显示类似如下信息:
HTTP/1.0 200 OK
 Date: Sun, 19 Mar 2006 19:39:05 GMT
 Content-Length: 65730
 Content-Type: text/html
 Expires: Sun, 19 Mar 2006 19:40:05 GMT
 Cache-Control: max-age=60, private
 Connection: keep-alive
 Proxy-Connection: keep-alive
 Server: Apache
 Last-Modified: Sun, 19 Mar 2006 19:38:58 GMT
 Vary: Accept-Encoding,User-Agent
 Via: 1.1 webcache (NetCache NetApp/6.0.1P3)
 
还有一个最近下载的页面的列表 Parsing page
 Found: 356 ref(s)
 http://www.cnn.com/
 http://www.cnn.com/search/
 http://www.cnn.com/linkto/intl.html
 
设置
这个程序提供一些参数的设置,包括:
MIME types
存储目的文件夹
最大抓取线程数
等等...
MIME types

      爬虫支持的下载下来的文件类型,用户可以添加
MIME types are the types that are supported to be downloaded by the crawler and the crawler includes a default types to be used. The user can add, edit and delete MIME types. The user can select to allow all MIME types as in the following figure. 

 

Output
Output settings include the download folder, and the number of requests to keep in the requests view for review requests details.


Connections
Connections settings contain:
Thread count: number of concurrent working threads in the crawler.
Thread sleep time when refs queue empty: the time that each thread sleeps when the refs queue empty.
Thread sleep time between two connection: the time that each thread sleep after handling any request, which is very important value to prevent Hosts from blocking the crawler due to heavy load.
Connection timeout: represents send and receive timeout to all crawler sockets.
Navigate through pages to a depth of: represents the depth of navigation in the crawling process.
Keep same URL server: to limit crawling process to the same host of the original URL. Keep connection alive: means keep socket connection opened for subsequent requests to avoid reconnect time.

Advanced
Advanced settings contain:
Code page to encode downloaded text pages
List of a user defined list of restricted words to enable user to prevent any bad pages
List of a user defined list of restricted hosts extensions to avoid blocking by these hosts
List of a user defined list of restricted files extensions to avoid paring non-text data
 

Points of Interest

Keep Alive Connection:
Keep-Alive is a request form the client to the server to keep the connection opened after response finished for subsequent requests. That can be done by adding an HTTP header in the request to the

server as in the following request:

 

 

GET /CNN/Programs/nancy.grace/ HTTP/1.0

 Host: www.cnn.com

 Connection: Keep-Alive

 

 

The "Connection: Keep-Alive" tells the server to not close the connection, but the server has the option to keep it opened or close it, but it should reply to the client socket by its decision.

So the server can keep tell the client that he will keep it opened by include "Connection: Keep-Alive" in his replay as follows:

 HTTP/1.0 200 OK

 Date: Sun, 19 Mar 2006 19:38:15 GMT

 Content-Length: 29025

 Content-Type: text/html

 Expires: Sun, 19 Mar 2006 19:39:15 GMT

 Cache-Control: max-age=60, private

 Connection: keep-alive

 Proxy-Connection: keep-alive

 Server: Apache

 Vary: Accept-Encoding,User-Agent

 Last-Modified: Sun, 19 Mar 2006 19:38:15 GMT

 Via: 1.1 webcache (NetCache NetApp/6.0.1P3)

 

Or it can tell the client that it refuses as follows:

 

HTTP/1.0 200 OK

 Date: Sun, 19 Mar 2006 19:38:15 GMT

 Content-Length: 29025

 Content-Type: text/html

 Expires: Sun, 19 Mar 2006 19:39:15 GMT

 Cache-Control: max-age=60, private

 Connection: Close

 Server: Apache

 Vary: Accept-Encoding,User-Agent

 Last-Modified: Sun, 19 Mar 2006 19:38:15 GMT

 Via: 1.1 webcache (NetCache NetApp/6.0.1P3)

 

WebRequest and WebResponse problems:

 

When I started this article code I was using WebRequest class and WebResponse like in the following code:

WebRequest request = WebRequest.Create(uri);

 WebResponse response = request.GetResponse();

 Stream streamIn = response.GetResponseStream();

 BinaryReader reader = new BinaryReader(streamIn, TextEncoding);

 byte[] RecvBuffer = new byte[10240];

 int nBytes, nTotalBytes = 0;

 while((nBytes = reader.Read(RecvBuffer, 0, 10240)) > 0)

 {

  nTotalBytes += nBytes;

 

 }

 reader.Close();

 streamIn.Close();

 response.Close();

 

This code works well but it has a very serious problem as the WebRequest class function GetResponse locks the access to all others process WebRequest tell the retrieved response closed as in the last line in the previous code. So I noticed that always one thread is downloading while others are waiting to GetResponse. To solve this serious problem I have implemented my two classes MyWebRequest and MyWebResponse .

MyWebRequest and MyWebResponse use Socket class to manage connections and they are similar to WebRequest and WebResponse but they support concurrent responses at the same time. In addition MyWebRequest supports a built in flag KeepAlive to support Keep-Alive connections.

So, my new code whould be like:

 request = MyWebRequest.Create(uri, request/**//*to Keep-Alive*/, KeepAlive);

 MyWebResponse response = request.GetResponse();

 byte[] RecvBuffer = new byte[10240];

 int nBytes, nTotalBytes = 0;

 while((nBytes = response.socket.Receive(RecvBuffer, 0, 10240, SocketFlags.None)) > 0)

 {

  nTotalBytes += nBytes;

 

  if(response.KeepAlive && nTotalBytes >= response.ContentLength && response.ContentLength > 0)

   break;

 }

 if(response.KeepAlive == false)

  response.Close();

just replacing the GetResponseStream with a direct access to the socket member of MyWebResponse class. To do that I did a simple trick to make socket next reading to start after the reply header, by reading one byte at a time tell header completion as in the following code:

 /**//* reading response header */

 Header = "";

 byte[] bytes = new byte[10];

 while(socket.Receive(bytes, 0, 1, SocketFlags.None) > 0)

 {

  Header += Encoding.ASCII.GetString(bytes, 0, 1);

  if(bytes[0] == '/n' && Header.EndsWith("/r/n/r/n"))

   break;

 }

So, the user of MyResponse class will just continue receiving from the first position of the page.

Thread Management:

Number of threads in the crawler is user defined through the settings. Its default value is 10 threads, but it can be changed from the settings tab Connections.

The Crawler code handles this change by the property ThreadCount as in the following code:

private int ThreadCount

{

 get { return nThreadCount; }

 set

 {

  Monitor.Enter(this.listViewThreads);

  for(int nIndex = 0; nIndex < value; nIndex ++)

  {

   if(threadsRun[nIndex] == null || threadsRun[nIndex].ThreadState != ThreadState.Suspended)

   {

    threadsRun[nIndex] = new Thread(new ThreadStart(ThreadRunFunction));

    threadsRun[nIndex].Name = nIndex.ToString();

    threadsRun[nIndex].Start();

    if(nIndex == this.listViewThreads.Items.Count)

    {

     ListViewItem item = this.listViewThreads.Items.Add((nIndex+1).ToString(), 0);

     string[] subItems = { "", "", "", "0", "0%" };

     item.SubItems.AddRange(subItems);

    }

   }

   else if(threadsRun[nIndex].ThreadState == ThreadState.Suspended)

   {

    ListViewItem item = this.listViewThreads.Items[nIndex];

    item.ImageIndex = 1;

    item.SubItems[2].Text = "Resume";

    threadsRun[nIndex].Resume();

   }

  }

  nThreadCount = value;

  Monitor.Exit(this.listViewThreads);

 }

}

If TheadCode increased by the user the code creates a new threads or suspends suspended threads. Else, the system leaves the process of suppending extra working thread to threads themselves as follows.

Each working thread has a name equal to its index in the thread array. If the thread name value is greater than ThreadCount it continues its job and go for suspension mode.

Crawling Depth:

It is the depth that the crawler goes in the navigation process. Each URL has an initial depth equal to its parent depth plus one, with a depth 0 for the first URL inserted by the user. The fetched URL from any page are inserted in the end of the URL queue, means "first in first out" operation. and all thread can insert in the queue at any time as in the following part of code:

void EnqueueUri(MyUri uri)

{

 Monitor.Enter(queueURLS);

 try

 {

  queueURLS.Enqueue(uri);

 }

 catch(Exception)

 {

 }

 Monitor.Exit(queueURLS);

}

 

And each thread can retrieve first URL in the queue to request it as in the following part of code:

MyUri DequeueUri()

{

 Monitor.Enter(queueURLS);

 MyUri uri = null;

 try

 {

  uri = (MyUri)queueURLS.Dequeue();

 }

 catch(Exception)

 {

 }

 Monitor.Exit(queueURLS);

 return uri;

}

本篇文章来源于 飞扬教程 原文链接:http://www.51fy.cn/program/CJJ/200705/32994.htm 

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值