使用ChilkatDotNet组件构建网络爬虫程序

整理自一个英文网站
整理后的word下载地址为 
http://files.cnblogs.com/mz121star/演示.rar
ChilkatDotNet是一个非常强大的.NET控件!
一起来看一下它在构建网络爬虫方面的应用!
由于文章大多都是代码演示,所以在此不做翻译,有需要的朋友可以看看!
安装完ChilkatDotNet之后,在安装目录中会有一个dll文件,在项目中引用一下那个dll文件即可开始构建你的爬虫程序!

Get Start

This is a very simple "getting started" example for spidering a web site. As you'll see in future examples, the Chilkat Spider library can be used to crawl the Web. For now, we'll concentrate on spidering a single site.<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

Download Chilkat .NET for 2.0 Framework

Download Chilkat .NET for 1.0 / 1.1 Framework

//  The Chilkat Spider component/library is free.

Chilkat.Spider spider = new Chilkat.Spider();

 

//  The spider object crawls a single web site at a time.  As you'll see

//  in later examples, you can collect outbound links and use them to

//  crawl the web.  For now, we'll simply spider 10 pages of chilkatsoft.com

spider.Initialize("www.chilkatsoft.com");

 

//  Add the 1st URL:

spider.AddUnspidered("http://www.chilkatsoft.com/");

 

//  Begin crawling the site by calling CrawlNext repeatedly.

int i;

for (i = 0; i <= 9; i++) {

    bool success;

    success = spider.CrawlNext();

    if (success == true) {

        //  Show the URL of the page just spidered.

        textBox1.Text += spider.LastUrl + "\r\n";

        //  The HTML is available in the LastHtml property

    }

    else {

        //  Did we get an error or are there no more URLs to crawl?

        if (spider.NumUnspidered == 0) {

            MessageBox.Show("No more URLs to spider");

        }

        else {

            MessageBox.Show(spider.LastErrorText);

        }

 

    }

 

    //  Sleep 1 second before spidering the next URL.

    spider.SleepMs(1000);

}

 

Extract HTML Title, Description, Keywords

This example expands on the "getting started" example by showing how to access the HTML title, description, and keywords within each page spidered. These are the contents of the META tags for keywords, description, and title found in the HTML header.

Download Chilkat .NET for 2.0 Framework

Download Chilkat .NET for 1.0 / 1.1 Framework

// The Chilkat Spider component/library is free.

Chilkat.Spider spider = new Chilkat.Spider();

 

// The spider object crawls a single web site at a time.  As you'll see

// in later examples, you can collect outbound links and use them to

// crawl the web.  For now, we'll simply spider 10 pages of chilkatsoft.com

spider.Initialize("www.chilkatsoft.com");

 

// Add the 1st URL:

spider.AddUnspidered("http://www.chilkatsoft.com/");

 

// Begin crawling the site by calling CrawlNext repeatedly.

int i;

for (i = 0; i <= 9; i++) {

    bool success;

    success = spider.CrawlNext();

    if (success == true) {

        // Show the URL of the page just spidered.

        textBox1.Text += spider.LastUrl + "\r\n";

        textBox1.Refresh();

 

        // The HTML META keywords, title, and description are available in these properties:

        textBox1.Text += spider.LastHtmlTitle + "\r\n";

        textBox1.Refresh();

        textBox1.Text += spider.LastHtmlDescription + "\r\n";

        textBox1.Refresh();

        textBox1.Text += spider.LastHtmlKeywords + "\r\n";

        textBox1.Refresh();

 

        // The HTML is available in the LastHtml property

    }

    else {

        // Did we get an error or are there no more URLs to crawl?

        if (spider.NumUnspidered == 0) {

            MessageBox.Show("No more URLs to spider");

        }

        else {

            MessageBox.Show(spider.LastErrorText);

        }

 

    }

 

    // Sleep 1 second before spidering the next URL.

    spider.SleepMs(1000);

}

 

 

 

 

Fetch robots.txt for a Site

The Chilkat Spider library is robots.txt compliant. It automatically fetches a site's robots.txt file and adheres to it. It will not download pages denied by robots.txt. Pages excluded by robots.txt will not appear in the Spider's "unspidered" list. This example shows how to explicitly download and review the robots.txt for a given site.

Download Chilkat .NET for 2.0 Framework

Download Chilkat .NET for 1.0 / 1.1 Framework

 
  

 

 

 

 

Avoid URLs Matching Any of a Set of Patterns

Demonstrates how to use "avoid patterns" to prevent spidering any URL that matches a wildcarded pattern. This example avoids URLs containing the substrings "java", "python", or "perl".

Download Chilkat .NET for 2.0 Framework

Download Chilkat .NET for 1.0 / 1.1 Framework

 
  

 

 

 

 

 
  

 

 

 

 

Setting a Maximum Response Size

The MaxResponseSize property protects your spider from downloading a page that is too large. By default, MaxResponseSize = 300,000 bytes. Setting it to 0 indicates that there is no maximum. You may set it to a number indicating the maximum number of bytes to download. URLs with response sizes larger than this will be skipped.

Download Chilkat .NET for 2.0 Framework

Download Chilkat .NET for 1.0 / 1.1 Framework

 
  

 

 

 

 

Setting a Maximum URL Length

The MaxUrlLen property prevents the spider from retrieving URLs that grow too long. The default value of MaxUrlLen is 300.

Download Chilkat .NET for 2.0 Framework

Download Chilkat .NET for 1.0 / 1.1 Framework

 
  

 

 

 

 

 

Using the Disk Cache

The Chilkat Spider component has disk caching capabilities. To setup a disk cache, create a new directory anywhere on your local hard drive and set the CacheDir property to the path. For example, you might create "c:/spiderCache/". The UpdateCache property controls whether downloaded pages are saved to the cache. The FetchFromCache property controls whether the cache is first checked for pages. The LastFromCache property tells whether the last URL fetched came from cache or not.

Download Chilkat .NET for 2.0 Framework

Download Chilkat .NET for 1.0 / 1.1 Framework

 
  

 

 

 

 

 

 
  

 

 

 

Crawling the Web

If the Chilkat Spider component only crawls a single site, how do you crawl the Web? The answer is simple: as you crawl a site, the spider collects outbound links and makes them accessible to you. You may then instantiate an instance of the Spider object for each site, and crawl it. The task of keeping track of what sites you've already crawled is left to you (for now). This example retrieves the home page of http://www.joelonsoftware.com/ and displays the outbound links.

Download Chilkat .NET for 2.0 Framework

Download Chilkat .NET for 1.0 / 1.1 Framework

 
  

 

 

 

 

 
  

 

Get Referenced Domains

Demonstrates how to accumulate a list of unique domain names referenced from outbound URLs.

Download Chilkat .NET for 2.0 Framework

Download Chilkat .NET for 1.0 / 1.1 Framework

 
  

 

 
  

 

 

 

 

 
  

 

 
  

 

Get Base Domains

Demonstrates how to accumulate a list of unique domain names referenced from outbound URLs.

Download Chilkat .NET for 2.0 Framework

Download Chilkat .NET for 1.0 / 1.1 Framework

 
  

 

 
  

 

 

 

 

 
  

 

 
  

GetBaseDomain

The GetBaseDomain method is a utility function that converts a domain into a "domain base", which is useful for grouping URLs. For example: abc.chilkatsoft.com, xyz.chilkatsoft.com, and blog.chilkatsoft.com all have the same base domain: chilkatsoft.com. Things get more complicated when considering country domains (.au, .uk, .se, .cn, etc.) and government, state, and .us domains. Also, domains such as blogspot, tripod, geocities, wordpress, etc, are treated specially so that "xyz.blogspot.com" has a base domain of "xyz.blogspot.com". Note: If you find other domains that should be treated similarly to blogspot.com, send a request to support@chilkatsoft.com.

Download Chilkat .NET for 2.0 Framework

Download Chilkat .NET for 1.0 / 1.1 Framework

 
  

 

 

CanonicalizeUrl

The CanonicalizeUrl method is a utility function that canonicalizes a URL into a standard form to avoid duplicates. For example, "http://www.chilkatsoft.com/" and "http://www.chilkatsoft.com/default.asp" are the same URL.

Download Chilkat .NET for 2.0 Framework

Download Chilkat .NET for 1.0 / 1.1 Framework

 
  

 

 

 

 

 

 

 

Avoiding Outbound Links Matching Patterns

The spider accumulates outbound links when crawling. Your program may specify any number of "avoid patterns" to prevent any link matching at least one of the wildcarded patterns from being added.

Download Chilkat .NET for 2.0 Framework

Download Chilkat .NET for 1.0 / 1.1 Framework

 
  

 

 

 

 

 
  

 

 

 

 

 

 
  

 

 

Must-Match Patterns

You may restrict the spider to only follow links that match any one of a set of "must-match" wildcard patterns. The AddMustMatchPattern can be called repeatedly to add must-match patterns.

Download Chilkat .NET for 2.0 Framework

Download Chilkat .NET for 1.0 / 1.1 Framework

 
  

 

 

 

 

 
  

 

 

 

 

 

 

 

 
  

 

 

A Simple Web Crawler

This demonstrates a very simple web crawler using the Chilkat Spider component.

Download Chilkat .NET for 2.0 Framework

Download Chilkat .NET for 1.0 / 1.1 Framework

 
  

 

 
  
 
  

 

 

 

 

 

 
  

 

 

 

 

 

 

 

 

 

 

 

 

 

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值