Chilkat爬虫控件的使用方法

最新推荐文章于 2024-09-12 14:03:22 发布

iteye_15968

最新推荐文章于 2024-09-12 14:03:22 发布

阅读量216

点赞数

文章标签： .net Web Cache Python Perl

This is a very simple "getting started" example for spidering a web site. As you'll see in future examples, the Chilkat Spider library can be used to crawl the Web. For now, we'll concentrate on spidering a single site.

Download Chilkat .NET for 2.0 Framework

Download Chilkat .NET for 1.0 / 1.1 Framework

// The Chilkat Spider component/library is free.

Chilkat.Spider spider = new Chilkat.Spider();

// The spider object crawls a single web site at a time. As you'll see

// in later examples, you can collect outbound links and use them to

// crawl the web. For now, we'll simply spider 10 pages of chilkatsoft.com

spider.Initialize("www.chilkatsoft.com");

// Add the 1st URL:

spider.AddUnspidered("http://www.chilkatsoft.com/");

// Begin crawling the site by calling CrawlNext repeatedly.

int i;

for (i = 0; i <= 9; i++) {

bool success;

success = spider.CrawlNext();

if (success == true) {

// Show the URL of the page just spidered.

textBox1.Text += spider.LastUrl + "\r\n";

// The HTML is available in the LastHtml property

}

else {

// Did we get an error or are there no more URLs to crawl?

if (spider.NumUnspidered == 0) {

MessageBox.Show("No more URLs to spider");

}

else {

MessageBox.Show(spider.LastErrorText);

}

// Sleep 1 second before spidering the next URL.

spider.SleepMs(1000);

}

Extract HTML Title, Description, Keywords

This example expands on the "getting started" example by showing how to access the HTML title, description, and keywords within each page spidered. These are the contents of the META tags for keywords, description, and title found in the HTML header.

Download Chilkat .NET for 2.0 Framework

Download Chilkat .NET for 1.0 / 1.1 Framework

// The Chilkat Spider component/library is free.

Chilkat.Spider spider = new Chilkat.Spider();

// The spider object crawls a single web site at a time. As you'll see

// in later examples, you can collect outbound links and use them to

// crawl the web. For now, we'll simply spider 10 pages of chilkatsoft.com

spider.Initialize("www.chilkatsoft.com");

// Add the 1st URL:

spider.AddUnspidered("http://www.chilkatsoft.com/");

// Begin crawling the site by calling CrawlNext repeatedly.

int i;

for (i = 0; i <= 9; i++) {

bool success;

success = spider.CrawlNext();

if (success == true) {

// Show the URL of the page just spidered.

textBox1.Text += spider.LastUrl + "\r\n";

textBox1.Refresh();

// The HTML META keywords, title, and description are available in these properties:

textBox1.Text += spider.LastHtmlTitle + "\r\n";

textBox1.Refresh();

textBox1.Text += spider.LastHtmlDescription + "\r\n";

textBox1.Refresh();

textBox1.Text += spider.LastHtmlKeywords + "\r\n";

textBox1.Refresh();

// The HTML is available in the LastHtml property

}

else {

// Did we get an error or are there no more URLs to crawl?

if (spider.NumUnspidered == 0) {

MessageBox.Show("No more URLs to spider");

}

else {

MessageBox.Show(spider.LastErrorText);

}

// Sleep 1 second before spidering the next URL.

spider.SleepMs(1000);

}

Fetch robots.txt for a Site

The Chilkat Spider library is robots.txt compliant. It automatically fetches a site's robots.txt file and adheres to it. It will not download pages denied by robots.txt. Pages excluded by robots.txt will not appear in the Spider's "unspidered" list. This example shows how to explicitly download and review the robots.txt for a given site.

Download Chilkat .NET for 2.0 Framework

Download Chilkat .NET for 1.0 / 1.1 Framework

// The Chilkat Spider component/library is free.

Chilkat.Spider spider = new Chilkat.Spider();

spider.Initialize("www.chilkatsoft.com");

string robotsText;

robotsText = spider.FetchRobotsText();

textBox1.Text += robotsText + "\r\n";

textBox1.Refresh();

Avoid URLs Matching Any of a Set of Patterns

Demonstrates how to use "avoid patterns" to prevent spidering any URL that matches a wildcarded pattern. This example avoids URLs containing the substrings "java", "python", or "perl".

Download Chilkat .NET for 2.0 Framework

Download Chilkat .NET for 1.0 / 1.1 Framework

// The Chilkat Spider component/library is free.

Chilkat.Spider spider = new Chilkat.Spider();

// The spider object crawls a single web site at a time. As you'll see

// in later examples, you can collect outbound links and use them to

// crawl the web. For now, we'll simply spider 10 pages of chilkatsoft.com

spider.Initialize("www.chilkatsoft.com");

// Add the 1st URL:

spider.AddUnspidered("http://www.chilkatsoft.com/");

// Avoid URLs matching these patterns:

spider.AddAvoidPattern("*java*");

spider.AddAvoidPattern("*python*");

spider.AddAvoidPattern("*perl*");

// Begin crawling the site by calling CrawlNext repeatedly.

int i;

for (i = 0; i <= 9; i++) {

 bool success;

 success = spider.CrawlNext();

 if (success == true) {

 // Show the URL of the page just spidered.

 textBox1.Text += spider.LastUrl + "\r\n";

 // The HTML is available in the LastHtml property

 else {

 // Did we get an error or are there no more URLs to crawl?

 if (spider.NumUnspidered == 0) {

 MessageBox.Show("No more URLs to spider");

 else {

 MessageBox.Show(spider.LastErrorText);

 // Sleep 1 second before spidering the next URL.

 spider.SleepMs(1000);

Setting a Maximum Response Size

The MaxResponseSize property protects your spider from downloading a page that is too large. By default, MaxResponseSize = 300,000 bytes. Setting it to 0 indicates that there is no maximum. You may set it to a number indicating the maximum number of bytes to download. URLs with response sizes larger than this will be skipped.

Download Chilkat .NET for 2.0 Framework

Download Chilkat .NET for 1.0 / 1.1 Framework

// The Chilkat Spider component/library is free.

Chilkat.Spider spider = new Chilkat.Spider();

spider.Initialize("www.chilkatsoft.com");

// Add the 1st URL:

spider.AddUnspidered("http://www.chilkatsoft.com/");

// This example demonstrates setting the MaxResponseSize property

// Do not download anything with a response size greater than 100,000 bytes.

spider.MaxResponseSize = 100000;

Setting a Maximum URL Length

The MaxUrlLen property prevents the spider from retrieving URLs that grow too long. The default value of MaxUrlLen is 300.

Download Chilkat .NET for 2.0 Framework

Download Chilkat .NET for 1.0 / 1.1 Framework

// The Chilkat Spider component/library is free.

Chilkat.Spider spider = new Chilkat.Spider();

spider.Initialize("www.chilkatsoft.com");

// Add the 1st URL:

spider.AddUnspidered("http://www.chilkatsoft.com/");

// This example demonstrates setting the MaxUrlLen property

// Do not add URLs longer than 250 characters to the "unspidered" queue:

spider.MaxUrlLen = 250;

// ...

Using the Disk Cache

The Chilkat Spider component has disk caching capabilities. To setup a disk cache, create a new directory anywhere on your local hard drive and set the CacheDir property to the path. For example, you might create "c:/spiderCache/". The UpdateCache property controls whether downloaded pages are saved to the cache. The FetchFromCache property controls whether the cache is first checked for pages. The LastFromCache property tells whether the last URL fetched came from cache or not.

Download Chilkat .NET for 2.0 Framework

Download Chilkat .NET for 1.0 / 1.1 Framework

// The Chilkat Spider component/library is free.

Chilkat.Spider spider = new Chilkat.Spider();

// Set our cache directory and make sure saving-to-cache and fetching-from-cache

// are both turned on:

spider.CacheDir = "c:/spiderCache/";

spider.FetchFromCache = true;

spider.UpdateCache = true;

// If you run this code twice, you'll find that the 2nd run is extremely fast

// because the pages will be retrieved from cache.

// The spider object crawls a single web site at a time. As you'll see

// in later examples, you can collect outbound links and use them to

// crawl the web. For now, we'll simply spider 10 pages of chilkatsoft.com

spider.Initialize("www.chilkatsoft.com");

// Add the 1st URL:

spider.AddUnspidered("http://www.chilkatsoft.com/");

// Begin crawling the site by calling CrawlNext repeatedly.

int i;

for (i = 0; i <= 9; i++) {

 bool success;

 success = spider.CrawlNext();

 if (success == true) {

 // Show the URL of the page just spidered.

 textBox1.Text += spider.LastUrl + "\r\n";

 // The HTML is available in the LastHtml property

 else {

 // Did we get an error or are there no more URLs to crawl?

 if (spider.NumUnspidered == 0) {

 MessageBox.Show("No more URLs to spider");

 else {

 MessageBox.Show(spider.LastErrorText);

 // Sleep 1 second before spidering the next URL.

 // The reason for waiting a short time before the next fetch is to prevent

 // undue stress on the web server. However, if the last page was retrieved

 // from cache, there is no need to pause.

 if (spider.LastFromCache != true) {

 spider.SleepMs(1000);

Crawling the Web

If the Chilkat Spider component only crawls a single site, how do you crawl the Web? The answer is simple: as you crawl a site, the spider collects outbound links and makes them accessible to you. You may then instantiate an instance of the Spider object for each site, and crawl it. The task of keeping track of what sites you've already crawled is left to you (for now). This example retrieves the home page of http://www.joelonsoftware.com/ and displays the outbound links.

Download Chilkat .NET for 2.0 Framework

Download Chilkat .NET for 1.0 / 1.1 Framework

// The Chilkat Spider component/library is free.

Chilkat.Spider spider = new Chilkat.Spider();

// The Initialize method may be called with just the domain name,

// such as "www.joelonsoftware.com" or a full URL. If you pass only

// the domain name, you must add URLs to the unspidered list by calling

// AddUnspidered. Otherwise, the URL you pass to Initialize is the 1st

// URL in the unspidered list.

spider.Initialize("www.joelonsoftware.com");

spider.AddUnspidered("http://www.joelonsoftware.com/");

bool success;

success = spider.CrawlNext();

int i;

for (i = 0; i <= spider.NumOutboundLinks - 1; i++) {

 textBox1.Text += spider.GetOutboundLink(i) + "\r\n";

 textBox1.Refresh();

Get Referenced Domains

Demonstrates how to accumulate a list of unique domain names referenced from outbound URLs.

Download Chilkat .NET for 2.0 Framework

Download Chilkat

iteye_15968

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫