英文原文地址:
http://www.example-code.com/python/spider_begin.asp
一、下载:上面页面中的 Download Chilkat Python Library(Chilkat :翻译出来的意思是奇尔卡特人(北美阿拉斯加印第安部落特林基特人的分支),老美还是蛮有意思的,啥时候咱们中国人做的软件也起一个叫啥“纳西”或是“蒙古”库的软件;那样是不是感觉很cool?,好了,不多说着了,有点跑题了。)
二、安装
解压之后有一个QuickStart的页面;里面说要安装 Python 2.5,你要确认一下你的python,必须是2.5;
而且还有一点提醒一下(因为我是python菜鸟,本人对同样是菜鸟的得有必要提醒一下)就是:在QuickStart的页面的页面里说
you only need to add a __path__ = ["dir_with_chilkat_pyd"],这个 "dir_with_chilkat_pyd"和java中是一样的处理方式,不是“/”而是“/”(目录间隔符号);然后你可以在__init__.py中加上“__path__”;这样你就可以使用Chilkat 了;
三、使用
源码:
- from extra import chilkat
- # The Chilkat Spider component/library is free.
- spider = chilkat.CkSpider()
- # The spider object crawls a single web site at a time. As you'll see
- # in later examples, you can collect outbound links and use them to
- # crawl the web. For now, we'll simply spider 10 pages of chilkatsoft.com
- spider.Initialize("www.chilkatsoft.com")
- # Add the 1st URL:
- spider.AddUnspidered("http://www.chilkatsoft.com/")
- # Begin crawling the site by calling CrawlNext repeatedly.
- for i in range(0,10):
- success = spider.CrawlNext()
- if (success == True):
- # Show the URL of the page just spidered.
- print spider.lastUrl()
- # The HTML is available in the LastHtml property
- else:
- # Did we get an error or are there no more URLs to crawl?
- if (spider.get_NumUnspidered() == 0):
- print "No more URLs to spider"
- else:
- print spider.lastErrorText()
- # Sleep 1 second before spidering the next URL.
- spider.SleepMs(1000)
注:我在这里把Chilkat放在了包 extra中调用;
四、代码说明
- spider.AddUnspidered("http://www.chilkatsoft.com/") 可以说是一种定义种子url
- 代码很简单就是两个if else,就是判断是否爬到页面和输出出错信息,最后定义spider休息时间,我还没看源代码,应该是多线程的了。
- Getting Started Spidering a Site
- Extract HTML Title, Description, Keywords
- Fetch robots.txt for a Site
- Avoid URLs Matching Any of a Set of Patterns
- Setting a Maximum Response Size
- Setting a Maximum URL Length
- Using the Disk Cache
- Crawling the Web
- Get Referenced Domains
- Get Base Domains
- GetBaseDomain
- CanonicalizeUrl
- Avoiding Outbound Links Matching Patterns
- Must-Match Patterns
- A Simple Web Crawler
The full suite of Chilkat components are now available for the Python scripting language. Commercially licensed components include:
- Email (POP3 / SMTP)
- IMAP
- Zip, GZip, and Unix Compress
- Encryption
- RSA
- MIME and S/MIME
- FTP
- HTTP
- HTML-to-XML
- Charset
- Bounce
上面的11个是要收费的产品
下面的4个免费
Freeware components include:
- XML
- Digital Certificates
- Spider
- Upload