Practical Tips on Writing an Effective Web Crawler

http://rmmod.com/effective-web-crawler/

A web crawler is a hard-working bot to gather information or index the pages on the Internet. It starts at some seeds URLs and finds every hyperlink on each page, and then crawler will visit those hyperlinks recursively.

1. Choose an Ideal Programming Language

Here is a ranking of popular languages on developing web crawlers (based on result numbers of relative repositories host on Github on February, 2013):
1

 

Python or Ruby probably is a wise choice, the mainly speed limit of web crawler is network latency not CPU, so choose Python or Ruby as a language to develop a web crawler will make life easier. Python provide some standard libraries, they are very useful, such like urllib, httplib and regex, those libraries can handle lots of work.

Python also has plenty of valuable third-party libraries worth a try:
scrapy, a web scraping framework.
urllib3, a Python HTTP library with thread-safe connection pooling, file post support.
greenlet, a Lightweight concurrent programming framework.
twisted, an event-driven networking engine.

 

2. Reading Some Simple Open-source Projects

You need to figure out how exactly does a crawler works.
Here is a very simple crawler written in Python, in 10 lines of code.

?
1
2
3
4
5
6
7
8
9
10
import re, urllib
crawled_urls = set ()
def crawl(url):
     for new_url in re.findall( '''href=["'](.[^"']+)["']''' , urllib.urlopen(url).read()):
         if new_url not in crawled_urls:
             print new_url
             crawled_urls.add(new_url)
if __name__ = = "__main__" :
     url = 'http://www.yahoo.com/'
     crawl(url)

Crawler usually needs to keep track of which URLs need to be crawled, and which URLs has already crawled (to avoid the infinite loop).

Other simple projects:
Python Crawler, a very simple crawler using Berkeley DB to store results.
pholcidae, a tiny Python module allows you to write your own crawl spider fast and easy.

3.Choosing the Right Data Structure

Choosing a proper data structure will make your crawler efficiently. Queue or Stack is a good choice to store the URLs need be crawled, Hash table or R-B tree seems proper for tracking the crawled URLs, it provide a fast speed to search.

Search Time Complexity: Hash table O(1), R-B Tree O(log n)

But what if your crawler needs to deal with tons of URLs your memory is not enough? Try to store the checksum of URL string, if it still not enough, you may need to use the Cache algorithms (such like LRU) to dump some URLs into the disk.

4. Multithreading and Asynchronous

If you crawling sites from different servers, using multithreading or asynchronous mechanism will save you lots of time.

Remember keep your crawler thread-safe, you need a thread-safe queue to share the results and a thread controller to handle threads.
Asynchronous is a event-based mechanism will make your crawler enter a while loop, when an events triggers (some resources become available), your crawler will wake up to do deal with this event (usually by execute callback function), Asynchronous can improve throughput, latency of your crawler.

Related Resources:

How to write a multi-threaded webcrawler, Andreas Hess

5. HTTP Persistent Connections

Every time sends an HTTP request you need to open a TCP socket connection, when you finish request, this socket will be closed. When you crawl lots of pages on a same server, you will open and close the socket again and again. The overhead cost is quite a big problem.

?
1
Connection: Keep-Alive

Use this header in your HTTP request to tell the server your client support keep-alive. Your code also should be modified accordingly.

6. Efficient Regular Expressions

You should really figure out how the regex works, a good regex really makes a difference in performance.

When your web crawlers parsing the information of the HTTP response, the same regex will execute frequently. Compile a regex need little more time in the beginning, but it will run faster when you use it. Notice if you are using Python (or .NET), it will automatically compile and cache the regexs, but it may still be worthwhile to manually do it, you can give it a proper name after compiling a regex, it will make your code more readable.
If you want parser even faster, you probably need to write a parser by yourself.

Related Resources:
Mastering Regular Expressions, Third Edition by Jeffrey Friedl.
Performance of Greedy vs. Lazy Regex QuantifiersSteven Levithan
Optimizing regular expressions in Java, Cristian Mocanu


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值