Practical Tips on Writing an Effective Web Crawler

最新推荐文章于 2018-08-22 12:50:45 发布

kezhen

最新推荐文章于 2018-08-22 12:50:45 发布

阅读量1.1k

点赞数

分类专栏： python

python 专栏收录该内容

42 篇文章 0 订阅

订阅专栏

http://rmmod.com/effective-web-crawler/

A web crawler is a hard-working bot to gather information or index the pages on the Internet. It starts at some seeds URLs and finds every hyperlink on each page, and then crawler will visit those hyperlinks recursively.

1. Choose an Ideal Programming Language

Here is a ranking of popular languages on developing web crawlers (based on result numbers of relative repositories host on Github on February, 2013):

Python or Ruby probably is a wise choice, the mainly speed limit of web crawler is network latency not CPU, so choose Python or Ruby as a language to develop a web crawler will make life easier. Python provide some standard libraries, they are very useful, such like urllib, httplib and regex, those libraries can handle lots of work.

Python also has plenty of valuable third-party libraries worth a try:
scrapy, a web scraping framework.
urllib3, a Python HTTP library with thread-safe connection pooling, file post support.
greenlet, a Lightweight concurrent programming framework.
twisted, an event-driven networking engine.

2. Reading Some Simple Open-source Projects

You need to figure out how exactly does a crawler works.
Here is a very simple crawler written in Python, in 10 lines of code.

 
        import 
         re, urllib 
       
        crawled_urls  
        = 
         set 
        () 
       
        def 
         crawl(url): 
       
        for 
         new_url  
        in 
         re.findall( 
        '''href=["'](.[^"']+)["']''' 
        , urllib.urlopen(url).read()): 
       
        if 
         new_url  
        not 
         in 
         crawled_urls: 
       
        print 
         new_url 
       
        crawled_urls.add(new_url) 
       
        if 
         __name__  
        = 
        = 
         "__main__" 
        : 
       
        url  
        = 
         'http://www.yahoo.com/' 
       
        crawl(url)

Crawler usually needs to keep track of which URLs need to be crawled, and which URLs has already crawled (to avoid the infinite loop).

Other simple projects:
Python Crawler, a very simple crawler using Berkeley DB to store results.
pholcidae, a tiny Python module allows you to write your own crawl spider fast and easy.

3.Choosing the Right Data Structure

Choosing a proper data structure will make your crawler efficiently. Queue or Stack is a good choice to store the URLs need be crawled, Hash table or R-B tree seems proper for tracking the crawled URLs, it provide a fast speed to search.

Search Time Complexity: Hash table O(1), R-B Tree O(log n)

But what if your crawler needs to deal with tons of URLs your memory is not enough? Try to store the checksum of URL string, if it still not enough, you may need to use the Cache algorithms (such like LRU) to dump some URLs into the disk.

4. Multithreading and Asynchronous

If you crawling sites from different servers, using multithreading or asynchronous mechanism will save you lots of time.

Remember keep your crawler thread-safe, you need a thread-safe queue to share the results and a thread controller to handle threads.
Asynchronous is a event-based mechanism will make your crawler enter a while loop, when an events triggers (some resources become available), your crawler will wake up to do deal with this event (usually by execute callback function), Asynchronous can improve throughput, latency of your crawler.

Related Resources:

How to write a multi-threaded webcrawler, Andreas Hess

5. HTTP Persistent Connections

Every time sends an HTTP request you need to open a TCP socket connection, when you finish request, this socket will be closed. When you crawl lots of pages on a same server, you will open and close the socket again and again. The overhead cost is quite a big problem.

1	`Connection: Keep-Alive`

Use this header in your HTTP request to tell the server your client support keep-alive. Your code also should be modified accordingly.

6. Efficient Regular Expressions

You should really figure out how the regex works, a good regex really makes a difference in performance.

When your web crawlers parsing the information of the HTTP response, the same regex will execute frequently. Compile a regex need little more time in the beginning, but it will run faster when you use it. Notice if you are using Python (or .NET), it will automatically compile and cache the regexs, but it may still be worthwhile to manually do it, you can give it a proper name after compiling a regex, it will make your code more readable.
If you want parser even faster, you probably need to write a parser by yourself.

Related Resources:
Mastering Regular Expressions, Third Edition by Jeffrey Friedl.
Performance of Greedy vs. Lazy Regex Quantifiers, Steven Levithan
Optimizing regular expressions in Java, Cristian Mocanu