Having fun web crawling with phantomJs

最新推荐文章于 2017-09-30 20:43:00 发布

cnbird2008

最新推荐文章于 2017-09-30 20:43:00 发布

阅读量855

点赞数

A couple of weeks ago, a colleague of mine showed me this cool tool called phantomJs.
This is a headless browser, that can receive javascript to do almost anything you would want from a regular browser, just without rendering anything to the screen.

This could be really useful for tasks like running ui tests on a project you created, or crawling a set of web pages looking for something.

...So, this is exactly what i did!
There's a great site I know of that has a ton of great ebooks ready to download, but the problem is that they show you only 2 results on each page, and the search never finds anything!

Realizing that this site has a very simple url structure (e.g.: website/page/#), I just created a quick javascript file, telling phantomjs to go through the first 50 pages and search for a list of keywords that interest me. If i find something interesting, it saves the name of the book along with the page link into a text file so i can download them all later. :)

Here's the script :

 
        var 
        page; 
       
        var 
        fs = require( 
        'fs' 
        ); 
       
        var 
        pageCount = 0; 
       
        scanPage(pageCount); 
       
        function 
        scanPage(pageIndex) { 
       
        // dispose of page before moving on 
       
        if 
        ( 
        typeof 
        page !==  
        'undefined' 
        ) 
       
        page.release(); 
       
        // dispose of phantomjs if we're done 
       
        if 
        (pageIndex > 50) { 
       
        phantom.exit(); 
       
        return 
        ; 
       
        } 
       
        pageIndex++; 
       
        // start crawling... 
       
        page = require('webpage 
        ').create(); 
       
        var currentPage = ' 
        your-favorite-ebook-site-goes-here/page/ 
        ' + pageIndex; 
       
        page.open(currentPage, function(status) { 
       
        if (status === ' 
        success 
        ') { 
       
        window.setTimeout(function() { 
       
        console.log(' 
        crawling page  
        ' + pageIndex); 
       
        var booksNames = page.evaluate(function() { 
       
        // there are 2 book titles on each page, just put these in an array 
       
        return [ $($(' 
        h2 a 
        ')[0]).attr(' 
        title 
        '), $($(' 
        h2 a 
        ')[1]).attr(' 
        title 
        ') ]; 
       
        }); 
       
        checkBookName(booksNames[0], currentPage); 
       
        checkBookName(booksNames[1], currentPage); 
       
        scanPage(pageIndex); 
       
        }, 3000); 
       
        } 
       
        else { 
       
        console.log(' 
        error crawling page  
        ' + pageIndex); 
       
        page.release(); 
       
        } 
       
        }); 
       
        } 
       
        // checks for interesting keywords in the book title, 
       
        // and saves the link for us if necessary 
       
        function checkBookName(bookTitle, bookLink) { 
       
        var interestingKeywords = [' 
        C 
        #','java','nhibernate','windsor','ioc','dependency injection', 
       
        'inversion of control 
        ',' 
        mysql 
        ']; 
       
        for (var i=0; i<interestingKeywords.length; i++) { 
       
        if (bookTitle.toLowerCase().indexOf(interestingKeywords[i]) !== -1) { 
       
        // save the book title and link 
       
        var a = bookTitle + ' 
        =>  
        ' + bookLink + ' 
        ; 
        '; 
       
        fs.write(' 
        books.txt 
        ', a, ' 
        a'); 
       
        console.log(a); 
       
        break 
        ; 
       
        } 
       
        } 
       
        }

And this is what the script looks like, when running :

Just some notes on the script :

I added comments to try to make it as clear as possible. Feel free to contact me if it isn't.
I hid the real website name from the script for obvious reasons. This technique could be useful for a variety of things, but you should check first about legality issues.
I also added an interval of 3 seconds between each website crawl. Another precaution from putting too much load on their site.

In order to use this script, or something like it, just go to the phantomjs homepage, download it, and run this at the command line :
C:\your-phantomjs-lib\phantomjs your-script.js

Enjoy! :)

cnbird2008

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Having fun web crawling with phantomJs

A couple of weeks ago, a colleague of mine showed me this cool tool called phantomJs.This is a headless browser, that can receive javascript to do almost anything you would want from a regular bro
复制链接

扫一扫