Crawl you website including login form with Phantomjs

Crawl you website including login form with Phantomjs

Sep 27th, 2013

With PhantomJS, we start a headless WebKit and pilot it with our own scripts. Said differently, we write a script in JavaScript or CoffeeScript which controls an Internet browser and manipulates the webpage loaded inside. In the past, I’ve used a similar solution called [Selenium]. PhantomJS is much faster, it doesn’t start a graphical browser (that’s what headless stands for) and you can inject your own JavaScript inside the page (I can’t remember that we could do such a thing with Selenium).

PhantomJS is commonly used for testing websites and HTML-based applications which content is dynamically updated with JavaScript events and Ajax requests. The product is also popular to generate screenshot of webpages and build website previews, a usage illustrated below.

The official website present PhantomJS as: * Headless Website Testing: Run functional tests with frameworks such as Jasmine, QUnit, Mocha, Capybara, WebDriver, and many others. * Screen Capture: Programmatically capture web contents, including SVG and Canvas. Create web site screenshots with thumbnail preview. * Page Automation: Access and manipulate webpages with the standard DOM API, or with usual libraries like jQuery. * Network Monitoring: Monitor page loading and export as standard HAR files. Automate performance analysis using YSlow and Jenkins.

In my case, I’ve used it to simulate users behaviors under high load to create user logs and populate a system like Google Analytics. More specifically, I will introducted a project architecture composed of 3 components: 1. User-writtenPhantomJS scripts that I later call “actions”. An action simulates user interactions and could be chained with other actions. For example a first action could login a user and a second one could update its personal information. 2. A generic PhantomJS script to run sequencially multiple actions passed as arguments. 3. A Node.js script to pilot PhantomJS and simulate concurrent user loads.

To make things more interesting, the user-written scripts will show you how to simulate a user login, or any form submission. Please don’t use it as a basis to login into your (boy|girl)friend Gmail account.

The user-written scripts

I will write 2 scripts for illustration purpose. The first will login the user on a fake website and the second will go to two user information pages. Those scripts are written in CoffeeScript and interact with the PhantomJS API which borrow a lot from the CommonJs specification. Keep in mind that even if it looks a lot like Node.js, it’s JavaScript after all, it will run in a completely different environment.

The login action

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
webpage = require 'webpage'
module.exports = (callback) ->
  page = webpage.create()
  url = 'https://mywebsite.com/login'
  count = 0
  page.onLoadFinished = ->
    console.log '** login', count
    page.render "login_#{count}.png"
    if count is 0
      page.evaluate ->
        jQuery('#login').val('IDTMAASP15')
        jQuery('#pass').val('azerty1')
        jQuery('[name="loginForm"] [name="submit"]').click()
    else if count is 1
      callback()
    count++
  page.open url, (status) ->
    return new Error "Invalid webage" if status isnt 'success'

The information action

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
webpage = require 'webpage'
module.exports = (callback) ->
  page = webpage.create()
  count = 0
  page.onLoadFinished = ->
    console.log 'info', count
    page.render "donnees_perso_#{count}.png"
    if count is 0
      page.evaluate ->
        window.location = jQuery('.boxSection [href*=info]')[0].href
    else if count is 1
      page.evaluate ->
        window.location = jQuery('.services [href*=info_perso]')[0].href
    else if count is 2
      page.goBack()
    else if count is 3
      page.evaluate ->
        window.location = jQuery('.services [href*=info_login]')[0].href
    else if count is 4
      callback()
    count++
  page.open 'https://domain/path/to/login', (status) ->
    return callback new Error "Invalid webage" if status isnt 'success'

There are a few things in this code which are interesting and that I will comment.

On line 9, the call to page.render generates a screenshot of the webpage at the time of the call. Generating website screen captures is a common use ofPhantomJS.

The code is run inside the PhantomJS execution engine with the exception of the one inside the page.evaluate running inside the loaded webpage. This simplify the writing of your PhantomJS script but is a little awkward in the sense that you wont be able to share context between those two sections. It is like if the webpage code is evaluated withpage.evaluate.toString` and run inside a separate engine.

Finally, the page object represents all the pages we will load. It is more appropriate to conceive it as a tab inside your browser inside which multiple pages are loaded. The function page.onLoadFinished is called every time a page is loaded.

2. The action runner

This script is also run inside PhantomJS. Its purpose is to run multiple actions sequentially (one after the other) in a generic manner.

The action runner takes a list of actions provided as arguments, load theJavaScript scripts named after the actions and run those scripts sequentially.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Grab arguments
args = require('system').args
# Convert to an array
args = Array.prototype.slice.call(args, 0)
# Remove the script filename
args.shift()
# Callback when all action have been run
done = (err) ->
  phantom.exit if err then 1 else 0
# Run the next action
next = (err) ->
  n = args.shift()
  return done err if err or not n
  n = require "./#{n}"
  n next
next()

3. The pilot

The pilot is a Node.js application responsible for Managing and MonitoringPhantomJS. It is able to simulate concurrent load by running multiple instances of PhanomJs in parallel. To achieve concurrency, I used the Node.js each module. The each.prototype.parallel indicates how many instances of PhantomJS will run at the same time. The each.prototype.repeat indicate how many times each action will run.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
fs = require 'fs'
util = require 'util'
phantomjs = require 'phantomjs'
each = require 'each'
child = require 'child_process'
cookies = "#{__dirname}/cookies.txt"

run = (actions, callback) ->
  args = [
    "--ignore-ssl-errors=yes"
    "--cookies-file=#{cookies}"
    "#{__dirname}/run.js"
  ]
  for action in actions then args.push action
  util.print "\x1b[36m..#{actions.join(' ')} start..\x1b[39m\n"
  web = child.spawn phantomjs.path, args
  web.stdout.on 'data', (data) ->
    util.print "\x1b[36m#{data.toString()}\x1b[39m"
  web.stderr.on 'data', (data) ->
    util.print "\x1b[35m#{data.toString()}\x1b[39m"
  web.on 'close', (code) ->
    util.print "\x1b[36m..#{actions.join(' ')} done..\x1b[39m\n"
    if callback
      err = if code isnt 0 then new Error "Invalid exit code #{code}" else null
      callback err

each([
  ['login','information']
  ['login','another_action'] ])
.parallel(2)
.repeat(20)
.on 'item', (scripts, next) ->
  fs.unlink cookies, (err) ->
    run scripts, next

Put it all together

In the end, you might create a Node.js project (simply a directory with a package.json file inside), place all the files described above inside the new directory, declare your “phantomjs” and “each” module dependencies (inside the package.js file), install them with npm install and run your “run.js” script with the command node run.js.

Note about PhantomJs cookies

This is a personal section covering my experience on using the cookies support.PhantomJS accept a “cookies-file” argument with a file path as a value. Basically, a PhantomJS command would look like phantomjs --cookies-file=#{cookies} {more_arguments} {script_path} {script_arguments}.

After a few trials, I wasn’t able to use the cookies file efficiently. Trying to run a second script will not honored the persisted session. However, if I don’t exit PhantomJS with phantom.exit() and force quit the application instead, then the cookie file will work as expected.

This is one of the two reasons why I came up with such an architecture in which I can chain multiple actions. The other reason is speed since the headless Webkit instance is started fewer times. I don’t blame PhantomJS, it could be something I pass over in the documentation.

[selenium]

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值