Web Scraping Ajax and Javascript Sites

转自:http://blog.databigbang.com/web-scraping-ajax-and-javascript-sites/

Introduction

Most crawling frameworks used for scraping cannot be used for Javascript or Ajax. Their scope is limited to those sites that show their main content without using scripting. One would also be tempted to connect a specific crawler to a Javascript engine but it’s not easy to do. You need a fully functional browser with good DOM support because the browser behavior is too complex for a simple connection between a crawler and a Javascript engine to work. There is a list of resources at the end of this article to explore the alternatives in more depth.

There are several ways to scrape a site that contains Javascript:

  1. Embed a web browser within an application and simulate a normal user.
  2. Remotely connect to a web browser and automate it from a scripting language.
  3. Use special purpose add-ons to automate the browser
  4. Use a framework/library to simulate a complete browser.

Each one of these alternatives has its pros and cons. For  example using a complete browser consumes a lot of resources, especially if we need to scrape websites with a lot of pages.

In this post we’ll give a simple example of how to scrape a web site that uses Javascript. We will use the htmlunit library to simulate a browser. Since htmlunit runs on a JVM we will use Jython, an [excellent] programming language,which is a Python implementation in the JVM. The resulting code is very clear and focuses on solving the problem instead of on the aspects of programming languages.

Setting up the environment

Prerequisites

  1. JRE or JDK.
  2. Download the latest version of Jython fromhttp://www.jython.org/downloads.html.
  3. Run the .jar file and install it in your preferred directory (e.g: /opt/jython).
  4. Download the htmlunit compiled binaries from:http://sourceforge.net/projects/htmlunit/files/.
  5. Unzip the htmlunit to your preferred directory.

Crawling example

We will scrape the Gartner Magic Quadrant pages at:http://www.gartner.com/it/products/mq/mq_ms.jsp . If you look at the list of documents, the links are Javascript code instead of hyperlinks with http urls. This is may be to reduce crawling, or just to open a popup window. It’s a very convenient page to illustrate the solution.

gartner.py

01 import com.gargoylesoftware.htmlunit.WebClient as WebClient
02 import com.gargoylesoftware.htmlunit.BrowserVersion as BrowserVersion
03  
04 def main():
05    webclient = WebClient(BrowserVersion.FIREFOX_3_6) # creating a new webclient object.
07    page = webclient.getPage(url) # getting the url
08    articles = page.getByXPath("//table[@id='mqtable']//tr/td/a") # getting all the hyperlinks
09  
10    for article in articles:
11       print "Clicking on:", article
12       subpage = article.click() # click on the article link
13       title = subpage.getByXPath("//div[@class='title']") # get title
14       summary = subpage.getByXPath("//div[@class='summary']") # get summary
15       if len(title) > 0 and len(summary) > 0:
16          print "Title:", title[0].asText()
17          print "Summary:", summary[0].asText()
18 #     break
19  
20 if __name__ == '__main__':
21    main()

run.sh

1 /opt/jython/jython -J-classpath "htmlunit-2.8/lib/*" gartner.py

Final notes

This article is just a starting point to move ahead of simple crawlers and point the way for further research. As this is a simple page, it is a good choice for a clear example of how Javascript scraping works.You must do your homework to learn to crawl more web pages or add multithreading for better performance. In a demanding crawling scenario a lot of things must be taken into account, but this is a subject for future articles.

If you want to be polite don’t forget to read the robots.txt file before crawling…

If you like this article, you might also be interested in

  1. Distributed Scraping With Multiple Tor Circuits
  2. Precise Scraping with Google Chrome
  3. Running Your Own Anonymous Rotating Proxies
  4. Automated Browserless OAuth Authentication for Twitter

Resources

  1. HtmlUnit
  2. ghost.py is a webkit web client written in python
  3. Crowbar web scraping environment
  4. Google Chrome remote debugging shell from Python
  5. Selenium web application testing system – Watir – Sahi – Windmill Testing Framework
  6. Internet Explorer automation
  7. jSSh Javascript Shell Server for Mozilla
  8. http://trac.webkit.org/wiki/QtWebKit
  9. Embedding Gecko
  10. Opera Dragonfly
  11. PyAuto: Python Interface to Chromum’s automation framework
  12. Related questions on Stack Overflow
  13. Scrapy
  14. EnvJS: Simulated browser environment written in Javascript
  15. Setting up Headless XServer and CutyCapt on Ubuntu
  16. CutyCapt: Capture WebKit’s rendering of a web page.
  17. Google webmaste blog: A spider’s view of Web 2.0
  18. OpenQA
  19. Python Webkit DOM Bindings
  20. Berkelium Browser
  21. uBrowser
  22. Using HtmlUnit on .NET for Headless Browser Automation (using IKVM)
  23. Zombie.js
  24. PhantomJS
  25. PyPhantomJS
  26. CasperJS
  27. Web Inspector Remote
  28. Offscreen/Headless Mozilla Firefox (via @brutuscat)
  29. Web Scraping with Google Spreadsheets and XPath
  30. Web Scraping with YQL and Yahoo Pipes



  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值