爬虫 ajax网页(Cobra)

http://lobobrowser.org/cobra.jsp

有js逻辑的页面,对网络爬虫的信息抓取工作造成了很大障碍。DOM树,只有执行了js的逻辑才可以完整的呈现。而有的时候,有要对js修改后的 dom树进行解析。在搜寻了大量资料后,发现了一个开源的项目cobra。cobra支持JavaScript引擎,其内置的JavaScript引擎是 mozilla下的 rhino,利用rhino的API,实现了对嵌入在html的JavaScript的解释执行。测试用例:

js.html

<html>

<title>test javascript</title>

<script language="javascript">

var go = function(){



document.getElementById("gg").innerHTML="google";

}

</script>

<body onLoad="javascript:go();">

<a id = "gg" onClick="javascript:go();" href="#">baidu</a>

</body>

</html>

Test.java

package net.cooleagle.test.cobra;

import java.io.InputStream;

import java.io.InputStreamReader;

import java.io.Reader;

import java.net.URL;

 



import org.lobobrowser.html.UserAgentContext;

import org.lobobrowser.html.domimpl.HTMLDocumentImpl;

import org.lobobrowser.html.parser.DocumentBuilderImpl;

import org.lobobrowser.html.parser.InputSourceImpl;

import org.lobobrowser.html.test.SimpleUserAgentContext;

import org.w3c.dom.Document;

import org.w3c.dom.Element;

 



public class Test{

private static final String TEST_URI = "http://localhost/js.html ";



public static void main(String[] args) throws Exception {

UserAgentContext uacontext = new SimpleUserAgentContext();

DocumentBuilderImpl builder = new DocumentBuilderImpl(uacontext);

URL url = new URL(TEST_URI);

InputStream in = url.openConnection().getInputStream();

try {

Reader reader = new InputStreamReader(in, "ISO-8859-1");

InputSourceImpl inputSource = new InputSourceImpl(reader, TEST_URI);

Document d = builder.parse(inputSource);

HTMLDocumentImpl document = (HTMLDocumentImpl) d;

Element ele = document.getElementById("gg");

System.out.println(ele.getTextContent());



} finally {

in.close();

}

}

}

执行结果:

google

测试成功。

 

============================================

 

I originally used JRex , a Java wrapper for the Mozilla Gecko layout engine, to render HTML pages. I was looking for a better engine for extracting the HTML of rendered pages and found the Cobra Toolkit that is part of the Lobo Project . This project includes the Cobra Toolkit that renders HTML and the LoboBrowser built on this toolkit. The code is pure Java.

My initial comparison of JRex and Cobra found the following salient facts:

  • JRex seems to be an abandoned project while the Lobo Project is active. The forums for this project are more active than for JRex.
  • While JRex appears to be abandoned, Gecko is a world-class rendering engine. Cobra still seems to be in development.
  • JRex crashes the Java JVM when loading certain pages, and Cobra does not.
  • Cobra can be run headless while JRex/Gecko cannot. Cobra seems faster since it doesn't have to actually render the HTML page to a graphic context.
  • By default, JRex/Gecko includes a Flash plug-in while Cobra does not. (Since the plug-in mechanism for the LoboBrowser requires Java code, plug-ins for other browsers will not work. Until a Java Flash plug-in is available, Cobra will not handle Flash.) The JavaScript in some pages will cause a modified page to be loaded if Flash isn't present. In some data mining tasks, being able to examine the <OBJECT> and <EMBED> tags is useful and might not be available in Cobra unless a plug-in for Flash is installed.
  • JRex/Gecko seems to handle less well-formed HTML than Cobra. A missing <HTML> or <HEAD> tag can cause Cobra to quit before building the complete DOM. But since the LoboBrowser does properly render one of my test pages that Cobra fails on, perhaps this is less of a problem than I think.
  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值