在java程序中使用jQuery抓取网页的新方法（java调用js解析引擎）

最新推荐文章于 2022-08-02 14:47:15 发布

MemRay

最新推荐文章于 2022-08-02 14:47:15 发布

阅读量1w

点赞数

分类专栏： JavaScript Java 文章标签： Rhino javascript解析

Java 同时被 2 个专栏收录

141 篇文章 0 订阅

订阅专栏

JavaScript

28 篇文章 1 订阅

订阅专栏

转载自：http://www.open-open.com/lib/view/open1331187174202.html

你想要的任何信息，基本上在互联网上存在了，问题是如何把它们整理成你所需要的，比如在某个行业网站上抓取所有相关公司的的名字，联系电话，Email等，然后存到Excel里面做分析。网页信息抓取变得原来越有用了。

一般传统的网页，web服务器直接返回Html，这类网页很好抓，不管是用何种方式，只要得到html页面，然后做Dom解析就可以了。但对于需要Javascript生成的网页，就不那么容易了。张瑜目前也没有找到好办法解决此问题。各位有抓javascript网页经验的朋友，欢迎指点。

所以今天要谈的还是传统html网页的信息抓取。虽然前面说了，没有技术难度，但是是否能有相对更容易的方法呢？用过jQuery等js框架的朋友，可能都会觉得javascript貌似抓取网页信息的天然助手，而且其出生就是为了网页解析而存在的。当然现在有更多的应用了，如Server端的javascript应用，NodeJs.

如果能在我们的应用程序，如java程序中，能使用jQuery去抓网页，绝对是件激动人心的事情。确实有现成的解决方案，一个Javascript引擎，一个能支撑jQuery运行的环境就可以了。

工具 : java, Rhino, envJs. 其中 Rhino是Mozzila提供的开源Javascript引擎，envJs是一个模拟浏览器额环境，如Window等。代码如下，

 
package stony.zhang.scrape;
 
 
 
 
 
import java.io.FileNotFoundException;
 
import java.io.FileReader;
 
import java.io.IOException;
 
import java.lang.reflect.InvocationTargetException;
 
 
 
import org.mozilla.javascript.Context;
 
import org.mozilla.javascript.ContextFactory;
 
import org.mozilla.javascript.Scriptable;
 
import org.mozilla.javascript.ScriptableObject;
 
 
 
/**
 
 * @author MyBeautiful
 
 * @Emal: zhangyu0182@sina.com
 
 * @date Mar 7, 2012
 
 */
 
public class RhinoScaper {
 
    private String url;
 
    private String jsFile;
 
 
 
    private Context cx;
 
    private Scriptable scope;
 
 
 
    public String getUrl() {
 
        return url;
 
    }
 
 
 
    public String getJsFile() {
 
        return jsFile;
 
    }
 
 
 
    public void setUrl(String url) {
 
        this.url = url;
 
        putObject("url", url);
 
    }
 
 
 
    public void setJsFile(String jsFile) {
 
        this.jsFile = jsFile;
 
    }
 
 
 
    public void init() {
 
        cx = ContextFactory.getGlobal().enterContext();
 
        scope = cx.initStandardObjects(null);
 
        cx.setOptimizationLevel(-1);
 
        cx.setLanguageVersion(Context.VERSION_1_5);
 
 
 
        String[] file = { "./lib/env.rhino.1.2.js", "./lib/jquery.js" };
 
        for (String f : file) {
 
            evaluateJs(f);
 
        }
 
         
 
        try {
 
            ScriptableObject.defineClass(scope, ExtendUtil.class);
 
        } catch (IllegalAccessException e1) {
 
            e1.printStackTrace();
 
        } catch (InstantiationException e1) {
 
            e1.printStackTrace();
 
        } catch (InvocationTargetException e1) {
 
            e1.printStackTrace();
 
        }
 
        ExtendUtil util = (ExtendUtil) cx.newObject(scope, "util");
 
        scope.put("util", scope, util);
 
    }
 
 
 
    protected void evaluateJs(String f) {
 
        try {
 
            FileReader in = null;
 
            in = new FileReader(f);
 
            cx.evaluateReader(scope, in, f, 1, null);
 
        } catch (FileNotFoundException e1) {
 
            e1.printStackTrace();
 
        } catch (IOException e1) {
 
            e1.printStackTrace();
 
        }
 
    }
 
 
 
    public void putObject(String name, Object o) {
 
        scope.put(name, scope, o);
 
    }
 
 
 
    public void run() {
 
        evaluateJs(this.jsFile);
 
    }
 
}

测试代码：

 
package stony.zhang.scrape;
 
 
 
import java.util.HashMap;
 
import java.util.Map;
 
 
 
import junit.framework.TestCase;
 
 
 
public class RhinoScaperTest extends TestCase {
 
 
 
    public RhinoScaperTest(String name) {
 
        super(name);
 
    }
 
 
 
    public void testRun() {
 
        RhinoScaper rs = new RhinoScaper();
 
        rs.init();
 
        rs.setUrl("http://www.baidu.com");
 
        rs.setJsFile("test.js");
 
//      Map<String, String> o = new HashMap<String, String>();
 
//      rs.putObject("result", o);
 
        rs.run();
 
//      System.out.println(o.get("imgurl"));
 
    }
 
 
 
}

test.js文件，如下

 
$.ajax({
 
  url: "http://www.baidu.com",
 
  context: document.body,
 
  success: function(data){
 
 //   util.log(data);
 
     
 
    var result =parseHtml(data);
 
     
 
    var $v= jQuery(result);
 
 //   util.log(result);
 
    $v.find('#u a').each(function(index) {
 
         util.log(index + ': ' + $(this).attr("href"));
 
  //        arr.add($(this).attr("href"));
 
    });
 
  }
 
});
 
 
 
 
 
 function parseHtml(html) {
 
       //Create an iFrame object that will be used to render the HTML in order to get the DOM objects
 
        //created - this is a far quicker way of achieving the HTML to DOM conversion than trying
 
        //to transform the HTML objects one-by-one
 
         var oIframe = document.createElement('iframe');
 
     //Hide the iFrame from view
 
         oIframe.style.display = 'none';
 
         if (document.body)
 
            document.body.appendChild(oIframe);
 
        else
 
            document.documentElement.appendChild(oIframe);
 
         
 
        //Open the iFrame DOM object and write in our HTML
 
        oIframe.contentDocument.open();
 
        oIframe.contentDocument.write(html);
 
        oIframe.contentDocument.close();
 
     
 
        //Return the document body object containing the HTML that was just
 
        //added to the iFrame as DOM objects
 
        var oBody = oIframe.contentDocument.body;
 
     
 
        //TODO: Remove the iFrame object created to cleanup the DOM
 
     
 
        return oBody;
 
    }

我们执行Unit Test，将会在控制台打印从网页上抓取的三个baidu的连接，

0: http://www.baidu.com/gaoji/preferences.html
1: http://passport.baidu.com/?login&tpl=mn
2: https://passport.baidu.com/?reg&tpl=mn

测试成功，故证明在java程序中用jQuery抓取网页是可行的.

MemRay

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
2
评论
在java程序中使用jQuery抓取网页的新方法（java调用js解析引擎）

转载自：http://www.open-open.com/lib/view/open1331187174202.html 你想要的任何信息，基本上在互联网上存在了，问题是如何把它们整理成你所需要的，比如在某个行业网站上抓取所有相关公司的的名字，联系电话，Email等，然后存到Excel里面做分析。网页信息抓取变得原来越有用了。一般传统的网页，web服务器直
复制链接

扫一扫