需求:
需要采集js渲染的页面,有些网站的页面是js渲染的
实现:
基于HtmlUnit实现:
public static void getAjaxPage() throws Exception{
WebClient webClient = new WebClient();
webClient.setJavaScriptEnabled(true);
webClient.setCssEnabled(false);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
webClient.setTimeout(Integer.MAX_VALUE);
webClient.setThrowExceptionOnScriptError(false);
HtmlPage rootPage = webClient.getPage("http://tt.mop.com/read_14304066_1_0.html");
System.out.println(rootPage.asXml());
}
maven依赖:
<dependency> <groupId>net.sourceforge.htmlunit</groupId> <artifactId>htmlunit-core-js</artifactId> <version>2.9</version> <scope>compile</scope> </dependency> <dependency> <groupId>net.sourceforge.htmlunit</groupId> <artifactId>htmlunit</artifactId> <version>2.9</version> <scope>compile</scope> </dependency>
说明:
Nutch插件:nutch-htmlunit用于替换Nutch自身的Http Fetch组件