使用htmlunit采集网页+点击网页按钮

最新推荐文章于 2025-04-16 09:40:05 发布

绝影邪

最新推荐文章于 2025-04-16 09:40:05 发布

阅读量1.1k

点赞数

分类专栏： SpringBoot

SpringBoot 专栏收录该内容

43 篇文章

订阅专栏

概念

htmlunit:这东西是java无界面浏览器，说白了就是全部是api操作，你就可以访问别人的网页。这意味着你可以写程序批量去做很多事情告别手工

问题：

我们采集网页的时候经常发现 javascript ajax等方式加载出来的html无法采集到，

这个时候选择htmlunit 可以解决这个烦恼因为这货就是一个浏览器啥不能干啊除了没界面而已

版本：

尝试了2.3到2.9版本结果跑起来发现没有一个能用的可能是我使用方式不当各种js执行异常报错

最后使用2.15版本同样代码顺利跑起来

项目是maven依赖

<dependency>
<groupId>net.sourceforge.htmlunit</groupId>
<artifactId>htmlunit</artifactId>
<version>2.15</version>
</dependency>

<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.7.3</version>
</dependency>

参考代码：

test 1:采集网页

// 模拟一个浏览器
WebClient webClient = new WebClient(BrowserVersion.getDefault());
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);

webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setActiveXNative(false);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.waitForBackgroundJavaScript(10*1000);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());

webClient.getOptions().setJavaScriptEnabled(true);

URL URL = new URL(url);
HtmlPage page= webClient.getPage(URL);

String firstPageHtml = page.asXml();
Document doc = Jsoup.parse(firstPageHtml);
Element element = doc.select(".album-page-1").first();
String firstPagePhoto = element.html();
outputAppend(output,firstPagePhoto);
//追加分页数据根据class来获取节点获取li class=pagination下面所有的a 至于你要获取其他方式参考xpath
List links = (List) page.getByXPath ("//li[@class='pagination']/a");

//循环去点击页面下一页按钮
if(null!=links){
int size = links.size();
if(size>3){
int min = 2;
int max = size-2;
for(int i=min;i<=max;i++){

//因为循环点击页面按钮，返回的总是第一页数据，所有我们干脆就重新请求一次下一页再发起一次浏览器请求拜求大神能解决这个问题
getNextPageContent(url, i);
}
}
}

private String getNextPageContent(String url,int idx) throws IOException {
HtmlPage newPage = HtmlPager.getPage(url);
WebClientFactory.getWebClient().waitForBackgroundJavaScript(1000*4);
List newLinks = (List) newPage.getByXPath ("//li[@class='pagination']/a");
HtmlAnchor newlink = (HtmlAnchor)newLinks.get(idx);
newPage = newlink.click();
WebClientFactory.closeAllWindows();
String newHtml = newPage.asXml();
Document doc = Jsoup.parse(newHtml);
Element element = doc.select(".album-page-" + idx).first();
return element.html();
}

至此结束了文章中代码不一定能直接跑起来拷贝代码比较麻烦但是大体90%的逻辑都在了

希望htmlUnit尽快的强大起来

另外求高手赐教如何能循环点击获取js执行的最新的所有html

test 2 验证码登录

try
{
WebClient webClient = new WebClient(BrowserVersion.CHROME);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);

webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setActiveXNative(false);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.waitForBackgroundJavaScript(10*1000);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
HtmlPage page = webClient.getPage("http://www.***.com.cn/login.htm");

HtmlElement username = (HtmlElement)page.getElementById("mobile");
HtmlElement password = (HtmlElement)page.getElementById("password");
HtmlElement valiCode = (HtmlElement)page.getElementById("checkCode");
HtmlImage valiCodeImg = (HtmlImage) page.getElementById("checkCodeImage");
ImageReader imageReader = valiCodeImg.getImageReader();
BufferedImage bufferedImage = imageReader.read(0);

JFrame f2 = new JFrame();
JLabel l = new JLabel();
l.setIcon(new ImageIcon(bufferedImage));
f2.getContentPane().add(l);
f2.setSize(300, 300);
f2.setTitle("验证码");
f2.setVisible(true);

String valicodeStr = JOptionPane.showInputDialog("请输入验证码：");
f2.setVisible(false);
List newLinks = (List) page.getByXPath ("//li[@class='login']/a");
HtmlAnchor newlink = (HtmlAnchor)newLinks.get(0);

username.click();
username.type("123456");
password.click();
password.type("123456");
valiCode.click();
valiCode.type(valicodeStr);
HtmlPage logingPage= newlink.click();