最近在学习网络爬虫抓取数据, 运用HtmlUnit, 可以获取到动态加载后的数据。但是有些网站需要先登录,后获取登录后的数据就出现问题。
public static void TianyaTestByHtmlUnit() {
try {
WebClient webClient = new WebClient(BrowserVersion.FIREFOX_45);
// The ScriptException is raised because you have a syntactical
// error in your javascript.
// Most browsers manage to interpret the JS even with some kind of
// errors
// but HtmlUnit is a bit inflexible in that sense.
webClient.getOptions().setUseInsecureSSL(true);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setCssEnabled(false);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
webClient.getOptions().setRedirectEnabled(true);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.setJavaScriptTimeout(20000);
webClient.waitForBackgroundJavaScript(10000);
webClient.getOptions().setRedirectEnabled(true);
webClient.getCookieManager().setCookiesEnabled(true);
// get the url
HtmlPage page = webClient.getPage("http://passport.tianya.cn/login.jsp");
System.out.println("Orgin page data =" + page.asXml());
HtmlTextInput username = (HtmlTextInput) page.getElementById("userName");
username.type("lms_test_****");
HtmlPasswordInput password = (HtmlPasswordInput) page.getElementById("password");
password.click();
password.type("liu*****");
//HtmlAnchor submit = page.getAnchorByName("loginBtn");
HtmlButton submit = (HtmlButton) page.getElementById("loginBtn");
webClient.waitForBackgroundJavaScript(4000);
HtmlPage nextPage =(HtmlPage) submit.click();
// Wait js load the data
webClient.waitForBackgroundJavaScript(10000);
Thread.sleep(20000);
System.out.println("After click login button =" + nextPage.asXml());
Set<Cookie> cookies = webClient.getCookieManager().getCookies();;
Map<String, String> responseCookies = new HashMap<String, String>();
for (Cookie c : cookies) {
responseCookies.put(c.getName(), c.getValue());
System.out.println("cookie name --" + c.getName()+" value:"+c.getValue());
}
webClient.close();
} catch (Exception e) {
e.printStackTrace();
}
}
nextPage.asXml() 获取的数据总是与登录前的数据差不多,求大神帮忙解决!