关于htmlunit,用java写过爬虫的大概都用到过。也许你有天会碰到一个异常让你懵逼,不禁问自己,为什么自己本地调试通过的爬虫可以跑起来,但是相同的代码在别人的环境可能会跑不起来。是电脑系统问题?环境问题?网络问题?
先看看异常吧:
avax.net.ssl.SSLPeerUnverifiedException: peer not authenticated
at com.sun.net.ssl.internal.ssl.SSLSessionImpl.getPeerCertificates(SSLSessionImpl.java:352)
at org.apache.http.conn.ssl.AbstractVerifier.verify(AbstractVerifier.java:126)
at org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:572)
at com.gargoylesoftware.htmlunit.HtmlUnitSSLSocketFactory.connectSocket(HtmlUnitSSLSocketFactory.java:171)
at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:180)
at org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:294)
at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:645)
at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:480)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
at com.gargoylesoftware.htmlunit.HttpWebConnection.getResponse(HttpWebConnection.java:172)
at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseFromWebConnection(WebClient.java:1486)
at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseFromWebConnection(WebClient.java:1536)
at com.gargoylesoftware.htmlunit.WebClient.loadWebResponse(WebClient.java:1403)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:305)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:374)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:359)
at org.sunvalley.app.service.util.WebClientService.getPage(WebClientService.java:88)
at test.TestUrl.Test4(TestUrl.java:17)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:73)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:46)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:180)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:41)
at org.junit.runners.ParentRunner$1.evaluate(ParentRunner.java:173)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
at org.junit.runners.ParentRunner.run(ParentRunner.java:220)
at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:46)
at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
乍一看,是一个SSL的证书问题。不禁发问,htmlunit是构造了一个浏览器的模拟行为,如何再构造证书?还是jar包不兼容还是有疏漏?需要重构httpclient?那为什么有的系统不会抛这个异常?是别人的系统问题吗?这可就扩大了。当然,两个完全一样的环境执行相同的代码,所得结果多半一样;但是两个不同环境执行相同代码,结果可能不一样。
看看我的解决思路吧:
1.重构htmlunit,因为htmlunit也是基于httpclient,可以争对httpclient下手,谷歌之
SSLContext sslContext = SSLContext.getInstance("SSL");
// set up a TrustManager that trusts everything
sslContext.init(null, new TrustManager[] { new X509TrustManager() {
public X509Certificate[] getAcceptedIssuers() {
System.out.println("getAcceptedIssuers =============");
return null;
}
public void checkClientTrusted(X509Certificate[] certs,
String authType) {
System.out.println("checkClientTrusted =============");
}
public void checkServerTrusted(X509Certificate[] certs,
String authType) {
System.out.println("checkServerTrusted =============");
}
} }, new SecureRandom());
SSLSocketFactory sf = new SSLSocketFactory(sslContext);
Scheme httpsScheme = new Scheme("https", 443, sf);
SchemeRegistry schemeRegistry = new SchemeRegistry();
schemeRegistry.register(httpsScheme);
// apache HttpClient version >4.2 should use BasicClientConnectionManager
ClientConnectionManager cm = new SingleClientConnManager(schemeRegistry);
HttpClient httpClient = new DefaultHttpClient(cm);
结果是无用的。。。
2.继续谷歌、百度之,找到如下
how to ignore ssl certificate error
解决方案:
For future reference if someone wants to do the same thing its fairly
straight forward.
1)Create a package called
org.apache.commons.httpclient.contrib.ssl
add the two files.
EasySSLProtocolSocketFactory.java
EasyX509TrustManager.java
Compile.
2)Add the following lines of code to your htmlclient before making any
htmlunit calls
Protocol easyhttps = new Protocol("https", new
EasySSLProtocolSocketFactory(), 443);
Protocol.registerProtocol("https", easyhttps);
对应两个java文件如下
按照修改后依旧不起作用
最后看到作者描述
FYI, starting with version 1.14 you can use WebClient.setUseInsecureSSL(true) instead, and it will take care of all the HttpClient configuration behind the scenes.
好激动………………
3.使用是发现webclient没有此方法,原来已被提取自webclientoptions类中。
/**
* If set to <code>true</code>, the client will accept connections to any host, regardless of
* whether they have valid certificates or not. This is especially useful when you are trying to
* connect to a server with expired or corrupt certificates.
* @param useInsecureSSL whether or not to use insecure SSL
*/
public void setUseInsecureSSL(final boolean useInsecureSSL) {
useInsecureSSL_ = useInsecureSSL;
}
所以在init()时,也就是构造一个webclient时 ,打开最后的注释行即可解决这个异常……
public void init() throws Exception {
webclient = new WebClient(BrowserVersion.FIREFOX_17);
webclient.getOptions().setJavaScriptEnabled(true);
webclient.getOptions().setThrowExceptionOnScriptError(false);
webclient.getOptions().setCssEnabled(false);
webclient.getCookieManager().clearCookies();
webclient.getCache().clear();
webclient.setRefreshHandler(new ImmediateRefreshHandler());
webclient.getOptions().setTimeout(600*1000);
webclient.setJavaScriptTimeout(600*1000);
webclient.setAjaxController(new NicelyResynchronizingAjaxController());
webclient.getOptions().setJavaScriptEnabled(true);
webclient.setJavaScriptTimeout(600*1000);
webclient.getOptions().setRedirectEnabled(true);
webclient.waitForBackgroundJavaScript(60*1000);
webclient.getOptions().setThrowExceptionOnScriptError(false);
webclient.getOptions().setThrowExceptionOnFailingStatusCode(false);
// webclient.getOptions().setUseInsecureSSL(true);
}