在使用webmagic制作的爬虫爬取网站数据时,发现有些图片爬取不到。比较了一下,发现所有无法爬取的图片都报相同的错误:
Encrypted HTTPS traffic flows through this CONNECT tunnel. HTTPS Decryption is enabled in Fiddler, so decrypted sessions running in this tunnel will be shown in the Web Sessions list. Secure Protocol: Tls12 Cipher: Aes256 256bits Hash Algorithm: Sha384 ?bits Key Exchange: RsaKeyX 2048bits == Server Certificate ==========
看来是传输层SSL的设置不太对。先瞅下源码,webmagic0.73使用的HttpClient是4.5.2版本。webmagic构造httpclient代码如下:
private CloseableHttpClient generateClient(Site site) {
HttpClientBuilder httpClientBuilder = HttpClients.custom();
httpClientBuilder.setConnectionManager(connectionManager);
if (site.getUserAgent() != null) {
httpClientBuilder.setUserAgent(site.getUserAgent());
} else {
httpClientBuilder.setUserAgent("");
}
if (site.isUseGzip()) {
httpClientBuilder.addInterceptorFirst(new HttpRequestInterceptor() {
public void process(
final HttpRequest request,
final HttpContext context) throws HttpException, IOException {
if (!request.containsHeader("Accept-Encoding")) {
request.addHeader("Accept-Encoding", "gzip");
}
}
});
}
//解决post/redirect/post 302跳转问题
httpClientBuilder.setRedirectStrategy(new CustomRedirectStrategy());
SocketConfig.Builder socketConfigBuilder = SocketConfig.custom();
socketConfigBuilder.setSoKeepAlive(true).setTcpNoDelay(true);
socketConfigBuilder.setSoTimeout(site.getTimeOut());
SocketConfig socketConfig = socketConfigBuilder.build();
httpClientBuilder.setDefaultSocketConfig(socketConfig);
connectionManager.setDefaultSocketConfig(socketConfig);
httpClientBuilder.setRetryHandler(new DefaultHttpRequestRetryHandler(site.getRetryTimes(), true));
generateCookie(httpClientBuilder, site);
return httpClientBuilder.build();
}
它是基于custom方法构建一个自定义的httpclient对象,没有看到显式的设置安全层协议。
通过debug,发下有关安全层的设置,构建socketFactory如下:
private SSLConnectionSocketFactory buildSSLConnectionSocketFactory() {
try {
return new SSLConnectionSocketFactory(createIgnoreVerifySSL()); // 优先绕过安全证书
} catch (KeyManagementException e) {
logger.error("ssl connection fail", e);
} catch (NoSuchAlgorithmException e) {
logger.error("ssl connection fail", e);
}
return SSLConnectionSocketFactory.getSocketFactory();
}
其中,createIgnoreVerifySSL方法是关键,代码如下:
private SSLContext createIgnoreVerifySSL() throws NoSuchAlgorithmException, KeyManagementException {
// 实现一个X509TrustManager接口,用于绕过验证,不用修改里面的方法
X509TrustManager trustManager = new X509TrustManager() {
@Override
public void checkClientTrusted(X509Certificate[] chain, String authType) throws CertificateException {
}
@Override
public void checkServerTrusted(X509Certificate[] chain, String authType) throws CertificateException {
}
@Override
public X509Certificate[] getAcceptedIssuers() {
return null;
}
};
SSLContext sc = SSLContext.getInstance("SSLv3");
sc.init(null, new TrustManager[] { trustManager }, null);
return sc;
}
绕过安全证书的逻辑先不探讨。在这里,它是基于SSLv3构建SSLContext的。所以TLS1的协议肯定是不支持。
TLS和SSL同作为TCP/IP协议层的安全层的协议,提供数据保密性和完整性,TLS对于SSL来说,就像更优秀的继任者。
那怎么设置HttpClient支持TLS1呢。先看下官方文档,基于官方文档的SSL设置代码修改下,得到:
// Trust own CA and all self-signed certs
SSLContext sslcontext = SSLContexts.custom()
// .loadTrustMaterial(new File("my.keystore"), "nopassword".toCharArray(),
// new TrustSelfSignedStrategy())
.build();
// Allow TLSv1 protocol only
SSLConnectionSocketFactory sslsf = new SSLConnectionSocketFactory(
sslcontext,
new String[] { "TLSv1" },
null,
SSLConnectionSocketFactory.getDefaultHostnameVerifier());
CloseableHttpClient httpclient = HttpClients.custom()
.setSSLSocketFactory(sslsf)
.build();
try {
HttpGet httpget = new HttpGet("");
System.out.println("Executing request " + httpget.getRequestLine());
CloseableHttpResponse response = httpclient.execute(httpget);
try {
HttpEntity entity = response.getEntity();
System.out.println("----------------------------------------");
System.out.println(response.getStatusLine());
byte[] bytes = EntityUtils.toByteArray(entity);
FileUtils.writeByteArrayToFile(new File("E:/test"), bytes);
} finally {
response.close();
}
} finally {
httpclient.close();
}
在socketFactory中显示设置TLSv1协议,经测试有效。
demo功能是下载一个图片,保存到本地,经测试,下载到本地的图片能正常打开