上一篇介绍了HtmlUnit在网络抓取,小型爬虫等应用中优劣势,这篇一起来看下HttpClient在这一方面的应用。
HttpClient 是 Apache Jakarta Common 下的子项目,可以用来提供高效的、最新的、功能丰富的支持 HTTP 协议的客户端编程工具包。它采用是大家再熟悉不过的Scoket编程,我们在HttpClientConnectionOperator类中可以看到。
- <span style="font-size:14px;"> public void connect(
- final ManagedHttpClientConnection conn,
- final HttpHost host,
- final InetSocketAddress localAddress,
- final int connectTimeout,
- final SocketConfig socketConfig,
- final HttpContext context) throws IOException {
- // 它先去找找这个Socket工厂,看看有没有
- final Lookup<ConnectionSocketFactory> registry = getSocketFactoryRegistry(context);
- final ConnectionSocketFactory sf = registry.lookup(host.getSchemeName());
- if (sf == null) {
- throw new UnsupportedSchemeException(host.getSchemeName() +
- " protocol is not supported");
- }
- // InetAddress这货也在。。。
- final InetAddress[] addresses = this.dnsResolver.resolve(host.getHostName());
- final int port = this.schemePortResolver.resolve(host);
- for (int i = 0; i < addresses.length; i++) {
- final InetAddress address = addresses[i];
- final boolean last = i == addresses.length - 1;
- Socket sock = sf.createSocket(context);
- sock.setReuseAddress(socketConfig.isSoReuseAddress());
- conn.bind(sock);
- final InetSocketAddress remoteAddress = new InetSocketAddress(address, port);
- if (this.log.isDebugEnabled()) {
- this.log.debug("Connecting to " + remoteAddress);
- }
- try {
- sock.setSoTimeout(socketConfig.getSoTimeout());
- // 这里就是它打开这个链接了,建立链接之后就可以接收流了,然后HttpClient再用自己的流接收方式接收进去。。。
- sock = sf.connectSocket(
- connectTimeout, sock, host, remoteAddress, localAddress, context);
- sock.setTcpNoDelay(socketConfig.isTcpNoDelay());
- sock.setKeepAlive(socketConfig.isSoKeepAlive());
- final int linger = socketConfig.getSoLinger();
- if (linger >= 0) {
- sock.setSoLinger(linger > 0, linger);
- }
- conn.bind(sock);
- if (this.log.isDebugEnabled()) {
- this.log.debug("Connection established " + conn);
- }
- return;
- } catch (final SocketTimeoutException ex) {
- if (last) {
- throw new ConnectTimeoutException(ex, host, addresses);
- }
- } catch (final ConnectException ex) {
- if (last) {
- final String msg = ex.getMessage();
- if ("Connection timed out".equals(msg)) {
- throw new ConnectTimeoutException(ex, host, addresses);
- } else {
- throw new HttpHostConnectException(ex, host, addresses);
- }
- }
- }
- if (this.log.isDebugEnabled()) {
- this.log.debug("Connect to " + remoteAddress + " timed out. " +
- "Connection will be retried using another IP address");
- }
- }
- }
- </span>
就是说HttpClient和URLConnection一样是通过Socket编程来实现网络通信的,相比来说当然是JDK的东西效率什么的更高了,但是我们选择前者,其实就是主要原因就是因为----懒!HttpClient是对Socket/HTTP协议恰到好处的封装,它不像HtmlUnit那样高度,也不像URLConnection用起来比较麻烦,它兼有简单、强扩展等特性,所以和它同Apache Jakarta开发项目组的HtmlUnit也采用了HttpClient。
优点(从百科中说的优点来看):
1、实现了所有 HTTP 的方法(GET,POST,PUT,HEAD 等):实现倒是实现了,笔者就用过GET、POST、PUT,其它的可能用的比较少了。现在Java Servlet服务器不是整天都GET、POST的,倒也不是特别关心其它的方式
2、支持自动转向:这句话说的也点坑,因为它其实是想说,支持200以下的响应码自动向前,这点可以参考HttpRequestExecutor类中的doReceiveResponse方法。
- protected HttpResponse doReceiveResponse(
- final HttpRequest request,
- final HttpClientConnection conn,
- final HttpContext context) throws HttpException, IOException {
- // 他们老喜欢把这个检测写成Args类了。。。。
- Args.notNull(request, "HTTP request");
- Args.notNull(conn, "Client connection");
- Args.notNull(context, "HTTP context");
- HttpResponse response = null;
- int statusCode = 0;
- // 这里开始接收响应(路由已经在MainClientExec中的establishRoute建立并连接)
- // 它判断了下状态码是不是小于200,小于200就要继续接收
- while (response == null || statusCode < HttpStatus.SC_OK) {
- response = conn.receiveResponseHeader();
- if (canResponseHaveBody(request, response)) {
- conn.receiveResponseEntity(response);
- }
- statusCode = response.getStatusLine().getStatusCode();
- } // while intermediate response
- return response;
- }
最开始的时候我还以为是支持自动跳转,就像上一篇文章中的例子一样。
3、支持 HTTPS 协议:HttpClient对SSL的支持是比较全面的,最简单的:
- private static HttpClient getSSLInsecureClient() throws Exception {
- 43. // 建立一个认证上下文,认可所有安全链接,当然,这是因为我们仅仅是测试,实际中认可所有安全链接是危险的
- 44. SSLContext sslContext = new SSLContextBuilder().loadTrustMaterial(null, new TrustStrategy() {
- 45. public boolean isTrusted(X509Certificate[] chain, String authType) throws CertificateException {
- 46. return true;
- 47. }
- 48. }).build();
- 49. SSLConnectionSocketFactory sslsf = new SSLConnectionSocketFactory(sslContext);
- 50. return HttpClients.custom().//
- 51. setSSLSocketFactory(sslsf)//
- 52. // .setProxy(new HttpHost("127.0.0.1", 8888))
- 53. .build();
- 54. }
4、支持代理服务器:支持代理,就实用的,你可以把HttpClient的代理设置为Filder(一款功能非常强大的网络监听软件,WebDebugger),这样所有的HttpClient发出的请求都会被Filder所接收和管理,这是在代码测试阶段一个非常好的方式。