这里写目录标题
1、HttpClient简介
HttpClient 是 Apache Jakarta Common 下的子项目,用来提供高效的、功能丰富的、 支持 HTTP 协议的客户端编程工具包。相比于 java.net 包中提供的 URLConnection 与 HttpURLConnection,HttpClient 增加了易用性和灵活性。在 Java 网络爬虫实战中,经 常使用 HttpClient 向服务器发送请求,获取响应资源。官网提供了 HttpClient 的使 用教程。
2、Jar包的下载
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.5</version>
</dependency>
可以看到 HttpClient 的依赖 jar 包有 commons-codec、commons-logging 和 httpcore。
3、请求URL
3.1、创建HttpClient
HttpClient 的重要功能是执行 HTTP 请求方法,获取响应资源。在执行具体的请 求方法之前,需要实例化 HttpClient。下面 供了实例化 HttpClient 的六种方式, 第一种方式已经不再建议使用。
//HttpClient实例化方法
// HttpClients.custom();//返回值HttpClientBuilder.create()
// HttpClients.createDefault();//返回值HttpClients.custom().build()
HttpClient httpClient1 = new DefaultHttpClient();//已过期
HttpClient httpClient2 = HttpClients.custom().build();
HttpClient httpClient3 = HttpClientBuilder.create().build();
CloseableHttpClient httpClient4 = HttpClients.createDefault();
HttpClient httpClients5 = HttpClients.createSystem();
HttpClient httpClient6 = HttpClients.createMinimal();
3.2、创建请求方法实例
在 HttpClient 中,支持 HTTP/1.1 版本中的所有 HTTP 方法,即 GET、POST、HEAD、 PUT、DELETE、OPTIONS 和 TRACE。其中,每种方法都对应一个类,即 HttpGet、 HttpPost、HttpHead、HttpPut、HttpDelete、HttpOption 和 HttpTrace。在网络爬虫中, 常用的类是 HttpGet 与 HttpPost。从 HttpClient 源码中,可以发现这些类的实例化方式 各有三种(以 HttpGet 为例),如程序所示。
其中,采用第一种方式实例化,还需要设置请求的 URL;第二种方式输入参数是 统一资源标识符 URI,第三种方式输入参数是字符串类型的 URI。
public static void main(String[] args) throws URISyntaxException {
//第一种方式
String personalUrl = "Http://www.*****.com/index.html";
URI uri = new URIBuilder(personalUrl).build();
HttpGet getMethod = new HttpGet();
getMethod.setURI(uri);
System.out.println(getMethod);
//第二种方式
HttpGet httpGetUri = new HttpGet(uri);
System.out.println(httpGetUri);
//第三种方式
HttpGet httpGetStr = new HttpGet(personalUrl);
System.out.println(httpGetStr);
}
3.3、执行请求
基于实例化的 HttpClient,可以调用 execute(HttpUriRequest request)方法执行数据请求,返回 HttpResponse。
public static void main(String[] args) {
CloseableHttpClient httpClient = HttpClients.createDefault();
HttpGet httpGet = new HttpGet("http://www.baidu.com");
//第一种
HttpResponse response = new BasicHttpResponse(HttpVersion.HTTP_1_1, HttpStatus.SC_OK, "OK");
try {
response = httpClient.execute(httpGet);
} catch (IOException e) {
e.printStackTrace();
}
//第二种
HttpResponse httpResponse = null;
try {
httpResponse = httpClient.execute(httpGet);
} catch (IOException e) {
e.printStackTrace();
}
//第三种 CloseableHttpResponse继承于HttpResponse
CloseableHttpResponse closeableHttpResponse = null;
try {
closeableHttpResponse = httpClient.execute(httpGet);
} catch (IOException e) {
e.printStackTrace();
}
}
另外,在 HttpClient 类中,还提供了其他执行请求的方法,以下列举了三种。
//方法 4:HttpContext 为 HTTP 执行环境
HttpResponse execute(HttpUriRequest request, HttpContext context) throws IOException, ClientProtocolException;
//方法5:HttpHost代表代理连接
HttpResponse execute(HttpHost target, HttpRequest request) throws IOException, ClientProtocolException;
//方法6:HttpHost代表代理连接
HttpResponse execute(HttpHost target, HttpRequest request,
HttpContext context)throws IOException, ClientProtocolException;
3.4、获取响应信息
基于上述方法 3 获取的 HttpResponse,可以继续执行一些方法获取响应状态码、 协议版本、响应头和响应实体等信息。
public class GetHttpResponseInfoDemo {
public static void main(String[] args) throws IOException {
//初始化HttpContext
HttpContext localContext = new BasicHttpContext();
String url = "http://www.baidu.com";
//初始化HttpClient
HttpClient httpClient = HttpClients.custom().build();
HttpGet httpGet = new HttpGet(url);
//执行请求获取HttpResponse
HttpResponse httpResponse = null;
try {
httpResponse = httpClient.execute(httpGet, localContext);
} catch (IOException e) {
e.printStackTrace();
}
//获取具体响应信息
System.out.println("response:" + httpResponse);
//响应状态
String status = httpResponse.getStatusLine().toString();
System.out.println("status:" + status);
//获取响应状态码
int StatusCode = httpResponse.getStatusLine().getStatusCode();
System.out.println("StatusCode:" + StatusCode);
ProtocolVersion protocolVersion = httpResponse.getProtocolVersion();//协议版本号
System.out.println("ProtocolVersion:" + protocolVersion);
//是否OK
String phrase = httpResponse.getStatusLine().getReasonPhrase();
System.out.println("phrase:" + phrase);
Header[] headers = httpResponse.getAllHeaders();
System.out.println("输出头信息为:");
for (int i = 0; i < headers.length; i++) {
System.out.println(headers[i]);
}
System.out.println("输出头信息结束");
if (StatusCode == HttpStatus.SC_OK) {
//获取实体内容
HttpEntity entity = httpResponse.getEntity();
//注意设置编码
String entityString = EntityUtils.toString(entity, "utf8");
//输出内容
System.out.println(entityString);
EntityUtils.consume(httpResponse.getEntity());//消耗实体
} else {
//关闭HttpEntity的流实体
EntityUtils.consume(httpResponse.getEntity());//消耗实体
}
}
}
4、EntityUtils类
EntityUtils 类的作用是操 作响应实体。例如,数据类型为 HTML 响应实体,可以使用以下三种方法将其直接转 化成字符串类型。
//可以设置编码
public static String toString(final HttpEntity entity, final String defaultCharset)
//可以设置编码
public static String toString(final HttpEntity entity, final Charset defaultCharset)
//使用默认编码ISO-8859-1
public static String toString(final HttpEntity entity)
另外,EntityUtils 类还提供了将响应实体转化成字节数组的方法,如下。
public static byte[] toByteArray(final HttpEntity entity)
针对图片、PDF 和压缩包等文件,可以先将响应实体转化成字节数组。之后,利 用缓冲流的方式写入指定文件,具体操作程序将在第六章进行介绍。 另外,为确保系统资源的释放,可以调用下面的方法消耗实体。
public static void consume(final HttpEntity entity)
5、设置头信息
5.1、单独设置
public static void main(String[] args) throws IOException {
HttpClient httpClient = HttpClients.custom().build();
HttpGet httpget = new HttpGet("http://www.baidu.com");
//请求头配置
httpget.setHeader("Accept", "text/html,application/xhtml+xml, application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8");
httpget.setHeader("Accept-Encoding", "gzip, deflate");
httpget.setHeader("Accept-Language", "zh-CN,zh;q=0.9");
httpget.setHeader("Cache-Control", "max-age=0");
httpget.setHeader("Host", "www.baidu.com");
httpget.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36"); //这项内容很重要
//发出GET请求
HttpResponse response = httpClient.execute(httpget); //获取响应状态码
int code = response.getStatusLine().getStatusCode(); HttpEntity httpEntity = response.getEntity(); //获取网页内容流 //以字符串的形式(需设置编码)
String entity = EntityUtils.toString(httpEntity, "utf8");
System.out.println(code + "\n" + entity); //输出所获得的内容
EntityUtils.consume(httpEntity); //关闭内容流
}
5.2、集合封装
public static void main(String[] args) throws IOException {
//通过集合封装头信息
List<Header> headerList = new ArrayList<Header>();
headerList.add(new BasicHeader(HttpHeaders.ACCEPT, "text/html, application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/ *;q=0.8"));
headerList.add(new BasicHeader(HttpHeaders.USER_AGENT, "Mozilla/ 5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36"));
headerList.add(new BasicHeader(HttpHeaders.ACCEPT_ENCODING, "gzip, deflate"));
headerList.add(new BasicHeader(HttpHeaders.CACHE_CONTROL, "max-age=0"));
headerList.add(new BasicHeader(HttpHeaders.CONNECTION, "keep-alive"));
headerList.add(new BasicHeader(HttpHeaders.ACCEPT_LANGUAGE, "zh-CN,zh;q=0.9"));
headerList.add(new BasicHeader(HttpHeaders.HOST, "www.********.com.cn"));
//构造自定义的HttpClient对象
HttpClient httpClient = HttpClients.custom().setDefaultHeaders(headerList).build();
//使用的请求方法
HttpGet httpget = new HttpGet("http://www.********.com.cn/b.asp");
//获取结果
//发出GET请求
HttpResponse response = httpClient.execute(httpget);
//获取响应状态码
int code = response.getStatusLine().getStatusCode();
HttpEntity httpEntity = response.getEntity(); //获取网页内容流
//以字符串的形式(需设置编码)
String entity = EntityUtils.toString(httpEntity, "gbk");
System.out.println(code + "\n" + entity);//输出所获得的内容
EntityUtils.consume(httpEntity);//关闭内容流
}
6、POST提交表单
在网络爬虫中,经常遇到表单的提交,尤其是模拟登录。在 HttpClient 中,提供了实体类 UrlEncodedFormEntity 以方便处理表单提交,其使用方式如程序所示。
//建立NameValuePair数组,用于存储欲传送的参数
List<NameValuePair> nvps= new ArrayList<NameValuePair>();
nvps.add(new BasicNameValuePair("param1", "value1")); nvps.add(new BasicNameValuePair("param2", "value2"));
UrlEncodedFormEntity entity = new UrlEncodedFormEntity(nvps,
Consts.UTF_8);
HttpPost httppost = new HttpPost("http://localhost/handler.do");
httppost.setEntity(entity);
UrlEncodedFormEntity 实例将会使用 URL encoding 来编码参数,产生如下所示的 内容
param1=value1¶m2=value2
public static void main(String[] args) throws IOException {
//实例化HttpClient
HttpClient httpclient = HttpClients.custom().build();
String renRenLoginURL = "http://www.******.com/ajaxLogin/login?1=1&uniqueTimestamp=2018922138705"; //登录的地址
//采用的方法为POST
HttpPost httpost = new HttpPost(renRenLoginURL);
//建立NameValuePair数组,用于存储欲传送的参数
List<NameValuePair> nvps = new ArrayList<NameValuePair>();
//输入你的邮箱地址
nvps.add(new BasicNameValuePair("email", ""));
//输入你的密码
nvps.add(new BasicNameValuePair("password", ""));
HttpResponse response = null;
try {
//表单参数提交
httpost.setEntity(new UrlEncodedFormEntity(nvps, HTTP.UTF_8));
response = httpclient.execute(httpost);
} catch (Exception e) {
e.printStackTrace();
} finally {
httpost.abort();//释放连接
}
System.out.println(response.getStatusLine());
String entityString = EntityUtils.toString(response.getEntity(), "gbk"); //注意设置编码
System.out.println(entityString); //登录完成之后需要请求的页面,这里为个人好友的信息页面
HttpGet httpget = new HttpGet("http://www.******.com/465530468/profile?v=info_timeline"); //构建 ResponseHandler
ResponseHandler<String> responseHandler = new BasicResponseHandler();
String responseBody = "";
try {
responseBody = httpclient.execute(httpget, responseHandler);
} catch (Exception e) {
e.printStackTrace();
responseBody = null;
} finally {
httpget.abort(); //释放连接
}
System.out.println(responseBody); //输出请求到的内容
}
7、超时设置
使用 HttpClient 可配置三种超时时间:RequestTimeout(获取连接超时时间)
、 ConnectTimeout(建立连接超时时间)
、SocketTimeout(获取数据超时时间)
。配置这 三种超时时间,需要用到 HttpClient 的 RequestConfig 类中的方法 custom(),该方法返 回值为实例化的内部类 Builder(配置器)。
内部类 Builder 的功能 是配置相关请求的字段, Builder 不仅可以配 置超时时间,还可以配置代理(proxy)、Cookie 规范(cookieSpec)、是否允许 HTTP 相关认证等。
public static RequestConfig.Builder custom() {
return new Builder();
}
Bulder内部类:
private boolean expectContinueEnabled; private HttpHost proxy;
private InetAddress localAddress;
private boolean staleConnectionCheckEnabled;
private String cookieSpec;
private boolean redirectsEnabled;
private boolean relativeRedirectsAllowed;
private boolean circularRedirectsAllowed;
private int maxRedirects;
private boolean authenticationEnabled;
private Collection<String> targetPreferredAuthSchemes; private Collection<String> proxyPreferredAuthSchemes; private int connectionRequestTimeout;
private int connectTimeout;
private int socketTimeout;
private boolean contentCompressionEnabled;
Builder() {
super();
this.staleConnectionCheckEnabled = false;
this.redirectsEnabled = true;
this.maxRedirects = 50;
this.relativeRedirectsAllowed = true;
this.authenticationEnabled = true;
this.connectionRequestTimeout = -1;
this.connectTimeout = -1;
this.socketTimeout = -1;
this.contentCompressionEnabled = true;
}
以下将使用 RequestConfig 类配置超时时间
public static void main(String[] args) throws IOException {
//全部设置为 10 秒
RequestConfig requestConfig = RequestConfig.custom()
.setSocketTimeout(10000)
.setConnectTimeout(10000)
.setConnectionRequestTimeout(10000)
.build();
//配置HttpClient
HttpClient httpClient = HttpClients.custom()
.setDefaultRequestConfig(requestConfig)
.build();
HttpGet httpGet = new HttpGet("http://www.********.com.cn/b.asp");
HttpResponse response = null;
try {
response = httpClient.execute(httpGet); //执行请求
}catch (Exception e){
e.printStackTrace();
}
String result = EntityUtils.toString(response.getEntity(), "gbk"); //获取HTML格式的结果
System.out.println(result); //输出结果
}
也可针对实例化的请求方法设置超时时间:
public static void main(String[] args) throws IOException {
//实例化 HttpClient
HttpClient httpClient = HttpClients.createDefault();
HttpGet httpGet=new HttpGet("http://www.********.com.cn/b.asp");//GET请求
RequestConfig requestConfig = RequestConfig.custom()
.setSocketTimeout(2000)
.setConnectTimeout(2000)
.setConnectionRequestTimeout(10000)
.build();//设置超时时间
httpGet.setConfig(requestConfig); //请求方法配置信息
HttpResponse response = null;
try {
response = httpClient.execute(httpGet); //执行请求
}catch (Exception e){
e.printStackTrace();
}
String result = EntityUtils.toString(response.getEntity(), "gbk"); //获取HTML格式的结果
System.out.println(result); //输出结果
}
8、代理服务器的使用
public static void main(String[] args) throws IOException {
RequestConfig defaultRequestConfig = RequestConfig.custom()
.setProxy(new HttpHost("171.221.239.11",808, null))
.build(); //添加代理
HttpGet httpGet = new HttpGet("http://www.********.com.cn/b.sap");
HttpClient httpClient = HttpClients.custom()
.setDefaultRequestConfig(defaultRequestConfig)
.build(); //配置HttpClient
//执行请求
HttpResponse httpResponse = httpClient.execute(httpGet);
if (httpResponse.getStatusLine().getStatusCode() == 200){
String result = EntityUtils.toString(httpResponse.getEntity(),"gbk");
System.out.println(result); //输出结果
}
}
与设置超时时间相同,也可针对实例化的请求方法配置代理服务器
public static void main(String[] args) throws IOException {
HttpClient httpClient = HttpClients.custom().build(); //实例化HttpClient
// 配置代理
HttpHost proxy = new HttpHost("171.221.239.11",808, null);
RequestConfig config = RequestConfig.custom().setProxy(proxy).build();
HttpGet httpGet = new HttpGet("http://www.********.com.cn/b.asp");
httpGet.setConfig(config); //针对实例化的请求方法配置代理
HttpResponse httpResponse = httpClient.execute(httpGet);
if (httpResponse.getStatusLine().getStatusCode() == 200){
String result = EntityUtils.toString(httpResponse.getEntity(),"gbk");
System.out.println(result);
}
}
9、文件下载
下载 HTML、图片、PDF 和压缩等文件时,一种方法是使用 HttpEntity 类将响应 实体转化成字节数组,再利用输出流的方式写入指定文件。另一种方法是使用 HttpEntity 类中的 writeTo(OutputStream)方法,直接将响应实体写入指定的输出流中, 这种方法简单且常用。
public static void main(String[] args) throws IOException {
String url = "https://www-us.******.org/dist//httpd/httpd-2.4.37.tar.gz";
//初始化HttpClient
HttpClient httpClient = HttpClients.custom().build();
HttpGet httpGet = new HttpGet(url);
//获取结果
HttpResponse httpResponse = null;
try {
httpResponse = httpClient.execute(httpGet);
} catch (IOException e) {
e.printStackTrace();
}
//非常简单的下载文件的方法
OutputStream out = new FileOutputStream("file/httpd-2.4.37. tar.gz");
httpResponse.getEntity().writeTo(out);
EntityUtils.consume(httpResponse.getEntity()); //消耗实体
}
10、HTTPS请求认证
与 Jsoup 类似,使用 HttpClient 直接请求以 https://为前缀的 URL,有时也会产生错误信息,即找不到合法证书请求目标 URL。
- 首先,利用内部类 SSL509TrustManager,创建 X.509 证书信任管理器;
- 之后,使用 SSLConnectionSocketFactory()方法创建 SSL 连接, 并利用 Registry 注册 http 和 https 套接字工厂;
- 接着,使用 PoolingHttpClientConnectionManager()方法实例化连接池管理器;
- 最后,基于实例化的 连接池管理器和 RequestConfig 配置的信息,来实例化一个可以执行 HTTPS 请求的 HttpClient。
SSLClient 类中的 HttpClient initSSLClient(String SSLProtocolVersion)方法,其中,参数 SSLProtocolVersion 设置为“SSLv3”,基于该 方法实例化的 HttpClient 能够成功请求以 https://为前缀的 URL。
public class SSLClient {
/**
* 基于SSL配置HttpClient
* @param SSLProtocolVersion(SSL, SSLv3, TLS, TLSv1, TLSv1.1, TLSv1.2)
* @return HttpClient
*/
public HttpClient initSSLClient(String SSLProtocolVersion) {
RequestConfig defaultConfig = null;
PoolingHttpClientConnectionManager pcm = null;
try {
X509TrustManager xtm = new SSL509TrustManager(); //创建信任管理
//创建SSLContext对象,并使用指定的信任管理器初始化
SSLContext context = SSLContext.getInstance(SSLProtocolVersion);
context.init(null, new X509TrustManager[]{xtm}, null);
/*从SSLContext对象中得到SSLConnectionSocketFactory对象
NoopHostnameVerifier.INSTANCE表示接受任何有效的和符合目标
主机的SSL会话
*/
SSLConnectionSocketFactory sslConnectionSocketFactory = new SSLConnectionSocketFactory(context, NoopHostnameVerifier.INSTANCE);
//设置全局请求配置,包括cookie规范
defaultConfig = RequestConfig.custom().setCookieSpec(CookieSpecs.STANDARD_STRICT)
.setExpectContinueEnabled(true)
.setTargetPreferredAuthSchemes(Arrays.asList(AuthSchemes.NTLM, AuthSchemes.DIGEST))
.setProxyPreferredAuthSchemes(Arrays.asList(AuthSchemes.BASIC)).build();
// 注册http和https套接字工厂
Registry<ConnectionSocketFactory> sfr = RegistryBuilder.<ConnectionSocketFactory>create()
.register("http", PlainConnectionSocketFactory.INSTANCE)
.register("https", sslConnectionSocketFactory).build();
//基于sfr创建连接管理器
pcm = new PoolingHttpClientConnectionManager(sfr);
} catch (NoSuchAlgorithmException | KeyManagementException e) {
e.printStackTrace();
}
//基于连接管理器和配置,实例化HttpClient
HttpClient httpClient = HttpClients.custom()
.setConnectionManager(pcm).setDefaultRequestConfig(defaultConfig)
.build();
return httpClient;
}
//实现X509TrustManager接口
private static class SSL509TrustManager implements X509TrustManager {
//检查客户端证书
public void checkClientTrusted(X509Certificate[] x509Certificates, String s) {
//do nothing 接受任意客户端证书
}
//检查服务器端证书
public void checkServerTrusted(X509Certificate[] x509Certificates, String s) {
//do nothing 接受任意服务器端证书
}
//返回受信任的X509证书
public X509Certificate[] getAcceptedIssuers() {
return new X509Certificate[0];
}
};
}
public class HttpsDemo {
public static void main(String[] args) throws IOException {
String url = "https://cn.*******.com";
SSLClient sslClient = new SSLClient(); //实例化
HttpClient httpClientSSL = sslClient.initSSLClient("SSLv3");
HttpGet httpGet = new HttpGet(url);
//获取结果
HttpResponse httpResponse = null;
try {
httpResponse = httpClientSSL.execute(httpGet);
} catch (IOException e) {
e.printStackTrace();
}
if(httpResponse .getStatusLine().getStatusCode() == HttpStatus.SC_OK){//状态码200表示响应成功
//获取实体内容
String entity = EntityUtils.toString(httpResponse.getEntity(),"UTF-8");
//输出实体内容
System.out.println(entity);
}else {
EntityUtils.consume(httpResponse.getEntity()); //消耗实体
//关闭HttpEntity的流实体
EntityUtils.consume(httpResponse.getEntity()); //消耗实体
}
}
}
11、请求重试
使用 HttpClient 请求 URL 时,有时会出现请求异常的情况。针对一些非致命的异常, 可以通过请求重试解决。
HttpClient 提供了默认重试策略 DefaultHttpRequestRetryHandler。 DefaultHttpRequestRetryHandler 类 实 现 了 HttpRequestRetryHandler 接 口 , 重 写 了 retryRequest()方法。
DefaultHttpRequestRetryHandler 类定义的默认重试次数为 3 次;幂等方法 (如 GET 和 HEAD 是幂等的)可以重试;如果网页请求失败,可以重试。另外,针 对 4 种异常不进行重试,这四种异常分别是 InterruptedIOException(线程中断异常)、 UnknownHostException(未知的 Host 异常)、ConnectException(连接异常,如连接拒 绝异常)和 SSLException(HTTPS 请求认证异常)。
在实例化 HttpClient 时,可以使用 HttpClientBuilder 类中的 setRetryHandler()方法 设置重试,如下面程序提供了两种方式。其中,第一种使用默认重试次数 3 次,第二 种自定义重试次数为 5 次。
HttpClient httpClient = HttpClients.custom()
.setRetryHandler(new DefaultHttpRequestRetryHandler())
.build();
//自定义重试次数
HttpClient httpClient = HttpClients.custom()
.setDefaultRequestConfig(defaultConfig)
.setRetryHandler(new DefaultHttpRequestRetryHandler(5, true))
.build();
值得注意的是,在进行数据爬取时经常遇到的两种超时时间:ConnectTimeout (建立连接的超时时间)和 SocketTimeout(获取数据的超时时间),这两种超时 时间对应的异常(ConnectTimeoutException 与 SocketTimeoutException)都继承自 InterruptedIOException 类,即属于线程中断异常,不会进行重试。
12、多线程执行请求
4.5 版本的 HttpClient 中的连接池管理器 PoolingHttpClientConnectionManager 类实 现了 HTTP 连接池化管理,其管理连接的单位为路由(Route),每个路由维护一定数 量(默认是 2)的连接;当给定路由的所有连接都被租用时,则新的连接请求将发生 阻塞,直到某连接被释放回连接池。另外,PoolingHttpClientConnectionManager 维护 的连接次数也受总数 MaxTotal(默认是 20)的限制。
当 HttpClient 配置了 PoolingHttpClientConnectionManager 时,其可以同时执行多 个 HTTP 请求,即实现多线程操作。下面提供了一个简单的多线程请求多个 URL 案例。使用实例化的 PoolingHttpClientConnectionManager 可以设置 最大连接数、每个路由的最大连接数、Connection 信息和 Socket 信息等。另外,本案 例是通过继承 Thread 类,重写 Thread 类的 run()方法实现的多线程,也可通过实现 Runnable 接口的方式实现多线程。
public class PoolDemo {
public static void main(String[] args) throws FileNotFoundException {
//添加连接参数
ConnectionConfig connectionConfig = ConnectionConfig.custom()
.setMalformedInputAction(CodingErrorAction.IGNORE)
.setUnmappableInputAction(CodingErrorAction.IGNORE)
.setCharset(Consts.UTF_8)
.build();
//添加socket参数
SocketConfig socketConfig = SocketConfig.custom()
.setTcpNoDelay(true)
.build();
//配置连接池管理器
PoolingHttpClientConnectionManager pcm = new PoolingHttpClientConnectionManager();
// 设置最大连接数
pcm.setMaxTotal(100);
// 设置每个连接的路由数
pcm.setDefaultMaxPerRoute(10);
//设置连接信息
pcm.setDefaultConnectionConfig(connectionConfig);
//设置socket信息
pcm.setDefaultSocketConfig(socketConfig);
//设置全局请求配置,包括cookie规范、Http认证、超时时间
RequestConfig defaultConfig = RequestConfig.custom()
.setCookieSpec(CookieSpecs.STANDARD_STRICT)
.setExpectContinueEnabled(true)
.setTargetPreferredAuthSchemes(Arrays.asList(AuthSchemes.NTLM, AuthSchemes.DIGEST))
.setProxyPreferredAuthSchemes(Arrays.asList(AuthSchemes.BASIC))
.setConnectionRequestTimeout(30*1000)
.setConnectTimeout(30*1000)
.setSocketTimeout(30*1000)
.build();
CloseableHttpClient httpClient = HttpClients.custom()
.setConnectionManager(pcm)
.setDefaultRequestConfig(defaultConfig)
.build();
// 请求的URL
String[] urlArr = {
"http://www.********.com.cn/html/index.asp",
"http://www.********.com.cn/html/html_basic.asp",
"http://www.********.com.cn/html/html_elements.asp",
"http://www.********.com.cn/html/html_attributes.asp",
"http://www.********.com.cn/html/html_formatting.asp"
};
//创建固定大小的线程池
ExecutorService exec = Executors.newFixedThreadPool(3);
for(int i = 0; i< urlArr.length;i++){
//HTML需要输出的文件名
String filename = urlArr[i].split("html/")[1];
//创建HTML文件输出目录
OutputStream out = new FileOutputStream("file/" + filename);
HttpGet httpget = new HttpGet(urlArr[i]);
//启动线程执行请求
exec.execute(new DownHtmlFileThread(httpClient, httpget, out));
}
//关闭线程
exec.shutdown();
}
static class DownHtmlFileThread extends Thread {
private final CloseableHttpClient httpClient;
private final HttpContext context;
private final HttpGet httpget;
private final OutputStream out;
//输入的参数
public DownHtmlFileThread(CloseableHttpClient httpClient, HttpGet httpget, OutputStream out) {
this.httpClient = httpClient;
this.context = HttpClientContext.create();
this.httpget = httpget;
this.out = out;
}
@Override
public void run() {
System.out.println(Thread.currentThread().getName() + "线程请求的URL为:" + httpget.getURI());
try {
CloseableHttpResponse response = httpClient.execute(httpget, context); //执行请求
try {
//将HTML文档写入文件
out.write(EntityUtils.toString(response.getEntity(), "gbk").getBytes());
out.close();
//消耗实体
EntityUtils.consume(response.getEntity());
} finally {
response.close(); //关闭响应
}
} catch (ClientProtocolException ex) {
ex.printStackTrace(); // 处理 Protocol错误
} catch (IOException ex) {
}
}
}
}