最近些的爬虫请求某代理网站时运行几天就会持续报错400,header too long,google无答案,于是看了下源码,主要原因为cookie的累积导致(可以理解为你的浏览器很久没有清理缓存),以下为排查过程,解决方案见文章最后。
httclient请求调用链路:
org.apache.http.impl.client.InternalHttpClient#doExecute
org.apache.http.impl.client.InternalHttpClient#setupContext
if (context.getAttribute(HttpClientContext.COOKIE_STORE) == null) {
context.setAttribute(HttpClientContext.COOKIE_STORE, this.cookieStore);
}
如果没有显示设置cooki_store,取this(httpclient)的成员变量cookieStore,而通常我们httpClient只有1个实例,那么cookieStore也等同于是单例的。
org.apache.http.impl.execchain.RedirectExec#execute:108
org.apache.http.impl.execchain.RetryExec#execute:86
org.apache.http.impl.execchain.ProtocolExec#execute
org.apache.http.impl.execchain.MainClientExec#execute
org.apache.http.protocol.HttpRequestExecutor#execute
conn.sendRequestHeader(request);
org.apache.http.protocol.HttpRequestExecutor#doSendRequest
org.apache.http.impl.conn.CPoolProxy#sendRequestHeader
org.apache.http.impl.io.AbstractMessageWriter#write
Cookie的发送:
for (final HeaderIterator it = message.headerIterator(); it.hasNext(); ) {
final Header header = it.nextHeader();
this.sessionBuffer.writeLine
(lineFormatter.formatHeader(this.lineBuf, header));
}
Cookie的保存:
org.apache.http.impl.execchain.ProtocolExec#execute:200
org.apache.http.client.protocol.ResponseProcessCookies#process
org.apache.http.client.protocol.ResponseProcessCookies#processCookies:114
cookieStore.addCookie(cookie);
/**
* Adds an {@link Cookie HTTP cookie}, replacing any existing equivalent cookies.
* If the given cookie has already expired it will not be added, but existing
* values will still be removed.
*
* @param cookie the {@link Cookie cookie} to be added
*
* @see #addCookies(Cookie[])
*
*/
public synchronized void addCookie(final Cookie cookie) {
if (cookie != null) {
// first remove any old cookie that is equivalent
cookies.remove(cookie);
if (!cookie.isExpired(new Date())) {
cookies.add(cookie);
}
}
}
注意到这里添加一个cookie时会先移除,然后判断cookie是否已经失效,没有失效才会add,这样看是不会出问题的,那问题到底出在哪里?而通过调试发现我们的第三方网站的sessionID的cookie的name居然是会变的!导致老的cookie无法删除,越积越多。
解决方案①:禁用cookie
CloseableHttpClient httpClient = HttpClientBuilder.create().setConnectionManager(connManager)
.setRetryHandler(retryHandler).setDefaultRequestConfig(config).disableCookieManagement().build();
disableCookieManagement()方法会停止发送和接收cookie。
/**
* Disables state (cookie) management.
* <p/>
* Please note this value can be overridden by the {@link #setHttpProcessor(
* org.apache.http.protocol.HttpProcessor)} method.
*/
public final HttpClientBuilder disableCookieManagement() {
this.cookieManagementDisabled = true;
return this;
}
启用后org.apache.http.protocol.ImmutableHttpProcessor#responseInterceptors response拦截器里不再包含ResponseProcessCookies这一拦截器,不再执行存储cookie操作,观察后续请求,header里也不再包含cookie字段。
拦截器注册代码:org.apache.http.impl.client.HttpClientBuilder#build:839
if (!cookieManagementDisabled) {
b.add(new RequestAddCookies());
}
if (!cookieManagementDisabled) {
b.add(new ResponseProcessCookies());
}
解放方案②:设置单独的context
HttpClientContext context = HttpClientContext.create();
context.setCookieStore(new BasicCookieStore());
CloseableHttpResponse response = httpClient.execute(httpGet, context);
设置后因为context的cookieStore不为null,将不再默认取httpclient的成员变量cookiestore。
以上两种方案,可根据自身情况进行选择。