使用httpclient返回 header too long

最近些的爬虫请求某代理网站时运行几天就会持续报错400,header too long,google无答案,于是看了下源码,主要原因为cookie的累积导致(可以理解为你的浏览器很久没有清理缓存),以下为排查过程,解决方案见文章最后。
httclient请求调用链路:
org.apache.http.impl.client.InternalHttpClient#doExecute
org.apache.http.impl.client.InternalHttpClient#setupContext
if (context.getAttribute(HttpClientContext.COOKIE_STORE) == null) {
    context.setAttribute(HttpClientContext.COOKIE_STORE, this.cookieStore);
}
 
如果没有显示设置cooki_store,取this(httpclient)的成员变量cookieStore,而通常我们httpClient只有1个实例,那么cookieStore也等同于是单例的。
org.apache.http.impl.execchain.RedirectExec#execute:108
org.apache.http.impl.execchain.RetryExec#execute:86
org.apache.http.impl.execchain.ProtocolExec#execute
org.apache.http.impl.execchain.MainClientExec#execute
org.apache.http.protocol.HttpRequestExecutor#execute
conn.sendRequestHeader(request);
 org.apache.http.protocol.HttpRequestExecutor#doSendRequest
org.apache.http.impl.conn.CPoolProxy#sendRequestHeader
org.apache.http.impl.io.AbstractMessageWriter#write
Cookie的发送:
for (final HeaderIterator it = message.headerIterator(); it.hasNext(); ) {
final Header header = it.nextHeader();
    this.sessionBuffer.writeLine
        (lineFormatter.formatHeader(this.lineBuf, header));
}
 
Cookie的保存:
org.apache.http.impl.execchain.ProtocolExec#execute:200
org.apache.http.HttpResponseInterceptor#process
org.apache.http.client.protocol.ResponseProcessCookies#process
org.apache.http.client.protocol.ResponseProcessCookies#processCookies:114
cookieStore.addCookie(cookie);
 
/**
 * Adds an {@link Cookie HTTP cookie}, replacing any existing equivalent cookies.
 * If the given cookie has already expired it will not be added, but existing
 * values will still be removed.
 *
 * @param cookie the {@link Cookie cookie} to be added
 *
 * @see #addCookies(Cookie[])
 *
 */
public synchronized void addCookie(final Cookie cookie) {
if (cookie != null) {
// first remove any old cookie that is equivalent
cookies.remove(cookie);
        if (!cookie.isExpired(new Date())) {
cookies.add(cookie);
}
    }
}
 
注意到这里添加一个cookie时会先移除,然后判断cookie是否已经失效,没有失效才会add,这样看是不会出问题的,那问题到底出在哪里?而通过调试发现我们的第三方网站的sessionID的cookie的name居然是会变的!导致老的cookie无法删除,越积越多。
解决方案①:禁用cookie
CloseableHttpClient httpClient = HttpClientBuilder.create().setConnectionManager(connManager)
                              .setRetryHandler(retryHandler).setDefaultRequestConfig(config).disableCookieManagement().build();
  disableCookieManagement()方法会停止发送和接收cookie。
/**
 * Disables state (cookie) management.
 * <p/>
* Please note this value can be overridden by the {@link #setHttpProcessor(
 * org.apache.http.protocol.HttpProcessor)} method.
 */
public final HttpClientBuilder disableCookieManagement() {
this.cookieManagementDisabled = true;
    return this;
}
 
启用后org.apache.http.protocol.ImmutableHttpProcessor#responseInterceptors response拦截器里不再包含ResponseProcessCookies这一拦截器,不再执行存储cookie操作,观察后续请求,header里也不再包含cookie字段。
拦截器注册代码:org.apache.http.impl.client.HttpClientBuilder#build:839
if (!cookieManagementDisabled) {
    b.add(new RequestAddCookies());
}
if (!cookieManagementDisabled) {
    b.add(new ResponseProcessCookies());
}
 
解放方案②:设置单独的context
HttpClientContext context = HttpClientContext.create();
context.setCookieStore(new BasicCookieStore());
CloseableHttpResponse response = httpClient.execute(httpGet, context);
 
设置后因为context的cookieStore不为null,将不再默认取httpclient的成员变量cookiestore。
以上两种方案,可根据自身情况进行选择。
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值