生产事故:
早高峰订单量突涨,下游etq接口调用异常(部分),org.apache.http.NoHttpResponseException: server failed to respond。第一感觉下游应用服务响应能力不足(服务端连接数配置过低或者需要扩容)。线上日志分析发现,此case频率并不高,一分钟发生1-2次,异常10次左右恢复正常,后续继续观察,未再有异常抛出。
异常信息:
org.apache.http.NoHttpResponseException: xxxxxxx:xxx failed to respond
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:143) ~[httpclient-4.5.jar:4.5]
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57) ~[httpclient-4.5.jar:4.5]
at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:261) ~[httpcore-4.4.1.jar:4.4.1]
at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:165) ~[httpcore-4.4.1.jar:4.4.1]
at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:167) ~[httpclient-4.5.jar:4.5]
at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:272) ~[httpcore-4.4.1.jar:4.4.1]
at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:124) ~[httpcore-4.4.1.jar:4.4.1]
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:271) ~[httpclient-4.5.jar:4.5]
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184) ~[httpclient-4.5.jar:4.5]
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88) ~[httpclient-4.5.jar:4.5]
at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[httpclient-4.5.jar:4.5]
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184) ~[httpclient-4.5.jar:4.5]
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) ~[httpclient-4.5.jar:4.5]
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107) ~[httpclient-4.5.jar:4.5]
apache官网解释:org.apache.http.NoHttpResponseException
In some circumstances, usually when under heavy load, the web server may be able to receive requests but unable to process them. A lack of sufficient resources like worker threads is a good example. This may cause the server to drop the connection to the client without giving any response. HttpClient throws NoHttpResponseException when it encounters such a condition. In most cases it is safe to retry a method that failed with NoHttpResponseException.
疑问:
httpclient使用了重试机制(3次),在head中使用了traceId,通过nginx access日志发现,此traceId只访问了一次且成功200。这说明了爆发异常的那次http请求甚至没有达到应用层。
百度:
百度NoHttpResponseException,发现在不少文章指出因使用了无效的连接,被服务端拒绝导致,建议主动清除无效链接。但我们项目中已经设置了后台线程主动清除过期链接和空闲链接。
从code的角度来研究httpclient驱逐连接策略
背景:
自http 1.1后,http默认打开长连接keepalive,为了增加连接的复用性,httpclient使用连接池(长连接)。但前提服务端也需要支持长连接,服务端可以配置长连接的最大持有数及连接的keepalive时间。假设服务端配置长连接的keepalive=20,若20s内这个连接没有使用,则主动关闭此连接,但客户端是无感知的。有可能仍然会使用此连接发出请求,而被服务端拒绝。于是抛出了NoHttpResponseException。
经典阻塞I/O模型的缺点之一是:只有阻塞发生在I/O操作,网络socket可以对I/O事件做出反应。当连接被释放到连接池中时,它是存活的,然而它不能监控socke的状态,并不能响应任何I/O事件。当服务端关闭了连接,客户端不能够发现连接状态的改变(做出关闭连接的动作)。
那HttpClient如何来解决这类问题呢?启动一定时任务
1.ClientConnectionManager#closeExpiredConnections() 清除过期连接(如何判断一个连接有没有过期?)
2.ClientConnectionManager#closeIdleConnections() 清除空闲连接(定时任务每隔几秒钟清除30s内未使用的连接)
下面我们来看看几个核心的类
PoolEntry
这个类封装了connection,路由信息,创建时间,最后一次使用时间,过期时间,有效使用时间及有效性判断的规则。是判断清除连接的核心底层类。
public abstract class PoolEntry<T, C> {
private final String id;
private final T route;
private final C conn; //封装的connection
private final long created; //connection的创建时间
private final long validityDeadline; //有效期(客户端设置)
private long updated; //最后一次使用时间
private long expiry; //有效期
public PoolEntry(final String id, final T route, final C conn,
final long timeToLive, final TimeUnit tunit) {
super();
Args.notNull(route, "Route");
Args.notNull(conn, "Connection");
Args.notNull(tunit, "Time unit");
this.id = id;
this.route = route;
this.conn = conn;
this.created = System.currentTimeMillis();
this.updated = this.created;
if (timeToLive >