文章内容
1. Java NIO 在 Linux 平台的空轮询问题
1.1 空轮询问题的介绍
1.1.1 空轮询的现象
Linux 下使用 IO 复用一般默认就是 epoll
,Java NIO 在 Linux 平台默认使用的也是 epoll
机制。但是 JDK 中对接底层 epoll
的实现是有漏洞的,比较有名的是 Linux 平台 Java NIO 空轮询问题。这个问题比较有年代感了,简单来说就是即便 Selector#select()
没有轮询到任何一个可处理的 IO channel,NIO 依然不断地从本应该阻塞的 Selector#select()
处唤醒出来,导致 CPU 使用率达到 100%。从以下 Selector#select()
方法的注释我们可以知道,正常情况下如果没有可处理的 IO channel,这个方法应该是阻塞而不是返回 0
/**
* Selects a set of keys whose corresponding channels are ready for I/O
* operations.
*
* <p> This method performs a blocking <a href="#selop">selection
* operation</a>. It returns only after at least one channel is selected,
* this selector's {@link #wakeup wakeup} method is invoked, or the current
* thread is interrupted, whichever comes first. </p>
*
* @return The number of keys, possibly zero,
* whose ready-operation sets were updated
*
* @throws IOException
* If an I/O error occurs
*
* @throws ClosedSelectorException
* If this selector is closed
*/
public abstract int select() throws IOException;
epoll
是 Linux 平台上一种高效的 IO 复用方式,相较于select
和poll
机制来说,其高效的原因是将监听事件的 fd 使用红黑树结构直接存放到内核,在内核中再使用链表存放有对应事件发生的 fd,从而形成就绪的 fd 列表。当应用程序调用相关方法时,内核将就绪的 fd 列表返回,应用程序直接处理这些就绪 fd 即可,而不用再自行遍历包含所有 fd 的集合去查找到有事件发生的 fd
epoll、selector和poll机制不了解的可以参考:【JAVA】IO多路复用之select、poll、epoll详解
1.1.2 空轮询的原因
JDK-6403933 : (se) Selector doesn’t block on Selector.select(timeout) (lnx) 是官方的 bug 单 ,文中详细阐述了空轮询产生的原因。简而言之,当 socket 连接出现突然终止(RST)时,epoll
针对这个 socket 会将向上层返回的 fd 的事件设置为 POLLHUP 或者 POLLERR,事件集合因此发生了变化。监听的事件集合一旦发生变化,上层的 Selector#select()
方法自然就被唤醒了。但是 JDK 没有处理好这种异常情况,SelectionKey中根本没有定义异常事件类型,底层的异常事件无法映射为上层能够处理的事件
。这就导致虽然有事件发生,但是上层Selector#selectedKeys()
方法获取到的 SelectionKey 集合为空,一直无法消费掉这个异常事件,从而导致Selector#select()
一直被唤醒,产生 CPU 使用率 100% 的问题
class SelectionKey {
public static final int OP_READ = 1 << 0;
public static final int OP_WRITE = 1 << 2;
public static final int OP_CONNECT = 1 << 3;
public static final int OP_ACCEPT = 1 << 4;
}
1.2 空轮询的处理思路
1.2.1 JDK 层面
SelectionKey
新增异常事件
简单来说,就是为 NIO 的 SelectionKey
新增一种异常事件用于涵盖如 Linux epoll
中的 POLLHUP 和 POLLERR 事件,这样就把底层的异常暴露给了上层程序,让上层应用程序可以处理这个事件。但这种方式改动有点大,因为 SelectionKey 的定义是针对包括 Linux、Windows 在内所有平台的
SelectionKey
增大映射范围
这种方式仅仅作用在 Java NIO 对 epoll
的封装中,对于epoll的 POLLHUP 和 POLLERR 事件可以考虑将其映射为 SelectionKey.OP_READ
或者 SelectionKey.OP_WRITE
事件,但是要处理好上层应用对异常情况的感知问题
1.2.2 应用程序层面
在应用程序层面,一个易行的解决方式是重建 Selector。简单地说,如果发生了空轮询情况,就将旧的 Selector 废弃掉进行 Selector 重建,这样就不用管之前发生了异常情况的那个连接了。因为重建也是根据 SelectionKey 对应的连接来重新注册的,自然就将异常终止的连接排除了
2. Netty 的应对措施
Netty 采用的应对方式是使用一个计数器统计空轮询发生的次数,当统计值超过了阈值SELECTOR_AUTO_REBUILD_THRESHOLD(默认512)时就进行 Selector 重建
Selector 重建的处理逻辑在 io.netty.channel.nio.NioEventLoop#select
private void select(boolean oldWakenUp) throws IOException {
Selector selector = this.selector;
try {
int selectCnt = 0;
long currentTimeNanos = System.nanoTime();
long selectDeadLineNanos = currentTimeNanos + delayNanos(currentTimeNanos);
for (;;) {
long timeoutMillis = (selectDeadLineNanos - currentTimeNanos + 500000L) / 1000000L;
if (timeoutMillis <= 0) {
if (selectCnt == 0) {
selector.selectNow();
selectCnt = 1;
}
break;
}
// If a task was submitted when wakenUp value was true, the task didn't get a chance to call
// Selector#wakeup. So we need to check task queue again before executing select operation.
// If we don't, the task might be pended until select operation was timed out.
// It might be pended until idle timeout if IdleStateHandler existed in pipeline.
if (hasTasks() && wakenUp.compareAndSet(false, true)) {
selector.selectNow();
selectCnt = 1;
break;
}
int selectedKeys = selector.select(timeoutMillis);
selectCnt ++;
if (selectedKeys != 0 || oldWakenUp || wakenUp.get() || hasTasks() || hasScheduledTasks()) {
// - Selected something,
// - waken up by user, or
// - the task queue has a pending task.
// - a scheduled task is ready for processing
break;
}
if (Thread.interrupted()) {
// Thread was interrupted so reset selected keys and break so we not run into a busy loop.
// As this is most likely a bug in the handler of the user or it's client library we will
// also log it.
//
// See https://github.com/netty/netty/issues/2426
if (logger.isDebugEnabled()) {
logger.debug("Selector.select() returned prematurely because " +
"Thread.currentThread().interrupt() was called. Use " +
"NioEventLoop.shutdownGracefully() to shutdown the NioEventLoop.");
}
selectCnt = 1;
break;
}
long time = System.nanoTime();
if (time - TimeUnit.MILLISECONDS.toNanos(timeoutMillis) >= currentTimeNanos) {
// timeoutMillis elapsed without anything selected.
selectCnt = 1;
} else if (SELECTOR_AUTO_REBUILD_THRESHOLD > 0 && // SELECTOR_AUTO_REBUILD_THRESHOLD 默认是521
selectCnt >= SELECTOR_AUTO_REBUILD_THRESHOLD) {
// The code exists in an extra method to ensure the method is not too big to inline as this
// branch is not very likely to get hit very frequently.
selector = selectRebuildSelector(selectCnt);
selectCnt = 1;
break;
}
currentTimeNanos = time;
}
if (selectCnt > MIN_PREMATURE_SELECTOR_RETURNS) {
if (logger.isDebugEnabled()) {
logger.debug("Selector.select() returned prematurely {} times in a row for Selector {}.",
selectCnt - 1, selector);
}
}
} catch (CancelledKeyException e) {
if (logger.isDebugEnabled()) {
logger.debug(CancelledKeyException.class.getSimpleName() + " raised by a Selector {} - JDK bug?",
selector, e);
}
// Harmless exception - log anyway
}
}
private Selector selectRebuildSelector(int selectCnt) throws IOException {
// The selector returned prematurely many times in a row.
// Rebuild the selector to work around the problem.
logger.warn(
"Selector.select() returned prematurely {} times in a row; rebuilding Selector {}.",
selectCnt, selector);
rebuildSelector();
Selector selector = this.selector;
// Select again to populate selectedKeys.
selector.selectNow();
return selector;
}
从源码可以看出,netty设置的默认阈值(SELECTOR_AUTO_REBUILD_THRESHOLD )为512,每次经过约为1s左右的阻塞后,selectCnt 值加1,当达到512次是将会重建selector。
更多文章参考: