I/O多路复用模型 select

芒骁

已于 2022-07-26 16:45:03 修改

阅读量323

点赞数

分类专栏： IO 文章标签： java

于 2022-01-11 15:53:11 首次发布

原文链接：https://blog.csdn.net/paradox_1_0/article/details/103211867

版权

IO 专栏收录该内容

19 篇文章 0 订阅

订阅专栏

https://blog.csdn.net/paradox_1_0/article/details/103211867

select 系统调用的的用途是：在一段指定的时间内，监听用户感兴趣的文件描述符上可读、可写和异常等事件。

select实现I/O多路复用模型
在这里插入图片描述

从流程上来看，使用 select 函数进行 IO 请求和同步阻塞模型没有太大的区别，甚至还多了添加监视 socket，以及调用 select 函数的额外操作，效率更差。但是，使用 select 以后最大的优势是用户可以在一个线程内同时处理多个 socket 的 IO 请求。用户可以注册多个 socket，然后不断地调用 select 读取被激活的 socket，即可达到在同一个线程内同时处理多个 IO 请求的目的。而在同步阻塞模型中，必须通过多线程的方式才能达到这个目的。

{
    select(socket);
    while(1) {
        sockets = select();
        for(socket in sockets) {
            if(can_read(socket)) {
                read(socket, buffer);
                process(buffer);
            }
        }
    }
}

总结select的过程

简洁版总结

IO可以总结为三种事件in、out、err，select的思想就是用三个比特序列分别保存文件的三个事件合集。

以in事件为例
001011001… 每个比特表示一个文件的in事件，值为1表示select正在替某个文件监视它的in事件，也就是我们常说的对某个文件的in事件感兴趣。

但是这件事说到底由内核来实现。用户向内核发送感兴趣的事件集合，也就是这三个比特序列in、out、select，一切工作都由内核完成。

内核拿到三个比特序列，就开始轮询这三个比特序列，查看比特位哪个值为1，找到了就说明来活了，需要去看这个文件对应的这个事件有没有到来，它就会去和Socket进行交互，

细节版总结

首先一个文件有三种输入事件，in、out、err。

select通过fd_set数据结构保存所有文件的一种事件。

用户空间向内核空间发送了三个事件的数据，入参为fd_set类型的指针，也就是指向了三个内存区域的数据，根据下面fd_set的结构，可以理解为就是一个比特序列。

// include/uapi/linux/posix_types.h
#define __FD_SETSIZE    1024
typedef struct {
    unsigned long fds_bits[__FD_SETSIZE / (8 * sizeof(long))];
} __kernel_fd_set;


// 源码位置： include/linux/types.h
typedef __kernel_fd_set     fd_set;

内核拿到数据，开始处理，首先分配6个fd_set大小的内存区域保存参数和最终的结果值，内核通过创建一个数据结构fd_set_bits实现（其实就是分配6个long型长度的内存区域对应6个fd_set）

// 源码位置： fs/select.c
// 用于传递 select 的输入事件、输出结果，是 fd_set 的扩展版。
typedef struct {
    unsigned long *in, *out, *ex;
    unsigned long *res_in, *res_out, *res_ex;
} fd_set_bits;

在这里插入图片描述

初始化fds

在这里插入图片描述

有了内存区域就可以处理从用户空间传来的数据，比特序列就用最朴素的方式轮询处理，但是轮询次数得有上限吧，这里的n就是上限，我们知道用户向内核发送了三个比特序列，也就是三个fd_set结构，这个n就表示三个比特序列长度中最长的那个+1，加一你可以理解为我们平时的for(i = 0; i <= n; i++)在内核里是for(i = 1 ;i <= n+1; i++)的处理方式，所以引入n+1方便。

开始拿到fds的数值进行工作,进行轮询
在这里插入图片描述
情况一：对于该文件的任何事件都不感兴趣 all_bits = in | out | ex，

在这里插入图片描述
第二种情况，找到fd感兴趣的事件。

select源码

static __attribute__((unused))
int select(int nfds, fd_set *rfds, fd_set *wfds, fd_set *efds, struct timeval *timeout)
{
	int ret = sys_select(nfds, rfds, wfds, efds, timeout);
//返回值：超时返回 0 ;失败返回 -1；成功返回大于 0 的整数，这个整数表示就绪描述符的数目。
	if (ret < 0) {
		SET_ERRNO(-ret);
		ret = -1;
	}
	return ret;
}

nfds：被监听的文件描述符的总数，它比所有文件描述符集合中的文件描述符的最大值大 1，因为文件描述符是从 0 开始计数的；

rfds、wfds、efds：分别指向可读、可写和异常等事件对应的描述符集合。

timeout:用于设置 select 函数的超时时间，即告诉内核 select 等待多长时间之后就放弃等待。timeout == NULL 表示等待无限长的时间

在这里插入图片描述

fd_set 集合操作

以下介绍与 select 函数相关的常见的几个宏：

#include <sys/select.h>   
int FD_ZERO(int fd, fd_set *fdset);     // 一个 fd_set 类型变量的所有位都设为 0
int FD_CLR(int fd, fd_set *fdset);      // 清除某个位时可以使用
int FD_SET(int fd, fd_set *fd_set);     // 设置变量的某个位置位
int FD_ISSET(int fd, fd_set *fdset);    // 测试某个位是否被置位

数据结构

// include/uapi/linux/posix_types.h
#define __FD_SETSIZE    1024
typedef struct {
    unsigned long fds_bits[__FD_SETSIZE / (8 * sizeof(long))];
} __kernel_fd_set;


// 源码位置： include/linux/types.h
typedef __kernel_fd_set     fd_set;

// 源码位置： fs/select.c
// 用于传递 select 的输入事件、输出结果，是 fd_set 的扩展版。
typedef struct {
    unsigned long *in, *out, *ex;
    unsigned long *res_in, *res_out, *res_ex;
} fd_set_bits;

从上述定义可以看到，fd_set 就是一个位图，限制了 select 操作就多可以 poll 1024 个文件，如果要支持 poll 更多的文件，需要修改源码、重新编译。

fd_set_bits 里的6个变量是6个指针，指向不同的位图起始地址，见下文里的注释说明。

select 使用范例

当声明了一个文件描述符集后，必须用 FD_ZERO 将所有位置零。之后将我们所感兴趣的描述符所对应的位置位，操作如下：

fd_set rset;   
int fd;   
FD_ZERO(&rset);   
FD_SET(fd, &rset);   
FD_SET(stdin, &rset);

/* Here come a few helper functions */

static __attribute__((unused))
void FD_ZERO(fd_set *set)
{
	memset(set, 0, sizeof(*set));
}

static __attribute__((unused))
void FD_SET(int fd, fd_set *set)
{
	if (fd < 0 || fd >= FD_SETSIZE)
		return;
	set->fd32[fd / 32] |= 1 << (fd & 31);
}

然后调用 select 函数，拥塞等待文件描述符事件的到来；如果超过设定的时间，则不再等待，继续往下执行。

select(fd+1, &rset, NULL, NULL,NULL);

select 返回后，用 FD_ISSET 测试给定位是否置位：

if(FD_ISSET(fd, &rset) { 
    ... 
    //do something  
}

深入理解 select 模型

/* commonly an fd_set represents 256 FDs */
#define FD_SETSIZE 256
# 定义了FD_SETSIZE的大小 256，然后定义了一个叫做fd_set的结构体
typedef struct { uint32_t fd32[FD_SETSIZE/32]; } fd_set;

理解 select 模型的关键在于理解 fd_set,为说明方便，取 fd_set 长度为 1 字节，fd_set 中的每一 bit 可以对应一个文件描述符 fd。则 1 字节长的 fd_set 最大可以对应 8 个 fd。

（1）执行 fd_set set; FD_ZERO(&set); 则 set 用位表示是 0000,0000。

（2）若 fd=5，执行 FD_SET(fd, &set); 后 set 变为 0001,0000(第 5 位置为 1)

（3）若再加入 fd=2，fd=1，则 set 变为 0001,0011

（4）执行 select(6, &set, 0, 0, 0) 阻塞等待

（5）若 fd=1, fd=2 上都发生可读事件，则 select 返回，此时 set 变为 0000,0011。注意：没有事件发生的 fd=5 被清空。

基于上面的讨论，可以轻松得出 select 模型的特点：

（1）可监控的文件描述符个数取决与 sizeof(fd_set) 的值。我这边服务器上 sizeof(fd_set)＝512，每 bit 表示一个文件描述符，则我服务器上支持的最大文件描述符是 512 * 8 = 4096。据说可调，另有说虽然可调，但调整上限受于编译内核时的变量值。

（2）将 fd 加入 select 监控集的同时，还要再使用一个数据结构 array 保存放到 select 监控集中的 fd，一是用于再 select 返回后，array 作为源数据和 fd_set 进行 FD_ISSET 判断。二是 select 返回后会把以前加入的但并无事件发生的 fd 清空，则每次开始 select 前都要重新从 array 取得 fd 逐一加入（FD_ZERO最先），扫描 array 的同时取得 fd 最大值 maxfd，用于 select 的第一个参数。

（3）可见 select 模型必须在 select 前循环加 fd，取 maxfd，select 返回后利用 FD_ISSET 判断是否有事件发生。

select总结

select 本质上是通过设置或者检查存放 fd 标志位的数据结构来进行下一步处理。这样所带来的缺点是：

单个进程可监视的 fd 数量被限制，即能监听端口的大小有限。一般来说这个数目和系统内存关系很大，具体数目可以 cat/proc/sys/fs/file-max 查看。32 位机默认是 1024 个。64 位机默认是 2048.

对 socket 进行扫描时是线性扫描，即采用轮询的方法，效率较低：当套接字比较多的时候，每次 select() 都要通过遍历 FD_SETSIZE 个 Socket 来完成调度，不管哪个 Socket 是活跃的，都遍历一遍。这会浪费很多 CPU 时间。如果能给套接字注册某个回调函数，当他们活跃时，自动完成相关操作，那就避免了轮询，这正是 epoll 与 kqueue 做的。

需要维护一个用来存放大量 fd 的数据结构，这样会使得用户空间和内核空间在传递该结构时复制开销大。

当然 select 也有优点：兼容性好，不管是 Linux 还是 Windows 都支持 select。

select实现详解

调用过程 sys_select > core_sys_select > doselect

// 源码位置： fs/select.c

// select 系统调用原型
SYSCALL_DEFINE5(select, int, n, fd_set __user *, inp, fd_set __user *, outp,
        fd_set __user *, exp, struct timeval __user *, tvp)
{
    struct timespec64 end_time, *to = NULL;
    struct timeval tv;
    int ret;

    if (tvp) {
        if (copy_from_user(&tv, tvp, sizeof(tv)))
            return -EFAULT;

        to = &end_time;
        if (poll_select_set_timeout(to,
                tv.tv_sec + (tv.tv_usec / USEC_PER_SEC),
                (tv.tv_usec % USEC_PER_SEC) * NSEC_PER_USEC))
            return -EINVAL;
    }

    ret = core_sys_select(n, inp, outp, exp, to);
    ret = poll_select_copy_remaining(&end_time, tvp, 1, ret);

    return ret;
}

int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
               fd_set __user *exp, struct timespec64 *end_time)
{
    fd_set_bits fds;
    void *bits;
    int ret, max_fds;
    size_t size, alloc_size;
    struct fdtable *fdt;
    /* Allocate small arguments on the stack to save memory and be faster */
    // 在栈上分配的小段参数，用于节省内存和提升速度
    //这里有个小技巧，先从内核栈上分配空间，如果不够用，才使用 kvmalloc分配。


    // SELECT_STACK_ALLOC=256
    long stack_fds[SELECT_STACK_ALLOC/sizeof(long)];

    ret = -EINVAL;
    if (n < 0)
        goto out_nofds;

    /* max_fds can increase, so grab it once to avoid race */
    rcu_read_lock();
    fdt = files_fdtable(current->files);
    max_fds = fdt->max_fds;
    rcu_read_unlock();

    // poll 的最大 FD 不能超过 进程打开的最大FD
    if (n > max_fds)
        n = max_fds;


    // 每个文件有3种输入、3种输出，因此每个文件需要6个位图来表示事件
    /*
     * We need 6 bitmaps (in/out/ex for both incoming and outgoing),
     * since we used fdset we need to allocate memory in units of
     * long-words. 
     */
    // n 个文件需要的字节数，也是每份位图的大小
    //通过 size = FDS_BYTES(n);计算出单一一种fd_set所需字节数，
    size = FDS_BYTES(n);
    bits = stack_fds;

    // sizeof(stack_fds) / 6 是把栈上分配的内存块划分为 6 份做位图
    if (size > sizeof(stack_fds) / 6 ) {
        /* Not enough space in on-stack array; must use kmalloc */
        // 栈上分配的空间不够，要使用 kmalloc 
        ret = -ENOMEM;
        if (size > (SIZE_MAX / 6))
            goto out_nofds;
//再能过 alloc_size = 6 * size; 即可计算出所需的全部字节数。
        alloc_size = 6 * size;
        bits = kvmalloc(alloc_size, GFP_KERNEL);
        if (!bits)
            goto out_nofds;
    }

    // 把 fds 里的指针指向不同位图的起始地址
    fds.in      = bits;
    fds.out     = bits +   size;
    fds.ex      = bits + 2*size;
    fds.res_in  = bits + 3*size;
    fds.res_out = bits + 4*size;
    fds.res_ex  = bits + 5*size;

    // 把用户空间的事件拷贝到内核空间
    if ((ret = get_fd_set(n, inp, fds.in)) ||
        (ret = get_fd_set(n, outp, fds.out)) ||
        (ret = get_fd_set(n, exp, fds.ex)))
        goto out;

    // 清零输出结果
    zero_fd_set(n, fds.res_in);
    zero_fd_set(n, fds.res_out);
    zero_fd_set(n, fds.res_ex);

    ret = do_select(n, &fds, end_time);

    if (ret < 0)
        goto out;
    if (!ret) {
        ret = -ERESTARTNOHAND;
        if (signal_pending(current))
            goto out;
        ret = 0;
    }

    // 通过 __copy_to_user 拷贝结果到用户空间
    if (set_fd_set(n, inp, fds.res_in) ||
        set_fd_set(n, outp, fds.res_out) ||
        set_fd_set(n, exp, fds.res_ex))
        ret = -EFAULT;

out:
    if (bits != stack_fds)
        kvfree(bits);
out_nofds:
    return ret;
}

内核空间分配一块内存区域

doSelect实现原理

由内核实现。



static int do_select(int n, fd_set_bits *fds, struct timespec64 *end_time)
{
    ktime_t expire, *to = NULL;
    // 构建一个等待队列，该队列维护着对所有添加到文件的等待队列的节点的指针
    struct poll_wqueues table;

    // 等待节点的数据原型，主要用于传递参数
    poll_table *wait;
    int retval, i, timed_out = 0;
    u64 slack = 0;
    unsigned int busy_flag = net_busy_loop_on() ? POLL_BUSY_LOOP : 0;
    unsigned long busy_start = 0;

    rcu_read_lock();
    retval = max_select_fd(n, fds);
    rcu_read_unlock();

    if (retval < 0)
        return retval;
    n = retval;

    // 设置 wait._qproc = __pollwait
    poll_initwait(&table);
    wait = &table.pt;
    if (end_time && !end_time->tv_sec && !end_time->tv_nsec) {
        wait->_qproc = NULL;
        timed_out = 1;
    }

    if (end_time && !timed_out)
        slack = select_estimate_accuracy(end_time);

    retval = 0;
    for (;;) {
        unsigned long *rinp, *routp, *rexp, *inp, *outp, *exp;
        bool can_busy_loop = false;

        inp = fds->in; outp = fds->out; exp = fds->ex;
        rinp = fds->res_in; routp = fds->res_out; rexp = fds->res_ex;

        // 分批轮询
        for (i = 0; i < n; ++rinp, ++routp, ++rexp) {
            unsigned long in, out, ex, all_bits, bit = 1, mask, j;
            unsigned long res_in = 0, res_out = 0, res_ex = 0;

            in = *inp++; out = *outp++; ex = *exp++;
            all_bits = in | out | ex;           // 
            if (all_bits == 0) {
                // 没有敢兴趣的事件，跳过 BITS_PER_LONG 个文件
                i += BITS_PER_LONG;
                continue;
            }

            // 批次内逐个轮询
            for (j = 0; j < BITS_PER_LONG; ++j, ++i, bit <<= 1) {   // bit 左移是为了给正确的文件设置事件结果
                struct fd f;
                if (i >= n)
                    break;
                if (!(bit & all_bits))
                    continue;
                f = fdget(i);
                if (f.file) {
                    // 找到了文件
                    const struct file_operations *f_op;
                    f_op = f.file->f_op;
                    mask = DEFAULT_POLLMASK;
                    if (f_op->poll) {
                        wait_key_set(wait, in, out,
                                 bit, busy_flag);

                        // 调用文件的 poll 函数，最终会调用到  __pollwait 函数
                        // __pollwait 
                        mask = (*f_op->poll)(f.file, wait);
                    }
                    fdput(f);

                    // 下面的 if 语句块内，是已经检测到事件发生了，进程不需要进行等待和唤醒
                    // 把 _qproc 设置为 NULL 是为了避免往后续 poll 未就绪的文件时被加入等待队列
                    // 这样可以避免无效的唤醒
                    if ((mask & POLLIN_SET) && (in & bit)) {
                        res_in |= bit;
                        retval++;
                        wait->_qproc = NULL;
                    }
                    if ((mask & POLLOUT_SET) && (out & bit)) {
                        res_out |= bit;
                        retval++;
                        wait->_qproc = NULL;
                    }
                    if ((mask & POLLEX_SET) && (ex & bit)) {
                        res_ex |= bit;
                        retval++;
                        wait->_qproc = NULL;
                    }

                    // 
                    /* got something, stop busy polling */
                    if (retval) {
                        can_busy_loop = false;
                        busy_flag = 0;

                    /*
                     * only remember a returned
                     * POLL_BUSY_LOOP if we asked for it
                     */
                    } else if (busy_flag & mask)
                        can_busy_loop = true;

                }
            }

            // 小批次轮询完，把结果记录下来
            if (res_in)
                *rinp = res_in;
            if (res_out)
                *routp = res_out;
            if (res_ex)
                *rexp = res_ex;

            //  进入睡眠，等待超时或唤醒
            cond_resched();
        }

        // 所有文件都轮询了一遍，要加入文件等待队列的都已经加了，避免下次轮询重复添加
        wait->_qproc = NULL;

        // 有事件、或超时、或有信号要处理
        if (retval || timed_out || signal_pending(current))
            break;
        if (table.error) {
            retval = table.error;
            break;
        }

        /* only if found POLL_BUSY_LOOP sockets && not out of time */
        if (can_busy_loop && !need_resched()) {
            if (!busy_start) {
                busy_start = busy_loop_current_time();
                continue;
            }
            if (!busy_loop_timeout(busy_start))
                continue;
        }
        busy_flag = 0;

        /*
         * If this is the first loop and we have a timeout
         * given, then we convert to ktime_t and set the to
         * pointer to the expiry value.
         */
        if (end_time && !to) {
            expire = timespec64_to_ktime(*end_time);
            to = &expire;
        }

        // 进程状态设置为 TASK_INTERRUPTIBLE，进入睡眠直到超时
        if (!poll_schedule_timeout(&table, TASK_INTERRUPTIBLE,
                       to, slack))
            timed_out = 1;
    }

    // 释放等待节点，重点是把等待节点从文件的等待队列删除掉
    poll_freewait(&table);

    return retval;
}