内存管理-ixy学习

最新推荐文章于 2024-06-06 09:31:57 发布

风流网民

最新推荐文章于 2024-06-06 09:31:57 发布

阅读量435

点赞数

分类专栏： DPDK 文章标签： DPDK

本文链接：https://blog.csdn.net/wq897387/article/details/129296857

版权

DPDK 专栏收录该内容

12 篇文章

订阅专栏

ixy使用与DPDK类似的明确的内存分配机制，有别于netmap那种将网卡的ring存储空间的副本暴露给应用程序。数据包的内存分配经常被拿出来说，因为它是netmap比传统内核驱动快的主要原因之一。因此，netmap让应用程序自己处理内存分配细节。

许多转发场景可以通过简单交换ring里的数据包指针来实现。然后，更复杂的场景如数据包不直接发给网卡的话就不能和这个API有好的对应，会要求在这个API之上添加手动的数据包管理。此外，基于ring的API使用相对于内存分配来说是非常复杂的。确实，在Linux内核中数据包的内存分配是是一笔大的开销，我们已经测量了当用Linux上的Open vSwitch转发数据包时分配和释放数据包内存时，每一个数据包要消耗100个周期。（等同于perf测试）。这个开销几乎是由于内核里sk_buff结构体的初始化造成的：在通用型的网络协议栈里使用的带有大的元数据的大型数据结构。

在ixy里的内存分配每个数据包只需要30个周期的开销，我们愿意为此付出代价，因为我们获得了面向用户的API简洁性。当发送接收数据包和管理内存时，ixy的API和DPDK的API是一样的。阅读示例代码如ixy-fwd.c和ixy-pktgen.c能得到更好的理解。

只有发送功能的示例程序ixy-pktgen.c创建一个内存池，一组可变长数据包大小的可变数据集，并用数据包预填它。

static struct mempool* init_mempool() {
        const int NUM_BUFS = 2048;
        struct mempool* mempool = memory_allocate_mempool(NUM_BUFS, 0);
        // pre-fill all our packet buffers with some templates that can be modified later
        // we have to do it like this because sending is async in the hardware; we cannot re-use a buffer immediately
        struct pkt_buf* bufs[NUM_BUFS];
        for (int buf_id = 0; buf_id < NUM_BUFS; buf_id++) {
                struct pkt_buf* buf = pkt_buf_alloc(mempool);
                buf->size = PKT_SIZE;
                memcpy(buf->data, pkt_data, sizeof(pkt_data));
                *(uint16_t*) (buf->data + 24) = calc_ip_checksum(buf->data + 14, 20);
                bufs[buf_id] = buf;
        }
        // return them all to the mempool, all future allocations will return bufs with the data set above
        for (int buf_id = 0; buf_id < NUM_BUFS; buf_id++) {
                pkt_buf_free(bufs[buf_id]);
        }

        return mempool;
}

随后它会从这个池子里分配一批数据包，把序列号添加到这个数据包里，然后把数据包发给发送函数。

发送函数是异步的：发送函数将这些数据包的指针插入队列，NIC抓取并发送它们。

        // tx loop
        while (true) {
                // we cannot immediately recycle packets, we need to allocate new packets every time
                // the old packets might still be used by the NIC: tx is async
                pkt_buf_alloc_batch(mempool, bufs, BATCH_SIZE);
                for (uint32_t i = 0; i < BATCH_SIZE; i++) {
                        // packets can be modified here, make sure to update the checksum when changing the IP header
                        *(uint32_t*)(bufs[i]->data + PKT_SIZE - 4) = seq_num++;
                }
                // the packets could be modified here to generate multiple flows
                ixy_tx_batch_busy_wait(dev, 0, bufs, BATCH_SIZE);

                // don't check time for every packet, this yields +10% performance :)
                if ((counter++ & 0xFFF) == 0) {
                        uint64_t time = monotonic_time();
                        if (time - last_stats_printed > 1000 * 1000 * 1000) {
                                // every second
                                ixy_read_stats(dev, &stats);
                                print_stats_diff(&stats, &stats_old, time - last_stats_printed);
                                stats_old = stats;
                                last_stats_printed = time;
                        }
                }
                // track stats
        }

其中发送函数将数据包的指针插入队列的动作如下：

uint32_t pkt_buf_alloc_batch(struct mempool* mempool, struct pkt_buf* bufs[], uint32_t num_bufs) {
        if (mempool->free_stack_top < num_bufs) {
                warn("memory pool %p only has %d free bufs, requested %d", mempool, mempool->free_stack_top, num_bufs);
                num_bufs = mempool->free_stack_top;
        }
        for (uint32_t i = 0; i < num_bufs; i++) {
                uint32_t entry_id = mempool->free_stack[--mempool->free_stack_top];
                bufs[i] = (struct pkt_buf*) (((uint8_t*) mempool->base_addr) + entry_id * mempool->buf_size);
        }
        return num_bufs;
}

具体的发送函数ixy_tx_batch_busy_wait实现如下：

        // step 2: send out as many of our packets as possible
        uint32_t sent;
        for (sent = 0; sent < num_bufs; sent++) {
                uint32_t next_index = wrap_ring(queue->tx_index, queue->num_entries);
                // we are full if the next index is the one we are trying to reclaim
                if (clean_index == next_index) {
                        break;
                }
                struct pkt_buf* buf = bufs[sent];
                // remember virtual address to clean it up later
                queue->virtual_addresses[queue->tx_index] = (void*) buf;
                volatile union ixgbe_adv_tx_desc* txd = queue->descriptors + queue->tx_index;
                queue->tx_index = next_index;
                // NIC reads from here
                txd->read.buffer_addr = buf->buf_addr_phy + offsetof(struct pkt_buf, data);
                // always the same flags: one buffer (EOP), advanced data descriptor, CRC offload, data length
                txd->read.cmd_type_len =
                        IXGBE_ADVTXD_DCMD_EOP | IXGBE_ADVTXD_DCMD_RS | IXGBE_ADVTXD_DCMD_IFCS | IXGBE_ADVTXD_DCMD_DEXT | IXGBE_ADVTXD_DTYP_DATA | buf->size;
                // no fancy offloading stuff - only the total payload length
                // implement offloading flags here:
                //      * ip checksum offloading is trivial: just set the offset
                //      * tcp/udp checksum offloading is more annoying, you have to precalculate the pseudo-header checksum
                txd->read.olinfo_status = buf->size << IXGBE_ADVTXD_PAYLEN_SHIFT;
        }

以前发送完的数据包是在发送函数里查询发送队列而异步释放的，然后把它们释放到内存池里。

这也就意味着一个数据包的数据流不能被立即重复使用，因而ixy-pktgen示例看起来跟典型的socket API的数据包生成是不一样的。

转发程序ixy-fwd.c里避免了显式的memory pool处理：驱动为每个接收ring分配了一个memory pool并自动分配了数据包。内存分配是在数据包接收功能里完成的，内存释放是在发送公里完成，或者输出链路满了的时候直接释放。像netmap那样直接暴露ring存储空间出来能有效地加速示例程序在使用上的开销。