[SPDK/NVMe存储技术分析]

https://www.cnblogs.com/vlhn/p/7727141.html

  1. NVMe概述

NVMe是一个针对基于PCIe的固态硬盘的高性能的、可扩展的主机控制器接口。
NVMe的显著特征是提供多个队列来处理I/O命令。单个NVMe设备支持多达64K个I/O 队列,每个I/O队列可以管理多达64K个命令。
当主机发出一个I/O命令的时候,主机系统将命令放置到提交队列(SQ),然后使用门铃寄存器(DB)通知NVMe设备。
当NVMe设备处理完I/O命令之后,设备将处理结果写入到完成队列(CQ),并引发一个中断通知主机系统。
NVMe使用MSI/MSI-X和中断聚合来提高中断处理的性能。
2. SPDK概述

The Storage Performance Development Kit (SPDK) provides a set of tools and libraries for writing high performance, scalable, user-mode storage applications. It achieves high performance by moving all of the necessary drivers into userspace and operating in a polled mode instead of relying on interrupts, which avoids kernel context switches and eliminates interrupt handling overhead.

SPDK(存储性能开发套件)为编写高性能的、可扩展的、用户态存储应用提供了一套工具和库函数。SPDK之所以能实现高性能,是因为所有必要的驱动被挪到了用户空间运行,使用轮询模式代替了中断模式,从而避免了内核上下文切换和消除了中断处理开销。

The bedrock of SPDK is a user space, polled-mode, asynchronous, lockless NVMe driver. This provides zero-copy, highly parallel access directly to an SSD from a user space application. The driver is written as a C library with a single public header. Similarly, SPDK provides a user space driver for the I/OAT DMA engine present on many Intel Xeon-based platforms with all of the same properties as the NVMe driver.

SPDK的基石是一个运行在用户空间的、采用轮询模式的、异步的、无锁的NVMe驱动。用户空间应用程序可直接访问SSD盘,而且是零拷贝、高度并行地访问SSD盘。该驱动程序实现为一个C函数库,该函数库携带一个单一的公共头文件。类似地,SPDK为许多基于Intel至强平台的I/OAT DMA引擎提供了一个用户空间驱动程序,NVMe驱动所具备的所有属性,该驱动程序都具备。

SPDK also provides NVMe-oF and iSCSI servers built on top of these user space drivers that are capable of serving disks over the network. The standard Linux kernel iSCSI and NVMe-oF initiator can be used (or the Windows iSCSI initiator even) to connect clients to the servers. These servers can be up to an order of magnitude more CPU efficient than other implementations.

SPDK还提供了NVMe-oF和基于这些用户空间驱动程序构建的iSCSI服务器, 从而有能力提供网络磁盘服务。客户端可以使用标准的Linux内核iSCSI和NVMe-oF initiator(或者甚至使用Windows的iSCSI initiator)来连接服务器。跟其他实现比起来,这些服务器在CPU利用效率方面可以达到数量级的提升。

SPDK is an open source, BSD licensed set of C libraries and executables hosted on GitHub. All new development is done on the master branch and stable releases are created quarterly. Contributors and users are welcome to submit patches, file issues, and ask questions on our mailing list.

SPDK是一个开源的、BSD授权的集C库和可执行文件一体的开发套件,其源代码通过GitHub托管。所有新的开发都放到master分支上,每个季度发布一个稳定版本。欢迎代码贡献者和用户提交补丁、报告问题,并通过邮件列表提问。

  1. SPDK/NVMe驱动概述

The NVMe driver is a C library that may be linked directly into an application that provides direct, zero-copy data transfer to and from NVMe SSDs. It is entirely passive, meaning that it spawns no threads and only performs actions in response to function calls from the application itself. The library controls NVMe devices by directly mapping the PCI BAR into the local process and performing MMIO. I/O is submitted asynchronously via queue pairs and the general flow isn’t entirely dissimilar from Linux’s libaio.

NVMe驱动是一个C函数库,可直接链接到应用程序从而在应用与NVMe固态硬盘之间提供直接的、零拷贝的数据传输。这是完全被动的,意味着不会开启线程,只是执行来自应用程序本身的函数调用。这套库函数直接控制NVMe设备,通过将PCI BAR寄存器直接映射到本地进程中然后执行基于内存映射的I/O(MMIO)。I/O是通过队列对(QP)进行异步提交,其一般的执行流程跟Linux的libaio相比起来,并非完全不同。

进一步的详细信息,请阅读这里。

  1. 其他NVMe驱动实现

Linux内核NVMe驱动 : 去官网www.kernel.org或者镜像下载内核源代码,然后阅读include/linux/nvme.h, drivers/nvme
NVMeDirect : 这是韩国人发起的一个开源项目,跟SPDK/NVMe驱动类似,但是严重依赖于Linux内核NVMe驱动的实现。 可以说,NVMeDirect是一个站在巨人肩膀上的用户态I/O框架。

Do not let what you cannot do interfere with what you can do. | 别让你不能做的事妨碍到你能做的事。

[SPDK/NVMe存储技术分析]002 - SPDK官方介绍
Introduction to the Storage Performance Development Kit (SPDK) | SPDK概述

By Jonathan S. (Intel), Updated December 5, 2016
Solid-state storage media is in the process of taking over the data center. Current-generation flash storage enjoys significant advantages in performance, power consumption, and rack density over rotational media. These advantages will continue to grow as next-generation media enter the marketplace.
固态存储设备正在逐步接管数据中心。目前这一代的闪存存储,相对于传统的磁盘设备来说,在性能(performance)、功耗(power consumption)和机架密度(rack density)上都具有显著的优势。这些优势将会进一步增大,使闪存存储作为下一代存储设备进入到存储市场。

Introduction to the Storage Performance Development Kit (SPDK) Customers integrating current solid-state media, such as the Intel® SSD DC P3700 Series Non-Volatile Memory Express* (NVMe*) drive, face a major challenge: because the throughput and latency performance are so much better than that of a spinning disk, the storage software now consumes a larger percentage of the total transaction time. In other words, the performance and efficiency of the storage software stack is increasingly critical to the overall storage system. As storage media continues to evolve, it risks outstripping the software architectures that use it, and in coming years the storage media landscape will continue evolving at an incredible pace.
用户使用现在的固态设备,比如Intel® SSD DC P3700 Series Non-Volatile Memory Express(NVMe)驱动,面临着一个主要的挑战:因为吞吐量和时延性能比传统的磁盘好太多,现在的存储软件在总的处理时间中占用了较大的比例。换句话说,存储软件栈的性能和效率在整个存储系统中显得越来越重要了。随着存储设备继续向前发展,它将面临远远超过正在使用的软件体系结构的风险(即存储设备受制于相关软件的不足而不能发挥其全部性能),在接下来的几年中,存储设备将会继续发展到一个令人难以置信的高度。

To help storage OEMs and ISVs integrate this hardware, Intel has created a set of drivers and a complete, end-to-end reference storage architecture called the Storage Performance Development Kit (SPDK). The goal of SPDK is to highlight the outstanding efficiency and performance enabled by using Intel’s networking, processing, and storage technologies together. By running software designed from the silicon up, SPDK has demonstrated that millions of I/Os per second are easily attainable by using a few processor cores and a few NVMe drives for storage with no additional offload hardware. Intel provides the entire Linux* reference architecture source code under the broad and permissive BSD license and is distributed to the community through GitHub*. A blog, mailing list, and additional documentation can be found at spdk.io.
为了帮助存储OEM(设备代工厂)和ISV(独立软件开发商)整合硬件,Inte构造了一系列驱动,以及一个完备的、端对端的参考存储体系结构,被命名为Storage Performance Development Kit(SPDK)。SPDK的目标是通过同时使用Intel的网络技术,处理技术和存储技术来显著地提高效率和性能。通过运行为硬件定制的软件,通过使用多个core和几个NVMe存储(没有额外的offlload硬件),SPDK已经证明很容易达到每秒钟数百万次I/O读取。Intel使用BSD license通过Github分发提供其全部的基于Linux架构的源码。博客、邮件列表和其他文档可在spdk.io中找到。

Software Architectural Overview | 软件体系结构概览

How does SPDK work? The extremely high performance is achieved by combining two key techniques: running at user level and using Poll Mode Drivers (PMDs). Let’s take a closer look at these two software engineering terms.
SPDK是如何工作的?达到这样的超高性能运用了两项关键技术:运行于用户态和轮询模式。让我们对这两个软件工程术语做进一步的了解。

First, running our device driver code at user level means that, by definition, driver code does not run in the kernel. Avoiding the kernel context switches and interrupts saves a significant amount of processing overhead, allowing more cycles to be spent doing the actual storing of the data. Regardless of the complexity of the storage algorithms (deduplication, encryption, compression, or plain block storage), fewer wasted cycles means better performance and latency. This is not to say that the kernel is adding unnecessary overhead; rather, the kernel adds overhead relevant to general-purpose computing use cases that may not be applicable to a dedicated storage stack. The guiding principle of SPDK is to provide the lowest latency and highest efficiency by eliminating every source of additional software overhead.
首先,我们的设备驱动代码运行在用户态,这意味着(在定义上)驱动代码不会运行在内核态。避免内核上下文切换和中断将会节省大量的处理开销,允许更多的时钟周期被用来做实际的数据存储。无论存储算法(去冗,加密,压缩,空白块存储)多么复杂,浪费更少的时钟周期总是意味着更好的性能和时延。这并不是说内核增加了不必要的开销;相反,内核增加了一些与通用计算用例相关的开销,因而可能不适合专用的存储栈。SPDK的指导原则是通过消除每一处额外的软件开销来提供最少的时延和最高的效率。

Second, PMDs change the basic model for an I/O. In the traditional I/O model, the application submits a request for a read or a write, and then sleeps while awaiting an interrupt to wake it up once the I/O has been completed. PMDs work differently; an application submits the request for a read or write, and then goes off to do other work, checking back at some interval to see if the I/O has yet been completed. This avoids the latency and overhead of using interrupts and allows the application to improve I/O efficiency. In the era of spinning media (tape and HDDs), the overhead of an interrupt was a small percentage of the overall I/O time, thus was a tremendous efficiency boost to the system. However, as the age of solid-state media continues to introduce lower-latency persistent media, interrupt overhead has become a non-trivial portion of the overall I/O time. This challenge will only become more glaring with lower latency media. Systems are already able to process many millions of I/Os per second, so the elimination of this overhead for millions of transactions compounds quickly into multiple cores being saved. Packets and blocks are dispatched immediately and time spent waiting is minimized, resulting in lower latency, more consistent latency (less jitter), and improved throughput.
其次,轮询模式驱动(Polled Mode Drivers, PMDs)改变了I/O的基本模型。在传统的I/O模型中,应用程序提交读写请求后进入睡眠状态,一旦I/O完成,中断就会将其唤醒。PMDs的工作方式则不同,应用程序提交读写请求后继续执行其他工作,以一定的时间间隔回头检查I/O是否已经完成。这种方式避免了中断带来的延迟和开销,并使得应用程序提高了I/O效率。在旋转设备时代(磁带和机械硬盘),中断开销只占整个I/O时间的很小的百分比,因此给系统带来了巨大的效率提升。然而,在固态设备时代,持续引入更低时延的持久化设备,中断开销成为了整个I/O时间中不可忽视的部分。这个问题在更低时延的设备上只会越来越严重。系统已经能够每秒处理数百万个I/O,所以消除数百万个事务的这种开销,能够快速地复制到多个core中。数据包和数据块被立即分发,因为等待花费的时间变小,使得时延更低,一致性时延更多(抖动更少),吞吐量也得到了提高。

SPDK is composed of numerous subcomponents, interlinked and sharing the common elements of user-level and poll-mode operation. Each of these components was created to overcome a specific performance bottleneck encountered while creating the end-to-end SPDK architecture. However, each of these components can also be integrated into non-SPDK architectures, allowing customers to leverage the experience and techniques used within SPDK to accelerate their own software.
SPDK由多个子组件构成,相互连接并共享用户态操作和轮询模式操作的共有部分。当构造端对端SPDK体系结构时,每个组件被构造来克服遭遇到的特定的性能瓶颈。然而,每个组件也可以被整合进非SPDK体系结构,允许用户利用SPDK中使用的经验和技术来加速自己的软件。
在这里插入图片描述

Starting at the bottom and building up:
让我们就上图自底向上开始讲述:

Hardware Drivers | 硬件驱动

NVMe driver: The foundational component for SPDK, this highly optimized, lockless driver provides unparalleled scalability, efficiency, and performance.
NVMe driver:SPDK的基础组件,高度优化且无锁的驱动提供了前所未有的高扩展性,高效性和高性能。

Intel® QuickData Technology: Also known as Intel® I/O Acceleration Technology (Intel® IOAT), this is a copy offload engine built into the Intel® Xeon® processor-based platform. By providing user space access, the threshold for DMA data movement is reduced, allowing greater utilization for small-size I/Os or NTB.
Inter QuickData Technology:也称为Intel I/O Acceleration Technology(Inter IOAT,Intel I/O加速技术),这是一种基于Xeon处理器平台上的copy offload引擎。通过提供用户空间访问,减少了DMA数据移动的阈值,允许对小尺寸I/O或NTB(非透明桥)做更好地利用。

Back-End Block Devices | 后端块设备

NVMe over Fabrics (NVMe-oF) initiator: From a programmer’s perspective, the local SPDK NVMe driver and the NVMe-oF initiator share a common set of API commands. This means that local/remote replication, for example, is extraordinarily easy to enable.
NVMe over Fabrics(NVMe-oF)initiator:从程序员的角度来看,本地SPDK NVMe驱动和NVMe-oF initiator共享一套公共的API命令。这意味着本地/远程复制非常容易实现。

Ceph* RADOS Block Device (RBD): Enables Ceph as a back-end device for SPDK. This might allow Ceph to be used as another storage tier, for example.
Ceph RADOS Block Device(RBD):使Ceph成为SPDK的后端设备,这可能允许Ceph用作另一个存储层。

Blobstore Block Device: A block device allocated by the SPDK Blobstore, this is a virtual device that VMs or databases could interact with. These devices enjoy the benefits of the SPDK infrastructure, meaning zero locks and incredibly scalable performance.
Blobstore Block Device:由SPDK Blobstore分配的块设备,是虚拟机或数据库可与之交互的虚拟设备。这些设备享有SPDK基础架构带来的优势,意味着无锁和令人难以置信的可扩展性。

Linux* Asynchronous I/O (AIO): Allows SPDK to interact with kernel devices like HDDs.
Linux Asynchrounous I/O(AIO):允许SPDK与内核设备(如机械硬盘)发生交互。

Storage Services | 存储服务

Block device abstraction layer (bdev): This generic block device abstraction is the glue that connects the storage protocols to the various device drivers and block devices. Also provides flexible APIs for additional customer functionality (RAID, compression, dedup, and so on) in the block layer.
Block device abstration layer(bdev):这种通用的块设备抽象是连接到各种不同设备驱动和块设备的存储协议的粘合剂。在块层中还提供了灵活的API用于额外的用户功能(磁盘阵列,压缩,去冗等)。

Blobstore: Implements a highly streamlined file-like semantic (non-POSIX*) for SPDK. This can provide high-performance underpinnings for databases, containers, virtual machines (VMs), or other workloads that do not depend on much of a POSIX file system’s feature set, such as user access control.
Blobstore:为SPDK实现一个高精简的类文件的语义(非POSIX)。这可为数据库,容器,虚拟机或其他不依赖于大部分POSIX文件系统功能集(比如用户访问控制)的工作负载提供高性能支撑。

Storage Protocols | 存储协议

iSCSI target: Implementation of the established specification for block traffic over Ethernet; about twice as efficient as kernel LIO. Current version uses the kernel TCP/IP stack by default.
iSCSI target:建立了通过以太网的块流量规范,大约是内核LIO效率的两倍。现在的版本默认使用内核TCP/IP协议栈。

NVMe-oF target: Implements the new NVMe-oF specification. Though it depends on RDMA hardware, the NVMe-oF target can serve up to 40 Gbps of traffic per CPU core.
NVMe-oF target:实现了新的NVMe-oF规范。虽然这取决于RDMA硬件,NVMe-oF target可以为每个CPU核提供高达40Gbps的流量。

vhost-scsi target: A feature for KVM/QEMU that utilizes the SPDK NVMe driver, giving guest VMs lower latency access to the storage media and reducing the overall CPU load for I/O intensive workloads.
vhost-scsi target:KVM/QEMU的功能利用了SPDK NVMe驱动,使得访客虚拟机访问存储设备时延更低,使得I/O密集型工作负载的整体CPU负载有所下降。

SPDK does not fit every storage architecture. Here are a few questions that might help you determine whether SPDK components are a good fit for your architecture.
SPDK不适用于所有的存储架构。这里有一些问题可能会帮助用户决定SPDK组件是否适合他们的架构。

》 Is the storage system based on Linux or or FreeBSD*?
这个存储系统是否基于Linux或FreeBSD?

SPDK is primarily tested and supported on Linux. The hardware drivers are supported on both FreeBSD and Linux.
是的。 SPDK主要在Linux上测试和支持。硬件驱动被FreeBSD和Linux所支持。

》Is the hardware platform for the storage system Intel® architecture?
存储系统的硬件平台是否要求是Intel体系结构?

SPDK is designed to take full advantage of Intel® platform characteristics and is tested and tuned for Intel® chips and systems.
是的。SPDK被设计来充分地利用Intel平台的特性,并针对Intel芯片和系统做测试和调优。

》Does the performance path of the storage system currently run in user mode?
这个存储系统的高性能路径是否运行在用户态?

SPDK is able to improve performance and efficiency by running more of the performance path in user space. By combining applications with SPDK features like the NVMe-oF target, initiator, or Blobstore, the entire data path may be able to run in user space, offering substantial efficiencies.
是的。SPDK通过更多地在用户态下运行从网卡到磁盘的高性能通路从而提高性能和效率。通过将具有SPDK功能(比如NVMe-oF target,NVMe-oFinitator,Blobstore)的应用程序结合起来,整个数据通路能够在用户态运行,从而显著地提供高效率。

》Can the system architecture incorporate lockless PMDs into its threading model?
该系统架构可以将无锁的PMDs合并到它的线程模型中去吗?

Since PMDs continually run on their threads (instead of sleeping or ceding the processor when unused), they have specific thread model requirements.
不能。 因为PMDs持续运行在它们的线程中(而不是睡眠或者不用时让出处理器),所以它们有特殊的线程模型需求。

》Does the system currently use the Data Plane Development Kit (DPDK) to handle network packet workloads?
系统现在是否用DPDK处理网络数据包的工作负载?

SPDK shares primitives and programming models with DPDK, so customers currently using DPDK will likely find the close integration with SPDK useful. Similarly, if customers are using SPDK, adding DPDK features for network processing may present a significant opportunity.
是的。 SPDK和DPDK共享早期的编程模型,所以现在使用DPDK的用户可能会发现与SPDK紧密整合非常有用。同样地,如果正在使用SPDK的用户为网络处理添加DPDK功能可能是个重要的机遇。

》Does the development team have the expertise to understand and troubleshoot problems themselves?
开发团队自己是否必须具有理解和解决问题的专业技能?

Intel shall have no support obligations for this reference software. While Intel and the open source community around SPDK will use commercially reasonable efforts to investigate potential errata of unmodified released software, under no circumstances will Intel have any obligation to customers with respect to providing any maintenance or support of the software.
是的。Intel没有义务为相关软件提供支持。Intel和围绕SPDK的开源社区会付出合理的商业努力去调查未修改的发布版本的软件中的潜在的错误,其他任何情况下,Intel都没有义务为用户提供针对该软件任何形式的维护和支持。

If you’d like to find out more about SPDK, please fill out the contact request form or check out SPDK.io for access to the mailing list, documentation, and blogs.
关于SPDK, 如果您想了解更多,请填写联系请求表,或者访问SPDK.io的邮件列表、文档和博客。

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. Check with your system manufacturer or retailer or learn more at intel.com.
英特尔技术的特性及优势依赖于具体的系统配置,并且可能需要启用硬件、 软件或激活某些服务。性能因系统配置不同而有所不同。请咨询您的系统制造商或零售商,或访问intel.com了解更多信息.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
软件和性能测试中使用的工作负载可能仅在英特尔微处理器上针对性能做了优化。SYSmark* 和 MobileMark* 等性能测试使用了特定的计算机系统、 组件、 软件、 操作和功能。对这些因素作任何修改都可能导致不同的结果。为了帮助您完整地评估您的采购决策,请查询其他信息和做其他性能测试,包括该产品与其它产品一同使用时的性能。

Performance testing configuration:
性能测试配置:

复制代码
2S Intel® Xeon® processor E5-2699 v3: 18 C, 2.3 GHz (hyper-threading off)
Note: Single socket was used while performance testing
32 GB DDR4 2133 MT/s
4 Memory Channel per CPU
1x 4GB 2R DIMM per channel
Ubuntu* (Linux) Server 14.10
3.16.0-30-generic kernel
Intel® Ethernet Controller XL710 for 40GbE
8x P3700 NVMe drive for storage

NVMe configuration
    Total 8 PCIe* Gen 3 x 4 NVMes
        4 NVMes coming from first x16 slot
        (bifurcated to 4 x4s in the BIOS)
        Another 4 NVMes coming from second x16 slot (bifurcated to 4 x4s in the BIOS)
    Intel® SSD DC P3700 Series 800 GB
    Firmware: 8DV10102

FIO BenchMark Configuration
    Direct: Yes
    Queue depth
    4KB Random I/O: 32 outstanding I/O
    64KB Seq. I/O: 8 outstanding I/O
    Ramp Time: 30 seconds
    Run Time:180 seconds
    Norandommap: 1
    I/O Engine: Libaio
    Numjobs: 1

BIOS Configuration
    Speed Step: Disabled
    Turbo Boost: Disabled
    CPU Power and Performance Policy: Performance

复制代码
For more information go to http://www.intel.com/performance.
更多信息,请访问http://www.intel.com/performance.

In the confrontation between the stream and the rock, the stream always wins, not through strength but by perseverance. | 在溪流与与岩石的对峙里,溪流永远都是胜利者,它靠的不是力量而是毅力。

[SPDK/NVMe存储技术分析]005 - DPDK概述
注: 之所以要中英文对照翻译下面的文章,是因为SPDK严重依赖于DPDK的实现。
Introduction to DPDK: Architecture and Principles
DPDK概论:体系结构与实现原理
在这里插入图片描述

Linux network stack performance has become increasingly relevant over the past few years. This is perfectly understandable: the amount of data that can be transferred over a network and the corresponding workload has been growing not by the day, but by the hour.
这几年以来,Linux网络栈的性能变得越来越重要。这很好理解,因为伴随着时间的向前推移,可以通过网络来传输的数据量和对应的工作负载在大幅度地向上增长。

Not even the widespread use of 10 GE network cards has resolved this issue; this is because a lot of bottlenecks that prevent packets from being quickly processed are found in the Linux kernel itself.
即便广泛使用10GbE网卡也解决不了这一性能问题(用重庆话说就是,10GbE的网卡mang起儿子家整,然并卵),因为在Linux内核中,阻止数据包被快速地处理掉的瓶颈实在是太多啦。

There have been many attempts to circumvent these bottlenecks with techniques called kernel bypasses (a short description can be found here). They let you process packets without involving the Linux network stack and make it so that the application running in the user space communicates directly with networking device. We’d like to discuss one of these solutions, the Intel DPDK (Data Plane Development Kit), in today’s article.
尝试绕过这些瓶颈的技术有很多,统称为kernel bypass(简短的描述戳这里)。kernel bypass技术让编程人员处理数据包,而不卷入Linux网络栈,在用户空间中运行的应用程序能够直接与网络设备打交道。在本文中,我们将讨论众多的kernel bypass解决方案中的一种,那就是Intel的DPDK(数据平面开发套件)。

A lot of posts have already been published about the DPDK and in a variety of languages. Although many of these are fairly informative, they don’t answer the most important questions: How does the DPDK process packets and what route does the packet take from the network device to the user?
来自多国语言的与有关DPDK的文章很多。虽然信息量已经很丰富了,但是并没有回答两个重要的问题。问题一:DPDK是如何处理数据包的?问题二:从网络设备到用户程序,数据包使用的路由是什么?

Finding the answers to these questions was not easy; since we couldn’t find everything we needed in the official documentation, we had to look through a myriad of additional materials and thoroughly review their sources. But first thing’s first: before talking about the DPDK and the issues it can help resolve, we should review how packets are processed in Linux.
找到上面的两个问题的答案不是一件容易的事情。由于我们无法在官方文档中找到我们所需要的所有东西,我们不得不查阅大量的额外资料,并彻底审查资料来源。但是最为首要的是:在谈论DPDK可以帮助我们解决问题之前,我们应该审阅一下数据包在Linux中是如何被处理的。

Processing Packets in Linux: Main Stages | Linux中的数据包处理的主要几个阶段

When a network card first receives a packet, it sends it to a receive queue, or RX. From there, it gets copied to the main memory via the DMA (Direct Memory Access) mechanism.
当网卡接收数据包之后,首先将其发送到接收队列(RX)。在那里,数据包被复制到内存中,通过直接内存访问(DMA)机制。

Afterwards, the system needs to be notified of the new packet and pass the data onto a specially allocated buffer (Linux allocates these buffers for every packet). To do this, Linux uses an interrupt mechanism: an interrupt is generated several times when a new packet enters the system. The packet then needs to be transferred to the user space.
接下来,系统需要被通知到,有新的数据包来了,然后系统将数据传递到一个专门分配的缓冲区中去(Linux为每一个数据包都分配这样的特别缓冲区)。为了做到这一点,Linux使用了中断机制:当一个新的数据包进入系统时,中断多次生成。然后,该数据包需要被转移到用户空间中去。

One bottleneck is already apparent: as more packets have to be processed, more resources are consumed, which negatively affects the overall system performance.
在这里,存在着一个很明显的瓶颈:伴随着更多的数据包需要被处理,更多的资源将被消耗掉,这无疑对整个系统的性能将产生负面的影响。

As we’ve already said, these packets are saved to specially allocated buffers - more specifically, the sk_buff struct. This struct is allocated for each packet and becomes free when a packet enters the user space. This operation consumes a lot of bus cycles (i.e. cycles that transfer data from the CPU to the main memory).
我们在前面已经说过,这些数据包被保存在专门分配的缓冲区中-更具体地说就是sk_buff结构体。系统给每一个数据包都分配一个这样的结构体,一但数据包到达用户空间,该结构体就被系统给释放掉。这种操作消耗大量的总线周期(bus cycle即是把数据从CPU挪到内存的周期)。

There is another problem with the sk_buff struct: the Linux network stack was originally designed to be compatible with as many protocols as possible. As such, metadata for all of these protocols is included in the sk_buff struct, but that’s simply not necessary for processing specific packets. Because of this overly complicated struct, processing is slower than it could be.
与sk_buff struct密切相关的另一个问题是:设计Linux网络协议栈的初衷是尽可能地兼容更多的协议。因此,所有协议的元数据都包含在sk_buff struct中,但是,处理特定的数据包的时候这些(与特定数据包无关的协议元数据)根本不需要。因而处理速度就肯定比较慢,由于这个结构体过于复杂。

Another factor that negatively affects performance is context switching. When an application in the user space needs to send or receive a packet, it executes a system call. The context is switched to kernel mode and then back to user mode. This consumes a significant amount of system resources.
对性能产生负面影响的另外一个因素就是上下文切换。当用户空间的应用程序需要发送或接收一个数据包的时候,执行一个系统调用。这个上下文切换就是从用户态切换到内核态,(系统调用在内核的活干完后)再返回到用户态。这无疑消耗了大量的系统资源。

To solve some of these problems, all Linux kernels since version 2.6 have included NAPI (New API), which combines interrupts with requests. Let’s take a quick look at how this works.
为了解决上面提及的所有问题,Linux内核从2.6版本开始包括了NAPI,将请求和中断予以合并。接下来我们将快速地看一看这是如何工作的。

The network card first works in interrupt mode, but as soon as a packet enters the network interface, it registers itself in a poll queue and disables the interrupt. The system periodically checks the queue for new devices and gathers packets for further processing. As soon as the packets are processed, the card will be deleted from the queue and interrupts are again enabled.
首先网卡是在中断模式下工作。但是,一旦有数据包进入网络接口,网卡就会去轮询队列中注册,并将中断禁用掉。系统周期性地检查新设备队列,收集数据包以便做进一步地处理。一旦数据包被系统处理了,系统就将对应的网卡从轮询队列中删除掉,并再次启用该网卡的中断(即将网卡恢复到中断模式下去工作)。

This has been just a cursory description of how packets are processed. A more detailed look at this process can be found in an article series from Private Internet Access. However, even a quick glance is enough to see the problems slowing down packet processing. In the next section, we’ll describe how these problems are solved using DPDK.
这只是对数据包如何被处理的粗略的描述。有关数据包处理过程的详细描述请参见Private Internet Access的系列文章。然而,就是这么一个快速一瞥也足以让我们看到数据包处理被减缓的问题。在下一节中,我们将描述使用DPDK后,这些问题是如何被解决掉的。

DPDK: How It Works | DPDK 是如何工作的

General Features | 一般特性

Let’s look at the following illustration:
让我们来看看下面的插图:
在这里插入图片描述

On the left you see the traditional way packets are processed, and on the right - with DPDK. As we can see, the kernel in the second example doesn’t step in at all: interactions with the network card are performed via special drivers and libraries.
如图所示,在左边的是传统的数据包处理方式,在右边的则是使用了DPDK之后的数据包处理方式。正如我们看到的一样,右边的例子中,内核根本不需要介入,与网卡的交互是通过特殊的驱动和库函数来进行的。

If you’ve already read about DPDK or have ever used it, then you know that the ports receiving incoming traffic on network cards need to be unbound from Linux (the kernel driver). This is done using the dpdk_nic_bind (or dpdk-devbind) command, or ./dpdk_nic_bind.py in earlier versions.
如果你已经读过DPDK或者已经使用过DPDK,那么你肯定知道网卡接收数据传入的网口需要从Linux内核驱动上去除绑定(松棒)。用dpdk_nic_bind(或dpdk-devbind)命令就可以完成松绑,早期的版本中使用dpdk_nic_bind.py。

How are ports then managed by DPDK? Every driver in Linux has bind and unbind files. That includes network card drivers:
DPDK是如何管理网口的?每一个Linux内核驱动都有bind和unbind文件。 当然包括网卡驱动:

ls /sys/bus/pci/drivers/ixgbe
bind module new_id remove_id uevent unbind
To unbind a device from a driver, the device’s bus number needs to be written to the unbind file. Similarly, to bind a device to another driver, the bus number needs to be written to its bind file. More detailed information about this can be found here.
从内核驱动上给一个设备松绑,需要把设备的bus号写入unbind文件。类似地,将设备绑定到另外一个驱动上,同样需要将bus号写入bind文件。更多详细信息请参见这里。

The DPDK installation instructions tell us that our ports need to be managed by the vfio_pci, igb_uio, or uio_pci_generic driver. (We won’t be geting into details here, but we suggested interested readers look at the following articles on kernel.org: 1 and 2.)
DPDK的安装指南告诉我们ports需要被vfio_pci, igb_uio或uio_pci_generic驱动管理。(细节这里就不谈了,但建议有兴趣的读者阅读kernel.org上的文章:1和2)

These drivers make it possible to interact with devices in the user space. Of course they include a kernel module, but that’s just to initialize devices and assign the PCI interface.
有了这些驱动程序,就可以在用户空间与设备进行交互。当然,它们包含了一个内核模块,但是该内核模块只负责设备初始化和分配PCI接口。

All further communication between the application and network card is organized by the DPDK poll mode driver (PMD). DPDK has poll mode drivers for all supported network cards and virtual devices.
在应用程序与网卡之间的所有的进一步的通信,都是有DPDK的轮询模式驱动(PMD)负责组织的。DPDK对所有网卡和虚拟设备都支持轮询模式驱动(PMD)。

The DPDK also requires hugepages be configured. This is required for allocating large chunks of memory and writing data to them. We can say that hugepages does the same job in DPDK that DMA does in traditional packet processing.
大内存页(hug pages)的配置对DPDK来说是必须的。这是因为需要分配大块内存并向大块内存中写入数据。可以这么说,数据包处理的活,传统的方式是使用直接内存访问(DMA)来干,而DPDK使用大内存页(huge pages)来完成。

We’ll discuss all of its nuances in more detail, but for now, let’s go over the main stages of packet processing with the DPDK:
我们将讨论更多的细节。但是现在,让我们浏览一下使用DPDK做数据包处理的几个主要阶段:

Incoming packets go to a ring buffer (we’ll look at its setup in the next section). The application periodically checks this buffer for new packets.
传入的数据包被放到环形缓冲区(ring buffer)中去。应用程序周期性检查这个缓冲区(ring buffer)以获取新的数据包。
If the buffer contains new packet descriptors, the application will refer to the DPDK packet buffers in the specially allocated memory pool using the pointers in the packet descriptors.
如果ring buffer包含有新的数据包描述符,应用程序就使用数据包描述符所包含的指针去做处理,该指针指向的是DPDK数据包缓冲区,该缓冲区位于专门的内存池中。
If the ring buffer does not contain any packets, the application will queue the network devices under the DPDK and then refer to the ring again.
如果ring buffer中不包含任何数据包描述符,应用程序就会在DPDK中将网络设备排队,然后再次指向ring。
Let’s take a closer look at the DPDK’s internal structure.
接下来我们将近距离地深入到DPDK的内部结构中去。

EAL: Environment Abstraction | 环境抽象层

The EAL, or Environment Abstraction Layer, is the main concept behind the DPDK.
环境抽象层(EAL),是位于DPDK背后的主要概念。

The EAL is a set of programming tools that let the DPDK work in a specific hardware environment and under a specific operating system. In the official DPDK repository, libraries and drivers that are part of the EAL are saved in the rte_eal directory.
EAL是一套编程工具,允许DPDK在特定的硬件环境和特定的操作系统下工作。在官方的DPDK仓库中,库和驱动是EAL的一部分,被保存在rte_eal目录。

Drivers and libraries for Linux and the BSD system are saved in this directory. It also contains a set of header files for various processor architectures: ARM, x86, TILE64, and PPC64.
为Linux和BSD系统写的库和驱动就保存在这个目录。同时还包含了一系列针对不同的处理器架构的头文件,不同的处理器包括ARM, x86, TILE64和PPC64。

We access software in the EAL when we compile the DPDK from the source code:
在从源代码编译DPDK的时候,就会访问到EAL中的软件:

make config T=x86_64-native-linuxapp-gcc
One can guess that this command will compile DPDK for Linux in an x86_64 architecture.
一猜上面的命令就是为Linux的x86_64架构编译DPDK。

The EAL is what binds the DPDK to applications. All of the applications that use the DPDK (see here for examples) must include the EAL’s header files.
将DPDK与应用程序捆绑到一起的就是EAL。所有使用DPDK的应用程序(例子戳这里)必须包含EAL的头文件。

The most commonly of these include:
其中最常见的包括:

rte_lcore.h – manages processor cores and sockets; 管理处理器核和socket;
rte_memory.h – manages memory; 管理内存;
rte_pci.h – provides the interface access to PCI address space; 提供访问PCI地址空间的接口;
rte_debug.h – provides trace and debug functions (logging, dump_stack, and more); 提供trace和debug函数(logging, dump_stack, 和更多);
rte_interrupts.h – processes interrupts. 中断处理。
More details on this structure and EAL functions can be found in the official documentation.
有关EAL功能与结构的更多详细信息请参见官方文档。

Managing Queues: rte_ring | 队列管理

As we’ve already said, packets received by the network card are sent to a ring buffer, which acts as a receiving queue. Packets received in the DPDK are also sent to a queue implemented on the rte_ring library. The library’s description below comes from information gathered from the developer’s guide and comments in the source code.
我们在前面已经说过了,网卡接收到的数据包被发送到环形缓冲区(ring buffer),该环形缓冲区充当接收队列的角色。DPDk接收到的数据包也被发送到用rte_ring函数库实现的队列中去。注意下面的函数库描述拉源于开发指南和源代码注释。

The rte_ring was developed from the FreeBSD ring buffer. If you look at the source code, you’ll see the following comment: Derived from FreeBSD’s bufring.c.
rte_ring是来自于FreeBSD的ring buffer。如果你阅读源代码,就会看见后面的注释: 来自于FreeBSD的bufring.c。

The queue is a lockless ring buffer built on the FIFO (First In, First Out) principle. The ring buffer is a table of pointers for objects that can be saved to the memory. Pointers can be divided into four categories: prod_tail, prod_head, cons_tail, cons_head.
DPDK的队列是一个无锁的环形缓冲区,基于FIFO(先进先出原理)构建。ring buffer本质上是一张表,表里的每一个元素是可以保存在内存中的对象的指针。指针分为4类: prod_tail, prod_head, cons_tail, 和cons_head。

Prod is short for producer, and cons for consumer. The producer is the process that writes data to the buffer at a given time, and the consumer is the process that removes data from the buffer.
prod是producer(生产者)的缩写,而cons是consumer(消费者)的缩写。生产者(producer)是在给定的时间之内将数据写入缓冲区的进程,而消费者(consumer)是从缓冲区中读走数据的进程。

The tail is where writing takes place on the ring buffer. The place the buffer is read from at a given time is called the head.
tail(尾部)是写入环形缓冲区的地方,而在给定的时间内从环形缓冲区读取数据的地方称之为head(头部)。

The idea behind the process for adding and removing elements from the queue is as follows: when a new object is added to the queue, the ring->prod_tail indicator should end up pointing to the location where ring->prod_head previously pointed to.
在给队列添加一个元素和从队列中移除一个元素的过程中, 藏在其背后的思想是:当一个新的对象被添加到队列中,rihg->prod_tail应该最终指向ring->prod_head在之前指向的位置。

This is just a brief description; a more detailed account of how the ring buffer scripts work can be found in the developer’s manual on the DPDK site.
这里只是做一个简短的描述。有关ring buffer是如何编排其工作的详细说明请参见DPDK网站的开发者手册。

This approach has a number of advantages. Firstly, data is written to the buffer extremely quickly. Secondly, when adding or removing a large number of objects from the queue, cache misses occur much less frequently since pointers are saved in a table.
这一方法有很多优点。首先,将数据写入缓冲区非常快。其次,当给队列添加大量对象或者从队列中移除大量对象时,cache未命中发生的频率要低得多,因为保存在表中的是对象的指针。

The drawback to DPDK’s ring buffer is its fixed size, which cannot be increased on the fly. Additionally, much more memory is spent working with the ring structure than in a linked queue since the buffer always uses the the maximum number of pointers.
DPDK的ring buffer的缺点是ring buffer的长度是固定的,不能够在运行时间动态地修改。另外,与链式队列相比,在ring结构中使用的内存比较多,因为ring buffer总是使用支持的对象指针数量的最大值。

Memory Management: rte_mempool | 内存管理

We mentioned above that DPDK requires hugepages. The installation instructions recommend creating 2MB hugepages.
在上面我们有提到DPDK需要使用大内存页。安装说明建议创建2MB的大内存页。

These pages are combined in segments, which are then divided into zones. Objects that are created by applications or other libraries, like queues and packet buffers, are placed in these zones.
这些大内存页合并为段,然后分割成zone。应用程序或者其他库(比如队列和数据包缓冲区)创建的对象被放置在这些zone中。

These objects include memory pools, which are created by the rte_mempool library. These are fixed size object pools that use rte_ring for storing free objects and can be identified by a unique name.
这些对象包括通过rte_mempool库创建的内存池。这些对象池的大小是固定的,使用rte_ring存储自由对象,能够用一个独一无二的名称来标识。

Memory alignment techniques can be implemented to improve performance.
内存对齐技术能够被用来提高性能。

Even though access to free objects is designed on a lockless ring buffer, consumption of system resources may still be very high. As multiple cores have access to the ring, a compare-and-set (CAS) operation usually has to be performed each time it is accessed.
尽管访问自由对象被设计在一个无锁的ring buffer中,但是系统资源消耗可能还是很大。 由于这个ring被多个核访问,在每一次访问ring的时候,通常不得不执行CAS原子操作。

To prevent bottlenecking, every core is given an additional local cache in the memory pool. Using the locking mechanism, cores can fully access the free object cache. When the cache is full or entirely empty, the memory pool exchanges data with the ring buffer. This gives the core access to frequently used objects.
为了防止瓶颈发生,在内存池中给每一个CPU核配备额外的本地缓存。通过使用锁机制,多个CPU核能够完全访问自由对象缓存。当缓存满了或者完全空了,内存池与ring buffer进行数据交换。这使得CPU核能够访问那些被频繁使用的对象。

Buffer Management: rte_mbuf | 缓冲区管理

In the Linux network stack, all network packets are represented by the the sk_buff data structure. In DPDK, this is done using the rte_mbuf struct, which is described in the rte_mbuf.h header file.
在Linux网络栈中,所有的网络数据包用sk_buff结构体表示。而在DPDK中,使用rte_mbuf结构体来表示网络数据包,该结构体的描述位于头文件rte_mbuf.h中。

The buffer management approach in DPDK is reminiscent of the approach used in FreeBSD: instead of one big sk_buff struct, there are many smaller rte_mbuf buffers. The buffers are created before the DPDK application is launched and are saved in memory pools (memory is allocated by rte_mempool).
DPDK的缓冲区管理方法让人不禁联想到FreeBSD的做法:使用很多较小的rte_buf缓冲区来代替一个大大的sk_buff结构体。在DPDK应用程序被启动之前,这些缓冲区就创建好了,它们被保存在内存池中(内存是通过rte_mempool来分配的)。

In addition to its own packet data, each buffer contains metadata (message type, length, data segment starting address). The buffer also contains pointers for the next buffer. This is needed when handling packets with large amounts of data. In cases like these, packets can be combined (as is done in FreeBSD; more detailed information about this can be found here).
除了包含它自己的包数据之外,每一个buffer还包含了元数据(消息类型,消息长度,数据段起始地址)。与此同时buffer也包含指向下一个buffer的指针,在处理大量的数据包的时候这是必须的。在这种情况下,数据包可以合并在一起(合并工作是由FreeBSD完成的,更多详细信息请参见这里)。

Other Libraries: General Overview | 鸟瞰其他库

In previous sections, we talked about the most basic DPDK libraries. There’s a great deal of other libraries, but one article isn’t enough to describe them all. Thus, we’ll be limiting ourselves to just a brief overview.
在前面的章节中,我们谈到了最基本的DPDK库。其实DPDK库还有很多,用一篇文章将所有库都描述到是不可能的。因此,我们只是做一个简短的概述罢了。

With the LPM library, DPDK runs the Longest Prefix Match (LPM) algorithm, which can be used to forward packets based on their IPv4 address. The primary function of this library is to add and delete IP addresses as well as to search for new addresses using the LPM algorithm.
在LPM库中,DPDK运行LPM(匹配最长前缀)算法,用于转发基于IPv4地址的数据包。这个库的主要功能就是添加和删除IP地址,同时使用LPM算法寻找新的地址。

A similar function can be performed for IPv6 addresses using the LPM6 library.
类似地,使用LPM6库处理IPv6地址。

Other libraries offer similar functionality based on hash functions. With rte_hash, you can search through a large record set using a unique key. This library can be used for classifying and distributing packets, for example.
其他库基于hash函数提供类似的功能。例如:使用rte_hash, 可以通过使用一个独一无二的key来搜索大记录集。这个库可用来分类和分发数据包。

The rte_timer library lets you execute functions asynchronously. The timer can run once or periodically.
rte_timer库允许执行异步函数调用。定时器可以运行一次,也可以周期性地运行。

Conclusion | 总结

In this article we went over the internal device and principles of DPDK. This is far from comprehensive though; the subject is too complex and extensive to fit in one article. So sit tight, we will continue this topic in a future article, where we’ll discuss the practical aspects of using DPDK.
本文我们讨论了DPDK的内部设备和原理。但这很不全面,因为该主题过于复杂和宽泛,以至于不能在一篇文章中讲述清楚。所以,我们将在未来的文章中继续讨论这一话题,讨论使用DPDK所遇到的实际问题。

We’d be happy to answer your questions in the comments below. And if you’ve had any experience using DPDK, we’d love to hear your thoughts and impressions.
我们非常乐意回答你在评论中提出的问题。如果你有任何使用DPDK的经验,请跟我们分享你的想法与感受。

For anyone interested in learning more, please visit the following links:
如有兴趣学习更多,请访问下面的链接:

http://dpdk.org/doc/guides/prog_guide/ — a detailed (but confusing in some places) description of all the DPDK libraries;
https://www.net.in.tum.de/fileadmin/TUM/NET/NET-2014-08-1/NET-2014-08-1_15.pdf — a brief overview of DPDK’s capabilities and comparison with other frameworks (netmap and PF_RING);
http://www.slideshare.net/garyachy/dpdk-44585840 — an introductory presentation to DPDK for beginners;
http://www.it-sobytie.ru/system/attachments/files/000/001/102/original/LinuxPiter-DPDK-2015.pdf — a presentation explaining the DPDK structure.
Andrej Yemelianov 24 November 2016 Tags: DPDK, linux, network, network stacks, packet processing

Everybody thinks of changing humanity, and nobody thinks of changing himself. | 每个人都在想着改变别人,但没有人想去改变自己。

  • 0
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值