Introduction to the InfiniBand Core Software

Introduction to the InfiniBand Core Software

Abstract

InfiniBand support was added to the kernel in 2.6.11. In this paper, we describe the various modules and interfaces of the InfiniBand core software and provide examples of how and when to use them. The core software consists of the host channel adapter (HCA) driver and a mid-layer that abstracts the InfiniBand device implementation specifics and presents a consistent interface to upper level protocols, such as IP over IB, sockets direct protocol, and the InfiniBand storage protocols. The InfiniBand core software is logically grouped into 5 major areas: HCA resource management, memory management, connection management, work request and completion event processing, and subnet administration. Physically, the core software is currently contained within 6 kernel modules. These include the Mellanox HCA driver, ib_mthca.ko, the core verbs module, ib_core.ko, the connection manager, ib_cm.ko, and the subnet administration support modules, ib_sa.ko, ib_mad.ko, ib_umad.ko. We will also discuss the additional modules that are under development to export the core software interfaces to userspace and allow safe direct access to InfiniBand hardware from userspace.

在内核2.6.11中将InfiniBand支持添加到内核中。在本文中,我们描述了InfiniBand核心软件的各种模块和接口,并提供了使用它们的示例。核心软件包括主机通道适配器(HCA)驱动程序和中间层,该中间层抽象InfiniBand设备实现细节,并为上层协议提供一致的接口,例如IP over IB,SDP(Socket Direct Protocol)协议和InfiniBand存储协议。 InfiniBand核心软件在逻辑上分为5个主要部分:HCA资源管理,内存管理,连接管理,工作请求和完成事件处理以及子网管理。从物理上讲,核心软件包含在6个内核模块中。其中包括Mellanox HCA驱动程序,ib_mthca.ko,核心模块,ib_core.ko,连接管理器,ib_cm.ko和子网管理支持模块,ib_sa.ko,ib_mad.ko,ib_umad.ko。我们还将讨论其他正在开发的模块,以将核心软件接口导出到用户空间,并允许从用户空间安全地直接访问InfiniBand硬件。

1 Introduction

This paper describes the core software components of the InfiniBand software that was included in the linux 2.6.11 kernel. The reader is referred to the architectural diagram and foils in the slide set that was provided as part of the paper’s presentation at the Ottawa Linux Symposium. It is also assumed that the reader has read at least chapters 3, 10, and 11 of Infini-Band Architecture Specification [IBTA] and is familiar with the concepts and terminology of the InfiniBand Architecture. The goal of the paper is not to educate people on the InfiniBand Architecture, but rather to introduce the reader to the APIs and code that implements the InfiniBand Architecture support in Linux. Note that the InfiniBand code that is in the kernel has been written to comply with the InfiniBand 1.1 specification with some 1.2 extensions, but it is important to note that the code is not yet completely 1.2 compliant.

本文描述了linux 2.6.11内核中包含的InfiniBand软件的核心软件组件。读者可以参考幻灯片组中的架构图。这是在渥太华Linux研讨会上作为论文演示的一部分提供的。还假设读者至少阅读了Infini-Band架构规范[IBTA]的第3,10和11章,并熟悉InfiniBand架构的概念和术语。本文的目的不是向人们介绍InfiniBand架构,而是向读者介绍在Linux中实现InfiniBand架构支持的API和代码。 请注意,内核中的InfiniBand代码已经编写为符合带有1.2扩展的InfiniBand 1.1规范,但重要的是要注意代码还不完全符合1.2。

The InfiniBand code is located in the kernel tree under linux-2.6.11/drivers/infiniband. The reader is encouraged to read the code and header files in the kernel tree. Several pieces of the InfiniBand stack that are in the kernel contain good examples of how to use the routines of the core software described in this paper. Another good source of information can be found at the www.openib.org website. This is where the code is developed prior to being submitted to the linux kernel mailing list (lkml) for kernel inclusion. There are several frequently asked question documents plus email lists <openib-general@openib.org>. where people can ask questions or submit patches to the InfinBand code.

InfiniBand代码位于linux-2.6.11 / drivers / infiniband。建议读者阅读内核中的代码和头文件。内核中的InfiniBand堆栈有几处包含了如何使用本文中描述的核心软件例程的很好示例。另一个很好的信息来源可以在www.openib.org网站上找到。这是在提交到Linux内核邮件列表(lkml)以供内核包含之前开发代码的地方。有几个常见问题文档和邮件列表<openib-general@openib.org>。 人们可以向InfinBand代码提问或提交补丁

The remainder of the paper provides a high level overview of the mid-layer routines and provides some examples of their usage. It is targeted at someone that might want to write a kernel module that uses the mid-layer or someone interested in how it is used. The paper is divided into several sections that cover driver initialization and exit, resource management, memory management, subnet administration from the viewpoint of an upper level protocol developer, connection management, and work request and completion event processing.

本文的其余部分提供了中间层例程的高级概述,并提供了一些使用示例。它面向可能想要编写使用中间层的内核模块或对其使用感兴趣的人。本文分为几个部分,从上层协议开发人员的角度,连接管理,工作请求和完成事件处理,涵盖驱动程序初始化和退出,资源管理,内存管理,子网管理。

Finally, the paper will present a section on the user-mode infrastructure and how one can safely use the InfiniBand resource directly from userspace applications.

最后,本文将介绍用户模式基础结构以及如何直接从用户空间应用程序安全地使用InfiniBand资源。

2 Driver initialization and exit

Before using InfiniBand resources, kernel clients must register with the mid-layer. This also provides the way, via callbacks, for the client to discover the available Infini-Band devices that are present in the system.

在使用InfiniBand资源之前,内核客户端必须注册中间层。中间层通过回调为客户端提供了发现系统中可用的Infini-Band设备的方法。

To register with the InfiniBand mid-layer, a client calls the ib_register_client routine. The routine takes as a parameter a pointer to a ib_client structure, as defined in linux-2.6.11/drivers/infiniband/include/ib_verbs.h. The structure takes a pointer to the client’s name, plus two function pointers to callback routines that are invoked when an InfiniBand device is added or removed from the system. Below is some sample code that shows how this routine is called:

要向InfiniBand中间层注册,客户端会调用ib_register_client函数。该函数将一个指向ib_client结构的指针作为参数,struct ib_client 在 linux-2.6.11 / drivers / infiniband / include / ib_verbs.h定义。该结构采用指向客户端名称的指针,以及两个函数指针,这些指针指向在系统中添加或删除InfiniBand设备时调用的函数。下面是示例代码:

static void my_add_device(

struct ib_device *device);

static void my_remove_device(

struct ib_device *device);

static struct ib_client my_client = {

.name = "my_name",

.add = my_add_device,

.remove = my_remove_device

};

static int __init my_init(void)

{

int ret;

ret = ib_register_client(

&my_client);

if (ret)

printk(KERN_ERR

"my ib_register_client failed\n");

return ret;

}

static void __exit my_cleanup(void)

{

ib_unregister_client(

&my_client);

}

module_init(my_init);

module_exit(my_cleanup);

3 InfiniBand resource management

3.1 Miscellaneous Query functions

The mid-layer provides routines that allow a client to query or modify information about the various InfiniBand resources.

中间层提供允许客户端查询或修改有关各种InfiniBand资源的函数。

ib_query_device

ib_query_port

ib_query_gid

ib_query_pkey

ib_modify_device

ib_modify_port

The ib_query_device routine allows a client to retrieve attributes for a given hardware device. The returned device_attr structure contains device specific capabilities and limitations, such as the maximum sizes for queue pairs, completion queues, scatter gather entries, etc., and is used when configuring queue pairs and establishing connections.

ib_query_device函数允许客户端查询给定硬件设备的属性。返回的device_attr结构包含特定于设备的功能和限制,例如QP的最大值,CQs,SGEs等,这在配置QP和建立连接时使用。

The ib_query_port routine returns information that is needed by the client, such as the state of the port (Active or not), the local identifier (LID) assigned to the port by the subnet manager, the Maximum Transfer Unit (MTU), the LID of the subnet manager, needed for sending SA queries, the partition table length, and the maximum message size.

ib_query_port函数返回客户端所需的信息,例如端口的状态(Active或Inactive),子网管理器分配给端口的本地标识符(LID),最大传输单元(MTU),LID 发送SA查询所需的子网管理器,分区表长度和最大消息尺寸。

The ib_query_pkey routine allows the client to retrieve the partition keys for a port. Typically, the subnet manager only sets one pkey for the entire subnet, which is the default pkey.

ib_query_pkey函数允许客户端查询端口的P_Key。通常,子网管理器仅为整个子网设置一个P_key,这是默认的P_key。

The ib_modify_device and ib_modify_port routines allow some of the device or port attributes to be modified. Most ULPs do not need to modify any of the port or device attributes. One exception to this would be the communication manager, which sets a bit in the port capabilities mask to indicate the presence of a CM.

ib_modify_device和ib_modify_port函数允许修改某些设备或端口属性。大多数ULP不需要修改任何端口或设备属性。一个例外是通信管理器,它在端口能力掩码中设置一个位以显示CM的存在。

Additional query and modify routines are discussed in later sections when a particular resource, such as queue pairs or completion queues, are discussed.

在后面的章节中,讨论特定资源(例如QP或CQ)时,将讨论其他查询和修改函数。

3.2 Protection Domains

Protection domains are a first level of access control provided by InfiniBand. Protection domains are allocated by the client and associated with subsequent InfiniBand resources, such as queue pairs, or memory regions.

保护域是InfiniBand提供的第一级访问控制。保护域由客户端分配,并与后续InfiniBand资源(如QP或MR)相关联。

Protection domains allow a client to associate multiple resources, such as queue pairs and memory regions, within a domain of trust. The client can then grant access rights for sending/receiving data within the protection domain to others that are on the Infinband fabric.

保护域允许客户端在信任域内关联多个资源,例如QP和MR。 然后,客户端可以授予访问权限,以便在保护域内向Infinband结构上的其他资源发送/接收数据。

To allocate a protection domain, clients call the ib_alloc_pd routine. The routine takes and pointer to the device structure that was returned when the driver was called back after registering with the mid-layer. For example:

要分配保护域,客户端将调用ib_alloc_pd函数。该函数的参数是device结构体指针。该指针指向在向中间层注册后回调驱动程序时返回的device结构体。例如:

my_pd = ib_alloc_pd(device);

Once a PD has been allocated, it is used in subsequent calls to allocate other resources, such as creating address handles or queue pairs.

一旦分配了PD,就会在后续调用中使用它来分配其他资源,例如创建AH或PQ。

To free a protection domain, the client calls ib_dealloc_pd, which is normally only done at driver unload time after all of the other resources associated with the PD have been freed.

为了释放保护域,客户端调用ib_dealloc_pd,通常在驱动程序卸载时调用。在释放了与PD关联的所有其他资源之后,才调用该函数。

ib_dealloc_pd(my_pd);

3.3 Types of communication in InfiniBand

Several types of communication between end-points are defined by the InfiniBand architecture specification [IBTA]. These include reliable-connected, unreliable-connected, reliable-datagram, and unreliable datagrams.

InfiniBand架构规范[IBTA]定义了端点之间的几种类型的通信:可靠连接(RC),不可靠连接(UC),可靠数据报(RD)和不可靠的数据报(UD)。

Most clients today only use either unreliable datagrams or reliable connected communications. An analogy in the IP network stack would be that unreliable datagrams are analogous to UDP packets, while a reliable-connected queue pairs provide a connection-oriented type of communication, similar to TCP. But InfiniBand communication is packet-based, rather than stream oriented.

如今大多数客户只使用不可靠的数据报(UD)或可靠的连接通信(RC)。 与TCP/IP协议栈相比,不可靠的数据报(UD)类似于UDP,而可靠连接通信(RC),类似于TCP。 但InfiniBand通信是基于数据包的,而不是面向流的。

3.4 Address handles

When a client wants to communicate via unreliable datagrams, the client needs to create an address handle that contains the information needed to send packets.

当客户端想要通过不可靠的数据报进行通信时,客户端需要创建一个包含发送数据包所需信息的AH。

To create an address handle the client calls the routine ib_create_ah(). An example code fragment is shown below:

要创建AH,客户端调用例程ib_create_ah()。 示例代码片段如下所示:

struct ib_ah_attr ah_attr;

struct ib_ah *remote_ah;

memset(&ah_attr, 0, sizeof ah_attr);

ah_attr.dlid = remote_lid;

ah_attr.sl = service_level;

ah_attr.port_num = port->port_num;

remote_ah = ib_create_ah(pd, &ah_attr);

In the above example, the pd is the protection domain, the remote_lid and service_level are obtained from an SA path record query, and the port_num was returned in the device structure through the ib_register_client callback. Another way to get the remote_lid and service_level information is from a packet that was received from a remote node.

在上面的示例中,pd是保护域,remote_lid和service_level是从SA路径记录查询中获取的,port_num是通过ib_register_client回调在设备结构中返回的。 获取remote_lid和service_level信息的另一种方法来自从远程节点接收的数据包。

There are also core verb APIs for destroying the address handles and for retrieving and modifying the address handle attributes.

还有核心verb API用于销毁AH以及检索和修改AH属性。

ib_destroy_ah

ib_query_ah

ib_modify_ah

Some example code that calls ib_create_ah to create an address handle for a multicast group can be found in the IPoIB network driver for InfiniBand, and is located in linux-2.6.11/drivers/infiniband/ulp/ipoib.

调用ib_create_ah为组播组创建AH的一些示例代码可以在InfiniBand的IPoIB网络驱动程序中找到,位于linux-2.6.11 / drivers / infiniband / ulp / ipoib中。

3.5 Queue Pairs and Completion Queue Allocation

All data communicated over InfiniBand is done via queue pairs. Queue pairs (QPs) contain a send queue, for sending outbound messages and requesting RDMA and atomic operations, and a receive queue for receiving incoming messages or immediate data. Furthermore, completion queues (CQs) must be allocated and associated with a queue pair, and are used to receive completion notifications and events.

通过InfiniBand传输的所有数据都是通过QPs完成的。 队列对(QPs)包含发送队列,用于发送出站消息和请求RDMA和原子操作,以及用于接收传入消息或立即数据的接收队列。 此外,必须分配完成队列(CQ)并将其与QPs相关联,并用于接收完成通知和事件。

Queue pairs and completion queues are allocated by calling the ib_create_qp and ib_create_cq routines, respectively.

The following sample code allocates separate completion queues to handle send and receive completions, and then allocates a queue pair associated with the two CQs.

通过分别调用ib_create_qp和ib_create_cq例程来分配QP和CQ。

以下示例代码分配单独的CQ以处理发送和接收完成,然后分配与这两个CQ关联的队列对。

send_cq = ib_create_cq(device,

my_cq_event_handler,

NULL,

my_context,

my_send_cq_size);

recv_cq = ib_create_cq(device,

my_cq_event_handler,

NULL,

my_context,

my_recv_cq_size);

init_attr->cap.max_send_wr = send_cq_size;

init_attr->cap.max_recv_wr = recv_cq_size;

init_attr->cap.max_send_sge = LIMIT_SG_SEND;

init_attr->cap.max_recv_sge = LIMIT_SG_RECV;

init_attr->send_cq = send_cq;

init_attr->recv_cq = recv_cq;

init_attr->sq_sig_type = IB_SIGNAL_REQ_WR;

init_attr->qp_type = IB_QPT_RC;

init_attr->event_handler = my_qp_event_handler;

my_qp = ib_create_qp(pd, init_attr);

After a queue pair is created, it can be connected to a remote QP to establish a connection. This is done using the QP modify routine and the communication manager helper functions described in a later section.

创建QP后,可以将其连接到远程QP以建立连接。这是使用QP修改例程和后面部分中描述的通信管理器帮助程序函数完成的。

There are also mid-layer routines that allow destruction and release of QPs and CQs, along with the routines to query and modify the queue pair attributes and states. These additional core QP and CQ support routines are as follows:

还有中间层例程允许销毁和释放QP和CQ,以及查询和修改队列对属性和状态的例程。这些额外的核心QP和CQ支持例程如下:

ib_modify_qp

ib_query_qp

ib_destroy_qp

ib_destroy_cq

ib_resize_cq

Note that ib_resize_cq is not currently implemented in the mthca driver. An example of kernel code that allocates QPs and CQs for reliable-connected style of communication is the SDP driver [SDP]. It can be found in the subversion tree at openib.org, and will be submitted for kernel inclusion at some point in the near future.

请注意,ib_resize_cq当前未在mthca驱动程序中实现。为可靠连接的通信方式分配QP和CQ的内核代码的示例是SDP驱动程序[SDP]。 它可以在openib.org的subversion树中找到,并将在不久的将来提交给内核。

4 InfiniBand memory management

Before a client can transfer data across Infini-Band, it needs to register the corresponding memory buffer with the InfiniBand HCA. The InfiniBand mid-layer assumes that the kernel or ULP has already pinned the pages and has translated the virtual address to a Linux DMA address, i.e., a bus address that can be used by the HCA. For example, the driver could call get_user_pages and then dma_map_sg to get the DMA address.

在客户端可以跨Infini-Band传输数据之前,它需要使用InfiniBand HCA注册相应的内存缓冲区。 InfiniBand中间层假设内核或ULP已经固定了页面并将虚拟地址转换为Linux DMA地址,即HCA可以使用的总线地址。 例如,驱动程序可以调用get_user_pages,然后调用dma_map_sg来获取DMA地址。

Memory registration can be done in a couple of different ways. For operations that do not have a scatter/gather list of pages, there is a memory region that can be used that has all of physical memory pre-registered. This can be thought of as getting access to the “Reserved L_key” that is defined in the InfiniBand verbs extensions [IBTA].

内存注册可以通过几种不同的方式完成。 对于没有scatter/gather页面列表的操作,可以使用预先注册了所有物理内存的内存区域。 这可以被认为是访问InfiniBand动词扩展[IBTA]中定义的“Reserved L_key”。

To get the memory region structure that has the keys that are needed for data transfers, the client calls the ib_get_dma_mr routine, for example:

要获取具有数据传输所需密钥的内存区域结构,客户端将调用ib_get_dma_mr例程,例如:

mr = ib_get_dma_mr(my_pd, IB_ACCESS_LOCAL_WRITE);

If the client has a list of pages that are not physically contiguous but want to be virtually contiguous with respect to the DMA operation, i.e., scatter/gather, the client can call the ib_reg_phys_mr routine. For example,

如果客户端有一个页面列表,这些页面不是物理上连续的,但想要与DMA操作实际上是连续的,即scatter/gather,则客户端可以调用ib_reg_phys_mr例程。 例如,

*iova = &my_buffer1;

buffer_list[0].addr = dma_addr_buffer1;

buffer_list[0].size = buffer1_size;

buffer_list[1].addr = dma_addr_buffer2;

buffer_list[1].size = buffer2_size;

mr = ib_reg_phys_mr(my_pd,

buffer_list,

2,

IB_ACCESS_LOCAL_WRITE |

IB_ACCESS_REMOTE_READ |

IB_ACCESS_REMOTE_WRITE,

iova);

The mr structure that is returned contains the necessary local and remote keys, lkey and rkey, needed for sending/receiving messages and performing RDMA operations. For example, the combination of the returned iova and the rkey are used by a remote node for RDMA operations.

返回的mr结构包含发送/接收消息和执行RDMA操作所需的本地和远程密钥lkey和rkey。 例如,返回的iova和rkey的组合由远程节点用于RDMA操作。

Once a client has completed all data transfers to a memory region, e.g., the DMA is completed, the client can release to the resources back to the HCA using the ib_dereg_mr routine, for example:

一旦客户端把所有数据都传输到存储器区域,例如DMA完成,客户端就可以使用ib_dereg_mr函数将资源释放回HCA,例如:

ib_dereg_mr(mr);

There is also a verb, ib_rereg_phys_mr that allows the client to modify the attributes of a given memory region. This is similar to doing a de-register followed by a re-register but where possible the HCA reuses the same resources rather than deallocating and then reallocating new ones.

还有一个verb ib_rereg_phys_mr允许客户端修改给定内存区域的属性。这类似于取消注册然后重新注册,但在可能的情况下,HCA重用相同的资源而不是解除分配,然后重新分配新的资源。

status = ib_rereg_phys_mr(mr,

mr_rereg_mask,

my_pd,

buffer_list,

num_phys_buf,

mr_access_flags,

iova_start);

There is also a set of routines that allow a technique called fast memory registration. Fast Memory Registration, or FMR, was implemented to allow the re-use of memory regions and to reduce the overhead involved in registration and deregistration with the HCAs. Using the technique of FMR, the client typically allocates a pool of FMRs during initialization. Then when it needs to register memory with the HCA, the client calls a routine that maps the pages using one of the pre-allocated FMRs.

还有一组函数允许一种称为快速内存注册的技术。实施快速存储器注册(FMR)以允许重新使用存储器区域并减少与HCA注册和注销相关的开销。使用FMR技术,客户端通常在初始化期间分配FMR池。然后,当需要使用HCA注册内存时,客户端调用一个函数,该函数使用预先分配的FMR映射页面。

Once the DMA is complete, the client can unmap the pages from the FMR and recycle the memory region and use it for another DMA operation. The following routines are used to allocate, map, unmap, and deallocate FMRs.

DMA完成后,客户端可以从FMR取消映射页面并回收内存区域,然后将其用于另一个DMA操作。 以下函数用于分配,映射,取消映射和取消分配FMR。

ib_alloc_fmr

ib_unmap_fmr

ib_map_phys_fmr

ib_dealloc_fmr

An example of coding using FMRs can be found in the SDP [SDP] driver available at openib.org.

使用FMR编码的示例可以在openib.org上的SDP [SDP]驱动程序中找到。

NOTE: These FMRs are a Mellanox specific implementation and are NOT the same as the FMRs as defined by the 1.2 InfiniBand verbs extensions [IBTA]. The FMRs that are implemented are based on the Mellanox FMRs that predate the 1.2 specification and so the developers deviated slightly from the InfiniBand specification in this area.

注意:这些FMR是Mellanox特定的实现,与1.2 InfiniBand verbs扩展[IBTA]定义的FMR不同。这个已经实现的FMR基于早于1.2规范的Mellanox FMR,因此开发人员略微偏离了该领域的InfiniBand规范。

InfiniBand also has the concept of memory windows [IBTA]. Memory windows are a way to bind a set of virtual addresses and attributes to a memory regions by posting an operation to a send queue. It was thought that people might want this dynamic binding/unbinding intermixed with their work request flow. However, it is currently not used, primarily because

of poor H/W performance in the existing HCA, and thus is not implemented in the mthca driver in Linux.

InfiniBand还具有内存窗口[IBTA]的概念。 内存窗口是一种通过将操作发布到发送队列来将一组虚拟地址和属性绑定到内存区域的方法。人们可能希望将这种动态绑定/解除绑定与其工作请求流混合在一起。 但是,它目前尚未使用,主要是因为现有HCA中的H / W性能较差,因此未在Linux中的mthca驱动程序中实现。

However, there are APIs defined in the midlayer for memory windows for when it is implemented in mthca or some future HCA driver. These are as follows:

但是,在中间层中为内存窗口定义了API,以便在mthca或某些未来的HCA驱动程序中实现。 这些如下:

ib_alloc_mw

ib_dealloc_mw

5 InfiniBand subnet administration

Communication with subnet administration(SA) is often needed to obtain information for establishing communication or setting up multicast groups. This is accomplished by sending management datagram (MAD) packets to the SA through InfiniBand special QP 1 [IBTA]. The low level routines that are needed to send/receive MADs along with the critical data structures are defined in linux-2.6.11/drivers/infiniband/include/ib_mad.h.

通常需要与子网管理(SA)通信以获得用于建立通信或建立多播组的信息。这是通过将管理数据报(MAD)数据包通过InfiniBand特殊QP 1 [IBTA]发送到SA来实现的。在linux-2.6.11 / drivers / infiniband / include / ib_mad.h中定义了发送/接收MAD以及关键数据结构所需的底层函数。

Several helper functions have been implemented for obtaining path record information or joining multicast groups. These relieve most clients from having to understand the low level MAD routines. Subnet administration APIs and data structures are located in linux-2.6.11/drivers/infiniband/include/ib_sa.h and the following sections discuss their usage.

已经实现了几种辅助函数,用于获取路径记录信息或加入组播组。这些函数让大多数客户不必理解底层MAD函数。 子网管理API和数据结构位于linux-2.6.11 / drivers / infiniband / include / ib_sa.h中,以下各节讨论它们的用法。

5.1 Path Record Queries

To establish connections, certain information is needed, such as the source/destination LIDs, service level, MTU, etc. This information is found in a data structure known as a path record, which contains all relevant information of a path between a source and destination. Path records are managed by the InfiniBand subnet administrator(SA). To obtain a path record, the client can use the helper function:

要建立连接,需要某些信息,例如源/目标LID,服务级别,MTU等。此信息位于称为路径记录的数据结构中,该记录包含源和目的之间路径的所有相关信息。路径记录由InfiniBand子网管理员(SA)管理。要获取路径记录,客户端可以使用辅助函数:

ib_sa_path_rec_get

This function takes the device structure, returned by the register routine callback, the local InfiniBand port to use for the query, a timeout value, which is the time to wait before giving up on the query, and two masks, comp_mask and gfp_mask. The comp_mask specifies the components of the ib_sa_path_rec to perform the query with. The gfp_mask is the mask used for internal memory allocations, e.g., the ones passed to kmalloc, GFP_KERNEL, GFP_USER, GFP_ATOMIC, GFP_USER. The **query parameter is a returned identifier of the query that can be used to cancel it, if needed. For example, given a source and destination InfiniBand global identifier (sgid/dgid) and the partition key, here is an example query call taken from the SDP [SDP] code.

此函数采用由寄存器回调函数返回的设备结构,用于查询的本地InfiniBand端口,超时值,即放弃查询之前等待的时间,以及两个掩码,comp_mask和gfp_mask。 comp_mask指定用于执行查询的ib_sa_path_rec的组件。 gfp_mask是用于内部存储器分配的掩码,例如传递给kmalloc,GFP_KERNEL,GFP_USER,GFP_ATOMIC,GFP_USER的掩码。 **查询参数是查询的返回标识符,如果需要,可用于取消它。 例如,给定源和目的InfiniBand全局标识符(sgid / dgid)和分区键,这是从SDP [SDP]代码获取的示例查询调用。

query_id = ib_sa_path_rec_get(

info->ca,

info->port,

&info->path,

(IB_SA_PATH_REC_DGID |

IB_SA_PATH_REC_SGID |

IB_SA_PATH_REC_PKEY |

IB_SA_PATH_REC_NUMB_PATH),

info->sa_time,

GFP_KERNEL,

sdp_link_path_rec_done,

info,

&info->query);

if (result < 0) {

sdp_dbg_warn(NULL,

"Error <%d> restarting path query",

result);

}

In the above example, when the query completes, or times-out, the client is called back at the provided callback routine, sdp_link_path_rec_done. If the query succeeds, the path record(s) information requested is returned along with the context value that was provided with the query. If the query times out, the client can retry the request by calling the routine again.

在上面的示例中,当查询完成或超时时,将在提供的回调函数sdp_link_path_rec_done中回调客户端。 如果查询成功,则返回所请求的路径记录信息以及随查询提供的上下文值。 如果查询超时,客户端可以通过再次调用函数来重试该请求。

Note that in the above example, the caller must provide the DGID, SGID, and PKEY in the info->path structure, In the SDP example, the info->path.dgid, info->path.sgid, and info->path.pkey are set in the SDP routine do_link_path_lookup.

请注意,在上面的示例中,调用者必须在info-> path结构中提供DGID,SGID和PKEY,在SDP示例中,info-> path.dgid,info-> path.sgid和info-> path.pkey在SDP例程do_link_path_lookup中设置。

5.2 Cancelling SA Queries

If the client wishes to cancel an SA query, the client uses the returned **query parameter and query function return value (query id), e.g.,

如果客户端希望取消SA查询,则客户端使用返回的**查询参数和查询函数返回值(查询id),例如,

ib_sa_cancel_query(

query_id,

query);

5.3 Multicast Groups

Multicast groups are administered by the subnet administrator/subnet manager, which configure InfiniBand switches for the multicast group. To participate in a multicast group, a client sends a message to the subnet administrator to join the group. The APIs used to do this are shown below:

组播组由子网管理员/子网管理器管理,子网管理员/子网管理器为组播组配置InfiniBand交换机。要加入多播组,客户端会向子网管理员发送一条消息以加入该组。 用于执行此操作的API如下所示:

ib_sa_mcmember_rec_set

ib_sa_mcmember_rec_delete

ib_sa_mcmember_rec_query

The ib_sa_mcmember_rec_set routine is used to create and/or join the multicast group and the ib_sa_mcmember_rec_delete routine is used to leave a multicast group.

ib_sa_mcmember_rec_set函数用于创建和/或加入组播组,ib_sa_mcmember_rec_delete函数用于离开组播组。

The ib_sa_mcmember_rec_query routine can be called get information on available multicast groups. After joining the multicast group, the client must attach a queue pair to the group to allow sending and receiving multicast messages. Attaching/detaching queue pairs from multicast groups can be done using the API shown below:

可以调用ib_sa_mcmember_rec_query函数获取有关可用多播组的信息。加入组播组后,客户端必须将一个QP附加到该组,以允许发送和接收组播消息。可以使用下面显示的API来完成从多播组中附加/分离QP:

ib_attach_mcast

ib_detach_mcast

The gid and lid in these routines are the multicast gid(mgid) and multicast lid (mlid) of the group. An example of using the multicast routines can be found in the IP over IB code located in linux-2.6.11/drivers/infiniband/ulp/ipoib.

这些函数中的gid和lid是组的组播gid(mgid)和组播LID(mlid)。 可以在位于linux-2.6.11 / drivers / infiniband / ulp / ipoib中的IP over IB代码中找到使用多播例程的示例。

5.4 MAD routines

Most upper level protocols do not need to send and receive InfiniBand management datagrams (MADs) directly. For the few operations that require communication with the subnet manager/subnet administrator, such as path record queries or joining multicast groups, helper functions are provided, as discussed in an earlier section.

大多数上层协议不需要直接发送和接收InfiniBand管理数据报(MAD)。 对于需要与子网管理器/子网管理员通信的少数操作,例如路径记录查询或加入多播组,提供了辅助功能,如前面部分所述。

However, for some modules of the mid-layer itself, such as the communications manager, or for developers wanting to implement management agents using the InfiniBand special queue pairs, MADs may need to be sent and received directly. An example might be someone that wanted to tunnel IPMI [IPMI] or SNMP [SNMP] over InfiniBand for remote server management. Another example is handling some vendor-specific MADs that are implemented by a specific Infini-Band vendor. The MAD routines are defined in linux-2.6.11/drivers/infiniband/include/ib_mad.h.

但是,对于中间层本身的某些模块(例如通信管理器),或者对于希望使用InfiniBand特殊队列对实现管理代理的开发人员,可能需要直接发送和接收MAD。 一个例子可能是想要通过InfiniBand隧道传输IPMI [IPMI]或SNMP [SNMP]以进行远程服务器管理的人。另一个例子是处理由特定Infini-Band供应商实现的某些供应商特定的MAD。MAD函数在linux-2.6.11 / drivers / infiniband / include / ib_mad.h中定义。

Before being allowed to send or receive MADs, MAD layer clients must register an agent with the MAD layer using the following routines. The ib_register_mad_snoop routine can be used to snoop MADs, which is useful for debugging.

在允许发送或接收MAD之前,MAD层客户端必须使用以下函数向MAD层注册代理。可以使用ib_register_mad_snoop例程来侦听MAD,这对于调试很有用。

ib_register_mad_agent

ib_unregister_mad_agent

ib_register_mad_snoop

After registering with the MAD layer, the MAD client sends and receives MADs using the following routines.

注册MAD层后,MAD客户端使用以下函数发送和接收MAD。

ib_post_send_mad

ib_coalesce_recv_mad

ib_free_recv_mad

ib_cancel_mad

ib_redirect_mad_qp

ib_process_mad_wc

The ib_post_send_mad routine allows the client to queue a MAD to be sent. After a MAD is received, it is given to a client through their receive handler specified when registering.

ib_post_send_mad函数允许客户端对要发送的MAD进行排队。收到MAD后,通过注册时指定的接收处理程序将其提供给客户端。

When a client is done processing an incoming MAD, it frees the MAD buffer by calling ib_free_recv_mad. As one would expect, the ib_cancel_mad routine is used to cancel an outstanding MAD request.

当客户端完成处理传入的MAD时,它通过调用ib_free_recv_mad释放MAD缓冲区。 正如人们所料,ib_cancel_mad例程用于取消未完成的MAD请求。

ib_coalesce_recv_mad is a place-holder routine related to the handling of MAD segmentation and reassembly. It will copy received MAD segments into a single data buffer, and will be implemented once the InfiniBand reliable-multi-packet-protocol (RMPP) support is added.

ib_coalesce_recv_mad是与处理MAD分段和重组相关的占位符函数。它会将收到的MAD段复制到单个数据缓冲区中,并在添加InfiniBand可靠多包协议(RMPP)支持后实施。

Similarly, the routine ib_redirect_mad_qp and the routine ib_process_mad_wc are place holders for supporting QP redirection, but are not currently implemented. QP redirection permits a management agent to send and receive MADs on a QP other than the GSI QP (QP 1). As an example, a protocol which was data intensive could use QP redirection to send and receive management datagrams on their own QP, avoiding contention with other users of the GSI QP, such as connection management or SA queries. In this case, the client can re-redirect a particular Infini-Band management class to a dedicated QP using the ib_redirect_mad_qp routine. The ib_process_mad_wc routine would then be used to complete or continue processing a previously started MAD request on the redirected QP.

类似地,函数ib_redirect_mad_qp和函数ib_process_mad_wc是用于支持QP重定向的占位符,但是当前未实现。 QP重定向允许管理代理在除GSI QP(QP 1)之外的QP上发送和接收MAD。 例如,数据密集型协议可以使用QP重定向在自己的QP上发送和接收管理数据报,从而避免与GSI QP的其他用户争用,例如连接管理或SA查询。在这种情况下,客户端可以使用ib_redirect_mad_qp函数将特定Infini-Band管理类重定向到专用QP。然后,ib_process_mad_wc函数将用于完成或继续处理重定向的QP上先前启动的MAD请求。

6 InfiniBand connection management

The mid-layer provides several helper functions to assist with establishing connections. These are defined in the header file, linux-2.6.11/drivers/infiniband/ include/ib_cm.h Before initiating a connection request, the client must first register a callback function with the mid-layer for connection events.

中间层提供了几个辅助函数来帮助建立连接。 这些在头文件linux-2.6.11 / drivers / infiniband / include / ib_cm.h中定义。在发起连接请求之前,客户端必须首先向中间层注册一个回调函数以用于连接事件。

ib_create_cm_id

ib_destroy_cm_id

The ib_create_id routine creates a communication id and registers a callback handler for connection events. The ib_destroy_cm_id routine can be used to free the communication id and de-register the communication callback routine after the client is finished using their connections.

ib_create_id函数创建通信ID并为连接事件注册回调处理程序。在客户端完成连接后,ib_destroy_cm_id例程可用于释放通信ID并取消注册通信回调函数。

The communication manager implements a client/server style of connection establishment, using a three-way handshake between the client and server. To establish a connection, the server side listens for incoming connection requests.

通信管理器使用客户端和服务器之间的三次握手实现客户端/服务器式连接建立。要建立连接,服务器端将侦听传入的连接请求。

Clients connect to this server by sending a connection request. After receiving the connection request, the server will send a connection response or reject message back to the client. A client completes the connection setup by sending a ready to use (RTU) message back to the server. The following routines are used to accomplish this:

客户端通过发送连接请求连接到此服务器。收到连接请求后,服务器将向客户端发送连接响应或拒绝消息。客户端通过将准备使用(RTU)消息发送回服务器来完成连接设置。以下例程用于完成此任务:

ib_cm_listen

ib_send_cm_req

ib_send_cm_rep

ib_send_cm_rtu

ib_send_cm_rej

ib_send_cm_mra

ib_cm_establish

The communication manager is responsible for retrying and timing out connection requests. Clients receiving a connection request may require more time to respond to a request than the timeout used by the sending client. For example, a client tries to connect to a server that provides access to disk storage array. The server may require several seconds to ready the drives before responding to the client. To prevent the client from timing out its connection request, the server would use the ib_send_cm_mra routine to send a message received acknowledged (MRA) to notify the client that the request was received and that a longer timeout is necessary.

通信管理器负责重试和超时连接请求。 接收连接请求的客户端可能需要更多时间来响应请求,而不是发送客户端使用的超时。例如,客户端尝试连接到提供对磁盘存储阵列的访问的服务器。在响应客户端之前,服务器可能需要几秒钟才能准备好驱动器。 为了防止客户端超时其连接请求,服务器将使用ib_send_cm_mra例程发送收到确认消息(MRA)以通知客户端已收到请求并且需要更长的超时。

After a client sends the RTU message, it can begin transferring data on the connection. However, since CM messages are unreliable, the RTU may be delayed or lost. In such cases, receiving a message on the connection notifies the server that the connection has been established.

客户端发送RTU消息后,它可以开始在连接上传输数据。 但是,由于CM消息不可靠,RTU可能会延迟或丢失。 在这种情况下,在连接上接收消息会通知服务器已建立连接。

In order for the CM to properly track the connection state, the server calls ib_cm_establish to notify the CM that the connection is now active.

为了使CM正确跟踪连接状态,服务器调用ib_cm_establish以通知CM该连接现在处于活动状态。

Once a client is finished with a connection, it can disconnect using the disconnect request routine (ib_send_cm_dreq) shown below. The recipient of a disconnect request sends a disconnect reply.

一旦客户端完成连接,它可以使用下面显示的断开连接请求函数(ib_send_cm_dreq)断开连接。 断开连接请求的收件人发送断开连接答复。

ib_send_cm_dreq

ib_send_cm_drep

There are two routines that support path migration to an alternate path. These are:

有两个函数支持路径迁移到备用路径。 这些是:

ib_send_cm_lap

ib_send_cm_apr

The ib_send_cm_lap routine is used to request that an alternate path be loaded. The ib_send_cm_apr routine sends a response to the alternative path request, indicating if the alternate path was accepted.

ib_send_cm_lap函数用于请求加载备用路径。 ib_send_cm_apr函数发送对备用路径请求的响应,指示备用路径是否被接受。

6.1 Service ID Queries

InfiniBand provides a mechanism to allow services to register their existence with the subnet administrator. Other nodes can then query the subnet administrator to locate other nodes that have this service and get information needed to communicate with the other nodes. For example, clients can discover if a node contains a specific UD service. Given the service ID, the client can discover the QP number and QKey of the service on the remote node. This can then be used to send datagrams to the remote service. The communication manager provides the following routines to assist in service ID

resolution.

InfiniBand提供了一种机制,允许服务向子网管理员注册其存在。然后,其他节点可以查询子网管理员以定位具有此服务的其他节点,并获取与其他节点通信所需的信息。例如,客户端可以发现节点是否包含特定的UD服务。给定服务ID,客户端可以发现远程节点上的服务的QP号和QKey。 然后,可以使用它将数据报发送到远程服务。 通信管理器提供以下例程以协助服务ID解析度。

ib_send_cm_sidr_req

ib_send_cm_sidr_rep

7 InfiniBand work request and completion event processing

Once a client has created QPs and CQs, registered memory, and established a connection or set up the QP for receiving datagrams, it can transfer data using the work request APIs. To send messages, perform RDMA reads or writes, or perform atomic operations, a client posts send work request elements (WQE) to the send queue of the queue pair. The format of the WQEs along with other critical data structures are located in linux-2.6.11/drivers/infiniband/include/ib_verbs.h. To allow data to be received, the client must first post receive WQEs to the receive queue of the QP.

一旦客户端创建了QP和CQ,已注册的内存,并建立了连接或设置QP以接收数据报,它就可以使用工作请求API传输数据。要发送消息,执行RDMA读取或写入,或执行原子操作,客户端会将工作请求元素(WQE)发送到队列对的发送队列。 WQE的格式以及其他关键数据结构位于linux-2.6.11 / drivers / infiniband / include / ib_verbs.h中。 为了允许接收数据,客户端必须首先将接收WQE发送到QP的接收队列。

ib_post_send

ib_post_recv

The post routines allow the client to post a list of WQEs that are linked via a linked list. If the format of WQE is bad and the post routine detects the error at post time, the post routines return a pointer to the bad WQE.

Post函数允许客户发布通过链表链接的WQE列表。如果WQE的格式错误并且post函数在发布时检测到错误,则post函数返回指向错误WQE的指针。

To process completions, a client typically sets up a completion callback handler when the completion queue (CQ) is created. The client can then call ib_req_notify_cq to request a notification callback on a given CQ. The ib_req_ncomp_notif routine allows the completion to be delivered after n WQEs have completed, rather than receiving a callback after a single one.

为了处理完成,客户端通常在创建完成队列(CQ)时设置完成回调处理程序。 然后,客户端可以调用ib_req_notify_cq来请求给定CQ的通知回调。ib_req_ncomp_notif函数允许在n个WQE完成之后传递完成,而不是在单个WQE之后接收回调。

ib_req_notify_cq

ib_req_ncomp_notif

The mid-layer also provides routines for polling for completions and peeking to see how many completions are currently pending on the completion queue. These are:

中间层还提供轮询完成和窥视的函数,以查看完成队列中当前有多少完成。 这些是:

ib_poll_cq

ib_peek_cq

Finally, there is the possibility that the client might receive an asynchronous event from the InfiniBand device. This happens for certain types of errors or ports coming online or going offline. Readers should refer to section 11.6.3 of the InfiniBand specification [IBTA] for a list of possible asynchronous event types. The midlayer provides the following routines to register for asynchronous events.

最后,客户端可能会从InfiniBand设备接收异步事件。 对于某些类型的错误或端口联机或脱机,会发生这种情况。 读者应参阅InfiniBand规范[IBTA]的第11.6.3节,以获取可能的异步事件类型列表。 中间层提供以下例程来注册异步事件。

ib_register_event_handler

ib_unregister_event_handler

8 Userspace InfiniBand Access

The InfiniBand architecture is designed so that multiple userspace processes can share a single InfiniBand adapter at the same time, with each the process using a private context so that fast path operation can access the adapter hardware directly without requiring the overhead of a system call or a copy between kernel space and userspace.

InfiniBand体系结构的设计使得多个用户空间进程可以同时共享一个InfiniBand适配器,每个进程使用一个私有上下文,以便快速路径可以直接访问适配器硬件,而无需系统调用的开销或在内核空间和用户空间之间复制。

Work is currently underway to add this support to the Linux InfiniBand stack. A kernel module ib_uverbs.ko implements character special devices that are used for control path operations, such as allocating userspace contexts and pinning userspace memory as well as creating InfiniBand resources such as queue pairs and completion queues. On the userspace

side, a library called libibverbs will provide an API in userspace similar to the kernel API described above.

目前正在努力将此支持添加到Linux InfiniBand堆栈。 内核模块ib_uverbs.ko用于控制路径操作的特殊字符设备,例如分配用户空间上下文和固定用户空间内存以及创建InfiniBand资源(如QPs和CQs)。另外,一个名为libibverbs的库将在用户空间中提供类似于上述内核API的用户空间API。

In addition to adding support for accessing the verbs from userspace, a kernel module (ib_umad.ko) allows access to MAD services from userspace.

除了添加了对从用户空间访问verbs的支持之外,内核模块(ib_umad.ko)允许从用户空间访问MAD服务。

There also now exists a kernel module to proxy CM services into userspace. The kernel module is called ib_ucm.ko.

现在还存在用于将CM服务代理到用户空间的内核模块。 内核模块名为ib_ucm.ko。

As the userspace infrastructure is still under construction, it has not yet been incorporated into the main kernel tree, but it is expected to be submitted to lkml in the near future. People that want get early access to the code can download it from the InfiniBand subversion development tree available from openib.org.

由于用户空间基础设施仍在建设中,尚未将其整合到主内核树中,但预计在不久的将来会提交给lkml。 想要及早访问代码的人可以从openib.org提供的InfiniBand subversion开发树下载它。

9 Acknowledgments

We would like to acknowledge the United States Department of Energy for their fund ing of InfiniBand open source work and for their computing resources used to host the openib.org web site and subversion data base. We would also like to acknowledge the DOE for their work in scale-up testing of the InfiniBand code using their large clusters.

我们要感谢美国能源部为InfiniBand开源工作提供的资金以及用于托管openib.org网站和subversion数据库的计算资源。我们还要感谢美国能源部在使用大型集群扩展测试InfiniBand代码方面所做的工作。

We would also like to acknowledge all of the companies of the openib.org alliance that have applied resources to the openib.org InfiniBand open source project.

我们还要感谢openib.org联盟的所有公司,这些公司已将资源应用于openib.org InfiniBand开源项目。

Finally we would like the acknowledge the help of all of the individuals in the Linux community that have submitted patches, provided code reviews, and helped with testing to ensure the InfiniBand code is stable.

最后,我们感谢Linux社区中所有提交补丁,提供代码审查以及帮助进行测试以确保InfiniBand代码稳定的人员的帮助。

10 Availability

The latest stable release of the InfiniBand code is available in the Linux releases (starting in 2.6.11) available from kernel.org.

最新的InfiniBand代码稳定版本可在kernel.org上发布的Linux发行版(从2.6.11开始)中找到。

ftp://kernel.org/pub

For those that want to track the latest Infini-Band development tree, it is located in a subversion database at openib.org.

对于那些想要跟踪最新的Infini-Band开发树的人来说,它位于openib.org的subversion数据库中。

svn checkout https://openib.org/svn/gen2

References

[IBTA] The InfiniBand Architecture Specification, Vol. 1, Release 1.2

http://www.ibta.org

[SDP] The SDP driver, developed by Libor Michalek

http://www.openib.org

[IPMI] Intelligent Platform Management Interface

http://developer.intel.com

[SNMP] Simple Network Management Protocol

http://www.ietf.org

[LJ] May 2005 Linux Journal—InfiniBand and Linux

http://www.linuxjournal.com

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值