【verbs】IBV_WR API(3) Libibverbs Programmer’s Manual

86 篇文章 269 订阅

阅读源码的时候,发现一些陌生 api,如ibv_wr_send。搜索后才发现是一些dev的api

 / testing / libibverbs-dev / ibv_wr_send(3)

源文:ibv_wr_send(3) — libibverbs-dev — Debian testing — Debian Manpages

描述


verbs   API (ibv_wr_*) 允许使用函数调用而不是基于结构的 ibv_post_send() 方案将工作高效地发布到发送队列。这种方法旨在最大限度地减少发布过程中的 CPU 分支和锁定。


此 API 旨在用于访问 ibv_post_send() 提供的功能之外的其他功能。
ibv_post_send() 的 WR 批次和此 API WR 批次可以交织在一起,只要它们不在彼此的关键区域内发布。 (此 API 中的一个关键区由 ibv_wr_start() 和 ibv_wr_complete()/ibv_wr_abort() 界定)

用法


要使用这些 API,必须使用 ibv_create_qp_ex() 创建 QP,它允许在 comp_mask 中设置 IBV_QP_INIT_ATTR_SEND_OPS_FLAGS。 send_ops_flags 应设置为将发布到 QP 的工作请求类型的 OR。


如果 QP 不支持所有请求的工作请求类型,则 QP 创建将失败。
向 QP 发布工作请求是在 ibv_wr_start() 和 ibv_wr_complete()/ibv_wr_abort() 形成的关键区域内完成的(参见下面的 CONCURRENCY)。

每个工作请求都是通过调用 WR 构建器函数(请参阅下面的表列 WR 构建器)来开始创建工作请求,然后是下面描述的允许/必需的 setter 函数来创建的。


可以多次调用 WR builder 和 setter 组合以在单个关键区域内有效地发布多个工作请求。
每个 WR builder 都会使用 struct ibv_qp_ex 的 wr_id 成员来设置完成中要返回的值。某些操作还将使用 wr_flags 成员来影响操作(请参阅下面的标志)。这些值

qpx->wr_id = 1;
ibv_wr_send(qpx);
ibv_wr_set_sge(qpx, lkey, &data, sizeof(data));

工作请求部分详细描述了各种 WR 构建器和设置器。
发布工作通过调用 ibv_wr_complete() 或 ibv_wr_abort() 完成。在 ibv_wr_complete() 返回成功之前,不会对队列执行任何工作。 ibv_wr_abort() 将丢弃自 ibv_wr_start() 以来准备的所有工作。

WORK REQUESTS

OperationWR builderQP Type Supportedsetters
ATOMIC_CMP_AND_SWPibv_wr_atomic_cmp_swp()RC, XRC_SENDDATA, QP
ATOMIC_FETCH_AND_ADDibv_wr_atomic_fetch_add()RC, XRC_SENDDATA, QP
BIND_MWibv_wr_bind_mw()UC, RC, XRC_SENDNONE
LOCAL_INVibv_wr_local_inv()UC, RC, XRC_SENDNONE
RDMA_READibv_wr_rdma_read()RC, XRC_SENDDATA, QP
RDMA_WRITEibv_wr_rdma_write()UC, RC, XRC_SENDDATA, QP
RDMA_WRITE_WITH_IMMibv_wr_rdma_write_imm()UC, RC, XRC_SENDDATA, QP
SENDibv_wr_send()UD, UC, RC, XRC_SEND, RAW_PACKETDATA, QP
SEND_WITH_IMMibv_wr_send_imm()UD, UC, RC, SRC SENDDATA, QP
SEND_WITH_INVibv_wr_send_inv()UC, RC, XRC_SENDDATA, QP
TSOibv_wr_send_tso()UD, RAW_PACKETDATA, QP

Atomic operations

Atomic operations are only atomic so long as all writes to memory go only through the same RDMA hardware. It is not atomic with writes performed by the CPU, or by other RDMA hardware in the system.

ibv_wr_atomic_cmp_swp()

If the remote 64 bit memory location specified by rkey and remote_addr equals compare then set it to swap.

ibv_wr_atomic_fetch_add()

Add add to the 64 bit memory location specified rkey and remote_addr.

Memory Windows

Memory window type 2 operations (See man page for ibv_alloc_mw).

ibv_wr_bind_mw()

Bind a MW type 2 specified by mw, set a new rkey and set its properties by bind_info.

ibv_wr_local_inv()

Invalidate a MW type 2 which is associated with rkey.

RDMA

ibv_wr_rdma_read()

Read from the remote memory location specified rkey and remote_addr. The number of bytes to read, and the local location to store the data, is determined by the DATA buffers set after this call.

ibv_wr_rdma_write(), ibv_wr_rdma_write_imm()

Write to the remote memory location specified rkey and remote_addr. The number of bytes to read, and the local location to get the data, is determined by the DATA buffers set after this call.

The _imm version causes the remote side to get a IBV_WC_RECV_RDMA_WITH_IMM containing the 32 bits of immediate data.

Message Send

ibv_wr_send(), ibv_wr_send_imm()

Send a message. The number of bytes to send, and the local location to get the data, is determined by the DATA buffers set after this call.

The _imm version causes the remote side to get a IBV_WC_RECV_RDMA_WITH_IMM containing the 32 bits of immediate data.

ibv_wr_send_inv()

The data transfer is the same as for ibv_wr_send(), however the remote side will invalidate the MR specified by invalidate_rkey before delivering a completion.

ibv_wr_send_tso()

Produce multiple SEND messages using TCP Segmentation Offload. The SGE points to a TCP Stream buffer which will be segmented into MSS size SENDs. The hdr includes the entire network headers up to and including the TCP header and is prefixed before each segment.

QP Specific setters

Certain QP types require each post to be accompanied by additional setters, these setters are mandatory for any operation listing a QP setter in the above table.

UD QPs

ibv_wr_set_ud_addr() must be called to set the destination address of the work.

XRC_SEND QPs

ibv_wr_set_xrc_srqn() must be called to set the destination SRQN field.

DATA transfer setters

For work that requires to transfer data one of the following setters should be called once after the WR builder:

ibv_wr_set_sge()

Transfer data to/from a single buffer given by the lkey, addr and length. This is equivalent to ibv_wr_set_sge_list() with a single element.

ibv_wr_set_sge_list()

Transfer data to/from a list of buffers, logically concatenated together. Each buffer is specified by an element in an array of struct ibv_sge.

Inline setters will copy the send data during the setter and allows the caller to immediately re-use the buffer. This behavior is identical to the IBV_SEND_INLINE flag. Generally this copy is done in a way that optimizes SEND latency and is suitable for small messages. The provider will limit the amount of data it can support in a single operation. This limit is requested in the max_inline_data member of struct ibv_qp_init_attr. Valid only for SEND and RDMA_WRITE.

ibv_wr_set_inline_data()

Copy send data from a single buffer given by the addr and length. This is equivalent to ibv_wr_set_inline_data_list() with a single element.

ibv_wr_set_inline_data_list()

Copy send data from a list of buffers, logically concatenated together. Each buffer is specified by an element in an array of struct ibv_inl_data.

Flags

A bit mask of flags may be specified in wr_flags to control the behavior of the work request.

IBV_SEND_FENCE

Do not start this work request until prior work has completed.

IBV_SEND_IP_CSUM

Offload the IPv4 and TCP/UDP checksum calculation

IBV_SEND_SIGNALED

A completion will be generated in the completion queue for the operation.

IBV_SEND_SOLICTED

Set the solicted bit in the RDMA packet. This informs the other side to generate a completion event upon receiving the RDMA operation.

CONCURRENCY

The provider will provide locking to ensure that ibv_wr_start() and ibv_wr_complete()/abort() form a per-QP critical section where no other threads can enter.

If an ibv_td is provided during QP creation then no locking will be performed and it is up to the caller to ensure that only one thread can be within the critical region at a time.

RETURN VALUE

Applications should use this API in a way that does not create failures. The individual APIs do not return a failure indication to avoid branching.

If a failure is detected during operation, for instance due to an invalid argument, then ibv_wr_complete() will return failure and the entire posting will be aborted.

例子

/* create RC QP type and specify the required send opcodes */
qp_init_attr_ex.qp_type = IBV_QPT_RC;
qp_init_attr_ex.comp_mask |= IBV_QP_INIT_ATTR_SEND_OPS_FLAGS;
qp_init_attr_ex.send_ops_flags |= IBV_QP_EX_WITH_RDMA_WRITE;
qp_init_attr_ex.send_ops_flags |= IBV_QP_EX_WITH_RDMA_WRITE_WITH_IMM;
ibv_qp *qp = ibv_create_qp_ex(ctx, qp_init_attr_ex);
ibv_qp_ex *qpx = ibv_qp_to_qp_ex(qp);
ibv_wr_start(qpx);
/* create 1st WRITE WR entry */
qpx->wr_id = my_wr_id_1;
ibv_wr_rdma_write(qpx, rkey, remote_addr_1);
ibv_wr_set_sge(qpx, lkey, local_addr_1, length_1);
/* create 2nd WRITE_WITH_IMM WR entry */
qpx->wr_id = my_wr_id_2;
qpx->wr_flags = IBV_SEND_SIGNALED;
ibv_wr_rdma_write_imm(qpx, rkey, remote_addr_2, htonl(0x1234));
ibv_set_wr_sge(qpx, lkey, local_addr_2, length_2);
/* Begin processing WRs */
ret = ibv_wr_complete(qpx);

名称


ibv_wr_abort、ibv_wr_complete、ibv_wr_start - 管理允许发布工作的区域
ibv_wr_atomic_cmp_swp, ibv_wr_atomic_fetch_add - 发布远程原子操作工作请求
ibv_wr_bind_mw, ibv_wr_local_inv - 发布内存窗口的工作请求
ibv_wr_rdma_read、ibv_wr_rdma_write、ibv_wr_rdma_write_imm - 发布 RDMA 工作请求
ibv_wr_send、ibv_wr_send_imm、ibv_wr_send_inv - 发布发送工作请求
ibv_wr_send_tso - 后分段卸载工作请求
ibv_wr_set_inline_data, ibv_wr_set_inline_data_list - 将内联数据附加到最后一个工作请求
ibv_wr_set_sge, ibv_wr_set_sge_list - 将数据附加到最后一个工作请求
ibv_wr_set_ud_addr - 将 UD 寻址信息附加到最后一个工作请求
ibv_wr_set_xrc_srqn - 将 XRC SRQN 附加到最后一个工作请求

概要

#include <infiniband/verbs.h>
void ibv_wr_abort(struct ibv_qp_ex *qp);
int ibv_wr_complete(struct ibv_qp_ex *qp);
void ibv_wr_start(struct ibv_qp_ex *qp);
void ibv_wr_atomic_cmp_swp(struct ibv_qp_ex *qp, uint32_t rkey,


                           uint64_t remote_addr, uint64_t compare,


                           uint64_t swap);
void ibv_wr_atomic_fetch_add(struct ibv_qp_ex *qp, uint32_t rkey,


                             uint64_t remote_addr, uint64_t add);
void ibv_wr_bind_mw(struct ibv_qp_ex *qp, struct ibv_mw *mw, uint32_t rkey,


                    const struct ibv_mw_bind_info *bind_info);
void ibv_wr_local_inv(struct ibv_qp_ex *qp, uint32_t invalidate_rkey);
void ibv_wr_rdma_read(struct ibv_qp_ex *qp, uint32_t rkey,


                      uint64_t remote_addr);
void ibv_wr_rdma_write(struct ibv_qp_ex *qp, uint32_t rkey,


                       uint64_t remote_addr);
void ibv_wr_rdma_write_imm(struct ibv_qp_ex *qp, uint32_t rkey,


                           uint64_t remote_addr, __be32 imm_data);
void ibv_wr_send(struct ibv_qp_ex *qp);
void ibv_wr_send_imm(struct ibv_qp_ex *qp, __be32 imm_data);
void ibv_wr_send_inv(struct ibv_qp_ex *qp, uint32_t invalidate_rkey);
void ibv_wr_send_tso(struct ibv_qp_ex *qp, void *hdr, uint16_t hdr_sz,


                     uint16_t mss);
void ibv_wr_set_inline_data(struct ibv_qp_ex *qp, void *addr, size_t length);
void ibv_wr_set_inline_data_list(struct ibv_qp_ex *qp, size_t num_buf,


                                 const struct ibv_data_buf *buf_list);
void ibv_wr_set_sge(struct ibv_qp_ex *qp, uint32_t lkey, uint64_t addr,


                    uint32_t length);
void ibv_wr_set_sge_list(struct ibv_qp_ex *qp, size_t num_sge,


                         const struct ibv_sge *sg_list);
void ibv_wr_set_ud_addr(struct ibv_qp_ex *qp, struct ibv_ah *ah,


                        uint32_t remote_qpn, uint32_t remote_qkey);
void ibv_wr_set_xrc_srqn(struct ibv_qp_ex *qp, uint32_t remote_srqn);
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
RDMA HCA/TCA是一种高速网络适配器,它使用RDMA技术来提高数据传输的效率和性能。HCA代表Host Channel Adapter,而TCA代表Target Channel Adapter。HCA通常安装在主机上,而TCA通常安装在存储设备上。这两种适配器都支持RDMA技术,可以通过RDMA协议进行高速数据传输。 RDMA技术是一种零拷贝技术,它可以直接在内存中传输数据,而不需要将数据从内存复制到网络适配器的缓冲区中。这种技术可以显著提高数据传输的效率和性能,减少CPU的负载,降低网络延迟和网络拥塞。 RDMA HCA/TCA通常使用InfiniBand或者RoCE(RDMA over Converged Ethernet)网络来进行高速数据传输。这些网络可以提供非常低的延迟和高的带宽,适用于高性能计算、云计算、大数据分析等领域。 以下是一个使用RDMA Write with Immediate Data的例子: ```c #include <stdio.h> #include <stdlib.h> #include <string.h> #include <infiniband/verbs.h> #define MSG_SIZE 1024 #define RDMA_BUF_SIZE 1024 struct rdma_context { struct ibv_context *ctx; struct ibv_pd *pd; struct ibv_mr *mr; struct ibv_cq *cq; struct ibv_qp *qp; struct ibv_comp_channel *comp_channel; struct ibv_port_attr port_attr; char *rdma_buf; uint32_t rkey; uint64_t remote_addr; }; int main(int argc, char *argv[]) { struct rdma_context ctx; struct ibv_device **dev_list; struct ibv_device *ib_dev; struct ibv_qp_init_attr qp_init_attr; struct ibv_qp_attr qp_attr; struct ibv_wc wc; int num_devices; int ret; int i; /* 获取IB设备列表 */ dev_list = ibv_get_device_list(&num_devices); if (!dev_list) { perror("ibv_get_device_list"); return -1; } /* 选择第一个IB设备 */ ib_dev = dev_list[0]; if (!ib_dev) { fprintf(stderr, "No IB devices found\n"); return -1; } /* 打开IB设备 */ ctx.ctx = ibv_open_device(ib_dev); if (!ctx.ctx) { perror("ibv_open_device"); return -1; } /* 创建PD */ ctx.pd = ibv_alloc_pd(ctx.ctx); if (!ctx.pd) { perror("ibv_alloc_pd"); return -1; } /* 分配内存 */ ctx.rdma_buf = malloc(RDMA_BUF_SIZE); if (!ctx.rdma_buf) { perror("malloc"); return -1; } /* 注册内存 */ ctx.mr = ibv_reg_mr(ctx.pd, ctx.rdma_buf, RDMA_BUF_SIZE, IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE); if (!ctx.mr) { perror("ibv_reg_mr"); return -1; } /* 创建CQ */ ctx.cq = ibv_create_cq(ctx.ctx, 1, NULL, NULL, 0); if (!ctx.cq) { perror("ibv_create_cq"); return -1; } /* 创建QP */ memset(&qp_init_attr, 0, sizeof(qp_init_attr)); qp_init_attr.send_cq = ctx.cq; qp_init_attr.recv_cq = ctx.cq; qp_init_attr.qp_type = IBV_QPT_RC; qp_init_attr.cap.max_send_wr = 1; qp_init_attr.cap.max_recv_wr = 1; qp_init_attr.cap.max_send_sge = 1; qp_init_attr.cap.max_recv_sge = 1; ctx.qp = ibv_create_qp(ctx.pd, &qp_init_attr); if (!ctx.qp) { perror("ibv_create_qp"); return -1; } /* 修改QP状态 */ memset(&qp_attr, 0, sizeof(qp_attr)); qp_attr.qp_state = IBV_QPS_INIT; qp_attr.pkey_index = 0; qp_attr.port_num = 1; qp_attr.qp_access_flags = IBV_ACCESS_REMOTE_WRITE; ret = ibv_modify_qp(ctx.qp, &qp_attr, IBV_QP_STATE | IBV_QP_PKEY_INDEX | IBV_QP_PORT | IBV_QP_ACCESS_FLAGS); if (ret) { perror("ibv_modify_qp"); return -1; } /* 获取端口属性 */ ret = ibv_query_port(ctx.ctx, 1, &ctx.port_attr); if (ret) { perror("ibv_query_port"); return -1; } /* 创建Completion Channel */ ctx.comp_channel = ibv_create_comp_channel(ctx.ctx); if (!ctx.comp_channel) { perror("ibv_create_comp_channel"); return -1; } /* 将CQ绑定到Completion Channel */ ret = ibv_req_notify_cq(ctx.cq, 0); if (ret) { perror("ibv_req_notify_cq"); return -1; } /* 等待CQ事件 */ ret = ibv_get_cq_event(ctx.comp_channel, &ctx.cq, &ctx.ctx); if (ret) { perror("ibv_get_cq_event"); return -1; } /* 请求下一个CQ事件 */ ret = ibv_req_notify_cq(ctx.cq, 0); if (ret) { perror("ibv_req_notify_cq"); return -1; } /* 获取远程节点的rkey和地址 */ ctx.rkey = 0x12345678; ctx.remote_addr = 0xdeadbeef; /* 向远程节点发送数据 */ memset(ctx.rdma_buf, 0, RDMA_BUF_SIZE); strcpy(ctx.rdma_buf, "Hello RDMA!"); struct ibv_send_wr wr, *bad_wr; struct ibv_sge sge; memset(&wr, 0, sizeof(wr)); wr.wr_id = 0; wr.opcode = IBV_WR_RDMA_WRITE_WITH_IMM; wr.send_flags = IBV_SEND_SIGNALED; wr.imm_data = 0x1234; wr.wr.rdma.remote_addr = ctx.remote_addr; wr.wr.rdma.rkey = ctx.rkey; wr.sg_list = &sge; wr.num_sge = 1; sge.addr = (uintptr_t)ctx.rdma_buf; sge.length = strlen(ctx.rdma_buf) + 1; sge.lkey = ctx.mr->lkey; ret = ibv_post_send(ctx.qp, &wr, &bad_wr); if (ret) { perror("ibv_post_send"); return -1; } /* 等待发送完成 */ do { ret = ibv_poll_cq(ctx.cq, 1, &wc); if (ret < 0) { perror("ibv_poll_cq"); return -1; } } while (ret == 0); /* 检查发送状态 */ if (wc.status != IBV_WC_SUCCESS) { fprintf(stderr, "Send failed with status %d\n", wc.status); return -1; } /* 关闭QP */ ret = ibv_destroy_qp(ctx.qp); if (ret) { perror("ibv_destroy_qp"); return -1; } /* 关闭Completion Channel */ ret = ibv_destroy_comp_channel(ctx.comp_channel); if (ret) { perror("ibv_destroy_comp_channel"); return -1; } /* 关闭CQ */ ret = ibv_destroy_cq(ctx.cq); if (ret) { perror("ibv_destroy_cq"); return -1; } /* 注销内存 */ ret = ibv_dereg_mr(ctx.mr); if (ret) { perror("ibv_dereg_mr"); return -1; } /* 释放内存 */ free(ctx.rdma_buf); /* 释放PD */ ret = ibv_dealloc_pd(ctx.pd); if (ret) { perror("ibv_dealloc_pd"); return -1; } /* 关闭IB设备 */ ret = ibv_close_device(ctx.ctx); if (ret) { perror("ibv_close_device"); return -1; } /* 释放IB设备列表 */ ibv_free_device_list(dev_list); return 0; } ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值