【RDMA】RDMA SEND/WRITE编程实例(IBV Verbs )

目录

前言

RDMA编程基础

更多例子

基于Verbs的RDMA RC通信编程示例概要

Main()

{

print_config()

resources_init()

resources_create()

sock_connect()

connect_qp()

post_send

poll_completion

resources_destroy

}

例子运行方法

代码1(Send, Receive, RDMA Read, RDMA Write)

代码2:增加 uc和rc的选择,增加Tos的设置

更多讲解教程

LINUX 编程例子


前言

随时更新,本文链接:https://blog.csdn.net/bandaoyu/article/details/115988785

【RDMA】RDMA 学习资料总目录_rdma书籍推荐-CSDN博客文章浏览阅读8.9k次,点赞29次,收藏153次。SavirRDMA 分享1. RDMA概述https://blog.csdn.net/bandaoyu/article/details/112859853https://zhuanlan.zhihu.com/p/1388747382. 比较基于Socket与RDMA的通信https://blog.csdn.net/bandaoyu/article/details/1128613993. RDMA基本元素和编程基础https://blog.csdn.net/bandaoyu/article/de._rdma书籍推荐https://blog.csdn.net/bandaoyu/article/details/120485737

RDMA编程基础

《RDMA编程入门》:https://blog.csdn.net/bandaoyu/article/details/125681856

存储大师班 | RDMA简介与编程基础 -https://zhuanlan.zhihu.com/p/387549948

 大家可以关注一下mellonx的vma,貌似可以直接用socket api通信,方便很多:【RDMA】降低CPU除了RDMA (vbers)还是VMA ?|使用socket进行RDMA编程?_bandaoyu的note-CSDN博客前言看介绍,像是mellonx针对其kernel bypass网卡(RDMA网卡)提供的一个lib库,该lib库对外提供socket api,使得用户的程序不需要修改就可以直接使用kernel bypass网卡(如RDMA网卡)。我们都知道RDMA 网卡目前使用的是rdma_cm和vbers api编程,和socket不一样,如果能用socket对RDMA编程,那确实是很大的利好。官网介绍什么是VMA?Mellanox Interconnect Community官方介绍:Mhttps://blog.csdn.net/bandaoyu/article/details/120726746

更多例子:

https://github.com/gpudirect/libibverbs/tree/master/examples

使用说明:

Package libibverbs-utils - man pages | ManKier

基于Verbs的RDMA RC通信编程示例概要

https://docs.mellanox.com/display/RDMAAwareProgrammingv17/Programming+Examples+Using+IBV+Verbs

以下是编程示例中的功能概述,以它们被调用的顺序为准。

Main()

{

Parse command line.

解析命令行。用户可以设置测试的TCP端口,设备名称和设备端口。如果设置,这些值将覆盖config中的默认值。最后一个参数是服务器名称。如果设置了服务器名称,则程序以客户端模式运行,连接服务器名称指定的服务器,否则程序处于服务器模式。

Call print_config().       #打印运行参数|如果没有给就是默认参数

Call resources_init().   #初始化结构体等变量

Call resources_create().#创建socket、contex、pd、cq、qp、申请内存和注册内存

Call connect_qp().         #qp_to_init、to RTR、RTS

Call post_send()            #使用IBV_WR_SEND operation (服务端)

Call poll_completion().  

请注意,服务器端希望从SEND请求中获得a completion,而客户端希望得到RECEIVE completion。

(Note that the server side expects a completion from the SEND request and the client side expects a RECEIVE completion.)
如果处于客户端模式,则显示 通过RECEIVE操作接收到的消息,如果处于服务器模式,请向缓冲区加载新消息。

(If in client mode, show the message we received via the RECEIVE operation, otherwise, if we are in server mode, load the buffer with a new message.)

Sync client<->server.

到这,服务器直接进入下一个(Sync client<->server)同步。客户端严格完成下面列的所有(Client only)RDMA操作。

(At this point the server goes directly to the next sync. All RDMA operations are done strictly by the client.)

***Client only ***

Call post_send with IBV_WR_RDMA_READ to perform a RDMA read of server’s buffer.

Call poll_completion.

显示服务器的消息。
向发送缓冲区Setup 新消息数据。

Call post_send with IBV_WR_RDMA_WRITE to perform a RDMA write of server’s buffer.

Call poll_completion.

*** End client only operations ***

Sync client<->server.

If server mode, show buffer, proving RDMA write worked.

Call resources_destroy.

Free device name string.

Done.

}

print_config()

{打印出configuration 配置信息。}

resources_init()

{ resources struct结构体清0}

resources_create()

{

  • Call sock_connect

调用sock_connect用TCP套接字连接到peer(对等方)。

获取设备列表(devices list),找到我们想要的设备,然后将其打开。

释放设备列表。(Free device list.)

  • Get the port information. #获取port 信息
  • Create a PD. //ibv_alloc_pd
  • Create a CQ. //ibv_create_cq
  • Allocate a buffer,并初始化和注册。res->buf = (char *)malloc(size) >>memset(res->buf, 0, size);>>ibv_reg_mr(res->pd, res->buf

分配一个缓冲区,对其进行初始化,然后对其进行注册。

  • Create a QP. //res->qp = ibv_create_qp(res->pd, &qp_init_attr);

}

sock_connect()

{

如果是客户端,请解析服务器的DNS地址并启动与服务器的连接。
如果是服务器,请在指示的端口上侦听传入的连接。

}

connect_qp()

{

  • Call modify_qp_to_init.
  • Call post_receive.
  • Call sock_sync_data  在服务器和客户端之间交换信息。
  • Call modify_qp_to_rtr.
  • Call modify_qp_to_rts.
  • Call sock_sync_data  同步客户端<->服务器

}

  • modify_qp_to_init

将QP转换为INIT状态。

  • post_receive

为接收缓冲区准备一个scatter/gather 条目。
       准备RR。
       发布RR。

Prepare an RR.

Post the RR.

  • sock_sync_data

使用用sock_connect创建的TCP套接字,在客户端和服务器之间同步给定的数据集。由于此函数处于阻塞状态,因此还会与伪数据一起调用该函数以同步客户端和服务器的时序。

(Using the TCP socket created with sock_connect, synchronize the given set of data between client and the server. Since this function is blocking, it is also called with dummy data to synchronize the timing of the client and server.)

  • modify_qp_to_rtr

将QP转换为RTR状态。

  • modify_qp_to_rts

将QP转换为RTS状态。

post_send

{

为要发送(或在RDMA读取情况下接收)的数据准备scatter/gather 条目。
创建一个SR。请注意,IBV_SEND_SIGNALED是冗余的(Note that IBV_SEND_SIGNALED is redundant.)。
如果这是RDMA操作,请设置address 和key。
Post SR。

}

poll_completion

{Poll CQ,直到找到一个CQE或达到MAX_POLL_CQ_TIMEOUT毫秒 超时。}

resources_destroy

{释放资源。}

}

例子运行方法


编译库的需求:libibverbs
编译参数:GCC <文件名> -o service -libverbs


运行方式:
1. 有IB网络支持:
服务端:./service
客户端:./service 服务端IP

2. 走ROCE:
服务端:./service -g 0
客户端:./service -g 0 服务端IP

服务端:

./RDMA_RC_example -g 0 -i 1 -d mlx5_1

客户端:

./RDMA_RC_example 172.17.31.53  -g 0 -d mlx5_1   #172.17.31.53  是 mlx5_1的IP

关于代码中出现的问题请到github中添加issue
博主github:https://github.com/fruitdish/RDMA-EXAMPLE/tree/master/01

代码1(Send, Receive, RDMA Read, RDMA Write)

RDMA_RC_example.c · bruce/RDMA-Tutorial - Gitee.com

(csdn源码下载:https://download.csdn.net/download/bandaoyu/25343833

注意:

struct config_t config =
{
    NULL,  /* dev_name */
    NULL,  /* server_name */
    19875, /* tcp_port */
    1,     /* ib_port */
    //-1     /* gid_idx 源码此处初始值应该是bug,会导致运行时不传递参数-g 时connect_qp()中的if (config.gid_idx >= 0)无法满足*/  
     0       /* gid_idx */
};

/*
* BUILD COMMAND:
* gcc -Wall -O0 -g -o RDMA_RC_example RDMA_RC_example.c -libverbs
*server:
*./RDMA_RC_example  -d mlx5_0 -i 1 -g 3
*client:
*./RDMA_RC_example 192.169.31.53 -d mlx5_0 -i 1 -g 3
*/
/******************************************************************************
*
* RDMA Aware Networks Programming Example
*
* This code demonstrates how to perform the following operations using 
* the * VPI Verbs API:
* Send
* Receive
* RDMA Read
* RDMA Write
*
*****************************************************************************/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <stdint.h>
#include <inttypes.h>
#include <endian.h>
#include <byteswap.h>
#include <getopt.h>
#include <sys/time.h>
#include <arpa/inet.h>
#include <infiniband/verbs.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netdb.h>

/* poll CQ timeout in millisec (2 seconds) */
#define MAX_POLL_CQ_TIMEOUT 2000
#define SRV_MSG "Server's message "
#define RDMAMSGR "RDMA read operation "
#define RDMAMSGW "RDMA write operation"
#define MSG_SIZE 64
#if __BYTE_ORDER == __LITTLE_ENDIAN
static inline uint64_t htonll(uint64_t x)
{
    return bswap_64(x);
}
static inline uint64_t ntohll(uint64_t x)
{
    return bswap_64(x);
}
#elif __BYTE_ORDER == __BIG_ENDIAN
static inline uint64_t htonll(uint64_t x)
{
    return x;
}
static inline uint64_t ntohll(uint64_t x)
{
    return x;
}
#else
#error __BYTE_ORDER is neither __LITTLE_ENDIAN nor __BIG_ENDIAN
#endif

/* structure of test parameters */
struct config_t
{
    const char *dev_name; /* IB device name */
    char *server_name;    /* server host name */
    uint32_t tcp_port;    /* server TCP port */
    int ib_port;          /* local IB port to work with */
    int gid_idx;          /* gid index to use */
};

/* structure to exchange data which is needed to connect the QPs */
struct cm_con_data_t
{
    uint64_t addr;        /* Buffer address */
    uint32_t rkey;        /* Remote key */
    uint32_t qp_num;      /* QP number */
    uint16_t lid;         /* LID of the IB port */
    uint8_t gid[16];      /* gid */
} __attribute__((packed));

/* structure of system resources */
struct resources
{
    struct ibv_device_attr device_attr; /* Device attributes */
    struct ibv_port_attr port_attr;     /* IB port attributes */
    struct cm_con_data_t remote_props;  /* values to connect to remote side */
    struct ibv_context *ib_ctx;         /* device handle */
    struct ibv_pd *pd;                  /* PD handle */
    struct ibv_cq *cq;                  /* CQ handle */
    struct ibv_qp *qp;                  /* QP handle */
    struct ibv_mr *mr;                  /* MR handle for buf */
    char *buf;                          /* memory buffer pointer, used for RDMA and send ops */
    int sock;                           /* TCP socket file descriptor */
};

struct config_t config =
{
    NULL,  /* dev_name */
    NULL,  /* server_name */
    19875, /* tcp_port */
    1,     /* ib_port */
    //-1     /* gid_idx 源码此处初始值应该是bug,会导致connect_qp()中的if (config.gid_idx >= 0)无法满足*/  
     0       /* gid_idx */
};

/******************************************************************************
Socket operations:
For simplicity, the example program uses TCP sockets to exchange control
information. If a TCP/IP stack/connection is not available, connection manager
(CM) may be used to pass this information. Use of CM is beyond the scope of
this example
******************************************************************************/
/******************************************************************************
* Function: sock_connect
* Input:
* servername: URL of server to connect to (NULL for server mode)
* port: port of service
*
* Output:none
*
* Returns: socket (fd) on success, negative error code on failure
*
* Description:
* Connect a socket. If servername is specified a client connection will be
* initiated to the indicated server and port. Otherwise listen on the
* indicated port for an incoming connection.
*
******************************************************************************/
static int sock_connect(const char *servername, int port)
{
    struct addrinfo *resolved_addr = NULL;
    struct addrinfo *iterator;
    char service[6];
    int sockfd = -1;
    int listenfd = 0;
    int tmp;
    struct addrinfo hints =
    {
        .ai_flags    = AI_PASSIVE,
        .ai_family   = AF_INET,
        .ai_socktype = SOCK_STREAM
    };

    if(sprintf(service, "%d", port) < 0)
    {
        goto sock_connect_exit;
    }

    /* Resolve DNS address, use sockfd as temp storage */
    sockfd = getaddrinfo(servername, service, &hints, &resolved_addr);
    if(sockfd < 0)
    {
        fprintf(stderr, "%s for %s:%d\n", gai_strerror(sockfd), servername, port);
        goto sock_connect_exit;
    }

    /* Search through results and find the one we want */
    for(iterator = resolved_addr; iterator ; iterator = iterator->ai_next)
    {
        sockfd = socket(iterator->ai_family, iterator->ai_socktype, iterator->ai_protocol);
        if(sockfd >= 0)
        {
            if(servername)
			{
                /* Client mode. Initiate connection to remote */
                if((tmp=connect(sockfd, iterator->ai_addr, iterator->ai_addrlen)))
                {
                    fprintf(stdout, "failed connect \n");
                    close(sockfd);
                    sockfd = -1;
                }
			}
            else
            {
                /* Server mode. Set up listening socket an accept a connection */
                listenfd = sockfd;
                sockfd = -1;
                if(bind(listenfd, iterator->ai_addr, iterator->ai_addrlen))
                {
                    goto sock_connect_exit;
                }
                listen(listenfd, 1);
                sockfd = accept(listenfd, NULL, 0);
            }
        }
    }

sock_connect_exit:
    if(listenfd)
    {
        close(listenfd);
    }

    if(resolved_addr)
    {
        freeaddrinfo(resolved_addr);
    }

    if(sockfd < 0)
    {
        if(servername)
        {
            fprintf(stderr, "Couldn't connect to %s:%d\n", servername, port);
        }
        else
        {
            perror("server accept");
            fprintf(stderr, "accept() failed\n");
        }
    }

    return sockfd;
}

/******************************************************************************
* Function: sock_sync_data
* Input:
* sock: socket to transfer data on
* xfer_size: size of data to transfer
* local_data: pointer to data to be sent to remote
*
* Output: remote_data pointer to buffer to receive remote data
*
* Returns: 0 on success, negative error code on failure
*
* Description:
* Sync data across a socket. The indicated local data will be sent to the
* remote. It will then wait for the remote to send its data back. It is
* assumed that the two sides are in sync and call this function in the proper
* order. Chaos will ensue if they are not. :)
*
* Also note this is a blocking function and will wait for the full data to be
* received from the remote.
*
******************************************************************************/
int sock_sync_data(int sock, int xfer_size, char *local_data, char *remote_data)
{
    int rc;
    int read_bytes = 0;
    int total_read_bytes = 0;
    rc = write(sock, local_data, xfer_size);

    if(rc < xfer_size)
    {
        fprintf(stderr, "Failed writing data during sock_sync_data\n");
    }
    else
    {
        rc = 0;
    }

    while(!rc && total_read_bytes < xfer_size)
    {
        read_bytes = read(sock, remote_data, xfer_size);
        if(read_bytes > 0)
        {
            total_read_bytes += read_bytes;
        }
        else
        {
            rc = read_bytes;
        }
    }
    return rc;
}
/******************************************************************************
End of socket operations
******************************************************************************/

/* poll_completion */
/******************************************************************************
* Function: poll_completion
*
* Input:
* res: pointer to resources structure
*
* Output: none
*
* Returns: 0 on success, 1 on failure
*
* Description:
* Poll the completion queue for a single event. This function will continue to
* poll the queue until MAX_POLL_CQ_TIMEOUT milliseconds have passed.
*
******************************************************************************/
static int poll_completion(struct resources *res)
{
    struct ibv_wc wc;
    unsigned long start_time_msec;
    unsigned long cur_time_msec;
    struct timeval cur_time;
    int poll_result;
    int rc = 0;
    /* poll the completion for a while before giving up of doing it .. */
    gettimeofday(&cur_time, NULL);
    start_time_msec = (cur_time.tv_sec * 1000) + (cur_time.tv_usec / 1000);
    do
    {
        poll_result = ibv_poll_cq(res->cq, 1, &wc);
        gettimeofday(&cur_time, NULL);
        cur_time_msec = (cur_time.tv_sec * 1000) + (cur_time.tv_usec / 1000);
    }
    while((poll_result == 0) && ((cur_time_msec - start_time_msec) < MAX_POLL_CQ_TIMEOUT));

    if(poll_result < 0)
    {
        /* poll CQ failed */
        fprintf(stderr, "poll CQ failed\n");
        rc = 1;
    }
    else if(poll_result == 0)
    {
        /* the CQ is empty */
        fprintf(stderr, "completion wasn't found in the CQ after timeout\n");
        rc = 1;
    }
    else
    {
        /* CQE found */
        fprintf(stdout, "completion was found in CQ with status 0x%x\n", wc.status);
        /* check the completion status (here we don't care about the completion opcode */
        if(wc.status != IBV_WC_SUCCESS)
        {
            fprintf(stderr, "got bad completion with status: 0x%x, vendor syndrome: 0x%x\n", 
					wc.status, wc.vendor_err);
            rc = 1;
        }
    }
    return rc;
}

/******************************************************************************
* Function: post_send
*
* Input:
* res: pointer to resources structure
* opcode: IBV_WR_SEND, IBV_WR_RDMA_READ or IBV_WR_RDMA_WRITE
*
* Output: none
*
* Returns: 0 on success, error code on failure
*
* Description: This function will create and post a send work request
******************************************************************************/
static int post_send(struct resources *res, int opcode)
{
    struct ibv_send_wr sr;
    struct ibv_sge sge;
    struct ibv_send_wr *bad_wr = NULL;
    int rc;

    /* prepare the scatter/gather entry */
    memset(&sge, 0, sizeof(sge));
    sge.addr = (uintptr_t)res->buf;
    sge.length = MSG_SIZE;
    sge.lkey = res->mr->lkey;

    /* prepare the send work request */
    memset(&sr, 0, sizeof(sr));
    sr.next = NULL;
    sr.wr_id = 0;
    sr.sg_list = &sge;
    sr.num_sge = 1;
    sr.opcode = opcode;
    sr.send_flags = IBV_SEND_SIGNALED;
    if(opcode != IBV_WR_SEND)
    {
        sr.wr.rdma.remote_addr = res->remote_props.addr;
        sr.wr.rdma.rkey = res->remote_props.rkey;
    }

    /* there is a Receive Request in the responder side, so we won't get any into RNR flow */
    rc = ibv_post_send(res->qp, &sr, &bad_wr);
    if(rc)
    {
        fprintf(stderr, "failed to post SR\n");
    }
    else
    {
        switch(opcode)
        {
        case IBV_WR_SEND:
            fprintf(stdout, "Send Request was posted\n");
            break;
        case IBV_WR_RDMA_READ:
            fprintf(stdout, "RDMA Read Request was posted\n");
            break;
        case IBV_WR_RDMA_WRITE:
            fprintf(stdout, "RDMA Write Request was posted\n");
            break;
        default:
            fprintf(stdout, "Unknown Request was posted\n");
            break;
        }
    }
    return rc;
}

/******************************************************************************
* Function: post_receive
*
* Input:
* res: pointer to resources structure
*
* Output: none
*
* Returns: 0 on success, error code on failure
*
* Description: post RR to be prepared for incoming messages
*
******************************************************************************/
static int post_receive(struct resources *res)
{
    struct ibv_recv_wr rr;
    struct ibv_sge sge;
    struct ibv_recv_wr *bad_wr;
    int rc;

    /* prepare the scatter/gather entry */
    memset(&sge, 0, sizeof(sge));
    sge.addr = (uintptr_t)res->buf;
    sge.length = MSG_SIZE;
    sge.lkey = res->mr->lkey;

    /* prepare the receive work request */
    memset(&rr, 0, sizeof(rr));
    rr.next = NULL;
    rr.wr_id = 0;
    rr.sg_list = &sge;
    rr.num_sge = 1;

    /* post the Receive Request to the RQ */
    rc = ibv_post_recv(res->qp, &rr, &bad_wr);
    if(rc)
    {
        fprintf(stderr, "failed to post RR\n");
    }
    else
    {
        fprintf(stdout, "Receive Request was posted\n");
    }
    return rc;
}

/******************************************************************************
* Function: resources_init
*
* Input:
* res: pointer to resources structure
*
* Output: res is initialized
*
* Returns: none
*
* Description: res is initialized to default values
******************************************************************************/
static void resources_init(struct resources *res)
{
    memset(res, 0, sizeof *res);
    res->sock = -1;
}

/******************************************************************************
* Function: resources_create
*
* Input: res pointer to resources structure to be filled in
*
* Output: res filled in with resources
*
* Returns: 0 on success, 1 on failure
*
* Description:
* This function creates and allocates all necessary system resources. These
* are stored in res.
*****************************************************************************/
static int resources_create(struct resources *res)
{
    struct ibv_device **dev_list = NULL;
    struct ibv_qp_init_attr qp_init_attr;
    struct ibv_device *ib_dev = NULL;
    size_t size;
    int i;
    int mr_flags = 0;
    int cq_size = 0;
    int num_devices;
    int rc = 0;

    /* if client side */
    if(config.server_name)
    {
        res->sock = sock_connect(config.server_name, config.tcp_port);
        if(res->sock < 0)
        {
            fprintf(stderr, "failed to establish TCP connection to server %s, port %d\n",
                    config.server_name, config.tcp_port);
            rc = -1;
            goto resources_create_exit;
        }
    }
    else
    {
        fprintf(stdout, "waiting on port %d for TCP connection\n", config.tcp_port);
        res->sock = sock_connect(NULL, config.tcp_port);
        if(res->sock < 0)
        {
            fprintf(stderr, "failed to establish TCP connection with client on port %d\n",
                    config.tcp_port);
            rc = -1;
            goto resources_create_exit;
        }
    }
    fprintf(stdout, "TCP connection was established\n");
    fprintf(stdout, "searching for IB devices in host\n");

    /* get device names in the system */
    dev_list = ibv_get_device_list(&num_devices);
    if(!dev_list)
    {
        fprintf(stderr, "failed to get IB devices list\n");
        rc = 1;
        goto resources_create_exit;
    }

    /* if there isn't any IB device in host */
    if(!num_devices)
    {
        fprintf(stderr, "found %d device(s)\n", num_devices);
        rc = 1;
        goto resources_create_exit;
    }
    fprintf(stdout, "found %d device(s)\n", num_devices);

    /* search for the specific device we want to work with */
    for(i = 0; i < num_devices; i ++)
    {
        if(!config.dev_name)
        {
            config.dev_name = strdup(ibv_get_device_name(dev_list[i]));
            fprintf(stdout, "device not specified, using first one found: %s\n", config.dev_name);
        }
		/* find the specific device */
        if(!strcmp(ibv_get_device_name(dev_list[i]), config.dev_name))
        {
            ib_dev = dev_list[i];
            break;
        }
    }

    /* if the device wasn't found in host */
    if(!ib_dev)
    {
        fprintf(stderr, "IB device %s wasn't found\n", config.dev_name);
        rc = 1;
        goto resources_create_exit;
    }

    /* get device handle */
    res->ib_ctx = ibv_open_device(ib_dev);
    if(!res->ib_ctx)
    {
        fprintf(stderr, "failed to open device %s\n", config.dev_name);
        rc = 1;
        goto resources_create_exit;
    }

    /* We are now done with device list, free it */
    ibv_free_device_list(dev_list);
    dev_list = NULL;
    ib_dev = NULL;

    /* query port properties */
    if(ibv_query_port(res->ib_ctx, config.ib_port, &res->port_attr))
    {
        fprintf(stderr, "ibv_query_port on port %u failed\n", config.ib_port);
        rc = 1;
        goto resources_create_exit;
    }

    /* allocate Protection Domain */
    res->pd = ibv_alloc_pd(res->ib_ctx);
    if(!res->pd)
    {
        fprintf(stderr, "ibv_alloc_pd failed\n");
        rc = 1;
        goto resources_create_exit;
    }

    /* each side will send only one WR, so Completion Queue with 1 entry is enough */
    cq_size = 1;
    res->cq = ibv_create_cq(res->ib_ctx, cq_size, NULL, NULL, 0);
    if(!res->cq)
    {
        fprintf(stderr, "failed to create CQ with %u entries\n", cq_size);
        rc = 1;
        goto resources_create_exit;
    }

    /* allocate the memory buffer that will hold the data */
    size = MSG_SIZE;
    res->buf = (char *) malloc(size);
    fprintf(stdout, "申请内存buf\n");
    if(!res->buf)
    {
        fprintf(stderr, "failed to malloc %Zu bytes to memory buffer\n", size);
        rc = 1;
        goto resources_create_exit;
    }
    memset(res->buf, 0 , size);

    /* only in the server side put the message in the memory buffer */
    if(!config.server_name)
    {
        strcpy(res->buf,SRV_MSG);
        fprintf(stdout, "put the message: '%s' to buf\n", res->buf);
    }
    else
    {
        memset(res->buf, 0, size);
    }

    /* register the memory buffer */
    mr_flags = IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_READ | IBV_ACCESS_REMOTE_WRITE ;
    res->mr = ibv_reg_mr(res->pd, res->buf, size, mr_flags);
    fprintf(stdout, "注册buf内存到pd\n");
    if(!res->mr)
    {
        fprintf(stderr, "ibv_reg_mr failed with mr_flags=0x%x\n", mr_flags);
        rc = 1;
        goto resources_create_exit;
    }
    fprintf(stdout, "MR was registered with addr=%p, lkey=0x%x, rkey=0x%x, flags=0x%x\n",
            res->buf, res->mr->lkey, res->mr->rkey, mr_flags);

    /* create the Queue Pair */
    memset(&qp_init_attr, 0, sizeof(qp_init_attr));
    qp_init_attr.qp_type = IBV_QPT_RC;
    qp_init_attr.sq_sig_all = 1;
    qp_init_attr.send_cq = res->cq;
    qp_init_attr.recv_cq = res->cq;
    qp_init_attr.cap.max_send_wr = 1;
    qp_init_attr.cap.max_recv_wr = 1;
    qp_init_attr.cap.max_send_sge = 1;
    qp_init_attr.cap.max_recv_sge = 1;
    res->qp = ibv_create_qp(res->pd, &qp_init_attr);
    if(!res->qp)
    {
        fprintf(stderr, "failed to create QP\n");
        rc = 1;
        goto resources_create_exit;
    }
    fprintf(stdout, "QP was created, QP number=0x%x\n", res->qp->qp_num);

resources_create_exit:
    if(rc)
    {
        /* Error encountered, cleanup */
        if(res->qp)
        {
            ibv_destroy_qp(res->qp);
            res->qp = NULL;
        }
        if(res->mr)
        {
            ibv_dereg_mr(res->mr);
            res->mr = NULL;
        }
        if(res->buf)
        {
            free(res->buf);
            res->buf = NULL;
        }
        if(res->cq)
        {
            ibv_destroy_cq(res->cq);
            res->cq = NULL;
        }
        if(res->pd)
        {
            ibv_dealloc_pd(res->pd);
            res->pd = NULL;
        }
        if(res->ib_ctx)
        {
            ibv_close_device(res->ib_ctx);
            res->ib_ctx = NULL;
        }
        if(dev_list)
        {
            ibv_free_device_list(dev_list);
            dev_list = NULL;
        }
        if(res->sock >= 0)
        {
            if(close(res->sock))
            {
                fprintf(stderr, "failed to close socket\n");
            }
            res->sock = -1;
        }
    }
    return rc;
}


/******************************************************************************
* Function: modify_qp_to_init
*
* Input:
* qp: QP to transition
*
* Output: none
*
* Returns: 0 on success, ibv_modify_qp failure code on failure
*
* Description: Transition a QP from the RESET to INIT state
******************************************************************************/
static int modify_qp_to_init(struct ibv_qp *qp)
{
    struct ibv_qp_attr attr;
    int flags;
    int rc;
    memset(&attr, 0, sizeof(attr));
    attr.qp_state = IBV_QPS_INIT;
    attr.port_num = config.ib_port;
    attr.pkey_index = 0;
    attr.qp_access_flags = IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_READ | IBV_ACCESS_REMOTE_WRITE;
    flags = IBV_QP_STATE | IBV_QP_PKEY_INDEX | IBV_QP_PORT | IBV_QP_ACCESS_FLAGS;
    rc = ibv_modify_qp(qp, &attr, flags);
    if(rc)
    {
        fprintf(stderr, "failed to modify QP state to INIT\n");
    }
    return rc;
}

/******************************************************************************
* Function: modify_qp_to_rtr
*
* Input:
* qp: QP to transition
* remote_qpn: remote QP number
* dlid: destination LID
* dgid: destination GID (mandatory for RoCEE)
*
* Output: none
*
* Returns: 0 on success, ibv_modify_qp failure code on failure
*
* Description: 
* Transition a QP from the INIT to RTR state, using the specified QP number
******************************************************************************/
static int modify_qp_to_rtr(struct ibv_qp *qp, uint32_t remote_qpn, uint16_t dlid, uint8_t *dgid)
{
    struct ibv_qp_attr attr;
    int flags;
    int rc;
    memset(&attr, 0, sizeof(attr));
    attr.qp_state = IBV_QPS_RTR;
    attr.path_mtu = IBV_MTU_256;
    attr.dest_qp_num = remote_qpn;
    attr.rq_psn = 0;
    attr.max_dest_rd_atomic = 1;
    attr.min_rnr_timer = 0x12;
    attr.ah_attr.is_global = 0;
    attr.ah_attr.dlid = dlid;
    attr.ah_attr.sl = 0;
    attr.ah_attr.src_path_bits = 0;
    attr.ah_attr.port_num = config.ib_port;
    if(config.gid_idx >= 0)
    {
        attr.ah_attr.is_global = 1;
        attr.ah_attr.port_num = 1;
        memcpy(&attr.ah_attr.grh.dgid, dgid, 16);
        attr.ah_attr.grh.flow_label = 0;
        attr.ah_attr.grh.hop_limit = 1;
        attr.ah_attr.grh.sgid_index = config.gid_idx;
        attr.ah_attr.grh.traffic_class = 0;
    }

    flags = IBV_QP_STATE | IBV_QP_AV | IBV_QP_PATH_MTU | IBV_QP_DEST_QPN |
            IBV_QP_RQ_PSN | IBV_QP_MAX_DEST_RD_ATOMIC | IBV_QP_MIN_RNR_TIMER;
    rc = ibv_modify_qp(qp, &attr, flags);
    if(rc)
    {
        fprintf(stderr, "failed to modify QP state to RTR\n");
    }
    return rc;
}

/******************************************************************************
* Function: modify_qp_to_rts
*
* Input:
* qp: QP to transition
*
* Output: none
*
* Returns: 0 on success, ibv_modify_qp failure code on failure
*
* Description: Transition a QP from the RTR to RTS state
******************************************************************************/
static int modify_qp_to_rts(struct ibv_qp *qp)
{
    struct ibv_qp_attr attr;
    int flags;
    int rc;
    memset(&attr, 0, sizeof(attr));
    attr.qp_state = IBV_QPS_RTS;
    attr.timeout = 0x12;
    attr.retry_cnt = 6;
    attr.rnr_retry = 0;
    attr.sq_psn = 0;
    attr.max_rd_atomic = 1;
    flags = IBV_QP_STATE | IBV_QP_TIMEOUT | IBV_QP_RETRY_CNT |
            IBV_QP_RNR_RETRY | IBV_QP_SQ_PSN | IBV_QP_MAX_QP_RD_ATOMIC;
    rc = ibv_modify_qp(qp, &attr, flags);
    if(rc)
    {
        fprintf(stderr, "failed to modify QP state to RTS\n");
    }
    return rc;
}

/******************************************************************************
* Function: connect_qp
*
* Input:
* res: pointer to resources structure
*
* Output: none
*
* Returns: 0 on success, error code on failure
*
* Description: 
* Connect the QP. Transition the server side to RTR, sender side to RTS
******************************************************************************/
static int connect_qp(struct resources *res)
{
    struct cm_con_data_t local_con_data;
    struct cm_con_data_t remote_con_data;
    struct cm_con_data_t tmp_con_data;
    int rc = 0;
    char temp_char;
    union ibv_gid my_gid;
    if(config.gid_idx >= 0)
    {
        rc = ibv_query_gid(res->ib_ctx, config.ib_port, config.gid_idx, &my_gid);
        if(rc)
        {
            fprintf(stderr, "could not get gid for port %d, index %d\n", config.ib_port, config.gid_idx);
            return rc;
        }
    }
    else
    {
        memset(&my_gid, 0, sizeof my_gid);
    }

    /* exchange using TCP sockets info required to connect QPs */
    local_con_data.addr = htonll((uintptr_t)res->buf);
    local_con_data.rkey = htonl(res->mr->rkey);
    local_con_data.qp_num = htonl(res->qp->qp_num);
    local_con_data.lid = htons(res->port_attr.lid);
    memcpy(local_con_data.gid, &my_gid, 16);
    fprintf(stdout, "\nLocal LID = 0x%x\n", res->port_attr.lid);
    if(sock_sync_data(res->sock, sizeof(struct cm_con_data_t), (char *) &local_con_data, (char *) &tmp_con_data) < 0)
    {
        fprintf(stderr, "failed to exchange connection data between sides\n");
        rc = 1;
        goto connect_qp_exit;
    }

    remote_con_data.addr = ntohll(tmp_con_data.addr);
    remote_con_data.rkey = ntohl(tmp_con_data.rkey);
    remote_con_data.qp_num = ntohl(tmp_con_data.qp_num);
    remote_con_data.lid = ntohs(tmp_con_data.lid);
    memcpy(remote_con_data.gid, tmp_con_data.gid, 16);

    /* save the remote side attributes, we will need it for the post SR */
    res->remote_props = remote_con_data;
    fprintf(stdout, "Remote address = 0x%"PRIx64"\n", remote_con_data.addr);
    fprintf(stdout, "Remote rkey = 0x%x\n", remote_con_data.rkey);
    fprintf(stdout, "Remote QP number = 0x%x\n", remote_con_data.qp_num);
    fprintf(stdout, "Remote LID = 0x%x\n", remote_con_data.lid);
    if(config.gid_idx >= 0)
    {
        uint8_t *p = remote_con_data.gid;
        fprintf(stdout, "Remote GID = %02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x\n",
				p[0], p[1], p[2], p[3], p[4], p[5], p[6], p[7], p[8], p[9], p[10], p[11], p[12], p[13], p[14], p[15]);
    }

    /* modify the QP to init */
    rc = modify_qp_to_init(res->qp);
    if(rc)
    {
        fprintf(stderr, "change QP state to INIT failed\n");
        goto connect_qp_exit;
    }

    /* let the client post RR to be prepared for incoming messages */
    if(config.server_name)
    {
        rc = post_receive(res);
        if(rc)
        {
            fprintf(stderr, "failed to post RR\n");
            goto connect_qp_exit;
        }
    }

    /* modify the QP to RTR */
    rc = modify_qp_to_rtr(res->qp, remote_con_data.qp_num, remote_con_data.lid, remote_con_data.gid);
    if(rc)
    {
        fprintf(stderr, "failed to modify QP state to RTR\n");
        goto connect_qp_exit;
    }

    /* modify the QP to RTS */
    rc = modify_qp_to_rts(res->qp);
    if(rc)
    {
        fprintf(stderr, "failed to modify QP state to RTS\n");
        goto connect_qp_exit;
    }
    fprintf(stdout, "QP state was change to RTS\n");

    /* sync to make sure that both sides are in states that they can connect to prevent packet loose */
    if(sock_sync_data(res->sock, 1, "Q", &temp_char))  /* just send a dummy char back and forth */
    {
        fprintf(stderr, "sync error after QPs are were moved to RTS\n");
        rc = 1;
    }

connect_qp_exit:
    return rc;
}

/******************************************************************************
* Function: resources_destroy
*
* Input:
* res: pointer to resources structure
*
* Output: none
*
* Returns: 0 on success, 1 on failure
*
* Description: Cleanup and deallocate all resources used
******************************************************************************/
static int resources_destroy(struct resources *res)
{
    int rc = 0;
    if(res->qp)
	{
        if(ibv_destroy_qp(res->qp))
        {
            fprintf(stderr, "failed to destroy QP\n");
            rc = 1;
        }
	}

    if(res->mr)
	{
        if(ibv_dereg_mr(res->mr))
        {
            fprintf(stderr, "failed to deregister MR\n");
            rc = 1;
        }
	}

    if(res->buf)
    {
        free(res->buf);
    }

    if(res->cq)
	{
        if(ibv_destroy_cq(res->cq))
        {
            fprintf(stderr, "failed to destroy CQ\n");
            rc = 1;
        }
	}

    if(res->pd)
	{
        if(ibv_dealloc_pd(res->pd))
        {
            fprintf(stderr, "failed to deallocate PD\n");
            rc = 1;
        }
	}

    if(res->ib_ctx)
	{
        if(ibv_close_device(res->ib_ctx))
        {
            fprintf(stderr, "failed to close device context\n");
            rc = 1;
        }
	}

    if(res->sock >= 0)
	{
        if(close(res->sock))
        {
            fprintf(stderr, "failed to close socket\n");
            rc = 1;
        }
	}
    return rc;
}

/******************************************************************************
* Function: print_config
*
* Input: none
*
* Output: none
*
* Returns: none
*
* Description: Print out config information
******************************************************************************/
static void print_config(void)
{
    fprintf(stdout, " ------------------------------------------------\n");
    fprintf(stdout, " Device name : \"%s\"\n", config.dev_name);
    fprintf(stdout, " IB port : %u\n", config.ib_port);
    if(config.server_name)
    {
        fprintf(stdout, " IP : %s\n", config.server_name);
    }
    fprintf(stdout, " TCP port : %u\n", config.tcp_port);
    if(config.gid_idx >= 0)
    {
        fprintf(stdout, " GID index : %u\n", config.gid_idx);
    }
    fprintf(stdout, " ------------------------------------------------\n\n");
}

/******************************************************************************
* Function: usage
*
* Input:
* argv0: command line arguments
*
* Output: none
*
* Returns: none
*
* Description: print a description of command line syntax
******************************************************************************/
static void usage(const char *argv0)
{
    fprintf(stdout, "Usage:\n");
    fprintf(stdout, " %s start a server and wait for connection\n", argv0);
    fprintf(stdout, " %s <host> connect to server at <host>\n", argv0);
    fprintf(stdout, "\n");
    fprintf(stdout, "Options:\n");
    fprintf(stdout, " -p, --port <port> listen on/connect to port <port> (default 18515)\n");
    fprintf(stdout, " -d, --ib-dev <dev> use IB device <dev> (default first device found)\n");
    fprintf(stdout, " -i, --ib-port <port> use port <port> of IB device (default 1)\n");
    fprintf(stdout, " -g, --gid_idx <git index> gid index to be used in GRH (default not used)\n");
}

/******************************************************************************
* Function: main
*
* Input:
* argc: number of items in argv
* argv: command line parameters
*
* Output: none
*
* Returns: 0 on success, 1 on failure
*
* Description: Main program code
******************************************************************************/
int main(int argc, char *argv[])
{
    struct resources res;
    int rc = 1;
    char temp_char;

    /* parse the command line parameters */
    while(1)
    {
        int c;
		/* Designated Initializer */
        static struct option long_options[] =
        {
            {.name = "port", .has_arg = 1, .val = 'p' },
            {.name = "ib-dev", .has_arg = 1, .val = 'd' },
            {.name = "ib-port", .has_arg = 1, .val = 'i' },
            {.name = "gid-idx", .has_arg = 1, .val = 'g' },
            {.name = NULL, .has_arg = 0, .val = '\0'}
        };

        c = getopt_long(argc, argv, "p:d:i:g:", long_options, NULL);
        if(c == -1)
        {
            break;
        }
        switch(c)
        {
        case 'p':
            config.tcp_port = strtoul(optarg, NULL, 0);
            break;
        case 'd':
            config.dev_name = strdup(optarg);
            break;
        case 'i':
            config.ib_port = strtoul(optarg, NULL, 0);
            if(config.ib_port < 0)
            {
                usage(argv[0]);
                return 1;
            }
            break;
        case 'g':
            config.gid_idx = strtoul(optarg, NULL, 0);
            if(config.gid_idx < 0)
            {
                usage(argv[0]);
                return 1;
            }
            break;
        default:
            usage(argv[0]);
            return 1;
        }
    }

    /* parse the last parameter (if exists) as the server name */
	/* 
	 * server_name is null means this node is a server,
	 * otherwise this node is a client which need to connect to 
	 * the specific server
	 */
    if(optind == argc - 1)
    {
        config.server_name = argv[optind];
    }
    else if(optind < argc)
    {
        usage(argv[0]);
        return 1;
    }

    /* print the used parameters for info*/
    print_config();
    /* init all of the resources, so cleanup will be easy */
    resources_init(&res);
    /* create resources before using them */
    if(resources_create(&res))
    {
        fprintf(stderr, "failed to create resources\n");
        goto main_exit;
    }
    /* connect the QPs */
    if(connect_qp(&res))
    {
        fprintf(stderr, "failed to connect QPs\n");
        goto main_exit;
    }
    /* let the server post the sr */
    if(!config.server_name)
	{
        if(post_send(&res, IBV_WR_SEND))
        {
            fprintf(stderr, "failed to post sr\n");
            goto main_exit;
        }
	}
    /* in both sides we expect to get a completion */
    if(poll_completion(&res))
    {
        fprintf(stderr, "poll completion failed\n");
        goto main_exit;
    }

    /* after polling the completion we have the message in the client buffer too */
    /*读到completion,说明SEND已经完成,去注册的buf里去读对端RDMA来的数据。
   if(config.server_name)
    {
        fprintf(stdout, "Message is: '%s'\n", res.buf);
    }
    else
    {
        /* setup server buffer with read message */
        strcpy(res.buf, RDMAMSGR);
    }
    /* Sync so we are sure server side has data ready before client tries to read it */
    if(sock_sync_data(res.sock, 1, "R", &temp_char))  /* just send a dummy char back and forth */
    {
        fprintf(stderr, "sync error before RDMA ops\n");
        rc = 1;
        goto main_exit;
    }

    /* 
	 * Now the client performs an RDMA read and then write on server.
	 * Note that the server has no idea these events have occured 
	 */
    if(config.server_name)
    {
        /* First we read contens of server's buffer */
        if(post_send(&res, IBV_WR_RDMA_READ))
        {
            fprintf(stderr, "failed to post SR 2\n");
            rc = 1;
            goto main_exit;
        }
        if(poll_completion(&res))
        {
            fprintf(stderr, "poll completion failed 2\n");
            rc = 1;
            goto main_exit;
        }
        /*读到completion,说明READ/WRITE已经完成,去注册的buf里去读对端RDMA来的数据。
        fprintf(stdout, "Contents of server's buffer: '%s'\n", res.buf);

        /* Now we replace what's in the server's buffer */
        strcpy(res.buf, RDMAMSGW);
        fprintf(stdout, "Now replacing it with: '%s'\n", res.buf);
        if(post_send(&res, IBV_WR_RDMA_WRITE))
        {
            fprintf(stderr, "failed to post SR 3\n");
            rc = 1;
            goto main_exit;
        }
        if(poll_completion(&res))
        {
            fprintf(stderr, "poll completion failed 3\n");
            rc = 1;
            goto main_exit;
        }
    }

    /* Sync so server will know that client is done mucking with its memory */
    if(sock_sync_data(res.sock, 1, "W", &temp_char))  /* just send a dummy char back and forth */
    {
        fprintf(stderr, "sync error after RDMA ops\n");
        rc = 1;
        goto main_exit;
    }
    if(!config.server_name)
    {
        fprintf(stdout, "Contents of server buffer: '%s'\n", res.buf);
    }
    rc = 0;

main_exit:
    if(resources_destroy(&res))
    {
        fprintf(stderr, "failed to destroy resources\n");
        rc = 1;
    }
    if(config.dev_name)
    {
        free((char *) config.dev_name);
    }
    fprintf(stdout, "\ntest result is %d\n", rc);
    return rc;
}

综合起来这个过程大概是这样的:

infiniband概念空间分析 - 知乎

dev = find_a_matched_device(ibv_get_device_list());
ctx =  ibv_open_device(dev)
pd =  ibv_alloc_pd(context)
send_cq = ibv_create_cq(ctx, ...);
recv_cq = ibv_create_cq(ctx, ...);
qp = ibv_create_qp(pd, qp_attr(send_cq, recv_cq, ...));
ibv_modify_qp(qp, attr, attr_mask);

然后就可以用ibv_post_send()和ibv_post_recv()来进行verb的收发了。

各步骤做的事情是:

首先我们得有提供qp的设备,这个称为dev,通过ibv_get_device_list()来获得,

然后通过设备名称来匹配自己需要的那个设备:dev = find_a_matched_device

基于设备建立context,ctx =  ibv_open_device(dev)

基于context建立pd,pd =  ibv_alloc_pd(context)

基于context建立cq,基于pd和cq建立qp,qp = ibv_create_qp(pd, qp_attr(send_cq, recv_cq, ...));

然后就可以用ibv_post_send()和ibv_post_recv()来进行verb的收发了。

这里有几个新概念:

ctx,这相当于一个client

pd,protection domain,这个用来隔离多对qp和mr。如果下面是个真正的通讯层,相关端口,地址(IB的地址叫id)等,可以通过ah(address handle)、channel等概念在创建cq的时候作为参数提供给相关的用户态驱动,让负责发送的驱动来处理)

ibv_modify_qp主要是用于通讯的相关细节的设置,ibv_create_qp仅仅分配了资源,要通过这个modify来让硬件进入工作状态。

ibv_post_send和recv用于发送verb,verb承载在一个称为ibv_send_wr或者ibv_recv_wr的数据结构中,里面是verb类型和mr的相关细节。verb的类型包括:

enum ibv_wr_opcode { 
        IBV_WR_RDMA_WRITE,  
        IBV_WR_RDMA_WRITE_WITH_IMM, 
        IBV_WR_SEND,
        IBV_WR_SEND_WITH_IMM,
        IBV_WR_RDMA_READ,
        IBV_WR_ATOMIC_CMP_AND_SWP,
        IBV_WR_ATOMIC_FETCH_AND_ADD, 
        IBV_WR_LOCAL_INV, 
        IBV_WR_BIND_MW,
        IBV_WR_SEND_WITH_INV,
        IBV_WR_TSO,
};

ULP层的消息都可以通过IBV_WR_SEND_XXXX来封装,其他的就是在IB层就可以处理的RDMA操作和原子操作。

【mr名称空间的进一步理解】

mr基于ctx创建,通过verb进行分享和更新。先看看mr的创建方法:

mr = ibv_reg_mr(pd, addr, length, access);

基本上是直接给定一个虚拟地址,ib负责帮你创建一个句柄,创建的mr中包含一个lkey和一个rkey,前者用于本地索引,后者用于远程索引。

mr还有一些扩展概念:

首先是fmr,fast mr,用法如下:

fmr = ib_alloc_fmr(pd, flags, attr);
ib_map_phys_fmr(fmr, page_list, list_len, iova);

这个概念仅在内核有效,其实就是mem_pool版本的mr,个人认为不影响整个概念空间。

第二个概念是sge,这表示scatter gather element,它用于指定不连续的内存块给mr做二次组织(这说起来是个优势,进程中不连续的虚拟地址,在设备上都可以是连续的)

第三个概念是mw,memory window ,用法如下:

wm = ibv_alloc_wm(pd, type);
ibv_bind_mw(qp, mw, mw_bind)

其中的mw_bind用来指定一个mr内的区域的读写属性,其实就是在mr中割一段空间出来,指定它的权限。这个功能在当前的rdma-core的代码中没有看到用户。

如果把通讯的一方称为A,另一方称为B。A有一片内存要B更新,可以从A发出一个verb,把自己的rkey和地址发给B。B先在自己的内存中完成修改,然后用这个内存创建一个MR,然后用 IBV_WR_RDMA_WRITE一类的verb,里面指定A的rkey和地址,就可以远程更新到A的mr上了。

代码2:增加 uc和rc的选择,增加Tos的设置

 上面的例子,增加 uc和rc的选择,增加Tos的设置

服务端:

./RDMA_RC_example  -q rc  -t 33 -d mlx5_0 -i 1 -g 3

客户端:

./RDMA_RC_example 192.169.31.53 -q rc  -t 33 -d mlx5_0 -i 1 -g 3

/*
* BUILD COMMAND:
* gcc -Wall -O0 -g -o RDMA_RC_example RDMA_RC_example.c -libverbs
*server:
*./RDMA_RC_example  -q rc  -t 33 -d mlx5_0 -i 1 -g 3
*client:
*./RDMA_RC_example 192.169.31.53 -q rc  -t 33 -d mlx5_0 -i 1 -g 3
*/
/******************************************************************************
*
* RDMA Aware Networks Programming Example
*
* This code demonstrates how to perform the following operations using
* the * VPI Verbs API:
* Send
* Receive
* RDMA Read
* RDMA Write
*
*****************************************************************************/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <stdint.h>
#include <inttypes.h>
#include <endian.h>
#include <byteswap.h>
#include <getopt.h>
#include <sys/time.h>
#include <arpa/inet.h>
#include <infiniband/verbs.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netdb.h>

/* poll CQ timeout in millisec (2 seconds) */
#define MAX_POLL_CQ_TIMEOUT 2000
#define MSG "msg:AAAAAAABBBBBBCCCCCCDDDDDEEEEEE!"
#define RDMAMSGR "RDMA read operation "
#define RDMAMSGW "RDMA write operation"
#define RDMAMSGS "send:AAAAAAABBBBBBCCCCCCDDDDDEEEEEE!"
#define MSG_SIZE 64
#if __BYTE_ORDER == __LITTLE_ENDIAN
static inline uint64_t htonll(uint64_t x)
{
    return bswap_64(x);
}
static inline uint64_t ntohll(uint64_t x)
{
    return bswap_64(x);
}
#elif __BYTE_ORDER == __BIG_ENDIAN
static inline uint64_t htonll(uint64_t x)
{
    return x;
}
static inline uint64_t ntohll(uint64_t x)
{
    return x;
}
#else
#error __BYTE_ORDER is neither __LITTLE_ENDIAN nor __BIG_ENDIAN
#endif

/* structure of test parameters */
struct config_t
{
    const char *dev_name; /* IB device name */
    char *server_name;    /* server host name */
    uint32_t tcp_port;    /* server TCP port */
    int ib_port;          /* local IB port to work with */
    int gid_idx;          /* gid index to use */
    int traffic_class;
    int qp_type;  /*qp_type*/
};

/* structure to exchange data which is needed to connect the QPs */
struct cm_con_data_t
{
    uint64_t addr;        /* Buffer address */
    uint32_t rkey;        /* Remote key */
    uint32_t qp_num;      /* QP number */
    uint16_t lid;         /* LID of the IB port */
    uint8_t gid[16];      /* gid */
} __attribute__((packed));

/* structure of system resources */
struct resources
{
    struct ibv_device_attr device_attr; /* Device attributes */
    struct ibv_port_attr port_attr;     /* IB port attributes */
    struct cm_con_data_t remote_props;  /* values to connect to remote side */
    struct ibv_context *ib_ctx;         /* device handle */
    struct ibv_pd *pd;                  /* PD handle */
    struct ibv_cq *cq;                  /* CQ handle */
    struct ibv_qp *qp;                  /* QP handle */
    struct ibv_mr *mr;                  /* MR handle for buf */
    char *buf;                          /* memory buffer pointer, used for RDMA and send ops */
    int sock;                           /* TCP socket file descriptor */
};

struct config_t config =
{
    NULL,  /* dev_name */
    NULL,  /* server_name */
    19875, /* tcp_port */
    1,     /* ib_port */
    //-1     /* gid_idx 源码此处初始值应该是bug,会导致connect_qp()中的if (config.gid_idx >= 0)无法满足*/
    2,       /* gid_idx */
    4,       /*traffic_class*/
    IBV_QPT_RC,       /*qp_type*/
};


/******************************************************************************
Socket operations:
For simplicity, the example program uses TCP sockets to exchange control
information. If a TCP/IP stack/connection is not available, connection manager
(CM) may be used to pass this information. Use of CM is beyond the scope of
this example
******************************************************************************/
/******************************************************************************
* Function: sock_connect
* Input:
* servername: URL of server to connect to (NULL for server mode)
* port: port of service
*
* Output:none
*
* Returns: socket (fd) on success, negative error code on failure
*
* Description:
* Connect a socket. If servername is specified a client connection will be
* initiated to the indicated server and port. Otherwise listen on the
* indicated port for an incoming connection.
*
******************************************************************************/
static int sock_connect(const char *servername, int port)
{
    struct addrinfo *resolved_addr = NULL;
    struct addrinfo *iterator;
    char service[6];
    int sockfd = -1;
    int listenfd = 0;
    int tmp;
    struct addrinfo hints =
    {
        .ai_flags    = AI_PASSIVE,
        .ai_family   = AF_INET,
        .ai_socktype = SOCK_STREAM
    };

    if(sprintf(service, "%d", port) < 0)
    {
        goto sock_connect_exit;
    }

    /* Resolve DNS address, use sockfd as temp storage */
    sockfd = getaddrinfo(servername, service, &hints, &resolved_addr);
    if(sockfd < 0)
    {
        fprintf(stderr, "%s for %s:%d\n", gai_strerror(sockfd), servername, port);
        goto sock_connect_exit;
    }

    /* Search through results and find the one we want */
    for(iterator = resolved_addr; iterator ; iterator = iterator->ai_next)
    {
        sockfd = socket(iterator->ai_family, iterator->ai_socktype, iterator->ai_protocol);
        if(sockfd >= 0)
        {
            if(servername)
            {
                /* Client mode. Initiate connection to remote */
                if((tmp = connect(sockfd, iterator->ai_addr, iterator->ai_addrlen)))
                {
                    fprintf(stdout, "failed connect \n");
                    close(sockfd);
                    sockfd = -1;
                }
            }
            else
            {
                /* Server mode. Set up listening socket an accept a connection */
                listenfd = sockfd;
                sockfd = -1;
                if(bind(listenfd, iterator->ai_addr, iterator->ai_addrlen))
                {
                    goto sock_connect_exit;
                }
                listen(listenfd, 1);
                sockfd = accept(listenfd, NULL, 0);
            }
        }
    }

sock_connect_exit:
    if(listenfd)
    {
        close(listenfd);
    }

    if(resolved_addr)
    {
        freeaddrinfo(resolved_addr);
    }

    if(sockfd < 0)
    {
        if(servername)
        {
            fprintf(stderr, "Couldn't connect to %s:%d\n", servername, port);
        }
        else
        {
            perror("server accept");
            fprintf(stderr, "accept() failed\n");
        }
    }

    return sockfd;
}

/******************************************************************************
* Function: sock_sync_data
* Input:
* sock: socket to transfer data on
* xfer_size: size of data to transfer
* local_data: pointer to data to be sent to remote
*
* Output: remote_data pointer to buffer to receive remote data
*
* Returns: 0 on success, negative error code on failure
*
* Description:
* Sync data across a socket. The indicated local data will be sent to the
* remote. It will then wait for the remote to send its data back. It is
* assumed that the two sides are in sync and call this function in the proper
* order. Chaos will ensue if they are not. :)
*
* Also note this is a blocking function and will wait for the full data to be
* received from the remote.
*
******************************************************************************/
int sock_sync_data(int sock, int xfer_size, char *local_data, char *remote_data)
{
    int rc;
    int read_bytes = 0;
    int total_read_bytes = 0;
    rc = write(sock, local_data, xfer_size);

    if(rc < xfer_size)
    {
        fprintf(stderr, "Failed writing data during sock_sync_data\n");
    }
    else
    {
        rc = 0;
    }

    while(!rc && total_read_bytes < xfer_size)
    {
        read_bytes = read(sock, remote_data, xfer_size);
        if(read_bytes > 0)
        {
            total_read_bytes += read_bytes;
        }
        else
        {
            rc = read_bytes;
        }
    }
    return rc;
}
/******************************************************************************
End of socket operations
******************************************************************************/

/* poll_completion */
/******************************************************************************
* Function: poll_completion
*
* Input:
* res: pointer to resources structure
*
* Output: none
*
* Returns: 0 on success, 1 on failure
*
* Description:
* Poll the completion queue for a single event. This function will continue to
* poll the queue until MAX_POLL_CQ_TIMEOUT milliseconds have passed.
*
******************************************************************************/
static int poll_completion(struct resources *res)
{
    struct ibv_wc wc;
    unsigned long start_time_msec;
    unsigned long cur_time_msec;
    struct timeval cur_time;
    int poll_result;
    int rc = 0;
    /* poll the completion for a while before giving up of doing it .. */
    gettimeofday(&cur_time, NULL);
    start_time_msec = (cur_time.tv_sec * 1000) + (cur_time.tv_usec / 1000);
    do
    {
        poll_result = ibv_poll_cq(res->cq, 1, &wc);
        gettimeofday(&cur_time, NULL);
        cur_time_msec = (cur_time.tv_sec * 1000) + (cur_time.tv_usec / 1000);
    }
    while((poll_result == 0) && ((cur_time_msec - start_time_msec) < MAX_POLL_CQ_TIMEOUT));

    if(poll_result < 0)
    {
        /* poll CQ failed */
        fprintf(stderr, "poll CQ failed\n");
        rc = 1;
    }
    else if(poll_result == 0)
    {
        /* the CQ is empty */
        fprintf(stderr, "completion wasn't found in the CQ after timeout\n");
        rc = 1;
    }
    else
    {
        /* CQE found */
        fprintf(stdout, "completion was found in CQ with status 0x%x\n", wc.status);
        /* check the completion status (here we don't care about the completion opcode */
        if(wc.status != IBV_WC_SUCCESS)
        {
            fprintf(stderr, "got bad completion with status: 0x%x, vendor syndrome: 0x%x\n",
                    wc.status, wc.vendor_err);
            rc = 1;
        }
    }
    return rc;
}

/******************************************************************************
* Function: post_send
*
* Input:
* res: pointer to resources structure
* opcode: IBV_WR_SEND, IBV_WR_RDMA_READ or IBV_WR_RDMA_WRITE
*
* Output: none
*
* Returns: 0 on success, error code on failure
*
* Description: This function will create and post a send work request
******************************************************************************/
static int post_send(struct resources *res, int opcode)
{
    struct ibv_send_wr sr;
    struct ibv_sge sge;
    struct ibv_send_wr *bad_wr = NULL;
    int rc;

    /* prepare the scatter/gather entry */
    memset(&sge, 0, sizeof(sge));
    sge.addr = (uintptr_t)res->buf;
    sge.length = MSG_SIZE;
    sge.lkey = res->mr->lkey;

    /* prepare the send work request */
    memset(&sr, 0, sizeof(sr));
    sr.next = NULL;
    sr.wr_id = 0;
    sr.sg_list = &sge;
    sr.num_sge = 1;
    sr.opcode = opcode;
    sr.send_flags = IBV_SEND_SIGNALED;
    if(opcode != IBV_WR_SEND)
    {
        sr.wr.rdma.remote_addr = res->remote_props.addr;
        sr.wr.rdma.rkey = res->remote_props.rkey;
    }

    /* there is a Receive Request in the responder side, so we won't get any into RNR flow */
    rc = ibv_post_send(res->qp, &sr, &bad_wr);
    if(rc)
    {
        fprintf(stderr, "failed to post SR\n");
    }
    else
    {
        switch(opcode)
        {
        case IBV_WR_SEND:
            fprintf(stdout, "Send Request was posted\n");
            break;
        case IBV_WR_RDMA_READ:
            fprintf(stdout, "RDMA Read Request was posted\n");
            break;
        case IBV_WR_RDMA_WRITE:
            fprintf(stdout, "RDMA Write Request was posted\n");
            break;
        default:
            fprintf(stdout, "Unknown Request was posted\n");
            break;
        }
    }
    return rc;
}

/******************************************************************************
* Function: post_receive
*
* Input:
* res: pointer to resources structure
*
* Output: none
*
* Returns: 0 on success, error code on failure
*
* Description: post RR to be prepared for incoming messages
*
******************************************************************************/
static int post_receive(struct resources *res)
{
    struct ibv_recv_wr rr;
    struct ibv_sge sge;
    struct ibv_recv_wr *bad_wr;
    int rc;

    /* prepare the scatter/gather entry */
    memset(&sge, 0, sizeof(sge));
    sge.addr = (uintptr_t)res->buf;
    sge.length = MSG_SIZE;
    sge.lkey = res->mr->lkey;

    /* prepare the receive work request */
    memset(&rr, 0, sizeof(rr));
    rr.next = NULL;
    rr.wr_id = 0;
    rr.sg_list = &sge;
    rr.num_sge = 1;

    /* post the Receive Request to the RQ */
    rc = ibv_post_recv(res->qp, &rr, &bad_wr);
    if(rc)
    {
        fprintf(stderr, "failed to post RR\n");
    }
    else
    {
        fprintf(stdout, "Receive Request was posted\n");
    }
    return rc;
}

/******************************************************************************
* Function: resources_init
*
* Input:
* res: pointer to resources structure
*
* Output: res is initialized
*
* Returns: none
*
* Description: res is initialized to default values
******************************************************************************/
static void resources_init(struct resources *res)
{
    memset(res, 0, sizeof * res);
    res->sock = -1;
}

/******************************************************************************
* Function: resources_create
*
* Input: res pointer to resources structure to be filled in
*
* Output: res filled in with resources
*
* Returns: 0 on success, 1 on failure
*
* Description:
* This function creates and allocates all necessary system resources. These
* are stored in res.
*****************************************************************************/
static int resources_create(struct resources *res)
{
    struct ibv_device **dev_list = NULL;
    struct ibv_qp_init_attr qp_init_attr;
    struct ibv_device *ib_dev = NULL;
    size_t size;
    int i;
    int mr_flags = 0;
    int cq_size = 0;
    int num_devices;
    int rc = 0;

    /* if client side */
    if(config.server_name)
    {
        res->sock = sock_connect(config.server_name, config.tcp_port);
        if(res->sock < 0)
        {
            fprintf(stderr, "failed to establish TCP connection to server %s, port %d\n",
                    config.server_name, config.tcp_port);
            rc = -1;
            goto resources_create_exit;
        }
    }
    else
    {
        fprintf(stdout, "waiting on port %d for TCP connection\n", config.tcp_port);
        res->sock = sock_connect(NULL, config.tcp_port);
        if(res->sock < 0)
        {
            fprintf(stderr, "failed to establish TCP connection with client on port %d\n",
                    config.tcp_port);
            rc = -1;
            goto resources_create_exit;
        }
    }
    fprintf(stdout, "TCP connection was established\n");
    fprintf(stdout, "searching for IB devices in host\n");

    /* get device names in the system */
    dev_list = ibv_get_device_list(&num_devices);
    if(!dev_list)
    {
        fprintf(stderr, "failed to get IB devices list\n");
        rc = 1;
        goto resources_create_exit;
    }

    /* if there isn't any IB device in host */
    if(!num_devices)
    {
        fprintf(stderr, "found %d device(s)\n", num_devices);
        rc = 1;
        goto resources_create_exit;
    }
    fprintf(stdout, "found %d device(s)\n", num_devices);

    /* search for the specific device we want to work with */
    for(i = 0; i < num_devices; i ++)
    {
        if(!config.dev_name)
        {
            config.dev_name = strdup(ibv_get_device_name(dev_list[i]));
            fprintf(stdout, "device not specified, using first one found: %s\n", config.dev_name);
        }
        /* find the specific device */
        if(!strcmp(ibv_get_device_name(dev_list[i]), config.dev_name))
        {
            ib_dev = dev_list[i];
            break;
        }
    }

    /* if the device wasn't found in host */
    if(!ib_dev)
    {
        fprintf(stderr, "IB device %s wasn't found\n", config.dev_name);
        rc = 1;
        goto resources_create_exit;
    }

    /* get device handle */
    res->ib_ctx = ibv_open_device(ib_dev);
    if(!res->ib_ctx)
    {
        fprintf(stderr, "failed to open device %s\n", config.dev_name);
        rc = 1;
        goto resources_create_exit;
    }

    /* We are now done with device list, free it */
    ibv_free_device_list(dev_list);
    dev_list = NULL;
    ib_dev = NULL;

    /* query port properties */
    if(ibv_query_port(res->ib_ctx, config.ib_port, &res->port_attr))
    {
        fprintf(stderr, "ibv_query_port on port %u failed\n", config.ib_port);
        rc = 1;
        goto resources_create_exit;
    }

    /* allocate Protection Domain */
    res->pd = ibv_alloc_pd(res->ib_ctx);
    if(!res->pd)
    {
        fprintf(stderr, "ibv_alloc_pd failed\n");
        rc = 1;
        goto resources_create_exit;
    }

    /* each side will send only one WR, so Completion Queue with 1 entry is enough */
    cq_size = 1;
    res->cq = ibv_create_cq(res->ib_ctx, cq_size, NULL, NULL, 0);
    if(!res->cq)
    {
        fprintf(stderr, "failed to create CQ with %u entries\n", cq_size);
        rc = 1;
        goto resources_create_exit;
    }

    /* allocate the memory buffer that will hold the data */
    size = MSG_SIZE;
    res->buf = (char *) malloc(size);
    if(!res->buf)
    {
        fprintf(stderr, "failed to malloc %Zu bytes to memory buffer\n", size);
        rc = 1;
        goto resources_create_exit;
    }
    memset(res->buf, 0, size);

    /* only in the server side put the message in the memory buffer */
    if(!config.server_name)
    {
        strcpy(res->buf, MSG);
        fprintf(stdout, "going to send the message: '%s'\n", res->buf);
    }
    else
    {
        memset(res->buf, 0, size);
    }

    /* register the memory buffer */
    mr_flags = IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_READ | IBV_ACCESS_REMOTE_WRITE ;
    res->mr = ibv_reg_mr(res->pd, res->buf, size, mr_flags);
    if(!res->mr)
    {
        fprintf(stderr, "ibv_reg_mr failed with mr_flags=0x%x\n", mr_flags);
        rc = 1;
        goto resources_create_exit;
    }
    fprintf(stdout, "MR was registered with addr=%p, lkey=0x%x, rkey=0x%x, flags=0x%x\n",
            res->buf, res->mr->lkey, res->mr->rkey, mr_flags);

    /* create the Queue Pair */
    memset(&qp_init_attr, 0, sizeof(qp_init_attr));
    qp_init_attr.qp_type = config.qp_type;
    qp_init_attr.sq_sig_all = 1;
    qp_init_attr.send_cq = res->cq;
    qp_init_attr.recv_cq = res->cq;
    qp_init_attr.cap.max_send_wr = 1;
    qp_init_attr.cap.max_recv_wr = 1;
    qp_init_attr.cap.max_send_sge = 1;
    qp_init_attr.cap.max_recv_sge = 1;
    res->qp = ibv_create_qp(res->pd, &qp_init_attr);
    if(!res->qp)
    {
        fprintf(stderr, "failed to create QP\n");
        rc = 1;
        goto resources_create_exit;
    }
    fprintf(stdout, "QP was created, QP number=0x%x\n", res->qp->qp_num);

resources_create_exit:
    if(rc)
    {
        /* Error encountered, cleanup */
        if(res->qp)
        {
            ibv_destroy_qp(res->qp);
            res->qp = NULL;
        }
        if(res->mr)
        {
            ibv_dereg_mr(res->mr);
            res->mr = NULL;
        }
        if(res->buf)
        {
            free(res->buf);
            res->buf = NULL;
        }
        if(res->cq)
        {
            ibv_destroy_cq(res->cq);
            res->cq = NULL;
        }
        if(res->pd)
        {
            ibv_dealloc_pd(res->pd);
            res->pd = NULL;
        }
        if(res->ib_ctx)
        {
            ibv_close_device(res->ib_ctx);
            res->ib_ctx = NULL;
        }
        if(dev_list)
        {
            ibv_free_device_list(dev_list);
            dev_list = NULL;
        }
        if(res->sock >= 0)
        {
            if(close(res->sock))
            {
                fprintf(stderr, "failed to close socket\n");
            }
            res->sock = -1;
        }
    }
    return rc;
}


/******************************************************************************
* Function: modify_qp_to_init
*
* Input:
* qp: QP to transition
*
* Output: none
*
* Returns: 0 on success, ibv_modify_qp failure code on failure
*
* Description: Transition a QP from the RESET to INIT state
******************************************************************************/
static int modify_qp_to_init(struct ibv_qp *qp)
{
    struct ibv_qp_attr attr;
    int flags;
    int rc;
    memset(&attr, 0, sizeof(attr));
    attr.qp_state = IBV_QPS_INIT;
    attr.port_num = config.ib_port;
    attr.pkey_index = 0;
    attr.qp_access_flags = IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_READ | IBV_ACCESS_REMOTE_WRITE;

    switch(config.qp_type)
    {
    case IBV_QPT_RC:
        flags = IBV_QP_STATE | IBV_QP_PKEY_INDEX | IBV_QP_PORT | IBV_QP_ACCESS_FLAGS;
        break;
    case IBV_QPT_UC:
        flags = IBV_QP_STATE | IBV_QP_PKEY_INDEX | IBV_QP_PORT | IBV_QP_ACCESS_FLAGS;
        break;
    }


    rc = ibv_modify_qp(qp, &attr, flags);
    if(rc)
    {
        fprintf(stderr, "failed to modify QP state to INIT\n");
    }
    return rc;
}

/******************************************************************************
* Function: modify_qp_to_rtr
*
* Input:
* qp: QP to transition
* remote_qpn: remote QP number
* dlid: destination LID
* dgid: destination GID (mandatory for RoCEE)
*
* Output: none
*
* Returns: 0 on success, ibv_modify_qp failure code on failure
*
* Description:
* Transition a QP from the INIT to RTR state, using the specified QP number
******************************************************************************/
static int modify_qp_to_rtr(struct ibv_qp *qp, uint32_t remote_qpn, uint16_t dlid, uint8_t *dgid)
{
    struct ibv_qp_attr attr;
    int flags;
    int rc;
    memset(&attr, 0, sizeof(attr));
    /*
     attr.qp_state = IBV_QPS_RTR;
     attr.path_mtu = IBV_MTU_256;
     attr.dest_qp_num = remote_qpn;
     attr.rq_psn = 0;
     attr.max_dest_rd_atomic = 1;
     attr.min_rnr_timer = 0x12;
     attr.ah_attr.is_global = 0;
     attr.ah_attr.dlid = dlid;
     attr.ah_attr.sl = 0;
     attr.ah_attr.src_path_bits = 0;
     attr.ah_attr.port_num = config.ib_port;
     if(config.gid_idx >= 0)
     {
         attr.ah_attr.is_global = 1;
         attr.ah_attr.port_num = 1;
         memcpy(&attr.ah_attr.grh.dgid, dgid, 16);
         attr.ah_attr.grh.flow_label = 0;
         attr.ah_attr.grh.hop_limit = 1;
         attr.ah_attr.grh.sgid_index = config.gid_idx;
         attr.ah_attr.grh.traffic_class = 0;
     }
    */


    attr.qp_state = IBV_QPS_RTR;
    attr.path_mtu = IBV_MTU_1024;
    attr.dest_qp_num = remote_qpn;
    attr.rq_psn      = 0;
    attr.max_dest_rd_atomic = 1;
    attr.min_rnr_timer      = 0x12;
    attr.ah_attr.is_global  = 1;
    attr.ah_attr.grh.hop_limit = 6;
    memcpy(&attr.ah_attr.grh.dgid, dgid, 16);
    attr.ah_attr.grh.sgid_index = config.gid_idx;
    attr.ah_attr.dlid = dlid;
    attr.ah_attr.sl   = 0;
    attr.ah_attr.src_path_bits = 0;
    attr.ah_attr.port_num          = config.ib_port;
    attr.ah_attr.grh.traffic_class = config.traffic_class;
    if(config.gid_idx >= 0)
    {
        fprintf(stderr, "##gid_idx:%d\n", config.gid_idx);
        attr.ah_attr.is_global = 1;
        attr.ah_attr.port_num = 1;
        memcpy(&attr.ah_attr.grh.dgid, dgid, 16);
        attr.ah_attr.grh.flow_label = 0;
        attr.ah_attr.grh.hop_limit  = 1;
        //attr.ah_attr.grh.sgid_index = config.gid_idx;
        //attr.ah_attr.grh.traffic_class = 0;
    }
    fprintf(stderr, "##traffic_class:%d\n", config.traffic_class);
    switch(config.qp_type)
    {
    case IBV_QPT_RC:
        flags = IBV_QP_STATE | IBV_QP_AV | IBV_QP_PATH_MTU | IBV_QP_DEST_QPN |
                IBV_QP_RQ_PSN | IBV_QP_MAX_DEST_RD_ATOMIC | IBV_QP_MIN_RNR_TIMER;
        break;
    case IBV_QPT_UC:
        flags = IBV_QP_STATE | IBV_QP_AV | IBV_QP_PATH_MTU | IBV_QP_DEST_QPN |
                IBV_QP_RQ_PSN ;
        break;
    }



    rc = ibv_modify_qp(qp, &attr, flags);
    if(rc)
    {
        fprintf(stderr, "failed to modify QP state to RTR,rc = %d,%s\n", rc, __FUNCTION__);
    }
    return rc;
}

/******************************************************************************
* Function: modify_qp_to_rts
*
* Input:
* qp: QP to transition
*
* Output: none
*
* Returns: 0 on success, ibv_modify_qp failure code on failure
*
* Description: Transition a QP from the RTR to RTS state
******************************************************************************/
static int modify_qp_to_rts(struct ibv_qp *qp)
{
    struct ibv_qp_attr attr;
    int flags;
    int rc;
    memset(&attr, 0, sizeof(attr));
    attr.qp_state = IBV_QPS_RTS;
    attr.timeout = 0x12;
    attr.retry_cnt = 6;
    attr.rnr_retry = 0;
    attr.sq_psn = 0;
    attr.max_rd_atomic = 1;

    switch(config.qp_type)
    {
    case IBV_QPT_RC:
        flags = IBV_QP_STATE | IBV_QP_TIMEOUT | IBV_QP_RETRY_CNT |
                IBV_QP_RNR_RETRY | IBV_QP_SQ_PSN | IBV_QP_MAX_QP_RD_ATOMIC;
        break;
    case IBV_QPT_UC:
        flags = IBV_QP_STATE | IBV_QP_SQ_PSN;
        break;
    }



    rc = ibv_modify_qp(qp, &attr, flags);
    if(rc)
    {
        fprintf(stderr, "failed to modify QP state to RTS\n");
    }
    return rc;
}

/******************************************************************************
* Function: connect_qp
*
* Input:
* res: pointer to resources structure
*
* Output: none
*
* Returns: 0 on success, error code on failure
*
* Description:
* Connect the QP. Transition the server side to RTR, sender side to RTS
******************************************************************************/
static int connect_qp(struct resources *res)
{
    struct cm_con_data_t local_con_data;
    struct cm_con_data_t remote_con_data;
    struct cm_con_data_t tmp_con_data;
    int rc = 0;
    char temp_char;
    union ibv_gid my_gid;
    if(config.gid_idx >= 0)
    {
        rc = ibv_query_gid(res->ib_ctx, config.ib_port, config.gid_idx, &my_gid);
        if(rc)
        {
            fprintf(stderr, "could not get gid for port %d, index %d\n", config.ib_port, config.gid_idx);
            return rc;
        }
    }
    else
    {
        memset(&my_gid, 0, sizeof my_gid);
    }

    /* exchange using TCP sockets info required to connect QPs */
    local_con_data.addr = htonll((uintptr_t)res->buf);
    local_con_data.rkey = htonl(res->mr->rkey);
    local_con_data.qp_num = htonl(res->qp->qp_num);
    local_con_data.lid = htons(res->port_attr.lid);
    memcpy(local_con_data.gid, &my_gid, 16);
    fprintf(stdout, "\nLocal LID = 0x%x\n", res->port_attr.lid);
    if(sock_sync_data(res->sock, sizeof(struct cm_con_data_t), (char *) &local_con_data, (char *) &tmp_con_data) < 0)
    {
        fprintf(stderr, "failed to exchange connection data between sides\n");
        rc = 1;
        goto connect_qp_exit;
    }

    remote_con_data.addr = ntohll(tmp_con_data.addr);
    remote_con_data.rkey = ntohl(tmp_con_data.rkey);
    remote_con_data.qp_num = ntohl(tmp_con_data.qp_num);
    remote_con_data.lid = ntohs(tmp_con_data.lid);
    memcpy(remote_con_data.gid, tmp_con_data.gid, 16);

    /* save the remote side attributes, we will need it for the post SR */
    res->remote_props = remote_con_data;
    fprintf(stdout, "Remote address = 0x%"PRIx64"\n", remote_con_data.addr);
    fprintf(stdout, "Remote rkey = 0x%x\n", remote_con_data.rkey);
    fprintf(stdout, "Remote QP number = 0x%x\n", remote_con_data.qp_num);
    fprintf(stdout, "Remote LID = 0x%x\n", remote_con_data.lid);
    if(config.gid_idx >= 0)
    {
        uint8_t *p = remote_con_data.gid;
        fprintf(stdout, "Remote GID = %02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x\n",
                p[0], p[1], p[2], p[3], p[4], p[5], p[6], p[7], p[8], p[9], p[10], p[11], p[12], p[13], p[14], p[15]);
    }

    /* modify the QP to init */
    rc = modify_qp_to_init(res->qp);
    if(rc)
    {
        fprintf(stderr, "change QP state to INIT failed\n");
        goto connect_qp_exit;
    }

    /* let the client post RR to be prepared for incoming messages */
    if(config.server_name)
    {
        rc = post_receive(res);
        if(rc)
        {
            fprintf(stderr, "failed to post RR\n");
            goto connect_qp_exit;
        }
    }

    /* modify the QP to RTR */
    rc = modify_qp_to_rtr(res->qp, remote_con_data.qp_num, remote_con_data.lid, remote_con_data.gid);
    if(rc)
    {
        fprintf(stderr, "failed to modify QP state to RTR,%s\n", __FUNCTION__);
        goto connect_qp_exit;
    }

    /* modify the QP to RTS */
    rc = modify_qp_to_rts(res->qp);
    if(rc)
    {
        fprintf(stderr, "failed to modify QP state to RTS\n");
        goto connect_qp_exit;
    }
    fprintf(stdout, "QP state was change to RTS\n");

    /* sync to make sure that both sides are in states that they can connect to prevent packet loose */
    if(sock_sync_data(res->sock, 1, "Q", &temp_char))  /* just send a dummy char back and forth */
    {
        fprintf(stderr, "sync error after QPs are were moved to RTS\n");
        rc = 1;
    }

connect_qp_exit:
    return rc;
}

/******************************************************************************
* Function: resources_destroy
*
* Input:
* res: pointer to resources structure
*
* Output: none
*
* Returns: 0 on success, 1 on failure
*
* Description: Cleanup and deallocate all resources used
******************************************************************************/
static int resources_destroy(struct resources *res)
{
    int rc = 0;
    if(res->qp)
    {
        if(ibv_destroy_qp(res->qp))
        {
            fprintf(stderr, "failed to destroy QP\n");
            rc = 1;
        }
    }

    if(res->mr)
    {
        if(ibv_dereg_mr(res->mr))
        {
            fprintf(stderr, "failed to deregister MR\n");
            rc = 1;
        }
    }

    if(res->buf)
    {
        free(res->buf);
    }

    if(res->cq)
    {
        if(ibv_destroy_cq(res->cq))
        {
            fprintf(stderr, "failed to destroy CQ\n");
            rc = 1;
        }
    }

    if(res->pd)
    {
        if(ibv_dealloc_pd(res->pd))
        {
            fprintf(stderr, "failed to deallocate PD\n");
            rc = 1;
        }
    }

    if(res->ib_ctx)
    {
        if(ibv_close_device(res->ib_ctx))
        {
            fprintf(stderr, "failed to close device context\n");
            rc = 1;
        }
    }

    if(res->sock >= 0)
    {
        if(close(res->sock))
        {
            fprintf(stderr, "failed to close socket\n");
            rc = 1;
        }
    }
    return rc;
}

/******************************************************************************
* Function: print_config
*
* Input: none
*
* Output: none
*
* Returns: none
*
* Description: Print out config information
******************************************************************************/
static void print_config(void)
{
    fprintf(stdout, " ------------------------------------------------\n");
    fprintf(stdout, " Device name : \"%s\"\n", config.dev_name);
    fprintf(stdout, " IB port : %u\n", config.ib_port);
    if(config.server_name)
    {
        fprintf(stdout, " IP : %s\n", config.server_name);
    }
    fprintf(stdout, " TCP port : %u\n", config.tcp_port);
    if(config.gid_idx >= 0)
    {
        fprintf(stdout, " GID index : %u\n", config.gid_idx);
    }
    fprintf(stdout, " ------------------------------------------------\n\n");
}

/******************************************************************************
* Function: usage
*
* Input:
* argv0: command line arguments
*
* Output: none
*
* Returns: none
*
* Description: print a description of command line syntax
******************************************************************************/
static void usage(const char *argv0)
{
    fprintf(stdout, "Usage:\n");
    fprintf(stdout, " %s start a server and wait for connection\n", argv0);
    fprintf(stdout, " %s <host> connect to server at <host>\n", argv0);
    fprintf(stdout, "\n");
    fprintf(stdout, "Options:\n");
    fprintf(stdout, " -p, --port <port> listen on/connect to port <port> (default 18515)\n");
    fprintf(stdout, " -d, --ib-dev <dev> use IB device <dev> (default first device found)\n");
    fprintf(stdout, " -i, --ib-port <port> use port <port> of IB device (default 1)\n");
    fprintf(stdout, " -g, --gid_idx <git index> gid index to be used in GRH (default not used)\n");
    fprintf(stdout, " -t, --tos 0-255\n");
    fprintf(stdout, " -q, --qp_type uc or rc\n");
}

/******************************************************************************
* Function: main
*
* Input:
* argc: number of items in argv
* argv: command line parameters
*
* Output: none
*
* Returns: 0 on success, 1 on failure
*
* Description: Main program code
******************************************************************************/
char *qptype = NULL;
int main(int argc, char *argv[])
{
    struct resources res;
    int rc = 1;
    char temp_char;

    //打印编译的时间:
    printf("Version:%s %s\n", __DATE__, __TIME__);

    /* parse the command line parameters */
    while(1)
    {
        int c;
        /* Designated Initializer */
        static struct option long_options[] =
        {
            {.name = "port", .has_arg = 1, .val = 'p' },
            {.name = "ib-dev", .has_arg = 1, .val = 'd' },
            {.name = "ib-port", .has_arg = 1, .val = 'i' },
            {.name = "gid-idx", .has_arg = 1, .val = 'g' },
            {.name = "traffic-class", .has_arg = 1, .val = 't'},
            {.name = "qp_type", .has_arg = 1, .val = 'q'},
            {.name = NULL, .has_arg = 0, .val = '\0'}
        };

        c = getopt_long(argc, argv, "p:d:i:g:t:q:", long_options, NULL);
        if(c == -1)
        {
            break;
        }
        switch(c)
        {
        
        case 'q':
            qptype =  strdup(optarg);

            if(0 == strcmp(qptype,"rc"))
            {
            	config.qp_type = IBV_QPT_RC;
            	 printf("QP type is %s\n", qptype);
            }
            else if (0 == strcmp(qptype,"uc"))
           {
                config.qp_type = IBV_QPT_UC;
                 printf("QP type is %s\n", qptype);
            }
            else
           {
             fprintf(stdout, "-q usage: \"-q uc\" or \"-q rc\"\n");
           }

            free(qptype);
           break;
        case 't':
            config.traffic_class = strtoul(optarg, NULL, 0);
            break;
        case 'p':
            config.tcp_port = strtoul(optarg, NULL, 0);
            break;
        case 'd':
            config.dev_name = strdup(optarg);
            break;
        case 'i':
            config.ib_port = strtoul(optarg, NULL, 0);
            if(config.ib_port < 0)
            {
                usage(argv[0]);
                return 1;
            }
            break;
        case 'g':
            config.gid_idx = strtoul(optarg, NULL, 0);
            if(config.gid_idx < 0)
            {
                usage(argv[0]);
                return 1;
            }
            break;
        default:
            usage(argv[0]);
            return 1;
        }
    }

    /* parse the last parameter (if exists) as the server name */
    /*
     * server_name is null means this node is a server,
     * otherwise this node is a client which need to connect to
     * the specific server
     */
    if(optind == argc - 1)
    {
        config.server_name = argv[optind];
    }
    else if(optind < argc)
    {
        usage(argv[0]);
        return 1;
    }

    /* print the used parameters for info*/
    print_config();
    /* init all of the resources, so cleanup will be easy */
    resources_init(&res);
    /* create resources before using them */
    if(resources_create(&res))
    {
        fprintf(stderr, "failed to create resources\n");
        goto main_exit;
    }
    /* connect the QPs */
    if(connect_qp(&res))
    {
        fprintf(stderr, "failed to connect QPs\n");
        goto main_exit;
    }
    /* let the server post the sr */
    if(!config.server_name)
    {
        if(post_send(&res, IBV_WR_SEND))
        {
            fprintf(stderr, "failed to post sr\n");
            goto main_exit;
        }
    }
    /* in both sides we expect to get a completion */
    if(poll_completion(&res))
    {
        fprintf(stderr, "poll completion failed\n");
        goto main_exit;
    }
    /* after polling the completion we have the message in the client buffer too */
    if(config.server_name)
    {
        fprintf(stdout, "Message is: '%s'\n", res.buf);
    }
    else
    {
        /* setup server buffer with read message */
        strcpy(res.buf, RDMAMSGR);
    }
    /* Sync so we are sure server side has data ready before client tries to read it */
    if(sock_sync_data(res.sock, 1, "R", &temp_char))  /* just send a dummy char back and forth */
    {
        fprintf(stderr, "sync error before RDMA ops\n");
        rc = 1;
        goto main_exit;
    }

    /*
     * Now the client performs an RDMA read and then write on server.
     * Note that the server has no idea these events have occured
     */
    if(config.server_name)
    {
        /* First we read contens of server's buffer */
        if(post_send(&res, IBV_WR_RDMA_READ))
        {
            fprintf(stderr, "failed to post SR 2\n");
            rc = 1;
            goto main_exit;
        }
        if(poll_completion(&res))
        {
            fprintf(stderr, "poll completion failed 2\n");
            rc = 1;
            goto main_exit;
        }
        fprintf(stdout, "Contents of server's buffer: '%s'\n", res.buf);

        /* Now we replace what's in the server's buffer */
        strcpy(res.buf, RDMAMSGW);
        fprintf(stdout, "Now replacing it with: '%s'\n", res.buf);
        if(post_send(&res, IBV_WR_RDMA_WRITE))
        {
            fprintf(stderr, "failed to post SR 3\n");
            rc = 1;
            goto main_exit;
        }
        if(poll_completion(&res))
        {
            fprintf(stderr, "poll completion failed 3\n");
            rc = 1;
            goto main_exit;
        }
    }

    /* Sync so server will know that client is done mucking with its memory */
    if(sock_sync_data(res.sock, 1, "W", &temp_char))  /* just send a dummy char back and forth */
    {
        fprintf(stderr, "sync error after RDMA ops\n");
        rc = 1;
        goto main_exit;
    }
    if(!config.server_name)
    {
        fprintf(stdout, "Contents of server buffer: '%s'\n", res.buf);
    }
    rc = 0;

main_exit:
    if(resources_destroy(&res))
    {
        fprintf(stderr, "failed to destroy resources\n");
        rc = 1;
    }
    if(config.dev_name)
    {
        free((char *) config.dev_name);
    }
    fprintf(stdout, "\ntest result is %d\n", rc);
    return rc;
}

更多讲解教程

InfiniBand, Verbs, RDMA | https://thegeekinthecorner.wordpress.com/category/infiniband-verbs-rdma/

RDMA read and write with IB verbs | https://thegeekinthecorner.wordpress.com/2010/09/28/rdma-read-and-write-with-ib-verbs/

http://www.hpcadvisorycouncil.com/pdf/building-an-rdma-capable-application-with-ib-verbs.pdf

LINUX 编程例子

https://community.mellanox.com/s/topic/0TO50000000g1zhGAA/linux-programming?tabset-dea0d=2

RC_pingpong 源码剖析:https://arxiv.org/pdf/1105.1827.pdf

  • 1
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
RDMA HCA/TCA是一种高速网络适配器,它使用RDMA技术来提高数据传输的效率和性能。HCA代表Host Channel Adapter,而TCA代表Target Channel Adapter。HCA通常安装在主机上,而TCA通常安装在存储设备上。这两种适配器都支持RDMA技术,可以通过RDMA协议进行高速数据传输。 RDMA技术是一种零拷贝技术,它可以直接在内存中传输数据,而不需要将数据从内存复制到网络适配器的缓冲区中。这种技术可以显著提高数据传输的效率和性能,减少CPU的负载,降低网络延迟和网络拥塞。 RDMA HCA/TCA通常使用InfiniBand或者RoCE(RDMA over Converged Ethernet)网络来进行高速数据传输。这些网络可以提供非常低的延迟和高的带宽,适用于高性能计算、云计算、大数据分析等领域。 以下是一个使用RDMA Write with Immediate Data的例子: ```c #include <stdio.h> #include <stdlib.h> #include <string.h> #include <infiniband/verbs.h> #define MSG_SIZE 1024 #define RDMA_BUF_SIZE 1024 struct rdma_context { struct ibv_context *ctx; struct ibv_pd *pd; struct ibv_mr *mr; struct ibv_cq *cq; struct ibv_qp *qp; struct ibv_comp_channel *comp_channel; struct ibv_port_attr port_attr; char *rdma_buf; uint32_t rkey; uint64_t remote_addr; }; int main(int argc, char *argv[]) { struct rdma_context ctx; struct ibv_device **dev_list; struct ibv_device *ib_dev; struct ibv_qp_init_attr qp_init_attr; struct ibv_qp_attr qp_attr; struct ibv_wc wc; int num_devices; int ret; int i; /* 获取IB设备列表 */ dev_list = ibv_get_device_list(&num_devices); if (!dev_list) { perror("ibv_get_device_list"); return -1; } /* 选择第一个IB设备 */ ib_dev = dev_list[0]; if (!ib_dev) { fprintf(stderr, "No IB devices found\n"); return -1; } /* 打开IB设备 */ ctx.ctx = ibv_open_device(ib_dev); if (!ctx.ctx) { perror("ibv_open_device"); return -1; } /* 创建PD */ ctx.pd = ibv_alloc_pd(ctx.ctx); if (!ctx.pd) { perror("ibv_alloc_pd"); return -1; } /* 分配内存 */ ctx.rdma_buf = malloc(RDMA_BUF_SIZE); if (!ctx.rdma_buf) { perror("malloc"); return -1; } /* 注册内存 */ ctx.mr = ibv_reg_mr(ctx.pd, ctx.rdma_buf, RDMA_BUF_SIZE, IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE); if (!ctx.mr) { perror("ibv_reg_mr"); return -1; } /* 创建CQ */ ctx.cq = ibv_create_cq(ctx.ctx, 1, NULL, NULL, 0); if (!ctx.cq) { perror("ibv_create_cq"); return -1; } /* 创建QP */ memset(&qp_init_attr, 0, sizeof(qp_init_attr)); qp_init_attr.send_cq = ctx.cq; qp_init_attr.recv_cq = ctx.cq; qp_init_attr.qp_type = IBV_QPT_RC; qp_init_attr.cap.max_send_wr = 1; qp_init_attr.cap.max_recv_wr = 1; qp_init_attr.cap.max_send_sge = 1; qp_init_attr.cap.max_recv_sge = 1; ctx.qp = ibv_create_qp(ctx.pd, &qp_init_attr); if (!ctx.qp) { perror("ibv_create_qp"); return -1; } /* 修改QP状态 */ memset(&qp_attr, 0, sizeof(qp_attr)); qp_attr.qp_state = IBV_QPS_INIT; qp_attr.pkey_index = 0; qp_attr.port_num = 1; qp_attr.qp_access_flags = IBV_ACCESS_REMOTE_WRITE; ret = ibv_modify_qp(ctx.qp, &qp_attr, IBV_QP_STATE | IBV_QP_PKEY_INDEX | IBV_QP_PORT | IBV_QP_ACCESS_FLAGS); if (ret) { perror("ibv_modify_qp"); return -1; } /* 获取端口属性 */ ret = ibv_query_port(ctx.ctx, 1, &ctx.port_attr); if (ret) { perror("ibv_query_port"); return -1; } /* 创建Completion Channel */ ctx.comp_channel = ibv_create_comp_channel(ctx.ctx); if (!ctx.comp_channel) { perror("ibv_create_comp_channel"); return -1; } /* 将CQ绑定到Completion Channel */ ret = ibv_req_notify_cq(ctx.cq, 0); if (ret) { perror("ibv_req_notify_cq"); return -1; } /* 等待CQ事件 */ ret = ibv_get_cq_event(ctx.comp_channel, &ctx.cq, &ctx.ctx); if (ret) { perror("ibv_get_cq_event"); return -1; } /* 请求下一个CQ事件 */ ret = ibv_req_notify_cq(ctx.cq, 0); if (ret) { perror("ibv_req_notify_cq"); return -1; } /* 获取远程节点的rkey和地址 */ ctx.rkey = 0x12345678; ctx.remote_addr = 0xdeadbeef; /* 向远程节点发送数据 */ memset(ctx.rdma_buf, 0, RDMA_BUF_SIZE); strcpy(ctx.rdma_buf, "Hello RDMA!"); struct ibv_send_wr wr, *bad_wr; struct ibv_sge sge; memset(&wr, 0, sizeof(wr)); wr.wr_id = 0; wr.opcode = IBV_WR_RDMA_WRITE_WITH_IMM; wr.send_flags = IBV_SEND_SIGNALED; wr.imm_data = 0x1234; wr.wr.rdma.remote_addr = ctx.remote_addr; wr.wr.rdma.rkey = ctx.rkey; wr.sg_list = &sge; wr.num_sge = 1; sge.addr = (uintptr_t)ctx.rdma_buf; sge.length = strlen(ctx.rdma_buf) + 1; sge.lkey = ctx.mr->lkey; ret = ibv_post_send(ctx.qp, &wr, &bad_wr); if (ret) { perror("ibv_post_send"); return -1; } /* 等待发送完成 */ do { ret = ibv_poll_cq(ctx.cq, 1, &wc); if (ret < 0) { perror("ibv_poll_cq"); return -1; } } while (ret == 0); /* 检查发送状态 */ if (wc.status != IBV_WC_SUCCESS) { fprintf(stderr, "Send failed with status %d\n", wc.status); return -1; } /* 关闭QP */ ret = ibv_destroy_qp(ctx.qp); if (ret) { perror("ibv_destroy_qp"); return -1; } /* 关闭Completion Channel */ ret = ibv_destroy_comp_channel(ctx.comp_channel); if (ret) { perror("ibv_destroy_comp_channel"); return -1; } /* 关闭CQ */ ret = ibv_destroy_cq(ctx.cq); if (ret) { perror("ibv_destroy_cq"); return -1; } /* 注销内存 */ ret = ibv_dereg_mr(ctx.mr); if (ret) { perror("ibv_dereg_mr"); return -1; } /* 释放内存 */ free(ctx.rdma_buf); /* 释放PD */ ret = ibv_dealloc_pd(ctx.pd); if (ret) { perror("ibv_dealloc_pd"); return -1; } /* 关闭IB设备 */ ret = ibv_close_device(ctx.ctx); if (ret) { perror("ibv_close_device"); return -1; } /* 释放IB设备列表 */ ibv_free_device_list(dev_list); return 0; } ```

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值