onload extensions api

最新推荐文章于 2022-01-20 13:55:21 发布

raindayinrain

最新推荐文章于 2022-01-20 13:55:21 发布

阅读量2.3k

点赞数 1

文章标签： onload api

本文链接：https://blog.csdn.net/x13262608581/article/details/122416889

版权

网络#Onload 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

概述

扩展api&关联库

源码

随onload分发版本提供

通用组件

1. #include <onload/extensions.h>
2. libonload_ext.a, libonload_ext.so
此类api不依赖onload

onload_is_present

// If the application is linked with libonload_ext, 
// but not running with Onload this will return 0. 
// If the application is running with Onload this will return 1.
int onload_is_present (void)

onload_fd_stat

struct onload_stat
{
int32_t stack_id;
char* stack_name;
int32_t endpoint_id;
int32_t endpoint_state;
};

// Retrieves internal details about an accelerated socket.
//0 socket is not accelerated
//1 socket is accelerated
//‐ENOMEM when memory cannot be allocated
//内部会掉malloc分配stack_name指向内存，所以后续需要对stack_name执行free
extern int onload_fd_stat(int fd, struct onload_stat* stat);

onload_fd_check_feature

// Used to check whether the Onload file descriptor supports a feature or not.
//0 if the feature is supported but not on this fd
//>0 if the feature is supported both by onload and this fd
//<0 if the feature is not supported:
//‐ENOSYS if onload_fd_check_feature() is not supported.
//‐ ENOTSUPP if the feature is not supported by onload.
int onload_fd_check_feature (int fd, enum onload_fd_feature feature);
enum onload_fd_feature {
	/* Check whether this fd supports ONLOAD_MSG_WARM or not */
	ONLOAD_FD_FEAT_MSG_WARM,
	/* see Notes for details */
	ONLOAD_FD_FEAT_UDP_TX_TS_HDR
};

onload_thread_set_spin

//For a thread calling this function, 
//onload_thread_set_spin() sets the per‐thread spinning actions, 
//it is not per‐stack and not per‐socket.
//0 on success
//‐EINVAL if unsupported type is specified.
int onload_thread_set_spin(enum onload_spin_type type, unsigned spin)

参数解释

type
	Which operation to change the spin status of. 
	The type must be one of the following:
	enum onload_spin_type 
	{
		ONLOAD_SPIN_ALL, /* enable or disable all spin options */
		ONLOAD_SPIN_UDP_RECV,
		ONLOAD_SPIN_UDP_SEND,
		ONLOAD_SPIN_TCP_RECV,
		ONLOAD_SPIN_TCP_SEND,
		ONLOAD_SPIN_TCP_ACCEPT,
		ONLOAD_SPIN_PIPE_RECV,
		ONLOAD_SPIN_PIPE_SEND,
		ONLOAD_SPIN_SELECT,
		ONLOAD_SPIN_POLL,
		ONLOAD_SPIN_PKT_WAIT,
		ONLOAD_SPIN_EPOLL_WAIT,
		ONLOAD_SPIN_STACK_LOCK,
		ONLOAD_SPIN_SOCK_LOCK,
		ONLOAD_SPIN_SO_BUSY_POLL,
		ONLOAD_SPIN_TCP_CONNECT,
		ONLOAD_SPIN_MIMIC_EF_POLL, 
		/* thread spin configuration which mimics spin settings in EF_POLL_USEC. 
		Note that this has no effect on the usec‐setting part of EF_POLL_USEC.
		This needs to be set separately*/
		ONLOAD_SPIN_MAX /* special value to mark largest valid input */
	};
spin
	A boolean which indicates whether the operation should spin or not.

额外说明

Notes
Spin time (for all threads) is set using the EF_SPIN_USEC parameter.
Examples
The onload_thread_set_spin API can be used to control spinning on a per‐thread or per‐API basis. 
The existing spin‐related configuration options set the default behavior for threads, 
and the onload_thread_set_spin API overrides the default for the thread calling this function.
例子1：Disable all sorts of spinning:
onload_thread_set_spin(ONLOAD_SPIN_ALL, 0);
例子2：Enable all sorts of spinning:
onload_thread_set_spin(ONLOAD_SPIN_ALL, 1);
例子3：Enable spinning only for certain threads:
1 Set the spin timeout by setting EF_SPIN_USEC, and disable spinning by default
by setting EF_POLL_USEC=0.
2 In each thread that should spin, invoke onload_thread_set_spin().
例子4：Disable spinning only in certain threads:
1 Enable spinning by setting EF_POLL_USEC=<timeout>.
2 In each thread that should not spin, invoke onload_thread_set_spin().
WARNING: If a thread is set to NOT spin and then blocks this may invoke an interrupt for the whole stack. Interrupts occurring on moderately busy threads may cause unintended and undesirable consequences.
例子5：Enable spinning for UDP traffic, but not TCP traffic:
1 Set the spin timeout by setting EF_SPIN_USEC, and disable spinning by default
by setting EF_POLL_USEC=0.
2In each thread that should spin (UDP only), do:
onload_thread_set_spin(ONLOAD_SPIN_UDP_RECV, 1)
onload_thread_set_spin(ONLOAD_SPIN_UDP_SEND, 1)
例子6：Enable spinning for TCP traffic, but not UDP traffic:
1 Set the spin timeout by setting EF_SPIN_USEC, and disable spinning by default
by setting EF_POLL_USEC=0.
2 In each thread that should spin (TCP only), do:
onload_thread_set_spin(ONLOAD_SPIN_TCP_RECV, 1)
onload_thread_set_spin(ONLOAD_SPIN_TCP_SEND, 1)
onload_thread_set_spin(ONLOAD_SPIN_TCP_ACCEPT, 1)

Spinning and sockets:
When a thread calls onload_thread_set_spin() it sets the spinning actions applied when the thread accesses any socket ‐ irrespective of whether the socket is created by this thread.
If a socket is created by thread‐A and is accessed by thread‐B, calling onload_thread_set_spin(ONLOAD_SPIN_ALL, 1) only from thread‐B will enable spinning for thread‐B, 
but not for thread‐A. 
In the same scenario, if onload_thread_set_spin(ONLOAD_SPIN_ALL, 1) is called only from thread‐A, 
then spinning is enabled only for thread‐A, but not for thread‐B.
The onload_thread_set_spin() function sets the per‐thread spinning action.

onload_thread_get_spin

//For the current thread, identify which operations should spin.
int onload_thread_get_spin(unsigned *state)

参数解释

state
	Location at which to write the spin status as a bitmask. 
	Bit n of the mask is set if spinning has been enabled for spin type n (see onload_thread_set_spin on page 285).

额外说明

0 on success
Notes
	Spin time (for all threads) is set using the EF_SPIN_USEC parameter.
Examples
	Determine if spinning is enabled for UDP receive:
	unsigned state; onload_thread_get_spin(&state);
	if (state & (1 << ONLOAD_SPIN_UDP_RECV)) 
	{
		// spinning is enabled for UDP receive
	}

onload_socket_nonaccel

// Create a socket which is not accelerated by Onload. 
// This function is useful 
// when attempting to reserve a port for an ephemeral ef_vi instance without installing Onload filters. 
// It is also possible to use the stackname API to disable acceleration for specific socket(s).
// This function takes arguments and returns values 
// that correspond exactly to the standard socket() function call.
// Return the file descriptor that refers to the created endpoint.
// ‐1 with errno ENOSYS if the Onload extensions library is not in use.
int onload_socket_nonaccel(int domain, int type, int protocol)

onload_socket_unicast_nonaccel

// Create a socket that will only accelerate multicast traffic. 
// If this socket is not able to receive multicast, for example, 
// because it is bound to a unicast local address, 
// or it is a TCP socket, 
// then it will be handed over to the kernel.
// This function is useful for cases 
// where a socket will be used solely for multicast traffic 
// to avoid consuming limited filter table resource. 
// This does not prevent unicast traffic from arriving at the socket, 
// and if appropriate traffic is received, 
// it will still be delivered via the un‐accelerated path. 
// It is most useful for sockets that are bound to INADDR_ANY, 
// because for these Onload must install a filter per IP address 
// that is configured on an accelerated interface, 
// on each accelerated hardware port.
// If a socket is bound to a multicast local address, 
// then no unicast filters will be installed, so there is no need for this function.
// Return the file descriptor that refers to the created endpoint.
// ‐1 with errno ENOSYS if the Onload extensions library is not in use.
int onload_socket_unicast_nonaccel(int domain, int type, int protocol)

解释

This function takes arguments and returns values that correspond 
exactly to the standard socket() function call.

Stacks API

Using the Onload Extensions API 
an application can bind selected sockets to specific Onload stacks 
and in this way ensure that time‐critical sockets are not starved of resources by other non‐critical sockets. 
The API allows an application to select sockets 
which are to be accelerated thus reserving Onload resources for performance critical paths. 
This also prevents non‐critical paths from creating jitter for critical paths.

onload_set_stackname

Select the Onload stack that new sockets are placed in. 
A socket can exist only in a single stack. 
A socket can be moved to a different stack ‐ see onload_move_fd() below.
Moving a socket to a different stack does not create a copy of the socket 
in originator and target stacks.
//0 on success
//‐1 with errno set to ENAMETOOLONG if the name exceeds permitted length
/‐1 with errno set to EINVAL if other parameters are invalid.
int onload_set_stackname(int who, int scope, const char *name)

参数解释

who
	Must be one of the following:
	‐ ONLOAD_THIS_THREAD ‐ to modify the stack name in which all subsequent sockets are created by this thread.
	‐ ONLOAD_ALL_THREADS ‐ to modify the stack name in which all subsequent sockets are created by all threads in the current process. ONLOAD_THIS_THREAD takes precedence over ONLOAD_ALL_THREADS.
scope
	Must be one of the following:
	‐ ONLOAD_SCOPE_THREAD ‐ name is scoped with current thread
	‐ ONLOAD_SCOPE_PROCESS ‐ name is scoped with current process
	‐ ONLOAD_SCOPE_USER ‐ name is scoped with current user
	‐ ONLOAD_SCOPE_GLOBAL ‐ name is global across all threads, users and processes.
	‐ ONLOAD_SCOPE_NOCHANGE ‐ undo effect of a previous call to onload_set_stackname(ONLOAD_THIS_THREAD, …), see Notes on page 290.
name
	One of the following:
	‐ the stack name up to 8 characters.
	‐ an empty string to set no stackname
	‐ the special value ONLOAD_DONT_ACCELERATE to prevent sockets created in this thread, user, process from being accelerated.
	Sockets identified by the options above will belong to the Onload stack until a subsequent call using onload_set_stackname identifies a different stack or the ONLOAD_SCOPE_NOCHANGE option is used.

解释说明

注意1：
This applies for stacks selected for sockets created by socket() and for pipe(), 
it has no effect on accept(). 
Passively opened sockets created via accept() will always be in the same stack 
as the listening socket that they are linked to, 
this means that the following are functionally identical i.e.
onload_set_stackname(foo)
socket
listen
onload_set_stackname(bar)
accept

and:
onload_set_stackname(foo)
socket
listen
accept
onload_set_stackname(bar)
In both cases the listening socket and the accepted socket will be in stack foo.
注意2：
Scope defines the namespace in which a stack belongs. 
A stackname of foo in scope user is not the same as a stackname of foo in scope thread. 
Scope restricts the visibility of a stack 
to either the current thread, current process, current user or is unrestricted (global). 
This has the property that with, for example, process based scoping, 
two processes can have the same stackname without sharing a stack 
‐ as the stack for each process has a different namespace.
注意3：
Scoping can be thought of as adding a suffix to the supplied name e.g.
ONLOAD_SCOPE_THREAD: <stackname>‐t<thread_id>
ONLOAD_SCOPE_PROCESS: <stackname>‐p<process_id>
ONLOAD_SCOPE_USER: <stackname>‐u<user_id>
ONLOAD_SCOPE_GLOBAL: <stackname>
This is an example only and the implementation is free to do something different such as maintaining different lists for different scopes.
注意4：
ONLOAD_SCOPE_NOCHANGE will undo the effect of a previous call to onload_set_stackname(ONLOAD_THIS_THREAD, …).
If you have previously used onload_set_stackname(ONLOAD_THIS_THREAD, …) and want to revert to the behavior of threads that are using the ONLOAD_ALL_THREADS configuration, 
without changing that configuration, 
you can do the following:
onload_set_stackname(ONLOAD_ALL_THREADS, ONLOAD_SCOPE_NOCHANGE, "");

Related environment variables
Related environment variables are:
EF_DONT_ACCELERATE
Default: 0
Minimum: 0
Maximum: 1
Scope: Per‐process
If this environment variable is set then acceleration for ALL sockets is disabled 
and handed off to the kernel stack until the application overrides this state with a call to onload_set_stackname().

EF_STACK_PER_THREAD
Default: 0
Minimum: 0
Maximum: 1
Scope: Per‐process
If this environment variable is set each socket created by the application will be placed in a stack depending on the thread in which it is created. Stacks could, for example, be named using the thread ID of the thread that creates the stack, but this should not be relied upon.
A call to onload_set_stackname overrides this variable. EF_DONT_ACCELERATE takes precedence over this variable.

EF_NAME
Default: none
Minimum: 0 chars
Maximum: 8 chars
Scope: per‐stack
The environment variable EF_NAME will be honored to control Onload stack sharing. However, a call to onload_set_stackname overrides this variable and, EF_DONT_ACCELERATE and EF_STACK_PER_THREAD both take precedence over EF_NAME.

onload_move_fd

// Move the file descriptor to the current stack. 
// The target stack can be specified with onload_set_stackname(),
// then use onload_move_fd() to put the socket into the target stack.
// A socket can exist only in a single stack. 
// Moving a socket to a different stack does not create a copy of the socket 
// in originator and target stacks. 
// Limited to TCP closed or accepted sockets only.
// fd ‐ the file descriptor to be moved to the current stack.
// 0 on success
// non‐zero otherwise.
int onload_move_fd (int fd)

注意点

Notes
•Useful to move fds obtained by accept() 
to a different Onload stack from the listening socket.
•Cannot be used on actively opened connections, 
although it is possible to use onload_set_stackname() 
before calling connect() to achieve the same result.
•The socket must have empty send and retransmit queues 
(i.e. send not called on this socket)
•The socket must have a simple receive queue 
(no loss, reordering, etc)
•The fd is not yet in an epoll set.
•The onload_move_fd function should not be used 
if SO_TIMESTAMPING is set to a non‐zero value for the originating socket.
•Should not be used simultaneously with other I/O multiplex actions 
i.e. poll(), select(), recv() etc on the file descriptor.
•This function is not async‐safe and should never be called 
from any process function handling signals.
•This function cannot be used to hand sockets over to the kernel. 
It is not possible to use onload_set_stackname (ONLOAD_DONT_ACCELERATE) and then onload_move_fd().

NOTE: The onload_move_fd function does not check whether a destination stack has either RX or TX timestamping enabled.

onload_stackname_save

Save the state of the current onload stack identified by the previous call to onload_set_stackname()
// 0 on success
// ‐ENOMEM when memory cannot be allocated.
int onload_stackname_save (void)

onload_stackname_restore

Restore stack state saved with a previous call to onload_stackname_save(). All updates/changes to state of the current stack will be deleted and all state previously saved will be restored. To avoid unexpected results, the stack should be restored in the same thread as used to call onload_stackname_save().
// 0 on success
// non‐zero if an error occurs.
int onload_stackname_restore (void)

注意点

The API stackname save and restore functions provide flexibility when binding sockets to an Onload stack.
Using a combination of onload_set_stackname(), onload_stackname_save() and onload_stackname_restore(), the user is able to create default stack settings which apply to one or more sockets, save this state and then create changed stack settings which are applied to other sockets. The original default settings can then be restored to apply to subsequent sockets.

Stacks API Usage

Using a combination of the EF_DONT_ACCELERATE environment variable and the function onload_set_stackname(), the user is able to control/select sockets which are to be accelerated and isolate these performance critical sockets and threads from the rest of the system.

onload_stack_opt_set_int

Set/modify per stack options that all subsequently created stacks will use instead of using the existing global stack options.
// name
// Stack option to modify
// value
// New value for the stack option.
// 0 on success
// errno set to EINVAL if the requested option is not found or ENOMEM.
int onload_stack_opt_set_int(const char* name, int64_t value)

onload_stack_opt_set_int(“EF_SCALABLE_FILTERS_ENABLE”, 1);

注意

Cannot be used to modify options on existing stacks ‐ only for new stacks.
Cannot be used to modify process options ‐ only stack options.
Modified options will be used for all newly created stacks until onload_stack_opt_reset() is called.

onload_stack_opt_reset

Revert to using global stack options for newly created stacks.
// 0 always
int onload_stack_opt_reset(void)

注意

Should be called following a call to onload_stack_opt_set_int() to revert to using global stack options for all newly created stacks.

Stacks API ‐ Examples

•This thread will use stack foo, other threads in the stack will continue as before.
onload_set_stackname(ONLOAD_THIS_THREAD, ONLOAD_SCOPE_GLOBAL, "foo")
•All threads in this process will get their own stack called foo. This is equivalent to the EF_STACK_PER_THREAD environment variable.
onload_set_stackname(ONLOAD_ALL_THREADS, ONLOAD_SCOPE_THREAD, "foo")
•All threads in this process will share a stack called foo. If another process did the same function call it will get its own stack.
onload_set_stackname(ONLOAD_ALL_THREADS, ONLOAD_SCOPE_PROCESS, "foo")
•All threads in this process will share a stack called foo. If another process run by the same user did the same, it would share the same stack as the first process. If another process run by a different user did the same it would get is own stack.
onload_set_stackname(ONLOAD_ALL_THREADS, ONLOAD_SCOPE_USER, "foo")

•Equivalent to EF_NAME. All threads will use a stack called foo which is shared by any other process which does the same.
onload_set_stackname(ONLOAD_ALL_THREADS, ONLOAD_SCOPE_GLOBAL, "foo")
•Equivalent to EF_DONT_ACCELERATE. New sockets/pipes will not be accelerated until another call to onload_set_stackname().
onload_set_stackname(ONLOAD_ALL_THREADS, ONLOAD_SCOPE_GLOBAL, ONLOAD_DONT_ACCELERATE)

onload_ordered_epoll_wait

If the epoll set contains accelerated sockets in only one stack 
this function can be used instead of epoll_wait() to return events 
in the order these were recovered from the wire. 
There is no explicit check on sockets, 
so applications must ensure that the rules are applied 
to avoid mis‐ordering of packets.

// A positive value identifies the number of epoll_evs / ordered_evs to process.
// A zero value indicates there are no events which can be processed while maintaining ordering i.e. there may be no data or only unordered data.
// A negative return value identifies an error condition.
int onload_ordered_epoll_wait (
	int epfd,
	struct epoll_event *events,
	struct onload_ordered_epoll_event *oo_events,
	int maxevents,
	int timeout);

注意

Any file descriptors returned as ready without a valid timestamp 
i.e. tv_sec = 0, should be considered un‐ordered with respect to the rest of the set. 
This can occur for data received via the kernel 
or data returned without a hardware timestamp 
i.e. from an interface that does not support hardware timestamping.
The environment variable EF_UL_EPOLL=1 must be set 
Hardware timestamps are required. 
This feature is only available on the SFN7000, SFN8000 and X2 series adapters.
struct onload_ordered_epoll_event
{
	/* The hardware timestamp of the first readable data */
	struct timespec ts;
	/* Number of bytes that may be read to maintain wire order */
	int bytes
};
ONLOAD_MSG_ONEPKT and EF_TCP_RCVBUF_STRICT are incompatible 
with the wire order delivery feature. 
Refer to Wire Order Delivery on page 87 for details.

Zero‐Copy API

intermediate buffers 
when transferring data 
between application and network adapter.
The Onload Extensions Zero‐Copy API supports zero‐copy 
of UDP received packet data and TCP transmit packet data.
The API provides the following components:
•#include <onload/extensions_zc.h>
In addition to the common components, 
an application should include this header file 
which contains all function prototypes and constant values required 
when using the API. 
The header file also includes comprehensive documentation, 
required data structures and function definitions.

Zero‐Copy Data Buffers

To avoid the copy data is passed to and from the application 
in special buffers described by a struct onload_zc_iovec. 
A message or datagram can consist of multiple iovecs 
using a struct onload_zc_msg. 
A single call to send may involve multiple messages 
using an array of struct onload_zc_mmsg.
/* A zc_iovec describes a single buffer */
struct onload_zc_iovec 
{
	void* iov_base; /* Address within buffer */
	size_t iov_len; /* Length of data */
	onload_zc_handle buf; /* (opaque) buffer handle */
	unsigned iov_flags; /* Not currently used */
};

/* A msg describes array of iovecs that make up datagram */
struct onload_zc_msg 
{
	struct onload_zc_iovec* iov; /* Array of buffers */
	struct msghdr msghdr; /* Message metadata */
};

/* An mmsg describes a message, the socket, and its result */
struct onload_zc_mmsg 
{
	struct onload_zc_msg msg; /* Message */
	int rc; /* Result of send operation */
	int fd; /* socket to send on */
};

Zero‐Copy UDP Receive Overview

Figure 27 illustrates the difference 
between the normal UDP receive mode and the zero‐copy method.
When using the standard POSIX socket calls, 
the adapter delivers packets to an Onload packet buffer 
which is described by a descriptor previously placed in the RX descriptor ring. 
When the application calls recv(), 
Onload copies the data from the packet buffer to an application‐supplied buffer.
Using the zero‐copy UDP receive API 
the application calls the onload_zc_recv() function 
including a callback function which will be called when data is ready. 
The callback can directly access the data inside the Onload packet buffer avoiding a copy.

A single call using onload_zc_recv() function can result in multiple datagrams 
being delivered to the callback function. 
Each time the callback returns to Onload 
the next datagram is delivered. 
Processing stops when the callback instructs Onload to cease delivery 
or there are no further received datagrams.

If the receiving application does not require to look at all data received 
(i.e. is filtering) this can result in a considerable performance advantage 
because this data is not pulled into the processor's cache, 
thereby reducing the application cache footprint.
As a general rule, 
the callback function should avoid calling other system calls 
which attempt to modify or close the current socket.
Zero‐copy UDP Receive is implemented within the Onload Extensions API.

Zero‐Copy UDP Receive

The onload_zc_recv() function specifies a callback to invoke 
for each received UDP datagram. 
The callback is invoked in the context of the call to onload_zc_recv() 
(i.e. It blocks/spins waiting for data).
Before calling, 
the application must set the following in the struct onload_zc_recv_args:
cb 										set to the callback function pointer
user_ptr								set to point to application state, this is not touched by onload
msg.msghdr.msg_control
msg_controllen
msg_name
msg_namelen
											the user application should set these to appropriate buffers and lengths 			
											(if required) as you would for recvmsg (or NULL and 0 if not used)
flags 									set to indicate behavior (e.g. ONLOAD_MSG_DONTWAIT)

typedef enum onload_zc_callback_rc
(*onload_zc_recv_callback)(struct onload_zc_recv_args *args, int flags);

struct onload_zc_recv_args
{
	struct onload_zc_msg msg;
	onload_zc_recv_callback cb;
	void* user_ptr;
	int flags;
};

int onload_zc_recv(int fd, struct onload_zc_recv_args *args);
The callback gets to examine the data, 
and can control what happens next: 
(i) whether or not the buffer(s) are kept by the callback or are immediately freed by Onload; 
and (ii) whether or not onload_zc_recv() will internally loop 
and invoke the callback with the next datagram, 
or immediately return to the application. 
The next action is determined by setting flags in the return code as follows:
ONLOAD_ZC_KEEP 				the callback function can elect to retain ownership of received buffer(s) 
												by returning ONLOAD_ZC_KEEP. 
												Following this, 
												the correct way to release retained buffers 
												is to call onload_zc_release_buffers() 
												to explicitly release the first buffer from each received datagram. 
												Subsequent buffers pertaining to the same datagram 
												will then be automatically released.
ONLOAD_ZC_CONTINUE		to suggest that Onload should loop 
												and process more datagrams
ONLOAD_ZC_TERMINATE	to insist that Onload immediately return from the onload_zc_recv()
Flags can also be set by Onload:
ONLOAD_ZC_END_OF_BURST 		Onload sets this flag to indicate that this is the last packet
ONLOAD_ZC_MSG_SHARED			Packet buffers are read only

If there is unaccelerated data on the socket from the kernel’s receive path 
this cannot be handled without copying. 
The application has two choices as follows:
ONLOAD_MSG_RECV_OS_INLINE			set this flag when calling onload_zc_recv(). 
																	Onload will deal with the kernel data internally 
																	and pass it to the callback
check return code										check the return code from onload_zc_recv(). 
																	If it returns ENOTEMPTY 
																	then the application must call onload_recvmsg_kernel() 
																	to retrieve the kernel data.

Zero‐Copy Receive Example #1

struct onload_zc_recv_args args;
struct zc_recv_state state;
int rc;
state.bytes = bytes_to_wait_for;

/* Easy way to set msg_control* and msg_name* to zero */
memset(&args.msg, 0, sizeof(args.msg));
args.cb = &zc_recv_callback;
args.user_ptr = &state;
args.flags = ONLOAD_ZC_RECV_OS_INLINE;
rc = onload_zc_recv(fd, &args);
//‐‐‐
enum onload_zc_callback_rc
zc_recv_callback(struct onload_zc_recv_args *args, int flags)
{
	int i;
	struct zc_recv_state *state = args‐>user_ptr;

	for( i = 0; i < args‐>msg.msghdr.msg_iovlen; ++i ) 
	{
		printf("zc callback iov %d: %p, %d", i,
			args‐>msg.iov[i].iov_base,
			args‐>msg.iov[i].iov_len);
		state‐>bytes ‐= args‐>msg.iov[i].iov_len;
	}

	if( state‐>bytes <= 0 ) return ONLOAD_ZC_TERMINATE;
	else return ONLOAD_ZC_CONTINUE;
}

Zero‐Copy Receive Example #2

static enum onload_zc_callback_rc
zc_recv_callback(struct onload_zc_recv_args *args, int flag)
{
	struct user_info *zc_info = args‐>user_ptr;
	int i, zc_rc = 0;
	for( i = 0; i < args‐>msg.msghdr.msg_iovlen; ++i ) 
	{
		zc_rc += args‐>msg.iov[i].iov_len;
		handle_msg(args‐>msg.iov[i].iov_base,
		args‐>msg.iov[i].iov_len);
	}
	if( zc_rc == 0 )
		return ONLOAD_ZC_TERMINATE;
	zc_info‐>zc_rc += zc_rc;
	if( (zc_info‐>flags & MSG_WAITALL) && (zc_info‐>zc_rc < zc_info‐>size) )
		return ONLOAD_ZC_CONTINUE;
	else return ONLOAD_ZC_TERMINATE;
}

struct onload_zc_recv_args zc_args;
ssize_t do_recv_zc(int fd, void* buf, size_t len, int flags)
{
	struct user_info info; int rc;
	init_user_info(&info);
	memset(&zc_args, 0, sizeof(zc_args));
	zc_args.user_ptr = &info;
	zc_args.flags = 0;
	zc_args.cb = &zc_recv_callback;
	if( flags & MSG_DONTWAIT )
		zc_args.flags |= ONLOAD_MSG_DONTWAIT;

	rc = onload_zc_recv(fd, &zc_args);
	if( rc == ‐ENOTEMPTY) {
		if( ( rc = onload_recvmsg_kernel(fd, &msg, 0) ) < 0 )
			printf("onload_recvmsg_kernel failed\n");
	}
	else if( rc == 0 ) {
		/* zc_rc gets set by callback to bytes received, so we
		* can return that to appear like standard recv call */
		rc = info.zc_rc;
	}
	return rc;
}

onload_zc_recv() should not be used together with onload_set_recv_filter() 
and only supports accelerated (Onloaded) sockets. 
For example, when bound to a broadcast address the socket fd is handed off to the kernel 
and this function will return ESOCKNOTSUPPORT.

Zero‐Copy TCP Send Overview

Figure 31 illustrates the difference between the normal TCP transmit method 
and the zero‐ copy method.
When using standard POSIX socket calls, 
the application first creates the payload data in an application allocated buffer 
before calling the send() function. 
Onload will copy the data to a Onload packet buffer in memory 
and post a descriptor to this buffer in the network adapter TX descriptor ring.
Using the zero‐copy TCP transmit API the application calls the onload_zc_alloc_buffers() function 
to request buffers from Onload. 
A pointer to a packet buffer is returned in response. 
The application places the data to send directly into this buffer 
and then calls onload_zc_send() to indicate to Onload that data is available to send.
Onload will post a descriptor for the packet buffer in the network adapter TX descriptor ring 
and ring the TX doorbell. 
The network adapter fetches the data for transmission.

The socket used to allocate zero‐copy buffers must be in the same stack 
as the socket used to send the buffers. 
When using TCP loopback, 
Onload can move a socket from one stack to another. 
Users must ensure that they ALWAYS USE BUFFERS FROM THE CORRECT STACK.
The onload_zc_send function does not currently support the ONLOAD_MSG_MORE 
or TCP_CORK flags.
Zero‐copy TCP transmit is implemented within the Onload Extensions API.

Zero‐Copy TCP Send

The zero‐copy send API supports the sending of multiple messages 
to different sockets in a single call. 
Data buffers must be allocated in advance 
and for best efficiency 
these should be allocated in blocks 
and off the critical path. 
The user should avoid simply moving the copy from Onload into the application, 
but where this is unavoidable, it should also be done off the critical path.

int onload_zc_send(struct onload_zc_mmsg* msgs, int mlen, int flags);
int onload_zc_alloc_buffers(
	int fd,
	struct onload_zc_iovec* iovecs,
	int iovecs_len,
	onload_zc_buffer_type_flags flags);
int onload_zc_release_buffers(
	int fd,
	onload_zc_handle* bufs,
	int bufs_len);
The onload_zc_send() function return value identifies 
how many of the onload_zc_mmsg array’s rc fields are set. 
Each onload_zc_mmsg.rc 
returns how many bytes (or error) were sent in for that message. 
Refer to the table below.
rc = onload_zc_send()
rc < 0						application error calling onload_zc_send(). rc is set to the error code
rc == 0						should not happen
0 < rc <= n_msgs	rc is set to the number of messages 
								whose status has been sent in mmsgs[i].rc.
								rc == n_msgs is the normal case
rc = mmsg[i].rc
rc < 0						error sending this message. rc is set to the error code
rc >= 0						rc is set to the number of bytes that have been sent in this message. 
								Compare to the message length to establish which buffers sent

Sent buffers are owned by Onload. 
Unsent buffers are owned by the application and must be freed or reused to avoid leaking.

Buffers sent with the ONLOAD_MSG_WARM feature enabled 
are not actually sent buffers, 
ownership remains with the user who is responsible for freeing these buffers.

Zero‐Copy Send ‐ Single Message, Single Buffer

struct onload_zc_iovec iovec;
struct onload_zc_mmsg mmsg;
rc = onload_zc_alloc_buffers(fd, &iovec, 1, ONLOAD_ZC_BUFFER_HDR_TCP);
assert(rc == O);
assert(my_data_len <= iovec.iov_len);
memcpy(iovec.iov_base, my_data, my_data_len);
iovec.iov_len = my_data_len;
mmsg.fd = fd;
mmsg.msg.iov = &iovec;
mmsg.msg.msghdr.msg_iovlen = 1;
rc = onload_zc_send(&mmsg, 1, 0);
if( rc <= 0) 
{
	/* Probably application bug */
	return rc;
} 
else 
{
	/* Only one message, so rc should be 1 */
	assert(rc == 1);
	/* rc == 1 so we can look at the first (only) mmsg.rc */
	if( mmsg.rc < 0 )
		/* Error sending message */
		onload_zc_release_buffers(fd, &iovec.buf, 1);
	else
		/* Message sent, single msg, single iovec so
		* shouldn't worry about partial sends */
		assert(mmsg.rc == my_data_len);
}

The example above demonstrates error code handling. 
Note it contains an examples of bad practice 
where buffers are allocated and populated on the critical path.

Zero‐Copy Send ‐ Multiple Message, Multiple Buffers

#define N_MSGS 2
struct onload_zc_iovec iovec[N_MSGS][N_BUFFERS];
struct onload_zc_mmsg mmsg[N_MSGS];
for( i = 0; i < N_MSGS; ++i ) 
{
	rc = onload_zc_alloc_buffers(
		fd, 
		iovec[i], 
		N_BUFFERS, 
		ONLOAD_ZC_BUFFER_HDR_TCP);
	assert(rc == 0);
	/* TODO store data in iovec[i][j].iov_base,
	* set iovec[i][j]iov_len */
	mmsg[i]fd = fd; /* Could be different for each message */
	mmsg[i].iov = iovec[i];
	mmsg[i].msg.msghdr.msg_iovlen = N_BUFFERS;
}

rc = onload_zc_send(mmsg, N_MSGS, 0);
if( rc <= 0 ) 
{
	/* Probably application bug */
	return rc;
} 
else 
{
	for( i = 0; i < N_MSGS; ++i ) 
	{
		if( i < rc ) 
		{
			/* mmsg[i] is set and we can use it */
			if( mmsg[i] < 0) 
			{
				/* error sending this message ‐ release buffers */
				for( j = 0; j < N_BUFFERS; ++j )
					onload_zc_release_buffers(fd, &iovec[i][j].buf, 1);
			} 
			else if( mmsg(i] < sum_over_j(iovec[i][j].iov_len) ) 
			{
				/* partial success */
				/* TODO use mmsg[i] to determine which buffers in
				* iovec[i] array are sent and which are still
				* owned by application */
			} 
			else 
			{
				/* Whole message sent, buffers now owned by Onload */
			}
		} 
		else 
		{
			/* mmsg[i] is not set, this message was not sent */
			for( j = 0; j < N_BUFFERS; ++j )
				onload_zc_release_buffers(fd, &iovec[i][j].buf, 1);
		}
	}
}

The example above demonstrates error code handling 
and contains some examples of bad practice 
where buffers are allocated and populated on the critical path.

Zero‐Copy Send ‐ Full Example

static struct onload_zc_iovec iovec[NUM_ZC_BUFFERS];
static ssize_t do_send_zc(int fd, const void* buf, size_t len, int flags)
{
	int bytes_done, rc, i, bufs_needed;
	struct onload_zc_mmsg mmsg;
	mmsg.fd = fd;
	mmsg.msg.iov = iovec;
	bytes_done = 0;
	mmsg.msg.msghdr.msg_iovlen = 0;
	while( bytes_done < len ) 
	{
		if( iovec[mmsg.msg.msghdr.msg_iovlen].iov_len > (len ‐ bytes_done))
			iovec[mmsg.msg.msghdr.msg_iovlen].iov_len = (len ‐ bytes_done);
		memcpy(iovec[i].iov_base, buf+bytes_done, iov_len);
		bytes_done += iovec[mmsg.msg.msghdr.msg_iovlen].iov_len;
		++mmsg.msg.msghdr.msg_iovlen;
	}

	rc = onload_zc_send(&mmsg, 1, 0);
	if( rc != 1 /* Number of messages we sent */ ) 
	{
		printf("onload_zc_send failed to process msg, %d\n", rc);
		return ‐1;
	} 
	else 
	{
		if( mmsg.rc < 0 )
			printf("onload_zc_send message error %d\n", mmsg.rc);
		else 
		{
			/* Iterate over the iovecs; 
			any that were sent we must replenish. */
			i = 0; bufs_needed= 0;
			while( i < mmsg.msg.msghdr.msg_iovlen ) 
			{
				if( bytes_done == mmsg.rc ) 
				{
					printf(onload_zc_send did not send iovec %d\n", i);
					/* In other buffer allocation schemes 
					we would have to release these buffers, 
					but seems pointless 
					as we guarantee at the end of this function 
					to have iovec array full, 
					so do nothing. */
				} 
				else 
				{
					/* Buffer sent, now owned by Onload, 
					so replenish iovec array */
					++bufs needed;
					bytes_done += iovec[i].iov_len;
				}
				
				++i;
			}

			if( bufs_needed ) /* replenish the iovec array */
				rc = onload_zc_alloc_buffers(
					fd, iovec, bufs_needed, ONLOAD_ZC_BUFFER_HDR_TCP);
		}
	}
	
	/* Set a return code that looks similar enough to send(). 
	NB. we're not setting (and neither does onload_zc_send()) errno */
	if( mmsg.rc < 0 ) return ‐1;
	else return bytes_done;
}

Receive Filtering API

The Onload Extensions Receive Filtering API 
allows a user‐defined callback to inspect data 
received on a UDP socket 
before it enters the socket receive buffer. 
It provides an alternative to the onload_zc_recv() function described in the previous sections.

An application using the Receive Filtering API can continue to use any POSIX function 
on the socket e.g select(), poll(), epoll_wait() or recv(), 
but must not use the onload_zc_receive() function.

Receive Filtering API

The Onload Extensions Receive Filtering API provides the following components:
•#include <onload/extensions_zc.h>
In addition to the common components, 
an application should include this header file 
which contains all function prototypes and constant values required when using the API.
This file includes comprehensive documentation, required data structures and function definitions.
The Receive Filtering API is a variation 
on the zero‐copy receive 
whereby the normal socket methods are used for accessing the data, 
but the application can specify a callback to inspect each datagram before it is received.

typedef enum onload_zc_callback_rc
(*onload_zc_recv_filter_callback)(struct onload_zc_msg *msg, void* arg, int flags);
int onload_set_rcv_fiIter(
	int fd,
	onload_zc_recv_filter_callback filter,
	void* cb_arg,
	int flags);

The onload_set_recv_filter() function returns immediately.
The callback is invoked once per message in the context of subsequent calls to recv(), recvmsg() etc. 	
The cb_arg value is passed to the callback along with the message. 
The flags argument of the callback is set to ONLOAD_ZC_MSG_SHARED 
if the message is shared with other sockets, 
and the caller should take care not to modify the contents of the iovec.
The message can be found in msg‐>iov[], 
and the iovec is of length msg‐>msghdr.msg_iovlen.
The callback must return ONLOAD_ZC_CONTINUE to allow the message 
to be delivered to the application. 
Other return codes such as ONLOAD_ZC_TERMINATE and ONLOAD_ZC_MODIFIED 
are deprecated and no longer supported.
This function can only be used with accelerated sockets 
(those being handled by Onload). 
If a socket has been handed over to the kernel stack 
(e.g. because it has been bound to an address 
that is not routed over a SFC interface), it will return ‐ESOCKTNOSUPPORT.

Receive Filter ‐ Example

static enum onload_zc_callback_rc
zc_recv_filter(struct onload_zc_msg* msg, void* arg, int flags)
{
	return ONLOAD_ZC_CONTINUE;
}

struct zc_recv_state zc_filter_state;
static int do_zc_filter(controller_t* c)
{
	zc_filter_state.c = c;
	zc_filter_state.bytes = 0;
	return onload_set_recv_filter(the_socket, zc_recv_filter, &zc_filter_state, 0);
}

The onload_set_recv_filter() function should not be used together with the onload_zc_recv() function.

Templated Sends

“Templated sends” is a feature for the SFN7000, SFN8000 and X2 series adapters 
that builds on top of TX PIO 
to provide further transmit latency improvements. 
Refer to Programmed I/O on page 148 for details of TX PIO.

Templated sends can be used in applications 
that know the majority of the content of packets 
in advance of when the packet is to be sent. 
For example, 
a market feed handler may publish packets 
that vary only in the specific value of certain fields, 
possibly different symbols and price information, 
but are otherwise identical.
The Onload templated sends feature uses the Onload Extensions API to generate the packet template 	
which is then instantiated on the adapter ready to receive the “missing” data 
before each transmission. Templated sends involve allocating a template of a packet 
on the adapter containing the bulk of the data prior to the time of sending the packet. 
Then, when the packet is to be sent, 
the remaining data is pushed to the adapter to complete and send the packet.
When the socket, associated with an allocated template, 
is shutdown or closed, 
allocated templates are freed and subsequent calls to access these template will return an error.
The API details are available in the Onload distribution at:
•/src/include/onload/extensions_zc.h

MSG Template

struct oo_msg_template 
{
	/* To verify subsequent templated calls are used with the same socket */
	oo_sp oomt_sock_id;
};

MSG Update

/* An update_iovec describes a single template update */
struct onload_template_msg_update_iovec 
{
	void* otmu_base; /* Pointer to new data */
	size_t otmu_len; /* Length of new data */
	off_t otmu_offset; /* Offset within template to update */
	unsigned otmu_flags; /* For future use. Must be set to 0. */
};

MSG Allocation

Populated from an array of iovecs to specify the initial packet data. 
This function is called once to allocate the packet template 
and populate the template with the bulk of the payload data.
extern int onload_msg_template_alloc(
	int fd,
	struct iovec* initial_msg,
	int iovlen,
	onload_template_handle* handle,
	unsigned flags);

fd
	File descriptor to send on
initial_msg
	Array of iovecs which are the bulk of the payload
iovlen
	Length of initial msg
handle
	Template handle, used to refer to this template
flags
	See notes below. Can also be set to zero

The initial iovec array passed to onload_msg_template_alloc() must have at least one element 
having a valid address and non‐zero length.
If PIO allocation fail, 
then template_alloc will fail. 
Setting the flags to ONLOAD_TEMPLATE_FLAGS_PIO_RETRY will force allocation 
without PIO while attempting to allocate the PIO 
in later calls to onload_msg_template_update().

MSG Template Update

Takes an array of onload_template_msg_update_iovec to describe changes 
to the base packet populated by the onload_msg_template_alloc() function. 
Each of the update iovecs should describe a single change. 
The update function is used to overwrite existing template content 
or to send the complete template content 
when the ONLOAD_TEMPLATE_FLAGS_SEND_NOW flag is set.
extern int onload_msg_template_update(
	int fd,
	onload_template_handle* handle,
	struct onload_template_msg_update_iovec* updates,
	int ulen,
	unsigned flags);
fd
	File descriptor to send on
handle
	Template handle, returned from the alloc function
onload_template_msg_update_iovec
	Array of onload_template_msg_update_iovec 
	each of which is a change to the template payload
ulen
	Length of updates array (i.e. the number of changes)
flags
	See notes below. Can also be set to zero

If the ONLOAD_TEMPLATE_FLAGS_SEND_NOW flag is set, 
ownership of the template is passed to Onload.
This function can be called multiple times and changes are cumulative.
Flags:
ONLOAD_TEMPLATE_FLAGS_SEND_NOW
Perform the template update, 
send the template contents and pass ownership of the template to Onload.
To send without updating template contents – updates=NULL, ulen=0 and set the send now flag.
ONLOAD_TEMPLATE_FLAGS_DONTWAIT (same as MSG_DONTWAIT) 
Do not block.

MSG Template Abort

Abort use of the template without sending the template 
and free the template resources including the template handle and PIO region.
extern int onload_msg_template_alloc(
	int fd,
	onload_template_handle* handle);
fd
	File descriptor owning the template
handle
	Template handle, used to refer to this template

Delegated Sends API

The delegated send API can lower the latency overhead incurred 
when calling send() on TCP sockets 
by controlling TCP socket creation and management through Onload, 
but allowing TCP sends directly through the Onload layer 2 ef_vi API or other similar API.

An application using the delegated sends API will prepare a packet buffer 
with IP/TCP header data, 
before adding payload data to the packet. 
The packet buffer can be prepared in advance 
and payload added just before the send is required.
After each delegated send, 
the actual data sent (and length of that data) is returned to Onload. 
This allows Onload to update the TCP internal state 
and have the data to hand if retransmissions are required on the socket.
This feature is intended for applications 
that make sporadic TCP sends 
as opposed to large amounts of bi‐directional TCP traffic. 
The API should be used with caution to send small amounts of TCP data. 
Although the packet buffer can be prepared in advance of the send, 
the idea is to complete the delegated send operation (onload_delegated_send_complete()) 
soon after the initial send to maintain the integrity of the TCP internal state 
i.e. so that sequence/acknowledgment numbers are correct.
The user is responsible 
for serialization when using the delegated send API. 
The first call should always be onload_delegated_send_prepare(). 
If a normal send is required following the prepare, 
the user should use onload_delegated_send_cancel().

For a given file descriptor, 
while a delegated send is in progress, 
and until complete has been called, 
the user should NOT attempt any standard send(), write() or sendfile() close() etc operations.

For best latency the application should call onload_delegated_send_complete() 
as soon as a delegated send is complete. 
This allows Onload to continue if retransmissions are required
When a link partner has already acknowledged data 
before complete has been called, 
Onload will not have to copy the sent data to the TCP retransmit queue. 
So delaying the complete call may avoid a data copy but latency may suffer in the event of packet loss.

Standard send vs. Delegated Send

A packet could be delayed 
before sending when the receiver or network is not ready. 
When this occurs using delegated send, 
the onload_delegated_send_prepare() function will return zero values 
in the cong/send window fields of the delegated send state 
and the caller can elect to send with the standard method.

The Onload distribution includes the exchange.c and trader_onload_ds_efvi.c example applications 
to demonstrate the delegated sends API. 
Variables and constants definitions, 
including socket flags and function return codes 
required 
when using the API can be found in the extensions.h header file.
•openonload‐<version>/src/tests/trade_sim
•openonload‐<version>/build/gnu_x86_64/tests/trade_sim

struct onload_delegated_send

struct onload_delegated_send {
	void* headers;
	int headers_len; /* buffer len on input, headers len on output */
	int mss; /* one packet payload may not exceed this */
	int send_wnd; /* send window */
	int cong_wnd; /* congestion window */
	int user_size; /* the "size" value from send_prepare() call */
	int tcp_seq_offset;
	int ip_len_offset;
	int ip_tcp_hdr_len;
	int reserved[5];
};

onload_delegated_send_rc

The onload_delegated_send_prepare() function can return different return codes identified below.

enum onload_delegated_send_rc {
	ONLOAD_DELEGATED_SEND_RC_OK = 0,
		send successful.
	ONLOAD_DELEGATED_SEND_RC_BAD_SOCKET,
		non‐onloaded, non‐TCP, non‐connected or write‐shutdowned.
	ONLOAD_DELEGATED_SEND_RC_SMALL_HEADER,
		too small header_len value.
	ONLOAD_DELEGATED_SEND_RC_SENDQ_BUSY,
		send queue is not empty.
	ONLOAD_DELEGATED_SEND_RC_NOWIN,
		send window is closed, the peer cannot receive more data.
	ONLOAD_DELEGATED_SEND_RC_NOARP,
		failed to find the destination MAC address. 
		See extensions.h for further information.
	ONLOAD_DELEGATED_SEND_RC_NOCWIN,
		congestion window is closed. 
		It is a violation of the TCP protocol to send anything. 
		However, all the headers are filled in
		and the caller may use them for sending.
};

onload_delegated_send_prepare

Prepare to send up to size bytes. 
Allocate TCP headers and prepare them with Ethernet IP/TCP header data 
‐ including current sequence number and acknowledgment number.
enum onload_delegated_send_prepare (
	int fd,
	int size,
	uint flags,
	struct onload_delegated_send* )

fd
	File descriptor to send on
size
	Size of payload data
flags
	See below
struct onload_delegated_send*
	See struct onload_delegated_send

This function can be called speculatively 
so that the packet buffer is prepared in advance, 
headers are added so that the packet payload data can be added immediately 
before the send is required.
This function assumes the packet length is equal to MSS 
in which case there is no need to call onload_delegated_send_tcp_update().
Flags are used for ARP resolution:
•default flags = 0
•ONLOAD_DELEGATED_SEND_FLAG_IGNORE_ARP 
‐ do not do ARP lookup, the caller will provide destination MAC address.
•ONLOAD_DELEGATED_SEND_FLAG_RESOLVE_ARP 
‐ if ARP information is not available, send a speculative TCP_ACK 
to provoke kernel into ARP resolution 
‐ wait up to 1ms for ARP information to appear.
NOTE: TCP send window/congestion windows must be respected 
during delegated sends. See extensions.h for flags and return code values.

onload_delegated_send_tcp_update

This function does not send TCP data, 
but is called to update packet headers with the sequence number 
and flags following successive sends via the onload_delegated_send_tcp_advance() function.
void onload_delegated_send_tcp_update (
	struct onload_delegated_send*,
	int size,
	int flags )
struct onload_delegated_send*	
	See struct onload_delegated_send
size
	Size of payload data
flags
	See below

This function is called when, during a send, 
the payload length is not equal to the MSS value. 
See onload_delegated_send_prepare on page 319.
Flag TCP_FLAG_PSH is expected to be set on the last packet when sending a large data chunk.

onload_delegated_send_tcp_advance

Advance TCP headers after sending a TCP packet. 
This function if good for:
‐ sending a few small packets in rapid succession
‐ sending large data chunk (>MSS) over multiple packets
The sequence number is updated for each outgoing packet. 
When a packet has been sent, 
the application must call onload_delegated_send_tcp_update() to update packet headers 
with the payload length 
‐ thereby ensuring that the sequence number is correct for the next send.
This function does not update the ACK number in outgoing packets. 
The ACK number in successive outgoing packets is the value 
from the last call to the onload_delegated_send_prepare() function.
The advance function is used to send a small number of successive outgoing packets, 
but the application should then call onload_delegated_send_complete() to return control to Onload 
in order to maintain sequence/acknowledgment number integrity 
and allow Onload to remove sent data from the retransmit queue.
void onload_delegated_send_tcp_advance (
	struct onload_delegated_send*,
	int bytes )

struct onload_delegated_send*
	See struct onload_delegated_send
bytes
	Number of bytes sent
When sending a packet using multiple sends, 
the function is called to update the header data with the number of bytes after each send.
The actual data sent is not returned to Onload 
until the function onload_delegated_send_complete() is called.

onload_delegated_send_complete

Following a delegated send, 
this function is used to return the actual data sent (and length of that data) 
to Onload which will update the internal TCP state 
i.e. 
sequence numbers and remove packets 
from the retransmit queue (when appropriate ACKs are received).
int onload_delegated_send_complete (
	int fd,
	const struct iovec *,
	int iovlen,
	int flags )
fd
	The file descriptor.
struct iovec
	Pointer to the data sent
iovlen
	Size (bytes) of the iovec array
flags
	(MSG_DONTWAIT | MSG_NOSIGNAL]
number of bytes accepted, or return ‐1 if an error occurs with errno set.

Onload is unable to do any retransmit 
until this function has been called.
This function should be called 
even if some (but not all) bytes specified in the prepare function have been sent. 
The user must also call onload_delegated_send_cancel() 
if some of the bytes are not going to be sent 
i.e. reserved‐but‐not‐sent 
‐ see onload_delegated_send_cancel() notes below.
NOTE: 
This function differs from the send() function in its handling 
of a “resource temporarily unavailable” or “operation would block” situation. 
This function returns 0, but the send() function would return ‐1 with errno set to EAGAIN.
This function can block because of SO_SNDBUF limitation and will ignore the SO_SNDTIMEO value.

onload_delegated_send_cancel

No more delegated send is planned.
Normal send(), shutdown() or close() etc can be called after this call.
int onload_delegated_send_cancel (int fd)
fd
	The file descriptor to be closed.
	When tcp headers have been allocated 
	with onload_delegated_send_prepare(), 
	but it is subsequently required to do a normal send,
	this function can be used to cancel the delegated send operation 
	and do a normal send.
	There is no need to call this function before calling onload_delegated_send_prepare().
	There is no need to call this function 
	if all the bytes specified in the onload_delegated_send_prepare() function have been sent.
	If some, but not all bytes have been sent, you must call onload_delegated_send_complete() 
	for the sent bytes THEN call onload_delegated_send_cancel() for the remaining bytes 
	(reserved‐but‐not‐sent) bytes. 
	This applies even if the reason for not sending is that 
	the window limits returned from the prepare function have been reached.
Normal send(), shutdown() or close() etc can be called after this call.