onload functionality

raindayinrain

已于 2023-10-22 15:50:56 修改

阅读量440

点赞数

文章标签： onload

于 2022-01-18 11:19:01 首次发布

本文链接：https://blog.csdn.net/x13262608581/article/details/122552983

版权

网络#Onload 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

Onload Transparency

using a Solarflare interface ‐ which are processed by the Onload stack, 
whilst those not using a Solarflare interface are 
transparently passed to the kernel stack.

Onload Stacks

An Onload 'stack' is an instance of a TCP/IP stack. 
The stack includes transmit and receive buffers, 
open connections 
and the associated port numbers 
and stack options. 
Each stack has associated with it one or more Virtual NICs

In normal usage, 
each accelerated process will have its own Onload stack 
shared by all connections created by the process.
It is also possible for multiple processes to share a single Onload stack instance
and for a single application to have more than one Onload stack.

Virtual Network Interface (VNIC)

The Solarflare network adapter supports 
1024 transmit queues, 
1024 receive queues, 
1024 event queues 
and 1024 timer resources 
per network port.

A VNIC (virtual network interface) consists 
of  one unique instance 
of each of these resources 
which allows the VNIC client i.e. the Onload stack, 
an isolated and safe mechanism of sending and receiving network traffic

An Onload stack allocates one VNIC per Solarflare network port 
so it has a dedicated send and receive channel from user mode.

Functional Overview

Maximum Number of Network Interfaces

A maximum of 32 network interfaces 
can be registered with the Onload driver.

Whitelist and Blacklist Interfaces

the user is able to select 
which Solarflare interfaces are to be used by Onload.
Onload module options can be specified 
in a user created file in the /etc/modprobe.d directory:

options onload intf_white_list=eth4
options onload intf_black_list="eth5 eth6"

The per‐stack environment variables 
EF_INTERFACE_BLACKLIST and EF_INTERFACE_WHITELIST 
are space‐separated lists of interfaces。

Onloaded PIDs

# onload_fuser ‐v
9886 ping

Only processes that have created an Onload stack are present. 
Processes which are loaded under Onload, 
but have not created any sockets are not present

Onload and File Descriptors, Stacks and Sockets

For an Onloaded process 
it is possible to identify 
the file descriptors, 
Onload stacks 
and sockets 
being accelerated by Onload

# ls ‐l /proc/9886/fd
total 0
lrwx‐‐‐‐‐‐ 1 root root 64 May 14 14:09 0 ‐> /dev/pts/0
lrwx‐‐‐‐‐‐ 1 root root 64 May 14 14:09 1 ‐> /dev/pts/0
lrwx‐‐‐‐‐‐ 1 root root 64 May 14 14:09 2 ‐> /dev/pts/0
lrwx‐‐‐‐‐‐ 1 root root 64 May 14 14:09 3 ‐> onload:[tcp:6:3]
lrwx‐‐‐‐‐‐ 1 root root 64 May 14 14:09 4 ‐> /dev/pts/0
lrwx‐‐‐‐‐‐ 1 root root 64 May 14 14:09 5 ‐> /dev/onload
lrwx‐‐‐‐‐‐ 1 root root 64 May 14 14:09 6 ‐> onload:[udp:6:2]

Accelerated file descriptors are listed 
as symbolic links to /dev/onload. 
Accelerated sockets are described 
in [protocol:stack:socket] format.

Linux Sysctls

The Linux directory/proc/sys/net/ipv4 contains default settings 
which tune different parts of the IPv4 networking stack. 
In many cases Onload takes its default settings from the values 
in this directory.
In some cases the default can be overridden, 
for a specified processes or socket, 
using socket options or with Onload environment variables.

Namespaces

Onload includes support for all Linux namespace types。
An Onload stack can exist in only one network namespace.

User‐space Control Plane Server

Starting from the onload‐201710 release, 
Onload deploys a user‐space control plane daemon.
A single onload_cp_server process is created per network namespace 
in which there is an active onload_stack.

SO_BINDTODEVICE

In response to the setsockopt() function call with SO_BINDTODEVICE, 	
sockets identifying non‐Solarflare interfaces 
will be handled by the kernel 
and all sockets identifying Solarflare interfaces 
will be handled by Onload. 
All sends from a socket are sent via the bound interface. 
Only traffic received over the bound interface 
will be delivered to the socket.

Multiplexed I/O

The general behavior of the poll(), select() and epoll_wait() functions 
with Onload is as follows:

•If there are operations ready on any file descriptors, 
poll(), select() and epoll_wait() will return immediately. 
Refer to the Poll, Select and Epoll subsections 
for specific behavior details.
•If there are no file descriptors ready 
and spinning is not enabled, 
calls to poll(), select() and epoll_wait() will enter the kernel and block.
•In the cases of poll()and select(), 
when the set contains file descriptors that are not accelerated sockets, 	
there is a slight latency overhead 
as Onload must make a system call to determine 
the readiness of these sockets. 
There is no such cost 
when using epoll_wait() 
and a system call is only needed 
when non‐Onload descriptors become ready.
To ensure that non‐accelerated (kernel) file descriptors are checked 
when there are no events ready on accelerated (onload) descriptors, 	
disable the following options:
EF_SELECT_FAST and EF_POLL_FAST 
‐ setting both to zero.
EF_POLL_FAST_USEC and EF_SELECT_FAST_USEC 
‐ setting both to zero.
•If there are no file descriptors ready and spinning is enabled, 
Onload will spin to ensure 
that accelerated sockets are polled a specified number of times 
before unaccelerated sockets are examined. 
This reduces the overhead incurred 
when Onload has to call into the kernel 
and reduces latency on accelerated sockets.

The following subsections discuss the use 
of these I/O functions 
and Onload environment variables 
that can be used 
to manipulate behavior of the I/O operation.

Poll, ppoll

The poll(), ppoll() file descriptor set can consist 
of both accelerated and non‐accelerated file descriptors.

The environment variable EF_UL_POLL 
enables/disables acceleration of the poll(), ppoll() function calls
Onload supports the following options for the EF_UL_POLL variable:
0								Disable acceleration at user‐level. 
								Calls to poll(), ppoll() are handled by the kernel.
								Spinning cannot be enabled.
1								Enable acceleration at user‐level. 
								Calls to poll(), ppoll() are processed at user level.
								Spinning can be enabled 
								and interrupts are avoided 
								until an application blocks.
Additional environment variables can be employed 
to control the poll(), ppoll() functions 
and to give priority to accelerated sockets 
over non‐accelerated sockets and other file descriptors. 
Refer to 
EF_POLL_FAST, 
EF_POLL_FAST_USEC 
and EF_POLL_SPIN in Parameter Reference on page 210.

Select, pselect

The select(), pselect() file descriptor set 
can consist of both accelerated and non‐accelerated file descriptors. 	
The environment variable EF_UL_SELECT 
enables/disables acceleration of the select(), pselect() function calls. 
Onload supports the following options for the EF_UL_SELECT variable:
0							Disable acceleration at user‐level. 
							Calls to select(), pselect() are handled by the kernel.
							Spinning cannot be enabled.
1							Enable acceleration at user‐level. 
							Calls to select(), pselect() are processed at user‐level.
							Spinning can be enabled 
							and interrupts are avoided 		
							until an application blocks.
Additional environment variables can be employed 
to control the select(), pselect() functions 
and to give priority to accelerated sockets 
over non‐accelerated sockets and other file descriptors. 
Refer to 
EF_SELECT_FAST 
and EF_SELECT_SPIN 
in Parameter Reference on page 210.

Epoll

The epoll set of functions, 
epoll_create(), epoll_ctl(), epoll_wait(), epoll_pwait(), 
are accelerated in the same way as poll and select. 
The environment variable EF_UL_EPOLL 
enables/disables epoll acceleration. 
Refer to the release change log for enhancements 
and changes to epoll behavior.
Using Onload an epoll set can consist 
of both Onload file descriptors and kernel file descriptors. 
Onload supports the following options 
for the EF_UL_EPOLL environment variable:

0					Accelerated epoll is disabled 
					and epoll_ctl(), epoll_wait() and epoll_pwait() function calls 
					are processed in the kernel. 
					Other functions calls such as send() and recv() 
					are still accelerated.
					Interrupt avoidance does not function 
					and spinning cannot be enabled.
					If a socket is handed over to the kernel stack 
					after it has been added to an epoll set, 
					it will be dropped from the epoll set.
					onload_ordered_epoll_wait() is not supported.
1					Function calls to epoll_ctl(), epoll_wait(), epoll_pwait() 
					are processed at user level.
					Delivers best latency 
					except when the number of accelerated file descriptors 
					in the epoll set is very large. 
					This option also gives the best acceleration 
					of epoll_ctl() calls.
					Spinning can be enabled 
					and interrupts are avoided until an application blocks.
					CPU overhead and latency increase 
					with the number of file descriptors in the epoll set.
					onload_ordered_epoll_wait() is supported.
2					Calls to epoll_ctl(), epoll_wait(), epoll_pwait() 
					are processed in the kernel.
					Delivers best performance 
					for large numbers of accelerated file descriptors.
					Spinning can be enabled 
					and interrupts are avoided until an application blocks.
					CPU overhead and latency are independent 
					of the number of file descriptors in the epoll set.
					onload_ordered_epoll_wait() is not supported.
3					Function calls to epoll_ctl(), epoll_wait(), epoll_pwait() 
					are processed at user level.
					Delivers best acceleration latency for epoll_ctl() calls 
					and scales well 
					when the number of accelerated file descriptors 
					in the epoll set is very large 
					‐ and all sockets are in the same stack. 
					The cost of the epoll_wait() is independent 
					of the number of accelerated file descriptors in the set 
					and depends only on the number of descriptors 
					that become ready.
					The benefits will be less 
					if sockets exist in different Onload stacks:	
					•From Onload 201805 onwards, 
					each socket can be in up to four epoll sets at a time, 		
					provided that each epoll set is in a different process
					•Otherwise, 
					each socket can be in at most one epoll set at a time.
					In such cases 
					the recommendation is to use EF_UL_EPOLL=2.
					EF_UL_EPOLL=3 does not allow monitoring 
					the readiness of the epoll file descriptors 
					from another epoll/poll/select.	
					EF_UL_EPOLL=3 cannot support epoll sets 
					which exist across fork().
					Spinning can be enabled 
					and interrupts are avoided 
					until an application blocks.
					onload_ordered_epoll_wait() is supported.

The relative performance of epoll options 1 and 2 
depends on 
the details of application behavior 
as well as 
the number of accelerated file descriptors in the epoll set. 
Behavior may also differ 
between 
earlier and later kernels 
and 
between Linux realtime and non‐realtime kernels. 
Generally the OS will allocate short time slices 
to a user‐level CPU intensive application 
which may result in performance (latency spikes). 
A kernel‐level CPU intensive process is less likely 
to be de‐scheduled 
resulting in better performance. 
Solarflare recommend the user evaluate options 1 and 2 
for applications 
that manages many file descriptors, 
or try option 3 (onload‐201502 and later) 
when using very large sets and all sockets are in the same stack.

Additional environment variables can be employed 
to control the epoll_ctl(), epoll_wait() and epoll_pwait() functions 
and to give priority to accelerated sockets 
over non‐accelerated sockets and other file descriptors. 
Refer to 
EF_EPOLL_CTL_FAST, 
EF_EPOLL_SPIN 
and EF_EPOLL_MT_SAFE in Parameter Reference on page 210.
Refer to epoll ‐ Known Issues on page 177.

Wire Order Delivery

When a TCP or UDP application is working 
with multiple network sockets simultaneously 
it is difficult 
to ensure 
data is delivered to the application 
in the strict order 
it was received 
from the wire 
across these sockets.
The onload_ordered_epoll_wait() API 
is an Onload alternative implementation 
of epoll_wait() 
providing additional data 
allowing a receiving application 
to recover in‐order timestamped data 
from multiple sockets. 
To maintain wire order delivery, 
only a specific number of bytes, 
as identified by the onload_ordered_epoll_event, 
should be recovered from a ready socket.
•Ordering is done on a per‐stack basis 
‐ for TCP and UDP sockets. 
Sockets must be in the same onload stack.
•Only data received from an Onload stack 
with a hardware timestamp will be ordered.
•The environment variable 
EF_RX_TIMESTAMPING must be enabled:
EF_RX_TIMESTAMPING=1
•File descriptors 
where timestamping information is not available may be included in the epoll set, 
but received data will be returned from these unordered.
•The application must use the epoll API 
and the onload_ordered_epoll_wait() function.
•The application must set the per‐process environment variable 
EF_UL_EPOLL=1 or EF_UL_EPOLL=3.
•EPOLLET and ONESHOT flags should NOT be used.
•Concurrent use of the ordering data is not safe, 
and so onload_ordered_epoll_wait() must not be called from multiple threads.
•See onload_ordered_epoll_wait on page 297 for further details.

To prevent packet coalescing in the receive queue, 
resulting in multiple packets received with the same hardware timestamp, 
the EF_TCP_RCVBUF_STRICT variable should be disabled (default setting).
Figure 14 demonstrates the Wire Order Delivery feature.

Stack Sharing

By default each process 
using Onload 
has its own 'stack'. 
Refer to Onload Stacks for definition. 
Several processes can be made to share a single stack, 
using the EF_NAME environment variable. 
Processes with the same value for EF_NAME in their environment 
will share a stack.
Stack sharing is one supported method to enable multiple processes 
using Onload 
to be accelerated 
when receiving the same multicast stream 
or to allow one application 
to receive a multicast stream 
generated locally by a second application. 
Other methods to achieve this 
are Multicast Replication and Hardware Multicast Loopback.
Stacks may also be shared 
by multiple processes 
in order to preserve 
and control resources within the system. 
Stack sharing can be employed 
by processes handling TCP as well as UDP sockets.

Stack sharing should only be requested 
if there is a trust relationship between the processes. 
If two processes share a stack 
then they are not completely isolated: 
a bug in one process may impact the other, 
or one process can gain access to the other's privileged information 
(i.e. breach security). 
Once the EF_NAME variable is set, 
any process on the local host can set the same value 
and gain access to the stack.
By default Onload stacks can only be shared with processes 
having the same UID. 
The EF_SHARE_WITH environment variable provides additional security 
while allowing a different UID to share a stack. 
Refer to Parameter Reference on page 210 
for a description of the EF_NAME and EF_SHARE_WITH variables.

Processes with different UIDs, 
sharing an Onload stack should not use huge pages. 
Onload will issue a warning at startup 
and prevent the allocation of huge pages 
if EF_SHARE_WITH identifies a UID of another process 
or is set to ‐1. 
If a process P1 creates an Onload stack, 
but is not using huge pages 
and another process P2 attempts to share the Onload stack 
by setting EF_NAME, 
the stack options set by P1 will apply, 
allocation of huge pages in P2 will be prevented.
To suppress these startup warnings 
about turning huge pages off, 
set EF_USE_HUGE_PAGES to 0 
if EF_SHARE_WITH is non‐zero.
An alternative method 
of implementing stack sharing 
is to use the Onload Extensions API 
and the onload_set_stackname() function 
which, through its scope parameter, 
can limit stack access 
to the processes created by a particular user. 
Refer to Onload Extensions API on page 281 for details.

Application Clustering

An application cluster is the set 
of Onload TCP or UDP stack sockets 
bound to the same port. 
This feature dramatically improves the scaling 
of some applications across multiple CPUs 
(especially those establishing many sockets from a TCP listening socket).
Onload from version 201405 automatically creates a cluster 
using the SO_REUSEPORT socket option. 
TCP or UDP processes 
running on RHEL 6.5 (and later) 
setting this option 
can bind multiple sockets to the same TCP or UDP port.

For TCP, 
clustering allows the established connections 
resulting from a listening socket 
to be spread over a number of Onload stacks. 
Each thread/process creates 
its own listening socket (using SO_REUSEPORT) on the same port, 
with each listening socket 
residing in its own Onload stack. 
Handling of incoming new TCP connections 
are spread via the adapter (using RSS) over the application cluster 
and therefore over each of the listening sockets 
resulting in each Onload stack 
and therefore each thread/process, 
handling a subset of the total traffic 
as illustrated in Figure 15 below.

For UDP, 
clustering allows UDP unicast traffic 
to be spread over multiple applications 
with each application 
receiving a subset of the total traffic load.
Existing applications 
that do not use SO_RESUSEPORT 
can use the application clustering feature 
without the need for re‐compilation 
by using the Onload 
EF_TCP_FORCE_REUSEPORT 
or EF_UDP_FORCE_REUSEPORT environment variables 
identifying the list of ports 
to which SO_RESUSEPORT will be applied.
The size or number of socket members 
of a cluster in Onload 
is controlled with EF_CLUSTER_SIZE. 
To create a cluster 
the application sets the cluster name 
with EF_CLUSTER_NAME. 
A cluster of EF_CLUSTER_SIZE is then created.

The spread of received traffic 
between cluster sockets 
employs Receive Side Scaling (RSS):
•for TCP on all adapters, 
and for UDP on SFN7000 series adapters onwards, 
the RSS hash is a function of the src_ip:src_port and dst_ip:dst_port 4‐tuple
•otherwise the RSS hash is a function of the src_ip and dst_ip only.
The reception of traffic 
within a cluster is dependent on port numbers only. 
If two sockets bind to the same port, 
but different IP addresses, 
a portion of traffic destined for one socket 
can be received (but dropped by Onload) on the other socket. 
For correct behavior, 
all cluster members should bind to the same IP address. 
This limitation has been removed in the Onload‐201509 release 
so that it is possible to create multiple listening sockets 
bound to the same port but to different addresses.

Restarting an application 
that includes cluster socket members 
can fail 
when orphan stacks are still present. 
Use EF_CLUSTER_RESTART to force termination of orphaned stacks 
allowing the creation of the new cluster.
Refer to Limitations on page 168 for details of Application Clustering limitations.