This chapter outlines configurations
that Onload does not accelerate
and ways in which Onload may change behavior
of the system and applications.
It is a key goal of Onload
to be fully compatible with the behavior
of the regular kernel stack,
but there are some cases where behavior deviates.
Resources
Onload uses certain physical resources on the network adapter.
If these resources are exhausted,
it is not possible to create new Onload stacks
and not possible to accelerate new sockets or applications.
The onload_stackdump utility should be used
to monitor hardware resources.
Physical resources include:
Virtual NICs
Virtual NICs provide the interface
by which a user level application sends and receives network traffic.
When these are exhausted
it is not possible to create new Onload stacks,
meaning new applications cannot be accelerated.
However,
Solarflare network adapters support large numbers of Virtual NICs,
and this resource is not typically the first to become unavailable.
Endpoints
Onload represents sockets and pipes as structures called endpoints.
The maximum number of accelerated endpoints permitted
by each Onload stack is set with the EF_MAX_ENDPOINTS variable.
The stack limit can be reached sooner than expected
when syn‐receive states (the number of half‐open connections) also
consume endpoint buffers.
Four syn‐receive states consume one endpoint.
The maximum number of syn‐receive states can be limited
using the EF_TCP_SYNRECV_MAX variable.
Filters
Filters are used to deliver packets
received from the wire to the appropriate application.
When filters are exhausted
it is not possible to create new accelerated sockets.
The general recommendation is
that applications do not allocate more than 4096 filters
‐ or applications should not create more than 4096 outgoing connections.
The limit does not apply to inbound connections
to a listening socket.
Buffer Table
The buffer table provides address protection and translation for DMA buffers.
When all buffer resources are exhausted
it is not possible to create new Onload stacks,
and existing stacks are not able to allocate more DMA buffers.
When hardware resources are exhausted,
normal operation of the system should continue,
but it will not be possible to accelerate new sockets or applications.
TX, RX Ring Buffer Size
Onload does not obey RX, TX ring sizes set in the kernel,
but instead uses the values specified
by EF_RXQ_SIZE and EF_TXQ_SIZE both default to 512.
Devices
The efrm driver used by Onload supports a maximum of 64 devices.
Changes to Behavior
Multithreaded Applications Termination
As Onload handles networking
in the context of the calling application's thread
it is recommended
that applications ensure all threads exit cleanly
when the process terminates.
In particular the exit() function causes all threads to exit immediately
‐ even those in critical sections.
This can cause threads currently within the Onload stack
holding the per stack lock to terminate
without releasing this shared lock
‐ this is particularly important for shared stacks
where a process sharing the stack could ‘hang’
when Onload locks are not released.
An unclean exit can prevent the Onload kernel components
from cleanly closing the application's TCP connections,
a message similar to the following will be observed:
[onload] Stack [0] released with lock stuck
and any pending TCP connections will be reset.
To prevent this,
applications should always ensure that all threads exit cleanly.
Thread Cancellation
Unexpected behavior can result
when an accelerated application uses a pthread_cancel function.
There is increased risk from multi‐threaded applications
or a PTHREAD_CANCEL_ASYNCHRONOUS thread
calling a non‐async safe function.
Onload users are strongly advised
that applications should not use pthread_cancel functions.
Packet Capture
Packets delivered to an application
via the accelerated path are not visible to the OS kernel.
As a result,
diagnostic tools such as tcpdump and wireshark
do not capture accelerated packets.
The Solarflare supplied onload_tcpdump does support capture
of UDP and TCP packets from Onload stacks
‐ Refer to onload_tcpdump on page 379 for details.
Firewalls
Packets delivered to an application
via the accelerated path are not visible to the OS kernel.
As a result,
these packets are not visible to the kernel firewall (iptables)
and therefore firewall rules will not be applied to accelerated traffic.
The onload_iptables feature can be used
to enforce Linux iptables rules
as hardware filters on the Solarflare adapter,
refer to onload_iptables on page 384.
NOTE: Hardware filtering on the network adapter will ensure
that accelerated applications receive traffic only on ports
to which they are bound.
System Tools ‐ Socket Visibility
With the exception of ‘listening’ sockets,
TCP sockets accelerated by Onload are not visible to the netstat tool.
UDP sockets are visible to netstat.
Accelerated sockets appear in the /proc directory
as symbolic links to /dev/onload.
Tools that rely on /proc will probably not identify
the associated file descriptors as being sockets.
Refer to Onload and File Descriptors,
Stacks and Sockets on page 74 for more details.
Accelerated sockets can be inspected
in detail with the Onload onload_stackdump tool,
which exposes considerably more information
than the regular system tools.
For details of onload_stackdump refer to onload_stackdump on page 324.
Signals
If an application receives a SIGSTOP signal,
it is possible for the processing
of network events to be stalled in an Onload stack
used by the application.
This happens
if the application is holding a lock inside the stack
when the application is stopped,
and if the application remains stopped for a long time,
this may cause TCP connections to time‐out.
A signal which terminates an application can prevent threads
from exiting cleanly.
Refer to Multithreaded Applications Termination on page 169
for more information.
Undefined content may result
when a signal handler uses the third argument (ucontext)
and if the signal is postponed by Onload.
To avoid this,
use the Onload module option safe_signals_and_exit=0
or use EF_SIGNALS_NOPOSTPONE to prevent specific signals
being postponed by Onload.
Onload and IP_MULTICAST_TTL
Onload will act in accordance with RFC 791
when it comes to the IP_MULTICAST_TTL setting.
Using Onload, if IP_MULTICAST_TTL=0,
packets will never be transmitted on the wire.
This differs from the Linux kernel
where the following behavior has been observed:
Kernel ‐ IP_MULTICAST_TTL 0
‐ if there is a local listener, packets will not be transmitted on the wire.
Kernel ‐ IP_MULTICAST_TTL 0
‐ if there is NO local listener, packets will always be transmitted on the wire.
Source/Policy Based Routing
1.OpenOnload 201710 / EnterpriseOnload 6.0 / Cloud Onload 201811
The Onload 201710, EnterpriseOnload 6.0 and Cloud Onload 201811
releases include support
for source based policy routing
for unicast and multicast packets.
The following are supported:
•source ip address
•destination ip address
•outgoing interface (SO_BINDTODEVICE)
•TOS (Type of Service)
Policy rules based on other criteria
are not supported
and will be ignored by Onload.
2.Earlier Onload versions
Earlier Onload versions do not support source based or policy based routing.
Whereas the Linux kernel will select a route and interface
based on routing metrics,
Onload will select any of the valid routes and Onload interfaces
to a destination that are available.
The EF_TCP_LISTEN_REPLIES_BACK environment variable
provides a pseudo source‐based routing solution.
This option forces a reply to an incoming SYN
to ignore routes and reply to the originating network interface.
Enabling this option will allow new TCP connections to be setup,
but does not guarantee
that all replies from an Onloaded application will go
via the receiving Solarflare interface
‐ and some re‐ordering of the routing table may be needed
to guarantee this OR an explicit route (to go via the Solarflare interface)
should be added to the routing table.
Routing Table Metrics
Onload, from version 201606,
introduced support for routing table metrics,
therefore,
if two entries in the routing table will route traffic
to the destination address,
the entry with the best metric will be selected
even if that means routing over a non‐Solarflare interface.
Multipath routes
Onload does not support a multipath route simultaneously
via Onload accelerated and non‐Onload‐accelerated interfaces.
The paths in a multipath route should either all be acceleratable,
or all be non‐acceleratable.
Reverse Path Filtering
Onload does not support Reverse Path Filtering.
When Onload cannot route traffic to a remote endpoint
over a Solarflare interface (no suitable route table entry),
the traffic will be handled via the kernel.
SO_REUSEPORT
Onload vs. kernel behavior is described in Chapter 7 on page 90.
Thread Safe
Onload assumes
that file descriptor modifications are thread‐safe
and that file descriptors are not concurrently modified by different threads.
Concurrent access should not cause problems.
This is different from kernel behaviour
and users should set EF_FDS_MT_SAFE=0
if the application is not considered thread‐safe.
Similar consideration should be given
when using epoll() where default concurrency control are disabled in Onload.
Users should set EF_EPOLL_MT_SAFE=0.
Control of Duplicated Sockets
When a socket has been duplicated, for example, using fork(),
and where the parent fd is controlled by the kernel,
the child fd controlled by Onload.
Changes by the kernel using fcntl() to modify flags
such as O_NONBLOCK will not be reflected in the Onload socket.
UDP sockets shutdown()
When a kernel UDP socket is unconnected,
a shutdown() call will prompt a blocking recv() operation on the socket
to successfully complete.
When an Onload UDP socket is unconnected,
a shutdown() call does not successfully complete a blocking recv() call
and thereafter the socket fd cannot be reused.
When a UDP socket is connected,
kernel and Onload behavior is the same,
a shutdown() call will prompt a blocking recv() operation
to complete successfully
Limits to Acceleration
IP Fragmentation
Fragmented IP traffic is not accelerated
by Onload on the receive side,
and is instead received transparently via the kernel stack.
IP fragmentation is rarely seen with TCP,
because the TCP/IP stacks segment messages
into MTU‐sized IP datagrams.
With UDP,
datagrams are fragmented by IP
if they are too large for the configured MTU.
Refer to Fragmented UDP on page 130 for a description of Onload behavior.
Broadcast Traffic
Broadcast sends and receives function as normal
but will not be accelerated.
Multicast traffic can be accelerated.
IPv6 Traffic
IPv6 traffic functions as normal but will not be accelerated.
If the kernel also does not support IPv6,
the following error message is output:
sock_create(10, <1 or 2>, 0) failed (‐97)
where:
•‐97 is the error code EAFNOSUPPORT
(Address family not supported by protocol)
•the other numbers indicate an IPv6 TCP or UDP socket.
One possible cause of this error is using Java,
which often creates IPv6 sockets alongside IPv4 ones.
TCP NOP Options
Onload will silently discard packets
that include IP header No Operation (NOP) options.
Discards will not increment drop packet counters.
Onload will process packets
that include NOP options in the TCP header,
but the options themselves will be ignored.
Raw Sockets
Raw Socket sends and receives function
as normal but will not be accelerated.
Socketpair and UNIX Domain Sockets
Onload will intercept,
but does not accelerate the socketpair() system call.
Sockets created with socketpair() will be handled by the kernel.
Onload also does not accelerate UNIX domain sockets.
UDP sendfile()
The UDP sendfile()method is not currently accelerated by Onload.
When an Onload accelerated application calls sendfile()
this will be handled seamlessly by the kernel.
Statically Linked Applications
Onload will not accelerate statically linked applications.
This is due to the method
in which Onload intercepts libc function calls (using LD_PRELOAD).
Local Port Address
Onload is limited to OOF_LOCAL_ADDR_MAX number
of local interface addresses.
A local address can identify a physical port
or a VLAN,
and multiple addresses can be assigned to a single interface
where each address contributes to the maximum value.
Users can allocate additional local interface addresses
by increasing the compile time constant OOF_LOCAL_ADDR_MAX
in the /src/lib/efthrm/oof_impl.h file
and rebuilding Onload.
In onload‐201205 OOF_LOCAL_ADDR_MAX was replaced
by the onload module option max_layer2_interfaces.
Bonding, Link aggregation
•Onload will only accelerate traffic over 802.3ad and active‐backup bonds.
•Onload will not accelerate traffic
if a bond contains any slave interfaces
that are not Solarflare network devices.
•Adding a non‐Solarflare network device to a bond
that is currently accelerated by Onload
may result in unexpected results such as connections being reset.
•Acceleration of bonded interfaces in Onload
requires a kernel configured
with sysfs support
and a bonding module version of 3.0.0 or later.
In cases where Onload will not accelerate the traffic
it will continue to work via the OS network stack.
VLANs
•Onload will only accelerate traffic over VLANs
where the master device is either a Solarflare network device,
or over a bonded interface that is accelerated.
i.e. If the VLAN's master is accelerated, then so is the VLAN interface itself.
•Nested VLAN tags are not accelerated, but will function as normal.
•The ifconfig command will return inconsistent statistics
on VLAN interfaces (not master interface).
•When a Solarflare VLAN tagged interface is subsequently placed in a bond,
the interface will continue to be accelerated,
but the bond is not accelerated.
•Using SFN7000, SFN8000 and X2 series adapters
with the low‐latency firmware variant,
the following limitation applies:
Hardware filters installed by Onload on the adapter
will only act on the IP address and port,
but not the VLAN identifier.
Therefore if the same IP address:port combination
exists on different VLAN interfaces,
only the first interface to install the filter will receive the traffic.
This limitation does not apply to SFN7000, SFN8000 and X2 series adapters
using the full‐feature firmware variant.
In cases where Onload will not accelerate the traffic
it will continue to work via the OS network stack.
For more information and details
and configuration options
refer to the Solarflare Server Adapter User Guide section
‘Setting Up VLANs’.
Ethernet Bridge Configuration
Onload does not currently support acceleration of interfaces
added to an Ethernet bridge configured/added with the Linux brctl command.
TCP RTO During Overload Conditions
Using Onload,
under very high load conditions
an increased frequency of TCP retransmission timeouts (RTOs)
might be observed.
This has the potential to occur
when a thread servicing the stack is descheduled
by the CPU
whilst still holding the stack lock
thus preventing another thread from accessing/polling the stack.
A stack not being serviced
means that ACKs are not received in a timely manner for packets sent,
resulting in RTOs for the unacknowledged packets
and increased jitter on the Onload stack.
Enabling the per stack environment variable EF_INT_DRIVEN
can reduce the likelihood of this behavior
and reduce jitter by ensuring the stack is serviced promptly.
TCP with Jumbo Frames
When using jumbo frames with TCP,
Onload will limit the MSS to 2048 bytes
to ensure that segments do not exceed the size
of internal packet buffers.
This should present no problems
unless the remote end of a connection is unable
to negotiate this lower MSS value.
Transmission Path ‐ Packet Loss
Occasionally Onload needs to send a packet,
which would normally be accelerated, via the kernel.
This occurs when there is no destination address entry
in the ARP table
or to prevent an ARP table entry from becoming stale.
By default, the Linux sysctl, unres_qlen,
will enqueue 3 packets per unresolved address
when waiting for an ARP reply,
and on a server subject to a very high UDP or TCP traffic load
this can result in packet loss
on the transmit path and packets being discarded.
The unres_qlen value can be identified
using the following command:
sysctl ‐a | grep unres_qlen
net.ipv4.neigh.eth2.unres_qlen = 3
net.ipv4.neigh.eth0.unres_qlen = 3
net.ipv4.neigh.lo.unres_qlen = 3
net.ipv4.neigh.default.unres_qlen = 3
Changes to the queue lengths can be made permanent
in the /etc/sysctl.conf file.
Solarflare recommend setting the unres_qlen value to at least 50.
If packet discards are suspected,
this extremely rare condition can be indicated
by the cp_defer counter
produced by the onload_stackdump
lots command on UDP sockets
or from the unresolved_discards counter
in the Linux /proc/net/stat arp_cache file.
TCP ‐ Unsupported Routing, Timed out Connections
If TCP packets are received over an Onload accelerated interface,
but Onload cannot find a suitable Onload accelerated return route,
no response will be sent
resulting in the connection timing out.
Application Clustering
For details of Application Clustering,
refer to Application Clustering on page 90.
•Onload matches the Linux kernel implementation
such that clustering is not supported
for multicast traffic
and where setting of SO_REUSEPORT has the same effect
as SO_REUSEADDR.
•Calling connect() on a TCP socket
which was previously subject to a bind() call is not currently supported.
This will be supported in a future release.
•An application cluster will not persist over adapter/server/driver reset.
Before restarting the server
or resetting the adapter the Onload applications should be terminated.
•The environment variable EF_CLUSTER_RESTART determines
the behavior of the cluster
when the application process is restarted
‐ refer to EF_CLUSTER_RESTART in Parameter Reference on page 210.
•If the number of sockets in a cluster is less than EF_CLUSTER_SIZE,
a portion of the received traffic will be lost.
•There is little benefit
when clustering involves a TCP loopback listening socket
as connections will not be distributed amongst all threads.
A non‐loopback listening socket
‐ which might occasionally get some loopback connections
can benefit from Application Clustering.
Duplicate IP or MAC addresses
Onload does not support multiple interfaces
with the same IP address or MAC
epoll ‐ Known Issues
Onload supports different implementations of epoll
controlled by the EF_UL_EPOLL environment variable
‐ see Multiplexed I/O on page 83 for configuration details.
There are various limitations and differences
in Onload vs. kernel behavior
‐ refer to Chapter 7 on page 83 for details.
•When using EF_UL_EPOLL=1 or 3,
it has been identified
that the behavior of epoll_wait() differs from the kernel
when the EPOLLONESHOT event is requested,
resulting in two ‘wakeups’ being observed,
one from the kernel
and one from Onload.
This behavior is apparent
on SOCK_DGRAM and SOCK_STREAM sockets
for all combinations of EPOLLONESHOT, EPOLLIN and EPOLLOUT events.
This applies for all types of accelerated sockets.
EF_EPOLL_CTL_FAST is enabled
by default
and this modifies the semantics of epoll.
In particular,
it buffers up calls to epoll_ctl()
and only applies them when epoll_wait() is called.
This can break applications
that do epoll_wait() in one thread
and epoll_ctl() in another thread.
The issue only affects EF_UL_EPOLL=2
and the solution is to set EF_EPOLL_CTL_FAST=0 if this is a problem.
The described condition does not occur
if EF_UL_EPOLL=1 or EF_UL_EPOLL=3.
•When EF_EPOLL_CTL_FAST is enabled
and an application is testing the readiness of an epoll file descriptor
without actually calling epoll_wait(),
for example by doing epoll within epoll() or epoll within select(),
if one thread is calling select() or epoll_wait()
and another thread is doing epoll_ctl(),
then EF_EPOLL_CTL_FAST should be disabled.
This applies
when using EF_UL_EPOLL 1, 2 or 3.
If the application is monitoring the state of the epoll file descriptor indirectly,
e.g. by monitoring the epoll fd with poll,
then EF_EPOLL_CTL_FAST can cause issues and should be set to zero.
To force Onload to follow the kernel behavior
when using the epoll_wait() call,
the following variables should be set:
EF_UL_EPOLL=2
EF_EPOLL_CTL_FAST=0
EF_EPOLL_CTL_HANDOFF=0 (when using EF_UL_EPOLL=1)
•A socket should be removed
from an epoll set
only when all references to the socket are closed.
With EF_UL_EPOLL=1 (default) or EF_UL_EPOLL=3,
a socket is removed from the epoll set if the file descriptor is closed,
even if other references to the socket exist.
This can cause problems
if file descriptors are duplicated
using dup(), dup2() or fork().
For example:
s = socket();
s2 = dup(s);
epoll_ctl(epoll_fd, EPOLL_CTL_ADD, s, ...);
close(s);
/* socket referenced by s is removed from epoll set when using onload */
Workaround is set EF_UL_EPOLL=2.
•When Onload is unable to accelerate a connected socket,
e.g. because no route to the destination exists
which uses a Solarflare interface,
the socket will be handed off to the kernel
and is removed from the epoll set.
Because the socket is no longer in the epoll set,
attempts to modify the socket with epoll_ctl() will fail
with the ENOENT (descriptor not present) error.
The described condition does not occur
if EF_UL_EPOLL=1 or 3.
•If an epoll file descriptor
is passed to the read() or write() functions
these will return a different errorcode than that reported
by the kernel stack. This issue exists for all implementations of epoll.
•When EPOLLET is used and the event is ready,
epoll_wait() is triggered by ANY event on the socket
instead of the requested event. This issue should not affect application
correctness.
•Users should be aware
that if a server is overclocked the epoll_wait() timeout value
will increase as CPU MHz increases
resulting in unexpected timeout values.
This has been observed on Intel based systems
and when the Onload epoll implementation is EF_UL_EPOLL=1 or 3.
Using EF_UL_EPOLL=2 this behavior is not observed.
•On a spinning thread,
if epoll acceleration is disabled by setting EF_UL_EPOLL=0,
sockets on this thread will be handed off to the kernel,
but latency will be worse than expected kernel socket latency.
•To ensure that non‐accelerated file descriptors are checked
in poll and select functions,
the following options should be disabled (set to zero):
•EF_SELECT_FAST and EF_POLL_FAST
•When using poll() and select() calls, to ensure that non‐accelerated file d
escriptors are checked
when there are no events on any accelerated descriptors,
set the following options:
•EF_POLL_FAST_USEC and EF_SELECT_FAST_USEC,
setting both to zero.
Nested Epoll Sets
When an epoll set includes accelerated sockets
and is nested inside another epoll set,
the outer set may [1] not always get notified
about socket readiness or [2] after a socket becomes ready,
the state cannot be cleared.
This limitation is known to affect EF_UL_EPOLL=3.
Spinning ‐ Timing Issues
Onload users should consider
that as different software is being run,
timings will be affected
which can result in unexpected scheduling behaviour
and memory use.
Spinning applications, in particular, require a dedicated core per spinning Onload thread.
Configuration Issues
Mixed Adapters Sharing a Broadcast Domain
Onload should not be used
when Solarflare and non‐Solarflare interfaces
in the same network server are configured
in the same broadcast domain1
as depicted by the following diagram.