commit ee122c79d4227f6ec642157834b6a90fcffa4382
Author: Thomas Graf <tgraf@suug.ch>
Date: Tue Jul 21 10:43:58 2015 +0200
vxlan: Flow based tunneling
Allows putting a VXLAN device into a new flow-based mode in which
skbs with a ip_tunnel_info dst metadata attached will be encapsulated
according to the instructions stored in there with the VXLAN device
defaults taken into consideration.
Similar on the receive side, if the VXLAN_F_COLLECT_METADATA flag is
set, the packet processing will populate a ip_tunnel_info struct for
each packet received and attach it to the skb using the new metadata
dst. The metadata structure will contain the outer header and tunnel
header fields which have been stripped off. Layers further up in the
stack such as routing, tc or netfitler can later match on these fields
and perform forwarding. It is the responsibility of upper layers to
ensure that the flag is set if the metadata is needed. The flag limits
the additional cost of metadata collecting based on demand.
This prepares the VXLAN device to be steered by the routing and other
subsystems which allows to support encapsulation for a large number
of tunnel endpoints and tunnel ids through a single net_device which
improves the scalability.
It also allows for OVS to leverage this mode which in turn allows for
the removal of the OVS specific VXLAN code.
Because the skb is currently scrubed in vxlan_rcv(), the attachment of
the new dst metadata is postponed until after scrubing which requires
the temporary addition of a new member to vxlan_metadata. This member
is removed again in a later commit after the indirect VXLAN receive API
has been removed.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
以前创建vxlan的命令如下:
ip link add name vxlan1 type vxlan id $vni dev $link remote $link_remote_ip dstport $vxlan_port
每个vni都需要一个vxlan dev对应。
有了上面的commit后,一个host创建一个vxlan dev就够了:
ip link add $vx type vxlan dstport $vxlan_port dev $link external udp6zerocsumrx udp6zerocsumtx
PF收包报文后,调用函数如下:
gro_cells_receive
b'vxlan_rcv+0x1 [vxlan]'
b'udp_queue_rcv_skb+0x45 [kernel]'
b'udp_unicast_rcv_skb+0x79 [kernel]'
b'__udp4_lib_rcv+0x326 [kernel]'
b'udp_rcv+0x1a [kernel]'
b'ip_protocol_deliver_rcu+0x1d0 [kernel]'
b'ip_local_deliver_finish+0x94 [kernel]'
b'ip_local_deliver+0x82 [kernel]'
b'ip_sublist_rcv_finish+0xd2 [kernel]'
b'ip_list_rcv_finish.constprop.0+0x19f [kernel]'
b'ip_list_rcv+0x15b [kernel]'
b'__netif_receive_skb_list_core+0x28d [kernel]'
b'__netif_receive_skb_list+0x102 [kernel]'
b'netif_receive_skb_list_internal+0x12c [kernel]'
b'napi_complete_done+0x7a [kernel]'
b'mlx5e_napi_poll+0x1bb [mlx5_core]'
b'__napi_poll+0x2f [kernel]'
b'net_rx_action+0x27b [kernel]'
b'__softirqentry_text_start+0xc6 [kernel]'
b'__irq_exit_rcu+0xbf [kernel]'
b'irq_exit_rcu+0xe [kernel]'
b'common_interrupt+0x8d [kernel]'
b'asm_common_interrupt+0x1e [kernel]'
b'cpuidle_enter_state+0x10a [kernel]'
b'cpuidle_enter+0x2e [kernel]'
b'cpuidle_idle_call+0x12d [kernel]'
b'do_idle+0x94 [kernel]'
b'cpu_startup_entry+0x20 [kernel]'
b'start_secondary+0x96 [kernel]'
b'secondary_startup_64_no_verify+0xb0 [kernel]'
在gro_cells_receive函数里会触发vxlan的软中断:
b'fl_classify+0x1 [cls_flower]'
b'tcf_classify+0x7a [kernel]'
b'sch_handle_ingress.constprop.0+0x133 [kernel]'
b'__netif_receive_skb_core+0x579 [kernel]'
b'__netif_receive_skb_list_core+0x12a [kernel]'
b'__netif_receive_skb_list+0x102 [kernel]'
b'netif_receive_skb_list_internal+0x12c [kernel]'
b'napi_complete_done+0x7a [kernel]'
b'gro_cell_poll+0x77 [kernel]'
b'__napi_poll+0x2f [kernel]'
b'net_rx_action+0x27b [kernel]'
b'__softirqentry_text_start+0xc6 [kernel]'
b'__irq_exit_rcu+0xbf [kernel]'
b'irq_exit_rcu+0xe [kernel]'
b'common_interrupt+0x8d [kernel]'
b'asm_common_interrupt+0x1e [kernel]'
b'cpuidle_enter_state+0x10a [kernel]'
b'cpuidle_enter+0x2e [kernel]'
b'cpuidle_idle_call+0x12d [kernel]'
b'do_idle+0x94 [kernel]'
b'cpu_startup_entry+0x20 [kernel]'
b'start_secondary+0x96 [kernel]'
b'secondary_startup_64_no_verify+0xb0 [kernel]'
vxlan dev的tc flower filter会被调到:
filter parent ffff: protocol ip pref 1 flower chain 0
filter parent ffff: protocol ip pref 1 flower chain 0 handle 0x1
dst_mac 02:25:d0:13:01:02
src_mac 24:25:d0:e1:00:00
eth_type ipv4
enc_dst_ip 192.168.1.13
enc_src_ip 192.168.1.14
enc_key_id 4
enc_dst_port 4789
skip_hw
not_in_hw
action order 1: tunnel_key unset pipe
index 3 ref 1 bind 1 installed 20085 sec used 1 sec
Action statistics:
Sent 1647408 bytes 19612 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
action order 2: mirred (Egress Redirect to device enp4s0f0_1) stolen
index 3 ref 1 bind 1 installed 20085 sec used 1 sec
Action statistics:
Sent 1647408 bytes 19612 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
如果用老的方式创建vxlan dev,也就是不指定external关键字,在vxlan_rcv函数里面vxlan_collect_metadata()返回false,udp_tun_rx_dst()和skb_dst_set()都不会被掉到。tunnel的信息就丢了。在fl_classify()里就不会match,fl_mask_lookup()返回空,tcf_exts_exec()也不会执行。
但是如果用老的方式创建vxlan dev,并且直接offload的话也是可以通的。这样就直接走硬件了。