OVS CT连接跟踪(connection tracking)
CT对于OVS来说就像一个第三方的模块,复用了内核的conntrack模块,内核跟踪一台机器的所有连接状态和会话,对流经这台机器的每一个数据包拦截,并进行分析,建立这台机器上的连接信息数据库conntrack table,不断更新数据库,无状态防火墙就是基于五元组来过滤数据包。有状态防火墙可以在五元组之外,再通过连接状态来过滤数据包。
CT连接跟踪中所说的“连接”,概念和 TCP/IP 协议中“面向连接”( connection oriented)的“连接”并不完全相同
TCP/IP 协议中,连接是一个四层(Layer 4)的概念。
TCP 是有连接的,或称面向连接的(connection oriented),发送出去的包都要求对端应答(ACK),并且有重传机制
UDP 是无连接的,发送的包无需对端应答,也没有重传机制
CT 中,一个元组(tuple)定义的一条数据流(flow )就表示一条连接(connection)。
UDP 甚至是 ICMP 这种三层协议在 CT 中也都是有连接记录的
但不是所有协议都会被连接跟踪,目前只支持以下六种协议:TCP、UDP、ICMP、DCCP、SCTP、GRE
conntrack是一种状态跟踪和记录的机制,本身并不能过滤数据包,只是提供包过滤的依据,分为3个步骤:
1、拦截(或称过滤)流经这台机器的每一个数据包,并进行分析。
2、根据这些信息建立起这台机器上的连接信息数据库(conntrack table)。
3、根据拦截到的包信息,不断更新数据库。
rel:当icmp request发出后若网络不可达是相关报文
rpl:只有icmp reply是icmp request的回复报文
CT有状态VS无状态
无状态:
内网用户访问公网的web服务,需要下发2条规则,一入一出,出方向 放开目的端口80,入方向放开源端口80,目的端口是随机生成的,所以要把0到65535的端口都打开,非常危险!!!
序号 | 动作 | 源地址 | 源端口 | 目标地址 | 目标端口 | 方向 |
---|---|---|---|---|---|---|
1 | 允许 | * | * | * | 80 | 出 |
2 | 允许 | * | 80 | * | 0-65535 | 入 |
匹配项:
源IP | 目的IP | 源端口 | 目的端口 | 协议 |
---|
有状态:
内网用户访问公网的web服务,只定义出站规则并追踪这条会话,当一个发起连接的初始数据包匹配访问外部web服务后,匹配规则并且在状态表中新建一条会话,会话会包括此连接的源IP、源端口、目标IP、目标端口、连接时间等信息,对于TCP连接它还包含序列号和标志位等信息,后续包到达时状态引擎将它与状态表进行匹配,匹配成功直接放行不在接受规则检查,提高效率,回包时状态引擎监测到数据包属于web会话,动态打开端口允许回包进入,传输完毕后动态关闭这个端口,它也能够有效地避免外部的DoS攻击,并且外部伪造的ACK数据包也不会进入。
序号 | 动作 | 源地址 | 源端口 | 目标地址 | 目标端口 | 方向 |
---|---|---|---|---|---|---|
1 | 允许 | * | * | * | 80 | 出 |
匹配项:
源IP | 目的IP | 源端口 | 目的端口 | 协议 | CT状态 | 序列号(仅TCP) | 超时时间 | 匹配包个数 |
---|
TCP三次握手
①:A -> B -trk
②:A -> B +trk+new
③:B -> A -trk
④:B -> A +trk+est
⑤:A -> B +trk+est
A发送syn报文:①+②
B回复syn+ack报文:③+④
A回复ack报文:①+⑤
A和B数据传输:A-B:①+⑤,B-A:③+④
四次挥手
A发送fin报文:①+⑤
B回复ack和fin报文:③+④
A回复ack:①+⑤
NAT应用:
NAT 网关(NAT Gateway)是一种支持 IP 地址转换服务,提供 SNAT 和 DNAT 能力,可为私有网络(VPC)内的资源提供安全、高性能的 Internet 访问服务。
NAT 网关可为 VPC 内多个无公网 IP 的服务器提供主动访问公网的能力,同时也支持将公网 IP 和端口映射到云服务器内网 IP 和端口,使得 VPC 内的服务器可被公网访问。
SNAT 支持 VPC 内多个服务器通过同一公网 IP 主动访问互联网。
DNAT 将外网 IP、协议、端口映射到 VPC 内的云服务器内网 IP、协议、端口,使得云服务器上的服务可被外网访问。
搭建环境验证
核心流表
# 第一次收到的包不管是入流量还是出流量都要到conntrack table里lookup里查找entry匹配-trk
ovs-ofctl add-flow br-int 'table=0,priority=10,ip,ct_state=-trk,action=ct(nat,table=1)'
# 出流量:去conntrack table里lookup里查询之后状态变为+trk,首包同时要匹配+new,对于其他-est-inv-rel状态,明确不匹配才可以commit到conntrack table
ovs-ofctl add-flow br-int 'table=1,in_port=1,ip,ct_state=+trk+new-est-inv-rel,action=ct(nat(src=172.93.74.5-172.93.74.15:5000-50000),commit),mod_dl_src:xx:xx:xx:xx:xx:xx,mod_dl_dst:xx:xx:xx:xx:xx:xx,3'
# 出流量:去conntrack table里lookup里查询之后状态变为+trk,非首包代表已经commit到conntrack table,要匹配+est,明确不匹配
ovs-ofctl add-flow br-int 'table=1,in_port=1,ip,ct_state=+trk+est-new-inv-rel,action=mod_dl_src:xx:xx:xx:xx:xx:xx,mod_dl_dst:xx:xx:xx:xx:xx:xx,3'
# 入流量:去conntrack table里lookup里查询之后状态变为+trk,不管是不是首包定匹配+est,明确不匹配-est-inv-rel状态
ovs-ofctl add-flow br-int 'table=1,in_port=3,ip,ct_state=+trk+est-new-inv-rel,action=mod_dl_src:xx:xx:xx:xx:xx:xx,mod_dl_dst:xx:xx:xx:xx:xx:xx,1'
环境准备
#1、创建2个ns模拟虚拟机
ip netns add ns1
ip netns add ns2
#2、创建2个网桥,ns连接在br-int
ovs-vsctl add-br br-int
ovs-vsctl add-br net-br
ovs-vsctl add-port br-int vif_1 -- set Interface vif_1 type=internal
ip link set vif_1 netns ns1
ip netns exec ns1 ip link set dev vif_1 up
ovs-vsctl add-port br-int vif_2 -- set Interface vif_2 type=internal
ip link set vif_2 netns ns2
ip netns exec ns2 ip link set dev vif_2 up
#3、给vm配置ip
ip netns exec ns1 ip addr add 192.168.1.102/24 dev vif_1
ip netns exec ns2 ip addr add 192.168.1.1/24 dev vif_2
ip netns exec ns1 ip link set lo up
ip netns exec ns2 ip link set lo up
#4、测试vm之间连通性
ip netns exec ns1 ping -c 1 192.168.1.102
ip netns exec ns1 ping -c 1 192.168.1.1
#5、在网桥间创建patch口
ovs-vsctl add-port br-int patch-ovs-0 -- set Interface patch-ovs-0 type=patch options:peer=patch-sw-1
ovs-vsctl add-port net-br patch-sw-1 -- set Interface patch-sw-1 type=patch options:peer=patch-ovs-0
#6、将物理网卡eth0添加到网桥net-br上,此时物理网卡eth0上的ip和路由全部消失
ovs-vsctl add-port net-br eth0
ifconfig anet-br 172.93.73.4/16 up
route add default gw 172.93.1.1 dev net-br
route add 172.93.0.0 dev anet-br
ip addr flush dev eth0
#7、添加流表
NS1_MAC=ip netns exec ns1 ip a | grep ether | awk '{print $2}'
NS2_MAC=ip netns exec ns2 ip a | grep ether | awk '{print $2}'
NET_BR_MAC=ip a | grep net-br: -A1 | grep ether | awk '{print $2}'
HOST_MAC="00:50:56:95:b0:b2"
NS1_PORT=ovs-ofctl show br-int | grep vif_1 | awk -F '(' '{print $1}' | awk '{print $NF}'
NS2_PORT=ovs-ofctl show br-int | grep vif_2 | awk -F '(' '{print $1}' | awk '{print $NF}'
PATCH_PORT=ovs-ofctl show br-int | grep patch-ovs | awk -F '(' '{print $1}' | awk '{print $NF}'
ovs-ofctl add-flow br-int "table=0,priority=10,ip,ct_state=-trk,action=ct(nat,table=1)"
ovs-ofctl add-flow br-int "table=1,in_port=${NS1_PORT},ip,ct_state=+trk+new,action=ct(nat(src=172.93.73.4-172.93.73.4:5000-50000),commit,exec(set_field:0x2->ct_label)),mod_dl_src:${NET_BR_MAC},mod_dl_dst:${HOST_MAC},${PATCH_PORT}"
ovs-ofctl add-flow br-int "table=1,priority=10,in_port=${NS1_PORT},ip,ct_state=+trk+est,action=mod_dl_src:${NET_BR_MAC},mod_dl_dst:${HOST_MAC},${PATCH_PORT}"
ovs-ofctl add-flow br-int "table=1,priority=1,ct_label=0x2,ip,ct_state=+trk+est,action=mod_dl_src:${NET_BR_MAC},mod_dl_dst:${NS1_MAC},${NS1_PORT}"
# 8、在ns口抓包
03:57:56.625643 b2:68:6b:f5:f5:bf > 00:50:56:95:59:53, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 41522, offset 0, flags [DF], proto ICMP (1), length 84)
192.168.1.102 > 172.93.74.28: ICMP echo request, id 1355, seq 9, length 64
03:57:56.626220 00:50:56:95:59:53 > b2:68:6b:f5:f5:bf, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 65244, offset 0, flags [none], proto ICMP (1), length 84)
172.93.74.28 > 192.168.1.102: ICMP echo reply, id 1355, seq 9, length 64
#9、在物理口抓包
03:57:48.459408 00:50:56:95:59:53 > 00:50:56:95:b0:b2, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 40494, offset 0, flags [DF], proto ICMP (1), length 84)
172.93.73.4 > 172.93.74.28: ICMP echo request, id 15933, seq 1, length 64
03:57:49.461041 00:50:56:95:b0:b2 > 00:50:56:95:59:53, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 64042, offset 0, flags [none], proto ICMP (1), length 84)
172.93.74.28 > 172.93.73.4: ICMP echo reply, id 15933, seq 2, length 64
#10、查看CT表
ovs-appctl dpctl/dump-conntrack -m | grep 172.93.74.28
icmp,orig=(src=192.168.1.102,dst=172.93.74.28,id=2167,type=8,code=0),reply=(src=172.93.74.28,dst=172.93.73.4,id=8934,type=0,code=0),status=SEEN_REPLY|CONFIRMED|SRC_NAT|SRC_NAT_DONE,labels=0x1
SNAT+DNAT
UnSnat:table=1, cookie=0, priority=100,ip,nw_dst=116.0.0.1 actions=ct(table=2,zone=10,nat)
Dnat:table=2, cookie=0, priority=100,ip actions=ct(commit,table=3,zone=20,nat(dst=117.0.0.1))
Routing:table=3, cookie=0, priority=100,ip,nw_dst=117.0.0.0/24 actions= set_field:fe:fe:fe:fe:fe:bb->eth_dst,resubmit(,4)"
UnDnat:table=4, cookie=0, priority=100,ip,nw_src=117.0.0.1 actions=ct(table=5,zone=20,nat)
Snat: table=5, cookie=0, priority=100,ip actions=ct(commit,table=6,zone=10,nat(src=116.0.0.1))
CT在OVS datapath实现
// 根据add-flow操作匹配type类型CT
OFPACT(CT, ofpact_conntrack, ofpact, "ct")
// 读取action里CT参数
static char * OVS_WARN_UNUSED_RESULT
parse_CT(char *arg, const struct ofpact_parse_params *pp)
{
const size_t ct_offset = ofpacts_pull(pp->ofpacts);
struct ofpact_conntrack *oc;
char *error = NULL;
char *key, *value;
oc = ofpact_put_CT(pp->ofpacts);
oc->flags = 0;
oc->recirc_table = NX_CT_RECIRC_NONE;
while (ofputil_parse_key_value(&arg, &key, &value)) {
if (!strcmp(key, "commit")) {
oc->flags |= NX_CT_F_COMMIT;
} else if (!strcmp(key, "force")) {
oc->flags |= NX_CT_F_FORCE;
} else if (!strcmp(key, "table")) {
if (!ofputil_table_from_string(value, pp->table_map,
&oc->recirc_table)) {
error = xasprintf("unknown table %s", value);
} else if (oc->recirc_table == NX_CT_RECIRC_NONE) {
error = xasprintf("invalid table %#"PRIx8, oc->recirc_table);
}
} else if (!strcmp(key, "zone")) {
error = str_to_u16(value, "zone", &oc->zone_imm);
if (error) {
free(error);
error = mf_parse_subfield(&oc->zone_src, value);
if (error) {
return error;
}
}
} else if (!strcmp(key, "alg")) {
error = str_to_connhelper(value, &oc->alg);
} else if (!strcmp(key, "nat")) {
const size_t nat_offset = ofpacts_pull(pp->ofpacts);
error = parse_NAT(value, pp);
/* Update CT action pointer and length. */
pp->ofpacts->header = ofpbuf_push_uninit(pp->ofpacts, nat_offset);
oc = pp->ofpacts->header;
} else if (!strcmp(key, "exec")) {
/* Hide existing actions from ofpacts_parse_copy(), so the
* nesting can be handled transparently. */
enum ofputil_protocol usable_protocols2;
const size_t exec_offset = ofpacts_pull(pp->ofpacts);
/* Initializes 'usable_protocol2', fold it back to
* '*usable_protocols' afterwards, so that we do not lose
* restrictions already in there. */
struct ofpact_parse_params pp2 = *pp;
pp2.usable_protocols = &usable_protocols2;
error = ofpacts_parse_copy(value, &pp2, false, OFPACT_CT);
*pp->usable_protocols &= usable_protocols2;
pp->ofpacts->header = ofpbuf_push_uninit(pp->ofpacts, exec_offset);
oc = pp->ofpacts->header;
} else {
error = xasprintf("invalid argument to \"ct\" action: `%s'", key);
}
}
if (!error && oc->flags & NX_CT_F_FORCE && !(oc->flags & NX_CT_F_COMMIT)) {
error = xasprintf("\"force\" flag requires \"commit\" flag.");
}
if (ofpbuf_oversized(pp->ofpacts)) {
free(error);
return xasprintf("input too big");
}
ofpact_finish_CT(pp->ofpacts, &oc);
ofpbuf_push_uninit(pp->ofpacts, ct_offset);
return error;
}
// FAST PATH:收到packet时走内核态查询flow,接上一篇ovs_execute_actions
int ovs_execute_actions(struct datapath *dp, struct sk_buff *skb,
const struct sw_flow_actions *acts,
struct sw_flow_key *key)
{
int err, level;
level = __this_cpu_inc_return(exec_actions_level);
if (unlikely(level > OVS_RECURSION_LIMIT)) {
net_crit_ratelimited("ovs: recursion limit reached on datapath %s, probable configuration error\n",
ovs_dp_name(dp));
kfree_skb(skb);
err = -ENETDOWN;
goto out;
}
OVS_CB(skb)->acts_origlen = acts->orig_len;
// 执行action
err = do_execute_actions(dp, skb, key,
acts->actions, acts->actions_len);
if (level == 1)
process_deferred_actions(dp);
out:
__this_cpu_dec(exec_actions_level);
return err;
}
/* Execute a list of actions against 'skb'. */
static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
struct sw_flow_key *key,
const struct nlattr *attr, int len)
{
const struct nlattr *a;
int rem;
for (a = attr, rem = len; rem > 0;
a = nla_next(a, &rem)) {
int err = 0;
switch (nla_type(a)) {
......
case OVS_ACTION_ATTR_CT:
if (!is_flow_key_valid(key)) {
err = ovs_flow_key_update(skb, key);
if (err)
return err;
}
err = ovs_ct_execute(ovs_dp_get_net(dp), skb, key,
nla_data(a));
/* Hide stolen IP fragments from user space. */
if (err)
return err == -EINPROGRESS ? 0 : err;
break;
......
}
......
}
/* Returns 0 on success, -EINPROGRESS if 'skb' is stolen, or other nonzero
* value if 'skb' is freed.
*/
int ovs_ct_execute(struct net *net, struct sk_buff *skb,
struct sw_flow_key *key,
const struct ovs_conntrack_info *info)
{
int nh_ofs;
int err;
/* The conntrack module expects to be working at L3. */
nh_ofs = skb_network_offset(skb);
skb_pull_rcsum(skb, nh_ofs);
err = ovs_skb_network_trim(skb);
if (err)
return err;
if (key->ip.frag != OVS_FRAG_TYPE_NONE) {
err = handle_fragments(net, key, info->zone.id, skb);
if (err)
return err;
}
if (info->commit)
// ct(commit,table=1)
err = ovs_ct_commit(net, key, info, skb);
else
// ct(table=1)
err = ovs_ct_lookup(net, key, info, skb);
skb_push(skb, nh_ofs);
skb_postpush_rcsum(skb, skb->data, nh_ofs);
if (err)
kfree_skb(skb);
return err;
}
// 设置ct_label,ct_mark,nat,commit参数
static int ovs_ct_commit(struct net *net, struct sw_flow_key *key,
const struct ovs_conntrack_info *info,
struct sk_buff *skb)
{
enum ip_conntrack_info ctinfo;
struct nf_conn *ct;
int err;
// 设置nat
err = __ovs_ct_lookup(net, key, info, skb);
if (err)
return err;
/* The connection could be invalid, in which case this is a no-op.*/
ct = nf_ct_get(skb, &ctinfo);
if (!ct)
return 0;
......
// 设置ct_mark
if (info->mark.mask) {
err = ovs_ct_set_mark(ct, key, info->mark.value,
info->mark.mask);
if (err)
return err;
}
// 设置ct_label
if (!nf_ct_is_confirmed(ct)) {
err = ovs_ct_init_labels(ct, key, &info->labels.value,
&info->labels.mask);
if (err)
return err;
} else if (IS_ENABLED(CONFIG_NF_CONNTRACK_LABELS) &&
labels_nonzero(&info->labels.mask)) {
err = ovs_ct_set_labels(ct, key, &info->labels.value,
&info->labels.mask);
if (err)
return err;
}
/* This will take care of sending queued events even if the connection
* is already confirmed.
*/
if (nf_conntrack_confirm(skb) != NF_ACCEPT)
return -EINVAL;
return 0;
}
static int __ovs_ct_lookup(struct net *net, struct sw_flow_key *key,
const struct ovs_conntrack_info *info,
struct sk_buff *skb)
{
bool cached = skb_nfct_cached(net, key, info, skb);
enum ip_conntrack_info ctinfo;
struct nf_conn *ct;
if (!cached) {
struct nf_hook_state state = {
.hook = NF_INET_PRE_ROUTING,
.pf = info->family,
.net = net,
};
struct nf_conn *tmpl = info->ct;
int err;
/* Associate skb with specified zone. */
if (tmpl) {
if (skb_nfct(skb))
nf_conntrack_put(skb_nfct(skb));
nf_conntrack_get(&tmpl->ct_general);
nf_ct_set(skb, tmpl, IP_CT_NEW);
}
err = nf_conntrack_in(skb, &state);
if (err != NF_ACCEPT)
return -ENOENT;
/* Clear CT state NAT flags to mark that we have not yet done
* NAT after the nf_conntrack_in() call. We can actually clear
* the whole state, as it will be re-initialized below.
*/
key->ct_state = 0;
/* Update the key, but keep the NAT flags. */
ovs_ct_update_key(skb, info, key, true, true);
}
ct = nf_ct_get(skb, &ctinfo);
if (ct) {
bool add_helper = false;
// 设置ct(nat)
if (info->nat && !(key->ct_state & OVS_CS_F_NAT_MASK) &&
(nf_ct_is_confirmed(ct) || info->commit) &&
ovs_ct_nat(net, key, info, skb, ct, ctinfo) != NF_ACCEPT) {
return -EINVAL;
}
......
}
static int ovs_ct_nat(struct net *net, struct sw_flow_key *key,
const struct ovs_conntrack_info *info,
struct sk_buff *skb, struct nf_conn *ct,
enum ip_conntrack_info ctinfo)
{
enum nf_nat_manip_type maniptype;
int err;
#ifdef HAVE_NF_CT_IS_UNTRACKED
if (nf_ct_is_untracked(ct)) {
/* A NAT action may only be performed on tracked packets. */
return NF_ACCEPT;
}
#endif /* HAVE_NF_CT_IS_UNTRACKED */
/* Add NAT extension if not confirmed yet. */
if (!nf_ct_is_confirmed(ct) && !nf_ct_nat_ext_add(ct))
return NF_ACCEPT; /* Can't NAT. */
/* Determine NAT type.
* Check if the NAT type can be deduced from the tracked connection.
* Make sure new expected connections (IP_CT_RELATED) are NATted only
* when committing.
*/
if (info->nat & OVS_CT_NAT && ctinfo != IP_CT_NEW &&
ct->status & IPS_NAT_MASK &&
(ctinfo != IP_CT_RELATED || info->commit)) {
/* NAT an established or related connection like before. */
if (CTINFO2DIR(ctinfo) == IP_CT_DIR_REPLY)
/* This is the REPLY direction for a connection
* for which NAT was applied in the forward
* direction. Do the reverse NAT.
*/
maniptype = ct->status & IPS_SRC_NAT
? NF_NAT_MANIP_DST : NF_NAT_MANIP_SRC;
else
maniptype = ct->status & IPS_SRC_NAT
? NF_NAT_MANIP_SRC : NF_NAT_MANIP_DST;
} else if (info->nat & OVS_CT_SRC_NAT) {
maniptype = NF_NAT_MANIP_SRC;
} else if (info->nat & OVS_CT_DST_NAT) {
maniptype = NF_NAT_MANIP_DST;
} else {
return NF_ACCEPT; /* Connection is not NATed. */
}
// 执行nat相关action
err = ovs_ct_nat_execute(skb, ct, ctinfo, &info->range, maniptype);
if (err == NF_ACCEPT && ct->status & IPS_DST_NAT) {
if (ct->status & IPS_SRC_NAT) {
if (maniptype == NF_NAT_MANIP_SRC)
maniptype = NF_NAT_MANIP_DST;
else
maniptype = NF_NAT_MANIP_SRC;
err = ovs_ct_nat_execute(skb, ct, ctinfo, &info->range,
maniptype);
} else if (CTINFO2DIR(ctinfo) == IP_CT_DIR_ORIGINAL) {
err = ovs_ct_nat_execute(skb, ct, ctinfo, NULL,
NF_NAT_MANIP_SRC);
}
}
/* Mark NAT done if successful and update the flow key. */
if (err == NF_ACCEPT)
ovs_nat_update_key(key, skb, maniptype);
return err;
}
static int ovs_ct_nat_execute(struct sk_buff *skb, struct nf_conn *ct,
enum ip_conntrack_info ctinfo,
const struct nf_nat_range2 *range,
enum nf_nat_manip_type maniptype)
{
int hooknum, nh_off, err = NF_ACCEPT;
nh_off = skb_network_offset(skb);
skb_pull_rcsum(skb, nh_off);
/* See HOOK2MANIP(). */
if (maniptype == NF_NAT_MANIP_SRC)
hooknum = NF_INET_LOCAL_IN; /* Source NAT */
else
hooknum = NF_INET_LOCAL_OUT; /* Destination NAT */
switch (ctinfo) {
case IP_CT_RELATED:
case IP_CT_RELATED_REPLY:
if (IS_ENABLED(CONFIG_NF_NAT_IPV4) &&
skb->protocol == htons(ETH_P_IP) &&
ip_hdr(skb)->protocol == IPPROTO_ICMP) {
if (!nf_nat_icmp_reply_translation(skb, ct, ctinfo,
hooknum))
err = NF_DROP;
goto push;
} else if (IS_ENABLED(CONFIG_NF_NAT_IPV6) &&
skb->protocol == htons(ETH_P_IPV6)) {
__be16 frag_off;
u8 nexthdr = ipv6_hdr(skb)->nexthdr;
int hdrlen = ipv6_skip_exthdr(skb,
sizeof(struct ipv6hdr),
&nexthdr, &frag_off);
if (hdrlen >= 0 && nexthdr == IPPROTO_ICMPV6) {
if (!nf_nat_icmpv6_reply_translation(skb, ct,
ctinfo,
hooknum,
hdrlen))
err = NF_DROP;
goto push;
}
}
/* Non-ICMP, fall thru to initialize if needed. */
/* fall through */
case IP_CT_NEW:
/* Seen it before? This can happen for loopback, retrans,
* or local packets.
*/
if (!nf_nat_initialized(ct, maniptype)) {
/* Initialize according to the NAT action. */
err = (range && range->flags & NF_NAT_RANGE_MAP_IPS)
/* Action is set up to establish a new
* mapping.
*/
? nf_nat_setup_info(ct, range, maniptype)
: nf_nat_alloc_null_binding(ct, hooknum);
if (err != NF_ACCEPT)
goto push;
}
break;
case IP_CT_ESTABLISHED:
case IP_CT_ESTABLISHED_REPLY:
break;
default:
err = NF_DROP;
goto push;
}
err = nf_nat_packet(ct, ctinfo, hooknum, skb);
push:
skb_push(skb, nh_off);
skb_postpush_rcsum(skb, skb->data, nh_off);
return err;
}
static int ovs_ct_lookup(struct net *net, struct sw_flow_key *key,
const struct ovs_conntrack_info *info,
struct sk_buff *skb)
{
struct nf_conntrack_expect *exp;
exp = ovs_ct_expect_find(net, &info->zone, info->family, skb);
if (exp) {
u8 state;
// 未找到,将ct_state更新为+trk、+new、+rel
state = OVS_CS_F_TRACKED | OVS_CS_F_NEW | OVS_CS_F_RELATED;
__ovs_ct_update_key(key, state, &info->zone, exp->master);
} else {
struct nf_conn *ct;
int err;
err = __ovs_ct_lookup(net, key, info, skb);
if (err)
return err;
ct = (struct nf_conn *)skb_nfct(skb);
if (ct)
nf_ct_deliver_cached_events(ct);
}
return 0;
}
// Slow Path:在内核态未查找到调用upcall去用户态查询
static void *udpif_upcall_handler(void *arg)
{
struct handler *handler = arg;
struct udpif *udpif = handler->udpif;
while (!latch_is_set(&handler->udpif->exit_latch)) {
if (recv_upcalls(handler)) {
poll_immediate_wake();
} else {
dpif_recv_wait(udpif->dpif, handler->handler_id);
latch_wait(&udpif->exit_latch);
}
poll_block();
}
return NULL;
}
// 2、调用用户态upcalls
static size_t recv_upcalls(struct handler *handler)
{
struct udpif *udpif = handler->udpif;
uint64_t recv_stubs[UPCALL_MAX_BATCH][512 / 8];
struct ofpbuf recv_bufs[UPCALL_MAX_BATCH];
struct dpif_upcall dupcalls[UPCALL_MAX_BATCH];
struct upcall upcalls[UPCALL_MAX_BATCH];
struct flow flows[UPCALL_MAX_BATCH];
size_t n_upcalls, i;
n_upcalls = 0;
while (n_upcalls < UPCALL_MAX_BATCH) {
struct ofpbuf *recv_buf = &recv_bufs[n_upcalls];
struct dpif_upcall *dupcall = &dupcalls[n_upcalls];
struct upcall *upcall = &upcalls[n_upcalls];
struct flow *flow = &flows[n_upcalls];
unsigned int mru = 0;
uint64_t hash = 0;
int error;
ofpbuf_use_stub(recv_buf, recv_stubs[n_upcalls],
sizeof recv_stubs[n_upcalls]);
// 2.1、接受upcall请求
if (dpif_recv(udpif->dpif, handler->handler_id, dupcall, recv_buf)) {
ofpbuf_uninit(recv_buf);
break;
}
upcall->fitness = odp_flow_key_to_flow(dupcall->key, dupcall->key_len,
flow, NULL);
if (upcall->fitness == ODP_FIT_ERROR) {
goto free_dupcall;
}
if (dupcall->mru) {
mru = nl_attr_get_u16(dupcall->mru);
}
if (dupcall->hash) {
hash = nl_attr_get_u64(dupcall->hash);
}
// 2.2收到内核上传的数据分类
error = upcall_receive(upcall, udpif->backer, &dupcall->packet,
dupcall->type, dupcall->userdata, flow, mru,
&dupcall->ufid, PMD_ID_NULL);
if (error) {
if (error == ENODEV) {
/* Received packet on datapath port for which we couldn't
* associate an ofproto. This can happen if a port is removed
* while traffic is being received. Print a rate-limited
* message in case it happens frequently. */
dpif_flow_put(udpif->dpif, DPIF_FP_CREATE, dupcall->key,
dupcall->key_len, NULL, 0, NULL, 0,
&dupcall->ufid, PMD_ID_NULL, NULL);
VLOG_INFO_RL(&rl, "received packet on unassociated datapath "
"port %"PRIu32, flow->in_port.odp_port);
}
goto free_dupcall;
}
upcall->key = dupcall->key;
upcall->key_len = dupcall->key_len;
upcall->ufid = &dupcall->ufid;
upcall->hash = hash;
upcall->out_tun_key = dupcall->out_tun_key;
upcall->actions = dupcall->actions;
pkt_metadata_from_flow(&dupcall->packet.md, flow);
// 2.3、提取flow
flow_extract(&dupcall->packet, flow);
// 2.4、处理upcall数据
error = process_upcall(udpif, upcall,
&upcall->odp_actions, &upcall->wc);
if (error) {
goto cleanup;
}
n_upcalls++;
continue;
}
// 2.4调用upcall_xlate
static int process_upcall(struct udpif *udpif, struct upcall *upcall,
struct ofpbuf *odp_actions, struct flow_wildcards *wc)
{
const struct dp_packet *packet = upcall->packet;
const struct flow *flow = upcall->flow;
size_t actions_len = 0;
switch (upcall->type) {
case MISS_UPCALL:
case SLOW_PATH_UPCALL:
// 2.4.1 处理upcall类型
upcall_xlate(udpif, upcall, odp_actions, wc);
return 0;
}
}
// 2.4.1 处理upcall类型
static void upcall_xlate(struct udpif *udpif, struct upcall *upcall,
struct ofpbuf *odp_actions, struct flow_wildcards *wc)
{
struct dpif_flow_stats stats;
enum xlate_error xerr;
struct xlate_in xin;
struct ds output;
stats.n_packets = 1;
stats.n_bytes = dp_packet_size(upcall->packet);
stats.used = time_msec();
stats.tcp_flags = ntohs(upcall->flow->tcp_flags);
xlate_in_init(&xin, upcall->ofproto,
ofproto_dpif_get_tables_version(upcall->ofproto),
upcall->flow, upcall->ofp_in_port, NULL,
stats.tcp_flags, upcall->packet, wc, odp_actions);
upcall->reval_seq = seq_read(udpif->reval_seq);
// 2.4.1.1流表一个个遍历匹配
xerr = xlate_actions(&xin, &upcall->xout);
/* Translate again and log the ofproto trace for
* these two error types. */
/* This function is also called for slow-pathed flows. As we are only
* going to create new datapath flows for actual datapath misses, there is
* no point in creating a ukey otherwise. */
if (upcall->type == MISS_UPCALL) {
upcall->ukey = ukey_create_from_upcall(upcall, wc);
}
}
// 2.4.1.1匹配流表
enum xlate_error xlate_actions(struct xlate_in *xin, struct xlate_out *xout)
{
*xout = (struct xlate_out) {
.slow = 0,
.recircs = RECIRC_REFS_EMPTY_INITIALIZER,
};
if (!xin->ofpacts && !ctx.rule) {
// 2.4.1.1流表一个个遍历匹配
ctx.rule = rule_dpif_lookup_from_table(
ctx.xbridge->ofproto, ctx.xin->tables_version, flow, ctx.wc,
ctx.xin->resubmit_stats, &ctx.table_id,
flow->in_port.ofp_port, true, true, ctx.xin->xcache);
}
// 2.4.1.2执行action
do_xlate_actions(ofpacts, ofpacts_len, &ctx, true, false);
}
// 2.4.1.2 执行ct模块action
static void do_xlate_actions(const struct ofpact *ofpacts, size_t ofpacts_len,
struct xlate_ctx *ctx, bool is_last_action,
bool group_bucket_action)
{
......
case OFPACT_CT:
compose_conntrack_action(ctx, ofpact_get_CT(a), last);
break;
case OFPACT_CT_CLEAR:
if (ctx->conntracked) {
compose_ct_clear_action(ctx);
}
break;
case OFPACT_NAT:
/* This will be processed by compose_conntrack_action(). */
ctx->ct_nat_action = ofpact_get_NAT(a);
break;
}
static void compose_conntrack_action(struct xlate_ctx *ctx, struct ofpact_conntrack *ofc,
bool is_last_action)
{
/* Ensure that any prior actions are applied before composing the new
* conntrack action. */
xlate_commit_actions(ctx);
/* Process nested actions first, to populate the key. */
ctx->ct_nat_action = NULL;
ctx->wc->masks.ct_mark = 0;
ctx->wc->masks.ct_label = OVS_U128_ZERO;
do_xlate_actions(ofc->actions, ofpact_ct_get_action_len(ofc), ctx,
is_last_action, false);
ct_offset = nl_msg_start_nested(ctx->odp_actions, OVS_ACTION_ATTR_CT);
if (ofc->flags & NX_CT_F_COMMIT) {
nl_msg_put_flag(ctx->odp_actions, ofc->flags & NX_CT_F_FORCE ?
OVS_CT_ATTR_FORCE_COMMIT : OVS_CT_ATTR_COMMIT);
if (ctx->xbridge->support.ct_eventmask) {
nl_msg_put_u32(ctx->odp_actions, OVS_CT_ATTR_EVENTMASK,
OVS_CT_EVENTMASK_DEFAULT);
}
if (ctx->xbridge->support.ct_timeout) {
put_ct_timeout(ctx->odp_actions, ctx->xbridge->ofproto->backer,
&ctx->xin->flow, ctx->wc, zone);
}
}
nl_msg_put_u16(ctx->odp_actions, OVS_CT_ATTR_ZONE, zone);
put_ct_mark(&ctx->xin->flow, ctx->odp_actions, ctx->wc);
put_ct_label(&ctx->xin->flow, ctx->odp_actions, ctx->wc);
put_ct_helper(ctx, ctx->odp_actions, ofc);
put_ct_nat(ctx);
nl_msg_end_nested(ctx->odp_actions, ct_offset);
ctx->wc->masks.ct_mark = old_ct_mark_mask;
ctx->wc->masks.ct_label = old_ct_label_mask;
if (ofc->recirc_table != NX_CT_RECIRC_NONE) {
ctx->conntracked = true;
// 若在用户态匹配到flow后会在内核datapath生成一条规则
compose_recirculate_and_fork(ctx, ofc->recirc_table, zone);
}
ctx->ct_nat_action = NULL;
/* The ct_* fields are only available in the scope of the 'recirc_table'
* call chain. */
flow_clear_conntrack(&ctx->xin->flow);
xlate_report(ctx, OFT_DETAIL, "Sets the packet to an untracked state, "
"and clears all the conntrack fields.");
ctx->conntracked = false;
}