OVS CT连接追踪实现NAT

24 篇文章 1 订阅
11 篇文章 7 订阅
OVS CT连接跟踪(connection tracking)

CT对于OVS来说就像一个第三方的模块,复用了内核的conntrack模块,内核跟踪一台机器的所有连接状态和会话,对流经这台机器的每一个数据包拦截,并进行分析,建立这台机器上的连接信息数据库conntrack table,不断更新数据库,无状态防火墙就是基于五元组来过滤数据包。有状态防火墙可以在五元组之外,再通过连接状态来过滤数据包。

CT连接跟踪中所说的“连接”,概念和 TCP/IP 协议中“面向连接”( connection oriented)的“连接”并不完全相同
TCP/IP 协议中,连接是一个四层(Layer 4)的概念。
TCP 是有连接的,或称面向连接的(connection oriented),发送出去的包都要求对端应答(ACK),并且有重传机制
UDP 是无连接的,发送的包无需对端应答,也没有重传机制

CT 中,一个元组(tuple)定义的一条数据流(flow )就表示一条连接(connection)。
UDP 甚至是 ICMP 这种三层协议在 CT 中也都是有连接记录的
但不是所有协议都会被连接跟踪,目前只支持以下六种协议:TCP、UDP、ICMP、DCCP、SCTP、GRE
conntrack是一种状态跟踪和记录的机制,本身并不能过滤数据包,只是提供包过滤的依据,分为3个步骤:
1、拦截(或称过滤)流经这台机器的每一个数据包,并进行分析。
2、根据这些信息建立起这台机器上的连接信息数据库(conntrack table)。
3、根据拦截到的包信息,不断更新数据库。
在这里插入图片描述
rel:当icmp request发出后若网络不可达是相关报文
rpl:只有icmp reply是icmp request的回复报文

CT有状态VS无状态
无状态:

内网用户访问公网的web服务,需要下发2条规则,一入一出,出方向 放开目的端口80,入方向放开源端口80,目的端口是随机生成的,所以要把0到65535的端口都打开,非常危险!!!

序号动作源地址源端口目标地址目标端口方向
1允许***80
2允许*80*0-65535

匹配项:

源IP目的IP源端口目的端口协议
有状态:

内网用户访问公网的web服务,只定义出站规则并追踪这条会话,当一个发起连接的初始数据包匹配访问外部web服务后,匹配规则并且在状态表中新建一条会话,会话会包括此连接的源IP、源端口、目标IP、目标端口、连接时间等信息,对于TCP连接它还包含序列号和标志位等信息,后续包到达时状态引擎将它与状态表进行匹配,匹配成功直接放行不在接受规则检查,提高效率,回包时状态引擎监测到数据包属于web会话,动态打开端口允许回包进入,传输完毕后动态关闭这个端口,它也能够有效地避免外部的DoS攻击,并且外部伪造的ACK数据包也不会进入。

序号动作源地址源端口目标地址目标端口方向
1允许***80

匹配项:

源IP目的IP源端口目的端口协议CT状态序列号(仅TCP)超时时间匹配包个数
TCP三次握手

①:A -> B -trk
②:A -> B +trk+new
③:B -> A -trk
④:B -> A +trk+est
⑤:A -> B +trk+est

A发送syn报文:①+②
B回复syn+ack报文:③+④
A回复ack报文:①+⑤
A和B数据传输:A-B:①+⑤,B-A:③+④
四次挥手
A发送fin报文:①+⑤
B回复ack和fin报文:③+④
A回复ack:①+⑤

NAT应用:

NAT 网关(NAT Gateway)是一种支持 IP 地址转换服务,提供 SNAT 和 DNAT 能力,可为私有网络(VPC)内的资源提供安全、高性能的 Internet 访问服务。
NAT 网关可为 VPC 内多个无公网 IP 的服务器提供主动访问公网的能力,同时也支持将公网 IP 和端口映射到云服务器内网 IP 和端口,使得 VPC 内的服务器可被公网访问。
SNAT 支持 VPC 内多个服务器通过同一公网 IP 主动访问互联网。
DNAT 将外网 IP、协议、端口映射到 VPC 内的云服务器内网 IP、协议、端口,使得云服务器上的服务可被外网访问。

搭建环境验证

在这里插入图片描述

核心流表
# 第一次收到的包不管是入流量还是出流量都要到conntrack table里lookup里查找entry匹配-trk
ovs-ofctl add-flow br-int 'table=0,priority=10,ip,ct_state=-trk,action=ct(nat,table=1)'

# 出流量:去conntrack table里lookup里查询之后状态变为+trk,首包同时要匹配+new,对于其他-est-inv-rel状态,明确不匹配才可以commit到conntrack table
ovs-ofctl add-flow br-int 'table=1,in_port=1,ip,ct_state=+trk+new-est-inv-rel,action=ct(nat(src=172.93.74.5-172.93.74.15:5000-50000),commit),mod_dl_src:xx:xx:xx:xx:xx:xx,mod_dl_dst:xx:xx:xx:xx:xx:xx,3'
# 出流量:去conntrack table里lookup里查询之后状态变为+trk,非首包代表已经commit到conntrack table,要匹配+est,明确不匹配
ovs-ofctl add-flow br-int 'table=1,in_port=1,ip,ct_state=+trk+est-new-inv-rel,action=mod_dl_src:xx:xx:xx:xx:xx:xx,mod_dl_dst:xx:xx:xx:xx:xx:xx,3'
# 入流量:去conntrack table里lookup里查询之后状态变为+trk,不管是不是首包定匹配+est,明确不匹配-est-inv-rel状态
ovs-ofctl add-flow br-int 'table=1,in_port=3,ip,ct_state=+trk+est-new-inv-rel,action=mod_dl_src:xx:xx:xx:xx:xx:xx,mod_dl_dst:xx:xx:xx:xx:xx:xx,1'
环境准备
#1、创建2个ns模拟虚拟机
ip netns add ns1
ip netns add ns2

#2、创建2个网桥,ns连接在br-int
ovs-vsctl add-br br-int
ovs-vsctl add-br net-br
ovs-vsctl add-port br-int vif_1 -- set Interface vif_1 type=internal
ip link set vif_1 netns ns1
ip netns exec ns1 ip link set dev vif_1 up
ovs-vsctl add-port br-int vif_2 -- set Interface vif_2 type=internal
ip link set vif_2 netns ns2
ip netns exec ns2 ip link set dev vif_2 up

#3、给vm配置ip
ip netns exec ns1 ip addr add 192.168.1.102/24 dev vif_1
ip netns exec ns2 ip addr add 192.168.1.1/24 dev vif_2
ip netns exec ns1 ip link set lo up
ip netns exec ns2 ip link set lo up

#4、测试vm之间连通性
ip netns exec ns1 ping -c 1 192.168.1.102
ip netns exec ns1 ping -c 1 192.168.1.1

#5、在网桥间创建patch口
ovs-vsctl add-port br-int patch-ovs-0 -- set Interface patch-ovs-0 type=patch options:peer=patch-sw-1
ovs-vsctl add-port net-br patch-sw-1 -- set Interface patch-sw-1 type=patch options:peer=patch-ovs-0

#6、将物理网卡eth0添加到网桥net-br上,此时物理网卡eth0上的ip和路由全部消失
ovs-vsctl add-port net-br eth0
ifconfig anet-br 172.93.73.4/16 up
route add default gw 172.93.1.1 dev net-br
route add 172.93.0.0 dev anet-br
ip addr flush dev eth0

#7、添加流表
NS1_MAC=ip netns exec ns1 ip a | grep ether | awk '{print $2}'
NS2_MAC=ip netns exec ns2 ip a | grep ether | awk '{print $2}'
NET_BR_MAC=ip a | grep net-br: -A1 | grep ether | awk '{print $2}'
HOST_MAC="00:50:56:95:b0:b2"
NS1_PORT=ovs-ofctl  show br-int | grep vif_1 | awk -F '(' '{print $1}' | awk '{print $NF}'
NS2_PORT=ovs-ofctl  show br-int | grep vif_2 | awk -F '(' '{print $1}' | awk '{print $NF}'
PATCH_PORT=ovs-ofctl  show br-int | grep patch-ovs | awk -F '(' '{print $1}' | awk '{print $NF}'
ovs-ofctl add-flow br-int "table=0,priority=10,ip,ct_state=-trk,action=ct(nat,table=1)"
ovs-ofctl add-flow br-int "table=1,in_port=${NS1_PORT},ip,ct_state=+trk+new,action=ct(nat(src=172.93.73.4-172.93.73.4:5000-50000),commit,exec(set_field:0x2->ct_label)),mod_dl_src:${NET_BR_MAC},mod_dl_dst:${HOST_MAC},${PATCH_PORT}"
ovs-ofctl add-flow br-int "table=1,priority=10,in_port=${NS1_PORT},ip,ct_state=+trk+est,action=mod_dl_src:${NET_BR_MAC},mod_dl_dst:${HOST_MAC},${PATCH_PORT}"
ovs-ofctl add-flow br-int "table=1,priority=1,ct_label=0x2,ip,ct_state=+trk+est,action=mod_dl_src:${NET_BR_MAC},mod_dl_dst:${NS1_MAC},${NS1_PORT}"

# 8、在ns口抓包
03:57:56.625643 b2:68:6b:f5:f5:bf > 00:50:56:95:59:53, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 41522, offset 0, flags [DF], proto ICMP (1), length 84)
192.168.1.102 > 172.93.74.28: ICMP echo request, id 1355, seq 9, length 64
03:57:56.626220 00:50:56:95:59:53 > b2:68:6b:f5:f5:bf, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 65244, offset 0, flags [none], proto ICMP (1), length 84)
172.93.74.28 > 192.168.1.102: ICMP echo reply, id 1355, seq 9, length 64

#9、在物理口抓包
03:57:48.459408 00:50:56:95:59:53 > 00:50:56:95:b0:b2, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 40494, offset 0, flags [DF], proto ICMP (1), length 84)
172.93.73.4 > 172.93.74.28: ICMP echo request, id 15933, seq 1, length 64
03:57:49.461041 00:50:56:95:b0:b2 > 00:50:56:95:59:53, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 64042, offset 0, flags [none], proto ICMP (1), length 84)
172.93.74.28 > 172.93.73.4: ICMP echo reply, id 15933, seq 2, length 64

#10、查看CT表
ovs-appctl dpctl/dump-conntrack -m | grep 172.93.74.28
icmp,orig=(src=192.168.1.102,dst=172.93.74.28,id=2167,type=8,code=0),reply=(src=172.93.74.28,dst=172.93.73.4,id=8934,type=0,code=0),status=SEEN_REPLY|CONFIRMED|SRC_NAT|SRC_NAT_DONE,labels=0x1
SNAT+DNAT

在这里插入图片描述
UnSnat:table=1, cookie=0, priority=100,ip,nw_dst=116.0.0.1 actions=ct(table=2,zone=10,nat)

Dnat:table=2, cookie=0, priority=100,ip actions=ct(commit,table=3,zone=20,nat(dst=117.0.0.1))

Routing:table=3, cookie=0, priority=100,ip,nw_dst=117.0.0.0/24 actions= set_field:fe:fe:fe:fe:fe:bb->eth_dst,resubmit(,4)"

UnDnat:table=4, cookie=0, priority=100,ip,nw_src=117.0.0.1 actions=ct(table=5,zone=20,nat)

Snat: table=5, cookie=0, priority=100,ip actions=ct(commit,table=6,zone=10,nat(src=116.0.0.1))

CT在OVS datapath实现
// 根据add-flow操作匹配type类型CT
OFPACT(CT,              ofpact_conntrack,   ofpact, "ct")           

// 读取action里CT参数
static char * OVS_WARN_UNUSED_RESULT
parse_CT(char *arg, const struct ofpact_parse_params *pp)
{
    const size_t ct_offset = ofpacts_pull(pp->ofpacts);
    struct ofpact_conntrack *oc;
    char *error = NULL;
    char *key, *value;
    oc = ofpact_put_CT(pp->ofpacts);
    oc->flags = 0;
    oc->recirc_table = NX_CT_RECIRC_NONE;
    while (ofputil_parse_key_value(&arg, &key, &value)) {
        if (!strcmp(key, "commit")) {
            oc->flags |= NX_CT_F_COMMIT;
        } else if (!strcmp(key, "force")) {
            oc->flags |= NX_CT_F_FORCE;
        } else if (!strcmp(key, "table")) {
            if (!ofputil_table_from_string(value, pp->table_map,
                                           &oc->recirc_table)) {
                error = xasprintf("unknown table %s", value);
            } else if (oc->recirc_table == NX_CT_RECIRC_NONE) {
                error = xasprintf("invalid table %#"PRIx8, oc->recirc_table);
            }
        } else if (!strcmp(key, "zone")) {
            error = str_to_u16(value, "zone", &oc->zone_imm);
            if (error) {
                free(error);
                error = mf_parse_subfield(&oc->zone_src, value);
                if (error) {
                    return error;
                }
            }
        } else if (!strcmp(key, "alg")) {
            error = str_to_connhelper(value, &oc->alg);
        } else if (!strcmp(key, "nat")) {
            const size_t nat_offset = ofpacts_pull(pp->ofpacts);
            error = parse_NAT(value, pp);
            /* Update CT action pointer and length. */
            pp->ofpacts->header = ofpbuf_push_uninit(pp->ofpacts, nat_offset);
            oc = pp->ofpacts->header;
        } else if (!strcmp(key, "exec")) {
            /* Hide existing actions from ofpacts_parse_copy(), so the
             * nesting can be handled transparently. */
            enum ofputil_protocol usable_protocols2;
            const size_t exec_offset = ofpacts_pull(pp->ofpacts);
            /* Initializes 'usable_protocol2', fold it back to
             * '*usable_protocols' afterwards, so that we do not lose
             * restrictions already in there. */
            struct ofpact_parse_params pp2 = *pp;
            pp2.usable_protocols = &usable_protocols2;
            error = ofpacts_parse_copy(value, &pp2, false, OFPACT_CT);
            *pp->usable_protocols &= usable_protocols2;
            pp->ofpacts->header = ofpbuf_push_uninit(pp->ofpacts, exec_offset);
            oc = pp->ofpacts->header;
        } else {
            error = xasprintf("invalid argument to \"ct\" action: `%s'", key);
        }
    }
    if (!error && oc->flags & NX_CT_F_FORCE && !(oc->flags & NX_CT_F_COMMIT)) {
        error = xasprintf("\"force\" flag requires \"commit\" flag.");
    }
    if (ofpbuf_oversized(pp->ofpacts)) {
        free(error);
        return xasprintf("input too big");
    }
    ofpact_finish_CT(pp->ofpacts, &oc);
    ofpbuf_push_uninit(pp->ofpacts, ct_offset);
    return error;
}

// FAST PATH:收到packet时走内核态查询flow,接上一篇ovs_execute_actions
int ovs_execute_actions(struct datapath *dp, struct sk_buff *skb,
            const struct sw_flow_actions *acts,
            struct sw_flow_key *key)
{
    int err, level;
    level = __this_cpu_inc_return(exec_actions_level);
    if (unlikely(level > OVS_RECURSION_LIMIT)) {
        net_crit_ratelimited("ovs: recursion limit reached on datapath %s, probable configuration error\n",
                     ovs_dp_name(dp));
        kfree_skb(skb);
        err = -ENETDOWN;
        goto out;
    }
    OVS_CB(skb)->acts_origlen = acts->orig_len;
    // 执行action
    err = do_execute_actions(dp, skb, key,
                 acts->actions, acts->actions_len);
    if (level == 1)
        process_deferred_actions(dp);
out:
    __this_cpu_dec(exec_actions_level);
    return err;
}

/* Execute a list of actions against 'skb'. */
static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
                  struct sw_flow_key *key,
                  const struct nlattr *attr, int len)
{
    const struct nlattr *a;
    int rem;
    for (a = attr, rem = len; rem > 0;
         a = nla_next(a, &rem)) {
        int err = 0;
        switch (nla_type(a)) {
        ......
        case OVS_ACTION_ATTR_CT:
            if (!is_flow_key_valid(key)) {
                err = ovs_flow_key_update(skb, key);
                if (err)
                    return err;
            }
            err = ovs_ct_execute(ovs_dp_get_net(dp), skb, key,
                         nla_data(a));
            /* Hide stolen IP fragments from user space. */
            if (err)
                return err == -EINPROGRESS ? 0 : err;
            break;
           ......
           } 
......
} 

/* Returns 0 on success, -EINPROGRESS if 'skb' is stolen, or other nonzero
 * value if 'skb' is freed.
 */
int ovs_ct_execute(struct net *net, struct sk_buff *skb,
           struct sw_flow_key *key,
           const struct ovs_conntrack_info *info)
{
    int nh_ofs;
    int err;
    /* The conntrack module expects to be working at L3. */
    nh_ofs = skb_network_offset(skb);
    skb_pull_rcsum(skb, nh_ofs);
    err = ovs_skb_network_trim(skb);
    if (err)
        return err;
    if (key->ip.frag != OVS_FRAG_TYPE_NONE) {
        err = handle_fragments(net, key, info->zone.id, skb);
        if (err)
            return err;
    }
    if (info->commit)
        // ct(commit,table=1)
        err = ovs_ct_commit(net, key, info, skb);
    else
        // ct(table=1)
        err = ovs_ct_lookup(net, key, info, skb);
    skb_push(skb, nh_ofs);
    skb_postpush_rcsum(skb, skb->data, nh_ofs);
    if (err)
        kfree_skb(skb);
    return err;
}
// 设置ct_label,ct_mark,nat,commit参数
static int ovs_ct_commit(struct net *net, struct sw_flow_key *key,
             const struct ovs_conntrack_info *info,
             struct sk_buff *skb)
{
    enum ip_conntrack_info ctinfo;
    struct nf_conn *ct;
    int err;
    // 设置nat
    err = __ovs_ct_lookup(net, key, info, skb);
    if (err)
        return err;
    /* The connection could be invalid, in which case this is a no-op.*/
    ct = nf_ct_get(skb, &ctinfo);
    if (!ct)
        return 0;
    ......
    // 设置ct_mark
    if (info->mark.mask) {
        err = ovs_ct_set_mark(ct, key, info->mark.value,
                      info->mark.mask);
        if (err)
            return err;
    }
    // 设置ct_label
    if (!nf_ct_is_confirmed(ct)) {
        err = ovs_ct_init_labels(ct, key, &info->labels.value,
                     &info->labels.mask);
        if (err)
            return err;
    } else if (IS_ENABLED(CONFIG_NF_CONNTRACK_LABELS) &&
           labels_nonzero(&info->labels.mask)) {
        err = ovs_ct_set_labels(ct, key, &info->labels.value,
                    &info->labels.mask);
        if (err)
            return err;
    }
    /* This will take care of sending queued events even if the connection
     * is already confirmed.
     */
    if (nf_conntrack_confirm(skb) != NF_ACCEPT)
        return -EINVAL;
    return 0;
}

static int __ovs_ct_lookup(struct net *net, struct sw_flow_key *key,
               const struct ovs_conntrack_info *info,
               struct sk_buff *skb)
{
    bool cached = skb_nfct_cached(net, key, info, skb);
    enum ip_conntrack_info ctinfo;
    struct nf_conn *ct;
    if (!cached) {
        struct nf_hook_state state = {
            .hook = NF_INET_PRE_ROUTING,
            .pf = info->family,
            .net = net,
        };
        struct nf_conn *tmpl = info->ct;
        int err;
        /* Associate skb with specified zone. */
        if (tmpl) {
            if (skb_nfct(skb))
                nf_conntrack_put(skb_nfct(skb));
            nf_conntrack_get(&tmpl->ct_general);
            nf_ct_set(skb, tmpl, IP_CT_NEW);
        }
        err = nf_conntrack_in(skb, &state);
        if (err != NF_ACCEPT)
            return -ENOENT;
        /* Clear CT state NAT flags to mark that we have not yet done
         * NAT after the nf_conntrack_in() call.  We can actually clear
         * the whole state, as it will be re-initialized below.
         */
        key->ct_state = 0;
        /* Update the key, but keep the NAT flags. */
        ovs_ct_update_key(skb, info, key, true, true);
    }
    ct = nf_ct_get(skb, &ctinfo);
    if (ct) {
        bool add_helper = false;
        // 设置ct(nat)
        if (info->nat && !(key->ct_state & OVS_CS_F_NAT_MASK) &&
            (nf_ct_is_confirmed(ct) || info->commit) &&
            ovs_ct_nat(net, key, info, skb, ct, ctinfo) != NF_ACCEPT) {
            return -EINVAL;
        }
        ......
}

static int ovs_ct_nat(struct net *net, struct sw_flow_key *key,
              const struct ovs_conntrack_info *info,
              struct sk_buff *skb, struct nf_conn *ct,
              enum ip_conntrack_info ctinfo)
{
    enum nf_nat_manip_type maniptype;
    int err;
#ifdef HAVE_NF_CT_IS_UNTRACKED
    if (nf_ct_is_untracked(ct)) {
        /* A NAT action may only be performed on tracked packets. */
        return NF_ACCEPT;
    }
#endif /* HAVE_NF_CT_IS_UNTRACKED */
    /* Add NAT extension if not confirmed yet. */
    if (!nf_ct_is_confirmed(ct) && !nf_ct_nat_ext_add(ct))
        return NF_ACCEPT;   /* Can't NAT. */
    /* Determine NAT type.
     * Check if the NAT type can be deduced from the tracked connection.
     * Make sure new expected connections (IP_CT_RELATED) are NATted only
     * when committing.
     */
    if (info->nat & OVS_CT_NAT && ctinfo != IP_CT_NEW &&
        ct->status & IPS_NAT_MASK &&
        (ctinfo != IP_CT_RELATED || info->commit)) {
        /* NAT an established or related connection like before. */
        if (CTINFO2DIR(ctinfo) == IP_CT_DIR_REPLY)
            /* This is the REPLY direction for a connection
             * for which NAT was applied in the forward
             * direction.  Do the reverse NAT.
             */
            maniptype = ct->status & IPS_SRC_NAT
                ? NF_NAT_MANIP_DST : NF_NAT_MANIP_SRC;
        else
            maniptype = ct->status & IPS_SRC_NAT
                ? NF_NAT_MANIP_SRC : NF_NAT_MANIP_DST;
    } else if (info->nat & OVS_CT_SRC_NAT) {
        maniptype = NF_NAT_MANIP_SRC;
    } else if (info->nat & OVS_CT_DST_NAT) {
        maniptype = NF_NAT_MANIP_DST;
    } else {
        return NF_ACCEPT; /* Connection is not NATed. */
    }
    // 执行nat相关action
    err = ovs_ct_nat_execute(skb, ct, ctinfo, &info->range, maniptype);
    if (err == NF_ACCEPT && ct->status & IPS_DST_NAT) {
        if (ct->status & IPS_SRC_NAT) {
            if (maniptype == NF_NAT_MANIP_SRC)
                maniptype = NF_NAT_MANIP_DST;
            else
                maniptype = NF_NAT_MANIP_SRC;
            err = ovs_ct_nat_execute(skb, ct, ctinfo, &info->range,
                         maniptype);
        } else if (CTINFO2DIR(ctinfo) == IP_CT_DIR_ORIGINAL) {
            err = ovs_ct_nat_execute(skb, ct, ctinfo, NULL,
                         NF_NAT_MANIP_SRC);
        }
    }
    /* Mark NAT done if successful and update the flow key. */
    if (err == NF_ACCEPT)
        ovs_nat_update_key(key, skb, maniptype);
    return err;
}

static int ovs_ct_nat_execute(struct sk_buff *skb, struct nf_conn *ct,
                  enum ip_conntrack_info ctinfo,
                  const struct nf_nat_range2 *range,
                  enum nf_nat_manip_type maniptype)
{
    int hooknum, nh_off, err = NF_ACCEPT;
    nh_off = skb_network_offset(skb);
    skb_pull_rcsum(skb, nh_off);
    /* See HOOK2MANIP(). */
    if (maniptype == NF_NAT_MANIP_SRC)
        hooknum = NF_INET_LOCAL_IN; /* Source NAT */
    else
        hooknum = NF_INET_LOCAL_OUT; /* Destination NAT */
    switch (ctinfo) {
    case IP_CT_RELATED:
    case IP_CT_RELATED_REPLY:
        if (IS_ENABLED(CONFIG_NF_NAT_IPV4) &&
            skb->protocol == htons(ETH_P_IP) &&
            ip_hdr(skb)->protocol == IPPROTO_ICMP) {
            if (!nf_nat_icmp_reply_translation(skb, ct, ctinfo,
                               hooknum))
                err = NF_DROP;
            goto push;
        } else if (IS_ENABLED(CONFIG_NF_NAT_IPV6) &&
               skb->protocol == htons(ETH_P_IPV6)) {
            __be16 frag_off;
            u8 nexthdr = ipv6_hdr(skb)->nexthdr;
            int hdrlen = ipv6_skip_exthdr(skb,
                              sizeof(struct ipv6hdr),
                              &nexthdr, &frag_off);
            if (hdrlen >= 0 && nexthdr == IPPROTO_ICMPV6) {
                if (!nf_nat_icmpv6_reply_translation(skb, ct,
                                     ctinfo,
                                     hooknum,
                                     hdrlen))
                    err = NF_DROP;
                goto push;
            }
        }
        /* Non-ICMP, fall thru to initialize if needed. */
        /* fall through */
    case IP_CT_NEW:
        /* Seen it before?  This can happen for loopback, retrans,
         * or local packets.
         */
        if (!nf_nat_initialized(ct, maniptype)) {
            /* Initialize according to the NAT action. */
            err = (range && range->flags & NF_NAT_RANGE_MAP_IPS)
                /* Action is set up to establish a new
                 * mapping.
                 */
                ? nf_nat_setup_info(ct, range, maniptype)
                : nf_nat_alloc_null_binding(ct, hooknum);
            if (err != NF_ACCEPT)
                goto push;
        }
        break;
    case IP_CT_ESTABLISHED:
    case IP_CT_ESTABLISHED_REPLY:
        break;
    default:
        err = NF_DROP;
        goto push;
    }
    err = nf_nat_packet(ct, ctinfo, hooknum, skb);
push:
    skb_push(skb, nh_off);
    skb_postpush_rcsum(skb, skb->data, nh_off);
    return err;
}

static int ovs_ct_lookup(struct net *net, struct sw_flow_key *key,
             const struct ovs_conntrack_info *info,
             struct sk_buff *skb)
{
    struct nf_conntrack_expect *exp;
    exp = ovs_ct_expect_find(net, &info->zone, info->family, skb);
    if (exp) {
        u8 state;
        // 未找到,将ct_state更新为+trk、+new、+rel
        state = OVS_CS_F_TRACKED | OVS_CS_F_NEW | OVS_CS_F_RELATED;
        __ovs_ct_update_key(key, state, &info->zone, exp->master);
    } else {
        struct nf_conn *ct;
        int err;
        err = __ovs_ct_lookup(net, key, info, skb);
        if (err)
            return err;
        ct = (struct nf_conn *)skb_nfct(skb);
        if (ct)
            nf_ct_deliver_cached_events(ct);
    }
    return 0;
}

// Slow Path:在内核态未查找到调用upcall去用户态查询
static void *udpif_upcall_handler(void *arg)
{
    struct handler *handler = arg;
    struct udpif *udpif = handler->udpif;
    while (!latch_is_set(&handler->udpif->exit_latch)) {
        if (recv_upcalls(handler)) {
            poll_immediate_wake();
        } else {
            dpif_recv_wait(udpif->dpif, handler->handler_id);
            latch_wait(&udpif->exit_latch);
        }
        poll_block();
    }
    return NULL;
}

// 2、调用用户态upcalls
static size_t recv_upcalls(struct handler *handler)
{
    struct udpif *udpif = handler->udpif;
    uint64_t recv_stubs[UPCALL_MAX_BATCH][512 / 8];
    struct ofpbuf recv_bufs[UPCALL_MAX_BATCH];
    struct dpif_upcall dupcalls[UPCALL_MAX_BATCH];
    struct upcall upcalls[UPCALL_MAX_BATCH];
    struct flow flows[UPCALL_MAX_BATCH];
    size_t n_upcalls, i;
    n_upcalls = 0;
    while (n_upcalls < UPCALL_MAX_BATCH) {
        struct ofpbuf *recv_buf = &recv_bufs[n_upcalls];
        struct dpif_upcall *dupcall = &dupcalls[n_upcalls];
        struct upcall *upcall = &upcalls[n_upcalls];
        struct flow *flow = &flows[n_upcalls];
        unsigned int mru = 0;
        uint64_t hash = 0;
        int error;
        ofpbuf_use_stub(recv_buf, recv_stubs[n_upcalls],
                        sizeof recv_stubs[n_upcalls]);
        // 2.1、接受upcall请求
        if (dpif_recv(udpif->dpif, handler->handler_id, dupcall, recv_buf)) {
            ofpbuf_uninit(recv_buf);
            break;
        }
        upcall->fitness = odp_flow_key_to_flow(dupcall->key, dupcall->key_len,
                                               flow, NULL);
        if (upcall->fitness == ODP_FIT_ERROR) {
            goto free_dupcall;
        }
        if (dupcall->mru) {
            mru = nl_attr_get_u16(dupcall->mru);
        }
        if (dupcall->hash) {
            hash = nl_attr_get_u64(dupcall->hash);
        }
        // 2.2收到内核上传的数据分类
        error = upcall_receive(upcall, udpif->backer, &dupcall->packet,
                               dupcall->type, dupcall->userdata, flow, mru,
                               &dupcall->ufid, PMD_ID_NULL);
        if (error) {
            if (error == ENODEV) {
                /* Received packet on datapath port for which we couldn't
                 * associate an ofproto.  This can happen if a port is removed
                 * while traffic is being received.  Print a rate-limited
                 * message in case it happens frequently. */
                dpif_flow_put(udpif->dpif, DPIF_FP_CREATE, dupcall->key,
                              dupcall->key_len, NULL, 0, NULL, 0,
                              &dupcall->ufid, PMD_ID_NULL, NULL);
                VLOG_INFO_RL(&rl, "received packet on unassociated datapath "
                             "port %"PRIu32, flow->in_port.odp_port);
            }
            goto free_dupcall;
        }
        upcall->key = dupcall->key;
        upcall->key_len = dupcall->key_len;
        upcall->ufid = &dupcall->ufid;
        upcall->hash = hash;
        upcall->out_tun_key = dupcall->out_tun_key;
        upcall->actions = dupcall->actions;
        pkt_metadata_from_flow(&dupcall->packet.md, flow);
        // 2.3、提取flow
        flow_extract(&dupcall->packet, flow);
        // 2.4、处理upcall数据
        error = process_upcall(udpif, upcall,
                               &upcall->odp_actions, &upcall->wc);
        if (error) {
            goto cleanup;
        }
        n_upcalls++;
        continue;
}
// 2.4调用upcall_xlate
static int process_upcall(struct udpif *udpif, struct upcall *upcall,
               struct ofpbuf *odp_actions, struct flow_wildcards *wc)
{
    const struct dp_packet *packet = upcall->packet;
    const struct flow *flow = upcall->flow;
    size_t actions_len = 0;
    switch (upcall->type) {
    case MISS_UPCALL:
    case SLOW_PATH_UPCALL:
        // 2.4.1 处理upcall类型
        upcall_xlate(udpif, upcall, odp_actions, wc);
        return 0;
    }
}

// 2.4.1 处理upcall类型
static void upcall_xlate(struct udpif *udpif, struct upcall *upcall,
             struct ofpbuf *odp_actions, struct flow_wildcards *wc)
{
    struct dpif_flow_stats stats;
    enum xlate_error xerr;
    struct xlate_in xin;
    struct ds output;
    stats.n_packets = 1;
    stats.n_bytes = dp_packet_size(upcall->packet);
    stats.used = time_msec();
    stats.tcp_flags = ntohs(upcall->flow->tcp_flags);
    xlate_in_init(&xin, upcall->ofproto,
                  ofproto_dpif_get_tables_version(upcall->ofproto),
                  upcall->flow, upcall->ofp_in_port, NULL,
                  stats.tcp_flags, upcall->packet, wc, odp_actions);
   
    upcall->reval_seq = seq_read(udpif->reval_seq);
    // 2.4.1.1流表一个个遍历匹配
    xerr = xlate_actions(&xin, &upcall->xout);
    /* Translate again and log the ofproto trace for
     * these two error types. */
    
    /* This function is also called for slow-pathed flows.  As we are only
     * going to create new datapath flows for actual datapath misses, there is
     * no point in creating a ukey otherwise. */
    if (upcall->type == MISS_UPCALL) {
        upcall->ukey = ukey_create_from_upcall(upcall, wc);
    }
}
// 2.4.1.1匹配流表
enum xlate_error xlate_actions(struct xlate_in *xin, struct xlate_out *xout)
{
    *xout = (struct xlate_out) {
        .slow = 0,
        .recircs = RECIRC_REFS_EMPTY_INITIALIZER,
    };
    if (!xin->ofpacts && !ctx.rule) {
        // 2.4.1.1流表一个个遍历匹配
        ctx.rule = rule_dpif_lookup_from_table(
            ctx.xbridge->ofproto, ctx.xin->tables_version, flow, ctx.wc,
            ctx.xin->resubmit_stats, &ctx.table_id,
            flow->in_port.ofp_port, true, true, ctx.xin->xcache);
    }
    // 2.4.1.2执行action
     do_xlate_actions(ofpacts, ofpacts_len, &ctx, true, false);
}

// 2.4.1.2 执行ct模块action
static void do_xlate_actions(const struct ofpact *ofpacts, size_t ofpacts_len,
                 struct xlate_ctx *ctx, bool is_last_action,
                 bool group_bucket_action)
{
    ......
    case OFPACT_CT:
            compose_conntrack_action(ctx, ofpact_get_CT(a), last);
            break;
        case OFPACT_CT_CLEAR:
            if (ctx->conntracked) {
                compose_ct_clear_action(ctx);
            }
            break;
        case OFPACT_NAT:
            /* This will be processed by compose_conntrack_action(). */
            ctx->ct_nat_action = ofpact_get_NAT(a);
            break;
}

static void compose_conntrack_action(struct xlate_ctx *ctx, struct ofpact_conntrack *ofc,
                         bool is_last_action)
{
  
    /* Ensure that any prior actions are applied before composing the new
     * conntrack action. */
    xlate_commit_actions(ctx);
    /* Process nested actions first, to populate the key. */
    ctx->ct_nat_action = NULL;
    ctx->wc->masks.ct_mark = 0;
    ctx->wc->masks.ct_label = OVS_U128_ZERO;
    do_xlate_actions(ofc->actions, ofpact_ct_get_action_len(ofc), ctx,
                     is_last_action, false);
    ct_offset = nl_msg_start_nested(ctx->odp_actions, OVS_ACTION_ATTR_CT);
    if (ofc->flags & NX_CT_F_COMMIT) {
        nl_msg_put_flag(ctx->odp_actions, ofc->flags & NX_CT_F_FORCE ?
                        OVS_CT_ATTR_FORCE_COMMIT : OVS_CT_ATTR_COMMIT);
        if (ctx->xbridge->support.ct_eventmask) {
            nl_msg_put_u32(ctx->odp_actions, OVS_CT_ATTR_EVENTMASK,
                           OVS_CT_EVENTMASK_DEFAULT);
        }
        if (ctx->xbridge->support.ct_timeout) {
            put_ct_timeout(ctx->odp_actions, ctx->xbridge->ofproto->backer,
                           &ctx->xin->flow, ctx->wc, zone);
        }
    }
    nl_msg_put_u16(ctx->odp_actions, OVS_CT_ATTR_ZONE, zone);
    put_ct_mark(&ctx->xin->flow, ctx->odp_actions, ctx->wc);
    put_ct_label(&ctx->xin->flow, ctx->odp_actions, ctx->wc);
    put_ct_helper(ctx, ctx->odp_actions, ofc);
    put_ct_nat(ctx);
    nl_msg_end_nested(ctx->odp_actions, ct_offset);
    ctx->wc->masks.ct_mark = old_ct_mark_mask;
    ctx->wc->masks.ct_label = old_ct_label_mask;
    if (ofc->recirc_table != NX_CT_RECIRC_NONE) {
        ctx->conntracked = true;
        // 若在用户态匹配到flow后会在内核datapath生成一条规则
        compose_recirculate_and_fork(ctx, ofc->recirc_table, zone);
    }
    ctx->ct_nat_action = NULL;
    /* The ct_* fields are only available in the scope of the 'recirc_table'
     * call chain. */
    flow_clear_conntrack(&ctx->xin->flow);
    xlate_report(ctx, OFT_DETAIL, "Sets the packet to an untracked state, "
                 "and clears all the conntrack fields.");
    ctx->conntracked = false;
}
  • 8
    点赞
  • 15
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
OVS offload 是指在开放式虚拟交换机(Open vSwitch,即OVS)中使用硬件加速来提升网络数据包处理的性能和效率。传统上,虚拟交换机在软件层面进行数据包处理,这可能会导致性能瓶颈和延迟增加。因此,为了解决这些问题,OVS offload 技术应运而生。 通过 OVS offload,虚拟交换机可以将一些网络数据包的处理任务委托给硬件设备来完成,而不是完全依赖于软件。这些硬件设备可以是物理网络交换机的芯片或网卡上的功能块,也可以是专门设计的网络加速卡(Network Interface Card,即NIC),具体取决于硬件厂商的支持。 OVS offload 技术带来了多方面的好处。首先,它可以大幅度提高网络数据包的处理速度和吞吐量,从而减少延迟并提供更好的网络性能。其次,它可以减轻CPU的负担,使其能够处理更多的网络流量和更复杂的网络任务。此外,OVS offload 还可以提供更好的网络流量监控和安全性,通过硬件加速可以更快速地检测和处理网络攻击。 然而,OVS offload 技术也存在一些限制。首先,它取决于硬件设备的支持,因此只有特定的硬件设备才能充分发挥其优势。其次,OVS offload 目前仍处于发展阶段,可能存在一些兼容性问题或性能优化的空间。因此,在实际应用中,需要仔细评估硬件设备的支持和兼容性,以及进行适当的性能测试和调优。 总的来说,OVS offload 技术为虚拟交换机提供了一种有效的性能优化手段,可以提高网络数据包处理的效率和性能。它在实际应用中具有广泛的应用前景,并且随着硬件技术的不断发展,其性能还将进一步提升。
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值