OVS-DPDK存在三层查询表/缓存。输入包首先将在EMC中进行匹配,若未匹配上那么将被送如dpcls。dpcls由一个元组空间查找算法(tuple space search TSS)实现,因此可对包头进行任意的逐位匹配。若包在dpcls中仍未匹配上那么将被送入openflow pipeline即ofproto classifier中进行处理,而该ofproto classifier由SDN控制器进行控制。
对包进行分类后,可对包执行多种不同的动作,比如将包转发至一个确定的端口,增加VLAN tag,丢包亦或者将包发送至连接跟踪模块。
EMC Call Graph
收到包后,包头将被提取存入miniflow中,miniflow是struct flow的稀疏表示,存在两个优势:
1.减少内存以及高速缓存区块(cache lines)
2.由于struct flow结构非常大并且大多数值为0,使用miniflow可快速对非0值进行迭代,每个struct flow中的uint64_t在miniflow.map.bits中占1位
emc_processing将会对struct netdev_flow_key keys[PKT_ARRAY_SIZE]赋值,keys[i]保存第i个miss emc_cache的包对应的miniflow
在EMC中,包将在以下几个结构体中进行处理:
emc_processing(struct dp_netdev_pmd_thread *pmd, struct dp_packet_batch *packets_,
struct netdev_flow_key *keys,
struct packet_batch_per_flow batches[], size_t *n_batches,
bool md_is_valid, odp_port_t port_no)
{
struct emc_cache *flow_cache = &pmd->flow_cache;
struct netdev_flow_key *key = &keys[0];
size_t i, n_missed = 0, n_dropped = 0;
struct dp_packet **packets = packets_->packets;
int cnt = packets_->count;
/*逐个对dp_packet_batch中的每一个packet进行处理*/
for (i = 0; i < cnt; i++) {
struct dp_netdev_flow *flow;
struct dp_packet *packet = packets[i];
/*若packet包长小于以太头的长度直接丢包*/
if (OVS_UNLIKELY(dp_packet_size(packet) < ETH_HEADER_LEN)) {
dp_packet_delete(packet);
n_dropped++;
continue;
}
/*对数据手工预取可减少读取延迟,从而提高性能*/
if (i != cnt - 1) {
/* Prefetch next packet data and metadata. */
OVS_PREFETCH(dp_packet_data(packets[i+1]));
pkt_metadata_prefetch_init(&packets[i+1]->md);
}
/*初始化metadata
*首先将pkt_metadata中flow_in_port前的字节全部设为0
*然后将in_port.odp_port设为port_no,tunnel.ip_dst设为0从而tunnel中的其他字段*/
if (!md_is_valid) {
pkt_metadata_init(&packet->md, port_no);
}
/*根据pkt_metadata中的值以及dp_packet->mbuf提取miniflow*/
miniflow_extract(packet, &key->mf);
key->len = 0; /* Not computed yet. */
/*计算与当前dp_packet相应的miniflow所在的netdev_flow_key中的hash
*该hash将在emc_lookup中匹配entry
*该hash可在NIC的RSS mode使能时可在收包时计算,或者由miniflow_hash_5tuple得到*/
key->hash = dpif_netdev_packet_get_rss_hash(packet, &key->mf);
/*根据key->hash,emc_entry alive,miniflow 3个条件得到dp_netdev_flow*/
flow = emc_lookup(flow_cache, key);
if (OVS_LIKELY(flow)) {
/*根据dp_netdev_flow对dp_packet分类,
将同以个dp_netdev_flow对应的所有dp_packet放入相同的packet_batch_per_flow*/
dp_netdev_queue_batches(packet, flow, &key->mf, batches,
n_batches);
} else {
/* Exact match cache missed. Group missed packets together at
* the beginning of the 'packets' array. */
packets[n_missed] = packet;
/* 'key[n_missed]' contains the key of the current packet and it
* must be returned to the caller. The next key should be extracted
* to 'keys[n_missed + 1]'. */
key = &keys[++n_missed];
}
}
dp_netdev_count_packet(pmd, DP_STAT_EXACT_HIT, cnt - n_dropped - n_missed);
return n_missed;
}
Datapath Classifier Call Graph
在每个subtable中,使用每一个包提取的miniflow与subtable mask生成一个search key来用于dpcls_lookup中的匹配。使用命令ovs-ofctl add-flow br0 dl_type=0x0800,nw_src=21.2.10.1/24,actions=output:2将在ofproto classifier中创建一条flow那么,若src ip为“21.2.10.5”的包第一次进入时,在EMC与dpcls均无法找到匹配,根据学习机制该flow将会在dpcls与EMC中建立表项,若有如下规则:Rule #1:Src IP="21.2.10.*"
为了保存通配规则Rule #1,首先需创建一个合适的“Mask #1”,mask对需要进行匹配的位置1,其他置0,因此Mask #1为"0xFF.FF.FF.00"。此时一个hash-table "HT 1"将被实例化为一个subtable。
同时HT 1将保存一些类似的规则,即那些拥有相同域以及相同mask的规则,比如Rule #1A:Src IP="83.83.83.*"因此每个subtable保存拥有相同域以及相同mask的规则。
当对Src IP=21.2.10.99进行处理时,subtable HT 1对应的Mask #1 and Src IP之后将用来计算hash值从而用来与HT 1中的所有hash值进行对比。
dpcls-多个->subtables-多个->rules,cmap_find_batch在查找hash值的同时将对每个miniflow对应的rule进行赋值。
dpcls_lookup(struct dpcls *cls, const struct netdev_flow_key keys[],
struct dpcls_rule **rules, const size_t cnt,
int *num_lookups_p)
{
/* The received 'cnt' miniflows are the search-keys that will be processed
* to find a matching entry into the available subtables.
* The number of bits in map_type is equal to NETDEV_MAX_BURST. */
typedef uint32_t map_type;
#define MAP_BITS (sizeof(map_type) * CHAR_BIT)
BUILD_ASSERT_DECL(MAP_BITS >= NETDEV_MAX_BURST);
struct dpcls_subtable *subtable;
map_type keys_map = TYPE_MAXIMUM(map_type); /* Set all bits. */
map_type found_map;
uint32_t hashes[MAP_BITS];
const struct cmap_node *nodes[MAP_BITS];
if (cnt != MAP_BITS) {
/*keys_map中置1位数为包的总数,并且第i位对应第i个包*/
keys_map >>= MAP_BITS - cnt; /* Clear extra bits. */
}
memset(rules, 0, cnt * sizeof *rules);
int lookups_match = 0, subtable_pos = 1;
/* The Datapath classifier - aka dpcls - is composed of subtables.
* Subtables are dynamically created as needed when new rules are inserted.
* Each subtable collects rules with matches on a specific subset of packet
* fields as defined by the subtable's mask. We proceed to process every
* search-key against each subtable, but when a match is found for a
* search-key, the search for that key can stop because the rules are
* non-overlapping. */
PVECTOR_FOR_EACH (subtable, &cls->subtables) {
int i;
/* Compute hashes for the remaining keys. Each search-key is
* masked with the subtable's mask to avoid hashing the wildcarded
* bits. */
ULLONG_FOR_EACH_1(i, keys_map) {
/*Murmur hash对每一个包的miniflow keys[i]计算hash值*/
hashes[i] = netdev_flow_key_hash_in_mask(&keys[i],
&subtable->mask);
}
/* Lookup. */
/*keys_map中bit为1的位将根据hashes在subtable->rules中查找
*找到了就将found_map中该位置1,然后将与之相应的rule指针存于nodes中*/
found_map = cmap_find_batch(&subtable->rules, keys_map, hashes, nodes);
/* Check results. When the i-th bit of found_map is set, it means
* that a set of nodes with a matching hash value was found for the
* i-th search-key. Due to possible hash collisions we need to check
* which of the found rules, if any, really matches our masked
* search-key. */
ULLONG_FOR_EACH_1(i, found_map) {
struct dpcls_rule *rule;
CMAP_NODE_FOR_EACH (rule, cmap_node, nodes[i]) {
/*rule->mask & keys[i]的值与rule->flow相比较*/
if (OVS_LIKELY(dpcls_rule_matches_key(rule, &keys[i]))) {
rules[i] = rule;
/* Even at 20 Mpps the 32-bit hit_cnt cannot wrap
* within one second optimization interval. */
subtable->hit_cnt++;
lookups_match += subtable_pos;
goto next;
}
}
/* None of the found rules was a match. Reset the i-th bit to
* keep searching this key in the next subtable. */
ULLONG_SET0(found_map, i); /* Did not match. */
next:
; /* Keep Sparse happy. */
}
keys_map &= ~found_map; /* Clear the found rules. */
if (!keys_map) {
if (num_lookups_p) {
*num_lookups_p = lookups_match;
}
return true; /* All found. */
}
subtable_pos++;
}
if (num_lookups_p) {
*num_lookups_p = lookups_match;
}
return false; /* Some misses. */
}
每个dp_packet拥有自己对应的dp_netdev_flow(miniflow),每个dp_netdev_flow拥有自己对应的rules
fast_path_processing(struct dp_netdev_pmd_thread *pmd,
struct dp_packet_batch *packets_,
struct netdev_flow_key *keys,
struct packet_batch_per_flow batches[], size_t *n_batches,
odp_port_t in_port,
long long now)
{
int cnt = packets_->count;
#if !defined(__CHECKER__) && !defined(_WIN32)
const size_t PKT_ARRAY_SIZE = cnt;
#else
/* Sparse or MSVC doesn't like variable length array. */
enum { PKT_ARRAY_SIZE = NETDEV_MAX_BURST };
#endif
struct dp_packet **packets = packets_->packets;
struct dpcls *cls;
struct dpcls_rule *rules[PKT_ARRAY_SIZE];
struct dp_netdev *dp = pmd->dp;
struct emc_cache *flow_cache = &pmd->flow_cache;
int miss_cnt = 0, lost_cnt = 0;
int lookup_cnt = 0, add_lookup_cnt;
bool any_miss;
size_t i;
for (i = 0; i < cnt; i++) {
/* Key length is needed in all the cases, hash computed on demand. */
keys[i].len = netdev_flow_key_size(miniflow_n_values(&keys[i].mf));
}
/* Get the classifier for the in_port */
/*根据in_port计算hash值,然后由此hash值在pmd->classifiers中查找dpcls
*每个in_port拥有一个dpcls*/
cls = dp_netdev_pmd_lookup_dpcls(pmd, in_port);
if (OVS_LIKELY(cls)) {
any_miss = !dpcls_lookup(cls, keys, rules, cnt, &lookup_cnt);
} else {
any_miss = true;
memset(rules, 0, sizeof(rules));
}
/*对rules[i]为空的packets[i]转入upcall流程处理*/
if (OVS_UNLIKELY(any_miss) && !fat_rwlock_tryrdlock(&dp->upcall_rwlock)) {
uint64_t actions_stub[512 / 8], slow_stub[512 / 8];
struct ofpbuf actions, put_actions;
ofpbuf_use_stub(&actions, actions_stub, sizeof actions_stub);
ofpbuf_use_stub(&put_actions, slow_stub, sizeof slow_stub);
for (i = 0; i < cnt; i++) {
struct dp_netdev_flow *netdev_flow;
if (OVS_LIKELY(rules[i])) {
continue;
}
/* It's possible that an earlier slow path execution installed
* a rule covering this flow. In this case, it's a lot cheaper
* to catch it here than execute a miss. */
/*根据keys中的miniflow得到in_port
*利用该in_port查找dpcls,若找到就调用dpcls_lookup在进行一次rule的查找*/
netdev_flow = dp_netdev_pmd_lookup_flow(pmd, &keys[i],
&add_lookup_cnt);
if (netdev_flow) {
lookup_cnt += add_lookup_cnt;
rules[i] = &netdev_flow->cr;
continue;
}
miss_cnt++;
handle_packet_upcall(pmd, packets[i], &keys[i], &actions,
&put_actions, &lost_cnt, now);
}
ofpbuf_uninit(&actions);
ofpbuf_uninit(&put_actions);
fat_rwlock_unlock(&dp->upcall_rwlock);
} else if (OVS_UNLIKELY(any_miss)) {
for (i = 0; i < cnt; i++) {
if (OVS_UNLIKELY(!rules[i])) {
dp_packet_delete(packets[i]);
lost_cnt++;
miss_cnt++;
}
}
}
for (i = 0; i < cnt; i++) {
struct dp_packet *packet = packets[i];
struct dp_netdev_flow *flow;
if (OVS_UNLIKELY(!rules[i])) {
continue;
}
/*根据每个包所对应的dpcls_rule得到相对应的dp_netdev_flow
*其后将该flow插入到emc中
*同时根据该flow对packet进行入队*/
flow = dp_netdev_flow_cast(rules[i]);
emc_insert(flow_cache, &keys[i], flow);
dp_netdev_queue_batches(packet, flow, &keys[i].mf, batches, n_batches);
}
dp_netdev_count_packet(pmd, DP_STAT_MASKED_HIT, cnt - miss_cnt);
dp_netdev_count_packet(pmd, DP_STAT_LOOKUP_HIT, lookup_cnt);
dp_netdev_count_packet(pmd, DP_STAT_MISS, miss_cnt);
dp_netdev_count_packet(pmd, DP_STAT_LOST, lost_cnt);
}
Action Execution Call Graph
拥有相同流信息的包将入队至一个相同组(batch),而该组将根据流动作对包进行处理。为了提升包转发性能,将对同一组内的包将同时处理。
根据分组将对包执行特定的动作,以下是动作将包转发至出口的流程:
netdev_send->netdev_dpdk_send__->netdev_dpdk_eth_tx_burst->rte_eth_tx_burst进行发包