ovs启动
vswitchd/ovs-vswitchd.c启动main-->netdev_run-->netdev_initialize-->netdev_dpdk_register-->netdev_register_provider注册dpdk_vhost_user_class
添加dpdk端口的时候,会触发创建pmd线程的操作。
dpif_netdev_port_add-->do_add_port-->dp_netdev_set_pmds_on_numa-->pmd_thread_main
如果已经添加了dpdk端口,启动的时候也会触发创建pmd的线程的操作。
dpif_netdev_pmd_set-->dp_netdev_reset_pmd_threads-->dp_netdev_set_pmds_on_numa-->pmd_thread_main
dp_netdev_process_rxq_port接口负责接收报文,然后调用接口dp_netdev_input–>dp_netdev_input__负责查表,然后调用packet_batch_execute–>dp_netdev_execute_actions执行actions操作。
poll mode线程
pmd_thread_main
- pmd_thread_setaffinity_cpu设置线程绑定的lcore。
- for无限循环
- for循环各个端口,执行dp_netdev_process_rxq_port处理端口。
- 循环中间会根据变动重新加载端口和队列信息。
dp_netdev_process_rxq_port
- 调用netdev_rxq_recv接收报文,前后都有计时。
- 调用dp_netdev_input将报文传输给flow,并且发送报文,前后都有计时。
- netdev_rxq_recv=>netdev_dpdk_vhost_rxq_recv
- 调用dpdk接口rte_vhost_dequeue_burst接收报文。
- 调用netdev_dpdk_vhost_update_rx_counters更新统计信息。
dp_netdev_input=>dp_netdev_input__
- emc_processing主要是将收到的几个报文解析key值,并且从cache中查找流表,匹配的报文放入流表;返回不匹配的报文个数。
- 如果存在不匹配的报文,调用fast_path_processing则继续查找全部表项,找到则将流表放入cache,不匹配则上报到controller。
- 调用packet_batch_execute根据流表来操作报文。
emc_processing
- 调用miniflow_extract将报文解析到key值。
- 调用emc_lookup,从hash表中查找,并且进行key值比较。
- 如果匹配,调用dp_netdev_queue_batches将报文添加在flow->batches中。
不匹配将不匹配的报文当前排。 - 调用dp_netdev_count_packet统计匹配的报文数。
fast_path_processing
- dpcls_lookup通过classifier查找子流表,如果所有的报文都找到了匹配的子流表,将流表插入缓存中,并且将报文加入flow->batches。
如果不匹配,则上报到controller。 - 统计匹配、不匹配和丢失。
packet_batch_per_flow_execute
- 调用dp_netdev_flow_get_actions获取flow对应的actions。
dp_netdev_execute_actions执行对应的actions
actions操作
dp_netdev_execute_actions=>odp_execute_actions
- 如果是一些基本操作的话,调用接口dp_execute_cb。
dp_execute_cb
- 如果是OVS_ACTION_ATTR_OUTPUT,调用dp_netdev_lookup_port查找端口,然后调用netdev_send进行报文发送。
- 如果是OVS_ACTION_ATTR_TUNNEL_PUSH,调用push_tnl_action进行tunnel封装,然后调用dp_netdev_recirculate–>dp_netdev_input__重新查表操作。
- 如果是OVS_ACTION_ATTR_TUNNEL_POP,调用netdev_pop_header解封装,然后调用dp_netdev_recirculate–>dp_netdev_input__重新查表操作。
netdev_send=>netdev_dpdk_vhost_send=>__netdev_dpdk_vhost_send
- 循环调用dpdk接口rte_vhost_enqueue_burst发送报文。
- 调用netdev_dpdk_vhost_update_tx_counters更新统计信息。
OVS源码pmd_thread_main分析
PMD线程在其轮询列表中持续轮询输入端口,在每一个端口上最多可同时收32个包(NETDEV_MAX_BURST),根据激活的流规则可将每一个收包进行分类。分类的目的是为了找到一个流,从而对包进行恰当的处理。包根据流进行分组,并且每一个分组将执行特定的动作。
pmd_thread_main是OVS通过pmd线程轮询在用户态收包流程的入口函数
pmd_thread_main(void *f_)
{
struct dp_netdev_pmd_thread *pmd = f_;
unsigned int lc = 0;
struct polled_queue *poll_list;
bool exiting;
int poll_cnt;
int i;
poll_list = NULL;
...
/*将pmd->poll_list存入poll_list并返回polled_queue数*/
poll_cnt = pmd_load_queues_and_ports(pmd, &poll_list);
reload:
emc_cache_init(&pmd->flow_cache);
...
for (;;) {
for (i = 0; i < poll_cnt; i++) {
dp_netdev_process_rxq_port(pmd, poll_list[i].rx,
poll_list[i].port_no);
}
...
}
poll_cnt = pmd_load_queues_and_ports(pmd, &poll_list);
exiting = latch_is_set(&pmd->exit_latch); //若设置pmd->exit_latch,那么终结pmd线程
/* Signal here to make sure the pmd finishes
* reloading the updated configuration. */
dp_netdev_pmd_reload_done(pmd);
emc_cache_uninit(&pmd->flow_cache);
if (!exiting) {
goto reload;
}
free(poll_list);
pmd_free_cached_ports(pmd);
return NULL;
}
pmd_thread_main通过调用dp_netdev_process_rxq_port处理netdev的收包过程
dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd,
struct netdev_rxq *rx,
odp_port_t port_no)
{
struct dp_packet_batch batch;
int error;
dp_packet_batch_init(&batch);
cycles_count_start(pmd);
/*通过调用netdev_class->rxq_recv从rx中收包存入batch中*/
error = netdev_rxq_recv(rx, &batch);
cycles_count_end(pmd, PMD_CYCLES_POLLING);
if (!error) {
*recirc_depth_get() = 0;
cycles_count_start(pmd);
/*将batch中的包转入datapath中进行处理*/
dp_netdev_input(pmd, &batch, port_no);
cycles_count_end(pmd, PMD_CYCLES_PROCESSING);
}
...
}
netdev_class的实例有NETDEV_DPDK_CLASS,NETDEV_DUMMY_CLASS,NETDEV_BSD_CLASS,NETDEV_LINUX_CLASS.
netdev_dpdk_rxq_recv(struct netdev_rxq *rxq, struct dp_packet_batch *batch)
{
struct netdev_rxq_dpdk *rx = netdev_rxq_dpdk_cast(rxq);
struct netdev_dpdk *dev = netdev_dpdk_cast(rxq->netdev);
struct ingress_policer *policer = netdev_dpdk_get_ingress_policer(dev);
int nb_rx;
int dropped = 0;
if (OVS_UNLIKELY(!(dev->flags & NETDEV_UP))) {
return EAGAIN;
}
/*调用dpdk接口rte_eth_rx_burst进行收包,一次最多收32个包*/
nb_rx = rte_eth_rx_burst(rx->port_id, rxq->queue_id,
(struct rte_mbuf **) batch->packets,
NETDEV_MAX_BURST);
if (!nb_rx) {
return EAGAIN;
}
/*若存在policer那么对dp_packet_batch中的每一个dp_packet
*调用netdev_dpdk_policer_pkt_handle进行处理,返回值为meter后的实际收包数*/
if (policer) {
dropped = nb_rx;
nb_rx = ingress_policer_run(policer,
(struct rte_mbuf **) batch->packets,
nb_rx);
dropped -= nb_rx;
}
/* Update stats to reflect dropped packets */
if (OVS_UNLIKELY(dropped)) {
rte_spinlock_lock(&dev->stats_lock);
dev->stats.rx_dropped += dropped;
rte_spinlock_unlock(&dev->stats_lock);
}
batch->count = nb_rx;
return 0;
}
包从物理或者虚拟接口进入OVS-DPDK后根据包的头域将会得到一个唯一的标识或者hash,这个标识将会与以下3个交换表中的一条表项进行匹配。这三个交换表分别为:exact match cache(EMC),datapath classifier(dpcls),ofproto classifier。包将会按顺序遍历以上3个表直到找到表项与其匹配,匹配后包将执行匹配所指示的所有动作,然后进行转发。
EMC根据有限数量的表项对流提供快速的处理,在EMC中包标识必须与表项进行IP 5元组的精确匹配。若EMC未匹配上,那么包将进入dpcls。dpcls拥有多重子表来维持更多的表项,并且可使用通配(wildcard)对包标识进行匹配。当包与dpcls匹配后流表项将在EMC中进行设置,在此之后那些拥有与当前包相同标识的包可以根据EMC快速处理。若EMC依旧未匹配上,那么包将进入ofproto classifier根据openflow控制器进行处理。若在ofproto classifier中匹配了相应的表项,那个该表项将项快速交换表分发,在此之后那些拥有相同流的包将被快速处理。(翻译自:https://software.intel.com/en-us/articles/open-vswitch-with-dpdk-overview)
注意:EMC是以PMD为边界的,每个PMD拥有自己的EMC;dpcls是以端口为边界的,每个端口拥有自己的dpcls;ofproto classifier是以桥为边界的,每个桥拥有自己的ofproto classifier
dp_netdev_input__(struct dp_netdev_pmd_thread *pmd,
struct dp_packet_batch *packets,
bool md_is_valid, odp_port_t port_no)
{
int cnt = packets->count;
#if !defined(__CHECKER__) && !defined(_WIN32)
const size_t PKT_ARRAY_SIZE = cnt;
#else
/* Sparse or MSVC doesn't like variable length array. */
enum { PKT_ARRAY_SIZE = NETDEV_MAX_BURST };
#endif
OVS_ALIGNED_VAR(CACHE_LINE_SIZE) struct netdev_flow_key keys[PKT_ARRAY_SIZE];
struct packet_batch_per_flow batches[PKT_ARRAY_SIZE];
long long now = time_msec();
size_t newcnt, n_batches, i;
odp_port_t in_port;
n_batches = 0;
/*将dp_packet_batch中的所有包送入EMC(pmd->flow_cache)处理
*返回要被送入fast_path_processing中处理的包数
*同时若md_is_valid该函数还将根据port_no初始化metadata*/
newcnt = emc_processing(pmd, packets, keys, batches, &n_batches,
md_is_valid, port_no);
if (OVS_UNLIKELY(newcnt)) {
packets->count = newcnt;
/* Get ingress port from first packet's metadata. */
in_port = packets->packets[0]->md.in_port.odp_port;
fast_path_processing(pmd, packets, keys, batches, &n_batches, in_port, now);
}
/* All the flow batches need to be reset before any call to
* packet_batch_per_flow_execute() as it could potentially trigger
* recirculation. When a packet matching flow ‘j’ happens to be
* recirculated, the nested call to dp_netdev_input__() could potentially
* classify the packet as matching another flow - say 'k'. It could happen
* that in the previous call to dp_netdev_input__() that same flow 'k' had
* already its own batches[k] still waiting to be served. So if its
* ‘batch’ member is not reset, the recirculated packet would be wrongly
* appended to batches[k] of the 1st call to dp_netdev_input__(). */
for (i = 0; i < n_batches; i++) {
batches[i].flow->batch = NULL;
}
for (i = 0; i < n_batches; i++) {
packet_batch_per_flow_execute(&batches[i], pmd, now);
}
}