per_cpu的原理就是一个变量结构在所有CPU cache上都存一份,这样每次读写就可以避免锁开销,上下文切换和cache miss等一系列问题,一般来说,最好把per_cpu变量声明为CPU cache对齐的,e.g.
struct percpu_stat {
uint64 a;
uint64 b;
} ____cacheline_aligned;
一种是全局栈类型的per_cpu变量,我摘抄了相应内核代码如下
DEFINE_PER_CPU(struct netif_rx_stats, netdev_rx_stat) = { 0, };
__get_cpu_var(netdev_rx_stat).received_rps++;
__get_cpu_var(netdev_rx_stat).total++;
__get_cpu_var(netdev_rx_stat).dropped++;
__get_cpu_var(netdev_rx_stat).time_squeeze++;
static struct netif_rx_stats *softnet_get_online(loff_t *pos)
{
struct netif_rx_stats *rc = NULL;
while (*pos < nr_cpu_ids)
if (cpu_online(*pos)) {
rc = &per_cpu(netdev_rx_stat, *pos);
break;
} else
++*pos;
return rc;
}
可以看出,DEFINE_PER_CPU定义的per_cpu变量,一般都通过__get_cpu_var(var)来访问,或者通过per_cpu(var, cpu)宏来访问,var代表类型,cpu代表CPU index,
另一种是分配出来的per_cpu变量,我摘抄了openvswitch相应内核代码如下
vport->percpu_stats = alloc_percpu(struct vport_percpu_stats);
free_percpu(vport->percpu_stats);
struct vport {
struct rcu_head rcu;
u16 port_no;
struct datapath *dp;
struct list_head node;
u32 upcall_pid;
struct hlist_node hash_node;
const struct vport_ops *ops;
struct vport_percpu_stats __percpu *percpu_stats;
spinlock_t stats_lock;
struct vport_err_stats err_stats;
};
for_each_possible_cpu(i) {
const struct vport_percpu_stats *percpu_stats;
struct vport_percpu_stats local_stats;
unsigned int start;
percpu_stats = per_cpu_ptr(vport->percpu_stats, i);
do {
start = u64_stats_fetch_begin_bh(&percpu_stats->sync);
local_stats = *percpu_stats;
} while (u64_stats_fetch_retry_bh(&percpu_stats->sync, start));
stats->rx_bytes += local_stats.rx_bytes;
stats->rx_packets += local_stats.rx_packets;
stats->tx_bytes += local_stats.tx_bytes;
stats->tx_packets += local_stats.tx_packets;
}
void ovs_vport_receive(struct vport *vport, struct sk_buff *skb)
{
struct vport_percpu_stats *stats;
stats = per_cpu_ptr(vport->percpu_stats, smp_processor_id());
u64_stats_update_begin(&stats->sync);
stats->rx_packets++;
stats->rx_bytes += skb->len;
u64_stats_update_end(&stats->sync);
ovs_dp_process_received_packet(vport, skb);
}
这种类型的per_cpu变量需要动态通过alloc_percpu来创建,通过free_percpu来释放,如果要引用到该per_cpu变量,需要通过per_cpu_ptr来获取该变量指针
需要注意的是,per_cpu_ptr读写的只是local CPU的一份per_cpu变量数据,如果需要所有CPU上的总和,则需要通过遍历所有smp cpu相加得到