Linux路由缓存实现浅析rt_hash_table

最新推荐文章于 2022-04-26 10:55:01 发布

JDSH0224

最新推荐文章于 2022-04-26 10:55:01 发布

阅读量1.4k

点赞数

分类专栏： linux tcp/ip

linux tcp/ip 专栏收录该内容

47 篇文章 4 订阅

订阅专栏

最近闲赋在家，在家里做了一个双ADSL负载均衡的东东，不过遗憾的是，流量始终在一条线路上，本着解决问题的态度，把Linux的路由缓存子系统看了一下，现在把笔记发上来。
原来好像也发过一篇，不过是老版本内核的，本贴对应的版本是2.6.31。

不保证内容都正确，仅供讨论学习之用。

转载请注明作者和出处。

一、什么是路由缓存
路由查询IP层最重要的工作，同时，它也是一件很耗时的工作，为了提高路由查询的效率。Linux内核引用了路由缓存，用于减少对路由表的查询。呵呵，在计算机世界里，cache是无处不在的。Linux的路由缓存（下文中可能会简称为DST）是被设计来与协议无关的独立子系统。一个典型的路由缓存如下：

root@kendo-ThinkpadT410:~# route -Cn
内核 IP 路由缓存
Source Destination Gateway Flags Metric Ref Use Iface
10.1.1.199 74.125.53.102 10.1.1.254 0 0 3 eth0
10.1.1.199 219.148.35.84 10.1.1.254 0 0 0 eth0
10.1.1.199 118.123.3.237 10.1.1.254 0 0 21 eth0
61.55.167.138 10.1.1.199 10.1.1.199 l 0 0 33 lo
10.1.1.199 203.208.37.22 10.1.1.254 0 0 3 eth0
10.1.1.183 10.1.1.255 10.1.1.255 ibl 0 0 1 lo
10.1.1.199 72.14.213.101 10.1.1.254 0 0 1 eth0
10.1.1.137 10.1.1.255 10.1.1.255 ibl 0 0 0 lo
10.1.1.199 61.139.2.69 10.1.1.254 0 0 53 eth0
10.1.1.199 8.8.8.8 10.1.1.254 0 0 45 eth0
10.1.1.199 220.166.65.249 10.1.1.254 0 0 21 eth0
10.1.1.199 207.46.193.178 10.1.1.254 0 0 3 eth0
219.148.35.84 10.1.1.199 10.1.1.199 l 0 0 2 lo
10.1.1.199 72.14.203.148 10.1.1.254 0 0 0 eth0
8.8.8.8 10.1.1.199 10.1.1.199 l 0 0 22 lo
10.1.1.199 207.46.193.178 10.1.1.254 0 0 1 eth0
10.1.1.199 219.232.243.91 10.1.1.254 0 0 2 eth0
10.1.1.199 118.123.3.236 10.1.1.254 0 0 21 eth0
……

复制代码

二、路由缓存初始化
2.1 ip_rt_init
路由缓存使用hash表存储，初始化工作，最重要的就是分配hash表和表项所使用的SLAB，这个工作是在ip_rt_init中完成的：

int __init ip_rt_init(void)
{
……
/* 初始化DST SLAB分配缓存器 */
ipv4_dst_ops.kmem_cachep =
kmem_cache_create("ip_dst_cache", sizeof(struct rtable), 0,
SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
ipv4_dst_blackhole_ops.kmem_cachep = ipv4_dst_ops.kmem_cachep;
/* 根据系统内存容量，分配路由缓存hash表 */
rt_hash_table = (struct rt_hash_bucket *)
alloc_large_system_hash("IP route cache",
sizeof(struct rt_hash_bucket),
rhash_entries,
(totalram_pages >= 128 * 1024) ?
15 : 17,
0,
&rt_hash_log,
&rt_hash_mask,
rhash_entries ? 0 : 512 * 1024);
/* 初始化hash表 */
memset(rt_hash_table, 0, (rt_hash_mask + 1) * sizeof(struct rt_hash_bucket));
rt_hash_lock_init();
/* gc_thresh和ip_rt_max_size用于垃圾回收 */
ipv4_dst_ops.gc_thresh = (rt_hash_mask + 1);
ip_rt_max_size = (rt_hash_mask + 1) * 16;
……

复制代码

指针rt_hash_table指向缓存hash表，表的每一个桶是结构struct rt_hash_bucket，桶下的链表的结构是struct rtable。

/*
* Route cache.
*/
/* The locking scheme is rather straight forward:
*
* 1) Read-Copy Update protects the buckets of the central route hash.
* 2) Only writers remove entries, and they hold the lock
* as they look at rtable reference counts.
* 3) Only readers acquire references to rtable entries,
* they do so with atomic increments and with the
* lock held.
*/
struct rt_hash_bucket {
struct rtable *chain;
};

复制代码

rt_hash_bucket只有一个struct rtable结构的成员，rtable用于描于一个缓存项（准确地讲，它是包括但不仅限于）：

struct fib_nh;
struct inet_peer;
struct rtable
{
union
{
struct dst_entry dst;
} u;
/* Cache lookup keys */
struct flowi fl;
struct in_device *idev;
int rt_genid;
unsigned rt_flags;
__u16 rt_type;
__be32 rt_dst; /* Path destination */
__be32 rt_src; /* Path source */
int rt_iif;
/* Info on neighbour */
__be32 rt_gateway;
/* Miscellaneous cached information */
__be32 rt_spec_dst; /* RFC1122 specific destination */
struct inet_peer *peer; /* long-living peer info */
};

复制代码

rtable包含一些具体的协议有关的项和一个与协议无关、类型为struct dst_entry的成员。路由缓存中，最为精华的部份就是DST的单独抽像，设计者将它设计成一个无协议无关的结构，无协议无关，意味着不论是IPV4，还是V6，亦或其它网络层协议，都可以使用它。值得注意的是，dst成员被设计成union，结构dst_entry与rtable有相同的地址，同一个指针可以方便地在两者之前进行强制类型转换。整个hash表如下图所示：

2.2 hash表的分配
DST缓存的hash表的分配，是通过调用系统API alloc_large_system_hash实现的：

rt_hash_table = (struct rt_hash_bucket *)
alloc_large_system_hash("IP route cache",
sizeof(struct rt_hash_bucket),
rhash_entries,
(totalram_pages >= 128 * 1024) ?
15 : 17,
0,
&rt_hash_log,
&rt_hash_mask,
rhash_entries ? 0 : 512 * 1024);

复制代码

对照以上代码，来分析alloc_large_system_hash的实现：

/*
* allocate a large system hash table from bootmem
* - it is assumed that the hash table must contain an exact power-of-2
* quantity of entries
* - limit is the number of hash buckets, not the total allocation size
*/
void *__init alloc_large_system_hash(const char *tablename, /* hash表名称 */
unsigned long bucketsize, /* hash表的每个桶的大小 */
unsigned long numentries, /* hash表的总的元素数目 */
int scale,
int flags,
unsigned int *_hash_shift,
unsigned int *_hash_mask,
unsigned long limit)
{

复制代码

函数前三个参数很清晰，后面几个参数在代码中逐步了解，一个有意思的是，实始分配的时候，并不需要指
明hash表的桶的数目。而这个数目，对于hash表来讲，是至关重要的。

unsigned long long max = limit;
unsigned long log2qty, size;
void *table = NULL;

复制代码

如果没有手动指定hash表的大小，则根据系统内存大小自动计算hash表的元素总数。对于DST子系统而言，其值是一个
内核名命行参数rhash_entries，用户可以在内核引导时指定其大小:

/* allow the kernel cmdline to have a say */
if (!numentries) {
/* round applicable memory size up to nearest megabyte */
/**
* numentries的计算基数是nr_kernel_pages，它表示内存的dma和normal页区的实际页数
*/
numentries = nr_kernel_pages;
/**
* 这部份的计算是让numentries的值自动校正为其对应的最接近以MB字节单位的页面数的值，
* 以x86，32位的情况，一个1MB包含的页面数为1UL << 20 - PAGE_SHIFT，后面直接以256来行文了。
* 例如，如果 numentries为100，则会自动调整为256，如果为257，则会调整为512，如果为1000，则
* 会调整为1024……(假定)
*/
/**
* 这里 "+= 1MB包含的页面数" 意味着向上对齐，即如果原始是2，则会变成257(当然，通过后面的位移运算，会把它变成256)，
* 而不是变成0(向下对齐)，而-1则是一个调整阀值，对于一些边界值，如0，会保证它还是0，256还是256（而不是向上靠成512了）
*/
numentries += (1UL << (20 - PAGE_SHIFT)) - 1;
/* 右移左移移得人头晕，其实就是以256为边界对齐 */
numentries >>= 20 - PAGE_SHIFT;
numentries <<= 20 - PAGE_SHIFT;
/* limit to 1 bucket per 2^scale bytes of low memory */
/* scale只是一个当numentries为0时，计算numentries的滚动标尺 */
if (scale > PAGE_SHIFT)
numentries >>= (scale - PAGE_SHIFT);
else
numentries <<= (PAGE_SHIFT - scale);
/* Make sure we've got at least a 0-order allocation.. */
if (unlikely((numentries * bucketsize) < PAGE_SIZE))
numentries = PAGE_SIZE / bucketsize;
}
/* 变为最接近的2的幂 */
numentries = roundup_pow_of_two(numentries);

复制代码

最后一个参数limit为hash表元素的总大小，如果没有指定，则自动计算一个，默认情况下，使用总共路由的1/16（右移四位），：

/* limit allocation size to 1/16 total memory by default */
if (max == 0) {
max = ((unsigned long long)nr_all_pages << PAGE_SHIFT) >> 4;
do_div(max, bucketsize);
}

复制代码

如果numentries超限，调整它：

if (numentries > max)
numentries = max;

复制代码

对numentries对对数：ilog2 - log of base 2 of 32-bit or a 64-bit unsigned value，即
hash表的总元素是2^log2qty。可以很方便地使用1 << log2qty来表示之。

log2qty = ilog2(numentries);

复制代码

另一方面，hash表的分配，采用了三种方式，其值主要是根据参数flags和另一个全局变量hashdist来决定的：

do {
size = bucketsize << log2qty;
/**
* 这里可以看到其第5个参数的作用，如果标志位设置有HASH_EARLY,表明在启动时分配,
* 在bootmem中分配，否则使用其它方式来分配。
*/
if (flags & HASH_EARLY)
table = alloc_bootmem_nopanic(size);
else if (hashdist)
table = __vmalloc(size, GFP_ATOMIC, PAGE_KERNEL);
else {
/*
* If bucketsize is not a power-of-two, we may free
* some pages at the end of hash table which
* alloc_pages_exact() automatically does
*/
if (get_order(size) < MAX_ORDER) {
table = alloc_pages_exact(size, GFP_ATOMIC);
kmemleak_alloc(table, size, 1, GFP_ATOMIC);
}
}
} while (!table && size > PAGE_SIZE && --log2qty);

复制代码

/* 分配失败 */

if (!table)
panic("Failed to allocate %s hash table\n", tablename);

复制代码

/* 从成功分配的信息当中，可以了解一些重要的计算参数的含义，也可以在dmesg中，对照计算 */

printk(KERN_INFO "%s hash table entries: %d (order: %d, %lu bytes)\n",
tablename,
(1U << log2qty),
ilog2(size) - PAGE_SHIFT,
size);

复制代码

/**
* 第6个参数向用户返回log2qty的值，这个值的含义前文已有分析。
*/

if (_hash_shift)
*_hash_shift = log2qty;

复制代码

/**
   * 1U << log2qty是hash表桶的大小，第7个参数*_hash_mask是向调用者返回桶的大小，即桶大小为*_hash_mask + 1
   * 之所以在做减1的调整，应该是因为C语言的数组是从0开始的。
   */

if (_hash_mask)
*_hash_mask = (1 << log2qty) - 1;
return table;
}

复制代码

hashdist的dist，意指distribution：

#define HASH_EARLY 0x00000001 /* Allocating during early boot? */
/* Only NUMA needs hash distribution. 64bit NUMA architectures have
* sufficient vmalloc space.
*/
#if defined(CONFIG_NUMA) && defined(CONFIG_64BIT)
#define HASHDIST_DEFAULT 1
#else
#define HASHDIST_DEFAULT 0
#endif
extern int hashdist; /* Distribute hashes across NUMA nodes? */
int hashdist = HASHDIST_DEFAULT;
#ifdef CONFIG_NUMA
static int __init set_hashdist(char *str)
{
if (!str)
return 0;
hashdist = simple_strtoul(str, &str, 0);
return 1;
}
__setup("hashdist=", set_hashdist);
#endif

复制代码

可见，hashdist主要是为了支持NUMA，而这个distribution，应该就是对应vmalloc的特性吧：物理上非连续。

了解了alloc_large_system_hash函数的各个参数的作用，就可以完全理解DST的hash表的分配了。

三、缓存的查询
IP层收到数据报文的时候，ip_rcv_finish会调用 ip_route_input进行路由查询工作：

static int ip_rcv_finish(struct sk_buff *skb)
{
const struct iphdr *iph = ip_hdr(skb);
struct rtable *rt;
/*
* Initialise the virtual path cache for the packet. It describes
* how the packet travels inside Linux networking.
*/
if (skb_dst(skb) == NULL) {
int err = ip_route_input(skb, iph->daddr, iph->saddr, iph->tos,
skb->dev);
if (unlikely(err)) {
if (err == -EHOSTUNREACH)
IP_INC_STATS_BH(dev_net(skb->dev),
IPSTATS_MIB_INADDRERRORS);
else if (err == -ENETUNREACH)
IP_INC_STATS_BH(dev_net(skb->dev),
IPSTATS_MIB_INNOROUTES);
goto drop;
}
}

复制代码

ip_route_input会首先尝试进行缓存的查找，如果找不到，再查询路由表，这里仅分析缓存的查找：

int ip_route_input(struct sk_buff *skb, __be32 daddr, __be32 saddr,
u8 tos, struct net_device *dev)
{
struct rtable * rth;
unsigned hash;
int iif = dev->ifindex;
struct net *net;
net = dev_net(dev);
if (!rt_caching(net))
goto skip_cache;
tos &= IPTOS_RT_MASK;
hash = rt_hash(daddr, saddr, iif, rt_genid(net));
rcu_read_lock();
for (rth = rcu_dereference(rt_hash_table[hash].chain); rth;
rth = rcu_dereference(rth->u.dst.rt_next)) {
if (((rth->fl.fl4_dst ^ daddr) |
(rth->fl.fl4_src ^ saddr) |
(rth->fl.iif ^ iif) |
rth->fl.oif |
(rth->fl.fl4_tos ^ tos)) == 0 &&
rth->fl.mark == skb->mark &&
net_eq(dev_net(rth->u.dst.dev), net) &&
!rt_is_expired(rth)) {
dst_use(&rth->u.dst, jiffies);
RT_CACHE_STAT_INC(in_hit);
rcu_read_unlock();
skb_dst_set(skb, &rth->u.dst);
return 0;
}
RT_CACHE_STAT_INC(in_hlist_search);
}
rcu_read_unlock();

复制代码

ip_route_input首先调用rt_hash_code函数计算hash值，以取得在rt_hash_table中的入口，然后使用for循环，遍历hash链中的每一个桶，进行缓存的匹备，匹备的要素包括：
目的地址
来源地址
输入接口
输出接口或ToS
netfilter mark
缓存设备
缓存是否过期

如果缓存查找命中，则使用dst_use更新使用计数器和时间戳:

static inline void dst_use(struct dst_entry *dst, unsigned long time)
{
dst_hold(dst);
dst->__use++;
dst->lastuse = time;
}

复制代码

RT_CACHE_STAT_INC宏用于累加查找命中计数器，skb_dst_set设置当前skb的dst：

static inline void skb_dst_set(struct sk_buff *skb, struct dst_entry *dst)
{
skb->_skb_dst = (unsigned long)dst;
}

复制代码

有一个重要的问题是，在缓存创建的时候，封装了缓存下一步的发送函数output，这里设置了skb的dst，就意味着它可以继续处理和转发了，例如：

rth->u.dst.input = ip_forward;
rth->u.dst.output = ip_output;

复制代码

一点小改变：值得注意的是，查找匹备与老版本相比较，已经有了明显的变化：

for (rth = rcu_dereference(rt_hash_table[hash].chain); rth;
rth = rcu_dereference(rth->u.rt_next)) {
if (rth->fl.fl4_dst == daddr &&
rth->fl.fl4_src == saddr &&
rth->fl.iif == iif &&
rth->fl.oif == 0 &&
#ifdef CONFIG_IP_ROUTE_FWMARK
rth->fl.fl4_fwmark == skb->nfmark &&
#endif
rth->fl.fl4_tos == tos) {

复制代码

上面代码取自2.6.12，新版本中多增加了两项比较：

net_eq(dev_net(rth->u.dst.dev), net) &&
!rt_is_expired(rth)

复制代码

因为缓存是独立于协议的，所以net_eq比较当前缓存对应的协议是否匹备，例如是否都是ipv4。rt_is_expired用于检查缓存是否过期。
另一个变化是把(XX == XX) && (YY == YY)比较，变成了(XX ^ XX) | (YY ^ YY)，这样变的理由在于：
如果A == B, 则A ^ B = 0

三、缓存的增加

当缓存查找没有命中，系统会进行路由表的查找，当查找命中后，会创建一个缓存项，将其插入到路由缓存hash表当中，这样，后续报文就不用再查路由表了。例如：

/* 分配一个路由缓存项 */
rth = dst_alloc(&ipv4_dst_ops);
if (!rth)
goto e_nobufs;
/* 初始化rtable的各个成员 */
rth->u.dst.output= ip_rt_bug;
rth->rt_genid = rt_genid(net);
atomic_set(&rth->u.dst.__refcnt, 1);
rth->u.dst.flags= DST_HOST;
if (IN_DEV_CONF_GET(in_dev, NOPOLICY))
rth->u.dst.flags |= DST_NOPOLICY;
rth->fl.fl4_dst = daddr;
rth->rt_dst = daddr;
rth->fl.fl4_tos = tos;
rth->fl.mark = skb->mark;
rth->fl.fl4_src = saddr;
rth->rt_src = saddr;
#ifdef CONFIG_NET_CLS_ROUTE
rth->u.dst.tclassid = itag;
#endif
rth->rt_iif =
rth->fl.iif = dev->ifindex;
rth->u.dst.dev = net->loopback_dev;
dev_hold(rth->u.dst.dev);
rth->idev = in_dev_get(rth->u.dst.dev);
rth->rt_gateway = daddr;
rth->rt_spec_dst= spec_dst;
rth->u.dst.input= ip_local_deliver;
rth->rt_flags = flags|RTCF_LOCAL;
if (res.type == RTN_UNREACHABLE) {
rth->u.dst.input= ip_error;
rth->u.dst.error= -err;
rth->rt_flags &= ~RTCF_LOCAL;
}
rth->rt_type = res.type;
/* 计算hash值 */
hash = rt_hash(daddr, saddr, fl.iif, rt_genid(net));
/* 将缓存插入hash表 */
err = rt_intern_hash(hash, rth, NULL, skb);

复制代码

在插入缓存项的时候，有四个动作：
1、调用dst_alloc分配一个缓存项；
2、初始化rth(struct rtable)各个成员，对rtable结构的理解，需要分析整个路由子系统，这里略过之；
3、计算缓存表的hash值，寻找入口；
4、调用rt_intern_hash插入路由表项；

3.1 缓存的分配
要将一个缓存增加入hash表，首先要调用dst_alloc分配一个路由缓存项，分配的实质就是在SLAB中分配一个高速缓存节点，每次分配的时候，都会尝试垃圾回收，关于垃圾回收，后面会有详述：

void * dst_alloc(struct dst_ops * ops)
{
struct dst_entry * dst;
/* 垃圾收集 */
if (ops->gc && atomic_read(&ops->entries) > ops->gc_thresh) {
if (ops->gc(ops))
return NULL;
}
/* 在slab中分配缓存 */
dst = kmem_cache_zalloc(ops->kmem_cachep, GFP_ATOMIC);
if (!dst)
return NULL;
/* 初始化各成员 */
atomic_set(&dst->__refcnt, 0);
dst->ops = ops;
dst->lastuse = jiffies;
dst->path = dst;
dst->input = dst->output = dst_discard;
#if RT_CACHE_DEBUG >= 2
atomic_inc(&dst_total);
#endif
atomic_inc(&ops->entries);
return dst;
}

复制代码

每个SLAB缓存的大小是sizeof(struct rtable)，所以dst_alloc分配的空间并不是dst，而是rtable，函数名称有点名不副实了。dst和rtable的指针可以相互转换，所以这并不是一个问题。不过函数的名称也并非完全不准确：它的初始化工作，仅是针对dst。而整个rtable的初始化，是留给调用者的。

3.2 缓存项的插入
缓存的插入是通过rt_intern_hash来完成的：

static int rt_intern_hash(unsigned hash, struct rtable *rt,
struct rtable **rp, struct sk_buff *skb)
{
struct rtable *rth, **rthp;
unsigned long now;
struct rtable *cand, **candp;
u32 min_score;
int chain_length;
int attempts = !in_softirq();
restart:
chain_length = 0;
min_score = ~(u32)0;
cand = NULL;
candp = NULL;
now = jiffies;
/* rt_intern_hash要做的第一件事情，就是检索要插入的缓存项在缓存hash表中是否存在。
* 常理来讲，缓存的插入都是先查找但未命中后，再进行插入操作，所以这个检查好像是多余的。
* 但是因为路由缓hash表可以在多个CPU上并行，缓存项可能在一个CPU上查找未命中的同时却被其它CPU插入……
*/
/* 设备对应的网络子系统还没有缓存，当然也用不着检索了 */
if (!rt_caching(dev_net(rt->u.dst.dev))) {
/*
* If we're not caching, just tell the caller we
* were successful and don't touch the route. The
* caller hold the sole reference to the cache entry, and
* it will be released when the caller is done with it.
* If we drop it here, the callers have no way to resolve routes
* when we're not caching. Instead, just point *rp at rt, so
* the caller gets a single use out of the route
* Note that we do rt_free on this new route entry, so that
* once its refcount hits zero, we are still able to reap it
* (Thanks Alexey)
* Note also the rt_free uses call_rcu. We don't actually
* need rcu protection here, this is just our path to get
* on the route gc list.
*/
/* 如果是单播中转或本地发送的报文，则尝试与arp绑定 */
if (rt->rt_type == RTN_UNICAST || rt->fl.iif == 0) {
int err = arp_bind_neighbour(&rt->u.dst);
if (err) {
/* 失败处理 */
if (net_ratelimit())
printk(KERN_WARNING
"Neighbour table failure & not caching routes.\n");
rt_drop(rt);
return err;
}
}
rt_free(rt);
goto skip_hashing;
}
/* 取得hash链，这里与普通的链表稍有区别，因为rthp是指向指针的指针 */
rthp = &rt_hash_table[hash].chain;
/* 取得链锁 */
spin_lock_bh(rt_hash_lock_addr(hash));
/* 遍历链，寻找缓存是否已经存在
* 这里一个值得注意的地方，是链表的删除操作：rthp被定义成一个指向指向的指针，这主要是为了高效地操作链表
* 每一次遍历，rthp指向的并不是缓存链中的下一个指点，而是指向“指向下一个节点的指针(dst.rt_next)的指针”:
* rthp = &rth->u.dst.rt_next; 这样，在删除节点时，只需要修改这个指针指向的地址，让它指向待删除的节点的
* 即可： *rthp = rth->u.dst.rt_next;这样，就不必再保留一个“前一节点的prev指针”。
*/
while ((rth = *rthp) != NULL) {
/* 尝试超时过期清理 */
if (rt_is_expired(rth)) {
*rthp = rth->u.dst.rt_next;
rt_free(rth);
continue;
}
/* 关键字匹备，查看要插入的项是否存在 */
if (compare_keys(&rth->fl, &rt->fl) && compare_netns(rth, rt)) {
/* 如果查找命中，则将它调到链首，这样做的理由是因为它是最近被使用，有可能会在接下来的查找中最先被使用 */
/* Put it first */
*rthp = rth->u.dst.rt_next;
/*
* Since lookup is lockfree, the deletion
* must be visible to another weakly ordered CPU before
* the insertion at the start of the hash chain.
*/
rcu_assign_pointer(rth->u.dst.rt_next,
rt_hash_table[hash].chain);
/*
* Since lookup is lockfree, the update writes
* must be ordered for consistency on SMP.
*/
rcu_assign_pointer(rt_hash_table[hash].chain, rth);
/* 更新使用计数器(不是引用计数器)和时间戳 */
dst_use(&rth->u.dst, now);
/* 解锁*/
spin_unlock_bh(rt_hash_lock_addr(hash));
/* 因为存在，没有必要插入了，丢弃要插入的缓存项 */
rt_drop(rt);
/* 如果调用者指明了rp，则使用它返回查找到的缓存项，否则调用skb_dst_set设置skb的dst */
if (rp)
*rp = rth;
else
skb_dst_set(skb, &rth->u.dst);
return 0;
}
/* 如果对该缓存项没有引用，则尝试调用rt_score来计算应被删除的最佳候选人
* rt_score算计一个得分，拥有最小得分的缓存项则被记录至cand,同时使用了一个candp的理由与rthp作用类似，
* 在删除这个最佳删除项cand的时候，减少一个prev指针
*/
if (!atomic_read(&rth->u.dst.__refcnt)) {
u32 score = rt_score(rth);
/* min_socre初值是一个32位的最大值，如果计算出最小值，则不断地更新它，以期得到最大值 */
if (score <= min_score) {
cand = rth;
candp = rthp;
min_score = score;
}
}
/* 统计链长，主要是用于以后判断是否超过垃圾回收的阀值 */
chain_length++;
rthp = &rth->u.dst.rt_next;
}
/* 循环完hash链，没有找到匹备的缓存项，则将尝试插入，每次插入之前，都会尝试找到一个最佳的删除项cand，这样，以避免出现缓存容量的溢出 */
if (cand) {
/* ip_rt_gc_elasticity used to be average length of chain
* length, when exceeded gc becomes really aggressive.
*
* The second limit is less certain. At the moment it allows
* only 2 entries per bucket. We will see.
*/
/* 如果找到了cand，并且当前链的长度已经超过了定义的垃圾回收的阀值，则直接调用rt_free删除之 */
if (chain_length > ip_rt_gc_elasticity) {
*candp = cand->u.dst.rt_next;
rt_free(cand);
}
} else {
/* 如果没有找到cand，但是当前链的长度已经超过了链的最大长度，则仍然在进行垃圾回收处理，这次出马的是rt_emergency_hash_rebuild */
if (chain_length > rt_chain_length_max) {
struct net *net = dev_net(rt->u.dst.dev);
int num = ++net->ipv4.current_rt_cache_rebuild_count;
if (!rt_caching(dev_net(rt->u.dst.dev))) {
printk(KERN_WARNING "%s: %d rebuilds is over limit, route caching disabled\n",
rt->u.dst.dev->name, num);
}
rt_emergency_hash_rebuild(dev_net(rt->u.dst.dev));
}
}
/* Try to bind route to arp only if it is output
route or unicast forwarding path.
*/
/* 如果是单播中转或本地发出的报文，尝试将路由缓存与arp绑定，需要绑定的理由在于加速，这样在数据发送的时候，可以很方便地封装二层帧首部 */
if (rt->rt_type == RTN_UNICAST || rt->fl.iif == 0) {
int err = arp_bind_neighbour(&rt->u.dst);
/* 绑定失败 */
if (err) {
spin_unlock_bh(rt_hash_lock_addr(hash));
/* 内存不足，直接丢弃，并出错返回 */
if (err != -ENOBUFS) {
rt_drop(rt);
return err;
}
/* Neighbour tables are full and nothing
can be released. Try to shrink route cache,
it is most likely it holds some neighbour records.
*/
/* 否则调整垃圾收回阀值，调用rt_garbage_collect进行主动垃圾清理，并尝试重试 */
if (attempts-- > 0) {
int saved_elasticity = ip_rt_gc_elasticity;
int saved_int = ip_rt_gc_min_interval;
ip_rt_gc_elasticity = 1;
ip_rt_gc_min_interval = 0;
rt_garbage_collect(&ipv4_dst_ops);
ip_rt_gc_min_interval = saved_int;
ip_rt_gc_elasticity = saved_elasticity;
goto restart;
}
/* 超过最大重试次数，仍旧失败 */
if (net_ratelimit())
printk(KERN_WARNING "Neighbour table overflow.\n");
rt_drop(rt);
return -ENOBUFS;
}
}
rt->u.dst.rt_next = rt_hash_table[hash].chain;
#if RT_CACHE_DEBUG >= 2
if (rt->u.dst.rt_next) {
struct rtable *trt;
printk(KERN_DEBUG "rt_cache @%02x: %pI4",
hash, &rt->rt_dst);
for (trt = rt->u.dst.rt_next; trt; trt = trt->u.dst.rt_next)
printk(" . %pI4", &trt->rt_dst);
printk("\n");
}
#endif
/*
* Since lookup is lockfree, we must make sure
* previous writes to rt are comitted to memory
* before making rt visible to other CPUS.
*/
/* 插入缓存项 */
rcu_assign_pointer(rt_hash_table[hash].chain, rt);
/* 解锁 */
spin_unlock_bh(rt_hash_lock_addr(hash));
skip_hashing:
if (rp)
*rp = rt;
else
skb_dst_set(skb, &rt->u.dst);
return 0;
}

复制代码

四、缓存的释放
在rt_intern_hash函数中，多次出现调用rt_free来释放或调用rt_drop来丢弃缓存的情况。
这两个函数非常相似：

static inline void rt_free(struct rtable *rt)
{
call_rcu_bh(&rt->u.dst.rcu_head, dst_rcu_free);
}

复制代码

static inline void rt_drop(struct rtable *rt)
{
ip_rt_put(rt);
call_rcu_bh(&rt->u.dst.rcu_head, dst_rcu_free);
}

复制代码

rt_drop多了一句ip_rt_put调用。ip_rt_put函数通过调用dst_release递减缓存的引用计数器：

static inline void ip_rt_put(struct rtable * rt)
{
if (rt)
dst_release(&rt->u.dst);
}

复制代码

static inline void dst_rcu_free(struct rcu_head *head)
{
struct dst_entry *dst = container_of(head, struct dst_entry, rcu_head);
dst_free(dst);
}

复制代码

static inline void dst_free(struct dst_entry * dst)
{
/* obsoolete > 1 时意思着缓存项已经被处理过了，直接返回 */
if (dst->obsolete > 1)
return;
/* 如果dst的引用计数器为0，则直接调用dst_destory删除之，否则调用__dst_free进一步处理，
* 特别地，如果删除失败，也会调用__dst_free
*/
if (!atomic_read(&dst->__refcnt)) {
dst = dst_destroy(dst);
if (!dst)
return;
}
__dst_free(dst);
}

复制代码

void __dst_free(struct dst_entry * dst)
{
spin_lock_bh(&dst_garbage.lock);
___dst_free(dst);
dst->next = dst_garbage.list;
dst_garbage.list = dst;
if (dst_garbage.timer_inc > DST_GC_INC) {
dst_garbage.timer_inc = DST_GC_INC;
dst_garbage.timer_expires = DST_GC_MIN;
cancel_delayed_work(&dst_gc_work);
schedule_delayed_work(&dst_gc_work, dst_garbage.timer_expires);
}
spin_unlock_bh(&dst_garbage.lock);
}

复制代码

__dst_free函数首先调用___dst_free将设备还没有处于运行或运行状态不是IFF_UP时，将dst的input/output函数指针设为dst_discard,并且设置obsolete=2，标识缓存项为DEAD状态，或者意味着它已经被_dst_free处理过了，与此相对就是在前文dst_free函数中，判断obsoolete > 1时就直接返回，因为它已经被处理过了：

static void ___dst_free(struct dst_entry * dst)
{
/* The first case (dev==NULL) is required, when
protocol module is unloaded.
*/
if (dst->dev == NULL || !(dst->dev->flags&IFF_UP)) {
dst->input = dst->output = dst_discard;
}
dst->obsolete = 2;
}

复制代码

接下来的工作就是把dst放到一个dst_garbage_list全局链表中，这意味着缓存应该被释放，但是因为引用计数器非0，所以暂时放在这里，相当于
打入天牢，待秋后处决的意思:

dst->next = dst_garbage.list;
dst_garbage.list = dst;

复制代码

这个秋后处决的时间调度是使用一个延迟队列来实现的，如果队列的时间计数器inc大于 DST_GC_INC，则设置在最小
延迟时间DST_GC_MIN后处理dst_garbage_list：

if (dst_garbage.timer_inc > DST_GC_INC) {
dst_garbage.timer_inc = DST_GC_INC;
dst_garbage.timer_expires = DST_GC_MIN;
cancel_delayed_work(&dst_gc_work);
schedule_delayed_work(&dst_gc_work, dst_garbage.timer_expires);
}

复制代码

缓存最终的内存释放，是通过dst_destroy来实现的，缓存释放的本质是向SLAB返还内存:

kmem_cache_free(dst->ops->kmem_cachep, dst);

复制代码

不过因为dst与其它子系统的相关性，实际的过程还要稍微麻烦一些：

struct dst_entry *dst_destroy(struct dst_entry * dst)
{
struct dst_entry *child;
struct neighbour *neigh;
struct hh_cache *hh;
smp_rmb();
again:
neigh = dst->neighbour;
hh = dst->hh;
child = dst->child;
/* 释放缓存对应的二层报头 */
dst->hh = NULL;
if (hh && atomic_dec_and_test(&hh->hh_refcnt))
kfree(hh);
/* 释放对应的二层协议结构 */
if (neigh) {
dst->neighbour = NULL;
neigh_release(neigh);
}
/* 减少总的缓存计数器 */
atomic_dec(&dst->ops->entries);
/* 如果协议定义了destroy，调用之 */
if (dst->ops->destroy)
dst->ops->destroy(dst);
/* 递减对应设备的引用计数器 */
if (dst->dev)
dev_put(dst->dev);
#if RT_CACHE_DEBUG >= 2
atomic_dec(&dst_total);
#endif
/* 内存释放 */
kmem_cache_free(dst->ops->kmem_cachep, dst);
/* dst的child指针被IPSEC模块使用，它可能是一个child链，在释放dst的时候，也会尝试去释放它，如果其
* 设置了DST_NOHASH标识，并且没有被引用，则使用goto again反复地释放它们。否则，则将其做为返回值返回，
* 前面在分析dst_free时指出，如果dst_free在调用dst_destroy时没有返回NULL，则调用__dst_free进一步处理。
*/
dst = child;
if (dst) {
int nohash = dst->flags & DST_NOHASH;
if (atomic_dec_and_test(&dst->__refcnt)) {
/* We were real parent of this dst, so kill child. */
if (nohash)
goto again;
} else {
/* Child is still referenced, return it for freeing. */
if (nohash)
return dst;
/* Child is still in his hash table */
}
}
return NULL;
}

五、缓存的垃圾回收
要进行垃圾回收的原因很多，例如，缓存项巨大，点用过多内存。前面在插入缓存项的时候已经看到在插入新缓存项的时候，总是会尝试去删除一个合适的旧的缓存项。
这就是缓存垃圾回收的一个例子。

缓存子系统使用了两种垃圾回收机集：
同步回收
当分配新的缓存项，但是发现缓存总数已经超过阀值gc_thresh时。
当一条新的缓存项需要插入到缓存hash表，而对应的表的链中有合适应该被删除的项，这在之前已经看到过。
当邻居子系统缓存需要内存时，因为dst与2层协议缓存之前存在相互用用关系，这在dst_destroy中已经看到。如果二层缓存协议无法分配到内存时，那么
进入同步回收，间接地释放2层缓存协议所占的内存。
异步回收
系统使用一个定时器，来定时地触发定期的垃圾回收操作，以使缓存的容量始终在一个合理的范围内。

5.1 同步回收
dst_alloc在分配新的缓存项，但是发现缓存总数已经超过阀值gc_thresh时

/* 垃圾收集 */
if (ops->gc && atomic_read(&ops->entries) > ops->gc_thresh) {
if (ops->gc(ops))
return NULL;
}

复制代码ops->gc在初始化的时候，指向的函数是gc_garbage_collect：

static struct dst_ops ipv4_dst_ops = {
.family = AF_INET,
.protocol = cpu_to_be16(ETH_P_IP),
.gc = rt_garbage_collect,
.check = ipv4_dst_check,
.destroy = ipv4_dst_destroy,
.ifdown = ipv4_dst_ifdown,
.negative_advice = ipv4_negative_advice,
.link_failure = ipv4_link_failure,
.update_pmtu = ip_rt_update_pmtu,
.local_out = __ip_local_out,
.entries = ATOMIC_INIT(0),
};

复制代码rt_garbage_collect是垃圾同步回收的核心函数，这个函数长而复杂，分段来分析它：

/*
Short description of GC goals.
We want to build algorithm, which will keep routing cache
at some equilibrium point, when number of aged off entries
is kept approximately equal to newly generated ones.
Current expiration strength is variable "expire".
We try to adjust it dynamically, so that if networking
is idle expires is large enough to keep enough of warm entries,
and when load increases it reduces to limit cache size.
*/
static int rt_garbage_collect(struct dst_ops *ops)
{
static unsigned long expire = RT_GC_TIMEOUT;
static unsigned long last_gc;
static int rover;
static int equilibrium;
struct rtable *rth, **rthp;
unsigned long now = jiffies;
int goal;
/*
* Garbage collection is pretty expensive,
* do not make it too frequently.
*/
/* 累计收集计数器 */
RT_CACHE_STAT_INC(gc_total);
/* 垃圾同步收集需要占用大量的CPU资源，所以不能太过频繁，如果两次收集的时间小于ip_rt_gc_min_interval，
* 并且当前的缓存总数小于ip_rt_max_size，则累计忽略收集的计数器，退出。
*/
if (now - last_gc < ip_rt_gc_min_interval &&
atomic_read(&ipv4_dst_ops.entries) < ip_rt_max_size) {
RT_CACHE_STAT_INC(gc_ignored);
goto out;
}
/* Calculate number of entries, which we want to expire now. */
/* rt_garbage_colloect首先计算出一个goal(垃圾收集的目标数量)，当hash表总数超过
* ip_rt_gc_elasticity * 2^ rt_hash_log时，系统认为缓存容量太大，存在风险
* goal的初始值即设为该值，求得该值后，则存在两种情况，goal <= 0，意味着负载还不是很大，而
* 反之，则意味着hash表负载较重，需要更严格的收集方式。
*/
goal = atomic_read(&ipv4_dst_ops.entries) -
(ip_rt_gc_elasticity << rt_hash_log);
if (goal <= 0) {
/* gc_thresh是垃圾回收的阀值，也可以看成，小于这个值的数目的缓存数目，用不着回收，如果equilibrium
* 的值小于gc_thresh，则让它等于这个——这可以看做，整个缓存数目中，除了要回收的goal，剩下的就是equilibrium */
if (equilibrium < ipv4_dst_ops.gc_thresh)
equilibrium = ipv4_dst_ops.gc_thresh;
/* 重新计算要回收的goal */
goal = atomic_read(&ipv4_dst_ops.entries) - equilibrium;
//
if (goal > 0) {
equilibrium += min_t(unsigned int, goal >> 1, rt_hash_mask + 1);
goal = atomic_read(&ipv4_dst_ops.entries) - equilibrium;
}
} else {
/* We are in dangerous area. Try to reduce cache really
* aggressively.
*/
/* hash表负载过重，需要一种更为严格的方式 */
goal = max_t(unsigned int, goal >> 1, rt_hash_mask + 1);
equilibrium = atomic_read(&ipv4_dst_ops.entries) - goal;
}
/* 更新最后一次垃圾回收的时间戳 */
if (now - last_gc >= ip_rt_gc_min_interval)
last_gc = now;
/* 用不着进行回收处理 */
if (goal <= 0) {
equilibrium += goal;
goto work_done;
}

复制代码接下来的工作，就是回收goal个缓存项，函数使用了一个复杂的三层循环来完成这一工作。先来看最深层的循环：

rthp = &rt_hash_table[k].chain;
spin_lock_bh(rt_hash_lock_addr(k));
while ((rth = *rthp) != NULL) {
if (!rt_is_expired(rth) &&
!rt_may_expire(rth, tmo, expire)) {
tmo >>= 1;
rthp = &rth->u.dst.rt_next;
continue;
}
*rthp = rth->u.dst.rt_next;
rt_free(rth);
goal--;
}
spin_unlock_bh(rt_hash_lock_addr(k));
if (goal <= 0)
break;

复制代码这层循环是遍历第k个hash链(这个k是第二层循环的循环变量)，它使用rt_is_expired和rt_may_expire判断缓存项是否应该被删除，
删除缓存项的代码，在之前已经分析过了，这里不同的是递减goal值，以期让它小于或等于0，以达到回收的要求。如果不满足删除条件，
则减半tmo(rt_may_expire的第一个参数)，这个值越低，在第二个参数expire不变的情况下，缓存项被删除的可能性就越大。这是循环
过程中的一个重点。

for (i = rt_hash_mask, k = rover; i >= 0; i--) {
/* 将tmo初始赋为expire，而expire在第三层循环中也会减半，所以，越是循环（老是没有达到goal的数目），回收的
条件越严格 */
unsigned long tmo = expire;
k = (k + 1) & rt_hash_mask;
……
spin_unlock_bh(rt_hash_lock_addr(k));
/* 在二层循环中，如果回收数目到达了goal，则退出该层循环之，这里并没有使用goto的直接跳出三层循环的目的在于，还没有设
* 置合适的rover值，结束循环的动作留给外层循环来完成。
*/
if (goal <= 0)
break;
}
/* 更新rover */
rover = k;

复制代码这个循环中，除了tmo是一个重点之外，rover是另一个需要注意的地方。首先rover是一个静态变量，引入它的理由在于：
假设一个很大的缓存表，第一次从最开始处遍历回收，进行了一部份就满足了goal。退出循环。第二次回收，又重头开始，这意味着后面的缓存项则可能永远
没有机会被发现。
为了公平和效率，使用rover值记住上一次到达的hash链的位置，下一次就接着开始就行了。所以，for中，k值的初始就是rover。而在for循环结束后，需要
更新rover值。

理解了内层两层的循环，最外层就显示相对简单了：

do {
……
rover = k;
if (goal <= 0)
goto work_done;
/* Goal is not achieved. We stop process if:
- if expire reduced to zero. Otherwise, expire is halfed.
- if table is not full.
- if we are called from interrupt.
- jiffies check is just fallback/debug loop breaker.
We will not spin here for long time in any case.
*/
RT_CACHE_STAT_INC(gc_goal_miss);
if (expire == 0)
break;
expire >>= 1;
#if RT_CACHE_DEBUG >= 2
printk(KERN_DEBUG "expire>> %u %d %d %d\n", expire,
atomic_read(&ipv4_dst_ops.entries), goal, i);
#endif
if (atomic_read(&ipv4_dst_ops.entries) < ip_rt_max_size)
goto out;
}while (!in_softirq() && time_before_eq(jiffies, now));

复制代码进入到最外层循环，如果goal已经达到条件，则结束循环，反之，则意味着回收工作并没有如期地完成，这需要设置更严格的回收策略来完成（hash表需要被再次扫描，My God!），
需这个更为严格，是通过将expire减半（因为for循环中，tmo的初始为expire），所以最内层的rt_may_expire就变得更加严厉，循环，遍历，直至回收到goal个缓存项，
当然，这样的结果就是有可能CPU被长期占用，所以，do...while中有也一些条件来提前结束这个工作：
1、expire到达0，不能一直地减半吧……
2、缓存项总数ipv4_dst_ops.entries小于ip_rt_max_size，这意味着hash表虽然负载很重，但是还没有满，还处理可以勉强接受的范围。
3、如果rt_garbage_collect是在软件断上下文中，或者扫描的时间已经超过1个jiffies，则应该让出CPU。

接下来就是一些收尾的工作：

if (atomic_read(&ipv4_dst_ops.entries) < ip_rt_max_size)
goto out;
/* 已经尽最大努力了，但是还是没有达到回收的目标，输出缓存溢出信息 */
if (net_ratelimit())
printk(KERN_WARNING "dst cache overflow\n");
RT_CACHE_STAT_INC(gc_dst_overflow);
return 1;
work_done:
/* 重新设置expire */
expire += ip_rt_gc_min_interval;
if (expire > ip_rt_gc_timeout ||
atomic_read(&ipv4_dst_ops.entries) < ipv4_dst_ops.gc_thresh)
expire = ip_rt_gc_timeout;
#if RT_CACHE_DEBUG >= 2
printk(KERN_DEBUG "expire++ %u %d %d %d\n", expire,
atomic_read(&ipv4_dst_ops.entries), goal, rover);
#endif
out: return 0;
}

复制代码5.2 异步回收
从以上分析可以看到，在极端的情况下，同步回收可能会非常占用资源，为了避免这种情况的出现，系统同时使用了另一个异步回收方式。所谓异步，就是通过一个
定时器，周期性地做回收工作。
在路由子系统的初始化中，定义了一个定时器expires_work：

/* All the timers, started at system startup tend
to synchronize. Perturb it a bit.
*/
INIT_DELAYED_WORK_DEFERRABLE(&expires_work, rt_worker_func);
/*
* rt_worker_func() is run in process context.
* we call rt_check_expire() to scan part of the hash table
*/
static void rt_worker_func(struct work_struct *work)
{
rt_check_expire();
schedule_delayed_work(&expires_work, ip_rt_gc_interval);
}

复制代码异步回收的核心函数是rt_check_expire，然后调用schedule_delayed_word重新激活异步回收：

static void rt_check_expire(void)
{
static unsigned int rover;
unsigned int i = rover, goal;
struct rtable *rth, *aux, **rthp;
unsigned long samples = 0;
unsigned long sum = 0, sum2 = 0;
unsigned long delta;
u64 mult;
delta = jiffies - expires_ljiffies;
expires_ljiffies = jiffies;
mult = ((u64)delta) << rt_hash_log;
if (ip_rt_gc_timeout > 1)
do_div(mult, ip_rt_gc_timeout);
goal = (unsigned int)mult;
if (goal > rt_hash_mask)
goal = rt_hash_mask + 1;
for (; goal > 0; goal--) {
unsigned long tmo = ip_rt_gc_timeout;
unsigned long length;
i = (i + 1) & rt_hash_mask;
rthp = &rt_hash_table[i].chain;
if (need_resched())
cond_resched();
samples++;
if (*rthp == NULL)
continue;
length = 0;
spin_lock_bh(rt_hash_lock_addr(i));
while ((rth = *rthp) != NULL) {
prefetch(rth->u.dst.rt_next);
if (rt_is_expired(rth)) {
*rthp = rth->u.dst.rt_next;
rt_free(rth);
continue;
}
if (rth->u.dst.expires) {
/* Entry is expired even if it is in use */
if (time_before_eq(jiffies, rth->u.dst.expires)) {
nofree:
tmo >>= 1;
rthp = &rth->u.dst.rt_next;
/*
* We only count entries on
* a chain with equal hash inputs once
* so that entries for different QOS
* levels, and other non-hash input
* attributes don't unfairly skew
* the length computation
*/
for (aux = rt_hash_table[i].chain;;) {
if (aux == rth) {
length += ONE;
break;
}
if (compare_hash_inputs(&aux->fl, &rth->fl))
break;
aux = aux->u.dst.rt_next;
}
continue;
}
} else if (!rt_may_expire(rth, tmo, ip_rt_gc_timeout))
goto nofree;
/* Cleanup aged off entries. */
*rthp = rth->u.dst.rt_next;
rt_free(rth);
}
spin_unlock_bh(rt_hash_lock_addr(i));
sum += length;
sum2 += length*length;
}
if (samples) {
unsigned long avg = sum / samples;
unsigned long sd = int_sqrt(sum2 / samples - avg*avg);
rt_chain_length_max = max_t(unsigned long,
ip_rt_gc_elasticity,
(avg + 4*sd) >> FRACT_BITS);
}
rover = i;
}

复制代码与同步回收类似，函数首先计算出一个goal，做为回收的标的外层循环的依据。内层循环也与同步回收非常相似，也是从上一次的rover标记处的hash链开始遍历，以示公平和效率：

unsigned int i = rover
i = (i + 1) & rt_hash_mask;
rthp = &rt_hash_table[i].chain;
while ((rth = *rthp) != NULL) {
/* 内存预取，用于优先目的 */
prefetch(rth->u.dst.rt_next);
/* 判断是否过期，如果过期释放之 */
if (rt_is_expired(rth)) {
*rthp = rth->u.dst.rt_next;
rt_free(rth);
continue;
}
/*这个if...else if判断，表示如果缓存项的时间已经过期，或者通过rt_may_expire判断出符合删除条件，则视做已经老化(aged)的项，删除之
* 否则扫行nofree标签处，tmo减半，继续循环。
*/
if (rth->u.dst.expires) {
/* Entry is expired even if it is in use */
if (time_before_eq(jiffies, rth->u.dst.expires)) {
nofree:
tmo >>= 1;
rthp = &rth->u.dst.rt_next;
/*
* We only count entries on
* a chain with equal hash inputs once
* so that entries for different QOS
* levels, and other non-hash input
* attributes don't unfairly skew
* the length computation
*/
for (aux = rt_hash_table[i].chain;;) {
if (aux == rth) {
length += ONE;
break;
}
if (compare_hash_inputs(&aux->fl, &rth->fl))
break;
aux = aux->u.dst.rt_next;
}
continue;
}
} else if (!rt_may_expire(rth, tmo, ip_rt_gc_timeout))
goto nofree;
/* Cleanup aged off entries. */
*rthp = rth->u.dst.rt_next;
rt_free(rth);
}

后记

缓存的分析并没有解决我的问题。因为这涉及到对路由表查询的处理。但是从ip_route_input查询路由缓存的过程来看，它并没有对多路径做明显的支持。从对路由表的查询和处理结果来看，对于多路径缓存的处理分为两种情况：
1、如果内核编译支持多路径缓存，则同时为每个路径生成一个缓存项，但是因为ip_route_input并没有做支持，所以，始终查询命中到第一个缓存上面。我就是属于这种情况，数据仅走一边了。

2、如果内核编译不支持，则使用一个随机公平的算法生成缓存项，这样就可以做到流量分担了。不过它是基于流，而非单个报文的。

转载：http://bbs.chinaunix.net/thread-1919577-1-1.html