ROUTING OUTPUT PACKETS

最新推荐文章于 2024-06-13 10:18:49 发布

maimang09

最新推荐文章于 2024-06-13 10:18:49 发布

阅读量156

点赞数

分类专栏：网络文章标签：网络

原文链接：https://www.wisdomjobs.com/e-university/linux-tutorial-277/routing-output-packets-1112.html

版权

网络专栏收录该内容

269 篇文章 33 订阅

订阅专栏

Routing Output Packets in Linux Tutorial 13 April 2022 - Learn Routing Output Packets in Linux Tutorial (1112) | Wisdom Jobs Indiahttps://www.wisdomjobs.com/e-university/linux-tutorial-277/routing-output-packets-1112.html

In this section, we discuss how IP chooses a route for packets before they are transmitted. If the destination host is directly connected to the sending machine, the destination address can be converted to a link layer address with ARP or another address resolution protocol and the packet can be sent on its way. This seems to be a fairly simple process. However, if the destination is not directly reachable, things are not as simple. The IP protocol must decide how to route the packet, which means that it must choose an output interface if there is more than one network interface device on the sending machine. In addition, it must choose the gateway or next hop that will receive the packet.

Before UDP sends a datagram, it requests a route to the destination address. TCP, however, has already established the route to the destination by the time it sends a packet. TCP requests the route for output packets when the connect socket call is issued by the application. This is because TCP transmission uses the cached route for all segments sent through an open socket once it is active.

Whether TCP or UDP packets are being transmitted, ip_route_connect, defined in file linux/include/net/route.h, is the main function for resolving routes for output packets.

As with the input route resolving functions, the flowi structure, examined earlier, is initialized with the information to try to do a match of routes in the route cache. It is also used for a search of the FIB if the route can’t be matched in the cache.

struct flowi fl = { .oif = oif,
.nl_u = { .ip4_u = { .daddr = dst,
.saddr = src,
.tos = tos } } ,
.proto = protocol,
.uli_u = { .ports =
{ .sport = sport,
.dport = dport } } } ;
int err;
if (!dst || !src) {

We call one of the fast path routing functions. Afterwards, we try to update both the source and destination addresses if one is missing.

err = __ip_route_output_key(rp, &fl);
if (err)
return err;
fl.fl4_dst = (*rp)->rt_dst;
fl.fl4_src = (*rp)->rt_src;
ip_rt_put(*rp);
*rp = NULL;
}

If both the source and destination addresses are defined, we pass in a pointer to the sock in sk.

return ip_route_output_flow(rp, &fl, sk, 0); }

The function ip_route_output_flow can automatically transform the route if the protocol value in the flowi structure is set to NAT or some other nonzero number.

int ip_route_output_flow(struct rtable **rp,
 struct flowi *flp, struct sock
*sk, int flags)
{
int err;

We call the fast route resolving function here.

if ((err =_ip_route_output_key(rp, flp)) != 0) return err;

If we found a route, and the proto info in the flowi was not zero, we try to transform the route.

return flp->proto ? xfrm_lookup ((struct dst_entry**)rp, flp, sk, flags) : 0; }

_ip_route_output_key is the function that does the fast path output routing. First, it tries to find a matching route in the route cache. If it can’t find the route in the cache, it tries to find the route by searching the FIB. This function returns a zero if the route is found, and a nonzero value if it is not. If the search was successful, a pointer to the route cache entry, rtable, in placed in the parameter, rp.

int ip_route_output_key (struct rtable **rp, const struct flowi *flp) {

We use a simple 32-bit hash for a first-level search of the hash table. Once a slot in the hash table, rt_hash_table, is identified, we read-lock the table location and try for an exact match of the routes at that location.

unsigned hash;
struct rtable *rth;
hash = rt_hash_code(flp->fl4_dst,  flp->fl4_src ^(flp->oif << 5),
flp->fl4_tos);
rcu_read_lock();

This is a second-level search done with the information from the flowi structure pointed to by the argument, flp. In this search, we try to find an exact match for the route. In many cases, there will be only one rtable entry at a hash slot.

for (rth = rt_hash_table[hash].chain; rth; rth = rth->u.rt_next) {
smp_read_barrier_depends();
if (rth->fl.fl4_dst == flp->fl4_dst  &&
rth->fl.fl4_src == flp->fl4_src  &&
rth->fl.iif == 0 &&
rth->fl.oif == flp->oif &&
#ifdef CONFIG_IP_ROUTE_FWMARK
rth->fl.fl4_fwmark == flp->fl4_fwmark  &&
#endif

Let’s look at this part of the if statement closely. The bit definitions of the fl4_tos field in the flowi structure are similar to the ToS field of the IP header. Earlier in this chapter, we examined the flowi structure. In the fl4_tos field, the RTO_ONLINK actually refers to bit zero, which is not defined in the "real" IP packet ToS field. Linux uses it here to identify a route to a directly connected host (reachable via link layer transmission). Generally, when we are called from ARP, tos is set to RTO_ONLINK to request a "route" to a directly connected host. Because the routing table is derived from the destination cache, the search of the routing cache returns a destination cache entry pointing to the attached host.

!((rth->fl.fl4_tos ^ flp->fl4_tos) & (IPTOS_RT_MASK | RTO_ONLINK))) {

If the key matches, we set lastuse to the current time to indicate when this destination entry was last used. In addition, the use count is decremented. We return a pointer to the new destination cache entry in rp and exit from the function.

rth->u.dst.lastuse = jiffies;
dst_hold(&rth->u.dst);
rth->u.dst.__use++;
RT_CACHE_STAT_INC(out_hit);
rcu_read_unlock();
*rp = rth;
return 0;
}
RT_CACHE_STAT_INC(out_hlist_search);
}

If we didn’t find a match in the route cache, we unlock the hash table slot and call the slow output route resolving function, ip_route_output_slow, with flp to search the FIB.

rcu_read_unlock(); return ip_route_output_slow(rp, flp); }

The Main Output Route Resolving Function

If the fast path route failed, we continue with the slow path routing. If the fast path routing function can’t match the new route in the route cache, it calls the function ip_route_output_slow to search the FIB.

int ip_route_output_slow(struct rtable **rp, const struct flowi *oldflp) {

This variable, tos, is built from the tos field in the flowi structure pointed to by the parameter oldflp. It mostly contains the IP ToS field bits, which are used as part of the criteria to determine the route. However, tos also includes another bit that is not part if the IP header ToS field. When set in fl4_tos field of flowi, this bit, RTO_ONLINK, defined as bit zero, indicates that the route is to a directly connected host.

u32 tos = oldflp->fl4_tos & (IPTOS_RT_MASK | RTO_ONLINK);

In this function, we create a new flowi structure, which is created from the input flowi pointed to by flp. Most of the fields are copied from the old flp but a few are calculated.

struct flowi fl = { .nl_u = { .ip4_u = { .daddr = oldflp->fl4_dst, .saddr = oldflp->fl4_src,

In the tos field in the IPv4 part of the flowi structure, the RTO_ONLINK actually refers to bit zero, which is not defined in the "real" IP packet ToS field. Linux uses it here to identify a route to a directly connected host (reachable via link layer transmission).

.tos = tos & IPTOS_RT_MASK,
.scope = ((tos & RTO_ONLINK) ?
RT_SCOPE_LINK :
RT_SCOPE_UNIVERSE),

The next field, fwmark, is for firewall marks. It is used for traffic shaping if firewall marks are configured into the Linux kernel.

#ifdef CONFIG_IP_ROUTE_FWMARK
.fwmark = oldflp->fl4_fwmark
#endif
} } ,
.iif = loopback_dev.ifindex,
.oif = oldflp->oif } ;

The structure fib_result holds the result of a search of the FIB.

struct fib_result res; unsigned flags = 0; struct rtable *rth;

We will use dev_out as a pointer to the input network interface device associated with the route, and in_dev as a pointer to the output network interface.

struct net_device *dev_out = NULL;
struct in_device *in_dev = NULL;
unsigned hash;
int free_res = 0;
int err;
res.fi = NULL;
#ifdef CONFIG_IP_MULTIPLE_TABLES
res.r = NULL;
#endif

We do some basic checking before looking up the route.

if (oldflp->fl4_src) {

If the source address is specified by the caller, we check to see if it is a Martian address.

err = -EINVAL;
if (MULTICAST(oldflp->fl4_src) ||
BADCLASS(oldflp->fl4_src) ||
ZERONET(oldflp->fl4_src))
goto out;

In this test, we check to see if the source address is one of our local addresses. If the address is assigned to an interface, ip_dev_find returns the network interface device that has that address. This check is functionally similar to calling inet_addr_type with source address as an argument to see if the address is in the local FIB table.

dev_out = ip_dev_find(oldflp->fl4_src); if (dev_out == NULL) goto out;

Comments in the code say that code was removed to see if the output interface oif was for the same device as the one returned by ip_dev_find. This check would have been incorrect because the source address could have been an address different from the device from which the packets are sent.

if (oldflp->oif == 0&& (MULTICAST(oldflp->flr_dst) || oldflp->fl4_dst == 0xFFFFFFFF)) {

Comments in the code say this is a special hack, so applications can send packets to multicast and broadcast addresses without specifying the IP_PKTINFO IP option. This facilitates applications such as VIC and VAT for video and audio conferencing. We need this hack because these applications bind the socket to the loopback address, set the multicast TTL to zero, and send the packets out without doing a join group or specifying the outgoing multicast interface.

fl.oif = dev_out->ifindex;

Now we are ready to go to the label make_route to enter the new route in cache.

goto make_route; }

This is just a cleanup. We decrement the use count of the device we used temporarily, dev_out.

if (dev_out) dev_put(dev_out); dev_out = NULL; }

If the output interface, oif, is specified by the caller, we get the net_device pointer and make sure that it references an in_device structure that contains the AF_INET type address information.

if (oldflp->oif) {
dev_out = dev_get_by_index(oldflp->oif);
err = -ENODEV;
if (dev_out == NULL)
goto out;
if (__in_dev_get(dev_out) == NULL) {
dev_put(dev_out);
goto out;
}

Now we check to see if we are trying to route a multicast destination address, and if so, we get the source address by calling inet_select_addr. This source address will be put in the route information so it can be used later as the source address of outgoing packets sent to the multicast or broadcast address in flp->fl4_dst. Next, since there is no need to do a FIB lookup, we go to make_route to enter the new route in the route cache.

if (LOCAL_MCAST(oldflp->fl4_dst) ||  oldflp->fl4_dst ==
0xFFFFFFFF) {
if (!fl.fl4_src)
fl.fl4__src = inet_select_addr(dev_out, 0,
RT_SCOPE_LINK);
goto make_route;
}
if (!fl.fl4_src) {
if (MULTICAST(oldflp->fl4_dst))
fl.fl4_src = inet_select_addr(dev_out, 0,  fl.fl4_scope);
else if (!oldflp->fl4_dst)
fl.fl4_src = inet_select_addr(dev_out, 0,
RT_SCOPE_HOST);
}
}

If the destination address in the key is NULL, we assume that we are looking for a local (internal) destination. We set the "output" network device to the loopback device and set both flags to look for a local route. Therefore, packets using this route will be looped back; they will be sent back up the IP stack. There is no need to do the FIB lookup for this key, so we go straight to make_route, which enters this local route in the route cache.

if (!fl.fl4_dst) {
fl.fl4_dst = fl.fl4_src;
if (!fl.fl4_dst)
fl.fl4_dst = fl.fl4_src =htonl(INADDR_LOOPBACK);
if (dev_out)
dev_put(dev_out);
dev_out = &loopback_dev;
dev_hold(dev_out);
fl.oif = loopback_dev.ifindex;
res.type = RTN_LOCAL;
flags |= RTCF_LOCAL;
goto make_route;
}

Now, we call fib_lookup to get the route for fl. If it finds a route, it returns a zero and puts the result in res.

if (fib_lookup(&fl, &res)) {

The FIB lookup has failed to find a route.

res.fi = NULL; if (oldflp->oif) {

We are here because the FIB lookup failed even though an output interface was specified in the lookup key, oldkey. A comment in the code states that the routing tables must be wrong if the lookup failed event even though an output device was specified. We are allowed to send packets out an interface even when there are no routes specifying the interface and no addresses assigned to the interface. Therefore, we assume that the destination is directly connected to the output interface oif even though there was no route in the FIB. If the output interface, oif is specified, the route lookup is only for checking to see whether the final destination is directly connected or reachable only through a gateway.

if (fl.fl4_src == 0)

If the source address wasn’t specified, we must pick one. We put it in the key, set the route type to multicast, and jump to make_route to enter the route into cache.

fl.fl4_src = inet_select_addr(dev_out, 0, RT_SCOPE_LINK); res.type = RTN_UNICAST; goto make_route; }

We arrived here because we couldn’t find a route, so we set the error to unreachable and get out.

if (dev_out)
dev_put(dev_out);
err = -ENETUNREACH;
goto out;
}

This section of the code is executed if we know that the FIB lookup has given us a route. We do some checks of the route type in the type field before we add the route to the cache.

free_res = 1;

This is a check if the route we got from the FIB indicated NAT. This shouldn’t happen, so we get out.

if (res.type == RTN_NAT) goto e_inval;

We got here because the FIB lookup gave us a “local” route, which means that the destination address in the key was one of our own addresses. We set the output “device” to the loopback device, and the route cache flag, flags, to RTCF_LOCAL indicating that the route is a local route. Next, we go to make_route to enter the route in cache.

if (res.type == RTN_LOCAL) {
if (!fl.fl4_src)
fl.fl4_src = fl.fl4_dst;
if (dev_out)
dev_put(dev_out);
dev_out = &loopback_dev;
dev_hold(dev_out);
fl.oif = dev_out->ifindex;
if (res.fi)
fib_info_put(res.fi);
res.fi = NULL;
flags |= RTCF_LOCAL;
goto make_route;
}

Multipath routing is a kernel option that allows more than one routing path to be defined to the same destination. If the option is configured, we check the fib_nhs field in the fib_info structure to see if it is greater than one. This lets us know that there is more than one "next hop" for the same destination.

#ifdef CONFIG_IP_ROUTE_MULTIPATH
if (res.fi->fib_nhs > 1 &&  key.oif == 0)
fib_select_multipath(&key, &res);
else
#endif

Here, we check to see if we got a fib_result with no netmask and no output device. With neither of these things, we get the default route by calling fib_select_default. The field prefixlen specifies the number of bits to use for the netmask, and if it is zero, there is no netmask. If the output device was specified in the fib_result we would know we are trying to reach a directly connected host, and if the netmask was specified, we would know we have a route to a gateway.

if (!res.prefixlen && res.type ==RTN_UNICAST && !fl.oif)
fib_select_default(&fl, &res);
if (!fl.fl4_src)
fl.fl4_src = FIB_RES_PREFSRC(res);
if (dev_out)
dev_put(dev_out);
dev_out = FIB_RES_DEV(res);
dev_hold(dev_out);
fl.oif = dev_out->ifindex;

At this point, we are done with the FIB lookup. We have a route so we enter it in the route cache after making a few checks for broadcast, multicast, and just plain bad destination addresses. It may seem that some of these checks are redundant at this point, but we may have come to this label by bypassing the FIB lookup.

make_route:

if (LOOPBACK(fl.fl4_src)&&!(dev_out->flags&IFF_LOOPBACK))

If the source address is the loopback address, the output device must also be a loopback device.

goto e_inval;

Here we set the route type to RTN_BROADCAST or RTN_MULTICAST depending on the destination address.

if (key.dst == 0xFFFFFFFF)
res.type = RTN_BROADCAST;
else if (MULTICAST(fl.fl4_dst))
res.type = RTN_MULTICAST;

If the destination address is Martian, we get out.

else if (BADCLASS(fl.fl4_dst)||ZERONET(fl.fl4_dst))

goto e_inval;If the output network interface is the loopback device, we must have a local route, so we set the route control flags to RTCF_LOCAL.

if (dev_out->flags & IFF_LOOPBACK) flags |= RTCF_LOCAL;

Next, we get the in_device structure for the output network interface because it contains the list of multicast addresses for the output device.

in_dev = in_dev_get(dev_out); if (!in_dev) goto e_inval;

We don’t need fib_info for either broadcast or local routes, so fi is set to NULL and fib_info is freed.

if (res.type == RTN_BROADCAST) {
flags |= RTCF_BROADCAST | RTCF_LOCAL;
if (res.fi) {
fib_info_put(res.fi);
res.fi = NULL;
}

There are a few additional things we must do for multicast routes.

} else if(res.type == RTN_MULTICAST) {

First, we initialize the route cache flags to indicate both multicast and local because we have a multicast route. Before entering the route in cache, we decide whether we will loop back packets sent via this multicast route. We do this by checking to see if the destination address is in the list of multicast groups for our output interface. If not, we reset the RTCF_LOCAL so multicast packets sent via this route won’t be looped back.

flags |= RTCF_MULTICAST|RTCF_LOCAL;
read_lock(&inetdev_lock);
if (ip_check_mc(in_dev, oldflp->fl4_dst,oldflp->fl4_src,
oldflp->proto))
flags &= ~RTCF_LOCAL;

Here, a comment in the code says that this is a hack. If the multicast route does not exist, use the default multicast route but do not send the packet to a gateway.

if (res.fi && res.prefixlen < 4) {
fib_info_put(res.fi);
res.fi = NULL;
}
}

We call dst_alloc to allocate a route cache entry from the generic destination cache. We increment the reference count in the cache entry. Next, we set fields in the new route cache entry including the flowi values.

rth = dst_alloc(&ipv4_dst_ops);
if (!rth)
goto e_nobufs;
atomic_set(&rth->u.dst.__refcnt, 1);
rth->u.dst.flags= DST_HOST;
if (in_dev->cnf.no_xfrm)
rth->u.dst.flags |= DST_NOXFRM;
if (in_dev->cnf.no_policy)
rth->u.dst.flags |= DST_NOPOLICY;
rth->fl.fl4_dst = oldflp->fl4_dst;
rth->fl.fl4_tos = tos;
rth->fl.fl4_src = oldflp->fl4_src;
rth->fl.oif = oldflp->oif;
#ifdef CONFIG_IP_ROUTE_FWMARK
rth->fl.fl4_fwmark= oldflp->fl4_fwmark;
#endif
rth->rt_dst = fl.fl4_dst;
rth->rt_src = fl.fl4_src;
#ifdef CONFIG_IP_ROUTE_NAT
rth->rt_dst_map = fl.fl4_dst;
rth->rt_src_map = fl.fl4_src;
#endif

We set the input interface for the route. Next, we set the output device specified in then destination cache entry. Later, when packets are transmitted via this route, the output interface device in dev_out can be accessed quickly via the destination cache entry through the dst field of the socket buffer.

rth->rt_iif = oldflp->oif ? :dev_out->ifindex;
rth->u.dst.dev = dev_out;
dev_hold(dev_out);
rth->rt_gateway = fl.fl4_dst;
rth->rt_spec_dst= fl.fl4_src;

Since this is an output route, ip_output is the output function for this route. When a transmit protocol sends the packet, it will call ip_output through the output field of the destination cache entry. We increment the statistics for the number of slow routes.

rth->u.dst.output=ip_output; RT_CACHE_STAT_INC(out_slow_tot);

If the route cache flag indicates that this is a local route, the input function is set. Later, when the IP receive function gets a packet via this route, it will call ip_local_deliver through the destination cache entry’s input field.

if (flags & RTCF_LOCAL) {
rth->u.dst.input = ip_local_deliver;
rth->rt_spec_dst = fl.fl4_dst;
}

If the flags indicate that this is a broadcast or multicast route, the output function is set to ip_mc_output.

if (flags & (RTCF_BROADCAST | RTCF_MULTICAST)) { rth->rt_spec_dst = fl.fl4_src;

If the route is to a multicast address, the output function pointer in the destination cache is set to point to the multicast output routing function.

if (flags & RTCF_LOCAL &&!(dev_out->flags & IFF_LOOPBACK)) {
rth->u.dst.output = ip_mc_output;
RT_CACHE_STAT_INC(out_slow_mc);
}
#ifdef CONFIG_IP_MROUTE
if (res.type == RTN_MULTICAST) {
if (IN_DEV_MFORWARD(in_dev) &&
!LOCAL_MCAST(oldflp->fl4_dst)) {
rth->u.dst.input = ip_mr_input;
rth->u.dst.output = ip_mc_output;
}
}
#endif
}

We call rt_set_nexthop to set some information about the gateway from the route cache. We get this from the fib_info structure attached to res. Then, we set the routing cache rt_flags field.

rt_set_nexthop(rth, &res, 0); rth->rt_flags = flags;

Next, we calculate the hash code for the new route cache entry and enter rth into the route cache by calling rt_intern_hash.

hash = rt_hash_code(oldflp->fl4_dst, oldflp->fl4_src ^
(oldflp->oif << 5), tos);
err = rt_intern_hash(hash, rth, rp);
done:
if (free_res)
fib_res_put(&res);
if (dev_out)
dev_put(dev_out);
out: return err;
e_inval:
err = -EINVAL;
goto done;
e_nobufs:
err = -ENOBUFS;
goto done;
}