http://wangcong.org/blog/archives/1679
长期以来对tun和tap这对兄弟分不太清,今天下定决心研究了一下代码,总算是搞明白了。
首先它们都是从/dev/net/tun里ioctl出来的虚拟设备,一个是通过IFF_TUN,另一个是 IFF_TAP。最好的例子莫过于vpnc里面的代码了。
-
int tun_open ( char *dev, enum if_mode_enum mode )
-
{
-
struct ifreq ifr;
-
int fd, err;
-
-
if ( (fd = open ( "/dev/net/tun", O_RDWR ) ) < 0 ) {
-
error ( 0, errno,
-
"can't open /dev/net/tun, check that it is either device char 10 200 or (with DevFS) a symlink to ../misc/net/tun (not misc/net/tun)" );
-
return -1;
-
}
-
-
memset ( &ifr, 0, sizeof (ifr ) );
-
ifr. ifr_flags = ( (mode == IF_MODE_TUN ) ? IFF_TUN : IFF_TAP ) | IFF_NO_PI;
-
if ( *dev )
-
strncpy (ifr. ifr_name, dev, IFNAMSIZ );
-
-
if ( (err = ioctl (fd, TUNSETIFF, ( void * ) &ifr ) ) < 0 ) {
-
close (fd );
-
return err;
-
}
-
strcpy (dev, ifr. ifr_name );
-
return fd;
-
}
用的ioctl的命令都是同一个TUNSETIFF。
虽然是出自一个娘,但它们仍然有大的不同。tun是点对点的设备,而tap是一个普通的以太网卡设备。也就是说,tun设备其实完全不需要有物理地址的!它收到和发出的包不需要arp,也不需要有数据链路层的头!而tap设备则是有完整的物理地址和完整的以太网帧。
用一个实际的例子来验证一下:
tap0 Link encap:Ethernet HWaddr 0E:78:39:78:E7:A7 inet addr:192.168.1.109 Bcast:192.168.1.255 Mask:255.255.255.0 inet6 addr: fe80::c78:39ff:fe78:e7a7/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:21 overruns:0 carrier:0 collisions:0 txqueuelen:500 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) tun0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 inet addr:X.X.X.X P-t-P:X.X.X.X Mask:255.255.255.255 UP POINTOPOINT RUNNING NOARP MULTICAST MTU:1412 Metric:1 RX packets:6 errors:0 dropped:0 overruns:0 frame:0 TX packets:6 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:500 RX bytes:690 (690.0 b) TX bytes:402 (402.0 b) % ethtool -i tun0 driver: tun version: 1.6 firmware-version: N/A bus-info: tun % ethtool -i tap0 driver: tun version: 1.6 firmware-version: N/A bus-info: tap
继续回来看代码。还是vpnc的代码 tunip.c,看它发送的时候做了什么处理:
-
static int tun_send_ip ( struct sa_block *s )
-
{
-
int sent, len;
-
uint8_t *start;
-
-
start = s ->ipsec. rx. buf;
-
len = s ->ipsec. rx. buflen;
-
-
if (opt_if_mode == IF_MODE_TAP ) {
-
#ifndef __sun__
-
/*
-
* Add ethernet header before s->ipsec.rx.buf where
-
* at least ETH_HLEN bytes should be available.
-
*/
-
struct ether_header *eth_hdr = ( struct ether_header * ) (s ->ipsec. rx. buf - ETH_HLEN );
-
-
memcpy (eth_hdr ->ether_dhost, s ->tun_hwaddr, ETH_ALEN );
-
memcpy (eth_hdr ->ether_shost, s ->tun_hwaddr, ETH_ALEN );
-
-
/* Use a different MAC as source */
-
eth_hdr ->ether_shost [ 0 ] ^= 0x80; /* toggle some visible bit */
-
eth_hdr ->ether_type = htons (ETHERTYPE_IP );
-
-
start = (uint8_t * ) eth_hdr;
-
len += ETH_HLEN;
-
#endif
-
}
-
-
sent = tun_write (s ->tun_fd, start, len );
-
if (sent != len )
-
syslog (LOG_ERR, "truncated in: %d -> %d\n", len, sent );
-
hex_dump ( "Tx pkt", start, len, NULL );
-
return 1;
-
}
从上面的代码我们很容易看出:
1. 所谓发送就是对/dev/net/tun进行写操作。对称的,所谓接收就是读操作。
2. 如果是tap设备,发送时还要多加一个以太网的头。
我们再看内核中对应的代码是怎么处理的,在drivers/net/tun.c 中的 tun_get_user():
-
switch (tun ->flags & TUN_TYPE_MASK ) {
-
case TUN_TUN_DEV :
-
if (tun ->flags & TUN_NO_PI ) {
-
//...
-
}
-
-
skb_reset_mac_header (skb );
-
skb ->protocol = pi. proto;
-
skb ->dev = tun ->dev;
-
break;
-
case TUN_TAP_DEV :
-
skb ->protocol = eth_type_trans (skb, tun ->dev );
-
break;
内核直接忽略了 tun 设备的以太网帧。现在,整个流程我们就已经很清楚了。
可是,上面只是用vpnc的例子。我们知道,实际中像kvm虚拟机才是tap的使用大户,我们很有必要看一下kvm是怎么使用tap设备的。为了方便起见,我们不看 qemu-kvm,因为它的代码过于复杂,我们看一个简单的kvm tools的实现。
这部分的主要代码在 virtio/net.c里面,virtio_net__tap_init()是在启动虚拟机时初始化tap设备的,然后启动两个线程分别监控tap设备的收发,代码是virtio_net_rx_thread()和virtio_net_tx_thread(),它们负责把进来的IO操作转换成对/dev/net/tun的读写。可是,IO操作是怎么进来的呢?这是关键。
顺着代码里的“针”一个个找下去,我们不难发现,IO操作是由kvm模拟出来的。首先它会把CPU指令中对应的IO操作进行转化,这部分在内核中,arch/x86/kvm/emulate.c::x86_emulate_insn():
-
do_io_in :
-
c ->dst. bytes = min (c ->dst. bytes, 4u );
-
if ( !emulator_io_permited (ctxt, ops, c ->src. val, c ->dst. bytes ) ) {
-
emulate_gp (ctxt, 0 );
-
goto done;
-
}
-
if ( !pio_in_emulated (ctxt, ops, c ->dst. bytes, c ->src. val,
-
&c ->dst. val ) )
-
goto done; /* IO is needed */
-
break;
pio_in_emulated() 调用的 emulator_pio_in_emulated() 会进一步触发KVM_EXIT_IO:
-
static int emulator_pio_in_emulated ( int size, unsigned short port, void *val,
-
unsigned int count, struct kvm_vcpu *vcpu )
-
{
-
if (vcpu ->arch. pio. count )
-
goto data_avail;
-
-
trace_kvm_pio ( 0, port, size, 1 );
-
-
vcpu ->arch. pio. port = port;
-
vcpu ->arch. pio. in = 1;
-
vcpu ->arch. pio. count = count;
-
vcpu ->arch. pio. size = size;
-
-
if ( !kernel_pio (vcpu, vcpu ->arch. pio_data ) ) {
-
data_avail :
-
memcpy (val, vcpu ->arch. pio_data, size * count );
-
vcpu ->arch. pio. count = 0;
-
return 1;
-
}
-
-
vcpu ->run ->exit_reason = KVM_EXIT_IO;
-
vcpu ->run ->io. direction = KVM_EXIT_IO_IN;
-
vcpu ->run ->io. size = size;
-
vcpu ->run ->io. data_offset = KVM_PIO_PAGE_OFFSET * PAGE_SIZE;
-
vcpu ->run ->io. count = count;
-
vcpu ->run ->io. port = port;
-
-
return 0;
-
}
内核部分结束,转到用户空间,用户空间的 vcpu 会捕捉到这个事件,在 kvm-cpu.c::kvm_cpu__start() 中:
-
case KVM_EXIT_IO : {
-
bool ret;
-
-
ret = kvm__emulate_io (cpu ->kvm,
-
cpu ->kvm_run ->io. port,
-
(u8 * )cpu ->kvm_run +
-
cpu ->kvm_run ->io. data_offset,
-
cpu ->kvm_run ->io. direction,
-
cpu ->kvm_run ->io. size,
-
cpu ->kvm_run ->io. count );
-
-
if ( !ret )
-
goto panic_kvm;
-
break;
-
}
kvm__emulate_io() 就会调用在 virtio/net.c 注册的 virtio_net_pci_io_in(),数据就这样流向了 tap 网卡了。