Libpcap是linux下用来捕获数据包的抓包库,它主要是基于socket的,和winpcap的本质的不同是,winpcap是和tcp/ip协议同层的,而libpcap是应用层的库,在tcp/ip层上对socket的又一次封装,所以从网卡得到的数据包需要经过多次拷贝才能达到应用程序,在千兆网的条件下,捕获包的性能较差,为了提高libpcap的包捕获性能,采用PF_RING对libpcap进行改进,改进后的libpcap采用环状缓冲区从网卡接收数据包,然后通过mmap映射到应用程序,减少内存拷贝的次数。为了更好的理解libpcap,pfring,libpfring等库函数,所以对这些源码进行分析,其中pfring是内核的源码,而libpfring是对pfring的封装,供应用程序调用,其实不采用libpcap,直接采用libpring也能捕获数据包,因为目前大部分的sniff工具都是建立在libpcap之上的,所以还是采用libpcap的接口,在底层采用pfring修改socket的实现过程。
Winpcap和libpcap捕获数据包的不同之处在于winpcap是与tcp/ip同层的协议,而libpcap是应用层的开发包,libpcap+pf_ring补丁后,和winpcap就有点类似了,都是采用环状的内核缓冲区,内核缓冲区的大小都可以设置。而winpcap和libpcap另外一个不同之处在于,它可以设定mintocopysize,即当内核缓冲区有这么多数据的时候,就将数据拷贝到应用程序缓冲区,而libpcap是没有这种功能的。Libpcap主要是基于网卡中断或轮询往上层传替数据的。
首先以libpcap为主线,先通过pcap_open_live函数,做一些初始化的操作,比如打开网卡,设置好读取数据包的回调函数等等,然后就可以通过pcap_next,pcap_next_ex,pcap_dispatch,pcap_loop来捕获数据包了。本文的主要宗旨是分析源码,从应用层的libpcap,pfring一直分析到内核的PF_RING,通过对源码的讲解,使得我们深入的理解PF_RING,及它是怎样改善libpcap捕获数据包的性能的。
1) pcap_open_live
首先还是从应用层的libpcap开始分析,第一个分析的函数pcap_open_live,在pcap.c文件中找到pcap_open_live函数,源码如下:
pcap_t * pcap_open_live(constchar *source, int snaplen, int promisc, int to_ms, char *errbuf)
{
pcap_t*p;
intstatus;
p= pcap_create(source, errbuf);
if(p == NULL)
return(NULL);
status= pcap_set_snaplen(p, snaplen);
if(status < 0)
gotofail;
status= pcap_set_promisc(p, promisc);
if(status < 0)
gotofail;
status= pcap_set_timeout(p, to_ms);
if(status < 0)
gotofail;
p->oldstyle= 1;
status= pcap_activate(p);
if(status < 0)
gotofail;
return(p);
fail:
if(status == PCAP_ERROR)
snprintf(errbuf,PCAP_ERRBUF_SIZE, "%s: %s", source,
p->errbuf);
elseif (status == PCAP_ERROR_NO_SUCH_DEVICE ||
status == PCAP_ERROR_PERM_DENIED)
snprintf(errbuf,PCAP_ERRBUF_SIZE, "%s: %s (%s)", source,
pcap_statustostr(status), p->errbuf);
else
snprintf(errbuf,PCAP_ERRBUF_SIZE, "%s: %s", source,
pcap_statustostr(status));
pcap_close(p);
return(NULL);
}
从上面的源码可以看出,pcap_open_live函数首先调用pcap_create函数,这个函数里面的内容到下面在进行分析,然后调用pcap_set_snaplen设置最大捕获包的长度,对于以太网数据包,最大长度为1518bytes,默认可以设置成65535就可以捕获所有的数据包了。然后调用pcap_set_promisc设置数据包的捕获模式,1为混杂模式,pcap_set_timeout函数的作用是设置超时的时间,当应用程序在这个时间内没读到数据就返回。接着就是pcap_activate函数了,这个也在下面进行讲解。其实在pcap_create函数和pcap_activate函数之间还可以调用pcap_set_buffer_size函数设置内核缓冲区的大小,这个函数我们可以在opentest.c文件中看到它的调用方法。我也会在下文中进行讲解。
Libpcap源码为了支持多个操作系统,代码错综复杂,你搜一下pcap_create函数,有很多地方定义了该函数,但是我们是在linux系统下进行源码分析,所以我们首先在pcap_linux.c下面搜索pcap_create函数,源码如下:
pcap_t * pcap_create(constchar *device, char *ebuf)
{ //device 为网卡的设备名,ebuf:存放错误信息的缓冲区
pcap_t *handle;
/*
* A null device name is equivalent to the"any" device.
*/
if (device == NULL)
device ="any";
#ifdef HAVE_DAG_API
if (strstr(device,"dag")) {
returndag_create(device, ebuf);
}
#endif /* HAVE_DAG_API */
#ifdef HAVE_SEPTEL_API
if (strstr(device,"septel")) {
returnseptel_create(device, ebuf);
}
#endif /* HAVE_SEPTEL_API */
#ifdef HAVE_SNF_API
handle =snf_create(device, ebuf);
if (strstr(device,"snf") || handle != NULL)
return handle;
#endif /* HAVE_SNF_API */
#ifdef PCAP_SUPPORT_BT
if (strstr(device,"bluetooth")) {
returnbt_create(device, ebuf);
}
#endif
#ifdef PCAP_SUPPORT_CAN
if (strstr(device,"can") || strstr(device, "vcan")) {
returncan_create(device, ebuf);
}
#endif
#ifdef PCAP_SUPPORT_USB
if (strstr(device,"usbmon")) {
returnusb_create(device, ebuf);
}
#endif
handle = pcap_create_common(device, ebuf);
if (handle == NULL)
return NULL;
// pcap_create_common为初始化的函数,通过网卡设备的名字,获得pcap_t*一个句柄,然后再设定handle的回调函数。
handle->activate_op =pcap_activate_linux;
handle->can_set_rfmon_op= pcap_can_set_rfmon_linux; //设置rfmonmode
return handle;
}
为了支持不同的设备,pcap_create通过 #ifdef进行区分,这样就将打开不同的设备集成在一个函数中,而在我们的应用中就是普通的网卡,所以它就是调用pcap_create_common函数,它在pcap.c中定义,感觉有点混乱,为什么不直接在pcap-linux.c中定义呢,个人观点,应该在pcap-linux中定义,显的直观些,害我跟踪的时候,还要到pcap.c中取找这个函数,因为libpcap还要兼容其它操作系统的原因吧,因为你把它放在pcap-linux.c,其它操作系统调用这个函数,就不方便了,从这一点考虑,libpcap的作者们的架构还是挺不错的。另外定义2个回调函数pcap_activate_linux和pcap_can_set_rfmon_linux函数。Pcap_create函数的返回值为pcap_t*类型的网卡的句柄。既然讲到了pcap_create函数,就必须跟踪到pcap_create_common函数及另外的2个回调函数中去。下面接着看pcap_create_common函数的源码。
pcap_t *pcap_create_common(constchar *source, char *ebuf)
{
pcap_t*p;
p= malloc(sizeof(*p)); //给p分配内存
if(p == NULL) {
snprintf(ebuf,PCAP_ERRBUF_SIZE, "malloc: %s",
pcap_strerror(errno));
return(NULL);
}
memset(p,0, sizeof(*p)); //对p的内存区域清0
#ifndef WIN32
p->fd= -1; /* not opened yet */
p->selectable_fd= -1;
p->send_fd= -1;
#endif
p->opt.source= strdup(source); //source为网卡的名字
if(p->opt.source == NULL) {
snprintf(ebuf,PCAP_ERRBUF_SIZE, "malloc: %s",
pcap_strerror(errno));
free(p);
return(NULL);
}
/*
* Default to"can't set rfmon mode"; if it's supported by
* a platform, thecreate routine that called us can set
* the op to its routineto check whether a particular
* device supports it.
*/
p->can_set_rfmon_op= pcap_cant_set_rfmon;
initialize_ops(p);
/*put in some defaults*/
pcap_set_timeout(p,0);
pcap_set_snaplen(p,65535); /* max packet size */
p->opt.promisc= 0;
p->opt.buffer_size= 0;
return(p);
}
在这个函数中,需要讲解的是strdup函数,它的作用是复制字符串,返回指向被复制的字符串的指针。注意应用它时,需要加头文件#include <string.h>。
p->can_set_rfmon_op =pcap_cant_set_rfmon; 这句话的作用在函数里面的注释中已经讲了,默认为不设置rfmon mode。initialize_ops(p);函数的作用就是设置初始化的一系列回调函数。
pcap_set_timeout(p,0);
pcap_set_snaplen(p,65535); /* max packet size */
p->opt.promisc= 0;
p->opt.buffer_size= 0;
这几行代码的作用是设置初始的超时,snaplen=65535,设置成非混杂模式,内核缓冲区的大小初始化为0。整的来说pcap_create_common就是一个初始化函数。
其中initialize_ops函数的源码如下:
static void initialize_ops(pcap_t*p)
{
/*
* Set operationpointers for operations that only work on
* an activated pcap_tto point to a routine that returns
* a "this isn'tactivated" error.
*/
p->read_op= (read_op_t)pcap_not_initialized;
p->inject_op= (inject_op_t)pcap_not_initialized;
p->setfilter_op= (setfilter_op_t)pcap_not_initialized;
p->setdirection_op= (setdirection_op_t)pcap_not_initialized;
p->set_datalink_op= (set_datalink_op_t)pcap_not_initialized;
p->getnonblock_op= (getnonblock_op_t)pcap_not_initialized;
p->setnonblock_op= (setnonblock_op_t)pcap_not_initialized;
p->stats_op= (stats_op_t)pcap_not_initialized;
#ifdef WIN32
p->setbuff_op= (setbuff_op_t)pcap_not_initialized;
p->setmode_op= (setmode_op_t)pcap_not_initialized;
p->setmintocopy_op= (setmintocopy_op_t)pcap_not_initialized;
#endif
/*
* Default cleanupoperation - implementations can override
* this, but should callpcap_cleanup_live_common() after
* doing their ownadditional cleanup.
*/
p->cleanup_op= pcap_cleanup_live_common;
/*
* In most cases, the standard one-shortcallback can
* be used for pcap_next()/pcap_next_ex().
*/
p->oneshot_callback= pcap_oneshot;
}
pcap_create_common讲解完了,接着讲解pcap_create函数中的另外一个回调函数,pcap_activate_linux,搜索这个函数,呵呵,在pcap-linux.c中找到了这个函数。Libpcap的作者这个架构,实在是令小生佩服。把linux要用到的函数都集成到了pcap-linux.c中,还把多个操作系统共用的函数就放到了pcap.c中,比如前面讲到的pcap_create_common函数。先不管这么多,抓住pcap_activate_linux再说。下面讲解pcap_activate_linux这个源码。从pcap_activate_linux的源码可以看到,通过pcap_create_common对pcap_t * p设定初始值,其实就像c++的初始化函数一样,比如c++的构造函数,MFC的OninitDialog函数一样。初始化就是初始化,对于不同的系统,就要进行不同的设置了,在linux函数中pcap_activate_linux中可以看到又对pcap_create_common中初始化的回调函数又重新进行了设置,看到这里我就佩服libpcap的作者了,把pcap_create_common函数放到了pcap.c文件中。
static int pcap_activate_linux(pcap_t*handle)
{
constchar *device;
int status = 0;
device= handle->opt.source; //网卡的名字
handle->inject_op= pcap_inject_linux;
handle->setfilter_op= pcap_setfilter_linux;
handle->setdirection_op= pcap_setdirection_linux;
handle->set_datalink_op= NULL; /* can't change data link type */
handle->getnonblock_op= pcap_getnonblock_fd;
handle->setnonblock_op= pcap_setnonblock_fd;
handle->cleanup_op= pcap_cleanup_linux;
handle->read_op= pcap_read_linux;
handle->stats_op= pcap_stats_linux;
/*
* The "any"device is a special device which causes us not
* to bind to a particulardevice and thus to look at all
* devices.
*/
if(strcmp(device, "any") == 0) {
if(handle->opt.promisc) {
handle->opt.promisc= 0;
/*Just a warning. */
snprintf(handle->errbuf,PCAP_ERRBUF_SIZE,
"Promiscuous mode not supported on the\"any\" device");
status= PCAP_WARNING_PROMISC_NOTSUP;
}
}
handle->md.device = strdup(device);
if(handle->md.device == NULL) {
snprintf(handle->errbuf,PCAP_ERRBUF_SIZE, "strdup: %s",
pcap_strerror(errno) );
returnPCAP_ERROR;
}
#ifdef HAVE_PF_RING //是否定义pf_ring
if(!getenv("PCAP_NO_PF_RING")){
/* Code courtesy ofChris Wakelin <c.d.wakelin@reading.ac.uk> */
char *clusterId;
handle->ring =pfring_open((char*)device, handle->opt.promisc, handle->snapshot, 1);
/*
#ifdef HAVE_PF_RING 如果定义了PF_RING,就执行这个里面的东东,从里面的函数可以看出,pf_ring从新定义了socket函数,pfring_open函数的作用如下:初始化PF_RING socket,获得一个pfring类型的结构。函数原型如下:
pfring* pfring_open(char *device_name,u_int8_t promisc, u_int32_t caplen, u_int8_t reentrant);
函数功能:初始化PF_RING socket,获得一个pfring类型结构。如果需要以DNA的方式打开一个设备,则必须调用pfring_open_dna()函数;
参数:
Device_name: PF_RING的符号链接命令(egeth0);
Promisc: 设置是否为混合模式(1=混合模式);
Caplen:最大的包捕获长度,(also known assnaplen和pcap_open_live函数的snaplen一样,通常设为65535就能捕获到网络上最大的数据包);
Reentrant: 设为非0,则设备已reentrant的模式打开,它以信号量的机制执行,性能稍微会变差,主要用在多线程应用程序;
返回值:成功返回一个句柄,否则返回NULL
Pfring_open的源码如下:
pfring*pfring_open(char *device_name, u_int8_t promisc,u_int32_t caplen, u_int8_t_reentrant) {
return(pfring_open_consumer(device_name, promisc, caplen, _reentrant,
0, NULL, 0));
Pfring_open 其实是调用的pfring_open_consumer函数;该函数到后面我们在继续分析它;
*/
if(handle->ring) {
if(clusterId =getenv("PCAP_PF_RING_CLUSTER_ID"))
/*
其中getenv为C语言中读取环境变量的当前值的函数
原形:char *getenv(const char *name)
用法:s=getenv("环境变量名");
需先定义char *s;
功能:返回一给定的环境变量值,环境变量名可大写或小写。如果指定的变量在环境中未定义,则返回一空串。
*/
if(atoi(clusterId) > 0 &&atoi(clusterId) < 255)
if(getenv("PCAP_PF_RING_USE_CLUSTER_PER_FLOW"))
pfring_set_cluster(handle->ring,atoi(clusterId), cluster_per_flow);
else
pfring_set_cluster(handle->ring, atoi(clusterId),cluster_round_robin);
pfring_enable_ring(handle->ring);
} else
handle->ring = NULL;
}else
handle->ring = NULL;
/*
pfring_set_cluster的函数只用于设置cluster_id,通过调用PF_RING的setsockopt函数完成:
查找PF_RING的文档,对这个函数有以下说明,在多cpu的情况下,pfring_set_cluster是非常有用的:
This call allows a ring to be added to acluster that can spawn across address spaces. On a nuthsell when two or moresockets are clustered they share incoming packets that are balanced on aper-flow manner. This technique is useful for exploiting multicore systems of for sharing packets in the same address space across multiple threads.
intpfring_set_cluster(pfring *ring, u_int clusterId, cluster_type the_type) {
#ifdef USE_PCAP
return(-1);
#else
if(ring->dna_mapped_device)
return(-1);
else {
struct add_to_cluster cluster;
cluster.clusterId = clusterId,cluster.the_type = the_type;
return(ring ? setsockopt(ring->fd, 0,SO_ADD_TO_CLUSTER,
&cluster, sizeof(cluster)): -1);
}
#endif
}
其中setsockopt/getsockopt函数的作用是:
功能描述:
获取或者设置与某个套接字关联的选项。选项可能存在于多层协议中,它们总会出现在最上面的套接字层。当操作套接字选项时,选项位于的层和选项的名称必须给出。为了操作套接字层的选项,应该将层的值指定为SOL_SOCKET。为了操作其它层的选项,控制选项的合适协议号必须给出。例如,为了表示一个选项由TCP协议解析,层应该设定为协议号TCP。用法如下:
#include <sys/types.h>
#include <sys/socket.h>
int getsockopt(int sock,int level, int optname, void *optval, socklen_t *optlen);
int setsockopt(int sock,int level, int optname, const void *optval, socklen_t optlen);
参数说明:
sock:将要被设置或者获取选项的套接字。
level:选项所在的协议层。
optname:需要访问的选项名。//SO_ADD_TO_CLUSTER
optval:对于getsockopt(),指向返回选项值的缓冲。对于setsockopt(),指向包含新选项值的缓冲。
optlen:对于getsockopt(),作为入口参数时,选项值的最大长度。作为出口参数时,选项值的实际长度。对于setsockopt(),现选项的长度。
如果定义了PF_RING就是调用pfring_open建立sock,这一部分内容讲解告一段落了。
*/
if(handle->ring!= NULL) {
handle->fd = handle->ring->fd;
handle->bufsize = handle->snapshot;
handle->linktype = DLT_EN10MB;
handle->offset = 2;
/* printf("OpenHAVE_PF_RING(%s)\n", device); */
}else {
/* printf("Open HAVE_PF_RING(%s) failed.Fallback to pcap\n", device); */
#endif
/*
* If we're inpromiscuous mode, then we probably want
* to see when theinterface drops packets too, so get an
* initial count from/proc/net/dev
*/
if(handle->opt.promisc)
handle->md.proc_dropped= linux_if_drops(handle->md.device);
/*
* Current Linux kernelsuse the protocol family PF_PACKET to
* allow direct accessto all packets on the network while
* older kernels had aspecial socket type SOCK_PACKET to
* implement thisfeature.
* While this oldimplementation is kind of obsolete we need
* to be compatible witholder kernels for a while so we are
* trying both methodswith the newer method preferred.
*/
// 目前的内核是采用PF_PACKET,而老的内核通过采用SOCK_PACKET
if((status = activate_new(handle)) == 1) {
/*
* Try to open a packet socket using the newkernel PF_PACKET interface.
* Returns 1 on success, 0 on an error thatmeans the new interface isn't
* present (so the old SOCK_PACKET interfaceshould be tried), and a
* PCAP_ERROR_ value on an error that meansthat the old mechanism won't
* work either (so it shouldn't be tried). Activate_new函数的作用在没有定义PF_RING的情况下通过PF_PACKET接口建立socket,返回1表示成功,可以采用PF_PACKET建立socket,返回0表示失败,这时可以尝试采用SOCKET_PACKET接口建立socket,该函数也在pcap-linux.c中可以找到源码;根据status的返回值,确定3种不同的情况,返回1成功,表示采用的是PF_PACKET建立socket,而返回0的时候,又调用activate_old函数进行判断,如果activate_old函数返回1表示调用的是SOCK_PACKET建立socket,而activate_old返回0表示失败;第3种情况是status不等于上面的2个值,则表示失败。
*/
/*
* Success.
* Try to use memory-mapped access.
*/
switch(activate_mmap(handle)) {
case1:
/*we succeeded; nothing more to do */
return0;
case0:
/*
* Kernel doesn't support it - just continue
* with non-memory-mapped access.
*/
status= 0;
break;
case-1:
/*
* We failed to set up to use it, or kernel
* supports it, but we failed to enable it;
* return an error. handle->errbuf contains
* an error message.
*/
status= PCAP_ERROR;
gotofail;
}
}
elseif (status == 0) {
/*Non-fatal error; try old way */
if((status = activate_old(handle)) != 1) {
/*
* Bothmethods to open the packet socket failed.
* Tidy upand report our failure (handle->errbuf
* isexpected to be set by the functions above).
*/
gotofail;
}
}else {
/*
* Fatal errorwith the new way; just fail.
* status has theerror return; if it's PCAP_ERROR,
*handle->errbuf has been set appropriately.
*/
gotofail;
}
/*
* We set up the socket,but not with memory-mapped access.
*/
if(handle->opt.buffer_size != 0) {
/*
如果opt.buffer_size!=0以我的理解就是应用程序调用了pcap_set_buffer_size设置了内核缓冲区的大小,而不是采用默认的内核缓冲区,因此首先通过setsockopt发送设置命令,然后调用malloc分配内存。
* Set the socket buffersize to the specified value.
*/
if(setsockopt(handle->fd, SOL_SOCKET,SO_RCVBUF,
&handle->opt.buffer_size,
sizeof(handle->opt.buffer_size)) == -1){
snprintf(handle->errbuf,PCAP_ERRBUF_SIZE,
"SO_RCVBUF: %s",pcap_strerror(errno));
status= PCAP_ERROR;
gotofail;
}
}
#ifdef HAVE_PF_RING
}
#endif
/*Allocate the buffer */
handle->buffer = malloc(handle->bufsize +handle->offset);
if(!handle->buffer) {
snprintf(handle->errbuf,PCAP_ERRBUF_SIZE,
"malloc: %s", pcap_strerror(errno));
status= PCAP_ERROR;
gotofail;
}
/*
*"handle->fd" is a socket, so "select()" and"poll()"
* should work on it.
*/
handle->selectable_fd= handle->fd;
returnstatus;
fail:
pcap_cleanup_linux(handle);
returnstatus;
}
pcap_activate_linux函数分析完了,按我的理解应该是用PF_RING代替PF_PACKET或SOCK_PACKET。但是我从pcap_activate_linux函数,简单的分析下,发现首先采用的pfring_open建立sock,以我的理解,当定义了pf_ring时,采用pfring_open建立socket后应该马上退出函数,不去判断后面的内容了,比如又去判断activate_new和activate_old函数,没有搞明白,也不理解作者的意图。所以我再次的对pfring_open的源码进行分析,继续跟踪代码:首先跟踪的是pfring_open函数,然后跟踪activate_new函数,有必要看看这个里面是怎么实现的。前面说过pfring_open是调用pfring_open_consumer函数的,为了分析他们的源码,跟踪到pfring.c文件中,pfring_open_consumer函数的源码如下:
pfring* pfring_open_consumer(char *device_name, u_int8_tpromisc,
u_int32_t caplen, u_int8_t _reentrant,
u_int8_tconsumer_plugin_id,
char* consumer_data, u_intconsumer_data_len) {
#ifdefUSE_PCAP
char ebuf[256];
pcap_t *pcapPtr = pcap_open_live(device_name,
caplen,
1 /* promiscuous mode */,
1000 /* ms */,
ebuf);
return((pfring*)pcapPtr);
#else
int err = 0;
pfring *ring =(pfring*)malloc(sizeof(pfring)); //申请pfring结构体大小的内存
if(ring == NULL)
return(NULL);
else
memset(ring, 0, sizeof(pfring)); //将缓冲区清0
ring->reentrant = _reentrant;
ring->fd = socket(PF_RING, SOCK_RAW,htons(ETH_P_ALL)); //建立socket
#ifdef RING_DEBUG
printf("OpenRING [fd=%d]\n", ring->fd);
#endif
if(ring->fd > 0) {
int rc;
u_int memSlotsLen;
if(caplen > MAX_CAPLEN) caplen = MAX_CAPLEN;
//在pfring.h中定义 MAX_CAPLEN,#define MAX_CAPLEN 16384
setsockopt(ring->fd, 0, SO_RING_BUCKET_LEN, &caplen, sizeof(caplen));
//设置caplen,caplen为捕获包的大小在pfring.h中定义它的最大大小为16384
/* printf("channel_id=%d\n",channel_id); */
if(device_name == NULL /* any */) {
device_name = "any";
rc = pfring_bind(ring, device_name); //绑定ring
} else if(!strcmp(device_name,"none")) {
/* No binding yet */
rc = 0;
} else
rc = pfring_bind(ring, device_name);
if(rc == 0) {
if(consumer_plugin_id > 0) {
ring->kernel_packet_consumer =consumer_plugin_id;
rc = pfring_set_packet_consumer_mode(ring,consumer_plugin_id,
consumer_data, consumer_data_len);
if(rc < 0) {
free(ring);
return(NULL);
}
} else
ring->kernel_packet_consumer = 0;
ring->buffer = (char *)mmap(NULL,PAGE_SIZE, PROT_READ|PROT_WRITE,
MAP_SHARED, ring->fd, 0);
//mmap 内存映射其中PAGE_SIZE=4096
/*
内存映射mmap函数原型如下:函数:void *mmap(void*start,size_t length,int prot,int flags,int fd,off_t offsize);
参数start:指向欲映射的内存起始地址,通常设为 NULL,代表让系统自动选定地址,映射成功后返回该地址。
参数length:代表将文件中多大的部分映射到内存。
参数prot:映射区域的保护方式。可以为以下几种方式的组合:
PROT_EXEC 映射区域可被执行,PROT_READ映射区域可被读取,PROT_WRITE映射区域可被写入
PROT_NONE 映射区域不能存取;
参数flags:影响映射区域的各种特性。在调用mmap()时必须要指定MAP_SHARED或MAP_PRIVATE。
MAP_FIXED 如果参数start所指的地址无法成功建立映射时,则放弃映射,不对地址做修正。通常不鼓励用此旗标。
MAP_SHARED对映射区域的写入数据会复制回文件内,而且允许其他映射该文件的进程共享。
MAP_PRIVATE 对映射区域的写入操作会产生一个映射文件的复制,即私人的“写入时复制”(copy on write)对此区域作的任何修改都不会写回原来的文件内容。
MAP_ANONYMOUS建立匿名映射。此时会忽略参数fd,不涉及文件,而且映射区域无法和其他进程共享。
MAP_DENYWRITE只允许对映射区域的写入操作,其他对文件直接写入的操作将会被拒绝。
MAP_LOCKED 将映射区域锁定住,这表示该区域不会被置换(swap)。
参数fd:要映射到内存中的文件描述符(ring->fd为socket函数的返回值)。如果使用匿名内存映射时,即flags中设置了MAP_ANONYMOUS,fd设为-1。有些系统不支持匿名内存映射,则可以使用fopen打开/dev/zero文件,然后对该文件进行映射,可以同样达到匿名内存映射的效果。
参数offset:文件映射的偏移量,通常设置为0,代表从文件最前方开始对应,offset必须是分页大小的整数倍。
返回值:
若映射成功则返回映射区的内存起始地址,否则返回MAP_FAILED(-1),错误原因存于errno中。
*/
if(ring->buffer == MAP_FAILED) {
printf("mmap()failed: try with a smaller snaplen\n");
free(ring);
return(NULL);
}
ring->slots_info = (FlowSlotInfo *)ring->buffer;
//其中ring->buffer为mmap内存映射的缓冲区,ring->slot_info指向ring->buffer的开始位置;
if(ring->slots_info->version != RING_FLOWSLOT_VERSION) {
printf("WrongRING version: "
"kernel is %i, libpfring wascompiled with %i\n",
ring->slots_info->version,RING_FLOWSLOT_VERSION);
free(ring);
return(NULL);
}
memSlotsLen = ring->slots_info->tot_mem; //
munmap(ring->buffer,PAGE_SIZE); //删除映射
ring->buffer = (char*)mmap(NULL, memSlotsLen,
PROT_READ|PROT_WRITE,
MAP_SHARED, ring->fd, 0);
/*
感觉前面的mmap就是为了得到memSlotsLen,然后就用munmap删除映射了,接着使用mmap重新内存映射。
*/
if(ring->buffer == MAP_FAILED) {
printf("mmap() failed");
free(ring);
return(NULL);
}
ring->slots_info = (FlowSlotInfo *)ring->buffer; //得到环状缓冲区指针
ring->slots = (char*)(ring->buffer+sizeof(FlowSlotInfo));
//跳过环状缓冲区前面的机构体的大小,后面就是用来接收数据了。
/* Set defaults */
ring->device_name = strdup(device_name? device_name : "");
#ifdefRING_DEBUG
printf("RING (%s):tot_mem=%u/min_tot_slots=%u/max_slot_len=%u/"
"insert_off=%u/remove_off=%u/dropped=%llu\n",
device_name,
ring->slots_info->tot_mem,
ring->slots_info->tot_slots,
ring->slots_info->slot_len,
ring->slots_info->insert_off,
ring->slots_info->remove_off,
ring->slots_info->tot_lost);
#endif
if(promisc) {
if(set_if_promisc(device_name, 1) == 0)
ring->clear_promisc = 1;
}
#ifdef ENABLE_HW_TIMESTAMP
pfring_enable_hw_timestamp(ring,device_name);
#endif
} else {
close(ring->fd);
err = -1;
}
} else {
err = -1;
free(ring);
}
if(err == 0) {
if(ring->reentrant)
pthread_spin_init(&ring->spinlock,PTHREAD_PROCESS_PRIVATE);
return(ring);
} else
return(NULL);
#endif
}
//pfring_bind函数的作用是调用bind绑定socket; rc = bind(ring->fd,(struct sockaddr *)&sa, sizeof(sa));
int pfring_bind(pfring *ring, char *device_name) {
struct sockaddr sa; //定义一个socket地址变量
char *at;
int32_t channel_id = -1;
int rc = 0;
if((device_name == NULL) ||(strcmp(device_name, "none") == 0))
return(-1);
at = strchr(device_name, '@');
if(at != NULL) {
char *tok, *pos = NULL;
at[0] = '\0';
/* Syntax
ethX@1,5 channel 1 and 5
ethX@1-5 channel 1,2...5
ethX@1-3,5-7 channel 1,2,3,5,6,7
*/
tok = strtok_r(&at[1], ",",&pos);
channel_id = 0;
while(tok != NULL) {
char *dash = strchr(tok, '-');
int32_t min_val, max_val, i;
if(dash) {
dash[0] = '\0';
min_val = atoi(tok);
max_val = atoi(&dash[1]);
} else
min_val = max_val = atoi(tok);
for(i = min_val; i <= max_val; i++)
channel_id |= 1 << i;
tok = strtok_r(NULL, ",",&pos);
}
}
/* Setup TX */
ring->sock_tx.sll_family = PF_PACKET;
ring->sock_tx.sll_protocol =htons(ETH_P_ALL);
sa.sa_family = PF_RING;
snprintf(sa.sa_data, sizeof(sa.sa_data),"%s", device_name);
rc = bind(ring->fd, (struct sockaddr*)&sa, sizeof(sa));
/*
Bind函数:
头文件 | #include <sys/types.h> #include <sys/socket.h> | |
函数原型 | int bind(int sockfd, const struct sockaddr *my_addr, socklen_t addrlen); | |
返回值 | 成功 | 失败 |
0 | 1 |
*/
if(rc == 0) {
if(channel_id != -1) {
int rc = pfring_set_channel_id(ring,channel_id);
if(rc != 0)
printf("pfring_set_channel_id()failed: %d\n", rc);
}
}
return(rc);
}
在这里又将pfring_open_consumer源码分析完了,确实跟我理解的一样。就是通过内存映射建立一个ring缓冲区,然后调用pfring_bind对socket进行绑定。再前面我们说了以我的个人理解,PF_RING的补丁,就是要采用新的socket代替原来的PF_PACKET和SOCK_PACKET,但是我开始分析源码时,发现既然建立了PF_RING,为什么pcap_activate_linux不直接返回呢,诧异,诧异。再次返回pcap_activate_linux函数看看,有什么没有看懂的吗?首先分析下pcap_activate_linux带的参数pcap_t *handle,这个数据结构吧,大家知道算法+数据结构=程序,可见数据结构的重要性。在pcap-int.h中找到了定义ring的地方,如下:
#ifdefHAVE_PF_RING
pfring *ring;
#endif
下面要看看,既然采用了pfring_open建立和绑定了socket,后面的activate_new函数的作用是什么呢?跟踪一下activate_new函数吧,
static int activate_new(pcap_t*handle)
{
#ifdef HAVE_PF_PACKET_SOCKETS
// HAVE_PF_PACKET_SOCKETS首先判断是不是PF_PACKETsocket类型,是的就执行这个里面的操作,不是的话,相当于直接返回0,就可以去调用activate_old去判断是不是SOCK_PACKET类型了。
const char *device = handle->opt.source;
int is_any_device= (strcmp(device, "any") == 0);
int sock_fd= -1, arptype;
#ifdef HAVE_PACKET_AUXDATA
int val;
#endif
int err= 0;
struct packet_mreq mr;
/*
* Open a socket with protocol family packet.If the
* "any" device was specified, weopen a SOCK_DGRAM
* socket for the cooked interface, otherwisewe first
* try a SOCK_RAW socket for the raw interface.
*/
sock_fd = is_any_device ?
socket(PF_PACKET, SOCK_DGRAM,htons(ETH_P_ALL)) :
socket(PF_PACKET, SOCK_RAW,htons(ETH_P_ALL));
//socket函数的作用是建立socket,下面是不是会出现绑定的函数呢,仔细看看
if (sock_fd == -1) {
snprintf(handle->errbuf,PCAP_ERRBUF_SIZE, "socket: %s",
pcap_strerror(errno) );
return 0; /* try old mechanism */
}
/* It seems the kernel supports the newinterface. */
handle->md.sock_packet = 0;
/*
* Get the interface index of the loopbackdevice.
* If the attempt fails, don't fail, just setthe
* "md.lo_ifindex" to -1.
*
* XXX - can there be more than one device thatloops
* packets back, i.e. devices other than"lo"? If so,
* we'd need to find them all, and have anarray of
* indices for them, and check all of them in
* "pcap_read_packet()".
*/
handle->md.lo_ifindex =iface_get_id(sock_fd, "lo", handle->errbuf);
/*
* Default value for offset to align link-layerpayload
* on a 4-byte boundary.
*/
handle->offset = 0;
/*
* What kind of frames do we have to deal with?Fall back
* to cooked mode if we have an unknowninterface type
* or a type we know doesn't work well in rawmode.
*/
if (!is_any_device) {
/* Assume for now we don't needcooked mode. */
handle->md.cooked = 0;
if (handle->opt.rfmon) {
/*
* We were asked to turn on monitor mode.
* Do so before we get the link-layer type,
* because entering monitor mode could change
* the link-layer type.
*/
err =enter_rfmon_mode(handle, sock_fd, device);
if (err < 0) {
/* Hard failure */
close(sock_fd);
return err;
}
if (err == 0) {
/*
* Nothing worked for turning monitor mode
* on.
*/
close(sock_fd);
returnPCAP_ERROR_RFMON_NOTSUP;
}
/*
* Either monitor mode has been turned on for
* the device, or we've been given a different
* device to open for monitor mode. If we've
* been given a different device, use it.
*/
if (handle->md.mondevice!= NULL)
device =handle->md.mondevice;
}
arptype = iface_get_arptype(sock_fd, device, handle->errbuf);
if (arptype < 0) {
close(sock_fd);
return arptype;
}
map_arphrd_to_dlt(handle, arptype,1);
if (handle->linktype == -1 ||
handle->linktype == DLT_LINUX_SLL ||
handle->linktype == DLT_LINUX_IRDA ||
handle->linktype == DLT_LINUX_LAPD ||
(handle->linktype == DLT_EN10MB &&
(strncmp("isdn", device, 4) == 0||
strncmp("isdY", device, 4) ==0))) {
if (close(sock_fd) == -1) {
snprintf(handle->errbuf,PCAP_ERRBUF_SIZE,
"close: %s", pcap_strerror(errno));
return PCAP_ERROR;
}
sock_fd = socket(PF_PACKET,SOCK_DGRAM,
htons(ETH_P_ALL));
if (sock_fd == -1) {
snprintf(handle->errbuf,PCAP_ERRBUF_SIZE,
"socket: %s",pcap_strerror(errno));
return PCAP_ERROR;
}
handle->md.cooked = 1;
/*
* Get rid of any link-layer type list
* we allocated - this only supports cooked
* capture.
*/
if (handle->dlt_list !=NULL) {
free(handle->dlt_list);
handle->dlt_list= NULL;
handle->dlt_count= 0;
}
if (handle->linktype ==-1) {
/*
* Warn that we're falling back on
* cooked mode; we may want to
* update "map_arphrd_to_dlt()"
* to handle the new type.
*/
snprintf(handle->errbuf,PCAP_ERRBUF_SIZE,
"arptype%d not "
"supportedby libpcap - "
"fallingback to cooked "
"socket",
arptype);
}
/*
* IrDA capture is not a real"cooked" capture,
* it's IrLAP frames, not IP packets. The
* same applies to LAPD capture.
*/
if (handle->linktype !=DLT_LINUX_IRDA &&
handle->linktype != DLT_LINUX_LAPD)
handle->linktype= DLT_LINUX_SLL;
}
handle->md.ifindex =iface_get_id(sock_fd, device,
handle->errbuf);
if (handle->md.ifindex == -1) {
close(sock_fd);
return PCAP_ERROR;
}
// 在上面我们分析盼望已久的绑定函数终于出现了iface_bind函数就是绑定函数,这个函数我猜里面也是调用的bind函数吧,带着这个预期,我去跟踪下iface_bind的代码,再来给答案,看了iface_bind代码,果然和我预测的结果一样,是调用的bind函数进行绑定。
if ((err =iface_bind(sock_fd, handle->md.ifindex,
handle->errbuf)) != 1) {
close(sock_fd);
if (err < 0)
return err;
else
return 0; /* try old mechanism */
}
} else {
/*
* The "any" device.
*/
if (handle->opt.rfmon) {
/*
* It doesn't support monitor mode.
*/
returnPCAP_ERROR_RFMON_NOTSUP;
}
/*
* It uses cooked mode.
*/
handle->md.cooked = 1;
handle->linktype =DLT_LINUX_SLL;
/*
* We're not bound to a device.
* For now, we're using this as an indication
* that we can't transmit; stop doing that only
* if we figure out how to transmit in cooked
* mode.
*/
handle->md.ifindex = -1;
}
if (!is_any_device &&handle->opt.promisc) {
memset(&mr, 0, sizeof(mr));
mr.mr_ifindex =handle->md.ifindex;
mr.mr_type = PACKET_MR_PROMISC;
if (setsockopt(sock_fd,SOL_PACKET, PACKET_ADD_MEMBERSHIP,
&mr, sizeof(mr)) == -1) {
snprintf(handle->errbuf,PCAP_ERRBUF_SIZE,
"setsockopt:%s", pcap_strerror(errno));
close(sock_fd);
return PCAP_ERROR;
}
}
/* Enableauxillary data if supported and reserve room for
* reconstructing VLAN headers. */
#ifdef HAVE_PACKET_AUXDATA
val = 1;
if (setsockopt(sock_fd, SOL_PACKET,PACKET_AUXDATA, &val,
sizeof(val)) == -1 && errno !=ENOPROTOOPT) {
snprintf(handle->errbuf,PCAP_ERRBUF_SIZE,
"setsockopt: %s", pcap_strerror(errno));
close(sock_fd);
return PCAP_ERROR;
}
handle->offset += VLAN_TAG_LEN;
#endif /* HAVE_PACKET_AUXDATA */
if (handle->md.cooked) {
if (handle->snapshot <SLL_HDR_LEN + 1)
handle->snapshot =SLL_HDR_LEN + 1;
}
handle->bufsize = handle->snapshot;
/* Save the socket FD in the pcapstructure */
handle->fd = sock_fd;
return 1;
#else
//如果不是PF_PACKET类型,就直接返回0了,呵呵
strncpy(ebuf,
"New packet capturinginterface not supported by build "
"environment",PCAP_ERRBUF_SIZE);
return 0;
#endif
}
从activate_new函数的源码中也没有解决我要解决的那个问题,如果是PF_RING,就应该不去判断后面两种socket类型了,我又回到了pcap_activate_linux函数的源码,仔细看了看,这一次真的看出来了,就是一个handle->ring != NULL开始没有注意到,害我分析好久的其它代码不过也学到一些东西,
if(handle->ring != NULL) {
handle->fd = handle->ring->fd;
handle->bufsize = handle->snapshot;
handle->linktype = DLT_EN10MB;
handle->offset = 2;
/* printf("OpenHAVE_PF_RING(%s)\n", device); */
}else {
/* printf("Open HAVE_PF_RING(%s) failed.Fallback to pcap\n", device); */
。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。
}
当handle->ring!=NULL的时候,就会跳过activate_new等代码的,也就是说执行了PF_RING成功后,就不会去判断后面2种socket类型了,和我预测的一样。呵呵,终于明白pcap_activate_linux这个函数的功能了。
2011-4-18补充。并不是所有的情况pfring_open都会返回成功的,对应pcap_activate_linux里面当pfring_open调用后,比如我在实验时,将PF_RING补丁打入内核就出现错误"WrongRING version: " "kernel is 10, libpfring was compiled with 13" ,但是提示这个错误后,程序还能正确的跑,我后面再ring.h中看到内核pf_ring的版本定义为:
#defineRING_FLOWSLOT_VERSION 10
同时在pf_ring.h中发现:
#defineRING_FLOWSLOT_VERSION 13
在pfring_open的源码pfring_open_consumer中发现如果版本不一致,就会提示错误,pfring_open_consumer直接返回,这样pfring_open的返回值为NULL,但是为什么程序还能继续运行呢,这就是因为执行到了handle->ring!=NULL时的else部分,随后就会调用原始的libpcap收包函数获取数据包了,也就是说采用PF_PACKET的方式读取数据包,所以还是能够正常运行的。
同时在没有加载insmodpf_ring.ko时候,pfring_open也会返回为NULL,此时,程序也会调用libpcap原来的PF_PACKET进行收包的。
另外另一问题,当采用PF_RING读取数据包时,cpu占用率从原来的37%上升到47%,原来240Mbit/s的速度发包,大约2分钟丢3个包,采用PF_RING后可以提高到3分钟丢2个包,包长为1514个字节。
pcap_activate_linux定义的这些回调函数也是值得注意的。这里把他们都列出来。
device = handle->opt.source;
handle->inject_op = pcap_inject_linux;
handle->setfilter_op = pcap_setfilter_linux;
handle->setdirection_op = pcap_setdirection_linux;
handle->set_datalink_op = NULL; /* can't change data link type */
handle->getnonblock_op = pcap_getnonblock_fd;
handle->setnonblock_op = pcap_setnonblock_fd;
handle->cleanup_op = pcap_cleanup_linux;
handle->read_op = pcap_read_linux;
handle->stats_op = pcap_stats_linux;
其它的回调函数我就不多说了,这里重点要讲解的是pcap_read_linux函数,函数源码如下:
/*
* Readat most max_packets from the capture stream and call the callback
* foreach of them. Returns the number of packets handled or -1 if an
* erroroccured.
*/
static int
pcap_read_linux(pcap_t *handle, intmax_packets, pcap_handler callback, u_char *user)
{
/*
* Currently, on Linuxonly one packet is delivered per read,
* so we don't loop.
*/
returnpcap_read_packet(handle, callback, user);
}
函数体就相当简答了,晕,只有一句,就是调用pcap_read_packet函数读取数据包。
pcap_read_packet函数;这个函数可长了,一步一步看吧,既然开始分析了,就一定要把这些源码吃透,这里才能理解libpcap为什么丢包,而加上pf-ring补丁后的libpcap就不丢包了呢。不多说了,看源码吧。还有这个回调函数什么时候调用的呢,我现在猜想应该是应用程序调用pcap_next, pcap_next_ex, pcap_dispatch, pcap_loop这几个函数时读包时调用的吧,现在只是猜想,还没有分析这部分读包的源码,呵呵,好了,还是来看pcap_read_packet函数吧。
/*
* Read a packet from the socket calling thehandler provided by
* the user. Returns the number of packetsreceived or -1 if an
* error occured.
*/
staticint
pcap_read_packet(pcap_t*handle, pcap_handler callback, u_char *userdata)
{
u_char *bp;
int offset;
#ifdef HAVE_PF_PACKET_SOCKETS
struct sockaddr_ll from;
struct sll_header *hdrp;
#else
struct sockaddr from;
#endif
#if defined(HAVE_PACKET_AUXDATA) &&defined(HAVE_LINUX_TPACKET_AUXDATA_TP_VLAN_TCI)
struct iovec iov;
struct msghdr msg;
struct cmsghdr *cmsg;
union {
structcmsghdr cmsg;
char buf[CMSG_SPACE(sizeof(structtpacket_auxdata))];
} cmsg_buf;
#else /* defined(HAVE_PACKET_AUXDATA) &&defined(HAVE_LINUX_TPACKET_AUXDATA_TP_VLAN_TCI) */
socklen_t fromlen;
#endif/* defined(HAVE_PACKET_AUXDATA) &&defined(HAVE_LINUX_TPACKET_AUXDATA_TP_VLAN_TCI) */
int packet_len,caplen;
#ifdef HAVE_PF_RING
structpfring_pkthdr pcap_header;
#else
struct pcap_pkthdr pcap_header;
#endif
// 这里必须讲解下,当定义了HAVE_PF_RING时候,pcap_header指向的是pfring_pkthdr结构体,去看看它和pcap_pkthdr结构体有什么不同。Pfring_pkthdr结构体的定义如下:
/*
struct pfring_pkthdr {
/* pcap header */
struct timeval ts; /* timestamp */
u_int32_t caplen; /* length ofportion present */
u_int32_t len; /* lengththis packet (off wire) */
struct pfring_extended_pkthdr extended_hdr; /* PF_RING extended header*/
};
*/
/*
而pcap_pkthdr的结构体定义如下:
struct pcap_pkthdr {
struct timeval ts; /* time stamp */
bpf_u_int32 caplen; /* length of portion present */
bpf_u_int32 len; /* length this packet (off wire) */
};
*/
//对比发现它们两个相比,pfring_pkthdr多了一个PF_RING的扩展头。
#ifdefHAVE_PF_RING
if(handle->ring) {
do {
if (handle->break_loop) {
/*
* Yes - clear the flag that indicates that it
* has, and return -2 as an indication that we
* were told to break out of the loop.
*
* Patch courtesy of Michael Stiller <ms@2scale.net>
*/
handle->break_loop = 0;
return -2;
}
packet_len = pfring_recv(handle->ring, (char*)handle->buffer,
handle->bufsize,
&pcap_header,
1 /* wait_for_incoming_packet */);
/*如果定义了PF_RING,就采用pfring_recv接收数据包,这个函数后面在进行讲解,如果没有定义PF_RING的话,采用recvmsg或recvfrom来接收数据包了,这两个函数有什么区别呢,大家google一下吧,不讲了。
*/
if (packet_len > 0) {
bp = handle->buffer;
pcap_header.caplen = min(pcap_header.caplen, handle->bufsize);
caplen = pcap_header.caplen, packet_len = pcap_header.len;
goto pfring_pcap_read_packet;
}
}while (packet_len == -1 && (errno == EINTR || errno == ENETDOWN));
}
#endif
#ifdefHAVE_PF_PACKET_SOCKETS
/*
*If this is a cooked device, leave extra room for a
*fake packet header.
*/
if (handle->md.cooked)
offset = SLL_HDR_LEN;
else
offset = 0;
#else
/*
*This system doesn't have PF_PACKET sockets, so it doesn't
*support cooked devices.
*/
offset = 0;
#endif
/*
* Receive a single packet from the kernel.
* We ignore EINTR, as that might just be dueto a signal
* being delivered - if the signal shouldinterrupt the
* loop, the signal handler should callpcap_breakloop()
* to set handle->break_loop (we ignore iton other
* platforms as well).
* We also ignore ENETDOWN, so that we cancontinue to
* capture traffic if the interface goes downand comes
* back up again; comments in the kernelindicate that
* we'll just block waiting for packets if wetry to
* receive from a socket that deliveredENETDOWN, and,
* if we're using a memory-mapped buffer, wewon't even
* get notified of "network down"events.
*/
bp = handle->buffer +handle->offset;
#ifdefined(HAVE_PACKET_AUXDATA) && defined(HAVE_LINUX_TPACKET_AUXDATA_TP_VLAN_TCI)
msg.msg_name = &from;
msg.msg_namelen = sizeof(from);
msg.msg_iov = &iov;
msg.msg_iovlen = 1;
msg.msg_control = &cmsg_buf;
msg.msg_controllen = sizeof(cmsg_buf);
msg.msg_flags = 0;
iov.iov_len = handle->bufsize - offset;
iov.iov_base = bp + offset;
#endif /*defined(HAVE_PACKET_AUXDATA) &&defined(HAVE_LINUX_TPACKET_AUXDATA_TP_VLAN_TCI) */
do {
/*
* Has "pcap_breakloop()" beencalled?
*/
if (handle->break_loop) {
/*
* Yes - clear the flag that indicates that ithas,
* and return PCAP_ERROR_BREAK as an indicationthat
* we were told to break out of the loop.
*/
handle->break_loop = 0;
return PCAP_ERROR_BREAK;
}
#ifdefined(HAVE_PACKET_AUXDATA) && defined(HAVE_LINUX_TPACKET_AUXDATA_TP_VLAN_TCI)
packet_len = recvmsg(handle->fd, &msg, MSG_TRUNC);
#else /*defined(HAVE_PACKET_AUXDATA) &&defined(HAVE_LINUX_TPACKET_AUXDATA_TP_VLAN_TCI) */
fromlen = sizeof(from);
packet_len = recvfrom(
handle->fd, bp + offset,
handle->bufsize -offset, MSG_TRUNC,
(struct sockaddr *)&from, &fromlen);
#endif /* defined(HAVE_PACKET_AUXDATA) &&defined(HAVE_LINUX_TPACKET_AUXDATA_TP_VLAN_TCI) */
} while (packet_len == -1 &&errno == EINTR);
/* Check if an error occured */
if (packet_len == -1) {
switch (errno) {
case EAGAIN:
return 0; /* no packet there */
case ENETDOWN:
/*
* The device on which we're capturing wentaway.
*
* XXX - we should really return
* PCAP_ERROR_IFACE_NOT_UP, but pcap_dispatch()
* etc. aren't defined to return that.
*/
snprintf(handle->errbuf,PCAP_ERRBUF_SIZE,
"The interfacewent down");
return PCAP_ERROR;
default:
snprintf(handle->errbuf,PCAP_ERRBUF_SIZE,
"recvfrom: %s",pcap_strerror(errno));
return PCAP_ERROR;
}
}
#ifdefHAVE_PF_PACKET_SOCKETS
if (!handle->md.sock_packet) {
/*
* Unfortunately, there is a window betweensocket() and
* bind() where the kernel may queue packetsfrom any
* interface. If we're bound to a particular interface,
* discard packets notfrom that interface.
*
* (If socket filters are supported, we coulddo the
* same thing we do when changing the filter;however,
* that won't handle packet sockets withoutsocket
* filter support, and it's a bit more complicated.
* It would save some instructions per packet,however.)
*/
if (handle->md.ifindex != -1&&
from.sll_ifindex != handle->md.ifindex)
return 0;
/*
* Do checks based on packet direction.
* We can only do this if we're usingPF_PACKET; the
* address returned for SOCK_PACKET is a"sockaddr_pkt"
* which lacks the relevant packet typeinformation.
*/
if (from.sll_pkttype ==PACKET_OUTGOING) {
/*
* Outgoing packet.
* If this is from the loopback device, rejectit;
* we'll see the packet as an incoming packetas well,
* and we don't want to see it twice.
*/
if (from.sll_ifindex ==handle->md.lo_ifindex)
return 0;
/*
* If the user only wants incoming packets,reject it.
*/
if (handle->direction ==PCAP_D_IN)
return 0;
} else {
/*
* Incoming packet.
* If the user only wants outgoing packets,reject it.
*/
if (handle->direction ==PCAP_D_OUT)
return 0;
}
}
#endif
#ifdefHAVE_PF_PACKET_SOCKETS
/*
* If this is a cooked device, fill in the fakepacket header.
*/
if (handle->md.cooked) {
/*
* Add the length of the fake header to thelength
* of packet data we read.
*/
packet_len += SLL_HDR_LEN;
hdrp = (struct sll_header *)bp;
hdrp->sll_pkttype =map_packet_type_to_sll_type(from.sll_pkttype);
hdrp->sll_hatype =htons(from.sll_hatype);
hdrp->sll_halen =htons(from.sll_halen);
memcpy(hdrp->sll_addr,from.sll_addr,
(from.sll_halen > SLL_ADDRLEN) ?
SLL_ADDRLEN :
from.sll_halen);
hdrp->sll_protocol= from.sll_protocol;
}
#ifdefHAVE_PF_RING
pfring_pcap_read_packet:
#endif
#ifdefined(HAVE_PACKET_AUXDATA) &&defined(HAVE_LINUX_TPACKET_AUXDATA_TP_VLAN_TCI)
for (cmsg = CMSG_FIRSTHDR(&msg);cmsg; cmsg = CMSG_NXTHDR(&msg, cmsg)) {
struct tpacket_auxdata *aux;
unsigned int len;
struct vlan_tag *tag;
if (cmsg->cmsg_len <CMSG_LEN(sizeof(struct tpacket_auxdata)) ||
cmsg->cmsg_level != SOL_PACKET ||
cmsg->cmsg_type != PACKET_AUXDATA)
continue;
aux= (struct tpacket_auxdata *)CMSG_DATA(cmsg);
if (aux->tp_vlan_tci == 0)
continue;
len = packet_len > iov.iov_len? iov.iov_len : packet_len;
if (len < 2 * ETH_ALEN)
break;
bp -= VLAN_TAG_LEN;
memmove(bp, bp + VLAN_TAG_LEN, 2 *ETH_ALEN);
tag = (struct vlan_tag *)(bp + 2 *ETH_ALEN);
tag->vlan_tpid =htons(ETH_P_8021Q);
tag->vlan_tci =htons(aux->tp_vlan_tci);
packet_len += VLAN_TAG_LEN;
}
#endif /*defined(HAVE_PACKET_AUXDATA) &&defined(HAVE_LINUX_TPACKET_AUXDATA_TP_VLAN_TCI) */
#endif /*HAVE_PF_PACKET_SOCKETS */
/*
* XXX: According to the kernel source weshould get the real
* packet len if calling recvfrom withMSG_TRUNC set. It does
* not seem to work here :(, but it issupported by this code
* anyway.
* To be honest the code RELIES on that featureso this is really
* broken with 2.2.x kernels.
* I spend a day to figure out what's going onand I found out
* that the following is happening:
*
* The packet comes from a random interface andthe packet_rcv
* hook is called with a clone of the packet.That code inserts
* the packet into the receive queue of thepacket socket.
* If a filter is attached to that socket thatfilter is run
* first - and there lies the problem. Thedefault filter always
* cuts the packet at the snaplen:
*
* # tcpdump -d
* (000) ret #68
*
* So the packet filter cuts down the packet.The recvfrom call
* says "hey, it's only 68 bytes, it fitsinto the buffer" with
* the result that we don't get the real packetlength. This
* is valid at least until kernel 2.2.17pre6.
*
* We currently handle this by making a copy ofthe filter
* program, fixing all "ret"instructions with non-zero
* operands to have an operand of 65535 so thatthe filter
* doesn't truncate the packet, and supplyingthat modified
* filter to the kernel.
*/
caplen = packet_len;
if (caplen > handle->snapshot)
caplen = handle->snapshot;
/* Run the packet filter if not usingkernel filter */
if (!handle->md.use_bpf && handle->fcode.bf_insns){
if(bpf_filter(handle->fcode.bf_insns, bp,
packet_len, caplen) == 0)
{
/* rejected by filter */
return 0;
}
}
/* Fill in our own header data */
#ifdef HAVE_PF_RING
if(!handle->ring) {
#endif
if (ioctl(handle->fd,SIOCGSTAMP, &pcap_header.ts) == -1) {
snprintf(handle->errbuf,PCAP_ERRBUF_SIZE,
"SIOCGSTAMP: %s",pcap_strerror(errno));
returnPCAP_ERROR;
}
pcap_header.caplen = caplen;
pcap_header.len = packet_len;
#ifdef HAVE_PF_RING
}
#endif
/*
* Count the packet.
*
* Arguably, we should count them before wecheck the filter,
* as on many other platforms"ps_recv" counts packets
* handed to the filter rather than packetsthat passed
* the filter, but if filtering is done in thekernel, we
* can't get a count of packets that passed thefilter,
* and that would mean the meaning of"ps_recv" wouldn't
* be the same on all Linux systems.
*
* XXX - it's not the same on all systems inany case;
* ideally, we should have a "get thestatistics" call
* that supplies more counts and indicateswhich of them
* it supplies, so that we supply a count ofpackets
* handed to the filter only on platforms wherethat
* information is available.
*
* We count them here even if we can get thepacket count
* from the kernel, as we can only determine atrun time
* whether we'll be able to get it from thekernel (if
* HAVE_TPACKET_STATS isn't defined, we can'tget it from
* the kernel, but if it is defined, thelibrary might
* have been built with a 2.4 or later kernel,but we
* might be running on a 2.2[.x] kernel without Alexey
* Kuznetzov's turbopacket patches, and thusthe kernel
* might not be able to supply thosestatistics). We
* could, I guess, try, when opening thesocket, to get
* the statistics, and if we can not incrementthe count
* here, but it's not clear that alwaysincrementing
* the count is more expensive than alwaystesting a flag
* in memory.
*
* We keep the count in"md.packets_read", and use that for
* "ps_recv" if we can't get thestatistics from the kernel.
* We do that because, if we *can* get thestatistics from
* the kernel, we use"md.stat.ps_recv" and "md.stat.ps_drop"
* as running counts, as reading the statisticsfrom the
* kernel resets the kernel statistics, and ifwe directly
* increment "md.stat.ps_recv" here,that means it will
* count packets *twice* on systems where wecan get kernel
* statistics - once here, and once inpcap_stats_linux().
*/
handle->md.packets_read++;
/* Call the usersupplied callback function */
#if defined(HAVE_PF_RING)
{
struct myts {
struct timeval ts;
u_int32_t caplen, len;
u_int64_t ns;
};
struct myts myhdr;
myhdr.ts.tv_sec = pcap_header.ts.tv_sec,myhdr.ts.tv_usec = pcap_header.ts.tv_usec;
myhdr.caplen = pcap_header.caplen, myhdr.len= pcap_header.len;
myhdr.ns =pcap_header.extended_hdr.timestamp_ns;
callback(userdata, (structpcap_pkthdr*)&myhdr, bp);
}
#else
callback(userdata,&pcap_header, bp);
#endif
/*这个函数虽然比较长,但是一路看下来,还是比较好理解的,就是在不同的socket下调用不同的函数接收数据包,最后看是否定义了HAVE_PF_RING,如果定义了,调用的回调函数callback的头会不一样的,呵呵,上面代码中已经可以看的很清楚了。
*/
return 1;
}
讲解了这么多了,pcap_open_live还没有讲解完了,这几十页下来就讲解了pcap_open_live中调用的一个函数,哈哈,也就是pcap-linux.c中调用的pcap_create函数,libpcap博大精深,加上了pf-ring就有一种更高深的感觉。既然还没有讲解完,就接着讲解呗,下面讲解pcap_open_live中调用的另外一个函数,pcap_activate。Pcap_create起的作用是创建和绑定socket,同时定义了一些回调函数。那么pcap_activate的作用是啥呢,用源码说话,I love linux ,I love open source。
Int pcap_activate(pcap_t*p)
{
int status;
status = p->activate_op(p);
/*activate_op是个什么函数呢,搜了下原型是个函数指针,这个函数指针在哪里赋值呢,搜源码吧,呵呵。终于在pcap-linux.c下搜到了它的初始化赋值,哈哈,原来就是
handle->activate_op= pcap_activate_linux;
明白了在pcap_create中定义的pcap_activate_linux函数中赋值的回调函数activate_op终于在这里调用了,其实pcap_create只赋值定义这个回调函数,而调用就是在这里了。前面分析的一切到现在才调用,呵呵,明白了吗?
*/
if (status >= 0) //pcap_activate_linux的返回值>=0表示成功
p->activated = 1;
else {
if (p->errbuf[0] == '\0') {
/*
* No error message supplied by the activateroutine;
* for the benefit of programs that don'tspecially
* handle errors other than PCAP_ERROR,return the
* error message corresponding to the status.
*/
snprintf(p->errbuf,PCAP_ERRBUF_SIZE, "%s",
pcap_statustostr(status));
}
/*
* Undo any operation pointer setting, etc.done by
* the activate operation.
*/
initialize_ops(p);
}
return (status);
}
Pcap_open_live终于分析完了,我也要去吃晚饭了,下面还有好多要分析呢,排个队吧,首先分析pcap_next等函数吧,socket已经建立和绑定了,也是该捕获数据的时候了,呵呵,捕获数据的回调函数也已经定义了,就是那个pcap_read_linux函数,即pcap_read_packet函数了,我现在猜想,pcap_open_live中肯定会调用这个回调函数的,咋们走着瞧。先吃饭,人是铁,饭是刚,一顿不吃饿的慌。稍后见。。。。。。。。