

内核版本:linux 5.2.8

根文件系统:busybox 1.25.0





  • 链路:从一个节点到相连节点的一段物理线路,中间没有任何其他的交换节点;
  • 局域网:是指在某一区域内(如一个学校、工厂等)有多台计算机互联成的网络;
  • 广域网:是指一种跨地区的数据通讯网络,通常包含一个国家或地区;广域网等于把局域网连接起来称为更大的网络;
  • 因特网:将需要国家的广域网结合在一起,组成实际上最大的网络,即因特网;
  • 以太网:可以看做是一种实现局域网通信的技术标准,是目前最广泛的局域网技术;
1.1 概念









  • 上图只是列举TCP/IP模型模型各层比较典型的协议,并不是所有。
1.2 物理层




  • 传递信息所使用的一些物理媒体,如双绞线、同轴电缆、光缆、无线信道等,并不在物理层协议在内而是在物理层协议的下面。
1.3 数据链路层





1.4 网络层



1.5 传输层


  • 复用就是多个应用层进程可同时使用下面传输层的服务。
  • 分用就是传输层把收到的信息分别交付给上面应用层中相应的进程。


1.5.1 传输控制协议TCP(Transmission Control Protocol)
  • TCP是面向连接的传输层协议,也就是说,应用程序在使用TCP协议之前,必须先建立TCP连接,在传输数据完毕后,必须释放已经建立的TCP连接;我们面试中经常问到的三次握手、四次挥手说的就是说的这个;
  • 每一个TCP的链接只能有两个端点,TCP连接的端点叫做套接字或者插口。根据RFC193的定义:端口号拼接到IP地址即构成套接字;
  • TCP能够提供可靠交付的服务,也就是说,通过TCP连接传送的数据,无差错,不丢失,不重复,并且按序到达;
  • TCP提供全双工通信;
  • TCP是面向面向字节流,数据传输的单位是报文段;
1.5.2 用户数据报协议UDP(User Datagram Protocal)
  • UDP是无连接的,即发送数据之前不需要建立连接,因此减少了开销和发送数据之前的时延;
  • UDP使用尽最大努力交付,即不保证可靠交付;
  •  UDP是面向报文的,数据传输的单位是用户数据报;发送方的UDP对应用程序交下来的报文,在添加首部后就直接交付给IP层;UDP对应用层交付下来的报文,既不合并,也不拆分,而是保留这些报文的边界;
  • UDP没有拥塞控制;
  • UDP支持一对一、一对多、多对一和多对多的交互通信;
  • UDP的首部开销小,只有8个字节;

关于什么是面向字节流,什么是面向报文如果不理解的话,可以参考:如何理解是 TCP 面向字节流协议

1.6 应用层


1.7 TCP/IP协议


1.7.1 TCP报文格式




  • 源端口和目的端口:各占两个字节,分别写入源端口号和目的端口号;
  • 序列号seq:占用4个字节,序号范围是[0,$2^{32}$-1];TCP是面向字节流的,在一个TCP连接中传送的字节流中的每个字节都按序号编号。整个要传送的字节流的起始序号必须在建立连接时设置。首部中的序号字段值指的就是本报本段所发送的数据第一个字节的序号;例如:一报文段的序号字段值是301,而写入的数据共有100字节;这就表明:该报文段的数据的第一个字节的序号是301,最后一个字节的序号是400,显然,下一个报文段(如果还有的话)的数据需要应当从401开始,即下一个报文段的序号字段值是401;
  • 确认号ack:占用4个字节,是期望收到对方下一个报文段的第一个数据字节的序号;
  • 首部长度:占用4个字节,它支出TCP报文段的数据起始处距离TCP报文段的起始有多远;由于首部有长度不确定的选项字段,因此首部长度是必要的;
  • URG:表示本报文段中发送的数据是否包含紧急数据:URG=1 时表示有紧急数据。当 URG=1 时,后面的紧急指针字段才有效;
  • ACK:表示前面的确认号字段是否有效:ACK=1 时表示有效;只有当 ACK=1 时,前面的确认号字段才有效;TCP 规定,连接建立后,ACK 必须为 1;
  • PSH:告诉对方收到该报文段后是否立即把数据推送给上层。如果值为1,表示应当立即把数据提交给上层,而不是缓存起来;
  • RST:表示是否重置连接:若 RST=1,说明TCP连接出现了严重错误(如主机崩溃),必须释放连接,然后再重新建立连接;
  • SYN:在建立连接时使用,用来同步序号:当 SYN=1,ACK=0 时,表示这是一个请求建立连接的报文段;当 SYN=1,ACK=1时,表示对方同意建立连接;SYN=1时,说明这是一个请求建立连接或同意建立连接的报文;只有在前两次握手中SYN才为1;
  • FIN:标记数据是否发送完毕:若FIN=1,表示 此报文段的发送发的数据已发送完毕,并要求释放连接;
  • 窗口:占用2个字节,窗口值是[0,$2^{16}$-1]之间的整数。窗口指的是发送本报文段的一方的接收窗口;
  • 校验位:占用两个字节。校验位字段校验的范围包括首部和数据两部分;
  • 紧急指针:占用两个字节。紧急指针仅在URG=1时才有意义,它指出本报文段中的紧急数据的字节数;
  • 选项:长度可变,最长可达40个字节。当没有使用该选项时,TCP的首部长度是20字节;
1.7.2 IP报文格式




  • 版本:占4位,指IP协议的版本。通信双方使用的IP协议的版本必须一致。目前广泛使用的IP协议版本号为4(即IPv4)。版本号为6(即IPv6);
  • 首部长度:占4位,可表示的最大十进制数值是15。首部长度字段所表示的单位是32位(4字节,与TCP首部中长度字段单位一致)。因为IP首部的固定长度是20字节,因此首部长度字段的最小值为5(0101)。当首部长度为15(1111)时,表示的长度为60字节。当IP分组的首部长度不是4的整数倍时,必须利用最后的填充字段加以填充达到4的整数倍;
  • 区分服务(tos):占1字节,用来获得更好的服务。这个字段在旧标准中叫做服务类型,但实际上一直没有被使用过。只有在区分服务时,这个字段才起作用。在一般情况下都不使用这个字段;
  • 总长度(totlen):占2字节,指首部和数据之和的长度,单位为字节。能表示的最大长度为65535字节。在IP层下面的链路层协议规定了一个数据帧的数据字段的最大长度,这称为最大传输单元MTU(maximum transfer unit)。当一个IP数据报封装成链路层的帧时,此数据报的总长度(即首部加上数据部分)一定不能超过下面的链路层所规定的的MTU值;以太网规定MTU为1500字节。若所传送的数据报长度超过链路层的MTU值,就必须把过长的数据进行分片处理;
  • 标识(identification):占2字节。网络层软件在存储器中维持一个计数器,每产生一个数据报,计数器就加1,并将此值赋给标识字段。但这个“标识”并不同于TCP首部中的序号,因为IP是无连接的服务,数据报不存在按序接收的问题。当数据报长度超过网络的MTU而必须分片时,这个标识字段的值就被复制到所有被分片报文片的标识字段中。相同的标识字段的值使分片后的各数据报片最后能正确地重装层原来的数据;
  • 标志(flag):占3位,目前只有两位有意义;标志字段中间的一位记为DF(dongt fragment),意思是“不能分片”。当DF=0时才允许分片。标志字段最低位MF(more fragment)。MF=1即表示后面“还有分片”的数据报。MF=0表示这已是若干数据报片中的最后一个;
  • 片偏移(offsetfrag):占13位。片偏移指出:较长的IP报文在分片后,某片在原分组中的相对位置。也就是说,相对于用户数据字段的起点,该片从何处开始。片偏移以8个字节为偏移单位。没片的长度一定是8字节的整数倍;
  • 生存时间(TTL):占8位,英文缩写TTL(Time To Live),表明数据报在网络中的寿命。由发出数据报的源点设置这个字段。目的是防止无法交付的数据无限制地在互联网中兜圈子。路由器在每次转发数据报之前就把TTL值减1。若TTL值减小到零,就丢弃此报文,不在转发;
  • 协议:占8位,协议字段指出此数据报携带的数据是使用何种协议,以便使用的目的主机的IP层知道应将数据部分上交给哪个协议进行处理;常用的一些协议和响应的协议字段值如下:
  • 首部检验和(checksum):占16位,也常成为校验和。这个字段只检验数据报的首部,但不包括数据部分(与UDP、TCP中的检验和不同)。IP数据报每经过一个路由器,路由器都需要重新计算一下首部检验和(IP首部中的TTL、标志、片偏移等都可能发生变化) ;

  • 源地址:占32位,4个字节;

  • 目的地址:占32位,4个字节;


1.7.3 以太帧

以太网链路传输的数据包称做以太帧,或者以太网数据帧。以太帧起始部分由前同步码和帧开始定界符组成,后面紧跟着一个以太网首部,以 MAC 地址说明目的地址和源地址。以太帧的中部是该帧负载的包含其他协议报头的数据包,如 IP 协议。

以太帧由一个 32 位冗余校验码结尾,用于检验数据传输是否出现损坏。以太帧结构如图所示:


  •  前同步码:用来使接收端的适配器在接收MAC帧时能够迅速调整时钟频率,使它和发送端的频率相同。前同步码为7个字节,1和0交替;
  • 帧开始定界符:帧的起始符,为1个字节。前6位1和0交替,最后的两个连续的1表示告诉接收端适配器:“帧信息要来了,准备接收”;
  • 目的MAC地址:接收帧的网络适配器的物理地址(MAC 地址),为 6个字节(48 比特)。作用是当网卡接收到一个数据帧时,首先会检查该帧的目的地址,是否与当前适配器的物理地址相同,如果相同,就会进一步处理;如果不同,则直接丢弃;
  • 源MAC地址:发送帧的网络适配器的物理地址(MAC 地址),为6个字节(48 比特);
  • 类型:上层协议的类型。由于上层协议众多,所以在处理数据的时候必须设置该字段,标识数据交付哪个协议处理。例如,字段为 0x0800 时,表示将数据交付给 IP 协议;
  • 数据:也称为有效载荷,表示交付给上层的数据。以太网帧数据长度最小为 46 字节,最大为 1500 字节。如果不足 46 字节时,会填充到最小长度。最大值也叫最大传输单元(MTU);
  • 帧检验序列 FCS:检测该帧是否出现差错,占 4 个字节(32 比特)。发送方计算帧的循环冗余码校验(CRC)值,把这个值写到帧里。接收方计算机重新计算 CRC,与 FCS 字段的值进行比较。如果两个值不相同,则表示传输过程中发生了数据丢失或改变。这时,就需要重新传输这一帧;
1.7.4 ICMP报文格式






  • 类型:1字节,常用的ICMP报文类型如下:
  • 代码:1字节;
  • 校验和:2字节;


2.1 概述





  • 网络协议接口层:向网络层协议提供统一的数据包收发接口,当上层ARP或IP需要发送数据时,它将主调用网络协议接口层的dev_queue_xmit()函数发送数据包到下层或者调用 netif_rx()函数接收数据包,都使用sk_buff作为数据的载体;
  • 网络设备接口层:通过net_device结构体来描述网络设备信息,是设备驱动功能层各个函数的容器,向上实现不同硬件类型接口的统一;
  • 设备驱动功能层:用来负责驱动网络设备硬件来完成各个功能,各个函数是网络设备接口层net_device数据结构的具体成员,比如最核心的功能实现数据包的发送和数据包的接收(它通过hard_start_xmit()函数启动数据包发送操作,并通过网络设备上的中断触发接收操作);
  • 网络设备和媒介层:是完成数据包发送和接收的物理实体,包括网络适配器和具体的传输媒介,网络适配器被设备驱动功能层中的函数在物理上驱动。对于Linux系统而言,网络设备和媒介都可以是虚拟的。




2.2 核心数据结构
2.2.1 struct sk_buff





 *      struct sk_buff - socket buffer
 *      @next: Next buffer in list
 *      @prev: Previous buffer in list
 *      @tstamp: Time we arrived/left
 *      @rbnode: RB tree node, alternative to next/prev for netem/tcp
 *      @sk: Socket we are owned by
 *      @dev: Device we arrived on/are leaving by
 *      @cb: Control buffer. Free for use by every layer. Put private vars here
 *      @_skb_refdst: destination entry (with norefcount bit)
 *      @sp: the security path, used for xfrm
 *      @len: Length of actual data
 *      @data_len: Data length
 *      @mac_len: Length of link layer header
 *      @hdr_len: writable header length of cloned skb
 *      @csum: Checksum (must include start/offset pair)
 *      @csum_start: Offset from skb->head where checksumming should start
 *      @csum_offset: Offset from csum_start where checksum should be stored
 *      @priority: Packet queueing priority
 *      @ignore_df: allow local fragmentation
 *      @cloned: Head may be cloned (check refcnt to be sure)
 *      @ip_summed: Driver fed us an IP checksum
 *      @nohdr: Payload reference only, must not modify header
 *      @pkt_type: Packet class
 *      @fclone: skbuff clone status
 *      @ipvs_property: skbuff is owned by ipvs
 *      @offload_fwd_mark: Packet was L2-forwarded in hardware
 *      @offload_l3_fwd_mark: Packet was L3-forwarded in hardware
 *      @tc_skip_classify: do not classify packet. set by IFB device
 *      @tc_at_ingress: used within tc_classify to distinguish in/egress
 *      @tc_redirected: packet was redirected by a tc action
 *      @tc_from_ingress: if tc_redirected, tc_at_ingress at time of redirect
 *      @peeked: this packet has been seen already, so stats have been
 *              done for it, don't do them again
 *      @nf_trace: netfilter packet trace flag
 *      @protocol: Packet protocol from driver
 *      @destructor: Destruct function
 *      @tcp_tsorted_anchor: list structure for TCP (tp->tsorted_sent_queue)
 *      @_nfct: Associated connection, if any (with nfctinfo bits)
 *      @nf_bridge: Saved data about a bridged frame - see br_netfilter.c
 *      @skb_iif: ifindex of device we arrived on
 *      @tc_index: Traffic control index
 *      @hash: the packet hash
 *      @queue_mapping: Queue mapping for multiqueue devices
 *      @pfmemalloc: skbuff was allocated from PFMEMALLOC reserves
 *      @active_extensions: active extensions (skb_ext_id types)
 *      @ndisc_nodetype: router type (from link layer)
 *      @ooo_okay: allow the mapping of a socket to a queue to be changed
 *      @l4_hash: indicate hash is a canonical 4-tuple hash over transport
 *              ports.
 *      @sw_hash: indicates hash was computed in software stack
 *      @wifi_acked_valid: wifi_acked was set
 *      @wifi_acked: whether frame was acked on wifi or not
 *      @no_fcs:  Request NIC to treat last 4 bytes as Ethernet FCS
 *      @csum_not_inet: use CRC32c to resolve CHECKSUM_PARTIAL
 *      @dst_pending_confirm: need to confirm neighbour
 *      @decrypted: Decrypted SKB
 *      @napi_id: id of the NAPI struct this skb came from
 *      @secmark: security marking
 *      @mark: Generic packet mark
 *      @vlan_proto: vlan encapsulation protocol
 *      @vlan_tci: vlan tag control information
 *      @inner_protocol: Protocol (encapsulation)
 *      @inner_transport_header: Inner transport layer header (encapsulation)
 *      @inner_network_header: Network layer header (encapsulation)
 *      @inner_mac_header: Link layer header (encapsulation)
 *      @transport_header: Transport layer header
 *      @network_header: Network layer header
 *      @mac_header: Link layer header
 *      @tail: Tail pointer
 *      @end: End pointer
 *      @head: Head of buffer
 *      @data: Data head pointer
 *      @truesize: Buffer size
 *      @users: User count - see {datagram,tcp}.c
 *      @extensions: allocated extensions, valid if active_extensions is nonzero
struct sk_buff {
        union {
                struct {
                        /* These two members must be first. */
                        struct sk_buff          *next;
                        struct sk_buff          *prev;

                        union {
                                struct net_device       *dev;
                                /* Some protocols might use this space to store information,
                                 * while device pointer would be NULL.
                                 * UDP receive path is one user.
                                unsigned long           dev_scratch;
                struct rb_node          rbnode; /* used in netem, ip4 defrag, and tcp stack */
                struct list_head        list;

        union {
                struct sock             *sk;
                int                     ip_defrag_offset;

        union {
                ktime_t         tstamp;
                u64             skb_mstamp_ns; /* earliest departure time */
         * This is the control buffer. It is free to use for every
         * layer. Please put your private variables there. If you
         * want to keep them across layers you have to do a skb_clone()
         * first. This is owned by whoever has the skb queued ATM.
        char                    cb[48] __aligned(8);

        union {
                struct {
                        unsigned long   _skb_refdst;
                        void            (*destructor)(struct sk_buff *skb);
                struct list_head        tcp_tsorted_anchor;
        unsigned long            _nfct;
        unsigned int            len,
        __u16                   mac_len,

        /* Following fields are _not_ copied in __copy_skb_header()
         * Note that queue_mapping is here mostly to fill a hole.
        __u16                   queue_mapping;

/* if you move cloned around you also must adapt those constants */
#define CLONED_MASK     (1 << 7)
#define CLONED_MASK     1
#define CLONED_OFFSET()         offsetof(struct sk_buff, __cloned_offset)

        __u8                    __cloned_offset[0];
        __u8                    cloned:1,
        __u8                    active_extensions;
        /* fields enclosed in headers_start/headers_end are copied
         * using a single memcpy() in __copy_skb_header()
        /* private: */
        __u32                   headers_start[0];
        /* public: */
/* if you move pkt_type around you also must adapt those constants */
#define PKT_TYPE_MAX    (7 << 5)
#define PKT_TYPE_MAX    7
#define PKT_TYPE_OFFSET()       offsetof(struct sk_buff, __pkt_type_offset)

        __u8                    __pkt_type_offset[0];
        __u8                    pkt_type:3;
        __u8                    ignore_df:1;
        __u8                    nf_trace:1;
        __u8                    ip_summed:2;
        __u8                    ooo_okay:1;

        __u8                    l4_hash:1;
        __u8                    sw_hash:1;
        __u8                    wifi_acked_valid:1;
        __u8                    wifi_acked:1;
        __u8                    no_fcs:1;
        /* Indicates the inner headers are valid in the skbuff. */
        __u8                    encapsulation:1;
        __u8                    encap_hdr_csum:1;
        __u8                    csum_valid:1;

#define PKT_VLAN_PRESENT_OFFSET()       offsetof(struct sk_buff, __pkt_vlan_present_offset)
        __u8                    __pkt_vlan_present_offset[0];
        __u8                    vlan_present:1;
        __u8                    csum_complete_sw:1;
        __u8                    csum_level:2;
        __u8                    csum_not_inet:1;
        __u8                    dst_pending_confirm:1;
        __u8                    ndisc_nodetype:2;

        __u8                    ipvs_property:1;
        __u8                    inner_protocol_type:1;
        __u8                    remcsum_offload:1;
        __u8                    offload_fwd_mark:1;
        __u8                    offload_l3_fwd_mark:1;
        __u8                    tc_skip_classify:1;
        __u8                    tc_at_ingress:1;
        __u8                    tc_redirected:1;
        __u8                    tc_from_ingress:1;
        __u8                    decrypted:1;

        __u16                   tc_index;       /* traffic control index */

        union {
                __wsum          csum;
                struct {
                        __u16   csum_start;
                        __u16   csum_offset;
        __u32                   priority;
        int                     skb_iif;
        __u32                   hash;
        __be16                  vlan_proto;
        __u16                   vlan_tci;
#if defined(CONFIG_NET_RX_BUSY_POLL) || defined(CONFIG_XPS)
        union {
                unsigned int    napi_id;
                unsigned int    sender_cpu;
        __u32           secmark;
        union {
                __u32           mark;
                __u32           reserved_tailroom;

        union {
                __be16          inner_protocol;
                __u8            inner_ipproto;

        __u16                   inner_transport_header;
        __u16                   inner_network_header;
        __u16                   inner_mac_header;

        __be16                  protocol;
        __u16                   transport_header;
        __u16                   network_header;
        __u16                   mac_header;

        /* private: */
        __u32                   headers_end[0];
        /* public: */

        /* These elements must be at the end, see alloc_skb() for details.  */
        sk_buff_data_t          tail;
        sk_buff_data_t          end;
        unsigned char           *head,
        unsigned int            truesize;
        refcount_t              users;

        /* only useable after checking ->active_extensions != 0 */
        struct skb_ext          *extensions;
View Code


  • next:指向双向链表中的下一个sk_buff结构体;
  • prev:指向双向链表中的上一个sk_buff结构体;
  • len:缓冲区中数据块大小。长度包括:主要缓冲区(head所指)的数据以及一些片断(fragment)的数据。当包在协议栈向上或向下走时,其大小会变,因为有头部的丢弃和添加;
  • data_len:片段中数据大小;
  • mac_len:数据链路层head长度;
  • hdr_len:writable header length of cloned skb;
  • priority:该sk_buff结构体的优先级;
  • ksb_iif:
  • hash:
  • vlan_proto:
  • vlan_tci:
  • protocal:存放上层的协议类型,可以通过eth_type_trans()来获取;
  • transport_header:传输层头部的偏移值;
  • network_header:网络层头部的偏移值;
  • mac_header:数据链路层头部的偏移值;
  • head:指向已分配数据缓冲区的开端;
  • end:指向已分配数据缓冲区的尾端;
  • data:指向实际数据的开端;
  • tail:指向实际数据的尾端;


2.2.2 struct net_device




 *      struct net_device - The DEVICE structure.
 *      Actually, this whole structure is a big mistake.  It mixes I/O
 *      data with strictly "high-level" data, and it has to know about
 *      almost every data structure used in the INET module.
 *      @name:  This is the first field of the "visible" part of this structure
 *              (i.e. as seen by users in the "Space.c" file).  It is the name
 *              of the interface.
 *      @name_hlist:    Device name hash chain, please keep it close to name[]
 *      @ifalias:       SNMP alias
 *      @mem_end:       Shared memory end
 *      @mem_start:     Shared memory start
 *      @base_addr:     Device I/O address
 *      @irq:           Device IRQ number
 *      @state:         Generic network queuing layer state, see netdev_state_t
 *      @dev_list:      The global list of network devices
 *      @napi_list:     List entry used for polling NAPI devices
 *      @unreg_list:    List entry  when we are unregistering the
 *                      device; see the function unregister_netdev
 *      @close_list:    List entry used when we are closing the device
 *      @ptype_all:     Device-specific packet handlers for all protocols
 *      @ptype_specific: Device-specific, protocol-specific packet handlers
 *      @adj_list:      Directly linked devices, like slaves for bonding
 *      @features:      Currently active device features
 *      @hw_features:   User-changeable features
 *      @wanted_features:       User-requested features
 *      @vlan_features:         Mask of features inheritable by VLAN devices
 *      @hw_enc_features:       Mask of features inherited by encapsulating devices
 *                              This field indicates what encapsulation
 *                              offloads the hardware is capable of doing,
 *                              and drivers will need to set them appropriately.
 *      @mpls_features: Mask of features inheritable by MPLS
 *      @ifindex:       interface index
 *      @group:         The group the device belongs to
 *      @stats:         Statistics struct, which was left as a legacy, use
 *                      rtnl_link_stats64 instead
 *      @rx_dropped:    Dropped packets by core network,
 *                      do not use this in drivers
 *      @tx_dropped:    Dropped packets by core network,
 *                      do not use this in drivers
 *      @rx_nohandler:  nohandler dropped packets by core network on
 *                      inactive devices, do not use this in drivers
 *      @carrier_up_count:      Number of times the carrier has been up
 *      @carrier_down_count:    Number of times the carrier has been down
 *      @wireless_handlers:     List of functions to handle Wireless Extensions,
 *                              instead of ioctl,
 *                              see <net/iw_handler.h> for details.
 *      @wireless_data: Instance data managed by the core of wireless extensions
 *      @netdev_ops:    Includes several pointers to callbacks,
 *                      if one wants to override the ndo_*() functions
 *      @ethtool_ops:   Management operations
 *      @ndisc_ops:     Includes callbacks for different IPv6 neighbour
 *                      discovery handling. Necessary for e.g. 6LoWPAN.
 *      @header_ops:    Includes callbacks for creating,parsing,caching,etc
 *                      of Layer 2 headers.
 *      @flags:         Interface flags (a la BSD)
 *      @priv_flags:    Like 'flags' but invisible to userspace,
 *                      see if.h for the definitions
 *      @gflags:        Global flags ( kept as legacy )
 *      @padded:        How much padding added by alloc_netdev()
 *      @operstate:     RFC2863 operstate
 *      @link_mode:     Mapping policy to operstate
 *      @if_port:       Selectable AUI, TP, ...
 *      @dma:           DMA channel
 *      @mtu:           Interface MTU value
 *      @min_mtu:       Interface Minimum MTU value
 *      @max_mtu:       Interface Maximum MTU value
 *      @type:          Interface hardware type
 *      @hard_header_len: Maximum hardware header length.
 *      @min_header_len:  Minimum hardware header length
 *      @needed_headroom: Extra headroom the hardware may need, but not in all
 *                        cases can this be guaranteed
 *      @needed_tailroom: Extra tailroom the hardware may need, but not in all
 *                        cases can this be guaranteed. Some cases also use
 *                        LL_MAX_HEADER instead to allocate the skb
 *      interface address info:
 *      @perm_addr:             Permanent hw address
 *      @addr_assign_type:      Hw address assignment type
 *      @addr_len:              Hardware address length
 *      @neigh_priv_len:        Used in neigh_alloc()
 *      @dev_id:                Used to differentiate devices that share
 *                              the same link layer address
 *      @dev_port:              Used to differentiate devices that share
 *                              the same function
 *      @addr_list_lock:        XXX: need comments on this one
 *      @uc_promisc:            Counter that indicates promiscuous mode
 *                              has been enabled due to the need to listen to
 *                              additional unicast addresses in a device that
 *                              does not implement ndo_set_rx_mode()
 *      @uc:                    unicast mac addresses
 *      @mc:                    multicast mac addresses
 *      @dev_addrs:             list of device hw addresses
 *      @queues_kset:           Group of all Kobjects in the Tx and RX queues
 *      @promiscuity:           Number of times the NIC is told to work in
 *                              promiscuous mode; if it becomes 0 the NIC will
 *                              exit promiscuous mode
 *      @allmulti:              Counter, enables or disables allmulticast mode
 *      @vlan_info:     VLAN info
 *      @dsa_ptr:       dsa specific data
 *      @tipc_ptr:      TIPC specific data
 *      @atalk_ptr:     AppleTalk link
 *      @ip_ptr:        IPv4 specific data
 *      @dn_ptr:        DECnet specific data
 *      @ip6_ptr:       IPv6 specific data
 *      @ax25_ptr:      AX.25 specific data
 *      @ieee80211_ptr: IEEE 802.11 specific data, assign before registering
 *      @dev_addr:      Hw address (before bcast,
 *                      because most packets are unicast)
 *      @_rx:                   Array of RX queues
 *      @num_rx_queues:         Number of RX queues
 *                              allocated at register_netdev() time
 *      @real_num_rx_queues:    Number of RX queues currently active in device
 *      @rx_handler:            handler for received packets
 *      @rx_handler_data:       XXX: need comments on this one
 *      @miniq_ingress:         ingress/clsact qdisc specific data for
 *                              ingress processing
 *      @ingress_queue:         XXX: need comments on this one
 *      @broadcast:             hw bcast address
 *      @rx_cpu_rmap:   CPU reverse-mapping for RX completion interrupts,
 *                      indexed by RX queue number. Assigned by driver.
 *                      This must only be set if the ndo_rx_flow_steer
 *                      operation is defined
 *      @index_hlist:           Device index hash chain
 *      @_tx:                   Array of TX queues
 *      @num_tx_queues:         Number of TX queues allocated at alloc_netdev_mq() time
 *      @real_num_tx_queues:    Number of TX queues currently active in device
 *      @qdisc:                 Root qdisc from userspace point of view
 *      @tx_queue_len:          Max frames per queue allowed
 *      @tx_global_lock:        XXX: need comments on this one
 *      @xps_maps:      XXX: need comments on this one
 *      @miniq_egress:          clsact qdisc specific data for
 *                              egress processing
 *      @watchdog_timeo:        Represents the timeout that is used by
 *                              the watchdog (see dev_watchdog())
 *      @watchdog_timer:        List of timers
 *      @pcpu_refcnt:           Number of references to this device
 *      @todo_list:             Delayed register/unregister
 *      @link_watch_list:       XXX: need comments on this one
 *      @reg_state:             Register/unregister state machine
 *      @dismantle:             Device is going to be freed
 *      @rtnl_link_state:       This enum represents the phases of creating
 *                              a new link
 *      @needs_free_netdev:     Should unregister perform free_netdev?
 *      @priv_destructor:       Called from unregister
 *      @npinfo:                XXX: need comments on this one
 *      @nd_net:                Network namespace this network device is inside
 *      @ml_priv:       Mid-layer private
 *      @lstats:        Loopback statistics
 *      @tstats:        Tunnel statistics
 *      @dstats:        Dummy statistics
 *      @vstats:        Virtual ethernet statistics
 *      @garp_port:     GARP
 *      @mrp_port:      MRP
 *      @dev:           Class/net/name entry
 *      @sysfs_groups:  Space for optional device, statistics and wireless
 *                      sysfs groups
 *      @sysfs_rx_queue_group:  Space for optional per-rx queue attributes
 *      @rtnl_link_ops: Rtnl_link_ops
 *      @gso_max_size:  Maximum size of generic segmentation offload
 *      @gso_max_segs:  Maximum number of segments that can be passed to the
 *                      NIC for GSO
 *      @dcbnl_ops:     Data Center Bridging netlink ops
 *      @num_tc:        Number of traffic classes in the net device
 *      @tc_to_txq:     XXX: need comments on this one
 *      @prio_tc_map:   XXX: need comments on this one
 *      @fcoe_ddp_xid:  Max exchange id for FCoE LRO by ddp
 *      @priomap:       XXX: need comments on this one
 *      @phydev:        Physical device may attach itself
 *                      for hardware timestamping
 *      @sfp_bus:       attached &struct sfp_bus structure.
 *      @qdisc_tx_busylock: lockdep class annotating Qdisc->busylock spinlock
 *      @qdisc_running_key: lockdep class annotating Qdisc->running seqcount
 *      @proto_down:    protocol port state information can be sent to the
 *                      switch driver and used to set the phys state of the
 *                      switch port.
 *      @wol_enabled:   Wake-on-LAN is enabled
 *      FIXME: cleanup struct net_device such that network protocol info
 *      moves out.
struct net_device {
        char                    name[IFNAMSIZ];
        struct hlist_node       name_hlist;
        struct dev_ifalias      __rcu *ifalias;
         *      I/O specific fields
         *      FIXME: Merge these and struct ifmap into one
        unsigned long           mem_end;
        unsigned long           mem_start;
        unsigned long           base_addr;
        int                     irq;

         *      Some hardware also needs these fields (state,dev_list,
         *      napi_list,unreg_list,close_list) but they are not
         *      part of the usual set specified in Space.c.

        unsigned long           state;

        struct list_head        dev_list;
        struct list_head        napi_list;
        struct list_head        unreg_list;
        struct list_head        close_list;
        struct list_head        ptype_all;
        struct list_head        ptype_specific;

        struct {
                struct list_head upper;
                struct list_head lower;
        } adj_list;

        netdev_features_t       features;
        netdev_features_t       hw_features;
        netdev_features_t       wanted_features;
        netdev_features_t       vlan_features;
        netdev_features_t       hw_enc_features;
        netdev_features_t       mpls_features;
        netdev_features_t       gso_partial_features;

        int                     ifindex;
        int                     group;

        struct net_device_stats stats;

        atomic_long_t           rx_dropped;
        atomic_long_t           tx_dropped;
        atomic_long_t           rx_nohandler;

        /* Stats to monitor link on/off, flapping */
        atomic_t                carrier_up_count;
        atomic_t                carrier_down_count;

        const struct iw_handler_def *wireless_handlers;
        struct iw_public_data   *wireless_data;
        const struct net_device_ops *netdev_ops;
        const struct ethtool_ops *ethtool_ops;
        const struct l3mdev_ops *l3mdev_ops;
        const struct ndisc_ops *ndisc_ops;

        const struct xfrmdev_ops *xfrmdev_ops;

        const struct tlsdev_ops *tlsdev_ops;

        const struct header_ops *header_ops;

        unsigned int            flags;
        unsigned int            priv_flags;

        unsigned short          gflags;
        unsigned short          padded;

        unsigned char           operstate;
        unsigned char           link_mode;

        unsigned char           if_port;
        unsigned char           dma;

        unsigned int            mtu;
        unsigned int            min_mtu;
        unsigned int            max_mtu;
        unsigned short          type;
        unsigned short          hard_header_len;
        unsigned char           min_header_len;

        unsigned short          needed_headroom;
        unsigned short          needed_tailroom;

        /* Interface address info. */
        unsigned char           perm_addr[MAX_ADDR_LEN];
        unsigned char           addr_assign_type;
        unsigned char           addr_len;
        unsigned short          neigh_priv_len;
        unsigned short          dev_id;
        unsigned short          dev_port;
        spinlock_t              addr_list_lock;
        unsigned char           name_assign_type;
        bool                    uc_promisc;
        struct netdev_hw_addr_list      uc;
        struct netdev_hw_addr_list      mc;
        struct netdev_hw_addr_list      dev_addrs;
        struct kset             *queues_kset;
        unsigned int            promiscuity;
        unsigned int            allmulti;

        /* Protocol-specific pointers */

        struct vlan_info __rcu  *vlan_info;
        struct dsa_port         *dsa_ptr;
        struct tipc_bearer __rcu *tipc_ptr;
        void                    *atalk_ptr;
        struct in_device __rcu  *ip_ptr;
        struct dn_dev __rcu     *dn_ptr;
        struct inet6_dev __rcu  *ip6_ptr;
        void                    *ax25_ptr;
        struct wireless_dev     *ieee80211_ptr;
        struct wpan_dev         *ieee802154_ptr;
        struct mpls_dev __rcu   *mpls_ptr;

 * Cache lines mostly used on receive path (including eth_type_trans())
        /* Interface address info used in eth_type_trans() */
        unsigned char           *dev_addr;

        struct netdev_rx_queue  *_rx;
        unsigned int            num_rx_queues;
        unsigned int            real_num_rx_queues;

        struct bpf_prog __rcu   *xdp_prog;
        unsigned long           gro_flush_timeout;
        rx_handler_func_t __rcu *rx_handler;
        void __rcu              *rx_handler_data;

        struct mini_Qdisc __rcu *miniq_ingress;
        struct netdev_queue __rcu *ingress_queue;
        struct nf_hook_entries __rcu *nf_hooks_ingress;

        unsigned char           broadcast[MAX_ADDR_LEN];
        struct cpu_rmap         *rx_cpu_rmap;
        struct hlist_node       index_hlist;

 * Cache lines mostly used on transmit path
        struct netdev_queue     *_tx ____cacheline_aligned_in_smp;
        unsigned int            num_tx_queues;
        unsigned int            real_num_tx_queues;
        struct Qdisc            *qdisc;
        DECLARE_HASHTABLE       (qdisc_hash, 4);
        unsigned int            tx_queue_len;
        spinlock_t              tx_global_lock;
        int                     watchdog_timeo;

        struct xps_dev_maps __rcu *xps_cpus_map;
        struct xps_dev_maps __rcu *xps_rxqs_map;
        struct mini_Qdisc __rcu *miniq_egress;

        /* These may be needed for future network-power-down code. */
        struct timer_list       watchdog_timer;

        int __percpu            *pcpu_refcnt;
        struct list_head        todo_list;

        struct list_head        link_watch_list;

        enum { NETREG_UNINITIALIZED=0,
               NETREG_REGISTERED,       /* completed register_netdevice */
               NETREG_UNREGISTERING,    /* called unregister_netdevice */
               NETREG_UNREGISTERED,     /* completed unregister todo */
               NETREG_RELEASED,         /* called free_netdev */
               NETREG_DUMMY,            /* dummy device for NAPI poll */
        } reg_state:8;

        bool dismantle;

        enum {
        } rtnl_link_state:16;

        bool needs_free_netdev;
        void (*priv_destructor)(struct net_device *dev);

        struct netpoll_info __rcu       *npinfo;

        possible_net_t                  nd_net;

        /* mid-layer private */
        union {
                void                                    *ml_priv;
                struct pcpu_lstats __percpu             *lstats;
                struct pcpu_sw_netstats __percpu        *tstats;
                struct pcpu_dstats __percpu             *dstats;

        struct garp_port __rcu  *garp_port;
        struct mrp_port __rcu   *mrp_port;

        struct device           dev;
        const struct attribute_group *sysfs_groups[4];
        const struct attribute_group *sysfs_rx_queue_group;

        const struct rtnl_link_ops *rtnl_link_ops;

        /* for setting kernel sock attribute on TCP connection setup */
#define GSO_MAX_SIZE            65536
        unsigned int            gso_max_size;
#define GSO_MAX_SEGS            65535
        u16                     gso_max_segs;

        const struct dcbnl_rtnl_ops *dcbnl_ops;
        s16                     num_tc;
        struct netdev_tc_txq    tc_to_txq[TC_MAX_QUEUE];
        u8                      prio_tc_map[TC_BITMASK + 1];

        unsigned int            fcoe_ddp_xid;
        struct netprio_map __rcu *priomap;
        struct phy_device       *phydev;
        struct sfp_bus          *sfp_bus;
        struct lock_class_key   *qdisc_tx_busylock;
        struct lock_class_key   *qdisc_running_key;
        bool                    proto_down;
        unsigned                wol_enabled:1;
View Code


  • name:网卡设备名称;
  • mem_end:该设备的内存结束地址;
  • mem_start:该设备的内存起始地址;
  • base_addr:该设备的内存I/O基地址;
  • irq:该设备的中断号;
  • if_port:多端口设备使用的端口类型;
  • dma:该设备的DMA通道;
  • get_stats:获取流量的统计信息,运行ifconfig便会调用该成员函数,并返回一个net_device_stats结构体获取信息;
  • stats:用来保存统计信息的net_device_stats结构体;
  • features:接口特征;
  • flags:flags指网络接口标志,以IFF_(Interface Flags)开头,flags =IFF_UP( 当设备被激活并可以开始发送数据包时,内核设置该标志)、IFF_AUTOMEDIA(设置设备可在多种媒介间切换)、IFF_BROADCAST( 允许广播)、IFF_DEBUG( 调试模式,可用于控制printk调用的详细程度) 、IFF_LOOPBACK( 回环)、IFF_MULTICAST( 允许组播) 、IFF_NOARP( 接口不能执行ARP,点对点接口就不需要运行 ARP)和IFF_POINTOPOINT(接口连接到点到点链路)等。
  • mtu:最大传输单元,也叫最大数据包;
  • type:接口的硬件类型;
  • hard_header_len:硬件帧头长度,一般被赋为ETH_HLEN,即14;
  • perm_addr:存放网关地址;
  • dev_addr:MAC地址;
  • netdev_ops:网卡设备的操作函数集;


2.2.3 struct net_device_ops


 * This structure defines the management hooks for network devices.
 * The following hooks can be defined; unless noted otherwise, they are
 * optional and can be filled with a null pointer.
 * int (*ndo_init)(struct net_device *dev);
 *     This function is called once when a network device is registered.
 *     The network device can use this for any late stage initialization
 *     or semantic validation. It can fail with an error code which will
 *     be propagated back to register_netdev.
 * void (*ndo_uninit)(struct net_device *dev);
 *     This function is called when device is unregistered or when registration
 *     fails. It is not called if init fails.
 * int (*ndo_open)(struct net_device *dev);
 *     This function is called when a network device transitions to the up
 *     state.
 * int (*ndo_stop)(struct net_device *dev);
 *     This function is called when a network device transitions to the down
 *     state.
 * netdev_tx_t (*ndo_start_xmit)(struct sk_buff *skb,
 *                               struct net_device *dev);
 *    Called when a packet needs to be transmitted.
 *    Returns NETDEV_TX_OK.  Can return NETDEV_TX_BUSY, but you should stop
 *    the queue before that can happen; it's for obsolete devices and weird
 *    corner cases, but the stack really does a non-trivial amount
 *    of useless work if you return NETDEV_TX_BUSY.
 *    Required; cannot be NULL.
 * netdev_features_t (*ndo_features_check)(struct sk_buff *skb,
 *                       struct net_device *dev
 *                       netdev_features_t features);
 *    Called by core transmit path to determine if device is capable of
 *    performing offload operations on a given packet. This is to give
 *    the device an opportunity to implement any restrictions that cannot
 *    be otherwise expressed by feature flags. The check is called with
 *    the set of features that the stack has calculated and it returns
 *    those the driver believes to be appropriate.
 * u16 (*ndo_select_queue)(struct net_device *dev, struct sk_buff *skb,
 *                         struct net_device *sb_dev);
 *    Called to decide which queue to use when device supports multiple
 *    transmit queues.
 * void (*ndo_change_rx_flags)(struct net_device *dev, int flags);
 *    This function is called to allow device receiver to make
 *    changes to configuration when multicast or promiscuous is enabled.
 * void (*ndo_set_rx_mode)(struct net_device *dev);
 *    This function is called device changes address list filtering.
 *    If driver handles unicast address filtering, it should set
 *    IFF_UNICAST_FLT in its priv_flags.
 * int (*ndo_set_mac_address)(struct net_device *dev, void *addr);
 *    This function  is called when the Media Access Control address
 *    needs to be changed. If this interface is not defined, the
 *    MAC address can not be changed.
 * int (*ndo_validate_addr)(struct net_device *dev);
 *    Test if Media Access Control address is valid for the device.
 * int (*ndo_do_ioctl)(struct net_device *dev, struct ifreq *ifr, int cmd);
 *    Called when a user requests an ioctl which can't be handled by
 *    the generic interface code. If not defined ioctls return
 *    not supported error code.
 * int (*ndo_set_config)(struct net_device *dev, struct ifmap *map);
 *    Used to set network devices bus interface parameters. This interface
 *    is retained for legacy reasons; new devices should use the bus
 *    interface (PCI) for low level management.
 * int (*ndo_change_mtu)(struct net_device *dev, int new_mtu);
 *    Called when a user wants to change the Maximum Transfer Unit
 *    of a device.
 * void (*ndo_tx_timeout)(struct net_device *dev);
 *    Callback used when the transmitter has not made any progress
 *    for dev->watchdog ticks.
 * void (*ndo_get_stats64)(struct net_device *dev,
 *                         struct rtnl_link_stats64 *storage);
 * struct net_device_stats* (*ndo_get_stats)(struct net_device *dev);
 *    Called when a user wants to get the network device usage
 *    statistics. Drivers must do one of the following:
 *    1. Define @ndo_get_stats64 to fill in a zero-initialised
 *       rtnl_link_stats64 structure passed by the caller.
 *    2. Define @ndo_get_stats to update a net_device_stats structure
 *       (which should normally be dev->stats) and return a pointer to
 *       it. The structure may be changed asynchronously only if each
 *       field is written atomically.
 *    3. Update dev->stats asynchronously and atomically, and define
 *       neither operation.
 * bool (*ndo_has_offload_stats)(const struct net_device *dev, int attr_id)
 *    Return true if this device supports offload stats of this attr_id.
 * int (*ndo_get_offload_stats)(int attr_id, const struct net_device *dev,
 *    void *attr_data)
 *    Get statistics for offload operations by attr_id. Write it into the
 *    attr_data pointer.
 * int (*ndo_vlan_rx_add_vid)(struct net_device *dev, __be16 proto, u16 vid);
 *    If device supports VLAN filtering this function is called when a
 *    VLAN id is registered.
 * int (*ndo_vlan_rx_kill_vid)(struct net_device *dev, __be16 proto, u16 vid);
 *    If device supports VLAN filtering this function is called when a
 *    VLAN id is unregistered.
 * void (*ndo_poll_controller)(struct net_device *dev);
 *    SR-IOV management functions.
 * int (*ndo_set_vf_mac)(struct net_device *dev, int vf, u8* mac);
 * int (*ndo_set_vf_vlan)(struct net_device *dev, int vf, u16 vlan,
 *              u8 qos, __be16 proto);
 * int (*ndo_set_vf_rate)(struct net_device *dev, int vf, int min_tx_rate,
 *              int max_tx_rate);
 * int (*ndo_set_vf_spoofchk)(struct net_device *dev, int vf, bool setting);
 * int (*ndo_set_vf_trust)(struct net_device *dev, int vf, bool setting);
 * int (*ndo_get_vf_config)(struct net_device *dev,
 *                int vf, struct ifla_vf_info *ivf);
 * int (*ndo_set_vf_link_state)(struct net_device *dev, int vf, int link_state);
 * int (*ndo_set_vf_port)(struct net_device *dev, int vf,
 *              struct nlattr *port[]);
 *      Enable or disable the VF ability to query its RSS Redirection Table and
 *      Hash Key. This is needed since on some devices VF share this information
 *      with PF and querying it may introduce a theoretical security risk.
 * int (*ndo_set_vf_rss_query_en)(struct net_device *dev, int vf, bool setting);
 * int (*ndo_get_vf_port)(struct net_device *dev, int vf, struct sk_buff *skb);
 * int (*ndo_setup_tc)(struct net_device *dev, enum tc_setup_type type,
 *               void *type_data);
 *    Called to setup any 'tc' scheduler, classifier or action on @dev.
 *    This is always called from the stack with the rtnl lock held and netif
 *    tx queues stopped. This allows the netdevice to perform queue
 *    management safely.
 *    Fiber Channel over Ethernet (FCoE) offload functions.
 * int (*ndo_fcoe_enable)(struct net_device *dev);
 *    Called when the FCoE protocol stack wants to start using LLD for FCoE
 *    so the underlying device can perform whatever needed configuration or
 *    initialization to support acceleration of FCoE traffic.
 * int (*ndo_fcoe_disable)(struct net_device *dev);
 *    Called when the FCoE protocol stack wants to stop using LLD for FCoE
 *    so the underlying device can perform whatever needed clean-ups to
 *    stop supporting acceleration of FCoE traffic.
 * int (*ndo_fcoe_ddp_setup)(struct net_device *dev, u16 xid,
 *                 struct scatterlist *sgl, unsigned int sgc);
 *    Called when the FCoE Initiator wants to initialize an I/O that
 *    is a possible candidate for Direct Data Placement (DDP). The LLD can
 *    perform necessary setup and returns 1 to indicate the device is set up
 *    successfully to perform DDP on this I/O, otherwise this returns 0.
 * int (*ndo_fcoe_ddp_done)(struct net_device *dev,  u16 xid);
 *    Called when the FCoE Initiator/Target is done with the DDPed I/O as
 *    indicated by the FC exchange id 'xid', so the underlying device can
 *    clean up and reuse resources for later DDP requests.
 * int (*ndo_fcoe_ddp_target)(struct net_device *dev, u16 xid,
 *                  struct scatterlist *sgl, unsigned int sgc);
 *    Called when the FCoE Target wants to initialize an I/O that
 *    is a possible candidate for Direct Data Placement (DDP). The LLD can
 *    perform necessary setup and returns 1 to indicate the device is set up
 *    successfully to perform DDP on this I/O, otherwise this returns 0.
 * int (*ndo_fcoe_get_hbainfo)(struct net_device *dev,
 *                   struct netdev_fcoe_hbainfo *hbainfo);
 *    Called when the FCoE Protocol stack wants information on the underlying
 *    device. This information is utilized by the FCoE protocol stack to
 *    register attributes with Fiber Channel management service as per the
 *    FC-GS Fabric Device Management Information(FDMI) specification.
 * int (*ndo_fcoe_get_wwn)(struct net_device *dev, u64 *wwn, int type);
 *    Called when the underlying device wants to override default World Wide
 *    Name (WWN) generation mechanism in FCoE protocol stack to pass its own
 *    World Wide Port Name (WWPN) or World Wide Node Name (WWNN) to the FCoE
 *    protocol stack to use.
 *    RFS acceleration.
 * int (*ndo_rx_flow_steer)(struct net_device *dev, const struct sk_buff *skb,
 *                u16 rxq_index, u32 flow_id);
 *    Set hardware filter for RFS.  rxq_index is the target queue index;
 *    flow_id is a flow ID to be passed to rps_may_expire_flow() later.
 *    Return the filter ID on success, or a negative error code.
 *    Slave management functions (for bridge, bonding, etc).
 * int (*ndo_add_slave)(struct net_device *dev, struct net_device *slave_dev);
 *    Called to make another netdev an underling.
 * int (*ndo_del_slave)(struct net_device *dev, struct net_device *slave_dev);
 *    Called to release previously enslaved netdev.
 *      Feature/offload setting functions.
 * netdev_features_t (*ndo_fix_features)(struct net_device *dev,
 *        netdev_features_t features);
 *    Adjusts the requested feature flags according to device-specific
 *    constraints, and returns the resulting flags. Must not modify
 *    the device state.
 * int (*ndo_set_features)(struct net_device *dev, netdev_features_t features);
 *    Called to update device configuration to new features. Passed
 *    feature set might be less than what was returned by ndo_fix_features()).
 *    Must return >0 or -errno if it changed dev->features itself.
 * int (*ndo_fdb_add)(struct ndmsg *ndm, struct nlattr *tb[],
 *              struct net_device *dev,
 *              const unsigned char *addr, u16 vid, u16 flags,
 *              struct netlink_ext_ack *extack);
 *    Adds an FDB entry to dev for addr.
 * int (*ndo_fdb_del)(struct ndmsg *ndm, struct nlattr *tb[],
 *              struct net_device *dev,
 *              const unsigned char *addr, u16 vid)
 *    Deletes the FDB entry from dev coresponding to addr.
 * int (*ndo_fdb_dump)(struct sk_buff *skb, struct netlink_callback *cb,
 *               struct net_device *dev, struct net_device *filter_dev,
 *               int *idx)
 *    Used to add FDB entries to dump requests. Implementers should add
 *    entries to skb and update idx with the number of entries.
 * int (*ndo_bridge_setlink)(struct net_device *dev, struct nlmsghdr *nlh,
 *                 u16 flags, struct netlink_ext_ack *extack)
 * int (*ndo_bridge_getlink)(struct sk_buff *skb, u32 pid, u32 seq,
 *                 struct net_device *dev, u32 filter_mask,
 *                 int nlflags)
 * int (*ndo_bridge_dellink)(struct net_device *dev, struct nlmsghdr *nlh,
 *                 u16 flags);
 * int (*ndo_change_carrier)(struct net_device *dev, bool new_carrier);
 *    Called to change device carrier. Soft-devices (like dummy, team, etc)
 *    which do not represent real hardware may define this to allow their
 *    userspace components to manage their virtual carrier state. Devices
 *    that determine carrier state from physical hardware properties (eg
 *    network cables) or protocol-dependent mechanisms (eg
 *    USB_CDC_NOTIFY_NETWORK_CONNECTION) should NOT implement this function.
 * int (*ndo_get_phys_port_id)(struct net_device *dev,
 *                   struct netdev_phys_item_id *ppid);
 *    Called to get ID of physical port of this device. If driver does
 *    not implement this, it is assumed that the hw is not able to have
 *    multiple net devices on single physical port.
 * int (*ndo_get_port_parent_id)(struct net_device *dev,
 *                 struct netdev_phys_item_id *ppid)
 *    Called to get the parent ID of the physical port of this device.
 * void (*ndo_udp_tunnel_add)(struct net_device *dev,
 *                  struct udp_tunnel_info *ti);
 *    Called by UDP tunnel to notify a driver about the UDP port and socket
 *    address family that a UDP tunnel is listnening to. It is called only
 *    when a new port starts listening. The operation is protected by the
 *    RTNL.
 * void (*ndo_udp_tunnel_del)(struct net_device *dev,
 *                  struct udp_tunnel_info *ti);
 *    Called by UDP tunnel to notify the driver about a UDP port and socket
 *    address family that the UDP tunnel is not listening to anymore. The
 *    operation is protected by the RTNL.
 * void* (*ndo_dfwd_add_station)(struct net_device *pdev,
 *                 struct net_device *dev)
 *    Called by upper layer devices to accelerate switching or other
 *    station functionality into hardware. 'pdev is the lowerdev
 *    to use for the offload and 'dev' is the net device that will
 *    back the offload. Returns a pointer to the private structure
 *    the upper layer will maintain.
 * void (*ndo_dfwd_del_station)(struct net_device *pdev, void *priv)
 *    Called by upper layer device to delete the station created
 *    by 'ndo_dfwd_add_station'. 'pdev' is the net device backing
 *    the station and priv is the structure returned by the add
 *    operation.
 * int (*ndo_set_tx_maxrate)(struct net_device *dev,
 *                 int queue_index, u32 maxrate);
 *    Called when a user wants to set a max-rate limitation of specific
 *    TX queue.
 * int (*ndo_get_iflink)(const struct net_device *dev);
 *    Called to get the iflink value of this device.
 * void (*ndo_change_proto_down)(struct net_device *dev,
 *                 bool proto_down);
 *    This function is used to pass protocol port error state information
 *    to the switch driver. The switch driver can react to the proto_down
 *      by doing a phys down on the associated switch port.
 * int (*ndo_fill_metadata_dst)(struct net_device *dev, struct sk_buff *skb);
 *    This function is used to get egress tunnel information for given skb.
 *    This is useful for retrieving outer tunnel header parameters while
 *    sampling packet.
 * void (*ndo_set_rx_headroom)(struct net_device *dev, int needed_headroom);
 *    This function is used to specify the headroom that the skb must
 *    consider when allocation skb during packet reception. Setting
 *    appropriate rx headroom value allows avoiding skb head copy on
 *    forward. Setting a negative value resets the rx headroom to the
 *    default value.
 * int (*ndo_bpf)(struct net_device *dev, struct netdev_bpf *bpf);
 *    This function is used to set or query state related to XDP on the
 *    netdevice and manage BPF offload. See definition of
 *    enum bpf_netdev_command for details.
 * int (*ndo_xdp_xmit)(struct net_device *dev, int n, struct xdp_frame **xdp,
 *            u32 flags);
 *    This function is used to submit @n XDP packets for transmit on a
 *    netdevice. Returns number of frames successfully transmitted, frames
 *    that got dropped are freed/returned via xdp_return_frame().
 *    Returns negative number, means general error invoking ndo, meaning
 *    no frames were xmit'ed and core-caller will free all frames.
 * struct devlink_port *(*ndo_get_devlink_port)(struct net_device *dev);
 *    Get devlink port instance associated with a given netdev.
 *    Called with a reference on the netdevice and devlink locks only,
 *    rtnl_lock is not held.
struct net_device_ops {
    int            (*ndo_init)(struct net_device *dev);
    void            (*ndo_uninit)(struct net_device *dev);
    int            (*ndo_open)(struct net_device *dev);
    int            (*ndo_stop)(struct net_device *dev);
    netdev_tx_t        (*ndo_start_xmit)(struct sk_buff *skb,
                          struct net_device *dev);
    netdev_features_t    (*ndo_features_check)(struct sk_buff *skb,
                              struct net_device *dev,
                              netdev_features_t features);
    u16            (*ndo_select_queue)(struct net_device *dev,
                            struct sk_buff *skb,
                            struct net_device *sb_dev);
    void            (*ndo_change_rx_flags)(struct net_device *dev,
                               int flags);
    void            (*ndo_set_rx_mode)(struct net_device *dev);
    int            (*ndo_set_mac_address)(struct net_device *dev,
                               void *addr);
    int            (*ndo_validate_addr)(struct net_device *dev);
    int            (*ndo_do_ioctl)(struct net_device *dev,
                            struct ifreq *ifr, int cmd);
    int            (*ndo_set_config)(struct net_device *dev,
                              struct ifmap *map);
    int            (*ndo_change_mtu)(struct net_device *dev,
                          int new_mtu);
    int            (*ndo_neigh_setup)(struct net_device *dev,
                           struct neigh_parms *);
    void            (*ndo_tx_timeout) (struct net_device *dev);

    void            (*ndo_get_stats64)(struct net_device *dev,
                           struct rtnl_link_stats64 *storage);
    bool            (*ndo_has_offload_stats)(const struct net_device *dev, int attr_id);
    int            (*ndo_get_offload_stats)(int attr_id,
                             const struct net_device *dev,
                             void *attr_data);
    struct net_device_stats* (*ndo_get_stats)(struct net_device *dev);

    int            (*ndo_vlan_rx_add_vid)(struct net_device *dev,
                               __be16 proto, u16 vid);
    int            (*ndo_vlan_rx_kill_vid)(struct net_device *dev,
                                __be16 proto, u16 vid);
    void                    (*ndo_poll_controller)(struct net_device *dev);
    int            (*ndo_netpoll_setup)(struct net_device *dev,
                             struct netpoll_info *info);
    void            (*ndo_netpoll_cleanup)(struct net_device *dev);
    int            (*ndo_set_vf_mac)(struct net_device *dev,
                          int queue, u8 *mac);
    int            (*ndo_set_vf_vlan)(struct net_device *dev,
                           int queue, u16 vlan,
                           u8 qos, __be16 proto);
    int            (*ndo_set_vf_rate)(struct net_device *dev,
                           int vf, int min_tx_rate,
                           int max_tx_rate);
    int            (*ndo_set_vf_spoofchk)(struct net_device *dev,
                               int vf, bool setting);
    int            (*ndo_set_vf_trust)(struct net_device *dev,
                            int vf, bool setting);
    int            (*ndo_get_vf_config)(struct net_device *dev,
                             int vf,
                             struct ifla_vf_info *ivf);
    int            (*ndo_set_vf_link_state)(struct net_device *dev,
                             int vf, int link_state);
    int            (*ndo_get_vf_stats)(struct net_device *dev,
                            int vf,
                            struct ifla_vf_stats
    int            (*ndo_set_vf_port)(struct net_device *dev,
                           int vf,
                           struct nlattr *port[]);
    int            (*ndo_get_vf_port)(struct net_device *dev,
                           int vf, struct sk_buff *skb);
    int            (*ndo_set_vf_guid)(struct net_device *dev,
                           int vf, u64 guid,
                           int guid_type);
    int            (*ndo_set_vf_rss_query_en)(
                           struct net_device *dev,
                           int vf, bool setting);
    int            (*ndo_setup_tc)(struct net_device *dev,
                        enum tc_setup_type type,
                        void *type_data);
    int            (*ndo_fcoe_enable)(struct net_device *dev);
    int            (*ndo_fcoe_disable)(struct net_device *dev);
    int            (*ndo_fcoe_ddp_setup)(struct net_device *dev,
                              u16 xid,
                              struct scatterlist *sgl,
                              unsigned int sgc);
    int            (*ndo_fcoe_ddp_done)(struct net_device *dev,
                             u16 xid);
    int            (*ndo_fcoe_ddp_target)(struct net_device *dev,
                               u16 xid,
                               struct scatterlist *sgl,
                               unsigned int sgc);
    int            (*ndo_fcoe_get_hbainfo)(struct net_device *dev,
                            struct netdev_fcoe_hbainfo *hbainfo);

    int            (*ndo_fcoe_get_wwn)(struct net_device *dev,
                            u64 *wwn, int type);

    int            (*ndo_rx_flow_steer)(struct net_device *dev,
                             const struct sk_buff *skb,
                             u16 rxq_index,
                             u32 flow_id);
    int            (*ndo_add_slave)(struct net_device *dev,
                         struct net_device *slave_dev,
                         struct netlink_ext_ack *extack);
    int            (*ndo_del_slave)(struct net_device *dev,
                         struct net_device *slave_dev);
    netdev_features_t    (*ndo_fix_features)(struct net_device *dev,
                            netdev_features_t features);
    int            (*ndo_set_features)(struct net_device *dev,
                            netdev_features_t features);
    int            (*ndo_neigh_construct)(struct net_device *dev,
                               struct neighbour *n);
    void            (*ndo_neigh_destroy)(struct net_device *dev,
                             struct neighbour *n);

    int            (*ndo_fdb_add)(struct ndmsg *ndm,
                           struct nlattr *tb[],
                           struct net_device *dev,
                           const unsigned char *addr,
                           u16 vid,
                           u16 flags,
                           struct netlink_ext_ack *extack);
    int            (*ndo_fdb_del)(struct ndmsg *ndm,
                           struct nlattr *tb[],
                           struct net_device *dev,
                           const unsigned char *addr,
                           u16 vid);
    int            (*ndo_fdb_dump)(struct sk_buff *skb,
                        struct netlink_callback *cb,
                        struct net_device *dev,
                        struct net_device *filter_dev,
                        int *idx);
    int            (*ndo_fdb_get)(struct sk_buff *skb,
                           struct nlattr *tb[],
                           struct net_device *dev,
                           const unsigned char *addr,
                           u16 vid, u32 portid, u32 seq,
                           struct netlink_ext_ack *extack);
    int            (*ndo_bridge_setlink)(struct net_device *dev,
                              struct nlmsghdr *nlh,
                              u16 flags,
                              struct netlink_ext_ack *extack);
    int            (*ndo_bridge_getlink)(struct sk_buff *skb,
                              u32 pid, u32 seq,
                              struct net_device *dev,
                              u32 filter_mask,
                              int nlflags);
    int            (*ndo_bridge_dellink)(struct net_device *dev,
                              struct nlmsghdr *nlh,
                              u16 flags);
    int            (*ndo_change_carrier)(struct net_device *dev,
                              bool new_carrier);
    int            (*ndo_get_phys_port_id)(struct net_device *dev,
                            struct netdev_phys_item_id *ppid);
    int            (*ndo_get_port_parent_id)(struct net_device *dev,
                              struct netdev_phys_item_id *ppid);
    int            (*ndo_get_phys_port_name)(struct net_device *dev,
                              char *name, size_t len);
    void            (*ndo_udp_tunnel_add)(struct net_device *dev,
                              struct udp_tunnel_info *ti);
    void            (*ndo_udp_tunnel_del)(struct net_device *dev,
                              struct udp_tunnel_info *ti);
    void*            (*ndo_dfwd_add_station)(struct net_device *pdev,
                            struct net_device *dev);
    void            (*ndo_dfwd_del_station)(struct net_device *pdev,
                            void *priv);

    int            (*ndo_get_lock_subclass)(struct net_device *dev);
    int            (*ndo_set_tx_maxrate)(struct net_device *dev,
                              int queue_index,
                              u32 maxrate);
    int            (*ndo_get_iflink)(const struct net_device *dev);
    int            (*ndo_change_proto_down)(struct net_device *dev,
                             bool proto_down);
    int            (*ndo_fill_metadata_dst)(struct net_device *dev,
                               struct sk_buff *skb);
    void            (*ndo_set_rx_headroom)(struct net_device *dev,
                               int needed_headroom);
    int            (*ndo_bpf)(struct net_device *dev,
                       struct netdev_bpf *bpf);
    int            (*ndo_xdp_xmit)(struct net_device *dev, int n,
                        struct xdp_frame **xdp,
                        u32 flags);
    int            (*ndo_xsk_async_xmit)(struct net_device *dev,
                              u32 queue_id);
    struct devlink_port *    (*ndo_get_devlink_port)(struct net_device *dev);
View Code


  • ndo_start_xmit:数据包发送函数, sk_buff就是用来收发数据包的结构体;
  • ndo_tx_timeout:发包超时处理函数;
2.3 核心函数
2.3.1 分配/释放/改变sk_buff


struct sk_buff *alloc_skb(unsigned int len, gfp_t priority);
struct sk_buff *dev_alloc_skb(unsigned int len);






void kfree_skb(struct sk_buff *skb);
void dev_kfree_skb(struct sk_buff *skb);
void dev_kfree_skb_irq(struct sk_buff *skb);
void dev_kfree_skb_any(struct sk_buff *skb);



  • dev_kfree_skb()函数用于非中断上下文;
  • dev_kfree_skb_irq()函数用于中断上下文;
  • dev_kfree_skb_any()函数在中断和非中断上下文中皆可采用,它其实是做一个非常简单的上下文判断,然后再调用dev_kfree_skb_irq()或者dev_kfree_skb()。


unsigned char *skb_put(struct sk_buff *skb, unsigned int len);



unsigned char *skb_push(struct sk_buff *skb, unsigned int len);

它会导致skb->data前移len(skb->data-=len),而skb->len会增加len的大小(skb->len+=len) 。


linux内核中可以使用如下函数在skb_reserve 在数据缓冲区的头部保留一些空间,通常用于允许插入协议头或强制将数据在某个边界上对齐。

 static inline void skb_reserve(struct sk_buff *skb, int len)
      /* 数据区data指针增加len字节*/
      skb->data += len;
      /* 数据区tail指针增加len字节 */
      skb->tail += len;


skb_reserve(skb, 2);     // 把IP对齐在16字节地址边界上

然后把一个14字节的ethernet帧拷贝到数据缓冲区, 这样IP报头就可以从缓冲区开始按照16字节边界对齐,并紧接在ethernet报头之后。

2.3.2 dev_queue_xmit


int dev_queue_xmit(struct sk_buff *skb);
2.3.3 netif_rx

上层对数据包的接收也通过向netif_rx()函数传递一个struct sk_buff数据结构的指针来完成。netif_rx()函数的原型为:

int netif_rx(struct sk_buff *skb);
2.3.4 分配/注册/卸载net_device


/* Support for loadable net-drivers */
struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
                                    unsigned char name_assign_type,
                                    void (*setup)(struct net_device *),
                                    unsigned int txqs, unsigned int rxqs);
int dev_get_valid_name(struct net *net, struct net_device *dev,
                       const char *name);

#define alloc_netdev(sizeof_priv, name, name_assign_type, setup) \
        alloc_netdev_mqs(sizeof_priv, name, name_assign_type, setup, 1, 1)




/* interface name assignment types (sysfs name_assign_type attribute) */
#define NET_NAME_UNKNOWN    0    /* unknown origin (not exposed to userspace) */
#define NET_NAME_ENUM        1    /* enumerated by kernel */
#define NET_NAME_PREDICTABLE    2    /* predictably named by the kernel */
#define NET_NAME_USER        3    /* provided by user-space */
#define NET_NAME_RENAMED    4    /* renamed by user-space */

setup: nnet_device的setuo()函数指针,一般设置为ether_setup,ether_setup是一个回调函数,使用设置以太网设备通用值,来设置分配net_device结构体的一些成员;



int register_netdev(struct net_device *dev);
int register_netdevice(struct net_device *dev);



void unregister_netdev(struct net_device *dev);
void unregister_netdevice(struct net_device *dev);
2.3.5 netif_stop_queue/netif_wake_queue


 *      netif_stop_queue - stop transmitted packets
 *      @dev: network device
 *      Stop upper layers calling the device hard_start_xmit routine.
 *      Used for flow control when transmit resources are unavailable.
static inline void netif_stop_queue(struct net_device *dev)
        netif_tx_stop_queue(netdev_get_tx_queue(dev, 0));


 *      netif_wake_queue - restart transmit
 *      @dev: network device
 *      Allow upper layers to call the device hard_start_xmit routine.
 *      Used for flow control when transmit resources are available.
static inline void netif_wake_queue(struct net_device *dev)
        netif_tx_wake_queue(netdev_get_tx_queue(dev, 0));





  • 进行网卡驱动初始化;
  • 发送数据包;
  • 接收数据包;
3.1 项目结构


3.2 网卡驱动初始化


  • 使用alloc_netdev()为网络设备分配一个net_device结构体;
  • 设置与网卡设备硬件相关的寄存器(虚拟网卡这步忽略);
  • 设置net_device结构体的成员;
    • 设置网络设备的操作函数集,如上层发送数据会最终调用到ndo_start_xmit()函数;
    • 设置网卡设备的MAC,这里可以随意设置,如果是真正的网卡设备,需要获取网卡硬件的MAC地址;
    • 设置虚拟网卡通信的标志flags,由于是虚拟网卡,并没有真正的和实际的网络设备进行通信,上报的数据只是我们人为构造的,所有不需要在通信前使用ARP(地址解析协议)获取通信设备的MAC地址。如果使能了使用ARP协议去获取相应IP的设备的MAC地址将会导致错误;
  • 使用register_netdev()向内核注册网络设备;


 * init入口函数
static int vnet_init(void)
   /* 1. 分配一个net_device结构体 */
   vnet_dev = alloc_netdev(0, "vnet%d", NET_NAME_ENUM, ether_setup);

   /* 2.设置网络设备的操作函数集,如上层发送数据会最终调用到ndo_start_xmit()函数 */
   vnet_dev->netdev_ops = &vnet_ops;

   /* 3. 设置网卡设备的MAC,这里可以随意设置,如果是真正的网卡设备,需要获取网卡硬件的MAC地址 */
   vnet_dev->dev_addr[0] = 0x89;
   vnet_dev->dev_addr[1] = 0x89;
   vnet_dev->dev_addr[2] = 0x89;
   vnet_dev->dev_addr[3] = 0x89;
   vnet_dev->dev_addr[4] = 0x89;
   vnet_dev->dev_addr[5] = 0x89;

   /* 4. 设置虚拟网卡通信的标志flags */
   vnet_dev->flags |= IFF_NOARP;

   /* 5.向内核注册网络设备  */

   return 0;
3.3 发送数据包


  • 发送数据时,使用netif_stop_queue()来阻止上层将新的数据传送进来;
  • 调用接收数据包函数,并代入发送的sk_buff缓冲区,里面来伪造一个收的ping包函数提交上层;这样当上层有数据发送时,由于构造到了一个相同类型的应答信息返回给上层,上层协议就能认为,当前网络设备能和给定ip的设备间能够正常的通信;
  • 使用dev_kfree_skb()函数来释放发送的sk_buff缓存区;
  • 更新发送的统计信息,记录总共发送包的个数和总共发送的字节数;
  • 使用netif_wake_queue()来唤醒被阻塞的上层,让上层协议继续调用设备数据操作函数传递数据;


 * 发送数据包
static netdev_tx_t vnet_send_packet(struct sk_buff *skb, struct net_device *dev)
   static int cnt = 0;
   int i = 0;

   printk("vnet_send_packet: cnt = %d\n", ++cnt);

   // 输出发送以太帧长度
   printk("vnet_send_packet: length = %d\n", skb->len);
   for(i=0;i<skb->len;i++) {
      printk(KERN_CONT "0x%02x ", *(skb->data+i));

    /* 1. 发送数据时,阻止上层将新的数据传送进来 */

    /* 2. 调用接收数据包函数,并代入发送的sk_buff缓冲区,里面来伪造一个收的ping包函数提交上层 */

    /* 3. 释放发送的sk_buff缓存区 */

    /* 4 更新发送的统计信息,记录总共发送包的个数和总共发送的字节数 */
     dev->stats.tx_bytes += skb->len;

    /* 5. 唤醒被阻塞的上层,让上层协议继续调用设备数据操作函数传递数据 */

    return 0;
3.4 接收数据包

我们以ping 命令为例,我么可以通过wirkshark抓取ping数据包,以下面一帧74字节的数据为例:


  • 将发送的skb_buff缓冲区中的源MAC和目标MAC内容调换;
  • 将发送的skb_buff缓冲区中的源IP和目标IP内容调换;
  • 设置数据包的数据类型,之前是发送ping包0x08,需要改为0x00,表示接收ping包;
  • 使用ip_fast_csum()来重新获取iphdr结构体的校验码;
  • 使用dev_alloc_skb()来构造一个新的sk_buff;
  • 使用skb_reserve(rx_skb, 2)将sk_buff缓冲区里的数据向后位移2字节,来腾出sk_buff缓冲区里的头部空间;
  • 使用memcpy()将之前修改好的sk_buff->data复制到新的sk_buff里的data成员指向的地址处;
  • 设置新的skb_buff中的net_device;
  • 使用eth_type_trans()来获取上层协议,即将skb_buff->data指向上层协议数据,并将返回值赋值给sk_buff的protocol成员变量;
  • 更新接收统计信息,最后使用netif_rx()函数将sk_buff传递给上层;


 * 模拟接收数据包
static void emulator_rx_packet(struct sk_buff *skb, struct net_device *dev)
    // 以太帧头部
    struct ethhdr *ethhdr;

    // IP数据报头部
    struct iphdr *ih;

    // 临时max缓冲区
    unsigned char mac_addr[ETH_ALEN];

    // 临时ip缓冲区
     __be32 *saddr, *daddr, tmp;

     // icmp类型
     unsigned char *type;

     // 接收套接字缓冲区
     struct sk_buff * rx_skb;

    int i = 0;

    /* 1. 将发送的skb_buff缓冲区中的源MAC和目标MAC内容调换 */
    ethhdr = (struct ethhdr*)skb->data;
    memcpy(ethhdr->h_dest, ethhdr->h_source, ETH_ALEN);
    memcpy(ethhdr->h_source, mac_addr, ETH_ALEN);

    /* 2.将发送的skb_buff缓冲区中的源IP和目标IP内容调换 */
    ih = (struct iphdr *)(skb->data + sizeof(struct ethhdr));
    saddr = &ih->saddr;
    daddr = &ih->daddr;
    tmp = *saddr;
    *saddr = *daddr;
    *daddr = tmp;

    /* 3. 设置数据包的数据类型,之前是发送ping包0x08,需要改为0x00,表示接收ping包 */
    type = skb->data + sizeof(struct ethhdr) + sizeof(struct iphdr);
    *type = 0x00;

    /* 4. 重新获取iphdr结构体的校验码,只校验ip数据报的首部 */
    ih->check = 0x00;
    ih->check = ip_fast_csum((unsigned char *)ih, ih->ihl);

    /* 5. 构造一个新的sk_buff */
     rx_skb = dev_alloc_skb(skb->len + 2);  // rx_skb->head、rx_skb->data、rx_skb->tail均指向缓冲区头部,rx_skb->end指向缓冲区尾部

    /* 6. 将sk_buff缓冲区里的数据向后移2字节,来腾出sk_buff缓冲区里的头部空间 */
     skb_reserve(rx_skb, 2);   /* align IP on 16B boundary */

    /* 7. 将之前修改好的sk_buff->data复制到新的sk_buff里的data成员指向的地址处 */
     memcpy(skb_put(rx_skb, skb->len), skb->data, skb->len);

    /* 8. 设置新的skb_buff中的net_device */
    rx_skb->dev = dev;

    /* 9. 设置sk_buff的protocol成员变量 */
    rx_skb->protocol = eth_type_trans(rx_skb, dev);  // 处理后的rx_skb->data会跳过以太帧报头  rx_skb->len-=ETH_HLEN、rx_skb->data+=ETH_HLEN
    rx_skb->ip_summed = CHECKSUM_UNNECESSARY;

    /* 10. 更新接收统计信息 */
    dev->stats.rx_bytes += skb->len;

    // 输出发送以太帧长度
    printk("emulator_rx_packet: length = %d\n", rx_skb->len);
    for(i=0;i<rx_skb->len;i++) {
      printk(KERN_CONT "0x%02x ", *(rx_skb->data+i));

    /* 11. 将sk_buff传递给上层 */
3.5 完整代码
#include <linux/module.h>
#include <linux/types.h>
#include <linux/netdevice.h>
#include <linux/etherdevice.h>
#include <linux/bitops.h>
#include <linux/ip.h>
#include <linux/kernel.h>
#include <linux/init.h>
#include <linux/slab.h>
#include <linux/device.h>
#include <linux/skbuff.h>
#include <linux/platform_device.h>

/* 虚拟网络设备 */
static struct net_device  *vnet_dev;

 * 模拟接收数据包
static void emulator_rx_packet(struct sk_buff *skb, struct net_device *dev)
    // 以太帧头部
    struct ethhdr *ethhdr;

    // IP数据报头部
    struct iphdr *ih;

    // 临时max缓冲区
    unsigned char mac_addr[ETH_ALEN];

    // 临时ip缓冲区
     __be32 *saddr, *daddr, tmp;

     // icmp类型
     unsigned char *type;

     // 接收套接字缓冲区
     struct sk_buff * rx_skb;

    int i = 0;

    /* 1. 将发送的skb_buff缓冲区中的源MAC和目标MAC内容调换 */
    ethhdr = (struct ethhdr*)skb->data;
    memcpy(ethhdr->h_dest, ethhdr->h_source, ETH_ALEN);
    memcpy(ethhdr->h_source, mac_addr, ETH_ALEN);

    /* 2.将发送的skb_buff缓冲区中的源IP和目标IP内容调换 */
    ih = (struct iphdr *)(skb->data + sizeof(struct ethhdr));
    saddr = &ih->saddr;
    daddr = &ih->daddr;
    tmp = *saddr;
    *saddr = *daddr;
    *daddr = tmp;

    /* 3. 设置数据包的数据类型,之前是发送ping包0x08,需要改为0x00,表示接收ping包 */
    type = skb->data + sizeof(struct ethhdr) + sizeof(struct iphdr);
    *type = 0x00;

    /* 4. 重新获取iphdr结构体的校验码 */
    ih->check = 0x00;
    ih->check = ip_fast_csum((unsigned char *)ih, ih->ihl);

    /* 5. 构造一个新的sk_buff */
     rx_skb = dev_alloc_skb(skb->len + 2);  // rx_skb->head、rx_skb->data、rx_skb->tail均指向缓冲区头部,rx_skb->end指向缓冲区尾部

    /* 6. 将sk_buff缓冲区里的数据向后移2字节,来腾出sk_buff缓冲区里的头部空间 */
     skb_reserve(rx_skb, 2);   /* align IP on 16B boundary */

    /* 7. 将之前修改好的sk_buff->data复制到新的sk_buff里的data成员指向的地址处 */
     memcpy(skb_put(rx_skb, skb->len), skb->data, skb->len);

    /* 8. 设置新的skb_buff中的net_device */
    rx_skb->dev = dev;

    /* 9. 设置sk_buff的protocol成员变量 */
    rx_skb->protocol = eth_type_trans(rx_skb, dev);  // 处理后的rx_skb->data会跳过以太帧报头  rx_skb->len-=ETH_HLEN、rx_skb->data+=ETH_HLEN
    rx_skb->ip_summed = CHECKSUM_UNNECESSARY;

    /* 10. 更新接收统计信息 */
    dev->stats.rx_bytes += skb->len;

    // 输出发送以太帧长度
    printk("emulator_rx_packet: length = %d\n", rx_skb->len);
    for(i=0;i<rx_skb->len;i++) {
      printk(KERN_CONT "0x%02x ", *(rx_skb->data+i));

    /* 11. 将sk_buff传递给上层 */

 * 发送数据包
static netdev_tx_t vnet_send_packet(struct sk_buff *skb, struct net_device *dev)
   static int cnt = 0;
   int i = 0;

   printk("vnet_send_packet: cnt = %d\n", ++cnt);

   // 输出发送以太帧长度
   printk("vnet_send_packet: length = %d\n", skb->len);
   for(i=0;i<skb->len;i++) {
      printk(KERN_CONT "0x%02x ", *(skb->data+i));

    /* 1. 发送数据时,阻止上层将新的数据传送进来 */

    /* 2. 调用接收数据包函数,并代入发送的sk_buff缓冲区,里面来伪造一个收的ping包函数提交上层 */

    /* 3. 释放发送的sk_buff缓存区 */

    /* 4 更新发送的统计信息,记录总共发送包的个数和总共发送的字节数 */
     dev->stats.tx_bytes += skb->len;

    /* 5. 唤醒被阻塞的上层,让上层协议继续调用设备数据操作函数传递数据 */

    return 0;

/* 网卡设备操作函数集 */
static const struct net_device_ops vnet_ops = {
    .ndo_start_xmit   = vnet_send_packet,      // 发送数据包

 * init入口函数
static int vnet_init(void)
   /* 1. 分配一个net_device结构体 */
   vnet_dev = alloc_netdev(0, "vnet%d", NET_NAME_ENUM, ether_setup);

   /* 2.设置网络设备的操作函数集,如上层发送数据会最终调用到ndo_start_xmit()函数 */
   vnet_dev->netdev_ops = &vnet_ops;

   /* 3. 设置网卡设备的MAC,这里可以随意设置,如果是真正的网卡设备,需要获取网卡硬件的MAC地址 */
   vnet_dev->dev_addr[0] = 0x89;
   vnet_dev->dev_addr[1] = 0x89;
   vnet_dev->dev_addr[2] = 0x89;
   vnet_dev->dev_addr[3] = 0x89;
   vnet_dev->dev_addr[4] = 0x89;
   vnet_dev->dev_addr[5] = 0x89;

   /* 4. 设置虚拟网卡通信的标志flags */
   vnet_dev->flags |= IFF_NOARP;

   /* 5.向内核注册网络设备  */

   return 0;

 * exit出口函数
static void vnet_exit(void)

View Code
3.6 测试
3.6.1 编译虚拟网卡驱动

编译虚拟网卡驱动,将 vnet_dev.ko拷贝到nfs根文件系统。

root@zhengyang:/work/sambashare/drivers/19.vnet_dev# cp /work/sambashare/drivers/19.vnet_dev/vnet_dev.ko /work/nfs_root/rootfs


[root@zy:/]# insmod vnet_dev.ko
vnet_dev: loading out-of-tree module taints kernel.


[root@zy:/]# ls /sys/class/net
eth0   lo     sit0   vnet0
3.6.2 配置虚拟网卡


[root@zy:/]# ifconfig
eth0      Link encap:Ethernet  HWaddr 52:45:18:A0:AC:CA
          inet addr:  Bcast:  Mask:
          inet6 addr: fe80::5045:18ff:fea0:acca/64 Scope:Link
          RX packets:1981 errors:0 dropped:0 overruns:0 frame:0
          TX packets:767 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:2653725 (2.5 MiB)  TX bytes:128314 (125.3 KiB)
          Interrupt:55 Base address:0x8300

lo        Link encap:Local Loopback
          inet addr:  Mask:
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

ifconfig -a可以看到我们加入的虚拟网卡,此时这块网卡还没有启用:

[root@zy:/]# ifconfig -a
eth0      Link encap:Ethernet  HWaddr 52:45:18:A0:AC:CA
          inet addr:  Bcast:  Mask:
          inet6 addr: fe80::5045:18ff:fea0:acca/64 Scope:Link
          RX packets:2088 errors:0 dropped:0 overruns:0 frame:0
          TX packets:817 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:2724795 (2.5 MiB)  TX bytes:137254 (134.0 KiB)
          Interrupt:55 Base address:0x8300

lo        Link encap:Local Loopback
          inet addr:  Mask:
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

sit0      Link encap:IPv6-in-IPv4
          NOARP  MTU:1480  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

vnet0     Link encap:Ethernet  HWaddr 89:89:89:89:89:89  // 新加入的虚拟网卡没有设置ip
          BROADCAST NOARP MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)


[root@zy:/]# ifconfig vnet0   // 配置虚拟网卡
vnet_send_packet: cnt = 1vnet_send_packet: cnt = 2
[root@zy:/]# ifconfig

eth0      Link encap:Ethernet  HWaddr 52:45:18:A0:AC:CA
          inet addr:  Bcast:  Mask:
          inet6 addr: fe80::5045:18ff:fea0:acca/64 Scope:Link
          RX packets:2105 errors:0 dropped:0 overruns:0 frame:0
          TX packets:859 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:2728356 (2.6 MiB)  TX bytes:155358 (151.7 KiB)
          Interrupt:55 Base address:0x8300

lo        Link encap:Local Loopback
          inet addr:  Mask:
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

vnet0     Link encap:Ethernet  HWaddr 89:89:89:89:89:89                 // 虚拟网卡
          inet addr:  Bcast:  Mask:
          inet6 addr: fe80::8b89:89ff:fe89:8989/64 Scope:Link
          RX packets:2 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:140 (140.0 B)  TX bytes:140 (140.0 B)
3.6.3 ping测试

执行如下命令ping自己: ping ,当ping自己时,使用回环网卡,没有调用到网卡驱动发包函数。

[root@zy:/]# ping
PING ( 56 data bytes
64 bytes from seq=0 ttl=64 time=1.585 ms
64 bytes from seq=1 ttl=64 time=0.907 ms
64 bytes from seq=2 ttl=64 time=0.897 ms
64 bytes from seq=3 ttl=64 time=0.900 ms

执行如下命令ping网络:  ping ,使用我们编写的虚拟网卡驱动了,调用到网卡驱动发包函数。

[root@zy:/]# ping
PING ( 56 data bytes
vnet_send_packet: cnt = 9
64 bytes from seq=0 ttl=64 time=1.087 ms
vnet_send_packet: cnt = 10
64 bytes from seq=1 ttl=64 time=0.768 ms
vnet_send_packet: cnt = 11
64 bytes from seq=2 ttl=64 time=0.749 ms


[root@zy:/]# ifconfig
eth0      Link encap:Ethernet  HWaddr 52:45:18:A0:AC:CA
          inet addr:  Bcast:  Mask:
          inet6 addr: fe80::5045:18ff:fea0:acca/64 Scope:Link
          RX packets:2273 errors:0 dropped:0 overruns:0 frame:0
          TX packets:960 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:2807766 (2.6 MiB)  TX bytes:179088 (174.8 KiB)
          Interrupt:55 Base address:0x8300

lo        Link encap:Local Loopback
          inet addr:  Mask:
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:34 errors:0 dropped:0 overruns:0 frame:0
          TX packets:34 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:2856 (2.7 KiB)  TX bytes:2856 (2.7 KiB)

vnet0     Link encap:Ethernet  HWaddr 89:89:89:89:89:89
          inet addr:  Bcast:  Mask:
          inet6 addr: fe80::8b89:89ff:fe89:8989/64 Scope:Link
          RX packets:12 errors:0 dropped:0 overruns:0 frame:0         // 接收了12数据包
          TX packets:12 errors:0 dropped:0 overruns:0 carrier:0       // 发送12数据包
          collisions:0 txqueuelen:1000
          RX bytes:968 (968.0 B)  TX bytes:968 (968.0 B)              // 接收/发送字节数
3.6.4 输出以太帧信息
[root@zy:/]# ping
PING ( 56 data bytes
0x01 0x01 0x89 0x89 0x89 0x89 0x89 0x89
vnet_send_packet: cnt = 2                                              # 发送的
vnet_send_packet: length = 98
0x89 0x89 0x89 0x89 0x89 0x89 0x89 0x89 0x89 0x89 0x89 0x89 0x08 0x00 0x45 0x00    # 源/目的 max 0x89 0x89 0x89 0x89 0x89 0x89
0x00 0x54 0xf8 0xda 0x40 0x00 0x40 0x01 0x35 0xc2 0x03 0x03 0x03 0x03 0x03 0x03    # IP数据报长度0x54=84 
0x03 0x04 0x08 0x00 0x62 0x67 0x3c 0x00 0x00 0x00 0xe6 0x94 0x73 0x03 0x00 0x00
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x00 0x00
emulator_rx_packet: length = 84                                      # 模拟接收的
0x45 0x00 0x00 0x54 0xf8 0xda 0x40 0x00 0x40 0x01 0x35 0xc2 0x03 0x03 0x03 0x04    # 这里只有IP数据报 没有以太帧头信息 这是因为发送的数据经过了eth_type_trans()函数处理,改变了skb->data
0x03 0x03 0x03 0x03 0x00 0x00 0x62 0x67 0x3c 0x00 0x00 0x00 0xe6 0x94 0x73 0x03
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x00 0x00 0x00 0x00
64 bytes from seq=0 ttl=64 time=88.367 ms


Young / s3c2440_project[drivers]


[1] Linux 网卡驱动程序

[2] 26.Linux-网卡驱动介绍以及制作虚拟网卡驱动(详解)




[6]TCP IP ICMP 以太网帧格式





