分析linux网络的书已经很多了,包括《追踪Linux TCP/IP代码运行》《Linux内核源码剖析——TCP/IP实现》,这里我只是从数据包在linux内核中的基本流程来分析,尽可能的展现一个主流程框架。
内核如何从网卡接收数据,传统的过程:
1.数据到达网卡;
2.网卡产生一个中断给内核;
3.内核使用I/O指令,从网卡I/O区域中去读取数据;
我们在许多网卡驱动中(很老那些),都可以在网卡的中断函数中见到这一过程。
但是,这一种方法,有一种重要的问题,就是大流量的数据来到,网卡会产生大量的中断,内核在中断上下文 中,会浪费大量的资源来处理中断本身。所以,就有一个问题,“可不可以不使用中断”,这就是轮询技术,所谓NAPI技术,说来也不神秘,就是说,内核屏蔽 中断,然后隔一会儿就去问网卡,“你有没有数据啊?”……
从这个描述本身可以看到,如果数据量少,轮询同样占用大量的不必要的CPU资源,大家各有所长吧
OK,另一个问题,就是从网卡的I/O区域,包括I/O寄存器或I/O内存中去读取数据,这都要CPU 去读,也要占用CPU资源,“CPU从I/O区域读,然后把它放到内存(这个内存指的是系统本身的物理内存,跟外设的内存不相干,也叫主内存)中”。于是 自然地,就想到了DMA技术——让网卡直接从主内存之间读写它们的I/O数据,CPU,这儿不干你事,自己找乐子去:
1.首先,内核在主内存中为收发数据建立一个环形的缓冲队列(通常叫DMA环形缓冲区)。
2.内核将这个缓冲区通过DMA映射,把这个队列交给网卡;
3.网卡收到数据,就直接放进这个环形缓冲区了——也就是直接放进主内存了;然后,向系统产生一个中断;
4.内核收到这个中断,就取消DMA映射,这样,内核就直接从主内存中读取数据;
1.首先,内核在主内存中为收发数据建立一个环形的缓冲队列(通常叫DMA环形缓冲区)。
2.内核将这个缓冲区通过DMA映射,把这个队列交给网卡;
3.网卡收到数据,就直接放进这个环形缓冲区了——也就是直接放进主内存了;然后,向系统产生一个中断;
4.内核收到这个中断,就取消DMA映射,这样,内核就直接从主内存中读取数据;
——呵呵,这一个过程比传统的过程少了不少工作,因为设备直接把数据放进了主内存,不需要CPU的干预,效率是不是提高不少?
对应以上4步,来看它的具体实现:
1)分配环形DMA缓冲区
Linux内核中,用skb来描述一个缓存,所谓分配,就是建立一定数量的skb,然后用e1000_rx_ring 环形缓冲区队列描述符连接起来
1)分配环形DMA缓冲区
Linux内核中,用skb来描述一个缓存,所谓分配,就是建立一定数量的skb,然后用e1000_rx_ring 环形缓冲区队列描述符连接起来
2)建立DMA映射
内核通过调用
dma_map_single(struct device *dev,void *buffer,size_t size,enum dma_data_direction direction)
建立映射关系。
struct device *dev 描述一个设备;
buffer:把哪个地址映射给设备;也就是某一个skb——要映射全部,当然是做一个双向链表的循环即可;
size:缓存大小;
direction:映射方向——谁传给谁:一般来说,是“双向”映射,数据在设备和内存之间双向流动;
对于PCI设备而言(网卡一般是PCI的),通过另一个包裹函数pci_map_single,这样,就把buffer交给设备了!设备可以直接从里边读/取数据。
内核通过调用
dma_map_single(struct device *dev,void *buffer,size_t size,enum dma_data_direction direction)
建立映射关系。
struct device *dev 描述一个设备;
buffer:把哪个地址映射给设备;也就是某一个skb——要映射全部,当然是做一个双向链表的循环即可;
size:缓存大小;
direction:映射方向——谁传给谁:一般来说,是“双向”映射,数据在设备和内存之间双向流动;
对于PCI设备而言(网卡一般是PCI的),通过另一个包裹函数pci_map_single,这样,就把buffer交给设备了!设备可以直接从里边读/取数据。
3)这一步由硬件完成;
4)取消映射
dma_unmap_single,对PCI而言,大多调用它的包裹函数pci_unmap_single,不取消的话,缓存控制权还在设备手里,要调用 它,把主动权掌握在CPU手里——因为我们已经接收到数据了,应该由CPU把数据交给上层网络栈;当然,不取消之前,通常要读一些状态位信息,诸如此类, 一般是调用dma_sync_single_for_cpu()让CPU在取消映射前,就可以访问DMA缓冲区中的内容
dma_unmap_single,对PCI而言,大多调用它的包裹函数pci_unmap_single,不取消的话,缓存控制权还在设备手里,要调用 它,把主动权掌握在CPU手里——因为我们已经接收到数据了,应该由CPU把数据交给上层网络栈;当然,不取消之前,通常要读一些状态位信息,诸如此类, 一般是调用dma_sync_single_for_cpu()让CPU在取消映射前,就可以访问DMA缓冲区中的内容
首先,数据包从网卡光电信号来之后,先经过网卡驱动,转换成skb,进入链路层,那么我首先就先分析一下网卡驱动的流程。
源码位置:Driver/net/E1000e文件夹下面。
static
int
__init e1000_init_module(
void
)
{注册网卡驱动,按照PCI驱动开发方式来进行注册
int
ret;
printk(KERN_INFO
"%s: Intel(R) PRO/1000 Network Driver - %s\n"
,
e1000e_driver_name, e1000e_driver_version);
printk(KERN_INFO
"%s: Copyright (c) 1999-2008 Intel Corporation.\n"
,
e1000e_driver_name);
ret = pci_register_driver(&e1000_driver);
pm_qos_add_requirement(PM_QOS_CPU_DMA_LATENCY, e1000e_driver_name,
PM_QOS_DEFAULT_VALUE);
return
ret;
}
|
然后看一下驱动结构体内容,这里不对PCI类型驱动开发做介绍了。
/* PCI Device API Driver */
static
struct
pci_driver e1000_driver = {
.name = e1000e_driver_name,
.id_table = e1000_pci_tbl,
.probe = e1000_probe,
.remove = __devexit_p(e1000_remove),
#ifdef CONFIG_PM
/* Power Management Hooks */
.suspend = e1000_suspend,
.resume = e1000_resume,
#endif
.shutdown = e1000_shutdown,
.err_handler = &e1000_err_handler
};
|
这里面最重要的函数是e1000_probe,先看一下这个函数的作用是什么:“Device Initialization Routine”,这个应该不难理解。
static
int
__devinit e1000_probe(
struct
pci_dev *pdev,
const
struct
pci_device_id *ent)
{
struct
net_device *netdev;
struct
e1000_adapter *adapter;
struct
e1000_hw *hw;
const
struct
e1000_info *ei = e1000_info_tbl[ent->driver_data];
resource_size_t mmio_start, mmio_len;
resource_size_t flash_start, flash_len;
static
int
cards_found;
int
i, err, pci_using_dac;
u16 eeprom_data = 0;
u16 eeprom_apme_mask = E1000_EEPROM_APME;
e1000e_disable_l1aspm(pdev);
从这里开始对设备驱动进行初始化,包括名称、内存之类的。
err = pci_enable_device_mem(pdev);
if
(err)
return
err;
pci_using_dac = 0;
err = pci_set_dma_mask(pdev, DMA_BIT_MASK(64));
if
(!err) {
err = pci_set_consistent_dma_mask(pdev, DMA_BIT_MASK(64));
if
(!err)
pci_using_dac = 1;
}
else
{
err = pci_set_dma_mask(pdev, DMA_BIT_MASK(32));
if
(err) {
err = pci_set_consistent_dma_mask(pdev,
DMA_BIT_MASK(32));
if
(err) {
dev_err(&pdev->dev,
"No usable DMA "
"configuration, aborting\n"
);
goto
err_dma;
}
}
}
err = pci_request_selected_regions_exclusive(pdev,
pci_select_bars(pdev, IORESOURCE_MEM),
e1000e_driver_name);
if
(err)
goto
err_pci_reg;
/* AER (Advanced Error Reporting) hooks */
err = pci_enable_pcie_error_reporting(pdev);
if
(err) {
dev_err(&pdev->dev,
"pci_enable_pcie_error_reporting failed "
"0x%x\n"
, err);
/* non-fatal, continue */
}
pci_set_master(pdev);
/* PCI config space info */
err = pci_save_state(pdev);
if
(err)
goto
err_alloc_etherdev;
err = -ENOMEM;<br>这里要为驱动分配一个容器之类的,因为驱动后面的一切操作都是在它的基础之上。
netdev = alloc_etherdev(
sizeof
(
struct
e1000_adapter));
if
(!netdev)
goto
err_alloc_etherdev;
SET_NETDEV_DEV(netdev, &pdev->dev);
pci_set_drvdata(pdev, netdev);
adapter = netdev_priv(netdev);
hw = &adapter->hw;
adapter->netdev = netdev;
adapter->pdev = pdev;
adapter->ei = ei;
adapter->pba = ei->pba;
adapter->flags = ei->flags;
adapter->flags2 = ei->flags2;
adapter->hw.adapter = adapter;
adapter->hw.mac.type = ei->mac;
adapter->max_hw_frame_size = ei->max_hw_frame_size;
adapter->msg_enable = (1 << NETIF_MSG_DRV | NETIF_MSG_PROBE) - 1;
<span>0表示设备映射的内存的的bar</span>
mmio_start = pci_resource_start(pdev, 0);
mmio_len = pci_resource_len(pdev, 0);
err = -EIO;<br>这里我的理解是容器的硬件地址与bar进行映射,hw_addr代表的是网卡的硬件地址
adapter->hw.hw_addr = ioremap(mmio_start, mmio_len);
if
(!adapter->hw.hw_addr)
goto
err_ioremap;
if
((adapter->flags & FLAG_HAS_FLASH) &&
(pci_resource_flags(pdev, 1) & IORESOURCE_MEM)) {
flash_start = pci_resource_start(pdev, 1);
flash_len = pci_resource_len(pdev, 1);
adapter->hw.flash_address = ioremap(flash_start, flash_len);
if
(!adapter->hw.flash_address)
goto
err_flashmap;
}
/* construct the net_device struct */
netdev->netdev_ops = &e1000e_netdev_ops;
e1000e_set_ethtool_ops(netdev);
netdev->watchdog_timeo = 5 * HZ;
netif_napi_add(netdev, &adapter->napi, e1000_clean, 64);
strncpy(netdev->name, pci_name(pdev),
sizeof
(netdev->name) - 1);
netdev->mem_start = mmio_start;
netdev->mem_end = mmio_start + mmio_len;
adapter->bd_number = cards_found++;
e1000e_check_options(adapter);
/* setup adapter struct */
err = e1000_sw_init(adapter);
if
(err)
goto
err_sw_init;
err = -EIO;
memcpy(&hw->mac.ops, ei->mac_ops,
sizeof
(hw->mac.ops));
memcpy(&hw->nvm.ops, ei->nvm_ops,
sizeof
(hw->nvm.ops));
memcpy(&hw->phy.ops, ei->phy_ops,
sizeof
(hw->phy.ops));
err = ei->get_variants(adapter);
if
(err)
goto
err_hw_init;
if
((adapter->flags & FLAG_IS_ICH) &&
(adapter->flags & FLAG_READ_ONLY_NVM))
e1000e_write_protect_nvm_ich8lan(&adapter->hw);
hw->mac.ops.get_bus_info(&adapter->hw);
adapter->hw.phy.autoneg_wait_to_complete = 0;
/* Copper options */
if
(adapter->hw.phy.media_type == e1000_media_type_copper) {
adapter->hw.phy.mdix = AUTO_ALL_MODES;
adapter->hw.phy.disable_polarity_correction = 0;
adapter->hw.phy.ms_type = e1000_ms_hw_default;
}
if
(e1000_check_reset_block(&adapter->hw))
e_info(
"PHY reset is blocked due to SOL/IDER session.\n"
);
netdev->features = NETIF_F_SG |
NETIF_F_HW_CSUM |
NETIF_F_HW_VLAN_TX |
NETIF_F_HW_VLAN_RX;
if
(adapter->flags & FLAG_HAS_HW_VLAN_FILTER)
netdev->features |= NETIF_F_HW_VLAN_FILTER;
netdev->features |= NETIF_F_TSO;
netdev->features |= NETIF_F_TSO6;
netdev->vlan_features |= NETIF_F_TSO;
netdev->vlan_features |= NETIF_F_TSO6;
netdev->vlan_features |= NETIF_F_HW_CSUM;
netdev->vlan_features |= NETIF_F_SG;
if
(pci_using_dac)
netdev->features |= NETIF_F_HIGHDMA;
if
(e1000e_enable_mng_pass_thru(&adapter->hw))
adapter->flags |= FLAG_MNG_PT_ENABLED;
/*
* before reading the NVM, reset the controller to
* put the device in a known good starting state
*/
adapter->hw.mac.ops.reset_hw(&adapter->hw);
/*
* systems with ASPM and others may see the checksum fail on the first
* attempt. Let's give it a few tries
*/
for
(i = 0;; i++) {
if
(e1000_validate_nvm_checksum(&adapter->hw) >= 0)
break
;
if
(i == 2) {
e_err(
"The NVM Checksum Is Not Valid\n"
);
err = -EIO;
goto
err_eeprom;
}
}
e1000_eeprom_checks(adapter);
/* copy the MAC address out of the NVM */
if
(e1000e_read_mac_addr(&adapter->hw))
e_err(
"NVM Read Error while reading MAC address\n"
);
memcpy(netdev->dev_addr, adapter->hw.mac.addr, netdev->addr_len);
memcpy(netdev->perm_addr, adapter->hw.mac.addr, netdev->addr_len);
if
(!is_valid_ether_addr(netdev->perm_addr)) {
e_err(
"Invalid MAC Address: %pM\n"
, netdev->perm_addr);
err = -EIO;
goto
err_eeprom;
}
init_timer(&adapter->watchdog_timer);
adapter->watchdog_timer.function = &e1000_watchdog;
adapter->watchdog_timer.data = (unsigned
long
) adapter;
init_timer(&adapter->phy_info_timer);
adapter->phy_info_timer.function = &e1000_update_phy_info;
adapter->phy_info_timer.data = (unsigned
long
) adapter;
INIT_WORK(&adapter->reset_task, e1000_reset_task);
INIT_WORK(&adapter->watchdog_task, e1000_watchdog_task);
INIT_WORK(&adapter->downshift_task, e1000e_downshift_workaround);
INIT_WORK(&adapter->update_phy_task, e1000e_update_phy_task);
/* Initialize link parameters. User can change them with ethtool */
adapter->hw.mac.autoneg = 1;
adapter->fc_autoneg = 1;
adapter->hw.fc.requested_mode = e1000_fc_default;
adapter->hw.fc.current_mode = e1000_fc_default;
adapter->hw.phy.autoneg_advertised = 0x2f;
这里是默认的接收环和发送环大小是256,其实一次中断,能接受的数据不会有太高,我做实验的时候也就是1个2个。这里的环不是一直存放skb_buff,而是DMA一次中断后能给内核的数据存放地,当中断结束后,skb_buff会被转移的。
/* ring size defaults */
adapter->rx_ring->count = 256;
adapter->tx_ring->count = 256;
/*
* Initial Wake on LAN setting - If APM wake is enabled in
* the EEPROM, enable the ACPI Magic Packet filter
*/
if
(adapter->flags & FLAG_APME_IN_WUC) {
/* APME bit in EEPROM is mapped to WUC.APME */
eeprom_data = er32(WUC);
eeprom_apme_mask = E1000_WUC_APME;
if
(eeprom_data & E1000_WUC_PHY_WAKE)
adapter->flags2 |= FLAG2_HAS_PHY_WAKEUP;
}
else
if
(adapter->flags & FLAG_APME_IN_CTRL3) {
if
(adapter->flags & FLAG_APME_CHECK_PORT_B &&
(adapter->hw.bus.func == 1))
e1000_read_nvm(&adapter->hw,
NVM_INIT_CONTROL3_PORT_B, 1, &eeprom_data);
else
e1000_read_nvm(&adapter->hw,
NVM_INIT_CONTROL3_PORT_A, 1, &eeprom_data);
}
/* fetch WoL from EEPROM */
if
(eeprom_data & eeprom_apme_mask)
adapter->eeprom_wol |= E1000_WUFC_MAG;
/*
* now that we have the eeprom settings, apply the special cases
* where the eeprom may be wrong or the board simply won't support
* wake on lan on a particular port
*/
if
(!(adapter->flags & FLAG_HAS_WOL))
adapter->eeprom_wol = 0;
/* initialize the wol settings based on the eeprom settings */
adapter->wol = adapter->eeprom_wol;
device_set_wakeup_enable(&adapter->pdev->dev, adapter->wol);
/* save off EEPROM version number */
e1000_read_nvm(&adapter->hw, 5, 1, &adapter->eeprom_vers);
/* reset the hardware with the new settings */
e1000e_reset(adapter);
/*
* If the controller has AMT, do not set DRV_LOAD until the interface
* is up. For all other cases, let the f/w know that the h/w is now
* under the control of the driver.
*/
if
(!(adapter->flags & FLAG_HAS_AMT))
e1000_get_hw_control(adapter);
strcpy(netdev->name,
"eth%d"
);<br>注册网卡驱动
err = register_netdev(netdev);
if
(err)
goto
err_register;
/* carrier off reporting is important to ethtool even BEFORE open */
netif_carrier_off(netdev);
e1000_print_device_info(adapter);
return
0;
err_register:
if
(!(adapter->flags & FLAG_HAS_AMT))
e1000_release_hw_control(adapter);
err_eeprom:
if
(!e1000_check_reset_block(&adapter->hw))
e1000_phy_hw_reset(&adapter->hw);
err_hw_init:
kfree(adapter->tx_ring);
kfree(adapter->rx_ring);
err_sw_init:
if
(adapter->hw.flash_address)
iounmap(adapter->hw.flash_address);
e1000e_reset_interrupt_capability(adapter);
err_flashmap:
iounmap(adapter->hw.hw_addr);
err_ioremap:
free_netdev(netdev);
err_alloc_etherdev:
pci_release_selected_regions(pdev,
pci_select_bars(pdev, IORESOURCE_MEM));
err_pci_reg:
err_dma:
pci_disable_device(pdev);
return
err;
}
通过上面的函数,我们完成了驱动的初始化和设备注册工作。下面是网卡设备注册的操作函数
这里关注一下最后一个函数
这是网卡驱动的中断处理函数,也就是后半段的处理
再看一下如何打开网络设备
这里看一下接收容器中断设置
|