Linux networking stack from the ground up, part 1

Posted on Jan 21, 2016 by PIA Research

part 1 | part 2 | part 3 | part 4 | part 5

Purpose

This multi-part blog series aims to outline the path of a packet from the wire through the network driver and kernel until it reaches the receive queue for a socket. This information pertains to the Linux kernel, release 3.13.0. Links to source code on GitHub are provided throughout to help with context.

This document will describe code throughout the Linux networking stack as well as some code from the following Ethernet device drivers:

  • e1000e: Intel PRO/1000 Linux driver
  • igb: Intel Gigabit Linux driver
  • ixgbe: Intel 10 Gigabit PCI Express Linux driver
  • tg3: Broadcom Tigon3 ethernet driver
  • be2net: HP Emulex 10 Gigabit PCI Express Linux Driver
  • bnx2: Broadcom NX2 network driver

Other kernels or drivers will likely be similar, but line numbers and detailed inner workings will likely be different.

Data sheets / Programmer’s Reference Manuals

Driver code can be cryptic, especially when trying to understand the meanings of stastistics counters that the driver reads from the device. In many cases, referring to the documentation about the device can help clear things up.

WARNING: All of these PDFs are large. You may or may not want to download these on mobile devices.

Overview

High level overview of the path of a packet:

  1. Driver is loaded and initialized.
  2. Packet arrives at the NIC from the network.
  3. Packet is copied (via DMA) to a ring buffer in kernel memory.
  4. Hardware interrupt is generated to let the system know a packet is in memory.
  5. Driver calls into NAPI to start a poll loop if one was not running already.
  6. ksoftirqd processes run on each CPU on the system. They are registered at boot time. The ksoftirqd processes pull packets off the ring buffer by calling the NAPI poll function that the device driver registered during initialization.
  7. Memory regions in the ring buffer that have had network data written to them are unmapped.
  8. Data that was DMA’d into memory is passed up the networking layer as an ‘skb’ for more processing.
  9. Packet steering happens to distribute packet processing load to multiple CPUs (in leu of a NIC with multiple receive queues), if enabled.
  10. Packets are handed to the protocol layers from the queues.
  11. Protocol layers add them to receive buffers attached to sockets.

Detailed look

Driver loading / PCI

PCI devices identify themselves with a series of registers in the PCI configuration space.

When a device driver is compiled, a macro named MODULE_DEVICE_TABLE is used to export a table of PCI device IDs identifying devices that the device driver can control. The kernel uses this table to determine which device driver to load to control the device.

When the driver is loaded, a function named pci_register_driver is called in the initialization function.

This function registers a structure of function pointers that the kernel can use to initialize the PCI device.

e1000e

In the e1000e driver, this structure can be found in drivers/net/ethernet/intel/e1000e/netdev.c around line 7035:


static struct pci_driver e1000_driver = {
  .name     = e1000e_driver_name,
  .id_table = e1000_pci_tbl,
  .probe    = e1000_probe,

  /* more stuff  */
}

It is registered in e1000_init_module in the same file around line 7043:


/**
* e1000_init_module - Driver Registration Routine
*
* e1000_init_module is the first routine called when the driver is
* loaded. All it does is register with the PCI subsystem.
**/

static int __init e1000_init_module(void)
{
  int ret;
  pr_info("Intel(R) PRO/1000 Network Driver - %s\n",
  e1000e_driver_version);
  pr_info("Copyright(c) 1999 - 2013 Intel Corporation.\n");
  ret = pci_register_driver(&e1000_driver);

  return ret;
}
module_init(e1000_init_module);

igb

In the igb driver, this structure can be found in drivers/net/ethernet/intel/igb/igb_main.c around line 238:


static struct pci_driver igb_driver = {
  .name = igb_driver_name,
  .id_table = igb_pci_tbl,
  .probe = igb_probe,
  .remove = igb_remove,
#ifdef CONFIG_PM
  .driver.pm = &igb_pm_ops,
#endif
  .shutdown = igb_shutdown,
  .sriov_configure = igb_pci_sriov_configure,
  .err_handler = &igb_err_handler
};

It is registered in igb_init_module in the same file around line 682:


static int __init igb_init_module(void)
{
  int ret;
  pr_info("%s - version %s\n",
  igb_driver_string, igb_driver_version);

  pr_info("%s\n", igb_copyright);

#ifdef CONFIG_IGB_DCA
  dca_register_notify(&dca_notifier);
#endif
  ret = pci_register_driver(&igb_driver);
  return ret;
}

ixgbe

In the ixgbe driver, this structure can be found in drivers/net/ethernet/intel/ixgbe/ixgbe_main.c around line 8448:


static struct pci_driver ixgbe_driver = {
  .name = ixgbe_driver_name,
  .id_table = ixgbe_pci_tbl,
  .probe = ixgbe_probe,
  .remove = ixgbe_remove,
#ifdef CONFIG_PM
  .suspend = ixgbe_suspend,
  .resume = ixgbe_resume,
#endif
  .shutdown = ixgbe_shutdown,
  .sriov_configure = ixgbe_pci_sriov_configure,
  .err_handler = &ixgbe_err_handler
};

It is registered in ixgbe_init_module in the same file around line 8468:


static int __init ixgbe_init_module(void)
{
  int ret;
  pr_info("%s - version %s\n", ixgbe_driver_string, ixgbe_driver_version);
  pr_info("%s\n", ixgbe_copyright);

  ixgbe_dbg_init();

  ret = pci_register_driver(&ixgbe_driver);
  if (ret) {
    ixgbe_dbg_exit();
    return ret;
  }

#ifdef CONFIG_IXGBE_DCA
  dca_register_notify(&dca_notifier);
#endif

  return 0;
}

tg3

In the tg3 driver, this structure can be found in drivers/net/ethernet/broadcom/tg3.c around line 17999:


static struct pci_driver tg3_driver = {
  .name = DRV_MODULE_NAME,
  .id_table = tg3_pci_tbl,
  .probe = tg3_init_one,
  .remove = tg3_remove_one,
  .err_handler = &tg3_err_handler,
  .driver.pm = &tg3_pm_ops,
  .shutdown = tg3_shutdown,
};

It is registered in the same file, using a macro module_pci_driver (defined in include/linux/pci.h:1104) just below the structure definition:


module_pci_driver(tg3_driver);

be2net

In the be2net driver, this structure can be found in drivers/net/ethernet/emulex/benet/be_main.c around line 4819:


static struct pci_driver be_driver = {
  .name = DRV_NAME,
  .id_table = be_dev_ids,
  .probe = be_probe,
  .remove = be_remove,
  .suspend = be_suspend,
  .resume = be_resume,
  .shutdown = be_shutdown,
  .err_handler = &be_eeh_handlers
};

It is registered in be_init_module in the same file around line 4764:


static int __init be_init_module(void)
{
  if (rx_frag_size != 8192 && rx_frag_size != 4096 &&
      rx_frag_size != 2048) {
    printk(KERN_WARNING DRV_NAME
    " : Module param rx_frag_size must be 2048/4096/8192."
    " Using 2048\n");
    rx_frag_size = 2048;
  }

  return pci_register_driver(&be_driver);
}

bnx2

In the bnx2 driver, this structure can be found in drivers/net/ethernet/broadcom/bnx2.c around line 8788:


static struct pci_driver bnx2_pci_driver = {
  .name = DRV_MODULE_NAME,
  .id_table = bnx2_pci_tbl,
  .probe = bnx2_init_one,
  .remove = bnx2_remove_one,
  .driver.pm = BNX2_PM_OPS,
  .err_handler = &bnx2_err_handler,
  .shutdown = bnx2_shutdown,
};

It is registered in the same file just below the structure definition using the module_pci_driver macro (from include/linux/pci.h):


module_pci_driver(bnx2_pci_driver);

PCI probe

Each driver registers a probe function with the PCI system in the kernel.
The kernel calls this function to do early initialization of the device.
Most drivers have a lot of code that runs to get the device ready for use. The
exact things done vary from driver to driver.

The name of the function registered as the probe function and a very (very) high level
overview of what it does is provided below for each of the drivers.

Generally speaking, the drivers are quite similar in terms of what they do at this stage:

  1. The ethtool (described more in the next parts of this series) functions the driver supports
  2. The NAPI poll function (described more in the next parts of this series) for harvesting incoming packets
  3. The NIC’s MAC address
  4. The higher level net_device structure
  5. The hardware IRQ number that will be used by the device when interrupts are (eventually) enabled
  6. Any watchdog tasks needed (for example, e1000e has a watchdog task to check if the hardware is hung)
  7. Other device specific stuff like workarounds or dealing with quirks or similar

To dig deeper into what each driver’s probe function does, see: