Linux networking stack from the ground up, part 2

Posted on Jan 22, 2016 by PIA Research

part 1 | part 2 | part 3 | part 4 | part 5


This post will pick up where part 1 left off, beginning by explaining what ethtool is, how device drivers register code for ethtool, how drivers enable NAPI, and how drivers enable interrupts.

ethtool setup

ethtool is a command line program you can use to get and set driver
information. You can install it on Ubuntu by running apt-get install ethtool.

Some ethtool settings of interest will be described later in this document.

The ethtool program talks to device drivers by using the ioctl system call.
The device drivers register a series of functions that run for the ethtool operations and the kernel provides the glue.

When an ioctl call is made from ethtool, the kernel finds the ethtool structure registered by the appropriate driver and executes the functions registered.


The ethtool functions are installed in the e1000e driver in the PCI probe function (drivers/net/ethernet/intel/e1000e/netdev.c:6627):


Which registers a structure of function pointers for each of the ethtool functions supported by e1000e (accessing stats, changing ring buffer sizes, etc) from drivers/net/ethernet/intel/e1000e/ethtool.c:2316.


The ethtool functions are installed in the igb driver in the PCI probe function (drivers/net/ethernet/intel/igb/igb_main.c:2091):


From drivers/net/ethernet/intel/igb/igb_ethtool.c:1905.


The ethtool functions are installed in the ixgbe driver in the PCI probe function (drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:7883):


From drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c:7686.


The ethtool functions are installed in the tg3 driver in the PCI probe function (drivers/net/ethernet/broadcom/tg3.c:17456):

dev->ethtool_ops = &tg3_ethtool_ops;


The ethtool functions are installed in the be2net driver in a function called from the PCI probe function (drivers/net/ethernet/emulex/benet/be_main.c:4094):

SET_ETHTOOL_OPS(netdev, &be_ethtool_ops);


The ethtool functions are installed in the bnx2 driver in a function called from the PCI probe function (drivers/net/ethernet/broadcom/bnx2.c:8539):

dev->ethtool_ops = &bnx2_ethtool_ops;

NAPI poll

Prior to the existence of NAPI, NICs would generate an interrupt for each packet received indicating that data is available to be processed by the kernel.

NAPI changes this by allowing a device driver to register a poll function that the NAPI subsystem will call to harvest packets. This method of gathering packets has reduced overhead compared to the older method, as many packets can be consumed at a time instead of processing only a single packet per interrupt.

The device driver implements a poll function and registers it with NAPI using netif_napi_add. When registering a NAPI poll function with netif_napi_add, the driver will also specify the “weight”. Most of the drivers hardcode a value of 64. this value and its meaning will be described in more detail below.

Typically, drivers register their NAPI poll functions during driver initialization. In the drivers this document examines, the poll function is registered in the PCI probe function itself or in a helper function called from there.


The e1000e driver registers its NAPI poll function in the e1000_probe function (drivers/net/ethernet/intel/e1000e/netdev.c:6629):

netif_napi_add(netdev, &adapter->napi, e1000e_poll, 64);

e1000e registers a single NAPI poll function because this device supports only a single receive queue. All of the other drivers being examined support multiple receive queues and will call this function to register multiple NAPI poll functions.


The igb driver registers its NAPI poll function in igb_alloc_q_vector (drivers/net/ethernet/intel/igb/igb_main.c:1180):

netif_napi_add(adapter->netdev, &q_vector->napi, igb_poll, 64);

This function is called from igb_alloc_q_vectors. igb_alloc_q_vector is called multiple times to initialize each of the RX and TX queues. igb_alloc_q_vectors is called from igb_init_interrupt_scheme which is called from several locations, one of which is igb_probe.


The ixgbe driver registers its NAPI poll function in ixgbe_alloc_q_vector (drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c:813):

netif_napi_add(adapter->netdev, &q_vector->napi, ixgbe_poll, 64);

Similar to igb, this function is called from ixgbe_alloc_q_vectors and it is called multiple times to initialize each of the RX and TX queues. ixgbe_alloc_q_vectors is called from ixgbe_init_interrupt_scheme which is called from several locations, once of which is ixgbe_probe.


The tg3 driver registers its NAPI poll function in tg3_napi_init (drivers/net/ethernet/broadcom/tg3.c:7366):

netif_napi_add(tp->dev, &tp->napi[i].napi, tg3_poll_msix, 64);

This is called within a loop registering a NAPI poll function for each RX and TX queue. It is called from tg3_start, which is called from tg3_open. Unlike the other drivers, tg3 registers its NAPI poll function in ndo_open and not in PCI probe.


The be2net driver registers its NAPI poll function in be_evt_queues_create (drivers/net/ethernet/emulex/benet/be_main.c:2053):

netif_napi_add(adapter->netdev, &eqo->napi, be_poll, BE_NAPI_WEIGHT);

This is called within a loop registering a NAPI poll function for each RX and TX queue. It is called from be_setup_queues, which is called from be_setup, which is called from be_probe.


The bnx2 driver registers its NAPI poll function in bnx2_init_napi (drivers/net/ethernet/broadcom/bnx2.c:6322):

netif_napi_add(bp->dev, &bp->bnx2_napi[i].napi, poll, 64);

This is called within a loop registering a NAPI poll function for each RX and TX queue. Like tg3, bnx2 allocates queue memory in its ndo_open function bnx2_open and not in PCI probe.

Interrupt number

The interrupt number is obtained from the struct pci_dev structure and stored on the net_device structure:

netdev->irq = pdev->irq;

Later in device initialization the IRQ handlers for this IRQ number will be registered

Driver initialization

When a network device is brought up (for example, with ifconfig eth0 up), an open function is called in the device driver. A pointer to this function is installed in a net_device_ops structure at a field named ndo_open. In e1000e, this function is called e1000_open (drivers/net/ethernet/intel/e1000e/netdev.c:4241).

The open function will typically do things like:

  1. Allocate RX and TX queue memory
  2. Enable NAPI
  3. Register an interrupt handler
  4. Enable hardware interrupts

And more.

Allocating RX and TX queue memory

For example, the e1000e driver (found in drivers/net/ethernet/intel/e1000e/) in netdev.c around line 4279:

/* allocate transmit descriptors */
err = e1000e_setup_tx_resources(adapter->tx_ring);
if (err)
  goto err_setup_tx;

/* allocate receive descriptors */
err = e1000e_setup_rx_resources(adapter->rx_ring);
if (err)
  goto err_setup_rx;

The e1000e_setup_rx_resources and e1000e_setup_tx_resources allocate receive and transmit queues and initialize associated data structures. It is important to note that these queues are read and written directly from the NIC via DMA. In other words: when data arrives from the network, the data is written directly to the receive queue by the NIC via DMA. The queue size is defaulted to E1000_DEFAULT_RXD (256) and the max is E1000_MAX_RXD (4096). These values are driver specific.

If data arrives faster than it can be processed, it will fill the queue. Once the queue is full, additional data that arrives will be dropped.

You can determine if drops are happening and increase the queue size by using the command line tool ethtool. ethtool communicates with the device driver by using the ioctl system call.

Most drivers have a file named ethtool.c or *_ethtool.c implementing this interface. Not all drivers implement every possible ethtool method, so you should check the driver code and ethtool output to determine if what you are doing is supported by the driver or not.

You can get stats from ethtool by using the -S flag, for example:

ethtool -S eth0

The names of the stats will differ from driver to driver, so you should read the output carefully and grep for things like “drop” “miss” and “error”.

As far as e1000e is concerned:

  • the rx_no_buffer_count statistic (also known as RNBC) indicates that there was nowhere to DMA the packet. Increasing the rx ring (explained below) can help reduce the number of rx_no_buffer_count seen over time.
  • the rx_missed_errors statistic indicates that rx_no_buffer_count happened enough times that packets were dropped. increasing the rx queue size can help reduce this count.

To increase the rx (or tx) queue size, you can run:

ethtool -G eth0 rx 4096

To increase the rx queue for eth0 to 4096.

Some NICs have multiple RX and TX queues for added performance. We’ll see shortly why having more than one queue for RX can be beneficial.

You can check if your NIC supports multiple queues by using ethtool and the -l flag:

ethtool -l eth0

You can increase the number of queues by using the -L flag:

ethtool -L eth0 rx 8

Note that not all device drivers support this ethtool function so you may need to consult your device driver source code.

Enable NAPI

e1000e enables NAPI by calling napi_enable (from drivers/net/ethernet/intel/e1000e/netdev.c:4332) a static inline function (from include/linux/netdevice.h:500):


This simply clears a bit on the state field of the napi_struct.

Register an interrupt handler

There are different methods a device can use to signal an interrupt: MSI-X, MSI, and legacy interrupts.

The driver must determine which method is supported by the device and register the appropriate handler function that will execute when the interrupt is received.

The e1000e driver tries to register an MSI-X interrupt handler first, falling back to MSI on failure, falling back again to a legacy interrupt handler if MSI handler registration fails.

This logic is abstracted into e1000_request_irq which is called during driver initialization (from drivers/net/ethernet/intel/e1000e/netdev.c:4303) and can be found at drivers/net/ethernet/intel/e1000e/netdev.c:2132.

MSI-X interrupts are the preferred method, especially for NICs that support multiple RX and TX queues. This is because each RX and TX queue can have its own hardware interrupt assigned, which can then be handled by a specific CPU (with irqbalance or by modifying /proc/irq/IRQ_NUMBER/smp_affinity). In this way, arriving packets can be processed by separate CPUs from the hardware interrupt level.

If MSI-X is unavailable, MSI still presents advantages over legacy interrupts (read more here and here).

In the e1000e driver, the functions e1000_intr_msix_rx, e1000_intr_msi, and e1000_intr are the interrupt handler methods used for the MSI-X, MSI, and legacy interrupt modes, respectively.

The handler is registered to the IRQ number obtain when PCI system called the probe function earlier.

For example, the registration of the interrupt handler for an MSI interrupt from drivers/net/ethernet/intel/e1000e/netdev.c:2147:

err = request_irq(adapter->pdev->irq, e1000_intr_msi, 0, netdev->name, netdev);

Enable interrupts

Finally, once initialization is complete, interrupts are enabled on the device. Incoming packets will now trigger an interrupt to be raised, causing the function registered above to be executed to handle the incoming data.

Enabling interrupts is device specific, but on e1000e the e1000_irq_enable function is called which writes a value to a device register to enable interrupts.