Linux networking stack from the ground up, part 2
part 1 | part 2 | part 3 | part 4 | part 5
Overview
This post will pick up where part 1 left off, beginning by explaining what ethtool
is, how device drivers register code for ethtool
, how drivers enable NAPI, and how drivers enable interrupts.
ethtool setup
ethtool
is a command line program you can use to get and set driver
information. You can install it on Ubuntu by running apt-get install ethtool
.
Some ethtool
settings of interest will be described later in this document.
The ethtool
program talks to device drivers by using the ioctl
system call.
The device drivers register a series of functions that run for the ethtool
operations and the kernel provides the glue.
When an ioctl
call is made from ethtool
, the kernel finds the ethtool
structure registered by the appropriate driver and executes the functions registered.
e1000e
The ethtool
functions are installed in the e1000e
driver in the PCI probe
function (drivers/net/ethernet/intel/e1000e/netdev.c:6627):
e1000e_set_ethtool_ops(netdev);
Which registers a structure of function pointers for each of the ethtool functions supported by e1000e
(accessing stats, changing ring buffer sizes, etc) from drivers/net/ethernet/intel/e1000e/ethtool.c:2316.
igb
The ethtool
functions are installed in the igb
driver in the PCI probe
function (drivers/net/ethernet/intel/igb/igb_main.c:2091):
igb_set_ethtool_ops(netdev);
From drivers/net/ethernet/intel/igb/igb_ethtool.c:1905.
ixgbe
The ethtool
functions are installed in the ixgbe
driver in the PCI probe
function (drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:7883):
ixgbe_set_ethtool_ops(netdev);
From drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c:7686.
tg3
The ethtool
functions are installed in the tg3
driver in the PCI probe
function (drivers/net/ethernet/broadcom/tg3.c:17456):
dev->ethtool_ops = &tg3_ethtool_ops;
be2net
The ethtool
functions are installed in the be2net
driver in a function called from the PCI probe
function (drivers/net/ethernet/emulex/benet/be_main.c:4094):
SET_ETHTOOL_OPS(netdev, &be_ethtool_ops);
bnx2
The ethtool
functions are installed in the bnx2
driver in a function called from the PCI probe
function (drivers/net/ethernet/broadcom/bnx2.c:8539):
dev->ethtool_ops = &bnx2_ethtool_ops;
NAPI poll
Prior to the existence of NAPI, NICs would generate an interrupt for each packet received indicating that data is available to be processed by the kernel.
NAPI changes this by allowing a device driver to register a poll
function that the NAPI subsystem will call to harvest packets. This method of gathering packets has reduced overhead compared to the older method, as many packets can be consumed at a time instead of processing only a single packet per interrupt.
The device driver implements a poll
function and registers it with NAPI using netif_napi_add
. When registering a NAPI poll
function with netif_napi_add
, the driver will also specify the “weight”. Most of the drivers hardcode a value of 64
. this value and its meaning will be described in more detail below.
Typically, drivers register their NAPI poll
functions during driver initialization. In the drivers this document examines, the poll
function is registered in the PCI probe
function itself or in a helper function called from there.
e1000e
The e1000e
driver registers its NAPI poll
function in the e1000_probe
function (drivers/net/ethernet/intel/e1000e/netdev.c:6629):
netif_napi_add(netdev, &adapter->napi, e1000e_poll, 64);
e1000e
registers a single NAPI poll
function because this device supports only a single receive queue. All of the other drivers being examined support multiple receive queues and will call this function to register multiple NAPI poll
functions.
igb
The igb
driver registers its NAPI poll
function in igb_alloc_q_vector
(drivers/net/ethernet/intel/igb/igb_main.c:1180):
netif_napi_add(adapter->netdev, &q_vector->napi, igb_poll, 64);
This function is called from igb_alloc_q_vectors
. igb_alloc_q_vector
is called multiple times to initialize each of the RX and TX queues. igb_alloc_q_vectors
is called from igb_init_interrupt_scheme
which is called from several locations, one of which is igb_probe
.
ixgbe
The ixgbe
driver registers its NAPI poll
function in ixgbe_alloc_q_vector
(drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c:813):
netif_napi_add(adapter->netdev, &q_vector->napi, ixgbe_poll, 64);
Similar to igb
, this function is called from ixgbe_alloc_q_vectors
and it is called multiple times to initialize each of the RX and TX queues. ixgbe_alloc_q_vectors
is called from ixgbe_init_interrupt_scheme
which is called from several locations, once of which is ixgbe_probe
.
tg3
The tg3
driver registers its NAPI poll
function in tg3_napi_init
(drivers/net/ethernet/broadcom/tg3.c:7366):
netif_napi_add(tp->dev, &tp->napi[i].napi, tg3_poll_msix, 64);
This is called within a loop registering a NAPI poll
function for each RX and TX queue. It is called from tg3_start
, which is called from tg3_open
. Unlike the other drivers, tg3
registers its NAPI poll
function in ndo_open
and not in PCI probe
.
be2net
The be2net
driver registers its NAPI poll
function in be_evt_queues_create
(drivers/net/ethernet/emulex/benet/be_main.c:2053):
netif_napi_add(adapter->netdev, &eqo->napi, be_poll, BE_NAPI_WEIGHT);
This is called within a loop registering a NAPI poll
function for each RX and TX queue. It is called from be_setup_queues
, which is called from be_setup
, which is called from be_probe
.
bnx2
The bnx2
driver registers its NAPI poll
function in bnx2_init_napi
(drivers/net/ethernet/broadcom/bnx2.c:6322):
netif_napi_add(bp->dev, &bp->bnx2_napi[i].napi, poll, 64);
This is called within a loop registering a NAPI poll
function for each RX and TX queue. Like tg3
, bnx2
allocates queue memory in its ndo_open
function bnx2_open
and not in PCI probe
.
Interrupt number
The interrupt number is obtained from the struct pci_dev
structure and stored on the net_device
structure:
netdev->irq = pdev->irq;
Later in device initialization the IRQ handlers for this IRQ number will be registered
Driver initialization
When a network device is brought up (for example, with ifconfig eth0 up
), an open
function is called in the device driver. A pointer to this function is installed in a net_device_ops
structure at a field named ndo_open
. In e1000e, this function is called e1000_open
(drivers/net/ethernet/intel/e1000e/netdev.c:4241).
The open function will typically do things like:
- Allocate RX and TX queue memory
- Enable NAPI
- Register an interrupt handler
- Enable hardware interrupts
And more.
Allocating RX and TX queue memory
For example, the e1000e driver (found in drivers/net/ethernet/intel/e1000e/) in netdev.c
around line 4279:
/* allocate transmit descriptors */ err = e1000e_setup_tx_resources(adapter->tx_ring); if (err) goto err_setup_tx; /* allocate receive descriptors */ err = e1000e_setup_rx_resources(adapter->rx_ring); if (err) goto err_setup_rx;
The e1000e_setup_rx_resources
and e1000e_setup_tx_resources
allocate receive and transmit queues and initialize associated data structures. It is important to note that these queues are read and written directly from the NIC via DMA. In other words: when data arrives from the network, the data is written directly to the receive queue by the NIC via DMA. The queue size is defaulted to E1000_DEFAULT_RXD
(256) and the max is E1000_MAX_RXD
(4096). These values are driver specific.
If data arrives faster than it can be processed, it will fill the queue. Once the queue is full, additional data that arrives will be dropped.
You can determine if drops are happening and increase the queue size by using the command line tool ethtool
. ethtool
communicates with the device driver by using the ioctl
system call.
Most drivers have a file named ethtool.c
or *_ethtool.c
implementing this interface. Not all drivers implement every possible ethtool
method, so you should check the driver code and ethtool
output to determine if what you are doing is supported by the driver or not.
You can get stats from ethtool
by using the -S flag, for example:
ethtool -S eth0
The names of the stats will differ from driver to driver, so you should read the output carefully and grep for things like “drop” “miss” and “error”.
As far as e1000e is concerned:
- the
rx_no_buffer_count
statistic (also known as RNBC) indicates that there was nowhere to DMA the packet. Increasing the rx ring (explained below) can help reduce the number ofrx_no_buffer_count
seen over time. - the
rx_missed_errors
statistic indicates thatrx_no_buffer_count
happened enough times that packets were dropped. increasing the rx queue size can help reduce this count.
To increase the rx (or tx) queue size, you can run:
ethtool -G eth0 rx 4096
To increase the rx queue for eth0
to 4096
.
Some NICs have multiple RX and TX queues for added performance. We’ll see shortly why having more than one queue for RX can be beneficial.
You can check if your NIC supports multiple queues by using ethtool
and the -l
flag:
ethtool -l eth0
You can increase the number of queues by using the -L
flag:
ethtool -L eth0 rx 8
Note that not all device drivers support this ethtool
function so you may need to consult your device driver source code.
Enable NAPI
e1000e enables NAPI by calling napi_enable
(from drivers/net/ethernet/intel/e1000e/netdev.c:4332) a static inline function (from include/linux/netdevice.h:500
):
napi_enable(&adapter->napi);
This simply clears a bit on the state
field of the napi_struct
.
Register an interrupt handler
There are different methods a device can use to signal an interrupt: MSI-X, MSI, and legacy interrupts.
The driver must determine which method is supported by the device and register the appropriate handler function that will execute when the interrupt is received.
The e1000e driver tries to register an MSI-X interrupt handler first, falling back to MSI on failure, falling back again to a legacy interrupt handler if MSI handler registration fails.
This logic is abstracted into e1000_request_irq
which is called during driver initialization (from drivers/net/ethernet/intel/e1000e/netdev.c:4303) and can be found at drivers/net/ethernet/intel/e1000e/netdev.c:2132.
MSI-X interrupts are the preferred method, especially for NICs that support multiple RX and TX queues. This is because each RX and TX queue can have its own hardware interrupt assigned, which can then be handled by a specific CPU (with irqbalance or by modifying /proc/irq/IRQ_NUMBER/smp_affinity
). In this way, arriving packets can be processed by separate CPUs from the hardware interrupt level.
If MSI-X is unavailable, MSI still presents advantages over legacy interrupts (read more here and here).
In the e1000e driver, the functions e1000_intr_msix_rx
, e1000_intr_msi
, and e1000_intr
are the interrupt handler methods used for the MSI-X, MSI, and legacy interrupt modes, respectively.
The handler is registered to the IRQ number obtain when PCI system called the probe
function earlier.
For example, the registration of the interrupt handler for an MSI interrupt from drivers/net/ethernet/intel/e1000e/netdev.c:2147:
err = request_irq(adapter->pdev->irq, e1000_intr_msi, 0, netdev->name, netdev);
Enable interrupts
Finally, once initialization is complete, interrupts are enabled on the device. Incoming packets will now trigger an interrupt to be raised, causing the function registered above to be executed to handle the incoming data.
Enabling interrupts is device specific, but on e1000e the e1000_irq_enable
function is called which writes a value to a device register to enable interrupts.