Linux networking stack from the ground up, part 3

Posted on Jan 28, 2016 by PIA Research

part 1 | part 2 | part 3 | part 4 | part 5


This post will pick up where part 2 left off, beginning by describing a packet arriving, examining softirqs, and examining how the e1000e driver passes packets up the network stack.

A packet arrives!

So, at long last a packet arrives from the network. Assuming, the rx ring buffer has enough space, the packet is written into the ring buffer via DMA and the device raises the interrupt that is assigned to it (or in the case of MSI-X, the IRQ tied to the rx queue the packet arrived on).

You can find statistics about hardware interrupts by checking the /proc/interrupts file.

In general the interrupt handler that runs when an interrupt is raised should try to defer as much processing as possible to happen outside the interrupt context. This is crucial because while an interrupt is being processed, other interrupts are blocked.

If we examine the e1000_intr_msi function in e1000e, we can see after some device specific code and workarounds for hardware bugs that the interrupt handler signals NAPI (drivers/net/ethernet/intel/e1000e/netdev.c:1777):

if (napi_schedule_prep(&adapter->napi)) {
  adapter->total_tx_bytes = 0;
  adapter->total_tx_packets = 0;
  adapter->total_rx_bytes = 0;
  adapter->total_rx_packets = 0;

This code is checking if NAPI is already running and if not, statistics structures are reset and NAPI is scheduled to run to process the packets.

At a high level, NAPI is scheduled to run from the hardware interrupt handler, but the NAPI code which does the packet processing is run outside of the hardware interrupt context. This is accomplished with softirqs, which will be detailed next.

The __napi_schedule function:

 * __napi_schedule - schedule for receive
 * @n: entry to schedule
 * The entry's receive function will be scheduled to run
void __napi_schedule(struct napi_struct *n)
  unsigned long flags;

  ____napi_schedule(&__get_cpu_var(softnet_data), n);

Schedules the NAPI poll function to be run. It does this by obtaining the current CPU’s softnet_data structure and passing that and the driver provided napi_struct to ____napi_schedule:

/* Called with irq disabled */                                                            
static inline void ____napi_schedule(struct softnet_data *sd,                             
                                     struct napi_struct *napi)                            
  list_add_tail(&napi->poll_list, &sd->poll_list);                                  

Which adds the driver provided NAPI poll structure to the softnet_data list for the current CPU.

Next, the function calls __raise_softirq_irqoff (from kernel/softirq.c) which contains this code:

struct task_struct *tsk = __this_cpu_read(ksoftirqd);                             
if (tsk && tsk->state != TASK_RUNNING)                                            

which brings up an important point: the hardware interrupt handler wakes up the NAPI softirq process on the same CPU as the hardware interrupt handler.


softirq is one mechanism for executing code outside of the hardware interrupt handler context. As mentioned above, this is important because only minimal work should be done in a hardware interrupt handler; the heavy-lifting should be left for later processing.

The softirq system is a series of kernel threads, one per CPU, that run handler functions that have been registered for different softirqs.

The softirq threads are started early in the kernel initialization process (`kernel/softirq.c`:754):

static __init int spawn_ksoftirqd(void)                                                   
        return 0;                                                                         

The softirq_threads structure exports a few fields, but the two important ones are thread_should_run and thread_fn both of which are called from kernel/smpboot.c.

The thread_should_run function checks for any pending softirqs and if there are one or more, the code in kernel/smpboot.c calls thread_fn, which for the softirq system happens to be run_ksoftirqd.

The run_ksoftirqd function runs the registered handler function for each softirq that is pending and increments stats that are found in /proc/softirqs.

NAPI and softirq

Recall from earlier we saw that the device driver calls __napi_schedule which eventually calls __raise_softirq_irqoff(NET_RX_SOFTIRQ);.

This __raise_softirq_irqoff function marks the NET_RX_SOFTIRQ softirq as pending and wakes up the softirq thread on the current CPU to execute the NET_RX_SOFTIRQ handler.

NET_RX_SOFTIRQ and NET_TX_SOFTIRQ softirq handlers

Early in the initialization code of the networking subsystem (net/core/dev.c:7114) we find the following code:

open_softirq(NET_TX_SOFTIRQ, net_tx_action);
open_softirq(NET_RX_SOFTIRQ, net_rx_action);

These function calls register the functions net_tx_action and net_rx_action as softirq handlers that will be run when NET_TX_SOFTIRQ and NET_RX_SOFTIRQ softirqs are pending.

rx packet processing begins

Once the softirq code determines that a softirq is pending, should be processed, and invokes the net_rx_action function registered to the NET_RX_SOFTIRQ, packet processing begins.

net_rx_action processing loop

net_rx_action begins the processing of packets from the memory the packets were DMA’d into by the device driver.

The function iterates through the list of NAPI structures that are queued for the current CPU, dequeuing each structure, one at a time and operating on it.

The processing loop bounds the amount of work and execution time that can be consumed by poll functions. It does this by keeping track of a work “budget” (which can be adjusted) and checking the elapsed time (net/core/dev.c:4366):

/* If softirq window is exhuasted then punt.
 * Allow this to run for 2 jiffies since which will allow
 * an average latency of 1.5/HZ.
if (unlikely(budget <= 0 || time_after_eq(jiffies, time_limit)))
  goto softnet_break;

NAPI also uses a “weight” to prevent individual calls to poll from consuming the entire CPU and never terminating. The weight is set on the call to netif_napi_add in the device driver initialization; recall that this was hardcoded to 64 in the driver.

The weight is passed into the poll function from the driver that is registered to NAPI. This amount dictates the maximum work poll can do before it should return. In this case, it can process up to 64 packets.

The poll function returns the number of packets processed, which may be any value less than or equal to the weight. This work done is then subtracted from the budget:

weight = n->weight;

/* This NAPI_STATE_SCHED test is for avoiding a race
 * with netpoll's poll_napi().  Only the entity which
 * obtains the lock and sees NAPI_STATE_SCHED set will
 * actually make the ->poll() call.  Therefore we avoid
 * accidentally calling ->poll() when NAPI is not 
work = 0;
if (test_bit(NAPI_STATE_SCHED, &n->state)) {
        work = n->poll(n, weight);

WARN_ON_ONCE(work > weight);

budget -= work;


If the work done was equal to the weight, the NAPI structure is moved to the end of the queue and will be examined again. The loop then begins anew, with the budget and time check.

Thus, you can adjust the number of packets processed during the NAPI poll loop by setting the netdev_budget sysctl:

sysctl net.core.netdev_budget=600

You can obtain detailed statistics of the softirq networking system by examining the file /proc/net/softnet_stat which outputs information about the number of packets, the number of drops, and the time squeeze counter which tracks the number of times the budget or time limit were consumed, but more work was available.

NAPI poll == e1000e_poll

It is up to the device driver to clear packets that were DMA’d into the rx ring buffer. This is accomplished by the poll method called in the code sample above, which in the case of e1000e, is actually a function pointer to the function e1000e_poll (drivers/net/ethernet/intel/e1000e/netdev.c:2638).

The e1000e driver’s e1000e_poll function calls a function via a function pointer named clean_rx. It is provided the weight (which was hardcoded to 64 during driver initialization), and a location to write the amount of work done (drivers/net/ethernet/intel/e1000e/netdev.c:2638):

adapter->clean_rx(adapter->rx_ring, &work_done, weight);

This function pointer is set in e1000_open when the driver is initialized and the device is brought up. It is set to the an appropriate function based on the MTU.

For our purposes, this is the e1000_clean_rx_irq function.

e1000_clean_rx_irq – unmap DMA regions and pass data up the stack

The e1000_clean_rx_irq function runs in a loop, and breaks out when the work done reaches the weight passed into the function.

The function unmaps memory regions that the device has DMA’d data to. Those memory regions are unmapped so they cannot be written to by the device.

Stat counters are incremented for total_rx_bytes and total_rx_packets and some additional memory regions for DMA are added back to the rx ring.

Finally, e1000_receive_skb is called to hand the skb up the network stack.

e1000_receive_skb — pass data up the stack

The function e1000_receive_skb starts a chain of function calls that deal with bookkeeping for things like hardware accelerated vlan tagging and generic receive offloading.

The chain of functions calls is:

  • e1000_receive_skb calls napi_gro_receive
  • napi_gro_receive calls napi_skb_finish
  • napi_skb_finish calls netif_receive_skb

And from netif_receive_skb the heavy lifting starts. Before we can examine what happens here, we first need describe Receive Packet Steering.