Linux networking stack from the ground up, part 4

Posted on Jan 30, 2016 by PIA Research

part 1 | part 2 | part 3 | part 4 | part 5

Overview

This post will pick up where part 3 left off beginning by describing Receive Packet Steering (RPS), what it is and how to configure it, followed by an examination of the network stack describing how packets are dealt with based on RPS settings, the packet backlog queue, the start of the IP protocol layer, and netfilter.

Receive Packet Steering

We saw that device drivers register NAPI poll instances. Each NAPI poller instance is executed in the context of a kernel thread called a softirq of which there are one per CPU. The kernel thread for the CPU that the hardware interrupt handler runs on is woken up / scheduled to run in the hardware interrupt handler.

Thus, a single CPU processes the hardware interrupt and polls from the networking layer to process the incoming data.

Some NICs support multiple queues at the hardware level. This means that incoming packets can be DMA’d to separate receive rings, each receive ring having its own hardware interrupt being delivered to indicate data is available. Each of these hardware interrupts would schedule NAPI poll instances to run on each of the associated CPUs.

This allows multiple CPUs to process hardware interrupts and poll from the networking layer.

Receive Packet Steering (RPS) is a software implementation of hardware enable multi-queue NICs. It allows multiple CPUs to process incoming packets even if the NIC only supports a single receive queue in hardware.

RPS works by generating a hash for an incoming data to determine which CPU should process the data. The data is then enqueued to the per-CPU receive network backlog to be processed. An Inter-processor interrupt is delivered to the CPU owning the backlog. This helps to kick-start backlog processing by the remote CPU if it is not currently processing packets.

netif_receive_skb will either continue sending network data up the networking stack, or hand it over to RPS for processing on a different CPU.

configure RPS

For RPS to work, it must be enabled in the kernel configuration (it is on Ubuntu for Linux kernel 3.13.0), and a bit mask describing which CPUs should process packets for a given interface and rx queue.

The bit masks to modify are found in /sys/class/net/DEVICE_NAME/queues/QUEUE/rps_cpus.

So, for eth0, and receive queue 0, you would modify: /sys/class/net/eth0/queues/rx-0/rps_cpus with a hexadecimal number indicating which CPUs should process packets from eth0’s receive queue 0.

Back to netif_receive_skb.

netif_receive_skb

As a reminder, netif_receive_skb function is called from napi_skb_finish in the softirq context from the NAPI poller registered by the device driver.

netif_receive_skb will either attempt to use RPS (as described above) or continue sending the data up the network stack.

Let’s first examine the second path: sending the data up the stack if RPS is disabled.

netif_receive_skb without RPS

netif_receive_skb calls __netif_receive_skb which does some bookkeeping prior to calling __netif_receive_skb_core to move the data along up the network stack toward the protocol levels.

__netif_receive_skb_core

This function passes the skb up to the protocol layer in this piece of code (net/core/dev.c:3628):

type = skb->protocol;                                                             
list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {               
  if (ptype->type == type &&                                                
     (ptype->dev == null_or_dev || ptype->dev == skb->dev || ptype->dev == orig_dev)) {                                           
    if (pt_prev)
      ret = deliver_skb(skb, pt_prev, orig_dev);
    pt_prev = ptype;
  }                                                                         
}

We will examine precisely how this code delivers data to the protocol layer below, but first, let’s see what happens when RPS is enabled.

netif_receive_skb with RPS

If RPS is enabled, netif_receive_skb will compute which CPU’s backlog it should queue the data. It does this by using the function get_rps_cpu (defined at net/core/dev.c:2980):

int cpu = get_rps_cpu(skb->dev, skb, &rflow);                             

if (cpu >= 0) {
  ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);           
  rcu_read_unlock();                                                
  return ret;                                                       
}

enqueue_to_backlog

This function begins by getting a pointer to the remote CPU’s softnet_data structure which contains a pointer to a NAPI poller.

Next, the queue length of the input_pkt_queue of the remote CPU is checked:

qlen = skb_queue_len(&sd->input_pkt_queue);
if (qlen <= netdev_max_backlog && !skb_flow_limit(skb, qlen)) { if (skb_queue_len(&sd->input_pkt_queue)) {

It is first compared to the netdev_max_backlog. If the queue length is larger than the backlog, the data is dropped and the drop is counted against the remote CPU.

You can prevent drops by increasing the netdev_max_backlog:

sysctl -w net.core.netdev_max_backlog=3000

If the queue length isn’t too large, the code next checks if the flow limit has been reached. By default, flow limits are disabled. In order to enable flow limits, you must specify a bitmap (similar to RPS’ bitmap) in /proc/sys/net/core/flow_limit_cpu_bitmap.

Once you enable flow limits per CPU, you can also adjust the size of the flow limit hash table by modifying the sysctl net.core.flow_limit_table_len.

You can read more about flow limits in the Documentation/networking/scaling.txt file.

Assuming that the flow limit has not been reached, enqueue_to_backlog then checks if the backlog queue has data queued to it already.

If so, the data is queued:

if (skb_queue_len(&sd->input_pkt_queue)) {
enqueue:
  __skb_queue_tail(&sd->input_pkt_queue, skb);
  input_queue_tail_incr_save(sd, qtail);
  rps_unlock(sd);
  local_irq_restore(flags);
  return NET_RX_SUCCESS;
}

If the queue is empty, first the NAPI poller for the backlog queue is started:

/* Schedule NAPI for backlog device                                       
 * We can use non atomic operation since we own the queue lock            
 */                                                                       
if (!__test_and_set_bit(NAPI_STATE_SCHED, &sd->backlog.state)) {          
  if (!rps_ipi_queued(sd))                                          
    ____napi_schedule(sd, &sd->backlog);                      
}                                                                         
goto enqueue;

The goto at the bottom brings execution back up to the block of code above, queuing the data to the backlog.

backlog queue NAPI poller

The per-CPU backlog queue plugs into NAPI the same way a device driver does. A poll function is provided that is used to process packets from the softirq context.

This NAPI struct is provided during initialization of the networking system. From net_dev_init in net/core/dev.c:6952:

sd->backlog.poll = process_backlog;
sd->backlog.weight = weight_p;
sd->backlog.gro_list = NULL;
sd->backlog.gro_count = 0;

The backlog NAPI structure differs from the device driver NAPI structure in that the weight parameter is adjustable. The drivers hardcode their values (most hardcode to 64, as seen in e1000e).

To adjust the backlog’s NAPI poller weight, modify /proc/sys/net/core/dev_weight.

The poll function for the backlog is called process_backlog and, similar to e1000e’s function e1000e_poll, is called from the softirq context.

process_backlog

The process_backlog function (net/core/dev.c:4097) is a loop which runs until its weight (specified in `/proc/sys/net/core/dev_weight`) has been consumed or no more data remains on the backlog.

Each piece of data on the backlog queue is removed from the backlog queue and passed on to __netif_receive_skb. As explained earlier in the no RPS case, data passed to this function eventually reaches the protocol layers after some bookkeeping.

Similarly to device driver NAPI implementations, the process_backlog code disables its poller if the total weight will not be used. The poller is restarted with the call to ____napi_schedule from enqueue_to_backlog as described above.

The function returns the amount of work done, which net_rx_action (described above) will subtract from the budget (which is adjusted with the net.core.netdev_budget, as described above).

__netif_receive_skb_core delivers data to protocol layers

The __netif_receive_skb_core delivers data to protocol layers. It does this by obtaining the protocol field from the skb and iterating across a list of deliver functions registered for that protocol type.

This happens in this piece of code (as seen above):

type = skb->protocol;                                                             
list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {               
  if (ptype->type == type &&
       (ptype->dev == null_or_dev || ptype->dev == skb->dev || 
        ptype->dev == orig_dev)) {
    if (pt_prev)
      ret = deliver_skb(skb, pt_prev, orig_dev);
    pt_prev = ptype;
  }                                                                         
}

The ptype_base identifier is defined at net/core/dev.c:146as a hash table of lists:

struct list_head ptype_base[PTYPE_HASH_SIZE] __read_mostly;

Each protocol layer adds a struct packet_type to a list at a given slot in the hash table.

The slot in the hash table is computed by ptype_head:

static inline struct list_head *ptype_head(const struct packet_type *pt)
{
  if (pt->type == htons(ETH_P_ALL))
    return &ptype_all;
  else
    return &ptype_base[ntohs(pt->type) & PTYPE_HASH_MASK];
}

The protocol layers call dev_add_pack to add themselves to the list.

IP protocol layer

The IP protocol layer plugs itself into the ptype_base hash table so that data will be delivered to it from the lower layers.

This happens in the function inet_init from net/ipv4/af_inet.c:1815

dev_add_pack(&ip_packet_type);

This registers the IP packet type structure defined as:

static struct packet_type ip_packet_type __read_mostly = {                                
  .type = cpu_to_be16(ETH_P_IP),                                                    
  .func = ip_rcv,
};

__netif_receive_skb_core calls deliver_skb (as seen in the above section). This function (net/core/dev.c:1712):

static inline int deliver_skb(struct sk_buff *skb,                                        
                              struct packet_type *pt_prev,                                
                              struct net_device *orig_dev)                                
{                                                                                         
  if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
    return -ENOMEM;
  atomic_inc(&skb->users);
  return pt_prev->func(skb, skb->dev, pt_prev, orig_dev);                           
}

In the case of the IP protocol, the ip_rcv function is called.

ip_rcv

The ip_rcv function is pretty straight-forward at a high level. There are several integrity checks to ensure the data is valid. Statistics counters that are bumped, as well.

ip_rcv ends by passing the packet to ip_rcv_finish by way of netfilter. This is done so that any iptables rules that should be matched at the ip protocol layer can take a look at the packet before it continues on (net/ipv4/ip_input.c:453):

return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING, skb, dev, NULL, ip_rcv_finish);

netfilter

The NF_HOOK_THRESH function is simple enough. It calls down to nf_hook_thresh and on success, calls okfn which in our case is ip_rcv_finish (include/linux/netfilter.h:175):

static inline int
NF_HOOK_THRESH(uint8_t pf, unsigned int hook, struct sk_buff *skb,                        
               struct net_device *in, struct net_device *out,                             
               int (*okfn)(struct sk_buff *), int thresh)                                 
{       
        int ret = nf_hook_thresh(pf, hook, skb, in, out, okfn, thresh);                   
        if (ret == 1) 
                ret = okfn(skb);                                                          
        return ret;                                                                       
}

The nf_hook_thresh function continues down approaching iptables. It begins by determining if there are any netfilter hooks for the netfilter protocol family and netfilter chain passed in.

In our example above, the protocol family is NFPROTO_IPV4 and chain type is NF_INET_PRE_ROUTING:

/**
 *      nf_hook_thresh - call a netfilter hook
 *      
 *      Returns 1 if the hook has allowed the packet to pass.  The function
 *      okfn must be invoked by the caller in this case.  Any other return
 *      value indicates the packet has been consumed by the hook.
 */
static inline int nf_hook_thresh(u_int8_t pf, unsigned int hook,
                                 struct sk_buff *skb,
                                 struct net_device *indev,
                                 struct net_device *outdev,
                                 int (*okfn)(struct sk_buff *), int thresh)
{
  if (nf_hooks_active(pf, hook))
    return nf_hook_slow(pf, hook, skb, indev, outdev, okfn, thresh);
  return 1;
}

This function calls the nf_hooks_active function which examines a table called nf_hooks_needed (include/linux/netfilter.h:114):

static inline bool nf_hooks_active(u_int8_t pf, unsigned int hook)         
{                                                                          
  return !list_empty(&nf_hooks[pf][hook]);                           
}

And if there is a hook present, nf_hook_slow is called to go deeper into iptables.

nf_hook_slow

nf_hook_slow iterates through the list of hooks in the nf_hooks table for the protocol type and chain type by calling nf_iterate for each entry in the hook list.

nf_iterate in turn calls the hook function associated with an entry on the hook list and returns a “verdict” about the packet.

iptables … tables

iptables registers hook functions for each of the packet matching tables: filter, nat, mangle, raw, and security.

In our example, we’re interested in NF_INET_PRE_ROUTING chains which are found in the nat table.

Sure enough, the struct with the hook function pointer which is registered with netfilter is found in net/ipv4/netfilter/iptable_nat.c:251:

static struct nf_hook_ops nf_nat_ipv4_ops[] __read_mostly = {
  /* Before packet filtering, change destination */
  {
    .hook           = nf_nat_ipv4_in,
    .owner          = THIS_MODULE,
    .pf             = NFPROTO_IPV4,
    .hooknum        = NF_INET_PRE_ROUTING,
    .priority       = NF_IP_PRI_NAT_DST,
  },

which is registered in iptable_nat_init (net/ipv4/netfilter/iptable_nat.c:316):

err = nf_register_hooks(nf_nat_ipv4_ops, ARRAY_SIZE(nf_nat_ipv4_ops));
if (err < 0)
  goto err2;

In our example above from the IP protocol layer, packets will be passed down to the nf_nat_ipv4_in to descend further into iptables via the nf_hook_slow function described in the previous section.

nf_nat_ipv4_in

nf_nat_ipv4_in passes the packet on to nf_nat_ipv4_fn which starts by obtaining the conntrack information for the packet:

struct nf_conn *ct;
enum ip_conntrack_info ctinfo;

/* slightly abbreviated code sample */

ct = nf_ct_get(skb, &ctinfo);

If the packet being examined is a packet for a new connection, the function nf_nat_rule_find is called (net/ipv4/netfilter/iptable_nat.c:117):

case IP_CT_NEW:
  /* Seen it before?  This can happen for loopback, retrans,
   * or local packets.
   */
  if (!nf_nat_initialized(ct, maniptype)) {
    unsigned int ret;

    ret = nf_nat_rule_find(skb, ops->hooknum, in, out, ct);
    if (ret != NF_ACCEPT)
      return ret;

And, finally, nf_nat_rule_find calls ipt_do_table which enters the iptables subsystem. This is as far as we will go into the netfilter and iptables systems, as they are complex enough to warrant their own multi-page documents.

The return value from the ipt_do_table function will either:

  • not be NF_ACCEPT, in which case it is returned immediately, OR
  • will be NF_ACCEPT causing nf_nat_ipv4_fn to call nf_nat_packet to do packet manipulation and return either NF_ACCEPT or NF_DROP.

Unwinding the return value

In either case of the return value for ipt_do_table, the final value of nf_nat_ipv4_fn is returned backward through all the functions described above until NF_HOOK_THRESH:

  1. nf_nat_ipv4_fn’s return value is returned back to nf_nat_ipv4_in
  2. which returns back to nf_iterate
  3. which returns back to nf_hook_slow
  4. which returns back to nf_hook_thresh
  5. which returns back to NF_HOOK_THRESH

NF_HOOK_THRESH checks the return value and if it is NF_ACCEPT (1), it calls the function pointed to by okfn.

In our example, the okfn is ip_rcv_finish which will do some processing and pass the packet up to the next protocol layer.

Comments are closed.

1 Comments

  1. shavenwarthog

    Very interesting and informative discussion, thanks for posting!

    9 years ago