Linux networking stack from the ground up, part 4
part 1 | part 2 | part 3 | part 4 | part 5
Overview
This post will pick up where part 3 left off beginning by describing Receive Packet Steering (RPS), what it is and how to configure it, followed by an examination of the network stack describing how packets are dealt with based on RPS settings, the packet backlog queue, the start of the IP protocol layer, and netfilter.
Receive Packet Steering
We saw that device drivers register NAPI poll instances. Each NAPI poller instance is executed in the context of a kernel thread called a softirq of which there are one per CPU. The kernel thread for the CPU that the hardware interrupt handler runs on is woken up / scheduled to run in the hardware interrupt handler.
Thus, a single CPU processes the hardware interrupt and polls from the networking layer to process the incoming data.
Some NICs support multiple queues at the hardware level. This means that incoming packets can be DMA’d to separate receive rings, each receive ring having its own hardware interrupt being delivered to indicate data is available. Each of these hardware interrupts would schedule NAPI poll instances to run on each of the associated CPUs.
This allows multiple CPUs to process hardware interrupts and poll from the networking layer.
Receive Packet Steering (RPS) is a software implementation of hardware enable multi-queue NICs. It allows multiple CPUs to process incoming packets even if the NIC only supports a single receive queue in hardware.
RPS works by generating a hash for an incoming data to determine which CPU should process the data. The data is then enqueued to the per-CPU receive network backlog to be processed. An Inter-processor interrupt is delivered to the CPU owning the backlog. This helps to kick-start backlog processing by the remote CPU if it is not currently processing packets.
netif_receive_skb
will either continue sending network data up the networking stack, or hand it over to RPS for processing on a different CPU.
configure RPS
For RPS to work, it must be enabled in the kernel configuration (it is on Ubuntu for Linux kernel 3.13.0), and a bit mask describing which CPUs should process packets for a given interface and rx queue.
The bit masks to modify are found in /sys/class/net/DEVICE_NAME/queues/QUEUE/rps_cpus
.
So, for eth0, and receive queue 0, you would modify: /sys/class/net/eth0/queues/rx-0/rps_cpus
with a hexadecimal number indicating which CPUs should process packets from eth0’s receive queue 0.
Back to netif_receive_skb
.
netif_receive_skb
As a reminder, netif_receive_skb
function is called from napi_skb_finish
in the softirq context from the NAPI poller registered by the device driver.
netif_receive_skb
will either attempt to use RPS (as described above) or continue sending the data up the network stack.
Let’s first examine the second path: sending the data up the stack if RPS is disabled.
netif_receive_skb
without RPS
netif_receive_skb
calls __netif_receive_skb
which does some bookkeeping prior to calling __netif_receive_skb_core
to move the data along up the network stack toward the protocol levels.
__netif_receive_skb_core
This function passes the skb up to the protocol layer in this piece of code (net/core/dev.c:3628):
type = skb->protocol; list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) { if (ptype->type == type && (ptype->dev == null_or_dev || ptype->dev == skb->dev || ptype->dev == orig_dev)) { if (pt_prev) ret = deliver_skb(skb, pt_prev, orig_dev); pt_prev = ptype; } }
We will examine precisely how this code delivers data to the protocol layer below, but first, let’s see what happens when RPS is enabled.
netif_receive_skb
with RPS
If RPS is enabled, netif_receive_skb
will compute which CPU’s backlog it should queue the data. It does this by using the function get_rps_cpu
(defined at net/core/dev.c:2980):
int cpu = get_rps_cpu(skb->dev, skb, &rflow); if (cpu >= 0) { ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail); rcu_read_unlock(); return ret; }
enqueue_to_backlog
This function begins by getting a pointer to the remote CPU’s softnet_data
structure which contains a pointer to a NAPI poller.
Next, the queue length of the input_pkt_queue
of the remote CPU is checked:
qlen = skb_queue_len(&sd->input_pkt_queue); if (qlen <= netdev_max_backlog && !skb_flow_limit(skb, qlen)) { if (skb_queue_len(&sd->input_pkt_queue)) {
It is first compared to the netdev_max_backlog
. If the queue length is larger than the backlog, the data is dropped and the drop is counted against the remote CPU.
You can prevent drops by increasing the netdev_max_backlog
:
sysctl -w net.core.netdev_max_backlog=3000
If the queue length isn’t too large, the code next checks if the flow limit has been reached. By default, flow limits are disabled. In order to enable flow limits, you must specify a bitmap (similar to RPS’ bitmap) in /proc/sys/net/core/flow_limit_cpu_bitmap
.
Once you enable flow limits per CPU, you can also adjust the size of the flow limit hash table by modifying the sysctl net.core.flow_limit_table_len
.
You can read more about flow limits in the Documentation/networking/scaling.txt file.
Assuming that the flow limit has not been reached, enqueue_to_backlog
then checks if the backlog queue has data queued to it already.
If so, the data is queued:
if (skb_queue_len(&sd->input_pkt_queue)) { enqueue: __skb_queue_tail(&sd->input_pkt_queue, skb); input_queue_tail_incr_save(sd, qtail); rps_unlock(sd); local_irq_restore(flags); return NET_RX_SUCCESS; }
If the queue is empty, first the NAPI poller for the backlog queue is started:
/* Schedule NAPI for backlog device * We can use non atomic operation since we own the queue lock */ if (!__test_and_set_bit(NAPI_STATE_SCHED, &sd->backlog.state)) { if (!rps_ipi_queued(sd)) ____napi_schedule(sd, &sd->backlog); } goto enqueue;
The goto
at the bottom brings execution back up to the block of code above, queuing the data to the backlog.
backlog queue NAPI poller
The per-CPU backlog queue plugs into NAPI the same way a device driver does. A poll function is provided that is used to process packets from the softirq context.
This NAPI struct is provided during initialization of the networking system. From net_dev_init
in net/core/dev.c:6952:
sd->backlog.poll = process_backlog; sd->backlog.weight = weight_p; sd->backlog.gro_list = NULL; sd->backlog.gro_count = 0;
The backlog NAPI structure differs from the device driver NAPI structure in that the weight
parameter is adjustable. The drivers hardcode their values (most hardcode to 64, as seen in e1000e).
To adjust the backlog’s NAPI poller weight, modify /proc/sys/net/core/dev_weight.
The poll function for the backlog is called process_backlog
and, similar to e1000e’s function e1000e_poll
, is called from the softirq context.
process_backlog
The process_backlog
function (net/core/dev.c:4097) is a loop which runs until its weight (specified in `/proc/sys/net/core/dev_weight`) has been consumed or no more data remains on the backlog.
Each piece of data on the backlog queue is removed from the backlog queue and passed on to __netif_receive_skb
. As explained earlier in the no RPS case, data passed to this function eventually reaches the protocol layers after some bookkeeping.
Similarly to device driver NAPI implementations, the process_backlog
code disables its poller if the total weight will not be used. The poller is restarted with the call to ____napi_schedule
from enqueue_to_backlog
as described above.
The function returns the amount of work done, which net_rx_action
(described above) will subtract from the budget (which is adjusted with the net.core.netdev_budget
, as described above).
__netif_receive_skb_core
delivers data to protocol layers
The __netif_receive_skb_core
delivers data to protocol layers. It does this by obtaining the protocol field from the skb
and iterating across a list of deliver
functions registered for that protocol type.
This happens in this piece of code (as seen above):
type = skb->protocol; list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) { if (ptype->type == type && (ptype->dev == null_or_dev || ptype->dev == skb->dev || ptype->dev == orig_dev)) { if (pt_prev) ret = deliver_skb(skb, pt_prev, orig_dev); pt_prev = ptype; } }
The ptype_base
identifier is defined at net/core/dev.c:146as a hash table of lists:
struct list_head ptype_base[PTYPE_HASH_SIZE] __read_mostly;
Each protocol layer adds a struct packet_type
to a list at a given slot in the hash table.
The slot in the hash table is computed by ptype_head
:
static inline struct list_head *ptype_head(const struct packet_type *pt) { if (pt->type == htons(ETH_P_ALL)) return &ptype_all; else return &ptype_base[ntohs(pt->type) & PTYPE_HASH_MASK]; }
The protocol layers call dev_add_pack
to add themselves to the list.
IP protocol layer
The IP protocol layer plugs itself into the ptype_base
hash table so that data will be delivered to it from the lower layers.
This happens in the function inet_init
from net/ipv4/af_inet.c:1815
dev_add_pack(&ip_packet_type);
This registers the IP packet type structure defined as:
static struct packet_type ip_packet_type __read_mostly = { .type = cpu_to_be16(ETH_P_IP), .func = ip_rcv, };
__netif_receive_skb_core
calls deliver_skb
(as seen in the above section). This function (net/core/dev.c:1712):
static inline int deliver_skb(struct sk_buff *skb, struct packet_type *pt_prev, struct net_device *orig_dev) { if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC))) return -ENOMEM; atomic_inc(&skb->users); return pt_prev->func(skb, skb->dev, pt_prev, orig_dev); }
In the case of the IP protocol, the ip_rcv
function is called.
ip_rcv
The ip_rcv
function is pretty straight-forward at a high level. There are several integrity checks to ensure the data is valid. Statistics counters that are bumped, as well.
ip_rcv
ends by passing the packet to ip_rcv_finish
by way of netfilter. This is done so that any iptables rules that should be matched at the ip protocol layer can take a look at the packet before it continues on (net/ipv4/ip_input.c:453):
return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING, skb, dev, NULL, ip_rcv_finish);
netfilter
The NF_HOOK_THRESH
function is simple enough. It calls down to nf_hook_thresh
and on success, calls okfn
which in our case is ip_rcv_finish
(include/linux/netfilter.h:175):
static inline int NF_HOOK_THRESH(uint8_t pf, unsigned int hook, struct sk_buff *skb, struct net_device *in, struct net_device *out, int (*okfn)(struct sk_buff *), int thresh) { int ret = nf_hook_thresh(pf, hook, skb, in, out, okfn, thresh); if (ret == 1) ret = okfn(skb); return ret; }
The nf_hook_thresh
function continues down approaching iptables. It begins by determining if there are any netfilter hooks for the netfilter protocol family and netfilter chain passed in.
In our example above, the protocol family is NFPROTO_IPV4
and chain type is NF_INET_PRE_ROUTING
:
/** * nf_hook_thresh - call a netfilter hook * * Returns 1 if the hook has allowed the packet to pass. The function * okfn must be invoked by the caller in this case. Any other return * value indicates the packet has been consumed by the hook. */ static inline int nf_hook_thresh(u_int8_t pf, unsigned int hook, struct sk_buff *skb, struct net_device *indev, struct net_device *outdev, int (*okfn)(struct sk_buff *), int thresh) { if (nf_hooks_active(pf, hook)) return nf_hook_slow(pf, hook, skb, indev, outdev, okfn, thresh); return 1; }
This function calls the nf_hooks_active
function which examines a table called nf_hooks_needed
(include/linux/netfilter.h:114):
static inline bool nf_hooks_active(u_int8_t pf, unsigned int hook) { return !list_empty(&nf_hooks[pf][hook]); }
And if there is a hook present, nf_hook_slow
is called to go deeper into iptables.
nf_hook_slow
nf_hook_slow
iterates through the list of hooks in the nf_hooks
table for the protocol type and chain type by calling nf_iterate
for each entry in the hook list.
nf_iterate
in turn calls the hook function associated with an entry on the hook list and returns a “verdict” about the packet.
iptables
… tables
iptables
registers hook functions for each of the packet matching tables: filter, nat, mangle, raw, and security.
In our example, we’re interested in NF_INET_PRE_ROUTING
chains which are found in the nat
table.
Sure enough, the struct with the hook function pointer which is registered with netfilter is found in net/ipv4/netfilter/iptable_nat.c:251:
static struct nf_hook_ops nf_nat_ipv4_ops[] __read_mostly = { /* Before packet filtering, change destination */ { .hook = nf_nat_ipv4_in, .owner = THIS_MODULE, .pf = NFPROTO_IPV4, .hooknum = NF_INET_PRE_ROUTING, .priority = NF_IP_PRI_NAT_DST, },
which is registered in iptable_nat_init
(net/ipv4/netfilter/iptable_nat.c:316):
err = nf_register_hooks(nf_nat_ipv4_ops, ARRAY_SIZE(nf_nat_ipv4_ops)); if (err < 0) goto err2;
In our example above from the IP protocol layer, packets will be passed down to the nf_nat_ipv4_in
to descend further into iptables via the nf_hook_slow
function described in the previous section.
nf_nat_ipv4_in
nf_nat_ipv4_in
passes the packet on to nf_nat_ipv4_fn
which starts by obtaining the conntrack information for the packet:
struct nf_conn *ct; enum ip_conntrack_info ctinfo; /* slightly abbreviated code sample */ ct = nf_ct_get(skb, &ctinfo);
If the packet being examined is a packet for a new connection, the function nf_nat_rule_find
is called (net/ipv4/netfilter/iptable_nat.c:117):
case IP_CT_NEW: /* Seen it before? This can happen for loopback, retrans, * or local packets. */ if (!nf_nat_initialized(ct, maniptype)) { unsigned int ret; ret = nf_nat_rule_find(skb, ops->hooknum, in, out, ct); if (ret != NF_ACCEPT) return ret;
And, finally, nf_nat_rule_find
calls ipt_do_table
which enters the iptables subsystem. This is as far as we will go into the netfilter and iptables systems, as they are complex enough to warrant their own multi-page documents.
The return value from the ipt_do_table
function will either:
- not be
NF_ACCEPT
, in which case it is returned immediately, OR - will be
NF_ACCEPT
causingnf_nat_ipv4_fn
to callnf_nat_packet
to do packet manipulation and return eitherNF_ACCEPT
orNF_DROP
.
Unwinding the return value
In either case of the return value for ipt_do_table
, the final value of nf_nat_ipv4_fn
is returned backward through all the functions described above until NF_HOOK_THRESH
:
nf_nat_ipv4_fn
’s return value is returned back tonf_nat_ipv4_in
- which returns back to
nf_iterate
- which returns back to
nf_hook_slow
- which returns back to
nf_hook_thresh
- which returns back to
NF_HOOK_THRESH
NF_HOOK_THRESH
checks the return value and if it is NF_ACCEPT
(1), it calls the function pointed to by okfn
.
In our example, the okfn
is ip_rcv_finish
which will do some processing and pass the packet up to the next protocol layer.