Linux networking stack from the ground up, part 5
part 1 | part 2 | part 3 | part 4 | part 5
Overview
This blog post picks up right where part 4 left off and begins by examining the last part of the IP protocol stack, the handoff to the UDP protocol stack, and finally by queuing the data to a socket’s queue so it can be read by user programs.
ip_rcv_finish
Once net filter has had a chance to take a look at the packet and decide what to do with it, ip_rcv_finish
is called.
ip_rcv_finish
begins with an optimization. In order to deliver the packet to proper place, a dst_entry
from the routing system needs to in place. In order to obtain one, the code initially attempts to call the early_demux
function from the higher level protocol.
The early_demux
routine is an optimization which attempts to find the dst_entry
needed to deliver the packet by checking if a dst_entry
is cached on the socket.
Here’s what that looks like (net/ipv4/ip_input.c:317):
if (sysctl_ip_early_demux && !skb_dst(skb) && skb->sk == NULL) { const struct net_protocol *ipprot; int protocol = iph->protocol; ipprot = rcu_dereference(inet_protos[protocol]); if (ipprot && ipprot->early_demux) { ipprot->early_demux(skb); /* must reload iph, skb->head might have changed */ iph = ip_hdr(skb); } }
If the optimization is disabled or there is no cached entry (because this is the first UDP packet arriving), the packet will be handed off to the routing system in the kernel where the dst_entry
will be computed and assigned.
Once the routing layer completes, statistics counters are updated and the function ends by calling dst_input(skb)
which in turn calls the input
function pointer on the packet’s dst_entry
structure that was affixed by the routing system.
If the packet’s final destination is the local system, the routing system will attach the function ip_local_deliver
to the input
function pointer in the dst_entry
structure on the packet.
ip_local_deliver
and netfilter
Recall how we saw the following pattern in the IP protocol layer:
- Calls to
ip_rcv
do some initial bookkeeping. - Packet is handed off to netfilter for processing, with a pointer to a callback to be executed when processing finishes.
ip_rcv_finish
is the callback which finished processing and continued working toward pushing the packet up the networking stack
ip_local_deliver
has the same pattern (net/ipv4/ip_input.c:242):
/* * Deliver IP Packets to the higher protocol layers. */ int ip_local_deliver(struct sk_buff *skb) { /* * Reassemble IP fragments. */ if (ip_is_fragment(ip_hdr(skb))) { if (ip_defrag(skb, IP_DEFRAG_LOCAL_DELIVER)) return 0; } return NF_HOOK(NFPROTO_IPV4, NF_INET_LOCAL_IN, skb, skb->dev, NULL, ip_local_deliver_finish); }
Except that in this case, the netfilter chain is NF_INET_LOCAL_IN
and the okfn
to be called on completion is ip_local_deliver_finish
.
We examined how packets are moved through netfilter briefly earlier, so we’ll move on to the completion ip_local_deliver_finish
.
ip_local_deliver_finish
ip_local_deliver_finish
obtains the protocol from the packet, looks up a net_protocol
structure registered for that protocol, and calls the function pointed to by handler
in the net_protocol
structure.
This hands the packet up to the higher level protocol layer.
Higher level protocol registration
In our case, we care mostly about UDP, but TCP protocol handlers are registered the same way and at the same time.
On net/ipv4/af_inet.c:1553 we can find the structure definitions which contains the handler functions for connecting the UDP, TCP, and ICMP protocols to the IP protocol layer:
static const struct net_protocol tcp_protocol = { .early_demux = tcp_v4_early_demux, .handler = tcp_v4_rcv, .err_handler = tcp_v4_err, .no_policy = 1, .netns_ok = 1, }; static const struct net_protocol udp_protocol = { .early_demux = udp_v4_early_demux, .handler = udp_rcv, .err_handler = udp_err, .no_policy = 1, .netns_ok = 1, }; static const struct net_protocol icmp_protocol = { .handler = icmp_rcv, .err_handler = icmp_err, .no_policy = 1, .netns_ok = 1, };
These structures are registered in the initialization code of the inet address family (net/ipv4/af_inet.c:1716):
/* * Add all the base protocols. */ if (inet_add_protocol(&icmp_protocol, IPPROTO_ICMP) < 0) pr_crit("%s: Cannot add ICMP protocol\n", __func__); if (inet_add_protocol(&udp_protocol, IPPROTO_UDP) < 0) pr_crit("%s: Cannot add UDP protocol\n", __func__); if (inet_add_protocol(&tcp_protocol, IPPROTO_TCP) < 0) pr_crit("%s: Cannot add TCP protocol\n", __func__);
In our research case, we care mostly about UDP. So, we’ll examine the UDP handler
function which is called from ip_local_deliver_finish
.
As we see in the structure definition above, this function is called udp_rcv
.
UDP
The code for the UDP protocol layer can be found in: net/ipv4/udp.c.
udp_rcv
The udp_rcv
(net/ipv4/udp.c:1954) function is just a single line which calls directly into __udp4_lib_rcv
to handle receiving the packet.
__udp4_lib_rcv
The __udp4_lib_rcv
(net/ipv4/udp.c:1708) will check to ensure the packet is valid and obtain the UDP header, UDP datagram length, source address, and destination address. Next, are some additional integrity checks and checksum verification.
Recall that earlier in the IP protocol layer, we saw that an optimization is performed to attach a dst_entry
to the packet before it is handed off to the upper layer protocol (UDP in our case).
If a socket and corresponding dst_entry
were found, __udp4_lib_rcv
will queue the packet to be received by the socket:
sk = skb_steal_sock(skb); if (sk) { struct dst_entry *dst = skb_dst(skb); int ret; if (unlikely(sk->sk_rx_dst != dst)) udp_sk_rx_dst_set(sk, dst); ret = udp_queue_rcv_skb(sk, skb); sock_put(sk); /* a return value > 0 means to resubmit the input, but * it wants the return to be -protocol, or 0 */ if (ret > 0) return -ret; return 0; } else {
If there is no socket attached from the early_demux
operation, a receiving socket will now be looked up by calling __udp4_lib_lookup_skb
.
In both cases described above, the datagram will be queued to the socket:
ret = udp_queue_rcv_skb(sk, skb); sock_put(sk);
If no socket was found, the datagram will be dropped:
/* No socket. Drop packet silently, if checksum is wrong */ if (udp_lib_checksum_complete(skb)) goto csum_error; UDP_INC_STATS_BH(net, UDP_MIB_NOPORTS, proto == IPPROTO_UDPLITE); icmp_send(skb, ICMP_DEST_UNREACH, ICMP_PORT_UNREACH, 0); /* * Hmm. We got an UDP packet to a port to which we * don't wanna listen. Ignore it. */ kfree_skb(skb); return 0;
udp_queue_rcv_skb
The initial parts of this function are as follows:
- Determine if the socket associated with the datagram is an encapsulation socket. If so, pass the packet up to that layer’s handler function before proceeding.
- Determine if the datagram is a UDP-Lite datagram and do some integrity checks.
- Verify the UDP checksum of the datagram and drop it if the checksum fails.
Finally, we arrive at the receive queue logic (net/ipv4/udp.c:1548) which begins by checking if the receive queue for the socket is full:
if (sk_rcvqueues_full(sk, skb, sk->sk_rcvbuf)) goto drop;
sk_rcvqueues_full
and tuning receive queue memory
The sk_rcvqueues_full
function (include/net/sock.h:788) checks the socket’s backlog length and the socket’s sk_rmem_alloc
to determine if the sum is greater than the sk_rcvbuf
for the socket (sk->sk_rcvbuf
above):
/* * Take into account size of receive queue and backlog queue * Do not take into account this skb truesize, * to allow even a single big packet to come. */ static inline bool sk_rcvqueues_full(const struct sock *sk, const struct sk_buff *skb, unsigned int limit) { unsigned int qsize = sk->sk_backlog.len + atomic_read(&sk->sk_rmem_alloc); return qsize > limit; }
Tuning these values is a bit tricky as there are many things that can be adjusted.
The sk->sk_rcvbuf
(called limit
in the function above) value can be increased to thenet.core.rmem_max
. You can set that max by setting the sysctl
: sysctl -w net.core.rmem_max=8388608
.
sk->sk_rcvbuf
starts at the net.core.rmem_default
value, which can also be adjusted by setting the sysctl
: sysctl -w net.core.rmem_default=8388608
.
You can also set the sk->sk_rcvbuf
size by calling setsockopt
and passing SO_RCVBUF
. The maximum you can set with setsockopt
is net.core.rmem_max
.
You can override the SO_RCVBUF
limit by calling setsockopt
and passing SO_RCVBUFFORCE
, but the user running the application will need the CAP_NET_ADMIN
capability.
The sk->sk_rmem_alloc
value is incremented by calls to skb_set_owner_r
which set the owner socket of a datagram. We’ll see this called later in the UDP layer.
The sk->sk_backlog.len
is incremented by calls to sk_add_backlog
, which we’ll see next.
Back to udp_queue_rcv_skb
Once we’ve verified that the queue is not full, we can continue toward queuing the datagram:
bh_lock_sock(sk); if (!sock_owned_by_user(sk)) rc = __udp_queue_rcv_skb(sk, skb); else if (sk_add_backlog(sk, skb, sk->sk_rcvbuf)) { bh_unlock_sock(sk); goto drop; } bh_unlock_sock(sk); return rc;
The first step is determine if the socket currently has any system calls against it from a userland program. If it does not, the datagram can be added to the receive queue with a call to __udp_queue_rcv_skb
. If it does, the datagram is queued to the backlog.
The datagrams on the backlog are added to the receive queue when socket system calls release the sock with a call to release_sock
.
__udp_queue_rcv_skb
The __udp_queue_rcv_skb
(net/ipv4/udp.c:1422) function adds datagrams to the receive queue and bumps statistics counters if the datagram could not be added to the receive queue for the socket:
rc = sock_queue_rcv_skb(sk, skb); if (rc < 0) { int is_udplite = IS_UDPLITE(sk); /* Note that an ENOMEM error is charged twice */ if (rc == -ENOMEM) UDP_INC_STATS_BH(sock_net(sk), UDP_MIB_RCVBUFERRORS, is_udplite); UDP_INC_STATS_BH(sock_net(sk), UDP_MIB_INERRORS, is_udplite); kfree_skb(skb); trace_udp_fail_queue_rcv_skb(rc, sk); return -1; }
To add the datagram to the queue, sock_queue_rcv_skb
is called.
sock_queue_rcv_skb
sock_queue_rcv
(net/core/sock.c:388) does a few things before adding the datagram to the queue:
- The socket’s allocated memory is checked to determine if it has exceeded the receive buffer size. If so, the drop count for the socket is incremented.
- Next
sk_filter
is used to process any Berkeley Packet Filter filters that have been applied to the socket. sk_rmem_schedule
is run to ensure sufficient receive buffer space exists to accept this datagram.- Next the size of the datagram is charged to the socket with a call to
skb_set_owner_r
. This incrementssk->sk_rmem_alloc
. - The data is added to the queue with a call to
__skb_queue_tail
- Finally, any processes waiting on data to arrive in the socket are notified with a call to the
sk_data_ready
notification handler function.
The End
That is how data that arrives from the network ends up on the receive queue for a socket ready to be read by a user process.