Linux networking stack from the ground up, part 5

Updated on Feb 1, 2024 by PIA Research

part 1 | part 2 | part 3 | part 4 | part 5

Overview

This blog post picks up right where part 4 left off and begins by examining the last part of the IP protocol stack, the handoff to the UDP protocol stack, and finally by queuing the data to a socket’s queue so it can be read by user programs.

`ip_rcv_finish`

Once net filter has had a chance to take a look at the packet and decide what to do with it, ip_rcv_finish is called.

ip_rcv_finish begins with an optimization. In order to deliver the packet to proper place, a dst_entry from the routing system needs to in place. In order to obtain one, the code initially attempts to call the early_demux function from the higher level protocol.

The early_demux routine is an optimization which attempts to find the dst_entry needed to deliver the packet by checking if a dst_entry is cached on the socket.

Here’s what that looks like (net/ipv4/ip_input.c:317):

if (sysctl_ip_early_demux && !skb_dst(skb) && skb->sk == NULL) {
  const struct net_protocol *ipprot;
  int protocol = iph->protocol;

  ipprot = rcu_dereference(inet_protos[protocol]);
  if (ipprot && ipprot->early_demux) {
    ipprot->early_demux(skb);
    /* must reload iph, skb->head might have changed */
    iph = ip_hdr(skb);
  }
}

If the optimization is disabled or there is no cached entry (because this is the first UDP packet arriving), the packet will be handed off to the routing system in the kernel where the dst_entry will be computed and assigned.

Once the routing layer completes, statistics counters are updated and the function ends by calling dst_input(skb) which in turn calls the input function pointer on the packet’s dst_entry structure that was affixed by the routing system.

If the packet’s final destination is the local system, the routing system will attach the function ip_local_deliver to the input function pointer in the dst_entry structure on the packet.

`ip_local_deliver` and netfilter

Recall how we saw the following pattern in the IP protocol layer:

Calls to ip_rcv do some initial bookkeeping.
Packet is handed off to netfilter for processing, with a pointer to a callback to be executed when processing finishes.
ip_rcv_finish is the callback which finished processing and continued working toward pushing the packet up the networking stack

ip_local_deliver has the same pattern (net/ipv4/ip_input.c:242):

/*      
 *      Deliver IP Packets to the higher protocol layers.                                  
 */             
int ip_local_deliver(struct sk_buff *skb)
{       
  /*                                                                                 
   *  Reassemble IP fragments.                                                   
   */

  if (ip_is_fragment(ip_hdr(skb))) {
    if (ip_defrag(skb, IP_DEFRAG_LOCAL_DELIVER))                               
      return 0;
  }

  return NF_HOOK(NFPROTO_IPV4, NF_INET_LOCAL_IN, skb, skb->dev, NULL, ip_local_deliver_finish);                                           
}

Except that in this case, the netfilter chain is NF_INET_LOCAL_IN and the okfn to be called on completion is ip_local_deliver_finish.

We examined how packets are moved through netfilter briefly earlier, so we’ll move on to the completion ip_local_deliver_finish.

`ip_local_deliver_finish`

ip_local_deliver_finish obtains the protocol from the packet, looks up a net_protocol structure registered for that protocol, and calls the function pointed to by handler in the net_protocol structure.

This hands the packet up to the higher level protocol layer.

Higher level protocol registration

In our case, we care mostly about UDP, but TCP protocol handlers are registered the same way and at the same time.

On net/ipv4/af_inet.c:1553 we can find the structure definitions which contains the handler functions for connecting the UDP, TCP, and ICMP protocols to the IP protocol layer:

static const struct net_protocol tcp_protocol = {
  .early_demux    =       tcp_v4_early_demux,
  .handler        =       tcp_v4_rcv,
  .err_handler    =       tcp_v4_err,
  .no_policy      =       1,
  .netns_ok       =       1,
};

static const struct net_protocol udp_protocol = {
  .early_demux =  udp_v4_early_demux,
  .handler =      udp_rcv,
  .err_handler =  udp_err,
  .no_policy =    1,
  .netns_ok =     1,
};

static const struct net_protocol icmp_protocol = {
  .handler =      icmp_rcv,
  .err_handler =  icmp_err,
  .no_policy =    1,
  .netns_ok =     1,
};

These structures are registered in the initialization code of the inet address family (net/ipv4/af_inet.c:1716):

/*
  *      Add all the base protocols.
  */

 if (inet_add_protocol(&icmp_protocol, IPPROTO_ICMP) < 0)
    pr_crit("%s: Cannot add ICMP protocol\n", __func__);
 if (inet_add_protocol(&udp_protocol, IPPROTO_UDP) < 0)
    pr_crit("%s: Cannot add UDP protocol\n", __func__);
 if (inet_add_protocol(&tcp_protocol, IPPROTO_TCP) < 0)
    pr_crit("%s: Cannot add TCP protocol\n", __func__);

In our research case, we care mostly about UDP. So, we’ll examine the UDP handler function which is called from ip_local_deliver_finish.

As we see in the structure definition above, this function is called udp_rcv.

UDP

The code for the UDP protocol layer can be found in: net/ipv4/udp.c.

udp_rcv

The udp_rcv (net/ipv4/udp.c:1954) function is just a single line which calls directly into __udp4_lib_rcv to handle receiving the packet.

`__udp4_lib_rcv`

The __udp4_lib_rcv (net/ipv4/udp.c:1708) will check to ensure the packet is valid and obtain the UDP header, UDP datagram length, source address, and destination address. Next, are some additional integrity checks and checksum verification.

Recall that earlier in the IP protocol layer, we saw that an optimization is performed to attach a dst_entry to the packet before it is handed off to the upper layer protocol (UDP in our case).

If a socket and corresponding dst_entry were found, __udp4_lib_rcv will queue the packet to be received by the socket:

sk = skb_steal_sock(skb);                                                          
if (sk) {
  struct dst_entry *dst = skb_dst(skb);                                      
  int ret;                                                                   
                                                                                   
  if (unlikely(sk->sk_rx_dst != dst))                                        
    udp_sk_rx_dst_set(sk, dst);                                        
                                                                                   
  ret = udp_queue_rcv_skb(sk, skb);                                          
  sock_put(sk);                                                              
  /* a return value > 0 means to resubmit the input, but                     
   * it wants the return to be -protocol, or 0                               
   */                                                                        
  if (ret > 0)                                                               
    return -ret;                                                       
  return 0;                                                                  
} else {

If there is no socket attached from the early_demux operation, a receiving socket will now be looked up by calling __udp4_lib_lookup_skb.

In both cases described above, the datagram will be queued to the socket:

ret = udp_queue_rcv_skb(sk, skb);
sock_put(sk);

If no socket was found, the datagram will be dropped:

/* No socket. Drop packet silently, if checksum is wrong */
if (udp_lib_checksum_complete(skb))
  goto csum_error;

UDP_INC_STATS_BH(net, UDP_MIB_NOPORTS, proto == IPPROTO_UDPLITE);
icmp_send(skb, ICMP_DEST_UNREACH, ICMP_PORT_UNREACH, 0);

/*
 * Hmm.  We got an UDP packet to a port to which we
 * don't wanna listen.  Ignore it.
 */
kfree_skb(skb);
return 0;

udp_queue_rcv_skb

The initial parts of this function are as follows:

Determine if the socket associated with the datagram is an encapsulation socket. If so, pass the packet up to that layer’s handler function before proceeding.
Determine if the datagram is a UDP-Lite datagram and do some integrity checks.
Verify the UDP checksum of the datagram and drop it if the checksum fails.

Finally, we arrive at the receive queue logic (net/ipv4/udp.c:1548) which begins by checking if the receive queue for the socket is full:

if (sk_rcvqueues_full(sk, skb, sk->sk_rcvbuf))                                     
  goto drop;

`sk_rcvqueues_full` and tuning receive queue memory

The sk_rcvqueues_full function (include/net/sock.h:788) checks the socket’s backlog length and the socket’s sk_rmem_alloc to determine if the sum is greater than the sk_rcvbuf for the socket (sk->sk_rcvbuf above):

/*                                                                                         
 * Take into account size of receive queue and backlog queue                               
 * Do not take into account this skb truesize,
 * to allow even a single big packet to come.                                              
 */                                                                                        
static inline bool sk_rcvqueues_full(const struct sock *sk, const struct sk_buff *skb,     
                                     unsigned int limit)                                   
{       
  unsigned int qsize = sk->sk_backlog.len + atomic_read(&sk->sk_rmem_alloc);         
        
  return qsize > limit;
}

Tuning these values is a bit tricky as there are many things that can be adjusted.

The sk->sk_rcvbuf (called limit in the function above) value can be increased to the
net.core.rmem_max. You can set that max by setting the sysctl: sysctl -w net.core.rmem_max=8388608.

sk->sk_rcvbuf starts at the net.core.rmem_default value, which can also be adjusted by setting the sysctl: sysctl -w net.core.rmem_default=8388608.

You can also set the sk->sk_rcvbuf size by calling setsockopt and passing SO_RCVBUF. The maximum you can set with setsockopt is net.core.rmem_max.

You can override the SO_RCVBUF limit by calling setsockopt and passing SO_RCVBUFFORCE, but the user running the application will need the CAP_NET_ADMIN capability.

The sk->sk_rmem_alloc value is incremented by calls to skb_set_owner_r which set the owner socket of a datagram. We’ll see this called later in the UDP layer.

The sk->sk_backlog.len is incremented by calls to sk_add_backlog, which we’ll see next.

Back to `udp_queue_rcv_skb`

Once we’ve verified that the queue is not full, we can continue toward queuing the datagram:

bh_lock_sock(sk);                                                                  
if (!sock_owned_by_user(sk))                                                       
  rc = __udp_queue_rcv_skb(sk, skb);                                         
else if (sk_add_backlog(sk, skb, sk->sk_rcvbuf)) {                                 
  bh_unlock_sock(sk);                                                        
  goto drop;                                                                 
}                                                                                  
bh_unlock_sock(sk);                                                                
                                                                                   
return rc;

The first step is determine if the socket currently has any system calls against it from a userland program. If it does not, the datagram can be added to the receive queue with a call to __udp_queue_rcv_skb. If it does, the datagram is queued to the backlog.

The datagrams on the backlog are added to the receive queue when socket system calls release the sock with a call to release_sock.

`__udp_queue_rcv_skb`

The __udp_queue_rcv_skb (net/ipv4/udp.c:1422) function adds datagrams to the receive queue and bumps statistics counters if the datagram could not be added to the receive queue for the socket:

rc = sock_queue_rcv_skb(sk, skb);                                                  
if (rc < 0) {                                                                      
  int is_udplite = IS_UDPLITE(sk);                                           
                                                                                   
  /* Note that an ENOMEM error is charged twice */                           
  if (rc == -ENOMEM)                                                         
    UDP_INC_STATS_BH(sock_net(sk), UDP_MIB_RCVBUFERRORS, is_udplite);                                      
    UDP_INC_STATS_BH(sock_net(sk), UDP_MIB_INERRORS, is_udplite);              
    kfree_skb(skb);                                                            
    trace_udp_fail_queue_rcv_skb(rc, sk);                                      
    return -1;                                                                 
}

To add the datagram to the queue, sock_queue_rcv_skb is called.

`sock_queue_rcv_skb`

sock_queue_rcv (net/core/sock.c:388) does a few things before adding the datagram to the queue:

The socket’s allocated memory is checked to determine if it has exceeded the receive buffer size. If so, the drop count for the socket is incremented.
Next sk_filter is used to process any Berkeley Packet Filter filters that have been applied to the socket.
sk_rmem_schedule is run to ensure sufficient receive buffer space exists to accept this datagram.
Next the size of the datagram is charged to the socket with a call to skb_set_owner_r. This increments sk->sk_rmem_alloc.
The data is added to the queue with a call to __skb_queue_tail
Finally, any processes waiting on data to arrive in the socket are notified with a call to the sk_data_ready notification handler function.

The End

That is how data that arrives from the network ends up on the receive queue for a socket ready to be read by a user process.

Comments are closed.

1 Comments

[ FilmVirus ]

Thank you so much for writing this up. Very helpful.

9 years ago

Linux networking stack from the ground up, part 5

Overview

ip_rcv_finish

ip_local_deliver and netfilter

ip_local_deliver_finish