GalSagie A blog about network virtualization and cloud

DPDK Design Tips (Part 1 - RSS)

In this blog series i am going to address subjects and suggestions thats are valuable when you design an application on top of DPDK libraries. It will provide important tips and tricks for enhancing your data-path/control-path applications performance, and a set of tools that are available for your network processing.

I am mainly addressing DPDK and hence network applications, but some of the subjects i will describe are common performance bottlenecks and can be applied to any other code running on a standard architecture.

This series will include topics ranging from NUMA aware to Flow table implementations, correct use of cache and HW offloading options when using DPDK.

If you dont know what DPDK is, you should visit my NFV acceleration post where i describe what DPDK is, you can find it here.

Or visit the open source DPDK project website.

RSS (Receive Side Scaling)

In this post i am going to describe RSS, how to configure it with DPDK, options for implementing RSS in software and the use cases, and an important aspect of how to configure symmetrical RSS for flow awareness without losing uniform distribution

What is RSS ?

Receive side scaling (RSS) is a network driver technology that enables the efficient distribution of network receive processing across multiple CPUs in multiprocessor systems.

What it does is issue a hash function with a predefined hash key on every packet coming in. The hash function takes the packet IP addresses, the protocol (UDP or TCP) and the ports (5 tuple) as the key and calculate the hash value. (RSS hash function can take only 2,3 or 4 tuples to create the key if configured so).

A number of least significant bits (LSBs) of the hash value are used to index an indirection table. The values in the indirection table are used to assign the received data to a CPU.

The following diagram describe the process done on each incoming packet:

This means a multi process application can use the hardware to distribute and split the traffic between the CPU’s. The use of the indirection table in the process is very use full for implementing dynamic load balancing as the indirection table can be reprogrammed by the driver/application.

Configure RSS with DPDK

DPDK support the configuration of RSS function, the static hash key and the indirection table. RSS is configured per port and the distribution depends on the number of RX queues configured on the port.

What DPDK does is to take the RX queues for the port and start repeatedly writing them in the indirection table.

For example if we have a port configured with RSS and 3 RX queues configured with indexes 0,1 and 2 the indirection table of size 128 will look something like this:

{0,1,2,0,1,2,0……} (indexes 0..127)

The traffic is distributed across those RX queues, its the application responsibility (if it choose so) to poll each one of these queues in a different CPU.

To configure RSS in DPDK you must enable it in the port rte_eth_conf structure.
set rx_mode.mq_mode = ETH_MQ_RX_RSS
edit the RSS configuration structure : rx_adv_conf.rss_conf (change the hash key or leave NULL for the default one and pick the RSS mode)

When RSS is enabled each packet coming in (rte_mbuf) has the RSS hash value result in the meta data structure, it can be accessed in mbuf.hash.rss , This is use full because other applications (hint: flow table) can later use this hash value without recalculating the hash.

The indirection table (called RETA) can be reconfigured at runtime, this means the application can dynamically change which queue the traffic is sent for each indirection table index.

The RETA configuration functions are implemented per poll mode driver, for example for the ixgbe driver look for the following functions:

ixgbe_dev_rss_reta_update and ixgbe_dev_rss_reta_query

Symmetric RSS

It is very important in networking applications to have the same CPU handle two sides of the connection, what is called symmetrical flow. Many networking applications needs to save information about the connection, and you dont want this information shared between two CPU’s, this introduce locking which is bad performance wise.

RSS algorithm is usually using the Toeplitz hash function, this function takes two inputs: the static hash key and the tuples which are extracted from the packet.

The problem is that the default hash key that is used in DPDK (and is the recommended key from Microsoft) does not distribute symmetrical flows to the same CPU.

For example if we have the following packet {src ip=, dst ip=, src port=123, dst port=88} then the symmetrical packet {src ip=, dst ip=, src port=88, dst port=123} might not have the same hash result.

I dont want to get too much into the hash calculation internals, but one can achieve symmetric RSS by changing the hash key (which as i showed earlier can be changed in DPDK configuration) such that the first 32 bits of the key need to be identical to the second 32 bits, and the 16 bits afterwards should be identical to the next 16 bits.

Using that key achieve symmetrical RSS, the problem is that changing this leads to bad distribution of the traffic between the different cores.

But fear not! as there is a solution to this problem. A group of smart people found that there is a specific hash key which gives you both symmetrical flow distribution and a uniform one which is same with the default key)

You can read about it in their published paper.

I can say that i did some tests to check the uniform distribution of this key with random ip traffic and found it to be good (and symmetrical). The hash key (in case you dont want to read the document) is:

static uint8_t hash_key[RSS_HASH_KEY_LENGTH] = { 0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A, };

You can configure DPDK to use it in the RSS advance configuration structure as shown above.

Software RSS

RSS great advantage is the fact that its done in hardware, of course software implementations can also be done for cases when RSS is not supported or for performing uniform distribution in the TX side as well for example.

Later in this series i am going to describe an implementation of a dynamic load distribution, but in the mean time you can take a look at this software implementation of toeplitz hash as a reference. (Taken from FreeBSD).

Be social and share this post!