Lately, a pastime for me has been learning and tinkering in Rust. As Rust is a systems programming I decided a load balancer would make a good pet project to hack on. While there are many exciting Layer 7 proxies out there, the available Layer 4 load balancers are all industrial strength and somewhat complex to get up and running. So I thought to create a more general purpose Layer 4 load balancer. These are my notes and takeaways from my load balancer side-project, Convey.
Convey
A goal of this project was to build a load balancer that easily supports Layer 4 Network Load Balancing but is still modern and general purpose. Convey supports a few modes of operations but some of the features are universal, namely health checking backends for availability and hot reloading of the load balancer configuration. This is probably up for discussion, but imo a modern load balancer should already have features like stats counters, health checking and hot configuration reloading baked in.
Backend configuration can be hot-reloaded, there is no need to restart or even reload the process. Simply update the configuration file and save it. Convey will notice the change and reload the backend server configuration making the addition or removal of load balanced servers simple. Right now the only available configuration source is via a configuration file, but this could easily be extended to support other configuration sources such as a key value store, for example.
I wanted to make it simple to run a load balancer as a proxy (the default since it is likely the most common use) or something more advanced such as Passthrough or Direct Server Return (DSR). So all three cases are possible settings at startup.
Proxy Mode
In a proxy setup, the client’s TCP connection is terminated at the load balancer. The load balancer copies the payload and initiates another TCP stream to one of the load balanced backed servers. This connection persists for the length of the TCP session as established by the client.
Convey’s proxy mode is built on the Rust tokio runtime, making the socket processing non-blocking. Admittedly, the Futures abstraction was a new concept for me and took a little bit to wrap my head around. But in the end it makes a lot of sense and turned out to be very powerful. Using the tokio runtime makes the Convey proxy mode similar to existing software Layer 4 proxies such as Nginx and HAProxy which are both event-driven as well.
To run Convey in Proxy mode, no specific flags have to be provided since Proxy mode is the default.
sudo RUST_LOG=DEBUG ./target/release/convey --config=config.toml
Passthrough Mode
A Passthrough setup is one specific to Network Load Balancing. At least that’s been my perception. Similar to the proxy, the client tries connecting to the single load balancer address. Unlike Proxy mode, however, in a Passthrough setup the client’s TCP session does not terminate at the load balancer. Instead the packet is processed, manipulated and forwarded onto a backend server. By processed, I mean the necessary connection tracking is in place or updated so future packets from, or back to, the client go to the right place. And by manipulated, I mainly mean the packet is NAT’ed appropriately. The client should think its communicating with the load balancer address the entire time. Ultimately, though the TCP connection terminates at a backend, load balanced server.
The backend server is chosen using consistent hashing and maintained in a connection tracking map to ensure that future packets from a given client are forwarded to the same backend (think sticky sessions in HAProxy).
Internally, the architecture is a manager-worker model for Passthrough and Direct Server Return (DSR) modes. The low level network package I rely on for these modes (libpnet) is great for building and manipulating packets. It also contains various handy abstractions for listening, sending, and receiving, making it a useful package for network utilities. Some of its operations are blocking, however, and at the IP/TCP layers the abstractions appear to do a fair amount of byte copying. This is nitpicking a bit as I really do find the package very useful and AFAIK there just aren’t many low level network APIs for Rust yet. For fast filtering and traffic shaping in user space, though, these are limitations. I hope in the future there is some low level network package which supports or abstracts Async operations.
For the above reasons an event driven architecture is out of the question here, so I went to the manager-worker model. A Rust thread listens on the target interface and performs some filtering. It then sends the Ethernet frame over a Multi-Producer-Multi-Receiver channel where a worker thread will pick it up. The frame is deconstructed, filtered some more, then eventually processed (provided its relevant). The new packet is built and sent on another channel (regular Multi-Producer-Single-Receiver this time) to the single transmitting thread.
The number of workers are configurable in the Convey toml file, but I noticed right away adjusting this had large effects on performance. Some of that is the locking overhead of shared structures like the connection tracking map. But it turns out Rust uses native threading. This is an incredibly important detail, especially given the packet construction and manipulation operations are all blocking. As far as I can tell, the most idiomatic way to handle these sorts of operations is with an Async runtime. Or maybe the way to go is with synchronous blocking, but with something like netmap (which libpnet apparently supports!). Regardless, with the current implementation’s native threading model its important to note the number of cores the load balancer will be running on and tuning the “workers” parameter appropriately.
I had initially wanted to simply manipulate the ingress IP packet for speed purposes, but ended up building the IP header from scratch every time. Unfortunately, I didn’t find a way around this, since I’m using channels to send ingress packets to the workers and then another channel to send the egress packets back out.
And finally, it should also be noted, the really fast Network Load Balancers are doing some fancy kernel bypass or using the kernel even more to their advantage with something like eBPF. The above is all in user space, so it just won’t be as fast as these other projects. That’s a little more ambitious than I wanted to be with this project, although I may still look into BPF filters.
Passthrough Setup
To run Convey in Passthrough mode, we need a couple iptables rules on the load balancer
sudo iptables -t raw -A PREROUTING -p tcp --dport <LOAD_BALANCER_PORT> -j DROPsudo iptables -t raw -A PREROUTING -p tcp --sport <BACKEND_SERVER_PORT> --dport 33768:61000 -j DROP
Whats going on here? Remember, Convey is not terminating any TCP sessions so not binding to any ports in this setup. So when the client tries connecting to the load balancer, the underlying OS tries to be helpful and immediately sends back a TCP SYN,RST. Similarly, when the backend server sends its response packets back to the load balancer, the OS will be disruptive again by responding with a TCP RST. Since Convey runs entirely in user space, the above commands are necessary to drop the packets before they even reach OS connection tracking. However, Convey can still listen to them.
Then run Convey in Passthrough mode by setting the “--passthrough” flag
sudo RUST_LOG=DEBUG ./target/release/convey --passthrough --config=config.toml
DSR Mode
With DSR, the client again thinks its establishing a connection to the load balancer, but is forwarded onto a backend using the same mechanisms as described in Passthrough mode. The internals of DSR are identical to Passthrough. There is just a flag indicating whether to set the IP/TCP sources to the client (for DSR) or the load balancer (for Passthrough). With this mode there is less connection tracking overhead in the load balancer so throughput should be increased relative to Passthrough.
Another difference between DSR and Passthrough is the backend servers must themselves “participate” in this mode of operation. Since the packet the backend servers receive is addressed to the client, they will do their thing, then send the response directly back to the client. However, the client still thinks its communicating with the load balancer so we need the response packets to look like they came from the load balancer. Some solutions, such as IPVS, use IPIP tunneling to handle this. Convey doesn’t handle any such encapsulation so we use an Egress NAT to manipulate the response back to the client. The Egress NAT should alter the source IP to be that of the Convey Load Balancer so from the client’s perspective it still thinks its communicating with the load balancer address.
DSR Setup
We need the same rule on the load balancer for ingress packets as for Passthrough mode
sudo iptables -t raw -A PREROUTING -p tcp --dport <LOAD_BALANCER_PORT> -j DROP
But we also need to handle the Egress NAT-ing somehow. The easiest way I know is to use Traffic Control. On each backend server, setup the Egress NAT using Traffic Control like so (also note the listening port on the load balancer should be the same as that of the backend load balanced servers):
sudo tc qdisc add dev enp0s8 root handle 10: htb
sudo tc filter add dev enp0s8 parent 10: protocol ip prio 1 u32 match ip src <LOCAL_SERVER_IP> match ip sport <LISTEN_PORT> 0xffff match ip dst <LOAD_BALANCER_IP> action ok
sudo tc filter add dev enp0s8 parent 10: protocol ip prio 10 u32 match ip src <LOCAL_SERVER_IP> match ip sport <LISTEN_PORT> 0xffff action nat egress 192.168.1.117 <LOAD_BALANCER_IP>
This will manipulate outbound packets from a given backend server to make them look like they originated from the load balancer. Exactly what we need.
Back on the load balancer run Convey in DSR mode by setting the “--dsr” flag
sudo RUST_LOG=DEBUG ./target/release/convey --dsr --config=config.toml
Benchmarks
Some basic benchmarks of the Proxy and DSR Convey modes against Nginx and Haproxy. These are very simple; they were performed in a vagrant environment on my laptop.
I used 4 VMs (1 for load generation with wrk, 2 backend servers serving the Nginx index page, 1 load balancer) for the tests with the following configuration:
- 1 GB RAM
- 2 CPU Cores per server except the load balancer which got 4
- Ubuntu 16.04
First, to set the baseline HAProxy and Nginx were setup with basic layer 4 proxy configurations. Then generate load like so:
wrk -t6 -c200 -d120s --latency http://192.168.1.197
And finally the initial baseline results for HAProxy and Nginx:
+---------+----------+------------+-----------+-----------+
| SW | Avg Lat. | Avg Req/s | Total Req | Data Read |
+---------+----------+------------+-----------+-----------+
| Nginx | 9.95ms | 3.42k | 2450490 | 1.96GB |
| HAProxy | 9.43ms | 3.55k | 2544029 | 2.04GB |
+---------+----------+------------+-----------+-----------+
First I ran Convey in Proxy mode. This takes advantage of the asynchronous Tokio runtime but HAProxy and Nginx are both event-driven themselves so I didn’t expect a real advantage there. I anticipated Convey Proxy would approach HAProxy and Nginx performance, however….
+--------------+----------+------------+-----------+-----------+
| SW | Avg Lat. | Avg Req/s | Total Req | Data Read |
+--------------+----------+------------+-----------+-----------+
| Nginx | 9.95ms | 3.42k | 2450490 | 1.96GB |
| Haproxy | 9.43ms | 3.55k | 2544029 | 2.04GB |
| Convey Proxy | 7.46ms | 5.81k | 4156170 | 3.32GB |
+--------------+----------+------------+-----------+-----------+
Holy smokes, Convey beat them handily! This is where I should probably again mention the caveats that this isn’t a real test, Nginx and HAProxy maybe aren’t properly tuned, etc. But still!
How would Convey with Direct Server Return do? As discussed above, the architecture is totally different (i.e. not event driven but manager-worker relying on native threads), but in theory it should perform adequately since routing the response directly to the client from the backend server avoids the overhead of the load balancer on the return trip.
+--------------+----------+------------+-----------+-----------+
| SW | Avg Lat. | Avg Req/s | Total Req | Data Read |
+--------------+----------+------------+-----------+-----------+
| Nginx | 9.95ms | 3.42k | 2450490 | 1.96GB |
| HAProxy | 9.43ms | 3.55k | 2544029 | 2.04GB |
| Convey Proxy | 7.46ms | 5.81k | 4156170 | 3.32GB |
| Convey DSR | 16.30ms | 4.98k | 3565295 | 2.85GB |
+--------------+----------+------------+-----------+-----------+
So, interestingly, the average latency is higher, but the Convey DSR still outperforms HAProxy and Nginx. I won’t break down the rest of the wrk statistics, but it appears there is more variance in the DSR latency distribution. Most of the requests are really fast, much faster than HAProxy and Nginx, but some in the 99th percentile are really pretty slow. Something to dig into more in the future perhaps.
Takeaways
- Rust is really fast
- Rust on tokio is really, really fast
- I think there are opportunities for low level networking packages in Rust which focus on throughput for these types of workloads. Pnet is very handy for handling and manipulating packets in user space, but there is no Async and at the IP/TCP layers much of the packet manipulation is copying bytes around in the background.
- I relied on some useful Rust packages for things like consistent hashing, but in researching what to use it doesn’t feel like some of them are actively maintained. I’m more of a Rust fan now than before, but I hope more people and companies start to use Rust for their projects so some of these packages get more support.
- Some ToDo’s: look into leveraging BPF for filtering ingress in the kernel and re-evaluate pnet for Passthrough and DSR modes (netlink style packet forwarding might be easier/more efficient/faster).
So that’s it! I’m happy with the results although I’m sure there is room for improvement in the code. In the end I was able to learn more Rust, wrap my head around Futures and Async programming in Rust, and even re-learn or better learn some concepts around load balancing and networking. I also feel Convey is potentially a relevant solution for some workloads. Its not industrial strength like some of the other options out there, but it does offer a fast proxy and a network load balancer which (imo) is a simpler setup to get up and running than some of the other options like IPVS.