In my previous post, we hit a wall with the kernel's multicast forwarding logic. The core issue was the RPF (Reverse Path Forwarding) check failure for sources on unroutable networks. While digging deeper into this, I found it's a well-documented challenge in multicast literature, often called the "last-hop/first-hop problem". It describes the difficulty of bridging the gap between an isolated multicast source and the main network.

The investigation concluded that a userspace relay was the right approach. This post introduces MCR (Multicast Relay) (https://github.com/acooks/mcr), a specialized userspace relay built in Rust, as a modern, high-performance implementation of that architectural pattern. The project was built by myself, with the help of AI models Gemini and Claude. Its goal is to provide a robust solution to the first-hop problem where throughput and dynamic control are critical.

Why Userspace? A Necessary Trade-off

Before diving into the architecture, it's important to address a fundamental question: why move packet forwarding into userspace at all? For raw throughput, kernel-native forwarding is almost always faster. The decision to build a userspace relay was not driven by a desire for more performance, but by necessity, after concluding that kernel-level solutions were insufficient or impractical for this specific problem.

The core challenge is the kernel's Reverse Path Forwarding (RPF) check. While it's true that this check can be disabled on a per-interface basis (by setting the rp_filter sysctl to 0), this is often an inadequate solution for two key reasons:

  1. It's a Blunt Instrument: Disabling rp_filter turns off a key anti-spoofing security mechanism for all traffic on that interface, not just the multicast stream in question. This is a security trade-off many administrators are unwilling to make.
  2. It Doesn't Solve the Whole Problem: More importantly, simply disabling the RPF check on the first router doesn't "clean" the packet. The packet is still forwarded with its original, unroutable source address, which will cause RPF failures on correctly configured downstream routers. It pushes the problem one hop further away without solving it.

With the rp_filter approach being insufficient, we are left with other kernel-level mechanisms, which also have limitations:

  • No Inbound Multicast SNAT: Our previous investigation proved that it's architecturally unsupported to use Netfilter's conntrack to perform SNAT on inbound multicast, which would be the most direct way to "clean" the source address.
  • Static Forwarding Limits: Using direct control of the Multicast Forwarding Cache (MFC) works for routable sources, but the kernel has a hard-coded limit of 32 Virtual Interfaces (MAXVIFS), making it unsuitable for high-density scenarios.

Given these constraints, a targeted userspace application becomes the most practical way to solve the problem robustly at the network edge. It allows us to "launder" the packet once, making it fully compliant for the rest of its journey through the network. This trade-off—sacrificing the raw speed of the kernel for the correctness and flexibility of a userspace application—means that performance cannot be taken for granted and must be a central goal of the design.

How it Works: MCR's Architecture

MCR's architecture is designed to solve the RPF problem by bypassing the kernel's IP stack and giving the operator direct, dynamic control over forwarding. It consists of two main components:

  1. The Supervisor & Workers: The main multicast_relay process acts as a supervisor. It spawns high-performance "worker" processes, pinning each to a specific CPU core. These workers form the data plane, doing the actual packet processing.
  2. The Control Plane: MCR is managed dynamically at runtime. A separate control_client tool communicates with the supervisor over a UNIX socket, allowing you to add, remove, and list forwarding rules without ever stopping the service.

This design separates the data plane, which is optimized for performance, from the control plane, which is optimized for flexible management.

Data Plane Design for Performance

At its core, MCR operates on a simple but powerful principle: if the kernel won't forward the packet, MCR will intercept it before the kernel has a chance to drop it. This is achieved through several key architectural decisions:

  • Unified io_uring Event Loop: At the heart of each worker is a single, unified event loop built on Linux's most advanced I/O interface, io_uring. This allows a single thread to manage both ingress and egress asynchronously with minimal syscall overhead, a design which avoids the need for complex and often slower cross-thread communication.
  • AF_PACKET for Raw Sockets: By operating at Layer 2, MCR sidesteps the kernel's entire IP stack on the ingress path. This is the key to bypassing the RPF check and gives MCR full control over packet handling.
  • Efficient In-Memory Fan-Out: When a single input stream must be replicated to multiple outputs, MCR uses a reference-counted pointer (Arc<[u8]>) to share a single packet's memory buffer across multiple send operations. This avoids expensive memory copies in userspace, a critical optimization for high-density replication scenarios.
  • Core-Local Buffer Pools: Each worker pre-allocates and manages its own memory buffers, a strategy aimed at avoiding the performance penalty of dynamic memory allocation in the fast path.

While even higher performance might be achievable with technologies like eBPF/XDP, the AF_PACKET and io_uring approach was chosen deliberately. It offers excellent performance while maintaining compatibility with a wider range of widely deployed enterprise Linux kernels, avoiding the tighter coupling to specific kernel versions and complex toolchain dependencies that an eBPF-based solution would entail.

Example Workflow

A typical workflow looks like this: an operator starts the multicast_relay service and then uses the control_client to provision forwarding rules. For example, to relay a stream arriving on eth0 to a new destination via eth1, the command is simple and declarative:

./control_client add-rule \
    --input-interface eth0 \
    --input-group 239.0.0.1 \
    --input-port 5001 \
    --output-interface eth1 \
    --output-group 239.0.0.2 \
    --output-port 6001

This command instructs a worker to listen for multicast traffic for 239.0.0.1:5001 on eth0 and re-transmit any received packets as a new stream to 239.0.0.2:6001 via eth1. The source address of the new stream is the relay server's own, making it fully routable.

What About socat?

For many network tasks, the versatile socat utility is an excellent choice. It can be used to solve the RPF problem by creating a userspace relay, and it's important to understand the architectural and operational differences between that approach and MCR's.

Architectural Difference: Layer 2 vs. Layer 4

The most significant difference is how each tool interacts with the network stack.

  • socat operates at Layer 4. It uses standard UDP sockets to receive and send packets. When it receives a packet, that packet has already been fully processed by the kernel's network stack (checksums verified, IP headers parsed, etc.). socat then takes the payload and sends it out via another UDP socket, causing the kernel to build a new packet. This is a simple and robust model.
  • MCR operates at Layer 2. As detailed in the architecture above, it uses AF_PACKET raw sockets to capture entire Ethernet frames directly from the network driver. This allows it to bypass the kernel's IP stack and, crucially, the RPF check.

Both tools solve the RPF problem, but in different ways. socat cleverly uses "local delivery"—the kernel sees a local application is the destination and doesn't apply the forwarding RPF check. MCR sidesteps the check entirely by intercepting the frame at a lower level.

Operational and Feature Differences

The architectural differences lead to different operational models:

  • Process Model: socat is a point-to-point tool. To relay a single multicast stream, you run a single socat process. To relay ten streams, you must run ten separate socat processes, which can be cumbersome to manage at scale. MCR uses a supervisor model where a single daemon can manage hundreds of forwarding rules concurrently across its worker processes.
  • Head-End Replication: A key feature of MCR is its ability to perform "fan-out" or "head-end replication," where a single input stream is replicated to multiple different outputs. socat's "one-to-one" model cannot do this; it can only forward a stream from one point to another.

Comparative Example: 1-to-1 Relay

Let's say we want to relay a stream from 239.1.1.1:5001 on eth0 to 239.10.10.10:6001 on eth1.

With MCR, you would ensure the supervisor is running and then add a rule dynamically:

# Add a rule to a running MCR instance
./control_client add-rule \
    --input-interface eth0 \
    --input-group 239.1.1.1 \
    --input-port 5001 \
    --output-interface eth1 \
    --output-group 239.10.10.10 \
    --output-port 6001

With socat, you would launch a dedicated process for this specific task:

socat -u \
  UDP4-RECV:5001,ip-add-membership=239.1.1.1:eth0 \
  UDP4-SEND:239.10.10.10:6001,bind=10.1.1.2

(Note: socat requires binding to the specific IP address of the egress interface, here assumed to be 10.1.1.2 on eth1.)

In summary, socat is an excellent tool for simple, static, one-to-one relay tasks. MCR is designed for more complex scenarios requiring high-density, dynamic rule management, and features like head-end replication.

Measured Performance

To complement the architectural and operational comparison, I conducted a series of benchmarks. The full performance validation reports are available in the project's documentation, but here's a summary of the key findings:

  • Baseline Reliability (Layer 3 Routing): In a standard Layer 3 routing test (e.g., forwarding between veth pairs), both MCR and socat demonstrated 100% packet delivery with 0% loss at moderate loads (e.g., 50,000 packets per second). For simple L3 scenarios, both tools are equally reliable.

  • Scalability Under High Load: Under a sustained load of 400,000 packets per second in the same Layer 3 topology, MCR maintained excellent performance with a negligible 0.2% packet loss, whereas socat dropped over 15.3% of packets. This highlights MCR's architectural advantage with io_uring's batched I/O, significantly reducing syscall overhead at high rates.

  • Architectural Suitability for L2 Bridging: A key use case for MCR is relaying traffic between logically separate Layer 2 domains, such as two different Linux bridges. In this scenario, MCR forwarded traffic with 0% loss, whereas socat, as a Layer 4 tool, is not designed to operate at this level and could not forward the traffic. This doesn't represent a flaw in socat, but rather illustrates a foundational difference in capability: MCR is explicitly designed for L2 operations that are outside the scope of standard UDP socket-based tools.

  • Extreme Fan-out: MCR was successfully validated performing a 1-to-50 head-end replication, where a single input stream was replicated to 50 unique outputs. This demonstrates MCR's ability to bypass the kernel's hard-coded MAXVIFS limit of 32 output interfaces for native multicast forwarding.

These results, gathered in a virtualized test environment, confirm MCR's capabilities for scenarios demanding high throughput and architectural versatility. A comprehensive analysis of performance on physical hardware is planned for a future article.

Why Rust?

Rust was chosen for MCR because its language features align well with the project's goals. Its focus on memory safety and its ownership model help prevent common classes of bugs, like buffer overflows and use-after-free errors, which are particularly critical in networking applications that deal directly with raw memory buffers and low-level system calls. This allows the development focus to remain on the core logic, with the compiler providing strong guarantees against many potential vulnerabilities. Of course, operating at Layer 2 requires powerful capabilities (CAP_NET_RAW), which brings significant security responsibility; future work will explore strategies for dropping these privileges as soon as they are no longer needed.

Conclusion: The Right Tool for the Job

The journey from the "disappearing multicast packet" to MCR highlights a core engineering principle: when you hit a fundamental limitation in one layer of the system, an effective solution is often to move up a layer and build a more specialized tool.

MCR is that tool. It attempts to solve the specific, real-world problem of unroutable multicast sources by providing a manageable and high-performance userspace relay. It's an exploration of how combining a deep understanding of the kernel's boundaries with modern APIs and a language like Rust can create powerful and reliable solutions to complex networking challenges.

This approach of a lightweight, userspace-driven protocol that leverages the existing unicast network is not a new idea. It follows the design philosophy of protocols like SLIM (Self-configuring Lightweight Internet Multicast), which argued that the key to multicast adoption was to "nullify the management complexity" by avoiding multicast-specific infrastructure. MCR can be seen as a modern realization of that vision, implementing a similar architectural pattern with the high-performance, asynchronous tools available to us today.

References

[1] Hjálmtýsson, G., Brynjúlfsson, B., & Helgason, Ó. R. (2003). Overcoming Last-Hop/First-Hop Problems in IP Multicast. Lecture Notes in Computer Science. [2] Hjálmtýsson, G., Brynjúlfsson, B., & Helgason, Ó. R. (2004). Self-configuring lightweight Internet multicast protocol specification. IEEE International Conference on Systems, Man and Cybernetics.


Many consumer and industrial devices—from home security cameras to the HDMI-over-IP extenders I investigated in a previous post—are designed as simple appliances. They often have hard-coded, unroutable IP addresses (e.g., 192.168.1.100) and expect to live on a simple, isolated network. This becomes a major …

Read More

I'll be speaking at linux.conf.au 2018 in Sydney.

Jittertrap testing infrastructure is slowly improving. The idea is to include known test data so that the front-end rendering can be verified and refactored. I expect a sprint over the holidays and some new bugs for lca. :)

Clean up of …

Read More

New hardware for BufferUpr has arrived! I'm working on adding Wifi capability into the product and with new hardware comes new challenges.

Challenge: Find a simple way to list all the network interfaces in the system, including the PCIe slot and MAC address for each.

Solution:

find /sys/devices/pci0000 …
Read More

Fedora 23 (the current release at the time of writing) ships an outdated version of SciPy that doesn't include the spectrogram function. Installing the latest Scipy was kind-of a pain, so I thought I'd record some instructions for future-me and share it with you.

These instructions install dependencies, set up …

Read More

This is the story of my first expensive lesson in Digital Signal Processing. It is about JitterTrap, the free software that powers BufferUpr.

The premise of BufferUpr is to combine commodity hardware with open source software to create a product that can measure data stream delays of 1-100 milliseconds. This …

Read More

It was the year of 2015 and people were still developing new applications in PHP... but for those who could no longer accept the idea of installing a system-wide LAMP stack, there was a new-old fassionable thing: Containers!

This is a quick howto for creating a throw-away container for messing …

Read More

Sometimes there is good reason to talk to yourself. You might be doing a sound check, for example.

Likewise, it can be useful to route IP packets between two interfaces on the same machine using an external path. One reason to do this is to test other network devices like …

Read More

I received some encouraging comments on G+ from Jesper Dangaard Brouer about my previous post on DSCP, Linux and VLAN priorities. Those comments and the work linked to (here and here) points to a few long-standing (but minor) issues with the way DSCP priorities are handled in Linux.

  1. Some DSCP …
Read More

I recently discovered a flaw in the VLAN implementation I did at work. It seemed that the normal TCP traffic had the correct VLAN priorities applied, but audio streaming UDP traffic did not.

This was due to DSCP being applied to the streaming audio and the fact that the VLAN …

Read More

This concerns the proliferation of netlink libraries and a lack of direction and documentation.

Background:

I've configured a router with netem (see Bandwidth Throttling with NetEM Network Emulation and netem example rules) to test Tieline devices under various delay and loss network conditions.

It's not really feasible for the tester …

Read More

After finally completing task 01 of the Eudyptula Challenge, I'd like to share a few things I've learned, without divulging any crucial details about the task or solution.

tl;dr Pay attention.

Patience...

Maybe it's just the timezone, but the turn-around time for a response to a submission meant that …

Read More