Rapido - rapidly creating test VMs for driver development

Thanks to rapido, it’s become much simpler to test Linux device drivers for real PCIe devices in VMs.

The advantages of this approach are:

  • the host is protected from memory corruption errors caused by buggy kernel drivers
  • the PCI peripheral can be physically installed in a multi-use machine, reducing hardware & lab requirements
  • debugging info is easily available
  • the development cycle is short and simple - rapid even :)

Here’s a screen cast showing the workflow for a simple developer test: (thanks to termtosvg)

Animated screencast of rapido workflow

Here’s what’s happening:

  1. The test environement is started with one simple command: ./rapido cut net-test-tn40xx
  2. A new initramfs is created;
  3. Qemu boots a new VM, passing a real PCI network device from the host to the VM;
  4. The kernel boots and loads the initramfs containing the test tools;
  5. The developmental tn40xx driver loads;
  6. Ethernet carrier is negotiated and IP addresses are assigned;
  7. The Undefined Behavior Sanitizer (UBSAN) detects a bug in the tn40xx driver;
  8. iperf3 client performs a simple tcp goodput test;
  9. VM exits

Initramfs creation

Rapido takes care of building an initramfs (by calling dracut) that contains the software under test, test tools and scripted test steps.

Rapido uses a ‘cut script’ to define:

  • the name of the autorun test script (reminds me a bit of classic BSD-style init scripts, in a good way).
  • the VM resources (2 cores, 512M RAM)
  • additional drivers to install in the initramfs
  • userspace tools to install in the initramfs

Here is the ‘cut’ script used by rapido for generating the initramfs:

#!/bin/bash
# SPDX-License-Identifier: (LGPL-2.1 OR LGPL-3.0)
# Copyright (C) SUSE LLC 2022, all rights reserved.

RAPIDO_DIR="$(realpath -e ${0%/*})/.."
. "${RAPIDO_DIR}/runtime.vars"

_rt_require_dracut_args "$RAPIDO_DIR/autorun/net_test_tn40xx.sh" "$@"
_rt_require_networking
_rt_cpu_resources_set "2"
_rt_mem_resources_set "512M"

"$DRACUT" \
        --install "hostname lspci ip bridge ethtool iperf3 nc" \
        --add-drivers "tn40xx" \
        --modules "base" \
        "${DRACUT_RAPIDO_ARGS[@]}" \
        "$DRACUT_OUT" || _fail "dracut failed"

Virtual Machine definition

Rapido also starts qemu after creating the initramfs. I’ve configured it to add a virtual IOMMU and pass the physical PCIe NIC to the VM.

Here’s the snippet of the QEMU extra arguments in my rapido.conf file:

QEMU_EXTRA_ARGS="\
 -M q35,accel=kvm,kernel-irqchip=split \
 -cpu host \
 -nographic\
 -device intel-iommu,intremap=on,caching-mode=on,device-iotlb=on \
 -device virtio-rng \
 -device vfio-pci,host=07:00.0 \
"

The passed-through vfio-pci device would be better in a cut script, because it doesn’t apply to all VMs. TODO

Kernel boots

The kernel running here is the mainline 6.6.1 version. I’ve compiled it with debugging options appropriate for the driver development it’s used for here. The ~1.5 second boot time is pretty good, although the debugging options makes the output more verbose and also has a noticeable performance cost.

Driver loads, tests happen

The test script in the screencast is this:

#!/bin/bash
# SPDX-License-Identifier: (LGPL-2.1 OR LGPL-3.0)
# Copyright (C) SUSE LLC 2022, all rights reserved.

_vm_ar_env_check || exit 1
_vm_ar_dyn_debug_enable

now() {
        declare -a _uptime
        readarray -d " " -t _uptime < /proc/uptime
        printf "[%12s][USPACE] " "${_uptime[0]}0000"
}

now; echo "Starting udevd"

/usr/lib/systemd/systemd-udevd &

now; echo "Loading Driver"
modprobe tn40xx || _fatal "Failed to load tn40xx module"

now; echo "Waiting for dynamic IP addresses..."
READY=$(ip -oneline -4 addr show dev eth1 scope global)

while [ -z "$READY" ] ; do
        sleep 1
        READY=$(ip -oneline -4 addr show dev eth1 scope global)
done

now; echo "Starting iperf3"
iperf3 -c 10.1.4.153

now; echo "You can now do your manual testing..."

I thought it would be nice to have timestamps on the userspace output as well, so I added a little function for it.

There are opportunities for improvement here too:

  • a number of hardcoded system attributes that should be moved out of the test logic, eg. eth1;
  • the test results should be stored for later analysis;

The actual test steps for a network interface card need some thought. Something like this:

  • Cover various ioctls using ethtool.
  • Add/remove change VLANs
  • Change MTU
  • Promiscuous mode
  • Receive Multicast groups
  • TSO / GSO
  • driver unload / module removal

But this is a start :)

At some point I should also get a proper reference NIC like an Intel X520-DA2 as a benchmark. Update: Got an X550 T2 on loan from a friend. Thanks, Simeon!

Conclusion

An efficient, repeatable test workflow is essential for all development. This is my first meaningful progress towards developing the tn40xx driver in five years and I can now start thinking about what tests are useful, doing regular releases and tackling the major rework it needs.

It’s also quite satisfying to be able to progress three topics I find interesting in one post:

  • development of the tn40xx driver
  • real hardware device passthrough to VMs, with IOMMU memory protection
  • ephemeral Linux-based micro VMs

Let the healing begin!