The Jitter Cliff: When TCP Goes Chaotic

Part 2: Why throughput collapses and what to do about it

In Part 1, we used “video over TCP” as a stress test for TCP’s behavior—examining how zero-window events, retransmits, and the masking effect reveal what’s happening inside a struggling connection.

But during those experiments, I discovered TCP throughput degraded rapidly as jitter worsened. While I knew that packet loss would destroy TCP throughput, I hadn’t quite expected the jitter-induced cliff.

At a certain jitter threshold, throughput collapses so severely that measurements become unreliable. Single tests can vary by over 100%. This “chaos zone” makes diagnosis treacherous: the same network conditions can produce wildly different results depending on when you measure.

This post explores TCP’s behavior under jitter and loss, comparing CUBIC and BBR. It’s common knowledge that TCP is inappropriate for delay-sensitive streaming data, and this post will try to demonstrate how and why.


Experimental Setup

The findings in this post come from controlled experiments using Linux network namespaces with tc/netem to inject precise network impairments (delay, jitter, loss).

Why these constraints? We wanted to study a single TCP stream carrying streaming data—video, audio, or telemetry—where the application produces data at a relatively consistent rate rather than as fast as possible. This differs from bulk file transfers, where the goal is maximum throughput and applications use large buffers with TCP autotuning (up to 32 MB on Linux). By using a modest receive buffer (256 KB) and a fixed send rate (80 Mbps), we model a constrained streaming scenario where the jitter cliff behavior becomes visible. The absolute throughput numbers in this post reflect this setup; the ratios between algorithms and conditions are what generalize.

Key parameters:

ParameterValueNotes
Target send rate10 MB/s (80 Mbps)Sender paced at this rate
Receive buffer256 KBSized to avoid zero-window events
RTT range24-100msControlled via netem delay
Jitter range0-24msControlled via netem delay ... distribution
Loss range0-5%Controlled via netem loss
Duration10 seconds per testMultiple iterations for statistical validity
Topologyveth pairsBetween Linux network namespaces

The receive buffer was set to 256 KB via SO_RCVBUF. Linux doubles this value internally for bookkeeping overhead (see socket(7))—verified by getsockopt() returning 512 KB. Packet capture analysis showed:

  • Requested: 256 KB (setsockopt)
  • Allocated: 512 KB (getsockopt)
  • Maximum advertised window: 313 KB
  • RTT: 50.2ms (from TCP timestamps)

This gives a theoretical maximum of 313 KB / 0.050s ≈ 50 Mbps. The measured baseline of ~42 Mbps is 85% of this theoretical maximum—the gap reflects TCP overhead from slow start, congestion window probing, and ACK timing. All figures in this post are relative to this ~42 Mbps baseline.

Why baseline throughput is ~42 Mbps, not 80 Mbps: TCP throughput is limited by the bandwidth-delay product (BDP)—the amount of data that can be “in flight” (sent but not yet acknowledged) at any moment. With 50ms RTT, the sender can only have ~50ms worth of data outstanding before it must wait for acknowledgments. Even in these benign conditions, throughput is significantly below the target rate of 80 Mbps.

The experiments compared CUBIC (Linux default) and BBR congestion control algorithms.

Results were obtained from more than 1500 experimental runs, sweeping the jitter, delay, loss and Congestion Control Algorithm parameters, using Linux kernel 6.12 with BBRv3.


The Jitter Cliff: RTT-Relative Collapse

A moderate amount of jitter (variation in packet delay) is inevitable on real networks. Routers queue packets, wireless links have variable latency (variable transmission rates due to variable modulation coding schemes), and congested paths add unpredictable delays. TCP is designed to handle this, within bounds.

This investigation is about learning where the boundaries are and what happens near them. Part 1 showed there is a rapid decay in throughput. Let’s characterise it: The cliff occurs when jitter reaches roughly 15-30% of the RTT, depending on the congestion control algorithm and conditions.

Throughput retention vs jitter as percentage of RTT, showing cliff where throughput collapses

The plot shows throughput retention (percentage of maximum achievable throughput) versus jitter expressed as a percentage of RTT. Above 50% retention is functional; below is degraded or unusable.

RTTCUBIC CliffBBR CliffNotes
24ms±8ms (33%)±4ms (17%)BBR more sensitive at low RTT
50ms±12ms (24%)±8ms (16%)BBR collapses earlier
100ms±16ms (16%)±16ms (16%)Both similar at high RTT

Notably, BBR’s cliff occurs earlier than CUBIC’s at lower RTTs. At 50ms RTT, BBR starts degrading at 16% jitter while CUBIC holds until 24%. BBR’s pacing model is more sensitive to timing disruptions from jitter, even though it handles packet loss better (as we’ll see later).

These thresholds matter because real-world networks operate in this range. Starlink, for example, has baseline jitter of ~7ms with ~27ms RTT—a 26% ratio, right at the CUBIC cliff. We’ll examine this in detail later.

The Counter-Intuitive Implication

This leads to a surprising conclusion: higher RTT means more jitter tolerance.

At ±12ms of jitter:

  • 50ms RTT: CUBIC 2.4 Mbps, BBR 4.7 Mbps (collapsed—jitter is 24% of RTT)
  • 100ms RTT: CUBIC 16 Mbps (functional—jitter is only 12% of RTT)

That’s nearly a 7x difference in CUBIC throughput for the same absolute jitter, despite the higher RTT.

This matters for network paths that include a satellite hop: the satellite segment may contribute more or less to the TCP throughput collapse than you might imagine. If the 20% rule holds, a geostationary satellite link with 600ms RTT could theoretically handle ±120ms of jitter while a 50ms LEO hop would collapse at ±10ms. (Note: GEO latencies were not simulated or tested—this is an extrapolation.)

The Chaos Zone: Why Single Measurements Lie

Near the jitter cliff, results become highly variable. Run the same test five times and you might get five dramatically different answers.

Coefficient of variation by jitter region showing chaos zone where CUBIC CV reaches 66%

RegionCUBIC CVBBR CVInterpretation
Below cliff1-5%1-5%Stable, predictable
At cliff (chaos zone)21-66%14-31%CUBIC unstable, BBR more predictable
Above cliff5-19%3-5%Both collapsed, BBR still more stable

In the chaos zone at 50ms RTT with 10ms jitter, CUBIC showed coefficient of variation up to 66%—meaning throughput varied wildly between runs. BBR’s CV stayed below 20% even in the worst conditions.

Practical implication: If your network operates near the jitter cliff (jitter 10-30% of RTT), don’t trust single measurements. Run at least five tests and look at the distribution.

Why CUBIC Becomes Unpredictable

What causes CUBIC’s high variance in the chaos zone? The data shows the pattern clearly, even if the mechanism is complex.

Box plot showing CUBIC vs BBR throughput distributions across jitter levels

At 10ms jitter (20% of RTT):

AlgorithmRangeMeanCV
CUBIC2-33 Mbps12 Mbps66%
BBR4-9 Mbps7 Mbps18%

Detailed histogram comparison at 10ms jitter showing CUBIC’s long tail vs BBR’s tight cluster

CUBIC’s distribution has a long right tail—occasional “lucky” runs achieved 3-4x the median throughput. BBR clusters tightly around its mean. This isn’t bimodal behavior; CUBIC seems to be highly sensitive to initial conditions while BBR remains consistent.

Packet capture analysis revealed the mechanism. The key metric is inter-packet gap—how long the sender waits between packets:

ConditionCUBIC GapBBR Gap
Low jitter (stable)~0.5ms~0.5ms
High jitter (collapsed)5-8ms2-3ms

When jitter crosses the threshold, CUBIC’s gap jumps 10-15x. BBR’s grows only 4-6x. Longer gaps mean lower throughput.

The hypothesis: High jitter causes spurious loss detection—packets arriving out of order or delayed beyond the retransmit timeout get mistaken for lost packets. Each “loss” triggers a cwnd reduction. But recovery is slow: during congestion avoidance, cwnd grows by roughly one segment per RTT. Above the threshold, reductions compound faster than recovery, and cwnd cascades toward its minimum. BBR survives better because its pacing-based approach doesn’t reduce sending rate aggressively on loss.

Caveat: We measured inter-packet gaps, not cwnd directly. This mechanism is plausible but not proven.


BBR vs CUBIC

Linux defaults to CUBIC for congestion control. BBR (Bottleneck Bandwidth and Round-trip propagation time) is an alternative developed by Google. The question everyone asks: which is better?

The answer: it depends on the amount of jitter and loss you expect on your path. For low amounts of jitter and loss, CUBIC still has a role to play.

BBR vs CUBIC heatmap showing when each algorithm wins based on RTT and jitter

The heatmap shows the BBR/CUBIC throughput ratio across different RTT and jitter combinations. Green means BBR is better, red means CUBIC is better, and yellow means they’re roughly equal.

When BBR Wins

ConditionBBR Advantage
Jitter > 30% of RTT3-5x better
Post-cliff (both algorithms collapsed)3-5x better
Any significant packet loss2-17x better

After the cliff, CUBIC flatlines at 0.7-0.9 Mbps while BBR maintains 3-5 Mbps. BBR degrades gracefully; CUBIC collapses sharply.

When CUBIC Wins (or Ties)

ConditionResult
Jitter < 10% of RTTSimilar or CUBIC slightly better
Clean network, no lossSimilar performance

The Surprising Middle Ground

In the chaos zone (jitter 10-30% of RTT), BBR can actually be worse than CUBIC in terms of mean throughput. At 50ms RTT with ±8ms jitter (16% of RTT):

  • CUBIC: 29 Mbps mean (but CV = 21%)
  • BBR: 14 Mbps mean (CV = 31%)
  • BBR/CUBIC ratio: 0.49x

BBR achieves roughly half the throughput of CUBIC at this operating point. However, CUBIC’s higher mean masks significant run-to-run variance. In the worst part of the chaos zone (±10ms jitter), CUBIC’s CV reaches 66%—meaning some runs achieve good throughput while others collapse entirely.

This caught me off guard. BBR’s model-based approach sometimes makes suboptimal decisions in conditions where CUBIC’s loss-based reactions happen to work better. However, BBR’s lower throughput is predictably lower—and for video streaming, predictable throughput often matters more than maximum throughput.

Loss Tolerance: Where BBR Dominates

Jitter creates complexity for TCP, but packet loss is where CUBIC truly struggles.

Loss tolerance comparison showing BBR maintaining throughput while CUBIC collapses

Loss RateCUBICBBRBBR Advantage
0%43 Mbps43 Mbps1.0x
0.1%14 Mbps39 Mbps2.8x
0.25%8 Mbps34 Mbps4.2x
0.5%5 Mbps29 Mbps6.1x
1%3 Mbps25 Mbps7.8x
2%2 Mbps21 Mbps10.6x
5%1 Mbps17 Mbps17.6x

(Tested at 50ms RTT, 0% jitter, 5 iterations per condition)

Even 0.1% packet loss—one packet in a thousand—causes CUBIC throughput to drop by 67%. BBR maintains 91% efficiency at the same loss rate.

At 5% loss, the difference is staggering: BBR provides 17.6x the throughput of CUBIC.

Why the Asymmetry?

CUBIC interprets every lost packet as a congestion signal and aggressively backs off. This is appropriate when loss is caused by buffer overflow—it means the network is genuinely oversaturated.

But loss can have other causes: wireless interference, cable faults, router bugs, or even ECN-incapable middleboxes dropping marked packets. In these cases, backing off doesn’t help—the path capacity hasn’t changed, just some packets got corrupted.

BBR uses a model-based pacing approach. It estimates the bottleneck bandwidth and RTT independently of loss, and paces packets accordingly. Random loss doesn’t cause BBR to dramatically reduce its sending rate.

Recommendation

On any network with adverse jitter or packet loss, use BBR. The advantage is significant.


The experiments so far used synthetic conditions. How do these findings apply to real networks? Starlink provides a good test case—it has measurable jitter, occasional packet loss, and millions of users trying to stream video over it.

Based on published measurements:

MetricTypical RangeSource
RTT25.7ms median (US), 30-80ms rangeStarlink official, APNIC
Jitter6.7ms average, 30-50ms at handoverAPNIC
Packet Loss0.13% baseline, ~1% overallWirelessMoves, APNIC
HandoverEvery 15 secondsAPNIC

Critically, Starlink’s loss is not congestion-related—it’s caused by satellite handovers and radio impairments. CUBIC’s loss-based congestion control misinterprets these as congestion signals, causing unnecessary throughput reduction.

Now we can connect these characteristics to the jitter cliff thresholds from earlier:

ProfileJitterRTTJitter/RTTCliff Zone
Baseline±7ms27ms26%At CUBIC cliff
Handover±40ms60ms67%Past both cliffs
Degraded±15ms80ms19%Chaos zone

Starlink’s baseline operation—not degraded, not during handover, just normal—sits right at the CUBIC cliff threshold. This explains why TCP performance over Starlink is so sensitive to congestion control algorithm choice.

Experimental Results

Starlink profile comparison showing BBR advantage across different conditions

ProfileRTTJitterLossCUBICBBRBBR Advantage
Baseline27ms±7ms0.125%2.3 Mbps7.3 Mbps3.1x
Handover60ms±40ms1.0%0.5 Mbps2.2 Mbps4.2x
Degraded80ms±15ms1.5%1.3 Mbps3.7 Mbps2.9x

(200 iterations per profile, parameters matched to cited Starlink characteristics)

Even at baseline conditions with only 0.125% loss, BBR provides 3.1x the throughput of CUBIC.

A note on absolute vs. relative throughput: The absolute Mbps values above reflect our constrained test setup (256 KB buffer, single stream). Real-world Starlink achieves higher absolute throughput because applications use larger buffers and speed tests use multiple parallel connections. However, the ratios we measured match real-world observations remarkably well.

Real-World Validation

Independent testing on actual Starlink connections confirms the pattern we observed in simulation:

SourceCUBICBBRBBR Advantage
Our simulation2.3 Mbps7.3 Mbps3.1x
WirelessMoves (2023)~20 Mbps>100 Mbps~5x

The WirelessMoves testing used single-connection iperf3 tests over real Starlink hardware. Despite the 10x difference in absolute throughput (due to larger buffers and different conditions), the BBR advantage ratio is consistent: 3-5x better than CUBIC.

This also explains why Starlink users don’t universally complain about poor TCP performance:

  1. Speed tests use multiple parallel connections. Speedtest.net and similar tools aggregate many TCP streams, masking single-connection limitations. A user might see “150 Mbps” on Speedtest while a single video stream struggles at 20 Mbps with CUBIC.

  2. Many major services use BBR. Google, Netflix, and Cloudflare have deployed BBR on their servers. Users streaming from these services get BBR’s benefits without changing anything on their end.

  3. Adaptive bitrate masks the problem. Video services like YouTube and Netflix adjust quality based on available throughput. Users see “480p” instead of “buffering,” which feels like a content choice rather than a network failure.

Video Quality Mapping

What does this mean for actual video quality?

ConditionWith CUBICWith BBRRecommendation
Baseline (2.3 vs 7.3 Mbps)360p choppy720p smoothBBR strongly recommended
Handover (0.5 vs 2.2 Mbps)Unusable360p barelyBBR + buffer for handovers
Degraded (1.3 vs 3.7 Mbps)360p choppy480p usableBBR essential

Video quality estimates assume ~2.5 Mbps for 480p, ~5 Mbps for 720p (H.264). These represent relative performance differences between algorithms under simulated Starlink conditions.

The Handover Problem

Starlink satellites hand off every 15 seconds. During handover (per APNIC measurements):

  • RTT spikes by 30-50ms (e.g., 30ms → 80ms)
  • Jitter increases significantly
  • Packet loss spikes occur

These are brief disruptions, not 15-second outages. Video applications need enough buffer to absorb the throughput dip during handover—likely a few seconds, not 15.


Practical Recommendations

Congestion control decision guide based on network conditions

Decision Guide

  1. Stable networks (jitter < 10% of RTT, loss < 0.1%): Use CUBIC (Linux default)
  2. Moderate jitter (10-30% of RTT): Test both algorithms—results vary
  3. High jitter (> 30% of RTT): Use BBR
  4. Any significant packet loss (> 0.1%): Use BBR
  5. Satellite links (Starlink, LEO, GEO): Use BBR

How to Switch Congestion Control

# Enable BBR (requires root, kernel 4.9+)
sudo modprobe tcp_bbr
sudo sysctl -w net.ipv4.tcp_congestion_control=bbr

To make permanent, add tcp_bbr to /etc/modules-load.d/bbr.conf and net.ipv4.tcp_congestion_control=bbr to /etc/sysctl.conf.


Implications for JitterTrap

This investigative work was done to support JitterTrap. When users report video stuttering or unexplained throughput problems, the tool needs to help them understand why—not just show that something is wrong. The jitter cliff and chaos zone findings directly inform what JitterTrap should measure and how it should present that information.

Planned improvements based on this work:

  1. Jitter/RTT ratio indicator: Show whether the network is below, at, or above the cliff threshold
  2. Chaos zone warning: Alert when measurements may be unreliable (jitter 10-30% of RTT)
  3. Congestion control guidance: Recommend BBR vs CUBIC based on observed conditions
  4. Stability indicator: Display coefficient of variation to distinguish consistent problems from chaotic ones

The Bigger Picture

The jitter cliff is a real problem for video over TCP. Throughput can collapse by 90% or more, and near the cliff, behavior becomes unpredictable. But understanding why this happens points to a deeper issue.

TCP’s flow control relies on back-pressure: when the network is impaired, TCP slows down and signals the sender to wait. This works for file transfers, database queries, or web requests—applications that can pause. But video can’t pause:

  1. Applications that cannot slow down: A live video encoder produces frames at a fixed rate regardless of network conditions. When TCP’s send buffer fills, frames queue up, latency grows unboundedly, and eventually data drops catastrophically. TCP signals back-pressure, but the encoder has no mechanism to respond—it can’t “skip this frame” or “reduce bitrate” based on socket buffer state.

  2. Disconnected back-pressure: Consider UDP video inside a TCP VPN tunnel. When TCP throughput drops—due to the jitter cliff, loss, or BDP limits—the video source keeps sending at its configured rate while the tunnel delivers at a fraction of that. Latency climbs as data queues. TCP retransmits packets the receiver may no longer care about. The back-pressure never reaches the encoder.

The jitter cliff isn’t a bug in TCP—it’s TCP doing what it was designed to do. The failure is architectural: TCP guarantees “deliver everything, in order, eventually,” but video needs “deliver what you can now, drop what’s stale, and tell me to adapt.”


Summary

  • The jitter cliff: Throughput collapses when jitter exceeds roughly 15-30% of RTT. Higher RTT = more tolerance.
  • The chaos zone: Near the cliff, CUBIC varies 21-66%; BBR stays at 14-31%. Don’t trust single tests.
  • BBR vs CUBIC: BBR wins under loss (up to 17.6x better) and post-cliff. CUBIC can win in the chaos zone but is less predictable.
  • Practical: Use BBR on lossy or satellite networks; CUBIC on stable networks; test both near the cliff.
  • Limitations: Lab simulation (1,500+ runs, Linux 6.12/BBRv3) validated against real-world Starlink measurements showing consistent ratios.

What’s Next

Part 3 will explore SRT (Secure Reliable Transport)—a protocol designed specifically for live video that borrows from TCP but fixes what makes TCP unsuitable:

  • Bounded latency: SRT enforces a maximum latency; packets that arrive too late are dropped, not delivered
  • Sender feedback: The receiver reports packet loss and timing back to the sender, enabling adaptive bitrate
  • Selective retransmission: Only retransmit packets that can still arrive in time
  • Application-layer control: The video encoder can respond to network conditions

SRT asks the right question: “What can I deliver within this latency budget?” rather than TCP’s “How do I eventually deliver everything?”

This research is part of Project Pathological Porcupines—an ongoing systematic exploration into the kinds of issues that delay-sensitive networking applications encounter, and how JitterTrap can help us understand these problems and improve our applications. Both the research and JitterTrap itself are works in progress.