All postsEngineering

How we cut p99 latency by 40% across our edge network

A deep dive into our anycast routing rebuild, the dead-ends, and what actually moved the needle.

DRDaniel ReyesPrincipal Engineer, EdgeApr 22, 202612 min read

For the past eighteen months we've been quietly rebuilding the routing layer that sits between your visitors and our edge. Last week we finished. The headline number — p99 latency dropped 40% across our top eight regions — looks clean on a marketing page, but it papered over six aborted approaches and a stretch of three weeks where nothing worked.

This is the engineering story. What we tried, what failed, and the one change that finally moved the needle.

Where we started

Like most hosts our age, we started with a regional model. A request from Tokyo would hit our Tokyo PoP, terminate TLS there, and forward to the origin in the same region. That gave us ~30ms p50 in-region, but cross-region requests — say, a Bangkok user hitting a Frankfurt origin — would balloon to 280ms p99. The internet's underlay routes are not optimized for our customers.

The first instinct was to add more PoPs. We doubled them, from nine to eighteen. p50 improved, p99 didn't. The long tail of slow requests was concentrated on a small fraction of paths where BGP convergence after a peer flap would briefly route traffic the long way around the planet.

When you're chasing p99, you're chasing the worst case — not the median. Adding capacity rarely helps. Routing intelligence almost always does.

What didn't work

We tried three things over five months that, in retrospect, were variations on the same wrong idea.

  • Static GeoDNS overrides. We hand-wrote routing tables for the top 200 source ASNs. It worked for a week, then BGP shifted under us and the overrides became actively harmful.
  • Vendored RUM data. We bought real-user-monitoring data from a third party and used it to bias our anycast announcements. The data was a week stale on average. By the time we acted on it, the underlying routes had already moved.
  • Per-customer pinning. We let large customers nail their traffic to a specific PoP. This reduced their variance but pushed the load onto neighboring regions when that PoP got hot.

Each approach made our dashboards look better in the short term. None of them moved the customer-facing p99.

The shift

The breakthrough came from one of our newer SREs, Hannah, who asked a question the rest of us had stopped asking: why are we using DNS to make routing decisions at all?

DNS resolution happens once, gets cached, and then becomes the dominant factor in your routing for the next five minutes. By the time we react to a degraded path, the DNS answer is already locked in for the user. We were trying to bias a system that, by design, couldn't be biased fast enough.

So we moved the routing decision down a layer.

Anycast plus active probing

The new layer announces a single anycast prefix from every PoP. The user's DNS resolver gets the same answer regardless of where they are. The internet's underlay routes them to the nearest PoP — that part is unchanged.

What's new is that every PoP, every five seconds, probes every other PoP and every customer origin from its vantage point. If a path degrades — packet loss creeps up, RTT shifts more than 15% — we re-announce the prefix from a different PoP within thirty seconds. The user's routing follows. No DNS cache to wait out.

go
// pseudo-code for the probe loop
for {
  results := probeAllPaths(localPoP, peers)
  for _, r := range results {
    if r.lossPct > threshold || r.driftMs > 15 {
      announceWithdraw(r.prefix, localPoP)
      break
    }
  }
  time.Sleep(5 * time.Second)
}

The first deployment moved p99 by 28%. We left it alone for a week to make sure nothing was on fire. Then we started tuning.

What got us the rest of the way

The remaining 12% came from two unglamorous places.

  • TCP fast open between PoPs. We were paying for a full handshake on every cross-PoP request. With TFO and a shared cookie scheme, we shaved ~40ms off cross-region paths.
  • Origin keep-alives. A surprising number of slow requests were just the cost of opening a new connection to the origin. We added a connection pool with aggressive keep-alives, and the long tail flattened.

Neither change is novel. Both had been on the backlog for over a year. We didn't get to them until the routing rebuild forced us to look at the connection layer with fresh eyes.

The numbers, region by region

Across our top eight regions, p99 dropped from a median of 312ms to 188ms. Two regions saw less than 20% improvement — both are in places where the internet underlay is structurally bad and we're already running close to the speed of light. Five regions saw 40-50%. One — São Paulo — saw 61%, mostly because we added a new peering relationship the week we shipped.

We didn't fix latency. We fixed our reaction time to the things that cause latency.

Hannah Cho, SRE

What's next

The probe loop runs every five seconds. We'd like to get it under one. We're also looking at extending the routing decision to the application layer — letting a slow database query, not just a slow path, trigger a re-route to a region with a healthier replica.

If you'd like the raw probe data we collected during the rollout, drop us a note. Some of the numbers are surprising — particularly around how often paths degrade in ways the underlying providers don't acknowledge.

DR
Written byDaniel ReyesPrincipal Engineer, Edge

Twelve years in network engineering. Spends his weekends reading RFCs and his weekdays explaining BGP to other engineers.

READY WHEN YOU ARE

Get the next post in your inbox.

Monthly engineering digest. No spam, no marketing — just what we're shipping and what we learned.

Subscribe View RSS