SDWAN Resilience Part 4: BFD and Convergence Tuning

Parts 2 and 3 stood up the routing. Both leaned on the phrases “with BFD” and “with tuned timers” without justifying the numbers. This post does that work.

The default BGP holdtime is 180 seconds. Default OSPF dead-interval is 40 seconds. Default IPsec DPD detection on FortiOS is in the 30–90 second range. None of those are acceptable when “spokes prefer DC1 in steady state and fail to DC2 on a real failure” is supposed to happen before users notice.

The target: end-to-end convergence in the 1–3 second range for any of the failure modes we care about, with a deliberate trade-off against control-plane churn during transient blips.

The convergence chain to optimise

For a hub-side failure (HUB-1 dies, or its DCE peering dies), the chain that matters is:

T0   :  failure happens
T+a  :  hub-side detector fires (BFD on hub-to-DCE, or BFD on hub-to-spoke)
T+b  :  hub withdraws affected prefixes (BGP UPDATE)
T+c  :  spoke installs alternate path (best-path re-selection)
T+d  :  forwarding plane re-pinned (RIB → FIB)

Total convergence is T+d - T0. Each of those steps is a knob.

The dominant term is almost always T+a — failure detection — because the BGP UPDATE / best-path / FIB-update steps in FortiOS run in tens to a few hundred milliseconds once a session goes down. Get the detector right and the rest is in the noise.

DPD: useful, but not for our purposes

Dead Peer Detection (DPD) is the IPsec-native liveness check. FortiOS supports three modes:

  • disable — no DPD.
  • on-demand — DPD probes only when there’s outbound traffic but nothing coming back.
  • on-idle — periodic DPD when the tunnel is idle.

DPD does its job: it detects a dead remote peer and tears down the IKE/IPsec SA so a new one can be re-established. But for routing convergence it has two problems:

  1. It’s slow. Default DPD is built around “the tunnel has been quiet for a while, let me check”; the smallest interval/retry combination still puts detection in the 10–30 second range, and the default is much higher.
  2. It detects the wrong thing. DPD knows the IPsec peer is unresponsive. It does not know whether the routing peer on the other side of the tunnel is healthy. With BGP-on-loopback, those are two different things — a hub can have a healthy IPsec tunnel and a dead BGP process and DPD will see no problem.

So DPD stays enabled (set dpd on-idle) as a hygiene measure that cleans up dead SAs, but it is not the failure detector for routing. That’s BFD’s job.

BFD on the tunnel interface

Two flavours of BFD on FortiOS, and we want both:

  1. BFD on the link / tunnel — between the two endpoints of the IPsec tunnel. Detects “the tunnel as a forwarding path is broken.”
  2. BFD on the BGP session — between the two BGP loopbacks (multihop BFD). Detects “the BGP control plane peer is broken.”

You could argue that with BGP-on-loopback the second is the only one that matters. In practice you want both, because BFD on the tunnel can fire before BFD-for-BGP if the underlay is the bit that broke, and the small belt-and-braces overlap is worth the negligible cost.

Configure BFD globally and per-interface:

config system settings
    set bfd enable
    set bfd-required-min-rx 250
    set bfd-desired-min-tx 250
    set bfd-detect-mult 5
end

config system interface
    edit "to-hub1"
        set bfd enable
        set bfd-required-min-rx 250
        set bfd-desired-min-tx 250
        set bfd-detect-mult 5
    next
end

tx 250 ms × multiplier 5 = 1.25 s detection is a conservative starting point over an internet underlay. If your underlay is reliable, drop to tx 200 / mult 3 = 600 ms; if you’re seeing flap-induced churn, back off to tx 500 / mult 5 = 2.5 s.

Fortinet’s SD-WAN Architecture for Enterprise and the BGP Best Practice recipe both recommend not going below 200 ms TX over an IPsec/internet path. Below that, transient ISP jitter can drop BFD packets and trigger a false neighbour-down. The cost of a false-positive failover is the cost of a real one — so the timer should be aggressive enough to catch the real failures and conservative enough that ISP jitter doesn’t catch it.

BFD for BGP

Enabling BFD on a BGP neighbour piggybacks the failure signal onto the BGP best-path machinery. It’s the part that makes BGP timers irrelevant for failure detection: BFD declares the peer down in 1–2 seconds, BGP tears the session, prefixes are withdrawn.

config router bgp
    config neighbor
        edit "10.255.0.1"
            set bfd enable
        next
    end
end

BFD-for-BGP requires that the underlying multihop BFD session is happy. With BGP peering on loopbacks, the multihop BFD session runs between loopbacks, recursively resolved over the tunnel. Two consequences worth knowing:

  • The BFD session itself uses the same path the BGP traffic does. If the tunnel flaps, BFD goes down with it. Good — that’s the point.
  • If you have ECMP across two underlays (we don’t, in active/standby) BFD only validates one path. Use one BFD session per underlay-distinct neighbour.

Either set per-neighbour BFD (as above) or enable it on the neighbor-group:

config router bgp
    config neighbor-group
        edit "spokes"
            set bfd enable
        next
    end
end

BGP timers: keep them as a backup, not as the detector

With BFD doing the detection work, BGP keepalive/holdtime can be set to the FortiOS minimum without risking false positives.

config router bgp
    set keepalive-timer 3
    set holdtime-timer 9
end

keepalive 3 / hold 9 is the smallest stable pair. Hold must be >= 3× keepalive per RFC 4271; FortiOS enforces this. The point of keeping them tight is that if BFD breaks (BFD config drift, BFD daemon crash, multipath confusion), BGP itself will still detect the failure in 9 seconds rather than 180.

Don’t go below keepalive 3 / hold 9 — at 1/3 you’ll get BGP resets from CPU scheduling jitter.

OSPF timers (for the DC-to-DCE OSPF case)

If you went with Option B in Part 3:

config router ospf
    config interface
        edit "port2"
            set hello-interval 1
            set dead-interval 4
            set network-type point-to-point
            set bfd enable
        next
    end
end

hello 1 / dead 4 is at the floor of what FortiOS supports without surprise. BFD does the actual detection; OSPF dead is the safety net. Use point-to-point network type if the segment really is point-to-point — DR/BDR election on a /30 between two devices is wasted CPU and adds startup delay.

The Graceful Restart trade-off

Graceful Restart (GR) is the feature where a BGP/OSPF peer that loses control-plane state asks its neighbours to keep forwarding to it for a grace period. It’s a great feature for in-service software upgrades and for HA failovers within a single device pair, where the data plane survives a control-plane restart.

It is a terrible feature for inter-site failover detection.

When GR is enabled, a peer that has gone truly down looks (briefly) the same as one that is restarting. The neighbours hold their RIBs, keep forwarding, and the failure signal you carefully tuned BFD to deliver in 1.25 s is gated by the GR restart-timer (default 120 s).

Two facts make this tolerable on FortiOS:

  1. BFD-down explicitly cancels GR. If BFD says the peer is dead, the neighbour doesn’t wait — it withdraws.
  2. The GR restart-timer is configurable. Set it short.

The recommended compromise:

config router bgp
    set graceful-restart-time 30
end
config router bgp
    config neighbor-group
        edit "spokes"
            set capability-graceful-restart enable
        next
    end
end

Enable GR for HA-failover ergonomics on the FortiGate pair, set the restart timer to 30s so an unmasked GR window can’t hide a real failure for two minutes, and rely on BFD-cancels-GR for the fast path.

If you don’t run a clustered FortiGate pair at the hub (single-node hubs), there’s a defensible argument for disabling GR entirely. The simplification is worth something. Run a 30-second GR for now; revisit if you ever build the HA pair.

End-to-end convergence numbers

With the timers above, here’s what each failure scenario produces:

Scenario 1: HUB-1 power-off

  • Spoke’s BFD-for-BGP to HUB-1 detects loss in ~1.25s.
  • Spoke withdraws HUB-1 as next-hop, recomputes best-path, picks HUB-2.
  • FIB update: tens of milliseconds.
  • DCE: HUB-1’s eBGP session to DCE drops (DCE-side BFD or session timeout). DCE withdraws DC1 path. HUB-2 path becomes best. Return traffic flips to DC2.
  • Total: ~1.5–2.5 s.

Scenario 2: HUB-1 alive, DCE peering on HUB-1 fails

  • HUB-1’s BFD-for-BGP to DCE detects loss in ~1.25s.
  • HUB-1 withdraws DCE prefixes from its RIB.
  • Withdrawal propagates to spokes via the existing BGP-to-spoke session as a normal UPDATE — sub-second.
  • Spoke now only has DCE prefixes from HUB-2; best-path re-selection is automatic.
  • Total: ~1.5 s.

The interesting bit about Scenario 2 is that the spoke’s BGP-to-HUB-1 session is still up — HUB-1 the device is fine. We’re relying on the prefix being withdrawn, not the session being torn down. That’s why the chain in Part 2 (“hub withdraws DCE prefixes when it loses DCE”) matters more than it looks.

Scenario 3: WAN flap on the spoke (DC1 path)

  • Spoke’s BFD declares the to-hub1 tunnel underlay down.
  • IPsec re-keys and re-establishes when the WAN comes back. During the gap, BGP-to-HUB-1 also drops (because BFD multihop fails over the same path).
  • Spoke recomputes best-path to HUB-2.
  • WAN comes back, tunnel re-establishes, BGP-to-HUB-1 re-establishes, route learnt with default local-pref 100, beats HUB-2 (50), failback.
  • Total during outage: ~1.5 s. Failback on recovery: a few seconds (driven by IPsec re-key + BGP startup).

Scenario 4: ISP jitter blips a single BFD interval

This is the one to defend against. With tx 250 / mult 5 = 1.25 s, a single dropped BFD packet takes us to multiplier 1, not failure. Even four consecutive drops is still inside the budget. False positives effectively zero for any normally-behaved underlay.

If you tighten to tx 100 / mult 3 = 300 ms, a real ISP brownout (which can produce 200ms blackholing without packet loss showing up in monitoring) will fire BFD. That’s why we don’t go below tx 200.

A sanity-check matrix

LayerDetection time (target)Mechanism
Spoke ↔ Hub IPsec underlay1.25 sBFD on tunnel interface
Spoke ↔ Hub BGP session1.25 sBFD-for-BGP (multihop)
Hub ↔ DCE BGP session1.25 sBFD-for-BGP (single-hop)
Hub ↔ DCE OSPF (alt)1.25 sBFD on interface
BGP holdtime backup9 skeepalive 3 / hold 9
OSPF dead backup4 shello 1 / dead 4
IPsec DPD (cleanup, not detection)30+ sdpd on-idle

Verification

# BFD sessions
get router info bfd neighbor
diagnose sys bfd-session list

# BGP sessions and timers
get router info bgp summary
get router info bgp neighbors 10.255.0.1
diagnose ip bfd statistics

# OSPF, if used
get router info ospf neighbor
get router info ospf interface

get router info bfd neighbor should show every BFD session as up with the negotiated TX/RX matching what you configured. Negotiated values are the max of what each side asked for; if your hub asked for 250 and the spoke asked for 1000, you’ll get 1000. Verify both ends.

If a BFD session is bouncing, the usual culprits are:

  • BFD enabled on the interface but not in system settings (or vice versa).
  • Multihop BFD on BGP without the underlying static /32 to the loopback (Part 2).
  • A NAT device in the path that doesn’t pass BFD (a nightmare to debug; capture and look for the BFD UDP/3784 packets).

Where Part 5 picks up

We’ve now got 1–3 second convergence for any failure that takes a tunnel or a BGP session with it. The remaining gap is the failure that doesn’t: HUB-1 is up, BGP is up, the DCE eBGP is up — and yet the application stack is unreachable. That’s the failure mode where end-to-end SD-WAN Performance SLAs earn their keep, and it’s what Part 5 is about.