SDWAN Resilience Part 5: Performance SLAs and Service Steering

Parts 2, 3, and 4 gave us a routing fabric that converges in 1–3 seconds for any failure that takes a tunnel or a routing session with it. This post is about the failure that doesn’t.

If HUB-1 is up, the IPsec tunnel to HUB-1 is up, the BGP session to HUB-1 is up, and HUB-1’s eBGP to DCE is up — but the actual application stack is unreachable from inside DC1 because some downstream component (a transit firewall, a load balancer, a peering through to the cloud) has failed — none of the mechanisms from Part 4 will fire. The control plane looks healthy. Routing is fine. Traffic still blackholes.

End-to-end Performance SLAs are how you catch this. They’re also what make “active/standby with intelligent failover” actually work, because they let the spoke decide based on what it can prove rather than what it has been told.

What SD-WAN buys us on top of BGP

It’s tempting to ask why we need SD-WAN at all on a spoke that already has BGP doing the routing. Two reasons:

  1. Health-check granularity. BGP says “the path is up.” SD-WAN can say “the application target is reachable, with N ms latency, J ms jitter, and L% loss” — and turn that into a routing decision.
  2. Application-aware policy. Different applications have different tolerances. Teams voice tolerates a packet loss spike worse than a slightly higher RTT; bulk SMB-over-WAN is the opposite. SD-WAN service rules let you match per-application and apply different SLA targets.

The SD-WAN layer sits on top of BGP in this design. BGP supplies reachability (the routes exist); SD-WAN supplies steering (which path a flow takes, given those routes are healthy). The two cooperate: if BGP withdraws a route, SD-WAN can’t pick that member; if SD-WAN’s SLA fires, the route is still there but the service rule routes around it.

Choosing what to probe

Health-check target choice is the single most important design decision in this whole section. Get it wrong and SD-WAN tells you all paths are fine while the application is on fire.

Don’t probe the hub loopback. A successful ping to 10.255.0.1 proves IPsec is up and routing is up. We already know that — BFD told us in 1.25 seconds. The probe needs to prove something BGP can’t.

Don’t probe a public IP outside DCE. Probing 8.8.8.8 proves the spoke’s underlay reaches the internet. It says nothing about the path from spoke to corporate application.

Probe an address inside DCE that is on or near the service path. Options, in rough order of usefulness:

  • A dedicated probe responder VM or container in DCE, accessible to every spoke via the overlay. Reliable, controllable, and aligned to the actual service path.
  • The service VIP itself, if it responds to ICMP (often it doesn’t) or supports an HTTP healthcheck endpoint.
  • The next-hop router on the DCE side of the hub-to-DCE peering. Better than a loopback because it proves the hub-to-DCE link is data-plane-forwarding, but it doesn’t prove anything past that hop.

A reasonable real-world choice is a small probe responder in DCE that exposes both an ICMP target and an HTTP endpoint. It costs nothing to run and it’s the only way to be honest about end-to-end health.

Performance SLA configuration

config system sdwan
    set status enable

    config zone
        edit "overlay"
        next
    end

    config members
        edit 1
            set interface "to-hub1"
            set zone "overlay"
            set priority 10
        next
        edit 2
            set interface "to-hub2"
            set zone "overlay"
            set priority 20
        next
    end

    config health-check
        edit "dce-probe"
            set server "10.100.250.10"
            set protocol ping
            set interval 500
            set probe-timeout 500
            set failtime 5
            set recoverytime 10
            set members 1 2
            set update-static-route enable
            config sla
                edit 1
                    set latency-threshold 80
                    set jitter-threshold 30
                    set packetloss-threshold 1
                next
            end
        next
        edit "dce-http"
            set server "10.100.250.10"
            set protocol http
            set http-get "/healthz"
            set interval 1000
            set failtime 3
            set recoverytime 5
            set members 1 2
            config sla
                edit 1
                    set latency-threshold 200
                    set packetloss-threshold 0
                next
            end
        next
    end
end

A few choices in there worth picking apart:

  • Two health checks, not one. The ICMP check is the fast, lightweight pulse — every 500 ms with 5-fail trigger gives a 2.5 s detection floor that won’t false-positive on a single packet loss. The HTTP check is slower but proves the service responds at L7. They cover different failure modes.
  • set update-static-route enable on the ICMP check ties the SLA result to any static routes that name this health-check as a tracker. We’ll use this for the SLA-tied static failover pattern below.
  • priority 10 and priority 20 on the members make HUB-1 the preferred SD-WAN member when both pass SLA. This is the active/standby preference at the SD-WAN layer, mirroring the BGP local-pref preference.
  • Thresholds. Latency 80 ms over a continental WAN is realistic; jitter 30 ms is tolerant; loss 1% is “anything sustained gets us off this path”. These are starting numbers — the right values come from baselining your own underlay for two weeks before you trust them.

Service rules — getting traffic on the right member

config system sdwan
    config service
        edit 1
            set name "dce-services"
            set mode sla
            set dst "dce-services-prefix"
            set health-check "dce-probe"
            set sla-compare-method order
            config sla
                edit "dce-probe"
                    set id 1
                next
            end
            set priority-members 1 2
        next
        edit 2
            set name "dce-app-strict"
            set mode sla
            set dst "dce-app-prefix"
            set health-check "dce-http"
            set sla-compare-method order
            config sla
                edit "dce-http"
                    set id 1
                next
            end
            set priority-members 1 2
        next
    end
end

config firewall address
    edit "dce-services-prefix"
        set subnet 10.100.0.0 255.255.0.0
    next
    edit "dce-app-prefix"
        set subnet 10.100.10.0 255.255.255.0
    next
end

What this does: any traffic destined for 10.100.0.0/16 is matched by the dce-services rule. The rule consults the dce-probe health-check, picks the highest-priority passing member (to-hub1 if it’s passing, otherwise to-hub2), and forwards the flow. If both fail, traffic falls through to whatever the routing table has — which, if everything went badly, will at least be “the BGP route via HUB-1 because that’s the local-pref winner.”

set sla-compare-method order is the right setting for active/standby: pick the highest-priority member that meets SLA, in the configured order, no load-balancing. The alternative loadbalance would split flows across all passing members, which is exactly what we don’t want here.

A separate rule for the latency-sensitive subset (dce-app-prefix) uses the stricter HTTP SLA. If the L7 probe degrades but ICMP is still happy, only that subset of traffic is steered to DC2, and bulk traffic stays on DC1.

SLA-tied static — the belt-and-braces

For the bluntest possible failover, you can pin a static route whose existence is gated by the SLA. If the SLA fails, the static is removed; if you write the static carefully, removing it forces the FIB lookup to fall through to a different next-hop.

config router static
    edit 200
        set dst 10.100.0.0 255.255.0.0
        set device "to-hub1"
        set priority 10
        set sdwan-zone "overlay"
        set link-monitor-exempt enable
    next
    edit 201
        set dst 10.100.0.0 255.255.0.0
        set device "to-hub2"
        set priority 20
    next
end

With the SD-WAN service rules above, you don’t strictly need this — the service rule does the equivalent at policy time. But the SLA-tied static is a useful backstop for traffic that doesn’t cross a firewall policy with SD-WAN routing applied (rare, but possible for management VRFs and out-of-band paths).

When SLA, when routing — and when both?

The honest answer is “use whichever expresses the policy most clearly, and don’t fight yourself by setting it in two places.”

Routing (BGP local-pref / AS-prepend) is the right tool when:

  • The preference is structural — “this spoke prefers DC1” — and rarely changes.
  • The decision can be made by control-plane signals (BGP up, BFD up).
  • You want the policy visible in get router info bgp rather than buried in SD-WAN config.

SD-WAN SLA is the right tool when:

  • The decision needs end-to-end signal (data-plane health, application reachability).
  • The preference is application-specific (voice goes via the lower-jitter path even if both are “up”).
  • You need fast, deterministic data-plane swap on a soft failure that doesn’t kill any sessions.

Don’t do both for the same decision. Setting BGP local-pref and SD-WAN priority for the same prefix means a failure in one mechanism gets papered over by the other and you lose the diagnostic clarity of “which one fired.” Pick the layer the policy belongs at.

In this design, the split lands as:

  • BGP owns “DC1 vs DC2” baseline (via local-pref on the spoke) and “what prefixes exist at all”.
  • SD-WAN SLA owns “is the path actually healthy end-to-end” and per-application steering.

Failure-mode walkthrough, end-to-end

This is the punchline. Here’s every failure mode the series is designed to cover and what catches it.

FailureDetectorTime to converge
HUB-1 hard downSpoke BFD-for-BGP to HUB-1 (Part 4)~1.5 s
HUB-1 alive, DCE eBGP downHUB-1 BFD-for-BGP to DCE → BGP withdraw → spoke best-path (Parts 3, 4)~1.5 s
HUB-1 alive, DCE eBGP up, but DCE-side service path brokenSD-WAN ICMP SLA via to-hub1 fails (this post)2.5–3 s
HUB-1 alive, full network healthy, but application returning errorsSD-WAN HTTP SLA via to-hub1 fails (this post)3–5 s
Spoke WAN flapSpoke BFD on tunnel + BFD-for-BGP (Part 4)~1.5 s
Spoke WAN brown-out (jitter/loss within thresholds)SD-WAN SLA loss/jitter threshold breach (this post)2.5–3 s
ISP single-packet dropsBelow BFD multiplier and SLA failtime — no actionUnaffected
FortiManager HA partition (no DCI)n/a — management-plane only, doesn’t affect data planen/a

The interesting rows are the middle three: every one requires SD-WAN SLA to catch it. Without SLA, those three failures result in spoke traffic being sent at HUB-1 indefinitely while the path is broken downstream. That is why the Performance SLA layer is non-negotiable in a real resilient design, even one with BGP doing the heavy routing lifting.

Failback and dampening

After a failback (DC1 recovers), the SD-WAN member’s SLA needs to come back to “passing” before the service rule re-prefers it. recoverytime on the health-check controls this:

set recoverytime 10

10 consecutive passing probes at 500 ms = 5 seconds of consistent health before failback. The failtime: recoverytime ratio (5:10 above) is asymmetric for a reason — fail fast, recover slow. A path that just came back is more likely to flap again than one that’s been healthy for hours.

For very flap-prone underlays you can extend recoverytime further (30+) or pair with a hold-down — but at that point you’re probably better off addressing the underlay than tuning the SD-WAN around it.

Verification

diagnose sys sdwan health-check
diagnose sys sdwan member
diagnose sys sdwan service
get system sdwan

diagnose sys sdwan health-check shows current latency / jitter / loss per member per check, plus whether each is in SLA. diagnose sys sdwan service shows which member each service rule has selected. If the rule isn’t picking what you expect, this is where it tells you why (no member meets SLA, all members fail health, the priority-members list doesn’t include the configured member, etc.).

Series wrap

What we built across the five posts:

  • Part 1 — Topology and assumptions: HA FortiManager, dual hubs in active/standby, no DCI, separate DCE peering. Defended the unfashionable choices and named the constraints.
  • Part 2 — BGP on loopback, dynamic IPsec on the hub, why we don’t run iBGP between hubs.
  • Part 3 — DC ↔ DCE routing options, with eBGP as the recommendation for the AS boundary.
  • Part 4 — BFD on tunnels and BFD-for-BGP, hold/keepalive math, the Graceful Restart trade-off, end-to-end convergence in 1–3 s.
  • Part 5 — Performance SLAs as the application-aware overlay on top of BGP, and the failure modes only end-to-end probing can catch.

The design isn’t Fortinet’s poster-child SD-WAN — that’s intentional. It’s a real topology with real constraints. The choices are defended against Fortinet’s published Best Practice where the BP fits the constraints, and challenged where it doesn’t.

Where to go from here, if you want to push it further: add a real DCI, switch to dual-active ADVPN, and revisit every choice in the series. Most of the configuration survives — the policy choices change. That’s a different five-part series.