SDWAN Resilience Part 1: Design and Assumptions
A multi-part deep dive into building a resilient Fortinet SD-WAN. The aim is not to repeat the Cookbook word-for-word, but to walk through the design choices on a real (and slightly unfashionable) topology, justify them against Fortinet’s published Best Practice, and challenge the choices where I think they deserve challenge.
This is Part 1: laying out the topology and the assumptions everything else in the series rests on.
The reference topology
+----------------------+
| DCE (AS 65500) |
| Application Stack |
+----+------------+----+
| |
eBGP/OSPF/static eBGP/OSPF/static
(independent) (independent)
| |
+--------+---+ +-----+------+
| HUB-1 | | HUB-2 |
| (DC1) | | (DC2) |
| AS 65000 | | AS 65000 |
| FMG-A * | | * FMG-B |
+-----^------+ +-----^------+
| IPsec | IPsec
| (preferred) | (standby)
| |
+-----+--------------+------+
| SPOKE-N (AS 65100) |
+---------------------------+
The moving parts:
- HA FortiManager —
FMG-Ain DC1,FMG-Bin DC2, on private RFC1918 addressing. Manages all FortiGates via templates. The HA pair has to sync over whatever path is available, which without a DCI means via the DCE — and that itself is a discussion to be had later. - HUB-1 / HUB-2 — FortiGate hubs in DC1 and DC2. Same hub AS (65000). Each hub terminates IPsec from spokes and runs an independent routing relationship with the DCE.
- SPOKE-N — branch FortiGates. Two IPsec tunnels each: one to HUB-1 (preferred), one to HUB-2 (standby). Per-spoke ASN from the 65100–65199 block.
- DCE — a separate AS (65500) hosting the application stack. Could be a colo, a cloud landing zone, or the rest of the campus. The point is that DC1 and DC2 each enter it independently. There is no DCI link between DC1 and DC2.
- Active/standby DCs — spokes prefer DC1 in steady state and only fail to DC2 on a hub or path failure.
This is intentionally not Fortinet’s poster-child SD-WAN design. The default reference in the SD-WAN Architecture for Enterprise guide is dual-active ADVPN, both DCs equally weighted, full overlay with shortcuts. This series is about a topology that, for legitimate reasons, can’t run that.
Why this topology, not dual-active ADVPN?
There are two design choices worth defending right now, because the rest of the series depends on them.
1. Active/standby (not dual-active)
The case for dual-active is well-known: better link utilisation, sub-second app-layer failover, and ADVPN shortcuts that take spoke-to-spoke off the hub. So why not?
Three reasons that come up regularly in real builds:
- Source-IP-sensitive backends. If the application stack inspects and pins on source IP — a legacy load balancer with source-IP persistence, an MFA system that geo-locks per session, or a stateful firewall holding flow state — flapping between two DC egress paths blows sessions apart. Active/standby keeps the source path predictable and only flips on a real failure. The cost is half-idle bandwidth in DC2.
- Asymmetric routing risk. With no DCI, return traffic from the DCE has to come back via the same DC the request went out of. The simplest way to enforce that is to make DC1 the only steady-state path. Dual-active without a DCI is achievable, but requires careful prefix advertisement and per-flow symmetry, and falls over the moment a stateful inspector sits in the DCE return path.
- Operational simplicity. Failure domains become smaller and easier to reason about. “DC1 path is broken; we are now using DC2” is a far easier narrative for a 2 a.m. page than “HUB-1 is partially degraded so 30% of flows are pinning to HUB-2 but only for the SaaS class.”
Fortinet’s SD-WAN Architecture for Enterprise guide does call out active/standby as a supported design — see the Single Hub vs Dual Hub section — and explicitly warns against assuming dual-active is always correct. The framing is “use dual-active when your applications can take it; use active/standby when they can’t.” That’s the line we’re walking on the right side of.
2. No DCI between DC1 and DC2
This one is harder to defend, and worth challenging openly. Without a DCI:
- Hub-to-hub iBGP has to traverse the DCE (or the spoke overlay, which is worse).
- A DCE outage on one side can leave a hub running but isolated from its peer.
- Spoke-to-DC2 failover only works if the spoke can detect the DC1 path failure end-to-end, not just locally.
The justification, in environments where this actually happens, is usually a mix of: the DCs are operated by different teams, the DCI was never built because each DC was already independently homed to the DCE, and the spend to retrofit a dedicated DCI is hard to justify when the DCE is fast and reliable.
We’re not going to pretend this is the world’s cleanest design. Instead, the rest of the series treats the no-DCI as a constraint, and uses BFD, end-to-end SLA probes, and careful prefix advertisement to make sure failures are detected and handled correctly.
If you can build a DCI, you probably should. The series will note at each step where a DCI would simplify things.
AS plan and addressing
A consistent plan keeps the rest of the configuration honest:
| Function | ASN block | Notes |
|---|---|---|
| Hub FortiGates (HUB-1, HUB-2) | 65000 | Single AS; both hubs are iBGP peers (over the overlay or via DCE — discussed in Part 2) |
| Spoke FortiGates | 65100–65199 | Per-spoke private ASN. eBGP to each hub. |
| DCE | 65500 | The application AS. Static, OSPF, or eBGP from each hub independently (Part 3). |
Loopback addressing — used for BGP peering across the overlay — sits in a dedicated block:
| Device | Loopback (lo0) |
|---|---|
| HUB-1 | 10.255.0.1/32 |
| HUB-2 | 10.255.0.2/32 |
| SPOKE-1 | 10.255.1.1/32 |
| SPOKE-2 | 10.255.1.2/32 |
| … | 10.255.1.N/32 |
The reason for using loopbacks for BGP peering rather than tunnel-interface IPs is the subject of Part 2, but the short version is: tunnel-interface IPs change when the tunnel re-establishes, loopbacks don’t.
Tunnel-interface IPs — used for the underlay between hubs and spokes — sit in a separate /30 per tunnel, for example 10.254.0.0/30 for SPOKE-1↔HUB-1 and 10.254.0.4/30 for SPOKE-1↔HUB-2.
What “resilience” actually means here
The word “resilience” gets thrown at SD-WAN deployments to mean roughly “it doesn’t fall over.” That’s not specific enough to design against. Concretely, resilience here means four distinct things:
- Convergence time on a hub failure. How long between HUB-1 going dark and SPOKE-N actually putting traffic on HUB-2. The default protocol timers will not get this under sixty seconds. Part 4 is mostly about driving this down to single-digit seconds.
- No black-holing. If HUB-1 is up but its DCE peering is down, the spoke must not keep sending traffic at it. The local view of the tunnel says “up”, but the end-to-end view says “broken”. This is where end-to-end SLAs (Part 5) earn their keep.
- No flap-induced asymmetry. If BGP holds the route for 90 seconds after BFD has already declared the path down, we’ll spend 90 seconds with the spoke pointing at HUB-2 while the hub is still trying to advertise itself as best path. BFD-for-BGP and aligned tunnel/route timers fix this; Part 4 covers it.
- Predictable failback. When DC1 recovers, the path returns to steady-state without operator intervention, and without flapping. AS-path prepending, MED, local-pref, and SLA priority all play a role; we’ll lean on routing primitives where the policy can be expressed in routing, and on SLA priority only where it cannot.
If a design choice in the rest of the series doesn’t measurably improve one of those four, it doesn’t earn its place.
What this series is not going to cover
To keep the scope honest:
- No FortiManager template walkthrough. Configs in this series are CLI as applied to the device. Translating to FortiManager templates is a separate exercise.
- No SaaS / Internet steering deep-dive. This series is about the corporate-traffic resilience path: spoke → hub → DCE. Internet break-out has different trade-offs and deserves its own series.
- No FortiSASE. Cloud-delivered hubs change the failure model in interesting ways, but it is a different topology.
Series roadmap
| Part | Topic |
|---|---|
| 1 (this post) | Design and assumptions |
| 2 | BGP on loopback — addressing, hub iBGP without a DCI, spoke eBGP via overlay |
| 3 | DC ↔ DCE integration: static, OSPF, and BGP — pros, cons, and which to pick |
| 4 | BFD and convergence tuning — DPD vs BFD, BFD-for-BGP, timer math |
| 5 | Performance SLAs and service steering — health-check targets, member preference, end-to-end failure walkthrough |
Part 2 will walk through the BGP design end-to-end and lay down the configuration that the rest of the series builds on.