Designing an Arista SD-WAN Spoke with Enhanced HA, Dual DIA, and OSPF
Designing an Arista SD-WAN Spoke with Enhanced HA, Dual DIA, and OSPF
This post walks through a real-world spoke-site design on Arista SD-WAN (formerly VMware/VeloCloud SD-WAN). The goal is a resilient branch with no single point of failure between the LAN and the overlay: two Edges in Enhanced HA, two Direct Internet Access circuits as the underlay, a shared multi-VLAN LAN, and OSPF between the Edge cluster and the LAN core.
I’ll cover the topology, why the WAN side is wired the way it is, how the LAN and OSPF design hangs together, and — most importantly — the caveats that catch people out.
What we’re building
ISP A (DIA1) ISP B (DIA2)
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Edge 1 │◄──HA1──►│ Edge 2 │
│ (GE2 = DIA1) │ │ (GE2 = DIA2) │
└──────┬───────┘ └──────┬───────┘
│ GE3 (LAN trunk) │
└─────────┬──────────────┘
▼
┌──────────────────┐
│ LAN core (MLAG) │
│ multiple VLANs │
│ OSPF peer │
└──────────────────┘
Two Edges, two ISPs, one LAN environment. The two Edges form an Enhanced HA cluster — each Edge owns its own WAN circuit, and the pair shares a cluster identity on the LAN. OSPF runs between the cluster and the LAN core to exchange routes both ways.
WAN underlay: why each DIA goes to its own Edge
In an Arista SD-WAN HA cluster you have two broad options for the WAN side:
- Standard HA — both Edges share an L2 segment on the WAN side, typically via an upstream switch between the modems and the Edges.
- Enhanced HA (eHA) — each Edge has its own dedicated WAN interfaces. No shared L2 on the WAN side. The active Edge uses its partner’s WAN circuits via the HA link.
For a spoke with two DIA circuits, Enhanced HA is the better default, and the optimal wiring is:
- DIA1 (ISP A handoff) → Edge 1 GE2
- DIA2 (ISP B handoff) → Edge 2 GE2
- HA1 link directly between Edge 1 GE1 ↔ Edge 2 GE1 (a single cross-cable, no switch in between)
Why this is the right shape:
- No upstream WAN switch. A switch between the modems and the Edges would itself be a single point of failure (or you’d need two of them, MLAG’d, just for the WAN side — unnecessary cost and complexity for two circuits). With eHA you simply don’t need it.
- Clean ISP demarcation. Each ISP’s handoff terminates on exactly one device. Faults are unambiguous: a flap on DIA1 is Edge 1’s problem.
- Both circuits are usable from the active Edge. Even though only one Edge is active in steady state, the cluster’s forwarding plane lets it use the standby’s local circuit via the HA link. So “active on Edge 1” doesn’t mean DIA2 is idle — it’s still measured, still load-balanced into by overlay policies.
- Survives an Edge failure with one circuit. If Edge 1 dies, Edge 2 takes over and DIA2 is still local to it. If Edge 2 dies, Edge 1 keeps DIA1 local. Either way, you keep at least one full-rate circuit.
- Survives a circuit failure with no Edge failover. A flap on DIA1 doesn’t trigger an HA failover — the cluster just stops using that path. HA failover is reserved for actual Edge faults.
What I’d avoid:
- Both DIAs on a single Edge — you lose both circuits if that Edge dies, and HA failover gives you a working Edge with no underlay.
- Both DIAs through a shared L2 switch — reintroduces a SPOF and complicates DHCP/PPPoE handoff.
- Crossing the DIAs (DIA1 → Edge 2, DIA2 → Edge 1) — no benefit, harder to reason about.
LAN side: shared environment, multiple VLANs, cluster IPs
Both Edges’ LAN ports (e.g. GE3) connect to the LAN environment as an 802.1Q trunk carrying every VLAN you want the Edges to serve. For real LAN-side resilience, terminate each Edge’s GE3 on a different physical switch in an MLAG pair — one Edge per switch. Single LAN switch is fine for a small site, but be honest with yourself that the switch is now the SPOF.
For each VLAN you want the Edge cluster to participate in, configure a sub-interface on the LAN trunk with three addresses:
- Edge 1 physical IP (unique)
- Edge 2 physical IP (unique)
- Cluster IP (shared, virtual — owned by whichever Edge is active)
Downstream LAN devices use the cluster IP as their default gateway (or as their OSPF neighbour). On failover the cluster IP MAC moves to the new active Edge via gratuitous ARP — that part is sub-second.
You do not need VRRP between the Edges and the LAN. The cluster handles the virtual IP itself; layering VRRP on top is a common mistake and adds nothing.
OSPF design
OSPFv2 runs on the LAN sub-interfaces of the Edge cluster and peers with the LAN core. A few design decisions to make up front:
- Where does OSPF actually run? Pick one transit VLAN to the core for the OSPF adjacency, and treat the user VLANs as static (Edge cluster IP is the gateway, no OSPF). This keeps the routing table predictable and avoids asymmetric paths across VLANs.
- Which Edge speaks OSPF? Only the active Edge. The standby is silent on OSPF until it’s promoted. The Edge sources Hellos from the cluster IP, not its physical IP, so the LAN core sees a single neighbour identity across failovers.
- Network type. If the LAN core presents as one OSPF peer (an MLAG pair acting as a single neighbour), use point-to-point — no DR/BDR election, faster, simpler. If you’ve genuinely got two separate OSPF peers on the segment, use broadcast with explicit priorities.
- Area. Most spokes sit in a stub or totally-stubby area to keep the LSA database small. Area 0 is fine for a small estate; pick once and stick with it across spokes.
- Redistribution. Routes learned from OSPF feed into the SD-WAN overlay (so other sites can reach this branch’s prefixes). Routes learned from the overlay feed back into OSPF as Type-5 (E2) by default. Tag everything you redistribute from overlay → OSPF (e.g.
tag 100) and filter that tag on the way back in to prevent loops. - Authentication. MD5 minimum, SHA-256 if both ends support it. Same key on both Edges and the LAN core.
Caveats — the things that bite
This is the part worth dog-earing.
1. OSPF only runs on the active Edge. A failover from Edge 1 to Edge 2 means a fresh OSPF adjacency on Edge 2. With default 10s/40s Hello/Dead timers, expect 30–40 seconds of blackhole on overlay-reachable prefixes during failover. Tune to 5/20 as a sane default, 1/3 if your LAN core is comfortable with it. BFD on top is the right answer where supported.
2. HA failover is fast; routing reconvergence is not. The cluster IP’s L2 move is sub-second, but the control plane on the new active Edge has to rebuild OSPF from scratch and reinstall routes. Don’t conflate the two — most “HA didn’t work” reports are actually “OSPF took 35 seconds”.
3. MTU mismatches wedge adjacency. SD-WAN overlay encap eats into MTU; the LAN side is normally 1500. If the OSPF peer disagrees on interface MTU, adjacency stalls at EXSTART/EXCHANGE. Either align MTU explicitly on both ends, or set ip ospf mtu-ignore on the Edge sub-interface.
4. Passive-interface every VLAN you don’t want OSPF on. If you’ve enabled OSPF globally and forgotten to mark user VLAN sub-interfaces as passive, the Edge will happily try to form adjacencies with whatever’s on those VLANs. Be explicit.
5. Route preference between overlay and OSPF. If the same prefix is reachable via the overlay and via OSPF (e.g. a dual-homed branch with a backdoor MPLS link), the Edge picks based on its route policy, not on classical AD rules. Decide which prefixes belong on the overlay versus OSPF, and write filters that match the decision.
6. Loops via redistribution. OSPF → overlay → another site’s OSPF → back into ours is the classic loop shape. Tag at the boundary, filter on the way back. Don’t rely on SPF cost to save you.
7. HA link sizing. When the active Edge uses its partner’s local DIA, that traffic crosses the HA link. With two 1 Gbps DIAs and a 1 Gbps HA link you’re fine. Above that, size the HA link to at least the smaller circuit. The HA link should be a direct cable, not a path through the LAN switch.
8. LAN switch is the new SPOF. eHA gives you Edge redundancy. It does not protect you against a single LAN switch dying with both Edge GE3 ports plugged into it. Dual LAN switches with MLAG, one Edge per switch, is the proper build.
9. Cluster IP must be the OSPF source. The Edge automatically sources Hellos from the cluster IP when OSPF is enabled on a sub-interface that has one configured — but verify this in show ip ospf neighbour from the LAN core after you bring it up. If the neighbour ID flaps on every failover, you’ve got the physical IP being used and you’ll need to fix it.
10. Stateful inspection and asymmetric routing. If multiple VLANs run OSPF and the LAN has internal paths between them, traffic can ingress on one VLAN and egress on another. The Edge’s stateful inspection drops these flows. The fix is the design choice in the OSPF section: run OSPF on one transit VLAN, treat the rest as static.
11. Hello/Dead timers must match. Both Edges and the LAN core must agree. Mismatch = no adjacency. Obvious, but easy to break by tuning one side and not the other.
12. Don’t add VRRP on the Edge LAN. The cluster IP is already the virtual gateway. Adding VRRP between the Edges and the LAN core fights the cluster and creates ARP races on failover.
Build order (Orchestrator)
- Stage both Edges in the Orchestrator under the same site, marked as an HA pair.
- Apply the Edge profile: WAN interfaces (
GE2= INET on each Edge, with the corresponding ISP’s handoff config), LAN trunk onGE3, HA onGE1. - Define each VLAN as a sub-interface on the LAN trunk, with per-Edge physical IPs and a cluster IP. Mark non-transit VLANs passive for OSPF.
- Configure OSPF on the transit VLAN: area, network type (
point-to-pointif the core is a single neighbour), Hello/Dead (5/20), MD5 auth,mtu-ignore, redistribution policy with tagging. - Define the route policies: which OSPF-learned prefixes get advertised into the overlay, which overlay prefixes get redistributed back into OSPF, and the inbound filter on the tag.
- Activate both Edges via ZTP over their local DIA. The cluster forms once both are connected.
Validation
Before you call it done:
- Orchestrator shows one Edge
Active, the otherStandby. - Cluster IP pings from a host on every VLAN.
show ip ospf neighbouron the LAN core shows one neighbour, inFULLstate, sourced from the cluster IP.- Traceroute from a LAN client to a remote-site prefix exits via the overlay.
- Pull DIA1 cable: traffic shifts to DIA2, no Edge failover, no OSPF flap.
- Reboot the active Edge: failover completes, OSPF reconverges within your tuned Dead interval, traffic restored.
- Tag-filter test: confirm overlay-redistributed routes don’t loop back into the overlay from another site.
Wrap
Enhanced HA with dual DIA, each circuit on its own Edge, is the right default for a resilient SD-WAN spoke. The wins are real — clean ISP demarc, no WAN switch, no shared L2 to argue about — but the design only pays off if you also get the OSPF side right. The two failure modes that actually hurt are slow OSPF reconvergence on Edge failover and asymmetric flows breaking stateful inspection. Tune timers, pick one transit VLAN, tag your redistributions, and you’ll have a spoke that holds up under both link and device faults.
Future posts will cover the hub side of this design and what changes when you swap one DIA for a private circuit (MPLS or P2P), where the underlay assumptions get more interesting.