Arista (VMware) SD-WAN Deep Dive — Part 5: Best Practice, Failure Modes, and a Design Checklist

Series map. Part 5 of five.

  1. Components, Gateways, and the Three Planes
  2. Routing — Overlay, Underlay, BGP, and the Gateway as Route Reflector
  3. The Data Plane — VCMP, DMPO, and Per-Flow Steering
  4. Topology Walkthroughs — MPLS-only meets Internet-only Across Continents
  5. Best Practice, Failure Modes, and a Design Checklist (this post)

Final post. Everything we’ve built in Parts 1–4 — Edges, Cloud Gateways, Partner Gateways, the three planes, the routing model, DMPO, the relay flows — needs to survive contact with real customers, real budgets, and real outages. This post is the design lore.

Three sections:

  • Best practice — the design rules that aren’t optional.
  • Failure modes — what breaks, why, and how to spot it.
  • The checklist — one page, use it on every new deployment.

Best practice

Gateway design

Two Gateways minimum, always. Both Cloud and Partner. There is no scenario in which one Gateway is acceptable for production. The two Gateways are not active/standby — they are active/active route reflectors and active/active data-plane relays. An Edge homes to both and uses both. Lose one and the other carries the load. Lose both and the Edges that depend on Gateway-mediated flows go dark for everything except established Direct tunnels.

Geographic placement matches your traffic concentration, not your geography. If 80% of your traffic ends up on a UK Gateway because all your users are in the UK and all your apps live in UK/EU clouds, two UK Gateways is correct. If 30% of your traffic is in APAC, you need a Gateway in APAC, not “two UK Gateways because we’re a UK company”. The BritNet design in this series (two UK Cloud Gateways) is legitimate if and only if international traffic is small or international sites have Internet breakout and don’t need Gateway-mediated relay. We saw in Part 4 that Chicago doesn’t work at all in that design without a Partner Gateway intervention.

Partner Gateways are not optional for MPLS-only Edges. If you have any MPLS-only sites, you have two choices: give them Internet underlay, or deploy a Partner Gateway that lives in their MPLS VRF. There is no third option that uses pure Cloud Gateways. Most enterprises pick “deploy Partner Gateway” because it’s cheaper than DIA-at-every-site and keeps MPLS sites on MPLS where their SP contract pays for the latency guarantees.

Two Partner Gateways per MPLS network, in different MPLS PoPs. Same reasoning as Cloud Gateway redundancy — except worse, because Partner Gateways often serve a smaller blast radius (one geography, one customer), so losing one is more painful per Edge. Pair them.

Don’t co-locate Partner and Cloud Gateways on the same hypervisor / power feed / network plane. The point of having both is to have different failure boundaries.

Underlay design

Underlay diversity has to be real. Two cables from the same handhole share a fate. MPLS + broadband from the same SP often shares last-mile fibre. The Edge will happily build VCMP tunnels over both and run DMPO over both, and they will both go down at the same time when the digger hits the duct. Spec last-mile physical diversity in the contract, not just logical diversity.

4G/5G as a third underlay is cheap insurance. A modest 5G modem as a third interface on a branch Edge gives DMPO a path that genuinely doesn’t share fate with the wired pair. Even if it’s bandwidth-limited (and most enterprise plans are), it carries voice and a couple of priority flows during a broadband-and-MPLS dual fail. Cost is small.

Don’t trust the underlay’s SLA. MPLS providers will tell you their network does <X ms / <Y% loss. They are reporting on the PE-to-PE part. Your underlay reality includes last-mile, CE-PE, and whatever happens between CE and the Edge. DMPO measures end-to-end VCMP tunnel quality, which is the only metric that matters for your flows.

MTU

Set the LAN MTU correctly on the Edge. VCMP encapsulation adds ~40 bytes plus encryption overhead. If your underlay is 1500-byte clean (most are), set the LAN MTU on the Edge to 1400 or thereabouts so that LAN-side packets, once wrapped, fit. PMTUD will mostly cope on its own if you let it, but PMTUD is fragile (ICMP filtering, asymmetric paths, GRE between routers somewhere). An explicit lower MTU on the LAN side avoids the entire class of problem.

TCP MSS clamping on the Edge LAN interface. Belt and braces. Even when PMTUD works, MSS clamping ensures TCP flows never try to send packets that won’t survive the encapsulation. The Edge can do this for you; turn it on.

Jumbo on the underlay is sometimes available (private MPLS, dark fibre, on-prem DC interconnect). If so, use it — every byte of underlay headroom is a byte of overlay headroom.

Segmentation

Segments are real and worth using. Each segment is a VRF end-to-end through the overlay. Per-segment route tables, per-segment Business Policy, per-segment Gateway distribution. Typical uses: guest Wi-Fi traffic separated from corporate; IoT separated from user devices; M&A integration separated from production while teams reconcile.

Don’t over-segment. Every segment is operational overhead (separate policies, separate firewall integrations, separate route reviews). Three or four segments is plenty for most enterprises. Twenty is somebody’s micro-segmentation theatre.

Partner Gateway segmentation lines up with MPLS VRFs. If your MPLS provider has you in multiple VRFs (some do — corporate, guest, IoT), the Partner Gateway needs a BGP session per VRF, mapped to the corresponding overlay segment. This is finicky but it’s the only way to keep the segments isolated end-to-end across the MPLS boundary.

Security service insertion

Decide where Internet egress happens — and stick to it. Three options:

  • Direct Internet Access at the Edge — flow goes Edge LAN → Edge → underlay Internet directly. Cheapest. Requires per-Edge security (NGFW, IPS) or a SASE forward — typically a tunnel from Edge to a SASE PoP. Most modern designs go this way.
  • Backhauled via the Gateway — flow goes Edge → Gateway → Internet from Gateway. Gateway provides a centralised egress and can chain Cloud Web Security (the integrated CASB/SWG offering). Higher latency, simpler to audit. Becoming less common as SASE has eaten this use case.
  • Backhauled via a Hub VCE in a DC — flow goes Edge → Hub → DC firewall → Internet. Traditional enterprise. Operationally familiar. Suboptimal latency. Use this only when compliance requires it.

SASE forward from the Edge is the sane default in 2025+. Iboss, Zscaler, Netskope, etc., all integrate as a forwarding target. Edge classifies the flow, decides “this needs SASE”, builds a tunnel to the nearest SASE PoP, and forwards. Egress security lives in the SASE; the overlay just delivers the flow there.

Service chaining order. When a flow needs multiple services (e.g., on-Edge DPI → SASE → Cloud Web Security), specify the chain explicitly in policy. Order matters: a flow that gets blocked by SASE shouldn’t have already consumed expensive on-Edge resources.

Edge sizing

CPU is more often the constraint than throughput. VCMP encryption (AES-GCM with NIC offload where available) is fine. What chews CPU is DPI on encrypted flows (lots of inspection, lots of state) and tunnel count (a hub VCE with hundreds of branches has hundreds of VCMP tunnels). Watch CPU on the hub and any large spoke; throughput numbers in the spec sheet assume clear-text or minimal DPI.

Direct-tunnel count scales with topology. A 200-site mesh with Direct-preferred everywhere can build ~199 tunnels per Edge. Most Edges are not specced for this. Use partial mesh — Direct only between sites that have meaningful traffic between them, Gateway-mediated everywhere else. This is a per-profile setting.

Failure modes

The ones that bite teams the first time, with how to spot them.

”Site is offline” with everything green

Site has Internet underlay, broadband is up, Edge is registered with the Orchestrator. But no overlay flows work.

Most common cause: Edge can’t reach the Gateway public IP from its current path because of CGNAT / firewall / ISP filtering on UDP/2426. The Edge will show happy management plane (HTTPS to Orchestrator) but no VCMP tunnel up. Check the Edge’s Gateway-connection status, not its Orchestrator status. They are independent.

Second-most common: Gateway’s public IP changed and the Orchestrator pushed the new address but the Edge hasn’t refreshed. Restart the Edge’s Gateway-discovery process.

Asymmetric flow through two Gateways

Edge has two Gateways (Slough and Manchester). Outbound packets of a flow go via Slough; return packets come back via Manchester. Stateful middleboxes between Edges (firewalls inside the LAN) see only half the flow and drop.

Cause: the destination Edge picked a different Gateway for return. Symptom: works on TCP that survives asymmetry (most modern stateful FWs are tolerant), breaks on UDP services and any FW that requires strict symmetry.

Fix: ensure flow affinity is on (it usually is by default) and that both Edges’ DMPO state agree on the preferred Gateway. If they disagree persistently — usually because of a real underlay asymmetry — that asymmetry is the underlying problem and DMPO is just reflecting it. Investigate the underlay first.

MPLS prefix appears twice in the overlay table

Edge sees a remote site’s LAN prefix via the Edge-on-that-MPLS-VRF (direct overlay learn from that Edge) and via the Partner Gateway (which is redistributing the MPLS route into the overlay).

Both routes are correct. The Edge picks one (per route-preference policy). If it picks the Partner Gateway route, traffic tromboning through the Partner Gateway when it could have gone direct. Latency penalty, capacity penalty on the Partner Gateway.

Fix: route maps on the Partner Gateway’s MPLS-to-overlay redistribution should filter out prefixes that belong to Edges in the overlay. The Edges advertise their own prefixes directly; the Partner Gateway only needs to redistribute prefixes for non-overlay CEs in the MPLS VRF (typically: nothing, in a pure-SD-WAN customer). In practice you set the filter to permit only the Partner Gateway’s own loopbacks and any non-SD-WAN sites you actually have on MPLS.

Tunnel flap caused by anti-spoofing on the ISP

Edge’s broadband ISP rate-limits or filters UDP/2426 from “consumer” customers. VCMP tunnels flap. DMPO marks the path bad. Flows migrate to the other underlay, costing money or capacity.

Symptom: bursts of “WAN link status changed” on the Edge, no corresponding underlay-layer alarm.

Fix: change the VCMP UDP port (it’s configurable per Edge). Some ISPs aggressively rate-limit non-standard high ports; others rate-limit 2426 specifically because they recognise the SD-WAN traffic. Trial and error. Sometimes the answer is a different ISP.

Cloud Web Security latency cliff

You enabled Cloud Web Security on the Gateway. Web latency jumps from 30ms to 200ms because flows now egress in a different country.

Cause: the CWS service backhauls Internet egress through a specific set of PoPs. If your Gateway’s CWS upstream is in a distant region, your users pay the latency.

Fix: use a SASE forward from the Edge instead, where the SASE provider has POPs near your users. Or accept the latency on Internet-egress only. Or pick a different Gateway in a region with closer CWS.

Partner Gateway BGP misadvertisement to MPLS

You stand up a Partner Gateway. The MPLS network suddenly sees the Partner Gateway advertising every overlay prefix, including prefixes for sites that are also CEs on the same MPLS VRF.

The MPLS network now has two paths to those sites — the legitimate CE BGP path and the Partner Gateway path. BGP best-path picks one, sometimes the wrong one. Now traffic between two MPLS-attached CEs goes via the Partner Gateway because the Partner Gateway’s route looked shorter.

Fix: route maps on the Partner Gateway’s overlay-to-MPLS export — only advertise prefixes for sites that don’t have native MPLS reachability. AS-path prepending or community tagging works if the SP supports it. Test it before deploying; rollback if you see traffic shift.

Direct tunnels building to sites you never use

You enabled Direct-preferred everywhere. Now your Edge has a Direct VCMP tunnel to every other Edge in the tenant — most of which you never talk to. Memory and CPU climb. Tunnel re-key churn rises.

Fix: change the tunnel-establishment trigger from “first packet” to “threshold of packets” or “explicit configuration”. Or use a partial-mesh profile that only allows Direct between sites in a defined set.

Orchestrator outage during change window

You’re deploying. The Orchestrator goes away. You can’t push config. You can’t see telemetry. You panic.

Reality check: the data plane is fine. Existing flows continue. Existing routes hold. DMPO continues to measure and remediate. Edges continue to forward.

You can’t add a new site or change a policy. That’s the only thing you can’t do. Take the change window pause as recovery time and resume when the Orchestrator is back.

Partner Gateway sees all the bandwidth

A site that should be Direct-meshing with its peers is sending everything via the Partner Gateway. Bandwidth bill on the Partner Gateway interface is climbing.

Cause: Direct tunnels aren’t being built — usually because the destination Edge’s reachable underlay set, as advertised by the Partner Gateway, doesn’t include an underlay the source Edge can reach. Both Edges fall back to Gateway-mediated.

Fix: confirm Edges have the underlay shape you expect. Check the overlay route advertisements — what reachable-underlay set is each Edge announcing to the Gateway? If the answer is “only MPLS” but you expected “MPLS and broadband”, the broadband interface isn’t being recognised as a public underlay (NAT issue, ISP detection issue, or the Edge config has it as private-only). Fix the underlay status first; Direct tunnels build themselves once the candidates are right.

The one-page design checklist

For every new Arista SD-WAN deployment, before you cut over a single site:

Topology

  • Two Cloud Gateways minimum, placed by traffic concentration not by HQ location.
  • If any MPLS-only sites exist: two Partner Gateways in the MPLS network, in different PoPs.
  • Hub VCE designated for DC-centric flows (if applicable) — and only those flows.
  • Cloud VPN mode set per profile: Direct-preferred where mesh is sane, Gateway-only where it isn’t.

Underlay

  • Last-mile physical diversity confirmed in writing for every site that claims two underlays.
  • 4G/5G as a third underlay on every business-critical branch.
  • LAN MTU on the Edge reduced to fit VCMP overhead (1400 typical).
  • TCP MSS clamping on the Edge LAN interface.

Routing

  • BGP at every Edge that sits on MPLS — peered with the SP PE inside the customer VRF.
  • BGP at every Partner Gateway — peered with the SP PE for each segment / VRF.
  • Route map on the Partner Gateway: only redistribute non-overlay prefixes from MPLS into overlay, and only non-overlay-reachable prefixes from overlay into MPLS.
  • LAN-side BGP / OSPF redistribution into the overlay scoped to LAN prefixes only — no underlay leakage.

Segmentation

  • Segments mapped 1:1 to MPLS VRFs where applicable.
  • Per-segment Business Policy reviewed.
  • Per-segment Gateway distribution confirmed (Partner Gateway sees the segment, Cloud Gateway sees the segment if relevant).

Security

  • Egress decision documented per segment: DIA at Edge / Gateway breakout / DC backhaul.
  • SASE provider integrated as a forward target if DIA at Edge.
  • On-Edge NGFW or service insertion confirmed where DIA at Edge is used.
  • Cloud Web Security configured on Gateway only if Gateway breakout is the chosen path.

Operations

  • Orchestrator outage scenario walked through with NOC — confirm understanding that data plane continues.
  • DMPO thresholds reviewed per application class — voice, video, transactional, bulk.
  • Tunnel-establishment trigger reviewed (first-packet vs. threshold).
  • Monitoring of Gateway public-IP reachability from every Edge — Gateway-side, not Orchestrator-side.
  • Capacity baseline captured before cutover so post-cutover comparison is possible.
  • Rollback procedure documented and tested for cutting one site back to MPLS-only forwarding.

If every box is ticked, you have a design that survives the first quarter. If not, the missing boxes are your follow-ups for the second quarter.

Closing

Across the five posts we’ve built up the architecture from components to flows to design rules. The pieces fit together cleanly once you separate the planes, give the Gateway credit for the route-reflector role it actually plays, and treat the Partner Gateway as the bridge between underlay regimes that it is.

If any of these posts disagrees with your production reality, that’s the kind of correction I want — much of this is the canonical architecture, and field reality always has the better stories.