Resilient Network Topologies
Resilient Network Topologies
Section titled “Resilient Network Topologies”A resilient network eliminates single points of failure across each layer: physical links, gateway routers, WAN uplinks, and management access. This guide covers design principles and RouterOS configuration patterns for building redundant networks, from dual-homed LAN gateways to multi-ISP routing with out-of-band management.
Design Principles
Section titled “Design Principles”Three core principles guide resilient network design:
Separate redundancy concerns by layer. VRRP handles first-hop gateway redundancy for clients. Bonding protects physical link failures between devices. Routing distance, ECMP, and recursive routes handle WAN uplink diversity. Mixing these mechanisms creates complexity and failure modes. Design each layer independently.
Avoid single points of failure at every tier. A redundant LAN gateway connected to a single WAN link is only partially resilient. Map out every device, link, and uplink in the path and identify where a single failure would disrupt service.
Plan for management access. Data plane failures can lock you out of the router if management traffic shares the same path. Dedicate a physically separate or logically isolated path for management access before a problem occurs.
Layer Architecture
Section titled “Layer Architecture”A fully redundant site uses at least three layers:
Clients | LAN (VRRP virtual gateway) / \R1 R2 ← Two routers, VRRP between them for gateway redundancy | | bond bond ← Bonded uplinks to distribution/edge switches | |SW1 SW2 ← Redundant upstream switches (LACP/MLAG) | |ISP1 ISP2 ← Dual ISP uplinks for WAN diversityEach layer fails independently. A switch failure does not impact the WAN. A WAN outage does not take down LAN access.
LAN Gateway Redundancy with VRRP
Section titled “LAN Gateway Redundancy with VRRP”VRRP provides a virtual gateway address that floats between two routers. Clients use the virtual IP as their default gateway and are unaffected by router failover.
Two-Router Active/Standby Gateway
Section titled “Two-Router Active/Standby Gateway”Configure the primary router with a high VRRP priority and the standby with a lower priority:
# Primary router — priority 150/interface vrrp add \ name=vrrp-lan \ interface=bridge-lan \ vrid=10 \ priority=150 \ preemption-mode=yes \ interval=1s
/ip addressadd address=10.10.10.2/24 interface=bridge-lanadd address=10.10.10.1/32 interface=vrrp-lan # virtual gateway# Standby router — priority 100/interface vrrp add \ name=vrrp-lan \ interface=bridge-lan \ vrid=10 \ priority=100 \ preemption-mode=yes \ interval=1s
/ip addressadd address=10.10.10.3/24 interface=bridge-lanadd address=10.10.10.1/32 interface=vrrp-lan # same virtual gatewayHosts point to 10.10.10.1 as their default gateway. When the primary router fails, the standby assumes Master state and begins handling traffic.
Connection Tracking Synchronization
Section titled “Connection Tracking Synchronization”For networks with stateful firewalls, enable connection tracking sync so established sessions survive failover:
# Enable on both routers/ip/firewall/connection/tracking/set enabled=yes/interface vrrp set vrrp-lan sync-connection-tracking=yesFirewall rules must permit UDP port 8275 from the peer’s real IP:
/ip firewall filter add \ chain=input protocol=udp dst-port=8275 \ src-address=10.10.10.3 action=accept \ comment="VRRP conntrack sync from standby"Grouped WAN and LAN VRRP
Section titled “Grouped WAN and LAN VRRP”When both routers also have WAN interfaces, group the WAN and LAN VRRP instances so they share a single state machine. This prevents asymmetric routing where the LAN Master sends traffic out a WAN interface on the other router:
/interface vrrpadd name=vrrp-wan interface=sfp-sfpplus1 vrid=20 priority=150add name=vrrp-lan interface=bridge-lan vrid=10 priority=150
# Link both instances — vrrp-lan controls the group/interface vrrp set [find where name!=vrrp-lan] group-authority=vrrp-lanBoth interfaces transition to Master or Backup together, ensuring the same router handles both LAN client traffic and WAN forwarding.
Link Redundancy with Bonding
Section titled “Link Redundancy with Bonding”Bonding combines multiple physical interfaces into a single logical link, providing both additional bandwidth and automatic failover when a physical link fails.
Active-Backup Between Routers
Section titled “Active-Backup Between Routers”For direct router-to-router links, active-backup provides simple failover without requiring switch LACP support:
# On both routers/interface bonding add \ name=bond-inter \ slaves=ether1,ether2 \ mode=active-backup \ primary=ether1 \ link-monitoring=mii \ mii-interval=100ms
/ip address add address=172.16.0.1/30 interface=bond-interWhen ether1 fails, ether2 takes over within one MII interval (100ms by default).
LACP to Distribution Switches
Section titled “LACP to Distribution Switches”For uplinks to managed switches, 802.3ad LACP provides standards-compliant link aggregation with automatic member discovery:
/interface bonding add \ name=bond-uplink \ slaves=ether3,ether4 \ mode=802.3ad \ lacp-rate=1sec \ link-monitoring=mii
/ip address add address=10.1.0.1/30 interface=bond-uplinkConfigure the matching LACP port-channel on the switch. lacp-rate=1sec enables fast LACP PDU exchange for faster member failure detection.
Choosing Link Monitoring
Section titled “Choosing Link Monitoring”| Scenario | Recommended monitoring |
|---|---|
| Direct router-to-router links | mii — detects physical layer failure immediately |
| Through a switch fabric | arp — detects reachability, not just physical link |
| 802.3ad LACP bonds | mii — ARP monitoring can mislead due to hashing |
For ARP monitoring, set the target IP to a reachable peer on the bonded segment:
/interface bonding set [find name=bond1] \ link-monitoring=arp \ arp-ip-targets=10.1.0.2 \ arp-interval=100msWAN Diversity with Dual ISPs
Section titled “WAN Diversity with Dual ISPs”Dual ISP uplinks eliminate single-provider dependency. RouterOS supports several models: active/standby failover, ECMP load sharing, and policy-based steering.
Active/Standby with Distance
Section titled “Active/Standby with Distance”The simplest dual-ISP design uses route distance to prefer one provider. When the primary route is withdrawn (via check-gateway=ping), the backup route activates automatically:
/ip routeadd dst-address=0.0.0.0/0 gateway=203.0.113.1 distance=1 comment=isp1-primaryadd dst-address=0.0.0.0/0 gateway=198.51.100.1 distance=10 comment=isp2-backupHowever, check-gateway=ping only confirms the directly connected gateway is reachable, not the broader internet path. A failed ISP upstream can leave the gateway pingable while internet traffic blackholes.
Recursive Routing for Reliable Health Checks
Section titled “Recursive Routing for Reliable Health Checks”Recursive routes with internet health targets provide end-to-end reachability verification:
# Bind health targets to specific ISP gateways/ip routeadd dst-address=9.9.9.9/32 gateway=203.0.113.1 scope=10 comment=isp1-healthadd dst-address=1.1.1.1/32 gateway=198.51.100.1 scope=10 comment=isp2-health
# Default routes resolved through health targets/ip routeadd dst-address=0.0.0.0/0 gateway=9.9.9.9 check-gateway=ping distance=1 comment=isp1-defaultadd dst-address=0.0.0.0/0 gateway=1.1.1.1 check-gateway=ping distance=10 comment=isp2-defaultWhen ISP1’s upstream fails, 9.9.9.9 becomes unreachable through 203.0.113.1, the recursive gateway resolution fails, and the ISP1 default route is withdrawn. Traffic immediately falls back to ISP2.
ECMP Load Sharing
Section titled “ECMP Load Sharing”For networks that want to use both ISP links simultaneously, set equal distances on both routes:
/ip routeadd dst-address=0.0.0.0/0 gateway=9.9.9.9 check-gateway=ping distance=1 comment=isp1-ecmpadd dst-address=0.0.0.0/0 gateway=1.1.1.1 check-gateway=ping distance=1 comment=isp2-ecmpRouterOS distributes new connections across both routes. When one path fails, traffic shifts entirely to the surviving ISP.
Note: ECMP with NAT requires careful policy routing to ensure that return traffic exits through the same ISP the connection was established on. See Multi-WAN Failover with Recursive Routing for connection-marking patterns.
Netwatch Integration
Section titled “Netwatch Integration”Add Netwatch probes to trigger script-based failover when ECMP is not suitable or when scripts must update VRRP priority simultaneously:
/tool/netwatchadd name=isp1-probe host=9.9.9.9 type=icmp interval=10s timeout=1s \ down-script="/system script run isp1-down" \ up-script="/system script run isp1-up"
/system scriptadd name=isp1-down source={ /ip route set [find comment=isp1-default] disabled=yes /interface vrrp set [find name=vrrp-lan] priority=90 :log warning "ISP1 down: route disabled, VRRP priority reduced"}add name=isp1-up source={ /ip route set [find comment=isp1-default] disabled=no /interface vrrp set [find name=vrrp-lan] priority=150 :log info "ISP1 up: route restored, VRRP priority restored"}Out-of-Band Management
Section titled “Out-of-Band Management”A management outage is a data-plane crisis that locks out the operator. Design management access independently of the production data path.
Dedicated Management Interface
Section titled “Dedicated Management Interface”Assign a separate physical interface or VLAN exclusively for management traffic:
/interface listadd name=WANadd name=LANadd name=MGMT
/interface list memberadd interface=ether10 list=MGMTadd interface=ether1 list=WANadd interface=bridge-lan list=LAN
# Restrict management services to MGMT network only/ip serviceset winbox address=10.255.0.0/24set ssh address=10.255.0.0/24set api address=10.255.0.0/24Management traffic on 10.255.0.0/24 is isolated from customer-facing interfaces. A WAN or LAN outage leaves management accessible.
LTE or Cellular OOB Path
Section titled “LTE or Cellular OOB Path”For critical sites, configure a cellular backup as an OOB path:
# LTE modem on a USB or dedicated port/interface lte add name=lte1
# Assign OOB management address via LTE/ip address add address=dhcp interface=lte1
# Firewall: accept SSH only from management source on LTE/ip firewall filter add \ chain=input in-interface=lte1 protocol=tcp dst-port=22 \ src-address=198.51.100.10 action=accept \ comment="OOB SSH from NOC via LTE"
/ip firewall filter add \ chain=input in-interface=lte1 action=drop \ comment="Drop all other LTE input"The LTE path remains available even when both primary ISPs are down, enabling recovery from routing or config errors remotely.
Management VLAN
Section titled “Management VLAN”In environments without a spare physical port, a dedicated management VLAN provides logical separation:
/interface vlan add name=mgmt-vlan interface=ether5 vlan-id=99/ip address add address=10.255.0.1/24 interface=mgmt-vlan
/ip serviceset winbox address=10.255.0.0/24set ssh address=10.255.0.0/24Traffic on VLAN 99 is logically isolated from other VLANs even though it shares a physical port.
Complete Resilient Site Example
Section titled “Complete Resilient Site Example”This example combines VRRP gateway redundancy, bonded uplinks, dual ISPs, and isolated management:
# ─── Physical uplinks: bonded pairs toward each upstream switch ───/interface bondingadd name=bond-sw1 slaves=ether1,ether2 mode=802.3ad lacp-rate=1sec link-monitoring=miiadd name=bond-sw2 slaves=ether3,ether4 mode=802.3ad lacp-rate=1sec link-monitoring=mii
# ─── LAN bridge (add switch ports as needed) ───/interface bridge add name=bridge-lan/ip address add address=10.10.10.2/24 interface=bridge-lan
# ─── VRRP virtual gateway for clients ───/interface vrrp add \ name=vrrp-lan interface=bridge-lan vrid=10 priority=150 \ preemption-mode=yes sync-connection-tracking=yes/ip address add address=10.10.10.1/32 interface=vrrp-lan
# ─── ISP1 and ISP2 WAN interfaces ───/ip addressadd address=203.0.113.2/30 interface=sfp-sfpplus1 # ISP1add address=198.51.100.2/30 interface=sfp-sfpplus2 # ISP2
# ─── Recursive health targets ───/ip routeadd dst-address=9.9.9.9/32 gateway=203.0.113.1 scope=10 comment=isp1-healthadd dst-address=1.1.1.1/32 gateway=198.51.100.1 scope=10 comment=isp2-health
# ─── ECMP default routes ───/ip routeadd dst-address=0.0.0.0/0 gateway=9.9.9.9 check-gateway=ping distance=1 comment=isp1-defaultadd dst-address=0.0.0.0/0 gateway=1.1.1.1 check-gateway=ping distance=10 comment=isp2-backup
# ─── Management isolation ───/interface vlan add name=mgmt vlan-id=99 interface=ether10/ip address add address=10.255.0.1/24 interface=mgmt/ip service set ssh address=10.255.0.0/24/ip service set winbox address=10.255.0.0/24
# ─── Netwatch + VRRP health automation ───/tool/netwatchadd name=isp1-probe host=9.9.9.9 type=icmp interval=10s timeout=1s \ down-script="/system script run isp1-down" \ up-script="/system script run isp1-up"
/system scriptadd name=isp1-down source={ /interface vrrp set [find name=vrrp-lan] priority=90 :log warning "ISP1 down: VRRP priority reduced, failover to standby"}add name=isp1-up source={ /interface vrrp set [find name=vrrp-lan] priority=150 :log info "ISP1 up: VRRP priority restored"}On the standby router, use identical configuration with priority=100 on the VRRP interface and its own real IP (10.10.10.3/24) on bridge-lan.
Anti-Patterns to Avoid
Section titled “Anti-Patterns to Avoid”Using only check-gateway=ping to the directly connected ISP gateway. The gateway can be pingable while the ISP’s upstream is down. Use recursive routing with internet health targets.
Treating bonding as router redundancy. A LACP bond between a single router and a switch provides link redundancy, not node redundancy. If the router fails, all bonded links go down together. VRRP is required for router-level redundancy.
Single shared VLAN for everything including management. A large L2 broadcast domain introduces split-brain risk with VRRP and makes management traffic depend on the same failure domain as production traffic. Separate management into a dedicated VLAN or physical interface.
ECMP without return-path control. Asymmetric routing breaks stateful firewall and NAT. When using ECMP with multiple ISPs, use connection marking to ensure each flow’s return traffic exits through the same ISP it arrived on.
No independent OOB path. A misconfiguration that breaks routing can make the router unreachable through all production interfaces simultaneously. A cellular OOB interface or dedicated management VLAN on a separate physical switch preserves access during data-plane incidents.
Troubleshooting
Section titled “Troubleshooting”# Verify VRRP state on both routers/interface vrrp print detail
# Check bonding member status and active slave/interface bonding monitor bond-sw1
# Verify recursive route resolution/ip route print detail where dst-address=0.0.0.0/0
# Test ISP reachability per path/tool ping 9.9.9.9 routing-table=main count=5/tool ping 1.1.1.1 routing-table=main count=5
# View Netwatch probe state/tool/netwatch print detailRelated Topics
Section titled “Related Topics”- VRRP — Virtual Router Redundancy Protocol parameters and state machine
- Bonding — Interface bonding modes and link monitoring
- Failover (WAN Backup) — Basic WAN failover with route distance
- Multi-WAN Failover with Recursive Routing — Advanced ECMP and connection marking
- Health Monitoring Scripts — Netwatch and scripted failover automation