Resilient Network Topologies

A resilient network eliminates single points of failure across each layer: physical links, gateway routers, WAN uplinks, and management access. This guide covers design principles and RouterOS configuration patterns for building redundant networks, from dual-homed LAN gateways to multi-ISP routing with out-of-band management.

Design Principles

Three core principles guide resilient network design:

Separate redundancy concerns by layer. VRRP handles first-hop gateway redundancy for clients. Bonding protects physical link failures between devices. Routing distance, ECMP, and recursive routes handle WAN uplink diversity. Mixing these mechanisms creates complexity and failure modes. Design each layer independently.

Avoid single points of failure at every tier. A redundant LAN gateway connected to a single WAN link is only partially resilient. Map out every device, link, and uplink in the path and identify where a single failure would disrupt service.

Plan for management access. Data plane failures can lock you out of the router if management traffic shares the same path. Dedicate a physically separate or logically isolated path for management access before a problem occurs.

Layer Architecture

A fully redundant site uses at least three layers:

Clients
   |
  LAN (VRRP virtual gateway)
  / \
R1   R2  ← Two routers, VRRP between them for gateway redundancy
  |   |
  bond bond  ← Bonded uplinks to distribution/edge switches
  |   |
SW1 SW2  ← Redundant upstream switches (LACP/MLAG)
  |   |
ISP1 ISP2  ← Dual ISP uplinks for WAN diversity

Each layer fails independently. A switch failure does not impact the WAN. A WAN outage does not take down LAN access.

LAN Gateway Redundancy with VRRP

VRRP provides a virtual gateway address that floats between two routers. Clients use the virtual IP as their default gateway and are unaffected by router failover.

Two-Router Active/Standby Gateway

Configure the primary router with a high VRRP priority and the standby with a lower priority:

# Primary router — priority 150
/interface vrrp add \
    name=vrrp-lan \
    interface=bridge-lan \
    vrid=10 \
    priority=150 \
    preemption-mode=yes \
    interval=1s

/ip address
add address=10.10.10.2/24 interface=bridge-lan
add address=10.10.10.1/32 interface=vrrp-lan   # virtual gateway

# Standby router — priority 100
/interface vrrp add \
    name=vrrp-lan \
    interface=bridge-lan \
    vrid=10 \
    priority=100 \
    preemption-mode=yes \
    interval=1s

/ip address
add address=10.10.10.3/24 interface=bridge-lan
add address=10.10.10.1/32 interface=vrrp-lan   # same virtual gateway

Hosts point to 10.10.10.1 as their default gateway. When the primary router fails, the standby assumes Master state and begins handling traffic.

Connection Tracking Synchronization

For networks with stateful firewalls, enable connection tracking sync so established sessions survive failover:

# Enable on both routers
/ip/firewall/connection/tracking/set enabled=yes
/interface vrrp set vrrp-lan sync-connection-tracking=yes

Firewall rules must permit UDP port 8275 from the peer’s real IP:

/ip firewall filter add \
    chain=input protocol=udp dst-port=8275 \
    src-address=10.10.10.3 action=accept \
    comment="VRRP conntrack sync from standby"

Grouped WAN and LAN VRRP

When both routers also have WAN interfaces, group the WAN and LAN VRRP instances so they share a single state machine. This prevents asymmetric routing where the LAN Master sends traffic out a WAN interface on the other router:

/interface vrrp
add name=vrrp-wan interface=sfp-sfpplus1 vrid=20 priority=150
add name=vrrp-lan interface=bridge-lan  vrid=10 priority=150

# Link both instances — vrrp-lan controls the group
/interface vrrp set [find where name!=vrrp-lan] group-authority=vrrp-lan

Both interfaces transition to Master or Backup together, ensuring the same router handles both LAN client traffic and WAN forwarding.

Link Redundancy with Bonding

Bonding combines multiple physical interfaces into a single logical link, providing both additional bandwidth and automatic failover when a physical link fails.

Active-Backup Between Routers

For direct router-to-router links, active-backup provides simple failover without requiring switch LACP support:

# On both routers
/interface bonding add \
    name=bond-inter \
    slaves=ether1,ether2 \
    mode=active-backup \
    primary=ether1 \
    link-monitoring=mii \
    mii-interval=100ms

/ip address add address=172.16.0.1/30 interface=bond-inter

When ether1 fails, ether2 takes over within one MII interval (100ms by default).

LACP to Distribution Switches

For uplinks to managed switches, 802.3ad LACP provides standards-compliant link aggregation with automatic member discovery:

/interface bonding add \
    name=bond-uplink \
    slaves=ether3,ether4 \
    mode=802.3ad \
    lacp-rate=1sec \
    link-monitoring=mii

/ip address add address=10.1.0.1/30 interface=bond-uplink

Configure the matching LACP port-channel on the switch. lacp-rate=1sec enables fast LACP PDU exchange for faster member failure detection.

Choosing Link Monitoring

Scenario	Recommended monitoring
Direct router-to-router links	`mii` — detects physical layer failure immediately
Through a switch fabric	`arp` — detects reachability, not just physical link
802.3ad LACP bonds	`mii` — ARP monitoring can mislead due to hashing

For ARP monitoring, set the target IP to a reachable peer on the bonded segment:

/interface bonding set [find name=bond1] \
    link-monitoring=arp \
    arp-ip-targets=10.1.0.2 \
    arp-interval=100ms

WAN Diversity with Dual ISPs

Dual ISP uplinks eliminate single-provider dependency. RouterOS supports several models: active/standby failover, ECMP load sharing, and policy-based steering.

Active/Standby with Distance

The simplest dual-ISP design uses route distance to prefer one provider. When the primary route is withdrawn (via check-gateway=ping), the backup route activates automatically:

/ip route
add dst-address=0.0.0.0/0 gateway=203.0.113.1 distance=1  comment=isp1-primary
add dst-address=0.0.0.0/0 gateway=198.51.100.1 distance=10 comment=isp2-backup

However, check-gateway=ping only confirms the directly connected gateway is reachable, not the broader internet path. A failed ISP upstream can leave the gateway pingable while internet traffic blackholes.

Recursive Routing for Reliable Health Checks

Recursive routes with internet health targets provide end-to-end reachability verification:

# Bind health targets to specific ISP gateways
/ip route
add dst-address=9.9.9.9/32  gateway=203.0.113.1  scope=10 comment=isp1-health
add dst-address=1.1.1.1/32   gateway=198.51.100.1 scope=10 comment=isp2-health

# Default routes resolved through health targets
/ip route
add dst-address=0.0.0.0/0 gateway=9.9.9.9  check-gateway=ping distance=1  comment=isp1-default
add dst-address=0.0.0.0/0 gateway=1.1.1.1  check-gateway=ping distance=10 comment=isp2-default

When ISP1’s upstream fails, 9.9.9.9 becomes unreachable through 203.0.113.1, the recursive gateway resolution fails, and the ISP1 default route is withdrawn. Traffic immediately falls back to ISP2.

For networks that want to use both ISP links simultaneously, set equal distances on both routes:

/ip route
add dst-address=0.0.0.0/0 gateway=9.9.9.9  check-gateway=ping distance=1 comment=isp1-ecmp
add dst-address=0.0.0.0/0 gateway=1.1.1.1  check-gateway=ping distance=1 comment=isp2-ecmp

RouterOS distributes new connections across both routes. When one path fails, traffic shifts entirely to the surviving ISP.

Note: ECMP with NAT requires careful policy routing to ensure that return traffic exits through the same ISP the connection was established on. See Multi-WAN Failover with Recursive Routing for connection-marking patterns.

Netwatch Integration

Add Netwatch probes to trigger script-based failover when ECMP is not suitable or when scripts must update VRRP priority simultaneously:

/tool/netwatch
add name=isp1-probe host=9.9.9.9 type=icmp interval=10s timeout=1s \
    down-script="/system script run isp1-down" \
    up-script="/system script run isp1-up"

/system script
add name=isp1-down source={
  /ip route set [find comment=isp1-default] disabled=yes
  /interface vrrp set [find name=vrrp-lan] priority=90
  :log warning "ISP1 down: route disabled, VRRP priority reduced"
}
add name=isp1-up source={
  /ip route set [find comment=isp1-default] disabled=no
  /interface vrrp set [find name=vrrp-lan] priority=150
  :log info "ISP1 up: route restored, VRRP priority restored"
}

Out-of-Band Management

A management outage is a data-plane crisis that locks out the operator. Design management access independently of the production data path.

Dedicated Management Interface

Assign a separate physical interface or VLAN exclusively for management traffic:

/interface list
add name=WAN
add name=LAN
add name=MGMT

/interface list member
add interface=ether10 list=MGMT
add interface=ether1  list=WAN
add interface=bridge-lan list=LAN

# Restrict management services to MGMT network only
/ip service
set winbox address=10.255.0.0/24
set ssh    address=10.255.0.0/24
set api    address=10.255.0.0/24

Management traffic on 10.255.0.0/24 is isolated from customer-facing interfaces. A WAN or LAN outage leaves management accessible.

LTE or Cellular OOB Path

For critical sites, configure a cellular backup as an OOB path:

# LTE modem on a USB or dedicated port
/interface lte add name=lte1

# Assign OOB management address via LTE
/ip address add address=dhcp interface=lte1

# Firewall: accept SSH only from management source on LTE
/ip firewall filter add \
    chain=input in-interface=lte1 protocol=tcp dst-port=22 \
    src-address=198.51.100.10 action=accept \
    comment="OOB SSH from NOC via LTE"

/ip firewall filter add \
    chain=input in-interface=lte1 action=drop \
    comment="Drop all other LTE input"

The LTE path remains available even when both primary ISPs are down, enabling recovery from routing or config errors remotely.

Management VLAN

In environments without a spare physical port, a dedicated management VLAN provides logical separation:

/interface vlan add name=mgmt-vlan interface=ether5 vlan-id=99
/ip address add address=10.255.0.1/24 interface=mgmt-vlan

/ip service
set winbox address=10.255.0.0/24
set ssh    address=10.255.0.0/24

Traffic on VLAN 99 is logically isolated from other VLANs even though it shares a physical port.

Complete Resilient Site Example

This example combines VRRP gateway redundancy, bonded uplinks, dual ISPs, and isolated management:

# ─── Physical uplinks: bonded pairs toward each upstream switch ───
/interface bonding
add name=bond-sw1 slaves=ether1,ether2 mode=802.3ad lacp-rate=1sec link-monitoring=mii
add name=bond-sw2 slaves=ether3,ether4 mode=802.3ad lacp-rate=1sec link-monitoring=mii

# ─── LAN bridge (add switch ports as needed) ───
/interface bridge add name=bridge-lan
/ip address add address=10.10.10.2/24 interface=bridge-lan

# ─── VRRP virtual gateway for clients ───
/interface vrrp add \
    name=vrrp-lan interface=bridge-lan vrid=10 priority=150 \
    preemption-mode=yes sync-connection-tracking=yes
/ip address add address=10.10.10.1/32 interface=vrrp-lan

# ─── ISP1 and ISP2 WAN interfaces ───
/ip address
add address=203.0.113.2/30  interface=sfp-sfpplus1  # ISP1
add address=198.51.100.2/30 interface=sfp-sfpplus2  # ISP2

# ─── Recursive health targets ───
/ip route
add dst-address=9.9.9.9/32  gateway=203.0.113.1  scope=10 comment=isp1-health
add dst-address=1.1.1.1/32   gateway=198.51.100.1 scope=10 comment=isp2-health

# ─── ECMP default routes ───
/ip route
add dst-address=0.0.0.0/0 gateway=9.9.9.9  check-gateway=ping distance=1  comment=isp1-default
add dst-address=0.0.0.0/0 gateway=1.1.1.1  check-gateway=ping distance=10 comment=isp2-backup

# ─── Management isolation ───
/interface vlan add name=mgmt vlan-id=99 interface=ether10
/ip address add address=10.255.0.1/24 interface=mgmt
/ip service set ssh address=10.255.0.0/24
/ip service set winbox address=10.255.0.0/24

# ─── Netwatch + VRRP health automation ───
/tool/netwatch
add name=isp1-probe host=9.9.9.9 type=icmp interval=10s timeout=1s \
    down-script="/system script run isp1-down" \
    up-script="/system script run isp1-up"

/system script
add name=isp1-down source={
  /interface vrrp set [find name=vrrp-lan] priority=90
  :log warning "ISP1 down: VRRP priority reduced, failover to standby"
}
add name=isp1-up source={
  /interface vrrp set [find name=vrrp-lan] priority=150
  :log info "ISP1 up: VRRP priority restored"
}

On the standby router, use identical configuration with priority=100 on the VRRP interface and its own real IP (10.10.10.3/24) on bridge-lan.

Anti-Patterns to Avoid

Using only check-gateway=ping to the directly connected ISP gateway. The gateway can be pingable while the ISP’s upstream is down. Use recursive routing with internet health targets.

Treating bonding as router redundancy. A LACP bond between a single router and a switch provides link redundancy, not node redundancy. If the router fails, all bonded links go down together. VRRP is required for router-level redundancy.

Single shared VLAN for everything including management. A large L2 broadcast domain introduces split-brain risk with VRRP and makes management traffic depend on the same failure domain as production traffic. Separate management into a dedicated VLAN or physical interface.

ECMP without return-path control. Asymmetric routing breaks stateful firewall and NAT. When using ECMP with multiple ISPs, use connection marking to ensure each flow’s return traffic exits through the same ISP it arrived on.

No independent OOB path. A misconfiguration that breaks routing can make the router unreachable through all production interfaces simultaneously. A cellular OOB interface or dedicated management VLAN on a separate physical switch preserves access during data-plane incidents.

Troubleshooting

# Verify VRRP state on both routers
/interface vrrp print detail

# Check bonding member status and active slave
/interface bonding monitor bond-sw1

# Verify recursive route resolution
/ip route print detail where dst-address=0.0.0.0/0

# Test ISP reachability per path
/tool ping 9.9.9.9 routing-table=main count=5
/tool ping 1.1.1.1 routing-table=main count=5

# View Netwatch probe state
/tool/netwatch print detail

VRRP — Virtual Router Redundancy Protocol parameters and state machine
Bonding — Interface bonding modes and link monitoring
Failover (WAN Backup) — Basic WAN failover with route distance
Multi-WAN Failover with Recursive Routing — Advanced ECMP and connection marking
Health Monitoring Scripts — Netwatch and scripted failover automation

Resilient Network Topologies

Resilient Network Topologies

Design Principles

Layer Architecture

LAN Gateway Redundancy with VRRP

Two-Router Active/Standby Gateway

Connection Tracking Synchronization

Grouped WAN and LAN VRRP

Link Redundancy with Bonding

Active-Backup Between Routers

LACP to Distribution Switches

Choosing Link Monitoring

WAN Diversity with Dual ISPs

Active/Standby with Distance

Recursive Routing for Reliable Health Checks

ECMP Load Sharing

Netwatch Integration

Out-of-Band Management

Dedicated Management Interface

LTE or Cellular OOB Path

Management VLAN

Complete Resilient Site Example

Anti-Patterns to Avoid

Troubleshooting

Related Topics