"Apparently, according to our network services guy, where the outage was happening on the circuit and the fact that it did not affect all URLs/routing did not register with our firewall, so we did not automatically fail over. We had to force the failover and then manually reroute back after the outage was over."
Thats a sort-of accurate answer. URLs dont route 🙂 But the jist is correct.
Same problem happened with some of our customers. It was not a failure of the Comcast pipe, it was a complete Comcast routing meltdown.
They lost core routing services and reachability for many upstreams and peers but not all.
The only (sound) way you can deal with this is to have your own ASN and BGP peering session up with a carrier. And not trust their default route the send you, since that is synthesized on their edge router and probably not any kind of indication of the health of the carrier network to cary valid traffic.
A hack would be to create 5 or so well known sites or routes to ping, add static routes forcing that traffic out a router or firewall and then measure loss and reachability. Set a threshold and if reachability drops below a certain percentage, you kill the connection. Sounds simple. Its incredibly hard to do.
Most of the "monitoring" points you can think of have one or more of 3 problems:
- They rely on popular websites or ips that are CDN hosted. (Content delivery network). If you are pinging cloudflare, and cloudflare is hosted in a CDN node on your providers network, you are not testing ANY reachability to anything other than that local CDN pod. Those networks were all working in fact on friday. So if you had pingers setup it would have passed as OK.
Hint, most junior admins setup 4.2.2.2 and 8.8.8.8 in their firewall and think this is fine. Those ips were reachable in the outage since they are ... CDN hosted and replicated (call anycasted).
- If you pick network routers you see in traces, these can (a) change w/o notice at any time - carriers mod their network, and (b) carriers will block or low pref ping responses on their control plane so it does not work.
- Require a subscription service to use or are free and best effort only.
So you end up having to build a program that watches the watchers, and constantly (maybe once a month) end up tweaking the pinger programs and decision config to get rid of stale ips, or sites that block icmp, i.e. things that stop working.
Your best bet is to:
- Get 2 carriers that support BGP
- Join ARIN or RIPE and get an ASN (about 1.2K of annual cost) 3. Buy an IP block (/24) - about 4.2K of one time cost 4. Setup BGP in front of your firewall cluster for redundant routing
That should take care of all cases EXCEPT when a carrier sends you routes they cant reach. Its rare, but can happen.
Another advantage of this setup is you DONT have to worry about your sessions dropping each and every time your path changes because your firewall will NOT be flipping back and forth on the NAT range. It will use your IP block and no matter which pipe your firewall cluster picks to use, the return packets can safely make it back w/o NAT or state issues.
Verizon FIOS (and business cable) does not support BGP. If you have picked one of these technologies as your secondary service you can still make this work.
You have to use a datacenter where you can land your BGP block and tunnels to route your traffic across the providers that do not support BGP. Works very well, but you of course have to pick a datacenter that can do the BGP for you, or help you get it setup properly. When you pick a datacenter you DONT have to buy and ASN or an IP block.
Of course if you decide to leave that datacenter, you lose your rented ips and have to renumber. So the advantage of your own ASN and IP Block is portability to wherever you want to go.
And with tunnels you can use inexpensive bandwidth if you prefer for secondary or even primary failover.