Online outages are serious. Vendors lose money for every minute their users can't reach their web services, and business productivity tanks when employeescan't access the web applications they rely on to get their jobs done. People can be convinced to forgive the occasional blip, but full-blown outages reinforce the impression that nothing truly critical should be entrusted to the internet .
A look at some of the outages over the past year reveals a disturbing pattern. While the move to cloud-based architecture and applications has reduced complexity in IT infrastructure, that has come at the cost of resiliency. IT has to regularly balance redundancy -- which improves resiliency -- with complexity, and recent outages show that redundancy keeps getting left behind. Taking the time to assess potential "what if" scenarios and plan for the worst-case scenario could have, if not prevented, at least minimized the effects of these outages.
"IT needs to plan for redundancy on critical services," said Nick Kephart, a senior director at network infrastructure monitoring company ThousandEyes.
Department of Redundancy DepartmentRedundancy is a basic IT tenet. Whether it's multiple backend servers running the same web applications or setting up disk drives in RAID arrays, IT regularly ensures availability even in the case of a failure. Yet the massive DDoS attack against DNS (Domain Name System) service provider Dyn showed that many organizations failed to think about redundancy on their critical infrastructure.
The attack overwhelmed Dyn's servers with enough junk traffic that legitimate DNS requests were no longer being answered. Web properties that had relied on Dyn to direct traffic to their servers realized too late that not having a backup DNS provider meant they were, for all intents and purposes, cut off from the rest of the internet during that period.
Those who load-balanced their DNS name servers across multiple providers -- such as Amazon, who used both Ultra DNS and Dyn -- were able to switch during the outage and remain unaffected.
The internet usually hums along without any major issues, but the growing intensity and frequency of DDoS attacks proves that DNS needs to be treated as critical Internet infrastructure and protected as such. Theattack against DNS wasn't an aberration -- cloud-based DNS provider NS1 was hit earlier in the year, and there was also the June attack that targeted all 13 of the DNS root servers . "It was a large-scale attack on the most critical part of the internet infrastructure and resulted in roughly three hours of performance issues," said Archana Kesavan, a manager at network infrastructure monitoring company ThousandEyes.
For many enterprises, Dyn seemed like the logical way to address redundancy for DNS services because Dyn already provides a distributed architecture. IT teams don't want to have multiple DNS providers because it increases complexity to the network infrastructure, but DNS outages can and do happen, so IT teams need to double or even triple up on their DNS providers. IT should also lower the time-to-life settings on their DNS servers so that traffic can be redirected faster to the backup provider in case of an outage at the primary one.
Popularity can hurt, tooOutages aren't just the result of malicious activity or equipment failure. Popularity can be just as damaging in the absence of proper network and capacity planning. There is no such thing as too many visitors, and a hit application everyone is clamoring for is fantastic. Or at least, until the increased traffic melts down the servers and the network collapses under the load, then everyone loses.
Lack of a CDN (content delivery network) front end can be costly if traffic bursts aren't factored into the network architecture, Kephart said.
January had one of the largest lottery jackpots in recent history, but Powerball couldn't keep up with the frenzy surrounding the mega-million payout. Neither the application nor the network could handle the uptick in traffic, leading to increased packet loss and extended page load times. Powerball avoided complete meltdown by distributing traffic across Verizon's Edgecast CDN network, Microsoft's data center, and the Multi-State Lottery Association data center just before the drawing. "The damage was already done, and user experience to the website was sub-standard," Kesavan said.
PokemonGo's servers experienced similar outages when the combination of network architecture and overloaded target servers prevented users from playing the game. Apple's servers struggled to handle the much-anticipated launch of Nintendo's Super Mario Run , with sporadic outages affecting all its online stores, including the iOS App Store, Mac App Store, Apple TV, and Apple Music.
Benchmarking and capacity planning is critical, especially before software updates and large-scale events. No matter how well the network architecture is designed, CDNs and anycast servers can support the network and maximize user experience.
Did we say redundancy yet?Don't forget about Infrastructure redundancy, either. It's tempting for IT teams to think, "My ISP can handle this, I don't need to do anything else," but even upstream providers can have outages, whether because of a mistaken configuration, hardware failure, or a security incident, Kephsart said. Networks by nature will have outages and face security threats, so IT needs to design into the network architecture the flexibility to react when something fails. Enterprises generally do a good job of building redundancy within their own data centers, but they overlook doing the same for third-party infrastructure providers.
Don't rely on a single provider, because that becomes a single point of failure. Distribute dependencies across ISPs, DNS providers, and hosting companies.
It is hard to justify security decisions when the only way to tell if it worked is to be able to say, "Hey, we didn't get hacked," or, "We didn't have an outage," at the end of the year. Those are great goals, but when there are competing demands, it's hard to justify the extra expenses or added complexity on the possibility that bad things won't happen. But that's the kind of calculus IT needs to be doing every day.