Tuesday’s Global Internet Outage Stemmed From One Obscure Company
And that’s not all! CDNs don’t just store content closer to the devices that crave it. They also help direct it across the internet. “It is like orchestrating traffic flow on a massive road system,” says Ramesh Sitaraman, a computer scientist at the University of Massachusetts at Amherst who helped create the first major CDN as a principle architect at Akamai. “If some link on the internet fails or gets congested, CDN algorithms quickly find an alternate route to the destination.”
So you can start to see how when a CDN goes down, it can take heaping portions of the internet along with it. Although that alone doesn’t quite explain how the impacts on Tuesday were so far-reaching, especially when there are so many redundancies built into these systems. Or at least, there should be.
Again, it’s not clear exactly what happened at Fastly. “We identified a service configuration that triggered disruptions across our POPs globally and have disabled that configuration,” a company spokesperson said in a statement. “Our global network is coming back online.”
“Service configuration” can mean any number of things; the only certainty is that whatever the root cause, it had wide-ranging effects. According to Fastly’s incident report page, every continent other than Antarctica felt the impact. Even after Fastly had fixed the underlying issue, it cautioned that users could still see a lower “cache hit ratio”—how often you can find the content you’re looking for already stored in a nearby server—and “increased origin load,” which refers to the process of going back to the source for items not in the cache. In other words, the cupboards are still fairly bare.
That an outage occurred is surprising, given that CDNs are typically designed to weather these tempests. “In principle, there is massive redundancy,” says Sitaraman, speaking about CDNs generally. “If a server fails, others servers could take over the load. If an entire data center fails, the load can be moved to other data centers. If things worked perfectly, you could have many network outages, data center problems, and server failures; the CDN’s resiliency mechanisms would ensure that the users never see the degradation.”
When things do go wrong, Sitaraman says, it typically relates to a software bug or configuration error that gets pushed to multiple servers at once.
Even then, the sites and services that employ CDNs typically have their own redundancies in place. Or at least, they should. In fact, you could see hints of how diversified various services are in the speed of their response this morning, says Medina. It took Amazon about 20 minutes to get back up and running, because it could divert traffic to other CDN providers. Anyone who relied solely on Fastly, or who didn’t have automated systems in place to accommodate for the disruption, had to wait it out.
“The outage was the result of monoculture,” says Roland Dobbins, principal engineer of security firm Netscout Arbor. He suggests that every organization with a substantial online presence should have multiple CDN providers to avoid precisely this sort of situation.
Their options, though, are increasingly limited. Just as the cloud has largely been subsumed by Amazon, Google, and Microsoft, three CDN providers—Cloudflare, Akamai, and Fastly—dominate the flow of content online. “There’s a lot of concentration of usage within very few service providers,” Medina says. “Whenever any one of those three providers has an issue, typically it’s not something that lasts a very long time, but it has a major impact across the internet.”
That’s a big part, Medina says, of why these sorts of outages have been more frequent of late, and why they’ll only continue to get worse. Baseball needs a cutoff man; intersections need traffic cops. The fewer of those there are to rely on, the more connections get missed, and the bigger the crashes.
Additional reporting by Lily Hay Newman.
More Great WIRED Stories
Author Brian Barrett