Tuesday will be remembered as the day the internet broke – before being quickly fixed again. Early in the morning, websites like Amazon, Reddit, Spotify, Ebay, Twitch, Pinterest, and sadly CNET went offline due to a major outage of a service called Fastly. Everywhere you looked there were 503 errors and people complaining about not being able to access key services and media, demonstrating how much the internet relies on this largely new cloud service.

After investigating what happened, Fastly posted a blog post about what exactly happened – and it turns out the whole incident was triggered by a single anonymous Fastly customer.

In mid-May, Fastly released a software deployment containing a bug that, if triggered under specific circumstances, could destroy large swathes of its network. The bug remained dormant until June 8, when a Fastly client inadvertently triggered the bug during a “valid configuration change”, causing 85% of the network to return with errors. the company.

“We detected the disruption within a minute, then identified and isolated the cause and disabled the configuration,” said Nick Rockwell, senior vice president of engineering and infrastructure at Fastly. blog post. “In 49 minutes, 95% of our network was functioning normally. This outage was large and severe, and we are truly sorry for the impact on our customers and all those who rely on them.”

What happened during the Fastly outage?

Around 2:58 a.m. PT, Fastly’sstatus update page noted an error, saying “we are currently investigating the potential impact on performance with our CDN [content delivery network] services. ”Shortly thereafter, reports emerged on Twitter that major news publications, including the BBC, CNN and The New York Times, were offline. Twitter itself was still running, although the server that was hosting his emojis crashed, resulting in weird tweets.

Rather than isolated incidents affecting individual sites, it turned out to be a massive outage that brought much of the internet to its knees. All over the world, people were receiving Error: 503 messages while trying to access sites, including some essential services, such as the UK government’s gov.uk web properties.

Almost an hour later, at 3:44 a.m. PT – or 6:44 a.m. ET, at the dawn of the US East Coast workday, and at noon in the UK – quickly turned updated its status page to say the problem has been identified and a fix has been implemented. At 4:10 am PT, the company tweeted, “We have identified a service configuration that has triggered disruption to our POPs around the world and have disabled that configuration. Our global network is coming back online.

The same message was sent to CNET as a comment from spokespersons for Fastly.

What is Fastly?

Fastly is a San Francisco-based cloud computing service provider that has been around since 2011. In 2017, it launched an edge cloud platform designed to bring websites closer to the people who use them. In fact, this means that if you go to a website hosted in another country, it will store part of that website closer to you so that there is no need to waste bandwidth going to search. all the content on this website from far away whenever you need it.

This speeds up website load times and optimizes images, videos, and other high payload content to display quickly and smoothly when you access a webpage. Among the bragging rights on the company’s website, he says pages load on Buzzfeed is 50% faster and allowed The New York Times to simultaneously handle 2 million readers on election night. Edge computing also performs vital cybersecurity functions, protecting sites from DDoS attacks and bots, and providing a web application firewall.

Due to the way Fastly sits between the main web servers and the front-end internet as we see it, any error on its part can cause entire websites to go down. Due to the localized nature of the edge cloud platform, this also means that the errors do not affect all regions equally at the same time (although people around the world reported experiencing issues on Tuesday. ).

What is a 503 error?

When you see a website displaying a 503 error rather than showing you the page you expected, it means that the server hosting the website is not ready to handle the request. It also indicates that the issue is temporary and will likely be resolved soon.

Usually, this happens when a server is down for maintenance or when a website has been overloaded – for example, if too many people try to access it at the same time.

Why did Fastly fail on Tuesday and will it happen again?

We now know that Tuesday’s internet outage was caused by a service configuration change by one of Fastly’s customers that triggered a hidden bug in Fastly’s network. The bug had been dormant since Fastly deployed a software update on May 6.

Many people have speculated on Twitter that the outage was caused by a cyber attack, but we now know for sure that was not the case. There are many technical reasons why a CDN can fail, and cyber attacks are just one of them.

To make sure the problem does not repeat itself, Fastly said it is taking a number of steps. He is deploying a bug fix on his network, while performing a full post mortem of the processes and practices he followed during the incident. It will also include understanding why it did not find the bug in its own testing processes and evaluating ways to improve remediation time.

“Even though there were specific conditions that triggered this outage, we should have anticipated it,” said Rockwell. “We provide mission critical services and we handle any action that may cause service issues with the utmost sensitivity and priority. “

Why have so many websites been affected by the Fastly outage?

Fastly is a service widely used by web publishers – and just how widely used it became evident on Tuesday when large swathes of the internet became unavailable.

The reason it is so popular is that the services it provides are considered essential by many online web properties, but few companies offer these services. As such, a large number of websites depend on a very small group of businesses to keep operating. Similar problems have been observed when Cloudflare has been affected by an outage last july, and when Amazon Web Services went down last november.

Like Corinne Cath-Speth, a Ph.D. candidate at the Oxford Internet Institute and the Alan Turing Institute highlighted on Twitter, this means that “a technical problem in a single company can have huge ramifications”.

“This in turn raises major questions about the dangers of consolidation (of power) in the cloud market and the undisputed influence of these often invisible actors on access to information,” she said. added.