



Google has revealed the root cause of the outage that disrupted service in its London-based europe-west2-a zone during the recent heat wave.

“One of the data centers hosting zone europe-west2-a was unable to maintain a safe operating temperature due to the simultaneous failure of multiple redundant cooling systems combined with abnormally high outside temperatures,” Google Incident the report said.

The report did not explain why the cooling system failed, but said Google was the first to become aware of it. /Pacific has launched an investigation. ”

I accidentally changed the traffic routing for an internal service to avoid all three zones in europe-west2.

The Register confirmed the weather records for the day in question. The temperature was 102F/39C in London at 2:20pm, just before Google noticed the cooling issue.

This is the level of heat that can be managed in a location where data center designers know such temperatures are to be expected. But July 19th was the hottest day on record in London, so the British capital is no such place.

Engineers began mitigating the failed cooling system beginning at 07:02 PT, but their efforts were unsuccessful.

Temperatures in London continued to hover above 35/35 degrees late into the night, and around 6pm in London, Google engineers said, “We have powered down this part of the zone to prevent further prolonged outages and damage to machinery. rice field”.

In other words, they shut down the Zone to save it from a more severe outage.

Chaos began after the decision to shut down. Closing a data center means “Compute Engine has terminated all VMs in the affected data center, which represents approximately 35% of VMs in the europe-west2-a zone. ”

Google has also caused confusion by trying to provide redundancy.

“At the start of the incident, we inadvertently changed our internal service traffic routing to avoid not only the affected europe-west2-a zone, but all three zones in the europe-west2 region.”

So while only part of the europe-west2-a zone is down, Google told itself to ignore working resources.

Google and other cloud vendors advise users to use multiple zones to improve resilience. Therefore, Google’s error went against Google’s own advice.

The cooling system was back online at 14:13pm Pacific time, just past 10pm in London when temperatures were still sweltering.

“Google engineers are actively conducting a detailed analysis of the cooling system failure that triggered this incident,” the report said.

The search giant and cloud aspirant also promises:

Investigate and develop more advanced methods to gradually reduce the heat load within a single data center space and reduce the likelihood of requiring a complete shutdown. Investigate gaps in procedures, tools, and automated recovery systems to significantly improve recovery times in the future. Audit cooling system equipment and standards across the data centers that house Google Cloud around the world.

The incident report also provides a detailed description of the impact of the incident on Google cloud services, showing an 18-hour, 23-minute outage period and a 35-hour, 15-minute “long tail period” to recovery. normally.

