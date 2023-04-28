



More than 90 Google Cloud services were taken offline in the Paris region after a major data center incident believed to be caused by a water leak in the battery room of a colocation data center.

This incident also briefly caused a complete global outage of the Cloud Console. (Major impact observed from 2023-04-25 23:15:30 PDT to 2023-04-26 03:38:40 PDT Google Cloud.)

Google Cloud blamed water intrusion on April 27 when it told customers it expected a lengthy outage of some services following an incident at a data center owned and operated by Global Switch Did. (Cloud-his providers, like other customers, often rent racks in colocation data centers operated by third-his parties.)

Google first reported this incident late on April 25th as an issue affecting multiple cloud services in the europe-west9-a zone. We have successfully restored some Google Cloud Paris region services, but others are affected.

Global Switch said Wednesday: Firefighters were dispatched and the fire was put out.

Google Cloud Paris outage caused by water intrusion

The fire response system in the building is functioning as designed and no one was injured. Many customers are temporarily affected and our site team is working to restore the added service.

A post on the French Network Operators Group said the fire was caused by a faulty cooling system water pump causing water to leak into the battery room and start the fire. (Stack recognizes that this raises questions about resilience and coolant design, and hopes to see more answers in the future.)

There is also a lively debate about Google Cloud’s resilience after a single Availability Zone (AZ) was damaged, rendering the region generally unavailable. Many customers assumed that the AZs were completely physically isolated, rather than being in a single data center with separate networks and power.

(A zone is a deployment area for Google Cloud resources within a region. A zone should be viewed as a single fault domain within a region. Deploy fault-tolerant applications with high availability and protection from unexpected failures. To protect against the loss of an entire region due to a natural disaster, have a disaster recovery plan and know how to start your application in the unlikely event that the primary region is lost. Google Cloud advises in their availability guidance).

When things start to go wrong in the nb data center, getting the fix right can be harder than many people assume. His post-mortem analysis of an AWS data center incident in Tokyo shows a colorful example of cascading problems…

