



Earlier this month, Google Cloud experienced its biggest ever blunder when $135 billion Australian pension fund UniSuper lost its Google Cloud account due to some kind of mistake on Google's side. At the time, UniSuper said it had lost all of its data stored on Google, including backups, and 647,000 members were down for two weeks. The incident prompted a joint statement from the CEOs of Google Cloud and UniSuper, and many apologies were made, likely leaving many customers worried that their retirement savings had vanished.

The explanation we received shortly after was that “this disruption arose from an unprecedented chain of events that led to an inadvertent misconfiguration during the provisioning of the UniSupers Private Cloud service, ultimately resulting in the deletion of the UniSupers Private Cloud subscription.” Two weeks later, Google Cloud had completed its internal investigation into the issue, and the company published a blog post detailing what had happened.

Google's post begins with “TL;DR,” but it appears a Google employee mistyped it.

During the initial deployment of Google Cloud VMware Engine (GCVE) private clouds for customers using our internal tools, a parameter was left blank, causing a Google operator to misconfigure the GCVE service. This had the unintended and subsequently unknown consequence of causing customer GCVE private clouds to default to a fixed duration and then be automatically deleted at the end of that duration. Both the incident trigger and the behavior of our downstream systems have been fixed to ensure this does not happen again.

The most shocking thing about Google's blunder was the sudden and irreversible deletion of customer accounts. Shouldn't there be protections, notifications, confirmations, etc. to prevent accidental deletion? Google claims to have those measures, but those warnings were for “customer-initiated deletions” and didn't work when using the management tool. Google said, “No customer notification was sent because the deletion was caused by a Google operator using an internal tool to blank out a parameter, not a customer request for deletion. Customer-initiated deletions would have been notified in advance.”

UniSuper indicated that during many of the updates during the downtime, it was unable to access its Google Cloud backups and had to consult third-party (possibly out-of-date) stores to recover. Amid the chaos of the recovery period, UniSuper said: “UniSuper had replicated data in two regions as a protection against outages or loss. However, as UniSuper's private cloud subscription was deleted, data was deleted in these two regions… UniSuper had backups in place with an additional service provider. These backups minimised data loss and significantly improved UniSuper's and Google Cloud's ability to complete restorations.”

In its post-mortem, Google said that “backups of that data stored in Google Cloud Storage in the same region were unaffected by the deletion and, together with third-party backup software, facilitated rapid restoration.” It's hard to reconcile these two statements, especially given the two-week recovery period. Since the purpose of backups is to restore quickly, either UniSuper's backups were not deleted and therefore not valid, resulting in two weeks of downtime, or they would have been valid if they had not been partially or completely erased.

Google emphasized multiple times in the post that the issue affected a single customer, has never happened before, should never happen again, and is not a systemic issue with Google Cloud. Here's the full “Remediation” section of the blog post:

Google Cloud has taken several steps to prevent and ensure that this incident does not reoccur, including:

The internal tool that triggered this chain of events has been deprecated. This aspect is now fully automated and controlled by the customer via the user interface, even if specific capacity management is required. We scrubbed system databases and manually reviewed all GCVE Private Clouds to ensure no other GCVE deployments were at risk. We have corrected system behavior that would mark GCVE Private Clouds for deletion during such deployment workflows.

Google said its cloud “still has safeguards in place that include a combination of soft deletion, advance notification, and human intervention when necessary,” confirming that all of these safeguards are still in place.

