Google Cloud GCVE incident details

What happened? Summary

During the initial deployment of Google Cloud VMware Engine (GCVE) private clouds for customers using our internal tools, a parameter was left blank, causing a Google operator to misconfigure the GCVE service. This had the unintended and subsequently unknown consequence of causing customer GCVE private clouds to default to a fixed duration and then be automatically deleted at the end of that duration. Both the incident trigger and the behavior of our downstream systems have been fixed to ensure this does not happen again.

This incident did not affect any Google Cloud services other than this customer's GCVE private cloud. No other customers were affected by this incident.

Digging deeper:

Deployment using exception processing

In early 2023, our operators deployed one of the customer's GCVE private clouds using our internal tool to meet a specific capacity placement need. This internal tool for capacity management was retired in Q4 2023 and is no longer required as it has become fully automated (i.e. no human intervention is required).

Blank input parameters caused unintended behavior

Google operators followed their internal management protocols, but left one input parameter blank when using their internal tools to provision a customer's private cloud. Because the parameter was blank, the system assigned it a then-unknown default fixed duration value of one year.

Your GCVE Private Cloud was deleted after the system-allocated 1-year period expired. No notification was sent to you as the deletion was triggered as a result of a Google operator leaving a parameter blank using an internal tool, not as a deletion request from you. In the case of a customer-initiated deletion, you are notified in advance.


The customer and our teams worked around the clock over several days to recover the customer's GCVE private cloud, restore network and security configurations, restore applications, and recover data to restore full operations.

This was supported by the customer's robust and resilient architectural approach to managing the risk of outages and failures.

Backups of data stored in Google Cloud Storage in the same region were not affected by the deletion and, along with third-party backup software, helped with quick restores.


Google Cloud has taken several steps to prevent this incident from happening again:

The internal tool that triggered this chain of events has been deprecated; this aspect is now fully automated and controllable by the customer via the user interface, even if specific capacity management is required.

We scrubbed our system databases and manually reviewed all GCVE private clouds to ensure that no other GCVE deployments were at risk.

We have corrected the system behavior of marking GCVE private clouds for deletion in such deployment workflows.


Prior to this incident, there have been no incidents of this nature within Google Cloud, and this is not a systemic issue.

Google Cloud services have strong safeguards in place that combine soft deletion, advance notification, and human intervention when necessary.

We have confirmed that these safety measures remain in place.

Close collaboration with the customer was essential to a rapid recovery, and the customer's CIO and technical teams deserve credit for the speed and precision with which they worked closely with the Google Cloud team to execute the recovery 24/7.

Fail-safe, resilient and robust risk management is essential to recover quickly in the event of an unexpected incident.

Google Cloud continues to deliver the most resilient and stable cloud infrastructure in the world, and despite this one-time incident, Google Cloud's uptime and resiliency has been independently verified to be the best of any major cloud.




