On June 12 of last year, Google Cloud suffered one of its most significant outages in recent years, affecting critical services worldwide. The incident began at 19:51 (Spanish time) and lasted for at least three and a half hours, impacting numerous Google Cloud Platform (GCP) and Google Workspace products, including infrastructure services and applications such as email and storage.
The cause of the problem was an incorrect automatic update in the API management system, which was distributed globally, causing the massive rejection of legitimate requests. This generated 503 error responses in services such as Compute Engine, Cloud Storage, BigQuery, and Gmail, among others. Although Google quickly detected the error and implemented a temporary mitigation, the recovery was heterogeneous, especially in the us-central1 (Iowa) region, where resources took longer to restore due to the overload on the quota policy database.
During the incident, thousands of organizations in Europe, Asia and the Americas faced intermittent failures in accessing their services. This caused continuity problems for IT teams and severely affected managed data services and artificial intelligence products. Google has acknowledged that the failure 'should not have happened' and has announced measures to prevent future incidents, including improvements in the validation and management of APIs.
In Europe, data centers in Madrid, Finland, Paris, Berlin, and other cities reported problems, affecting companies of all sizes, governments, and startups. At 10:49 p.m. (Spanish time), Google confirmed that most services were restored, although some operations in heavily affected regions took a little longer to normalize.
This incident underscores that, despite the advantages of the cloud, no provider is immune to serious failures. Companies should consider multicloud strategies, perform independent backups, and have robust contingency plans to mitigate the effects of similar outages in the future. Google now faces the challenge of restoring the trust of its users, promising a detailed technical report that explains the error and the corrective actions taken.
More information and references in Cloud News.


