The global fall of Amazon Web Services (AWS) left out of action, on Monday morning, October 20, to services as diverse as Bizum, Ticketmaster, Canva, Alexa the online video games. The epicenter was in US-EAST-1 (N. Virginia)a problem of resolution of DynamoDB led to cascading errors that affected EC2, lambda, Load Balancers and dozens of upstream services. Although everything remained. restored hours later, the incident once again highlighted two things: We concentrate too much risk. in a single region of a single provider and Europe lacks a own Plan B when a hyperscaler fails.
To understand what has happened and what can be done, we request a reading from David Carrero, cofounder of Stackscale – Grupo Aire (Spanish cloud and bare-metal infrastructure).
What happened (and why "the Internet is down for us" from Virginia)
- Origin: errors of Domain Name System towards DynamoDB in US-EAST-1As this pillar degraded, they failed. instance launches, health checks of load balancers and them invocaciones in lambda.
- Domino effect: many control plans and global services they depend on endpoints of that region, which is why Europe that I saw failed logins, partial charges, latency spikes and queues accumulated even without having charges in Northern Virginia.
- Why Spain noticed it: a large part of the SaaS What we use here (payments, tickets, design, attendees, games) lives on AWS or depends on global services that draw on US-EAST-1.
The experts' reading: HA is not Plan B
David Carrero (Stackscale): Many companies in Spain and Europe They trust all of their infrastructure. to American providers such as AWS and, in addition, They don't have a Plan B. not even when their services are really critics. It's very good to have high availability (HA), but if everything depends on some common element, the HA will fail.”
“We still lack real multiregional internships.: single-region control planes, data centralized in us-east-1 and tests of failover that are not rehearsed. The result is always the same: when there is a big setback, We all stop..”
In Europe there are many winning options what is menosprecian due to the pressure to 'be with the big players'—hyperscalers. Not only Stackscale It could be. alternative or complement; the European and Spanish ecosystem You are trained on data up to October 2023 broad and professional, nothing to envy for the the vast majority of needs.”
What can we do today (and what should we have ready tomorrow)
For users
- Check them. status pages of the apps you use.
- Evita reinstalar or delete data if the problem is the provider's.
- Reintenta Later: in these incidents the recovery tends to be gradual.
For IT teams
Right now
- Don't touch critical configuration unless you have one. mitigation route clear (toggle to another region already prepared).
- Comunicastatus, scope, time steps and upcoming notices.
- Registra metrics and errores for the post-mortem (what failed, where and how long).
For next Monday at 9:00
- True multi-region
- Control plans and data desacoplados of US-EAST-1 (or another unique region).
- Replication and cut-over tested; gamedays quarterly.
- RTO/RPO per service
- Define objectives realistas: what could be X minutes Fallen and what? no; what data you can to lose (if you can).
- Aligns arquitectura and presupuesto with those objectives.
- Global dependencies
- If you use "global" services, verify where they anchor (IAM, queues, catalogs) and offers alternative routes.
- Avoid “Everything to Virginia” por inercia (coste, catálogo o histórico).
- Backups and restoration
- Immutable copies and desconectadas; proof of Timed restoration.
- Runbooks operatives, not forgotten PDFs.
- DNS/CDN with failover
- Policies of failover in DNS/GTM and alternative origins in CDN; health based on service, not only in pings.
- Multicloud where it makes sense
- For services critics or for sovereignty, values dual supplier.
- Keep portable controls (identity, logging, backups) to avoid increasing the complexity.
And Europe? Alternatives that already exist
Carrero: “It's not about ‘abandonar’ to the hyper-scalars, but of reduce risk concentration and to gain resilienceThere are in Europe tier-one suppliers for private cloud, bare-metal, housing, connectivity, backup and managed services that complementan to the big ones. In most cases, you don't need an AI supercluster to serve your customers: you need continuidad.”
Practical ideas:
- Critical data and applications in European infrastructure (private or sovereign), with dedicated links toward SaaS/hyperscale wherever necessary.
- Layers of continuity (backup, Disaster Recovery, DNS, observability) outside of the same fault domain.
- Local partners for 24/7 and support proximity; it's easier to correct quickly when the The team is close..
Lessons from this incident
- US-EAST-1 cannot be an all-in-one.: comfortable and cheap, yes; systemic risk, also.
- Multi-AZ isn't always enough.: when one falls transverse component, it drags you in the same way.
- Plan B is rehearsed.: if you don't practice the failover, you don't have failover.
- Transparency: parts and early communication reduce incertidumbre and soporte.
Carrero (closing): The resiliencia it's not a slogan, it's engineering and discipline. If the company lives From your platform, you should be able to follow Even if his main supplier dies. has is not Plan B; the Plan B is another complete route to reach the same result.
In two lines: the AWS outage is not a rarity, it's a operational risk that we already know. Spain and Europe need to bear on regions, diversify suppliers and activate the local industry how complemento. The difference between a scare and a crisis will mark it, again, the Plan B.


