On Monday, October 20, 2025, Amazon Web Services went dark for 15 hours. Outages are common in the tech world. 10-15 minutes here and there are quite common. Fifteen hours of cascading failures is not. That exposed something uncomfortable about the companies building “the future of intelligence”: they don’t have a backup plan.
For several companies and specifically,3 AI companies, this outage exposed a governance failure that reveals how we’re building critical AI infrastructure on assumptions that won’t hold in practice nor under pressure. The question isn’t whether cloud providers will have outages. They will. The question is whether the AI companies we’re trusting - with ever increasing, business-critical operations across healthcare, education and financial services - have made responsible decisions about what happens when infrastructure fails.
Spoiler: they haven’t.
In this issue, I cover:
What Happened
Revenue Loses Breakdown (Poor Q4 Timing)
What YOU Need To Know
Future Predictions for AI Governance
[Paid Subscribers] How To Leverage This Opportunity
[Paid] The Executive Guide: The Exact Response Executives Need Now
What Happened
The outage originated from an error in a software update to Amazon’s DynamoDB service in Northern Virginia, triggering a cascading chain reaction of service failures across AWS’s infrastructure. Downdetector logged over 50,000 reports at its peak around 7:50am ET, affecting everything from financial services to gaming platforms.
But buried in the chaos was a revealing detail about AI governance: AI companies with billions in funding had no failover strategy.
Perplexity, ChatGPT (OpenAI), and Character.AI were among the AI services disrupted. Perplexity CEO Aravind Srinivas acknowledged on X: “The root cause is an AWS issue. We’re working on resolving it” This is a public admission that his $500M+ company was entirely dependent on a single cloud provider.
OpenAI fared slightly better. Their entire system is not hosted on AWS but their login/authorizations services are. Although users faced brief interruptions, their lack of a failover for this region resulted in cascading issues throughout the region.
No Azure backup. No GCP failover. Just... waiting for AWS to fix it.
The Q4 Timing Couldn’t Be Worse
Estimated revenue loss: Analysts suggest the outage may have cost businesses over $150 million in lost transactions, downtime, and remediation costs.
Duration: The outage lasted 15 hours, centered in the US-EAST-1 region. One of AWS’s most critical zones.
Outage reports: Over 11 million user reports were logged globally, with 3 million from the U.S. alone.
Fifteen hours of downtime results in significant business impact. Especially in Q4 when companies are executing against annual targets and consumer spending is anticipated to peak.
Consider what happens during 15 hours of outage:
Revenue Evaporation:
Ad impressions go unserved (and competitors capture that traffic)
Subscription conversions stall at checkout
Enterprise deals hit pause when demos fail
Holiday shopping momentum shifts to competitors
Operational Chaos:
Engineering teams firefight instead of shipping features
Customer support drowns in “is it just me?” tickets
PR scrambles to manage reputation damage
Leadership diverts attention from strategic priorities to crisis management
Reputation Erosion:
Users who try alternatives during downtime don’t always come back
Enterprise buyers question “mission-critical” claims
Board members asking uncomfortable questions about infrastructure decisions
For AI companies positioning themselves as essential business infrastructure such as the backbone of customer service, content generation, research -this is a credibility crisis.
The Governance Thread: When “Move Fast” Meets Critical Infrastructure
Here’s where this becomes an AI governance story rather than the average tech ops failure:
We’re at an inflection point where AI systems are transitioning from experimental tools to operational dependencies in businesses and in society.
Companies are embedding ChatGPT into customer workflows. Perplexity is becoming the research layer for knowledge workers. Character.AI is where millions go for companionship and creative collaboration.
But the governance frameworks haven’t caught up to the reality of what happens when these services fail.
And here’s the part that should concern everyone: we’ll never know what actually caused this outage.
AD’s Take: AWS will release a post-incident report citing ‘DNS resolution issues’ or ‘configuration errors,’ but the real root cause—especially if it involved agentic systems making decisions without human oversight—will remain behind closed doors. This means we can’t learn from it. Can’t prevent it and therefore can’t hold anyone accountable.
We’ll wonder: Was this a cascading failure triggered by an AI agent making optimization decisions? A software update pushed by automated systems?
We can speculate , but Amazon doesn’t have to tell us. And they won’t.
This opacity is itself a governance failure.
When critical infrastructure fails, transparency isn’t optional nor should it be. It’s how we learn, adapt, and prevent future incidents. But cloud providers operate under contracts that shield them from meaningful disclosure requirements. They control the narrative, the timeline, and what information gets shared.
Here are the uncomfortable questions:
If AI is “critical infrastructure,” why do its builders lack the redundancy that traditional infrastructure demands?
Who bears the risk when AI services go down—the provider or the businesses depending on them?
What accountability mechanisms exist when billion-dollar companies can shrug and say “AWS issue”?
Why don’t we have the right to know if agentic systems caused cascading failures in infrastructure we depend on?
AD’s Take: AWS holds approximately 30% of the worldwide cloud computing market, making it a concentration risk that extends far beyond any single company’s infrastructure decisions. The incident underscores systemic risks posed by high concentration of digital services within a few dominant providers.
This isn’t about AWS being unreliable (though 15 hours tests that claim). It’s about governance failures at multiple levels. Let’s call them out:
Corporate governance: Boards approving AI strategies without asking “what’s our failover plan?”.
Vendor governance: Contracts that offer service credits instead of meaningful accountability.
Industry governance: No standards for what “production-ready AI” actually requires in terms of resilience.
The Three Things Everyone Needs to Know
1. Multi-Cloud Isn’t A Luxury Anymore
This incident proves that it’s existential. The days of “all-in on AWS (or Google, or Azure)” are over for any service claiming to be mission-critical. A system failover to a secondary cloud is now the new bar. Yes, it’s expensive and somewhat complex. But 15 hours of revenue loss makes the ROI calculation pretty simple.
AD’s Take: If your AI partner/provider can’t answer “what happens if your primary cloud provider goes down?”, they’re not ready for enterprise deployment and they can’t provide the partnership your business needs. They’re selling you risk. Not infrastructure. Walk away.
2. Service Credits Are Not Accountability
AWS will offer affected customers service credits which might amount to 10-30% of monthly spend if they push hard. This covers approximately 0.01% of actual business losses (revenue, operational costs, reputation damage).
AD’s Take: Contracts need financial penalties that are tied to your business impact, not their service costs. If a vendor won’t negotiate beyond standard terms, that tells you exactly how seriously they will prioritize your risk when incidents occur.
3. The “It’s an AWS Issue” Defense Won’t Age Well
Perplexity’s CEO publicly blamed AWS. But here’s the thing: choosing to depend entirely on AWS was Perplexity’s decision. Architecture is governance. Dependency is a choice. I’m such a fan of Perplexity as a product. The leadership team seems to run a cohesive ship. So I ‘m somewhat surprised that they were caught without a secondary cloud.
AD’s Take: As AI becomes regulated (and it will), “our cloud provider failed” won’t be an acceptable defense. The companies building AI systems will be held accountable for ensuring those systems remain available. Regardless of underlying infrastructure.
Future Predictions For AI Governance
Here’s where this gets philosophically interesting (and legally messy):
If a hospital’s diagnostic AI goes down because AWS had an outage, who’s liable? The hospital for choosing that AI vendor? The AI vendor for depending on AWS? AWS for the outage?
Current contracts push all risk to the end customer. AI vendors say “we’re dependent on cloud infrastructure” and cloud providers say “we offer credits, not guarantees.” The customer absorbs all the business risk.
Predictions. We’re heading toward a world where
Regulatory frameworks will require demonstrated resilience for AI in critical sectors
Liability will flow to whoever made the architectural decisions (not just whose server failed)
Insurance markets will price in infrastructure concentration risk
Enterprise buyers will demand proof of multi-cloud capability, not promises
The companies that recognize this now and adjust accordingly will have a massive competitive advantage when regulation catches up to reality.
Real Leaders Know The Leverage Window Is Now
The following analysis is for paid subscribers who need to act on this information, not just understand it.
Want the Tactical Playbook?
If you’re heading into contract negotiations with AWS or your AI vendors (with 3-6 mos) and need specific language, questions, and leverage points I’ve created a detailed guide to respond to this issue:
Get the guide here → What to Demand from AWS After 15 Hours of Downtime
This guide is for decision makers, VPs, Executives and Directors seeking to gain leverage as an outcome of this outage:
How to calculate your real losses (not what AWS will admit to)
Contract negotiation tactics while you have power
Template email and questions for your AWS review
The multi-cloud architecture conversation your CTO needs to have
Red lines for when to walk away or diversify
Boardrooms might be discussing one or two of these items. You need to ensure you cover everything. Click the image below to get it.
Get the guide here → What to Demand from AWS After 15 Hours of Downtime
What did I miss? How is your organization thinking about infrastructure resilience for AI systems? Reply to this email or comment below. I read and respond to everything.
The Bottom Line
October 20, 2025 wasn’t just another incident. It was a reminder of the trust we put into technological systems each day. Further more, it was a stress test of responsible AI governance. Most companies (on AWS) failed.
The ones that passed? They’re the ones who treated infrastructure resilience as a strategic priority, not an operational afterthought. They’re the ones who recognized that moving fast is important, but staying up is essential.
If you’re building AI systems, buying AI services, or governing organizations that depend on AI: this is your wake-up call. Multi-cloud isn’t a nice-to-have. Contractual accountability isn’t negotiable. .
You have leverage right now. Use it.
About Me / My Socials etc
Hello and thank you for reading AI Governance, Ethics and Leadership. I’m AD (not an AI). I’ve spent over a decade working in tech and I’ve managed a few of these incidents myself. I’m an engineering leader with an MBA in Information Systems. I’m now a Stanford certified AI Governance Champion and I enjoy integrating my business, tech and AI expertise into meaningful insights for you, your community and your business.
Find me on X (fka Twitter)
Follow the Linkedin Page
I proved to myself that I can build with vibecoding platforms and still add in key governance features. Check out the AI Environmental Footprint calculator I built.
References
The Register - https://www.theregister.com/2025/10/20/amazon_aws_outage/
CNN Business - https://www.cnn.com/business/live-news/amazon-tech-outage-10-20-25-intl
NPR - https://www.npr.org/2025/10/20/nx-s1-5580312/aws-outage
ThousandEyes - https://www.thousandeyes.com/blog/aws-outage-analysis-october-20-2025



![[PAID] Companion Article for AWS Outage: Real Leaders Know The Leverage Window Is Now](https://substackcdn.com/image/fetch/$s_!3sPO!,w_140,h_140,c_fill,f_auto,q_auto:good,fl_progressive:steep,g_auto/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c39c26e-ade6-472d-b296-8c325c4801f7_448x497.png)










