AI Governance, Ethics and Leadership
AI Governance, Ethics and Leadership Podcast
AI Governance in Action: AWS Outage hits Perplexity, Character and ChatGPT and more
0:00
-10:22

AI Governance in Action: AWS Outage hits Perplexity, Character and ChatGPT and more

How billions in AI funding couldn't buy basic resilience and what it means for governance

On Monday, October 20, 2025, Amazon Web Services went dark for 15 hours. Outages are common in the tech world. 10-15 minutes here and there are quite common. Fifteen hours of cascading failures is not. That exposed something uncomfortable about the companies building “the future of intelligence”: they don’t have a backup plan.

For several companies and specifically,3 AI companies, this outage exposed a governance failure that reveals how we’re building critical AI infrastructure on assumptions that won’t hold in practice nor under pressure. The question isn’t whether cloud providers will have outages. They will. The question is whether the AI companies we’re trusting - with ever increasing, business-critical operations across healthcare, education and financial services - have made responsible decisions about what happens when infrastructure fails.

Spoiler: they haven’t.

In this issue, I cover:
  • What Happened

  • Revenue Loses Breakdown (Poor Q4 Timing)

  • What YOU Need To Know

  • Future Predictions for AI Governance

  • [Paid Subscribers] How To Leverage This Opportunity

  • [Paid] The Executive Guide: The Exact Response Executives Need Now

What Happened

The outage originated from an error in a software update to Amazon’s DynamoDB service in Northern Virginia, triggering a cascading chain reaction of service failures across AWS’s infrastructure. Downdetector logged over 50,000 reports at its peak around 7:50am ET, affecting everything from financial services to gaming platforms.

But buried in the chaos was a revealing detail about AI governance: AI companies with billions in funding had no failover strategy.

Perplexity, ChatGPT (OpenAI), and Character.AI were among the AI services disrupted. Perplexity CEO Aravind Srinivas acknowledged on X: “The root cause is an AWS issue. We’re working on resolving it” This is a public admission that his $500M+ company was entirely dependent on a single cloud provider.

OpenAI fared slightly better. Their entire system is not hosted on AWS but their login/authorizations services are. Although users faced brief interruptions, their lack of a failover for this region resulted in cascading issues throughout the region.

No Azure backup. No GCP failover. Just... waiting for AWS to fix it.

AI Governance, Ethics and Leadership is a leading, reader-supported publication. To receive critical business and leadership insights, consider becoming a free or paid subscriber.

The Q4 Timing Couldn’t Be Worse

  • Estimated revenue loss: Analysts suggest the outage may have cost businesses over $150 million in lost transactions, downtime, and remediation costs.

  • Duration: The outage lasted 15 hours, centered in the US-EAST-1 region. One of AWS’s most critical zones.

  • Outage reports: Over 11 million user reports were logged globally, with 3 million from the U.S. alone.

Fifteen hours of downtime results in significant business impact. Especially in Q4 when companies are executing against annual targets and consumer spending is anticipated to peak.

Consider what happens during 15 hours of outage:

Revenue Evaporation:

  • Ad impressions go unserved (and competitors capture that traffic)

  • Subscription conversions stall at checkout

  • Enterprise deals hit pause when demos fail

  • Holiday shopping momentum shifts to competitors

Operational Chaos:

  • Engineering teams firefight instead of shipping features

  • Customer support drowns in “is it just me?” tickets

  • PR scrambles to manage reputation damage

  • Leadership diverts attention from strategic priorities to crisis management

Reputation Erosion:

  • Users who try alternatives during downtime don’t always come back

  • Enterprise buyers question “mission-critical” claims

  • Board members asking uncomfortable questions about infrastructure decisions

For AI companies positioning themselves as essential business infrastructure such as the backbone of customer service, content generation, research -this is a credibility crisis.

Leave a comment

The Governance Thread: When “Move Fast” Meets Critical Infrastructure

Here’s where this becomes an AI governance story rather than the average tech ops failure:

We’re at an inflection point where AI systems are transitioning from experimental tools to operational dependencies in businesses and in society.

Companies are embedding ChatGPT into customer workflows. Perplexity is becoming the research layer for knowledge workers. Character.AI is where millions go for companionship and creative collaboration.

But the governance frameworks haven’t caught up to the reality of what happens when these services fail.

And here’s the part that should concern everyone: we’ll never know what actually caused this outage.

AD’s Take: AWS will release a post-incident report citing ‘DNS resolution issues’ or ‘configuration errors,’ but the real root cause—especially if it involved agentic systems making decisions without human oversight—will remain behind closed doors. This means we can’t learn from it. Can’t prevent it and therefore can’t hold anyone accountable.

We’ll wonder: Was this a cascading failure triggered by an AI agent making optimization decisions? A software update pushed by automated systems?

We can speculate , but Amazon doesn’t have to tell us. And they won’t.

This opacity is itself a governance failure.

When critical infrastructure fails, transparency isn’t optional nor should it be. It’s how we learn, adapt, and prevent future incidents. But cloud providers operate under contracts that shield them from meaningful disclosure requirements. They control the narrative, the timeline, and what information gets shared.

Here are the uncomfortable questions:

  • If AI is “critical infrastructure,” why do its builders lack the redundancy that traditional infrastructure demands?

  • Who bears the risk when AI services go down—the provider or the businesses depending on them?

  • What accountability mechanisms exist when billion-dollar companies can shrug and say “AWS issue”?

  • Why don’t we have the right to know if agentic systems caused cascading failures in infrastructure we depend on?

AD’s Take: AWS holds approximately 30% of the worldwide cloud computing market, making it a concentration risk that extends far beyond any single company’s infrastructure decisions. The incident underscores systemic risks posed by high concentration of digital services within a few dominant providers.

This isn’t about AWS being unreliable (though 15 hours tests that claim). It’s about governance failures at multiple levels. Let’s call them out:

  1. Corporate governance: Boards approving AI strategies without asking “what’s our failover plan?”.

  2. Vendor governance: Contracts that offer service credits instead of meaningful accountability.

  3. Industry governance: No standards for what “production-ready AI” actually requires in terms of resilience.

Leave a comment

The Three Things Everyone Needs to Know

1. Multi-Cloud Isn’t A Luxury Anymore

This incident proves that it’s existential. The days of “all-in on AWS (or Google, or Azure)” are over for any service claiming to be mission-critical. A system failover to a secondary cloud is now the new bar. Yes, it’s expensive and somewhat complex. But 15 hours of revenue loss makes the ROI calculation pretty simple.

AD’s Take: If your AI partner/provider can’t answer “what happens if your primary cloud provider goes down?”, they’re not ready for enterprise deployment and they can’t provide the partnership your business needs. They’re selling you risk. Not infrastructure. Walk away.

2. Service Credits Are Not Accountability

AWS will offer affected customers service credits which might amount to 10-30% of monthly spend if they push hard. This covers approximately 0.01% of actual business losses (revenue, operational costs, reputation damage).

AD’s Take: Contracts need financial penalties that are tied to your business impact, not their service costs. If a vendor won’t negotiate beyond standard terms, that tells you exactly how seriously they will prioritize your risk when incidents occur.

3. The “It’s an AWS Issue” Defense Won’t Age Well

Perplexity’s CEO publicly blamed AWS. But here’s the thing: choosing to depend entirely on AWS was Perplexity’s decision. Architecture is governance. Dependency is a choice. I’m such a fan of Perplexity as a product. The leadership team seems to run a cohesive ship. So I ‘m somewhat surprised that they were caught without a secondary cloud.

AD’s Take: As AI becomes regulated (and it will), “our cloud provider failed” won’t be an acceptable defense. The companies building AI systems will be held accountable for ensuring those systems remain available. Regardless of underlying infrastructure.

Leave a comment

Future Predictions For AI Governance

Here’s where this gets philosophically interesting (and legally messy):

If a hospital’s diagnostic AI goes down because AWS had an outage, who’s liable? The hospital for choosing that AI vendor? The AI vendor for depending on AWS? AWS for the outage?

Current contracts push all risk to the end customer. AI vendors say “we’re dependent on cloud infrastructure” and cloud providers say “we offer credits, not guarantees.” The customer absorbs all the business risk.

Predictions. We’re heading toward a world where

  • Regulatory frameworks will require demonstrated resilience for AI in critical sectors

  • Liability will flow to whoever made the architectural decisions (not just whose server failed)

  • Insurance markets will price in infrastructure concentration risk

  • Enterprise buyers will demand proof of multi-cloud capability, not promises

The companies that recognize this now and adjust accordingly will have a massive competitive advantage when regulation catches up to reality.

Thanks for reading AI Governance, Ethics and Leadership! This post is has critical insights that will spark action oriented discussions. Feel free to share it.

Share

Real Leaders Know The Leverage Window Is Now

The following analysis is for paid subscribers who need to act on this information, not just understand it.

Want the Tactical Playbook?

If you’re heading into contract negotiations with AWS or your AI vendors (with 3-6 mos) and need specific language, questions, and leverage points I’ve created a detailed guide to respond to this issue:

Get the guide here → What to Demand from AWS After 15 Hours of Downtime

This guide is for decision makers, VPs, Executives and Directors seeking to gain leverage as an outcome of this outage:

  • How to calculate your real losses (not what AWS will admit to)

  • Contract negotiation tactics while you have power

  • Template email and questions for your AWS review

  • The multi-cloud architecture conversation your CTO needs to have

  • Red lines for when to walk away or diversify

Boardrooms might be discussing one or two of these items. You need to ensure you cover everything. Click the image below to get it.

Get the guide here → What to Demand from AWS After 15 Hours of Downtime

What did I miss? How is your organization thinking about infrastructure resilience for AI systems? Reply to this email or comment below. I read and respond to everything.

Leave a comment

The Bottom Line

October 20, 2025 wasn’t just another incident. It was a reminder of the trust we put into technological systems each day. Further more, it was a stress test of responsible AI governance. Most companies (on AWS) failed.

The ones that passed? They’re the ones who treated infrastructure resilience as a strategic priority, not an operational afterthought. They’re the ones who recognized that moving fast is important, but staying up is essential.

If you’re building AI systems, buying AI services, or governing organizations that depend on AI: this is your wake-up call. Multi-cloud isn’t a nice-to-have. Contractual accountability isn’t negotiable. .

You have leverage right now. Use it.

About Me / My Socials etc

Hello and thank you for reading AI Governance, Ethics and Leadership. I’m AD (not an AI). I’ve spent over a decade working in tech and I’ve managed a few of these incidents myself. I’m an engineering leader with an MBA in Information Systems. I’m now a Stanford certified AI Governance Champion and I enjoy integrating my business, tech and AI expertise into meaningful insights for you, your community and your business.

References

  1. Engadget - https://www.engadget.com/big-tech/amazons-aws-outage-has-knocked-services-like-alexa-snapchat-fortnite-venmo-and-more-offline-142935812.html

  2. The Register - https://www.theregister.com/2025/10/20/amazon_aws_outage/

  3. CNN Business - https://www.cnn.com/business/live-news/amazon-tech-outage-10-20-25-intl

  4. LatestLY - https://www.latestly.com/socially/technology/perplexity-down-perplexity-ai-services-not-working-for-users-worldwide-ceo-aravind-srinivas-says-root-cause-is-an-aws-issue-7168367.html

  5. NPR - https://www.npr.org/2025/10/20/nx-s1-5580312/aws-outage

  6. ThousandEyes - https://www.thousandeyes.com/blog/aws-outage-analysis-october-20-2025

Discussion about this episode

User's avatar

Ready for more?