English Upgrade #14: Incidents and Post-Mortems

Thuan: 2 AM. PagerDuty wakes me up. Production is down. My manager pings: “What’s happening?” Slack is exploding. My CEO is in the channel. And I need to write clear English updates while my brain is in full panic mode.

Alex: Incidents are the ultimate communication test. You’re tired, stressed, and every word you write is being read by people who don’t understand the technical details but really care about the business impact. Let’s build your incident communication kit.

Phase 1: During the Incident

The First Status Update (Within 5 Minutes)

Template:

🔴 INCIDENT: [Service/Feature] is [down/degraded]
Impact: [Who is affected] [How many users]
Status: Investigating
Next update: [time]

Example:

🔴 INCIDENT: Payment processing is failing
Impact: All users attempting checkout (est. 500/hour)
Status: Investigating — team is online
Next update: 15 minutes

Thuan: Short. Direct. No explanation of why yet.

Alex: Exactly. The first update answers one question: “Do you know about it?” That’s it. Save the diagnosis for later.

Ongoing Status Updates (Every 15-30 Minutes)

Template:

🟡 UPDATE: [Service] — [Still investigating / Identified / Fix in progress]
Root cause: [If known] / Still investigating
What we've tried: [Actions taken]
ETA to resolution: [If known] / Unknown
Next update: [time]

Example:

🟡 UPDATE: Payment processing — root cause identified
Root cause: Database connection pool exhausted due to 
  traffic spike from marketing campaign
What we've tried: Increased pool size from 20 to 50
ETA: Fix deployed, monitoring for 10 minutes
Next update: 10 minutes

Resolution Update

🟢 RESOLVED: [Service] is fully operational
Duration: [start → end time]
Root cause: [One sentence]
Impact: [Users affected, transactions failed, etc.]
Follow-up: Post-mortem scheduled for [date]

Essential Incident Phrases

Situation	Phrase
Acknowledging	”We’re aware of the issue and investigating.”
Team is working	”The on-call team is online and working on this.”
Asking for time	”We need about [X] minutes to diagnose. Next update at [time].”
Partial fix	”We’ve applied a temporary fix. The system is partially restored. Working on a full resolution.”
Asking for help	”I need [Name] to join the incident channel. We need expertise on [system].”
Delegating	”[Name], can you check the [database/service/logs]? I’ll handle communication.”
Escalating	”Escalating to [manager/VP]. Impact is [scope]. We need a decision on [X].”

What NOT to Say During an Incident

Don’t Say	Why	Say Instead
”I have no idea what’s happening”	Creates panic	”We’re investigating. More info shortly."
"[Name] caused this”	Blame destroys trust	”We’ve identified the root cause."
"This should be a quick fix”	Sets wrong expectations	”We’re working on a fix. I’ll update in 15 minutes."
"It’s not my fault”	No one asked	Focus on the fix, not fault
Nothing (radio silence)	Silence = panic	Always update, even if update is “still investigating”

Phase 2: The Post-Mortem

Thuan: We do post-mortems, but they’re either blame sessions or “we’ll try harder next time.”

Alex: A good post-mortem is blameless, data-driven, and action-oriented. Here’s the structure:

Post-Mortem Document Template

# Post-Mortem: [Incident Title]

**Date:** [Date]
**Duration:** [Start time] — [End time] ([X] minutes)
**Severity:** [P1/P2/P3]
**Author:** [Name]

## Summary
[One paragraph: what happened, who was impacted, how 
it was resolved]

## Timeline
- [HH:MM] — [Event]
- [HH:MM] — [Event]
- [HH:MM] — [Event]

## Root Cause
[Technical explanation of what failed and why]

## Impact
- [X] users affected
- [Y] transactions failed
- [$Z] estimated revenue impact
- [X] minutes of downtime

## What Went Well
- [Things the team did right]

## What Could Be Improved
- [Process/system gaps — blameless]

## Action Items
| Action | Owner | Due Date |
|--------|-------|----------|
| [Specific action] | [Name] | [Date] |

Post-Mortem Meeting Phrases

Moment	Phrase
Opening	”This is a blameless post-mortem. We’re here to learn, not to assign blame.”
Timeline review	”Let me walk through the timeline. At [time], [event happened].”
Identifying gaps	”Where in this timeline could we have detected the issue earlier?”
Discussing improvements	”How can we prevent this class of issue from happening again?”
Assigning actions	”Let’s create specific, actionable follow-ups. Each one needs an owner and a deadline.”
Closing	”Good discussion. We’ve identified [X] action items. Let’s review progress in next week’s team meeting.”

Blameless Language in Post-Mortems

Blaming	Blameless
”The deploy broke production because [name] didn’t test"	"The deploy introduced a regression that wasn’t caught by our test suite"
"Operations team was too slow to respond"	"Our alerting didn’t trigger for 20 minutes, delaying response"
"Nobody was monitoring the dashboard"	"The anomaly wasn’t flagged by our monitoring rules"
"The junior developer pushed bad code"	"Our code review process didn’t catch the edge case”

Key Principle: Replace the person with the system. Ask “what in our process allowed this?” not “who messed up?”

Phase 3: After the Incident

The Internal All-Hands Update

If the incident was significant:

“Last Thursday, we had a 45-minute outage affecting payment processing. Root cause: a database connection exhaustion during a traffic spike. We’ve taken three actions: doubled our connection pool, added auto-scaling, and improved our monitoring alerts. Post-mortem is published [link]. Questions?”

The Client Communication

If clients were affected:

“Subject: Service Interruption — [Date] — Resolution

Dear [Client],

On [date], our payment processing experienced a 45-minute interruption between [time] and [time].

Impact: Some transactions during this window may have failed. All transactions have been automatically retried and processed.

Root Cause: Unexpected traffic volume exceeded our database capacity.

Actions Taken: We’ve implemented capacity improvements and enhanced monitoring to prevent recurrence.

We apologize for the inconvenience. Please reach out if you have any lingering issues.

Best regards, Thuan”

10-Minute Self-Practice

The Incident Update Drill (5 min)

Imagine your most important service is down right now
Write the first status update using the template
Write a 15-minute follow-up update
Write the resolution update
Time check: could you write all three in under 5 minutes? Good — that’s the skill.

The Post-Mortem Writing Practice (5 min)

Think of a recent bug or outage (even a small one)
Write the root cause in one sentence (blameless)
Write one “what went well” and one “what could be improved”
Write one SMART action item with owner and date

What’s Next

Incidents are no longer communication nightmares. Next post: Career English — Salary, Promotions, and Job Moves — the English that directly impacts your income.

This is Part 14 of the English Upgrade series. Related: Tech Coffee Break #9: Security — many incidents start as security issues.

Also see: English Upgrade #6: Retros — post-mortems are retros for production incidents.

Export for reading

English Upgrade #14: Incidents and Post-Mortems

Phase 1: During the Incident

The First Status Update (Within 5 Minutes)

Ongoing Status Updates (Every 15-30 Minutes)

Resolution Update

Essential Incident Phrases

What NOT to Say During an Incident

Phase 2: The Post-Mortem

Post-Mortem Document Template

Post-Mortem Meeting Phrases

Blameless Language in Post-Mortems

Phase 3: After the Incident

The Internal All-Hands Update

The Client Communication

10-Minute Self-Practice

The Incident Update Drill (5 min)

The Post-Mortem Writing Practice (5 min)

What’s Next

Comments

On this page

English Upgrade #14: Incidents and Post-Mortems