Post-mortems without action items are just storytelling. We implemented a strict follow-up process: every post-mortem produces at most three concrete action items, each assigned to a specific person with a deadline. Items that don’t get done within two sprints get escalated or explicitly deprioritized.
Authentication turned out to be the most politically charged decision in the entire project. Every team had opinions about OAuth providers, session management strategies, and token lifetimes. We eventually settled on a pragmatic middle ground that nobody loved but everyone could live with.
Where We Struggled
We built a lightweight internal developer portal that aggregates service ownership, runbook links, API docs, and deployment status. It took one engineer three sprints to build using a static site generator, and it immediately became the first place anyone goes when an incident starts.
Our initial benchmark numbers looked promising in staging but fell apart under production traffic patterns. The difference? Staging used uniform request distributions while real users exhibit bursty, correlated behavior that exposes different bottlenecks entirely.
What worked for us won’t work for everyone. Context matters enormously. But we hope sharing our experience saves someone else from repeating our more expensive mistakes.