Outage of all services

Incident Report for Chili Piper

Postmortem

Action Taken

Escalated to Confluent and engaged with their engineering team throughout the incident.
Internally began preparation of a fallback Kafka cluster independent of Confluent, to be used if needed.
After recovery, performed thorough verification of platform functionality and monitored stability.
Initiated and coordinated an investigation with Confluent.
Requested and received a detailed postmortem from Confluent outlining root cause and contributing factors.

Follow-up Steps

Evaluate feasibility and timeline for decoupling from Confluent to ensure greater control and resilience.
Set up monitoring and alerting specifically for Kafka cluster reachability to detect future issues earlier.
Review and improve vendor migration validation processes.
Review Confluent root cause analysis and postmortem

Posted Jul 28, 2025 - 16:44 UTC

Resolved

Chili Piper experienced a full platform outage related to services dependent on Kafka. These services were unable to send or receive data during this time which essentially caused all products to go offline and lose access.

The outage was detected immediately by our internal alerting system and our on-call Site Reliability Engineering team began working to correct the issue. We raised the escalation with the Confluent Cloud team through all available channels.

Approximately 1 hour and 50 minutes after the incident began, Confluent confirmed that their internal teams were actively engaged in resolving the issue with additional escalation within Confluent prioritized at this time.

To mitigate the issue, we created a new subscription through their system in order to restore access.

Service was fully restored at 19:12 UTC (3:02pm EDT).

No existing data was lost, but no new data was processed during this time which means no routes or meetings could be booked through Chili Piper.

Posted Jul 27, 2025 - 16:00 UTC