Chili Cal Links displaying a "Something Went Wrong" page

Incident Report for Chili Piper

Postmortem

Incident Overview

Some users experienced intermittent failures when accessing ChiliCal scheduling links while others began experiencing no access to any booking links within their organization. Affected users were presented with a “Something Went Wrong” error screen. The issue did not affect all links, but occurred unpredictably across different users, affecting scheduling and handoff workflows.

After resolution, we found that some links were impacted further by caching of services which delayed the rollout of the fix to all users and made the impact’s duration longer than expected.

Timeline

09:00 - 09:30 UTC – Incident reported. Users began experiencing intermittent issues with ChiliCal scheduling links.
09:45 UTC – Engineering team began investigation and confirmed issue reproducibility.
10:00 UTC – Initial fixes for the root cause were implemented as a hotfix
10:30 UTC – Suspected caching-related issue identified as extending the issue’s behavior
12:45 UTC – Team forced cache clear across relevant instances to expedite recovery.
13:08 UTC – Systems showed recovery. Monitoring phase initiated.
Post-Incident – Recommended browser cache clear for users still seeing the issue. Ongoing monitoring confirmed full resolution.

Root Cause

On April 28th we found that our JS framework (NextJS) did not provide deterministic application builds. We deployed a fix for this as an unrelated incident. While fixes were implemented to resolve the library’s update, this incident was also impacted by stale or inconsistent application-level caching that impacted the rendering of certain ChiliCal scheduling links. The cache did not invalidate properly after recent backend changes, resulting in outdated or broken state served to some users.

Resolution

The team manually forced a cache clear on affected services. This action immediately restored functionality for most users. Those still affected were likely experiencing residual issues from local browser cache, which was resolved with manual clearing.

Preventative Actions

Completed:

Manual cache purge on affected infrastructure.
Different technique used to handle the stop-gap measure we had in place previously which is not reliant upon framework-specific functions.
Communication to Customer Love and affected stakeholders with recommended local steps.

Planned:

Implement automated cache invalidation tied to relevant backend deployments.
Improve detection and alerting for cache-related anomalies (e.g., error rate spikes).
Migration from our current JS library / framework to an alternative with broader support and flexibility.

Outcome / Next Steps

Caching mechanisms need tighter integration with deployment workflows to avoid serving stale content.
Partial or intermittent failures are harder to detect and require better instrumentation.
Proactive cache management and user-friendly fallback experiences are key to resilience.
A new and more robust JS framework is needed to avoid this scenario in the first place.

The ChiliCal incident on April 29, 2025, was resolved with no lasting impact, but highlighted important areas for improvement in cache handling and monitoring. Work is underway to strengthen these areas and reduce the likelihood of recurrence.

Posted May 01, 2025 - 15:57 UTC

Resolved

This incident has been resolved.

Posted Apr 30, 2025 - 22:10 UTC

Monitoring

The underlying issue appears to have been caused by caching which naturally took time to clear. Our team has forced the instances to clear and we should see service recovered now. Local cache may still be a factor in displaying these pages, so please try clearing browser cache if you see the issue remain.

Our team is monitoring from this point forward to ensure it's fully resolved.

Posted Apr 30, 2025 - 13:08 UTC

Investigating

Some users are experiencing issues when accessing Chili Cal scheduling links. In certain cases, the links fail to load and instead display a "Something Went Wrong" error screen. This behavior is not affecting all links, but the issue appears to occur intermittently across multiple users.

Our team is actively investigating the root cause of the issue and working to restore normal functionality as soon as possible. Updates will be provided as more information becomes available.

Posted Apr 30, 2025 - 09:45 UTC

This incident affected: Demand Conversion Platform (Fire) (ChiliCal, ChiliCal Scheduler (Handoff)).