Incident Overview
Some users experienced intermittent failures when accessing ChiliCal scheduling links while others began experiencing no access to any booking links within their organization. Affected users were presented with a “Something Went Wrong” error screen. The issue did not affect all links, but occurred unpredictably across different users, affecting scheduling and handoff workflows.
After resolution, we found that some links were impacted further by caching of services which delayed the rollout of the fix to all users and made the impact’s duration longer than expected.
Timeline
- 09:00 - 09:30 UTC – Incident reported. Users began experiencing intermittent issues with ChiliCal scheduling links.
- 09:45 UTC – Engineering team began investigation and confirmed issue reproducibility.
- 10:00 UTC – Initial fixes for the root cause were implemented as a hotfix
- 10:30 UTC – Suspected caching-related issue identified as extending the issue’s behavior
- 12:45 UTC – Team forced cache clear across relevant instances to expedite recovery.
- 13:08 UTC – Systems showed recovery. Monitoring phase initiated.
- Post-Incident – Recommended browser cache clear for users still seeing the issue. Ongoing monitoring confirmed full resolution.
Root Cause
On April 28th we found that our JS framework (NextJS) did not provide deterministic application builds. We deployed a fix for this as an unrelated incident. While fixes were implemented to resolve the library’s update, this incident was also impacted by stale or inconsistent application-level caching that impacted the rendering of certain ChiliCal scheduling links. The cache did not invalidate properly after recent backend changes, resulting in outdated or broken state served to some users.
Resolution
The team manually forced a cache clear on affected services. This action immediately restored functionality for most users. Those still affected were likely experiencing residual issues from local browser cache, which was resolved with manual clearing.
Preventative Actions
Completed:
- Manual cache purge on affected infrastructure.
- Different technique used to handle the stop-gap measure we had in place previously which is not reliant upon framework-specific functions.
- Communication to Customer Love and affected stakeholders with recommended local steps.
Planned:
- Implement automated cache invalidation tied to relevant backend deployments.
- Improve detection and alerting for cache-related anomalies (e.g., error rate spikes).
- Migration from our current JS library / framework to an alternative with broader support and flexibility.
Outcome / Next Steps
- Caching mechanisms need tighter integration with deployment workflows to avoid serving stale content.
- Partial or intermittent failures are harder to detect and require better instrumentation.
- Proactive cache management and user-friendly fallback experiences are key to resilience.
- A new and more robust JS framework is needed to avoid this scenario in the first place.
The ChiliCal incident on April 29, 2025, was resolved with no lasting impact, but highlighted important areas for improvement in cache handling and monitoring. Work is underway to strengthen these areas and reduce the likelihood of recurrence.