Summary

On April 7th 2026, a data sync job consumed an excessive amount of memory on one of our servers, exhausting the server's resources and causing two separate but related incidents.

The first incident caused our Redis instances to become unavailable, which blocked critical flows including sign-in and notifications for approximately 49 minutes.

The second incident began after the first incident was resolved, and was caused by a bad update to our ingress by a load balancer controller. This load balancer controller was running on the same server whose resources had been exhausted; when the server recovered, the controller acted on stale data and misconfigured the ingress targets for our load balancer. This made Opal's UI and API unavailable for approximately 25 minutes.

Impact

Severity: Sev 1

This incident is classified as Sev 1 due to a period of complete unavailability of Opal's UI and APIs, affecting all cloud customers.

Root cause analysis

A data sync job ran into a memory inefficiency in our code: when processing a large + deeply nested set of resources, the sync job would cache duplicate data for each resource. This allowed memory usage to grow disproportionately with the size of the dataset. Combined with an excessively-high memory limit on the pod, this allowed the job to quickly consume all available memory on the node it was running on.

This node was also hosting several other critical components, including our Redis instances and a load balancer controller. Once the server's memory was exhausted:

Actions taken

For the first incident: