Summary

On April 7th 2026, a data sync job consumed an excessive amount of memory on one of our servers, exhausting the server's resources and causing two separate but related incidents.

The first incident caused our Redis instances to become unavailable, which blocked critical flows including sign-in and notifications for approximately 49 minutes.

The second incident began after the first incident was resolved, and was caused by a bad update to our ingress by a load balancer controller. This load balancer controller was running on the same server whose resources had been exhausted; when the server recovered, the controller acted on stale data and misconfigured the ingress targets for our load balancer. This made Opal's UI and API unavailable for approximately 25 minutes.

Impact

From 10:47am to 11:36am PDT (49 minutes), Opal's Redis cluster was unavailable. This blocked sign-in for all users, delayed notifications, and caused elevated error rates across the product.
From ~12:15pm to ~12:40pm PDT (25 minutes), Opal's UI and APIs (app.opal.dev) were fully unavailable.
A scheduled maintenance window was held at 7:00pm PDT to perform additional cleanup on our ingress configuration. During this window, app.opal.dev experienced brief, sporadic request failures (~5 minutes in total across two brief windows).

Severity: Sev 1

This incident is classified as Sev 1 due to a period of complete unavailability of Opal's UI and APIs, affecting all cloud customers.

Root cause analysis

A data sync job ran into a memory inefficiency in our code: when processing a large + deeply nested set of resources, the sync job would cache duplicate data for each resource. This allowed memory usage to grow disproportionately with the size of the dataset. Combined with an excessively-high memory limit on the pod, this allowed the job to quickly consume all available memory on the node it was running on.

This node was also hosting several other critical components, including our Redis instances and a load balancer controller. Once the server's memory was exhausted:

Redis became unavailable, blocking flows that depend on it (sign-in, notifications, task processing).
The load balancer controller lost its connection to our cluster's control plane and fell behind on updates. When the server recovered, this component acted on its last known (now-outdated) state and misconfigured the load balancer, pointing traffic to a server that no longer existed. This happened in the span of a few seconds, before the pod could be deleted as part of a previously-issued node drain.

Actions taken

For the first incident: