Summary

On February 3rd 2026, an asynchronous job to delete an off-boarded customer’s data from our cloud environment caused multiple locking issues in our database. This created several performance issues, and manifested most visibly for customers as increased latency and a delay in sending notifications.

Impact

On 2/3/26 from 1:20pm PST to 4:30pm PST, Opal experienced increased latency for all Web, API, and Slack requests, peaking at ~30s.
From 2/3/26 2:42pm PST to 2/4/26 2:20am PST, Opal did not deliver any notifications.
- These notifications were not lost - they were enqueued, and we finished processing this queue and sending delayed notifications by 2/4/26 3:00am PST

Severity: Sev 2

Root cause analysis

Opal has an asynchronous job that is run on an ad-hoc basis to delete an offboarded customer’s data. The final step in this job is to delete the row for the organization itself, which cascades deletes to other tables. An issue in our offboarding process meant that:

we were still accepting REST API requests from this customer
we were still processing some asynchronous tasks for this customer

Both of these cases involved writes to tables that had rows locked by the ongoing delete operation, meaning that queries began to stack up that were stuck in this locked state. This didn’t have a direct impact on queries for other customers - the locks were only on data tied to this customer. However, the buildup of queries caused a steady increase in database load and used memory. In particular, the REST API requests each triggered writes to the table that stores our API tokens, resulting in a high volume of stuck queries. The increasing load on the db began to noticeably affect db CPU usage and overall latency by ~2pm PST.

The delay in delivering notifications had a similar underlying cause (rows locked by this offboarding job), but was more directly caused by issues in our task system. In this case, the async tasks kicked off for this offboarded customer were stuck waiting on locked queries. They hogged all of the available workers for this queue, and a bug in our task worker autoscaling logic meant that we didn’t automatically scale up the number of workers in this pool to handle the backlog of tasks.

Actions taken

The latency issue was initially mitigated by pushing a hotfix to our REST API to avoid writing to the API token table on each request. It was fully resolved shortly after by identifying and mass-canceling the stuck queries.
The issue with delayed notifications was resolved by identifying the stuck tasks, canceling them, and manually scaling up our pool of task workers to churn through the backlog of tasks