On February 3rd 2026, an asynchronous job to delete an off-boarded customer’s data from our cloud environment caused multiple locking issues in our database. This created several performance issues, and manifested most visibly for customers as increased latency and a delay in sending notifications.
Opal has an asynchronous job that is run on an ad-hoc basis to delete an offboarded customer’s data. The final step in this job is to delete the row for the organization itself, which cascades deletes to other tables. An issue in our offboarding process meant that:
Both of these cases involved writes to tables that had rows locked by the ongoing delete operation, meaning that queries began to stack up that were stuck in this locked state. This didn’t have a direct impact on queries for other customers - the locks were only on data tied to this customer. However, the buildup of queries caused a steady increase in database load and used memory. In particular, the REST API requests each triggered writes to the table that stores our API tokens, resulting in a high volume of stuck queries. The increasing load on the db began to noticeably affect db CPU usage and overall latency by ~2pm PST.
The delay in delivering notifications had a similar underlying cause (rows locked by this offboarding job), but was more directly caused by issues in our task system. In this case, the async tasks kicked off for this offboarded customer were stuck waiting on locked queries. They hogged all of the available workers for this queue, and a bug in our task worker autoscaling logic meant that we didn’t automatically scale up the number of workers in this pool to handle the backlog of tasks.