[EXTERNAL] 2025-06-09 Delays in processing asynchronous workloads

Summary

From 12:24 PM ET to 12:34 PM ET on June 9, 2025, Opal experienced a 10 minute delay in processing asynchronous workloads primarily including access grants and revocations due to a deploy failure that temporarily disabled Opal’s task worker.

Impact

Opal cloud customers experienced a delay of up to 10 minutes at which point the deploy failure timed out and tasks resumed processing thereafter.

Screenshot 2025-06-13 at 10.58.59 AM.png

Chart showing the successful tasks processed over time by Opal’s asynchronous task worker during the impacted time

Severity: SEV 2

This incident was classified as SEV 2 in accordance with our severity guidelines, as it represented a significant performance degradation of a core platform feature affecting multiple customers.

Root Cause Analysis

The deploy that triggered the incident failed due to a timed out migration and was subsequently rolled back after 10 minutes. The deployment strategy of Opal’s asynchronous task worker guarantees at most once availability of the pod. The pod’s deployment strategy was implemented this way to avoid a race condition upon task dequeue and incorrectly relied on the general case of pod availability being restored within seconds during deployment.

During the new deployment, there was a blocking check on initialization that depended on the completed migration. The deployment being stuck in the initialization check was the root cause of the resulting 10 minutes of unavailability with no task worker processing.

Timeline

12:14 PM ET: Deploy that triggered incident started
12:24 PM ET: Incident begins as migration begins on one of Opal’s pods
12:34 PM ET: Migration times out after 10 minutes.
12:34 PM ET: Incident ends as scheduled task worker pod is rolled back

Next Steps

Define and publish SLAs for access grant/revocation processing times.
Add alerting on # of successful tasks processed with Opal’s scheduled task worker