Summary

On October 1st, 2025, a change was deployed to our cloud deployment that increased the latency of the authorizer. This led to increased latency across the Opal product, including the slack integration.

Impact

Severity: Sev 2

Root cause analysis

The Opal team has been working on backend data model migrations to increase stability and performance. The changes merged on Sept 30th completed the migration but led to unintended latency spikes in our authorization service.

What Happened

We migrated from legacy authorization libraries that used 4 separate queries across different tables (resource-users, group-users, group-groups, group-resources) to a single recursive query against a consolidated role_assignments table.

The root causes of the latency were:

  1. Query Performance: The new recursive CTE performs well for deep access hierarchies (depth > 1) but poorly for direct access grants (depth=1), which make up the majority of our authorization checks. Unlike the old system that filtered efficiently in memory, the recursive query must exhaustively check for downstream access even when none exists.
  2. Loss of Precomputed Data: The legacy group_groups closure table provided O(1) lookups. We chose not to implement this for role_assignments due to scale, thus losing this optimization
  3. Insufficient Scale Testing: We tested this pretty significantly, but underestimated production conditions - particularly the distribution of shallow vs. deep access patterns and concurrent query load.

Resolution

The Opal team identified the bottlenecks through query profiling and deployed optimizations which successfully restored acceptable latency levels.

Actions taken

Timeline