Summary
On October 1st, 2025, a change was deployed to our cloud deployment that increased the latency of the authorizer. This led to increased latency across the Opal product, including the slack integration.
Impact
- This led to increased latency for various parts of the Opal product
- As the slack integration has a strict timeout of 3 seconds, some slack requests were timing out
Severity: Sev 2
Root cause analysis
The Opal team has been working on backend data model migrations to increase stability and performance. The changes merged on Sept 30th completed the migration but led to unintended latency spikes in our authorization service.
What Happened
We migrated from legacy authorization libraries that used 4 separate queries across different tables (resource-users, group-users, group-groups, group-resources) to a single recursive query against a consolidated role_assignments table.
The root causes of the latency were:
- Query Performance: The new recursive CTE performs well for deep access hierarchies (depth > 1) but poorly for direct access grants (depth=1), which make up the majority of our authorization checks. Unlike the old system that filtered efficiently in memory, the recursive query must exhaustively check for downstream access even when none exists.
- Loss of Precomputed Data: The legacy group_groups closure table provided O(1) lookups. We chose not to implement this for role_assignments due to scale, thus losing this optimization
- Insufficient Scale Testing: We tested this pretty significantly, but underestimated production conditions - particularly the distribution of shallow vs. deep access patterns and concurrent query load.
Resolution
The Opal team identified the bottlenecks through query profiling and deployed optimizations which successfully restored acceptable latency levels.
Actions taken
- Forward fixed by merging several PRs that improved authorizer performance
Timeline