Between 15:00 and 17:00 CET, a portion of customers experienced an authentication issue. There was a problem with SAML endpoint and its certificate. Users were unable to obtain a new access token.
What went wrong and why?
Automatic renewal jobs were not correctly set.
How did we respond?
The error was corrected by manual intervention of the infra team.
15:20 CET on 17-10-2025 – Customer impact began, triggered by the issue described above. 15:30 CET on 17-10-2025 – Investigation started, issue detected on US for SAML users. 15:40 CET on 17-10-2025 – Issue mittigation started. 16:00 CET on 17-10-2025 - Fix applied.
Propagation takes up to 2h
How are we making incidents like this less likely or less impactful?
We will implement proper automatic certificate renewal and validation jobs, including monitoring and alerting in case of failures.
Oct 17, 16:49 CEST
Between 14:00 and 14:30 a platform issue resulted in service performance degradation and issues. Customers experienced issues with navigation bar and admin section.
What went wrong and why?
One instance of user service , which is responsible for authorization lost communication with cache and data layer responding in timeouts. Authenticated users could not authorize and components required authorization stopped working.
How did we respond?
The error was corrected by removing faulted instance. Most resources were automatically restored and communication was re-established.
14:00 CET on 06-10-2025 – Customer impact began, triggered by the change described above. 14:00 CET on 06-10-2025 – Issue detected by the customers, and monitoring. 14:15 CET on 06-10-2025 – Investigation commenced by our DevOps and Back-End team. 14:20 CET on 06-10-2025 – We performed steps to revert code change. 14:22 CET on 06-10-2025 – Code release revered, issue still exists. 14:27 CET on 06-10-2025 – Issue detected on specific service and communication between Azure Redis. 14:30 CET on 06-10-2025 – Faulty instance of the service removed. Errors dropped. 14:35 CET on 06-10-2025 – Validation of recovery was confirmed for the majority of impacted services.
How are we making incidents like this less likely or less impactful?
We will add more alerts to detect earlier similar issue and automate recovery process. We will investigate ways to improve the recovery time for resources affected by such issues.
Oct 6, 14:00 CEST