Resolved -
A few servers are still offline but FRA1 is fully functional since 23/02 at 23:00 CET. At this time we are still expecting a post-mortem from our cloud provider. This incident is now resolved.
Feb 24, 08:53 UTC
Update -
All event storage cluster servers hosting long-term data are now online with indexes gradually recovering; workers responsible for data retention will be restarted shortly. Remaining event storage cluster servers are being restarted via a custom process and should rejoin shortly.
Feb 23, 17:25 UTC
Update -
This incident is ongoing with around 80 servers still down. Indexation is operating with less than 10 minutes of lag on the event storage cluster. The data center provider is manually rebooting servers that failed to start correctly, prioritizing critical nodes including message bus nodes. Recovery efforts are ongoing.
Feb 23, 15:58 UTC
Update -
This incident continues to impact approximately 80 servers. The event storage cluster is recovering with nodes restarting and data becoming progressively available. Indexation has resumed with some lag still present. Frontend, APIs, automation features, and detection services remain operational. The main residual issues relate to forwarding difficulties caused by four message bus nodes being down, affecting forwarding to search indexing components. Recovery of infrastructure nodes is ongoing.
We are still in contact with our cloud provider to ensure all servers come back online as soon as possible.
Feb 23, 15:06 UTC
Identified -
More than a hundred servers are down due to an issue impacting network switches in one of the data centers, causing disruptions to the message bus and other infrastructure components. Our teams are actively investigating, and we are coordinating with the data center provider to determine the estimated time for recovery.
Feb 23, 14:18 UTC
Investigating -
We have identified more than a hundred servers down at once on FRA1. As this is affecting nearly all clusters, we are currently looking into a cloud provider issue.
At this time we are not sure about the overall impact, as our team is looking into it.
Feb 23, 14:15 UTC