FRA1 server instability causing workflow slowdowns

Incident Report for SEKOIA FRA1

Resolved

We are pleased to announce that the incident has been fully resolved. Our team has successfully stabilized the servers, resumed operations, and cleared the backlog of data. All "event drop" notifications received during this incident can be disregarded as no events were lost; all events have been processed. We appreciate your understanding and cooperation during this time and will continue to monitor the situation to ensure stable operation. Thank you for your patience.

Posted Jul 29, 2025 - 16:49 UTC

Update

We are glad to report that the incident has been largely resolved. Our team has managed to successfully stabilize the servers and has resumed operations. We are currently processing incoming data and making good progress in catching up on the backlog. We want to reassure our clients that no events have been lost.

Any "event drop" notifications you may have received can be ignored; the events are being processed gradually.

We will continue to monitor the situation closely to ensure stable consumption and to completely eliminate any remaining lag. We appreciate your understanding and patience.

Posted Jul 29, 2025 - 13:35 UTC

Monitoring

The team has successfully stabilized the server situation and resumed operations. We are currently processing incoming data and catching up on the backlog. Please be aware that we are monitoring the situation closely to ensure stable consumption and to address any remaining lag. Investigation into the root cause, a known memory leak in our ingest pods, is ongoing. We appreciate your understanding and patience as we work to fully resolve this issue.

Posted Jul 29, 2025 - 12:47 UTC

Identified

We are currently experiencing an issue with a number of servers that are not operational. This is causing a slowdown in our workflows. Our team is actively working on stabilizing the affected servers and managing the high memory usage observed on a given tier of nodes. In the process, we have temporarily paused certain operations to allow for system recovery and offset commitments. Please note that this may result in some event duplication. We will keep you updated as we progress in resolving the issue. Thank you for your patience.

Posted Jul 29, 2025 - 12:08 UTC

This incident affected: Detection.