After a network recovery event, the connectivity with the storage nodes was not successfully restored on certain nodes causing unresponsive infrastructure nodes.
To resolve the issue a restart of the affected nodes was required.
As advised by our upstream supplier, infrastructure changes were implemented to maintain uninterrupted availability of the storage when reconnecting after a network recovery event.
15 - August - 14:30 UTC
Redwood Engineers working with Amazon have identified a potential performance issue between the Redwood infrastructure and Amazon storage services. We have deployed a fix to our infrastructure for which a restart of the environments is needed. The restart will automatically migrate the environment to the infrastructure that has the fix applied.
Please verify which of your environments will require a restart. The environment names are displayed in the MOTD you received or by checking the RMJ/RMF Dashboard
If the option to reschedule a restart is not displayed in the Dashboard, no restart of that environment is needed.
12 - August - 07:55 UTC
Some instability during the night was detected causing a few of the environments mainly in the Dublin Region to restart between 22:29 UTC and 00:36 UTC.
11 - August - 21:00 UTC
Redwood engineers are working with AWS engineers to investigate some EFS-related issues that have been impacting stability.
A provisional fix has been determined and an implementation and rollout plan is being developed.
10 - August - 11:44 UTC
Some instability during the night was detected causing a few of the environments to restart during the night. Investigations are in progress.
09 - August - 16:00 UTC
The environments continue to remain stable. We will continue to work on the RCA. We expect to share the RCA within the next week.
09 - August - 10:15 UTC
The environments continue to remain stable.
09 - August - 08:25 UTC
All affected environments are in a running state. We will continue to monitor the environments closely.
If you are still experiencing issues, please contact support.
09 - August - 07:23 UTC
We have created this knowledge base article to keep you apprised of the current developments regarding the outage that started on the 9th of August at 00:00 UTC in the Dublin Region.
Our monitoring system immediately identified the ongoing issue, and our Engineering team was notified.
We are actively working on the recovery of the environment(s) that is(are) affected.
The current focus of the team is to bring affected environments to a running state, once that is completed and the environments were verified, the team will work on providing a more detailed RCA.
We will keep posting regular updates in this KBA.