Dear Customer,
We are writing to inform you about a critical incident affecting the dalstorage01 node earlier today.
This morning, we experienced a severe RAID array failure which caused a portion of the virtual machines on the node to enter an inconsistent state. Our infrastructure utilizes RAID60 hardware arrays with SSD cache in RAID1, and we continuously monitor all systems to proactively maintain their health.
Based on our current analysis, we believe the RAID controller entered a faulty state and continued writing data incorrectly to the array, resulting in corruption of multiple KVM disk images.
As an immediate response:
The affected node was taken offline
The RAID controller was replaced with on-site spare hardware
Data recovery procedures were initiated
Due to the size of the array, over 180TB, recovery is a time-intensive process and may not be fully successful. While we expect that a significant portion of the data remains intact, many virtual machines are currently in an inconsistent state and may not boot properly.
At this time, we consider the node unsuitable for production use and therefore we're moving forward with re-creation of new virtual machines for all affected customers. Given the nature of these VMs is cold storage and backups, we've determined that our customers will benefit from a faster recovery of a storage virtual machine, over anything else.
We will be compensating all affected users with 30 day free services extend, due to the inconvenience that's caused by this issue.