I have two servers currently. one has the control panel and a few websites. the other just for websites.
I had an issue yesterday where the second server went down for an hour or so.
Got a notification that some sites were down and if i try to login to the control panel i just get a spinning ajax loader and nothing ever loads.
I am using open litespeed as a server.
Things i have checked and confirmed.
docker ps showed uptime of 5 weeks on the main control panel server until i restarted it.
otherwise the dockers seem fine.
neither server is out of space.
Htop doesnt show any high usage on either of the servers.
From what i can gather it appears that the control panel server isnt able to fine the second one and as such wont start orchd.
from docker logs orchd.
4eeb0bc5d0ed is unreachable: Error { kind: RpcUnavailable, context: None, entity: None, message: Some("failed to connect to all addresses") }
2025-06-01T20:41:23.426468Z WARN ThreadId(27) orchd::scheduler::fetch_service_statuses: Server 7xxxxx5-73aa-xxx-xxxx-4eeb0bc5d0ed is unreachable: Error { kind: RpcUnavailable, context: None, entity: None, message: Some("failed to connect to all addresses") }
2025-06-01T20:42:23.423531Z WARN ThreadId(27) orchd::scheduler::fetch_service_statuses: Server 7xxxxx5-73aa-xx-xxxx-4eeb0bc5d0ed is unreachable: Error { kind: RpcUnavailable, context: None, entity: None, message: Some("failed to connect to all addresses") }
2025-06-01T20:43:23.424128Z WARN ThreadId(27) orchd::scheduler::fetch_service_statuses: Server 7xxxxx5-xxxx-xxx-xxx-4eeb0bc5d0ed is unreachable: Error { kind: RpcUnavailable, context: None, entity: None, message: Some("failed to connect to all addresses") }
2025-06-01T20:45:03.006969Z ERROR ThreadId(28) orchd::scheduler::stat_polls: Failed to collect server stats for 7xxxxx5-xx-xx-xx-4eeb0bc5d0ed :internal: RpcFailure: 4-DEADLINE_EXCEEDED Deadline Exceeded
I have restarted both servers several times.
server 1 can ping server 2 and see it without problem.
The only thing i can think of atm. is that when the server was down it was put into recovery mode. which changes the ssh fingerprint etc and that at some point there server 1 got mixed up or stuck in some way.
i assume i just know too little about the potential problems and im hoping someone can shed some light?
Thanks in advance