I got an update from Adam about my questions
Yes it could have been caused by the MySQL upgrade issue you had during the v12 upgrade. We will be adding some logic that checks/repairs the local MySQL user automatically in future.
So now I wonder why only one server would be affected by the permissions issue. I don't remember having an issue with WIPs after the V12 upgrade, and in fact when going through manual DR there weren't WIPs in the sites from that server, so I don't know how it could have the permissions issue as a result of the V12 upgrade when the backups were running fine - the WIPs issue only appeared after DR.
And then I wonder if there was a permissions issue after V12 upgrade but before the crash, then did it cause the server to kill itself somehow? I still haven't gotten a chance to dig into IPMI too much yet on the dead server.
I don't think I'm going to get answers to any of these questions, and maybe it's impossible to know by this point. Maybe the dead server will dig up some clues, it's basically the last chance for any concrete answer.
I'm going to continue checking the backup server for WIPs daily and running cpfence's backup check script. I'm going to run a backup restore test on a website on all servers too just to make sure it's working everywhere (haven't verified that part, I just assume since the other servers don't have backups issues, I need to verify though).
I'm not feeling confident about DR at the moment, but at least I have backups running good - so worst case scenario they can all be restored manually again. Knowing that creating a blank database on a website will reconnect broken mysql permissions (that part needs to be bolded for everyone to see lol) is a big win, that alone would make restoring manually immensely quicker since it should allow manual backup restore to work.