Update to 10 crashes all sites with a DB

Skyflare

Regarding the issue of receiving the 'Error establishing a database connection,' you are referring to WordPress sites. I experienced the same problem here, and generating a new password for the MySQL user with fewer digits than what Enhance generates solved the issue.

However, every time I use the Enhance password generator, the website cannot connect to the database in this version, and it reports an invalid password error.

@Adam could you look into this for us? I thought it might be a behavior happening only to me, but apparently, after this update, it started occurring.

wmonline

I am referring to WP sites.

I managed to activate DB role on server running WebServer (Nginx), decommissioned current DB Server, moved sites to WebServer that now has the DB role. All non WP sites moved ok. WP sites fail with this error:

[10:13:28 16/12/2023] Provisioning website 94210fd6-f628-45f0-8319-a46e390524be to role Database on server 1d7456ea-b81b-4ee0-a04e-df4fdb736e79 prior to website migration
[10:13:28 16/12/2023] Creating MySQL users on target server
[10:13:28 16/12/2023] Re-creating user stitch_b1_n0Ad
[10:13:28 16/12/2023] Provisioning error: Error { kind: InvalidArgument, detail: Some("website"), msg: Some("Cannot set caching_sha2_password on MariaDB") }

twest

wmonline during the upgrade process did you change to MariaDB? Or did you setup a new server with MariaDB? I'm wondering why the reference to MariaDB in your error message. I know they added MariaDB support, but unless we change to MariaDB it shouldn't affect our sites/servers.

wmonline

twest I set up a separate dedicated server with the intention of restoring the databases on the new DB server. But the panel did not allow me to add a DB role, neither mySQL or mariadDB, I then setup up a mariaDB on the app/nginx server and tried to restore, no luck

twest

wmonline can you add the DB role to your app/hosting servers? Do you have a backups server running that you can use to redeploy the sites to a new/fresh server?

Sorry you're going through this, I know (god I know) how terrible the feeling is. I've spent years trying to get a good disaster recovery strategy in place. I feel like my recovery plan with enhance is good - I have a backups server with backups running every 3 hours to be very fresh, and I have a spare hosting/app server in enhance sitting idle just waiting for the horrid day where I might need to use it to recommission a downed server.

If I read between the lines right, you had all sites running their DB from one dedicated DB server, and it's now messed up? I read someone mention updating the passwords for the DB users to get sites reconnected, any luck on that?

wmonline

thank you twest , appreciate this... yes I did add the DB to an App server running NNGINX and MariaDB, the error above was from my attempt to restore to this App server. It seems that after the update, anything to do with Database services on any server fails, yes it looks like I probably need to mess around with passwords, feels like I am desperately grasping at straws waiting for official support to come on line. Your setup makes sense and I am aiming at something similar. Yes I have a dedicated server, so hopefully the backups will save the day : - / maybe I was bit hasty to upgrade... would like to establish a logical reason for this fuster cluck

twest

wmonline yeah keep grasping at straws, chances are you can get something working faster than support. It's weekend now, do they even have weekend workers? I don't know if they do.

Backups restore sounds promising. It would probably be fastest to just hit the backup restores, hopefully works.

Still seems odd about the MariaDB reference. Like it tried to convert your MySQL to MariaDB and messed up or something. If the user password change doesn't do anything, can you try changing the DBs over to MySQL (assuming they are MariaDB now)?

Im about in bed going to sleep now, but if enhance support doesn't reach out by the time I'm up and things are still a wreck I can try helping you. I usually use Skype if you have a username to drop, or an email, I will check this thread when I get up.

Good luck man, hope you get lucky with something soon!

wmonline

thanks twest !! ... Adam helped big time, thanks again Adam , have to admit though I am a tad annoyed having wasted so much time on this and I feel it could have been avoided (how much testing was done pre release). I am also worried as to how technical I have to be for this solution. Adam is awesome but I am left wondering how many Adams there are in Enhance. Having said that I normally prefer to work this stuff out by reading the frigging docs and the logs, but I feel hamstrung in this regard. There is no ways in hell I would have figured this out without becoming a DB expert understanding the nuances between MariDB an MySQL.

For the folks that are several hours ahead of the UK chances are we will be the first to update. But then if it fails we have to wait several hours for the UK to come on line. Lesson learned... give it a few days before hitting update!

twest

One more thing. In the new update thread, enhance guy said this:

"However the MySQL users cannot be moved since MariaDB also doesn't support caching_sha2_password. Therefore you need to delete the MySQL users from the website, move it to the new server then re-create them. There will be a knowledgebase article for this soon."

Is that what you were doing? You made a new server with MariaDB, and tried moving your MySQL DBs to the MariaDB server? I guess that's the problem with the websites DB user. Probably just need to delete the existing DB user and make a new one, and make sure the password in the sites wp-config.php matches the password set on the new DB user.

twest

Actually it looks like the enhance guys comment has the same warning about the password as the error you posted, that must be the problem... So just delete the DB user and recreate it with new password, fix it in the wp-config for each site. That should allow DB connection to work.

mendozal

Please let us know what happened and how you solved it when you do. I'm holding the update a little bit.

Adam

The issue for @wmonline was that a config variable was not copied from the old my.cnf when upgrading to mysql 8.0.35 and this prevented remote users from authenticating. This is fixed in appcd 2.0.1 which is already released. If you already upgraded appcd to 2.0.0, upgrade to 2.0.1 before running the core update.

It is correct that you can't copy mysql users to mariadb if using caching_sha2_password since MariaDB doesn't support this authentication plugin. We will offer an option to move these users with a randomised password in the future. For now you need to remove the users, add them with mysql_native_password then you should be able to migrate the site to the MariaDB server.

twest

wmonline glad you're back online! I would suggest never updating on a Friday night lol

Adam

You're right there is a weakness in our test suite. Every release goes through hundreds of hours of automated CI testing where we test updates from every version of Enhance which is currently in use. We also have a lot of tests for database servers being external to the web server. However we don't have a test which covers self-updates and sites with external databases which is how this issue was able to pass our CI.

In this case it wasn't a problem that we had foreseen so there wouldn't be any documentation for it. The way I diagnosed problem was was to turn on WP_DEBUG on one of your websites which exposed the database connection error. I then granted permissions on the database to the existing user which brought that site back online, which then lead me to discover the my.cnf setting which was missing. Our development team have already patched the issue in appcd 2.0.1 so it shouldn't affect anyone updating now.

Since the 10.0.0 update, the MySQL/MariaDB implementation is much simpler. You can now access the MySQL CLI directly from the host o/s as root - the connection info has been written to /root/.my.cnf similar to behaviour in other panels. You can also edit the my.cnf directly and have these changes persist.

At the moment we only offer support officially 9AM to 9PM UK time. We are looking to expand coverage but we will also shortly be offering Enhance through authorised partners who will offer 24x7x365 support.

briansmith84

twest Omg I cannot imagine if I had my clients go down because of this...

twest

briansmith84 well, that's what having a well thought out and implemented disaster recovery plan is for. It's extremely important. Some key mistakes my dude made:

Never run major updates on a Friday evening. It's the absolute worst time, as everyone is heading to take the weekend off, and even if there is weekend support available it's usually the lesser skilled part timers. Major updates should be performed during business hours, when you know you can possibly get support (from enhance or other sydadmins).
Never run a brand new feature against your entire software stack without doing a test first. If you run it on one site, and it works, then run it on 10 sites, then 100 sites, etc.
Build a disaster recovery plan BEFORE a disaster. My dude was searching through online docs when he should have already had a disaster recovery plan in place and known exactly which steps to take (up to the nuclear option of recommissioning a server from backups).

In my career I've gone through dozens of disaster situations, fully blown out servers. Of course I still get the tension, rapid heartbeat, sweaty palms - BUT I know exactly what steps I'm going to take to resolve. Depending on initial diagnosis, some things we'd implement include updating our company status page about the outage, posting a tweet, notify our datacenter staff (if it's hardware related, ie server doesn't respond to pings), give 30 minutes of attempted troubleshooting if it's software based issue before going nuclear and decommissioning the server and recommissioning to the standby server (in this case DNS will update automatically for us, it's just a matter of waiting for backups to transfer to the standby server for redeployment).

We're running enhance backups tri-hourly, so they're very fresh. The idea is we can bring clients back online from a disaster scenario with minimal data loss (up to 3 hours data loss). In most situations a few hours of data loss is much more preferable to ongoing downtime. Our clients actually wouldn't stand for hours of downtime, hence quickest recovery is key. By keeping backups super fresh and having a full dedicated server ready to go sitting idle in "standby" we cut down the time it would take to even provision a new server. The standby server is costing hundreds of $ a month to just sit there doing nothing - but in a disaster it could save 20 minutes of time off recovery or more, well worth it.

Now having a software issue is unique, because at some point you have to consider the cost/benefit of staying offline to troubleshoot. A lot of times it's not an easy fix, or even diagnosing what's wrong can take a while. Do you work on it for an hour, two hours? After 3 hours why not keep trying another few hours? At some point you have to decide troubleshooting is wasting more time than it would take to run an alternative fix like decomish and recomish the server from backups. Of course, if your backups aren't super fresh then that can be a negative on customers too - if they get redeployed on a 24 hour old backup, well they may be pissed to lose a days worth of work they did on their site, etc.

I set a 30 minute limit for software because I and my partners are very well rounded sydadmins, and if one of us can't at LEAST diagnose what the problem is by 30 mins, then it's not worth pursuing. If after 30 mins we know what's wrong, then at that point we would also know how long it would take to fix, then it's a judgment call whether to fix it or redeploy.

Now for a hardware failure we may give more leeway. If the DC staff confirm the ticket and begin investigating within 15-30 mins, then we can give them an hour to resolve the issue. If by 60 minutes they haven't communicated an update with some positive notes, then redeployment is preferable - again it's a judgment call at that point. In the past some DCs we've worked with have been crap at hardware support, taking 6-8 hours to replace a blown CPU chip is unacceptable for downtime, so alternative measures would be implemented long before that.

Our main/old servers have a crazy balance of backups and clones to maintain readiness for a disaster, it's hugely expensive and not great for performance. But, we CAN get a dead server brought back within 3 hours which is remarkable.

The plan with enhance servers is to reduce recovery time to <1 hour, with minimal data loss thanks to 3 hours backups.

It's a significant investment to have a strong backups plan, and strong disaster recovery plan. But when SHTF you will thank your ass for spending the $... And make no mistake, the downtime WILL hit you, as it hits everyone at some point. So get prepared, be vigilant, and keep yourself educated on the subject.