MySQL Self Recovery For Crashes/Shutdowns

twest

In the old Enhance, Docker would monitor services and restart them on failures/crashes. To restore that functionality you can create systemd overrides. They're pretty simple, and this is normal config editing for systemd, nothing crazy. Here's how you can add an override for mysql that will automatically restart mysql after a crash:

First open the override file:
sudo systemctl edit mysql.service

Next, add this to your overrides file:
[Service] Restart=on-failure RestartSec=10

Now reload systemd:
sudo systemctl daemon-reload

Last, restart MySQL:
sudo systemctl restart mysql.service

There should now be a override.conf file in /etc/systemd/system/mysql.service.d/, that file should just contain the 3 lines of customization you made.

You can do this on all your DB servers, so they'll all restart MySQL if it crashes, help you prevent extended downtime. I set restart delay to 10 seconds, so if there's a critical runaway looping error - there's time between restarts where I can stop services or other troubleshooting. You can increase this to more time if you like, whatever balance between uptime and recovery is best for you.

Important Note
I set my override to "Restart=always". The difference being that systemd will restart MySQL no matter why it is shutdown (other than using systemctl command specifically). This is important for me because for the first time ever on Enhance we got a downtime yesterday on a server where the DB was offline. MySQL log shows "user signal" initiated shutdown (wtf). Having "Restart=always" might have caught that and restarted mysql, whereas "Restart=on-failure" would not done anything... So choose how important the DB server is in your environment and add the config as needed, for me it's "Restart=always" with a 10 second delay.

JohnB

twest MySQL log shows "user signal" initiated shutdown (wtf).

How did that happen?

cPFence

twest

Be careful of potential data corruption if MySQL gets stuck in a restart loop due to repeated OOM kills. This kind of corruption tends to happen more often when running directly on the host and seems less likely inside Docker (not exactly sure why). The best approach is to monitor the issue and investigate the root cause rather than relying on auto-restarts.

You'll need MySQL optimizations specific to the v12 setup to fine-tune it again. The old settings used inside Docker may no longer be a good fit, so it's best to monitor performance for 24 hours, check key variables, and adjust accordingly. This script might help. It’s a bit of a hassle, but once optimized, you can set it and forget it. Setting max user connections to 1000 is definitely a very bad idea, regardless of how powerful your server is.

For cPFence users, please avoid setting restart=always since the Owl module is already monitoring all essential services (MySQL, orchd, LiteSpeed, appcd, etc.) and this may cause conflicts. The Owl will notify you of any issues and restart services for you wisely as needed, preventing endless restart loops.

twest

JohnB no idea! Server monitor shows all resources were low/normal usage - lots of cpu/ram available. Error log is good/clean, no signs of distress anywhere. No fuggin clue how/why mysql randomly shut down. Luckily we've got lots of uptime monitoring, so it was quickly found and fixed right as it happened... But still, it's unsettling since mysql was rock solid before v12.

I know some people had issues with their db randomly crashing after v12 - but I think those instances were due to cpguard/cpfence conflicts... In any case, I'm hopeful this systemd override does the trick 🙂

Jordan

@twest you might want to look at Netdata, it will alert on MySQL service being restarted or stopped as well as capture a ton over observability metrics.

wizardeur I noticed the same thing. I'm trying to investigate, log rotation is one of my suspects. Disabled it and will observe if it happens again.

Log rotation shouldn't restart/reload MySQL it should be sending flush logs;

twest Error log is good/clean,
Double check that it's still enabled. Default is disabled with Enhance. With the upgrade to v12 it might have been disabled.

twest Welp, journalctl didn't help, it showed no entries. I analyzed the syslog and looking at the time stamps from when the server was last shut down found these entries:

Enable the mysql general log, might get more details, might get the same.

xyzulu

Good thread.. It's going to get lost here due to the way these forums work/don't work. I might add it to the wiki on GitHub here: https://github.com/xyzulu/enhance-related-commands/wiki I just need to think about where would be best.

On v11, mysql/mariadb default docker behaviour was restart=always, but like you, we have set Restart=on-failure and have not had any issues with mysql since. We've had this set like this for a number of weeks. This is more a failsafe step as memory/resource issue are perhaps at the root cause of mysql getting OOM killed, or stopping for strange reasons.

Depending on if it's a shared server or a dedicated mysql server, setting max_user_connections is also a good idea for some stability. @cPFence gave that reminder I recall somewhere.

Edit: @twest do you want to add it yourself on the page I created here: https://github.com/xyzulu/enhance-related-commands/wiki/v12-mariadb-mysql-tweaks that way it doesn't look like I just copy and pasted your words 😉

twest

xyzulu yeah my max user conn is set to 1000, and haven't come close to hitting that. This is a big dedi (50cpu/250gb ram) with very low resource usage (15% memory/5-7 load). Really mysterious why mysql shut down, but I'm not too worried as long as those events happen infrequently (only happened to 1 server in the week since my upgrade) and the override catches it.

I noticed another override was in there already for Apache, it was set to Restart=always. I may add one for ELS if it ever has an issue since that's what I use... We might just want to put one for NGINX/OLS as well if you add this to the wiki, I think overrides for webserver/db are critical - maybe someone has ideas for other services that should go in.

twest

cPFence yeah I thought about increasing the restart delay to make sure monitors had time to catch it, but ultimately decided it would be better to just add a monitor for mysql's system status so if its uptime is less than 1 day it will update the status-uptime-monitoring page, then a uptime monitor checks the page for the target keyword "Warning". Now we get an email whenever mysql restarts and will know to look into any possible issue.

In case anyone is curious, I use this monitoring page to check for webserver and mysql uptime. I plop it as a website on each server, then config my uptime monitoring tool to scan the page for uptime (setting it to check for http-200 status code, and checking for target keyword "Warning"). This is the updated code (compared to what I posted in my thread about this last year) that incorporates a mysql uptime time + warning:

https://pastebin.com/kSAwcuwi

wizardeur

I noticed the same thing. I'm trying to investigate, log rotation is one of my suspects. Disabled it and will observe if it happens again.

twest

wizardeur good news for me is the Restart=always worked and brought a server back up after mysql shut down today. Bad news is it's the same server that randomly shut down the other day, so I'll have to investigate more tomorrow. If it shuts down again at the same time tomorrow that will be a big clue about it being a scheduled task, like maybe the log rotation as you said.

xyzulu

twest Have you checked the reason it was killed?

journalctl -u mariadb or journalctl -u mysql ?

twest

xyzulu not yet. It seems to be a once a day thing, so not critical yet. Got some time scheduled for tomorrow to check it more thoroughly. Hopefully find something concrete to put this issue to rest 🥺

xyzulu

That log should tell you the exact reason .. you can even use: --since "06:00" or --since "2 hours ago" etc to make it easier to find.

twest

Welp, journalctl didn't help, it showed no entries. I analyzed the syslog and looking at the time stamps from when the server was last shut down found these entries:

Mar 19 14:27:26 s988 systemd[1]: mysql.service: Scheduled restart job, restart counter is at 1.
Mar 19 14:27:26 s988 systemd[1]: Stopped MySQL Community Server.
Mar 19 14:27:26 s988 systemd[1]: mysql.service: Consumed 1d 1h 59min 7.545s CPU time.
Mar 19 14:27:26 s988 systemd[1]: Starting MySQL Community Server...
Mar 19 14:27:26 s988 kernel: [37918439.152631] audit: type=1400 audit(1742408846.980:35): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="/usr/sbin/mysqld" pid=1476683 comm="apparmor_parser"
Mar 19 14:27:31 s988 srs_milter[1339927]: CLOSE
Mar 19 14:27:31 s988 systemd[1]: Started MySQL Community Server.

That appears to indicate there's a scheduled task restarting mysql. I checked systemctl active timers, and crons - nothing strange there.

I tried checking timestamps in syslog to match up with the SHUTDOWN entry in the mysql error log (this was initiated by <via user signal>), but the only entries in syslog around that timestamp are hundreds and hundreds of website screenshot entries...

Speaking of website screenshot entries... I seem to be generating about 1 million entries daily on my syslog, and it's almost entirely website screenshots. The last time I updated the settings in the control panel I set it to update once a week or maybe once a month. I went to check on that setting, but it appears to be gone from the control panel now?

The docs says screenshot interval is at Settings>Service>App, but it's not there for me. I can't find it anywhere now?
https://enhance.com/docs/application-role/application-settings.html#screenshot-interval

I think I need to disable the screenshot service, it appears to be running non-stop.

xyzulu

twest Scheduled restart job

Anything just before this entry? This entry could be from your auto restart MySQL setting.

MarkD

twest Speaking of website screenshot entries... I seem to be generating about 1 million entries daily on my syslog, and it's almost entirely website screenshots. The last time I updated the settings in the control panel I set it to update once a week or maybe once a month. I went to check on that setting, but it appears to be gone from the control panel now?

We have this problem as well - the logs are full of "Checking "/var/local/enhance/screenshots/..."

cPFence

twest

If Restart=always is set in the MySQL systemd unit file, then any crash or forced kill (like from OOM or a segfault) can trigger a fast restart. Logs about the crash get missed in journalctl -u mysql because the service is started again almost immediately. Systemd marks it as a normal restart if the service comes back up fine, so it doesn't treat it as a failure worth logging at the systemd level. To troubleshoot better, you will need restart=no or maybe try to enable persistent journaling. Hopefully, logs won't get lost.

twest

xyzulu just nonsense entries about the screenshot service.

twest

Jordan I trialed netdata before, but it was too heavy on cpu for my liking. I'll try the general log if another shutdown happens.

Currently running at 27 hours without a shutdown on the trouble server, guessing it's not a scheduled service crapping it up. Only change I've made so far is killing screenshots on all servers.

Jordan

twest Jordan I trialed netdata before, but it was too heavy on cpu for my liking. I'll try the general log if another shutdown happens.

I haven't had CPU issues, I also make sure to size servers correctly to allow for CPU/Memory for Netdata and backups.

twest Currently running at 27 hours without a shutdown on the trouble server, guessing it's not a scheduled service crapping it up. Only change I've made so far is killing screenshots on all servers.

I'm having 2-3 days and then mysql has stopped. I was wrong, you don't need enable the general log, as it will show all queries which may be helpful but will take up disk space. Instead just enable log_error.

log_error = /var/log/mysql/error.log

cPFence Logs about the crash get missed in journalctl -u mysql because the service is started again almost immediately

Unfortunately with MySQL depending on syslog for logging isn't ideal, enabling log_error should be a default configuration and journtalctl might capture stdout/stderr for the service via systemd.