MySQL/MariaDB Random Stops

MarkD

Hey Community

We have been trying to get to the bottom of why we get random MySQL and MariaDB stops/restarts across the Enhance servers since upgrading to v12 and this is the closest we have come to getting a detailed output.

To try and figure this out we setup 1 script that runs every 15 seconds and logs the "ps -aux" output to a daily log file.
We also have another script that runs every minute and logs the output from top and vmstat to another daily log file.

Finally we setup the following audit rules on all servers which should catch any process calling the kill or systemctl command.

-w /usr/bin/systemctl -p x -k systemctl_exec
-w /bin/systemctl -p x -k systemctl_exec
-w /usr/sbin/service -p x -k systemctl_exec
-w /bin/kill -p x -k kill_mysql
-w /usr/bin/kill -p x -k kill_mysql
-w /usr/bin/pkill -p x -k kill_mysql
-w /usr/bin/killall -p x -k kill_mysql

The 15 second script was setup to catch the name of the process that the audit log would display as it does not give you the name of the process, only the process id if it catches it.

For context, this is the MySQL configuration which is basically the default with a few minor tweaks

[mysqld]
skip-log-bin
#ssl-ca=/etc/certs/mysql/ca.pem
#ssl-cert=/etc/mysql/ssl/cert.pem
#ssl-key=/etc/mysql/ssl/key.pem
skip-name-resolve
collation-server = utf8_unicode_ci
character-set-server = utf8
log_error_suppression_list='MY-013360'
default_authentication_plugin = mysql_native_password
skip-host-cache
innodb_buffer_pool_size=512M
log_error=/var/log/mysql/error.log
max_user_connections=25
tmp_table_size = 64M
max_heap_table_size = 64M
max_allowed_packet = 128M
wait_timeout = 300
interactive_timeout = 300

So, this morning at 02:43:49 one of the servers had MySQL killed using the kill -9 signal, we know it was killed because we can see this in the syslog

2025-04-10T02:43:49.438505+01:00 vps529 systemd[1]: mysql.service: Main process exited, code=killed, status=9/KILL
2025-04-10T02:43:49.438902+01:00 vps529 systemd[1]: mysql.service: Failed with result 'signal'.
2025-04-10T02:43:49.439250+01:00 vps529 systemd[1]: mysql.service: Consumed 2h 28min 8.413s CPU time, 6.0G memory peak, 0B memory swap peak.
2025-04-10T02:43:49.602971+01:00 vps529 systemd[1]: mysql.service: Scheduled restart job, restart counter is at 1.
2025-04-10T02:43:49.611744+01:00 vps529 systemd[1]: Starting mysql.service - MySQL Community Server...

There are no other entries before these in syslog which relate to any problems with MySQL, only IPDB blocked rules.

And if we look at the mysql error log, we see this entry

2025-04-10T02:43:49.231935Z+01:00 0 [System] [MY-013172] [Server] Received SHUTDOWN from user <via user signal>. Shutting down mysqld (Version: 8.0.41).

We searched the audit logs using "ausearch -k kill_mysql" and "ausearch -k systemctl_exec" but nothing was logged for this morning, we searched the "/var/log/kern.log*" files for the kill word but nothing in there either.

We can't search our journal logs because they only go back as far as 03:00 this morning, no idea why these keep disapearing but again this happens on all the backup servers and on the servers where there are more than around 10 websites.

So, even though it wasn't logged it potentially looks like an OOM Killed it but the server memory was fine as far we can see from these top snapshots just before and after the event can't we?

This particular server has 8 vcpus and 32 GiB memory and here is the output from top at 02:43:01 (just before MySQL was killed) and then at 02:44:01 (just after MySQL was killed) - MySQL was killed at 02:43:49.

top - 02:43:01 up 3 days, 23:34,  0 user,  load average: 0.71, 0.46, 0.51
Tasks: 379 total,   1 running, 378 sleeping,   0 stopped,   0 zombie
%Cpu(s):  6.0 us,  9.5 sy,  0.0 ni, 84.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  32040.5 total,   1058.5 free,  11942.6 used,  24687.4 buff/cache
MiB Swap:   4096.0 total,   4093.7 free,      2.3 used.  20097.9 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
2231089 root      20   0   38660  22256  14100 D  18.2   0.1   0:00.02 cpguard-job-logs::fetchlogs
2231058 root      20   0   12332   5652   3476 R   9.1   0.0   0:00.01 /usr/bin/top -cSb -n 1
      1 root      20   0   23868  14624   9256 S   0.0   0.0 241:17.06 /sbin/init
      2 root      20   0       0      0      0 S   0.0   0.0   0:00.10 [kthreadd]
      3 root      20   0       0      0      0 S   0.0   0.0   0:00.00 [pool_workqueue_release]
      4 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 [kworker/R-rcu_gp]
      5 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 [kworker/R-sync_wq]
      6 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 [kworker/R-slub_flushwq]
      7 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 [kworker/R-netns]
      9 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 [kworker/0:0H-events_highpri]
     12 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 [kworker/R-mm_percpu_wq]
     13 root      20   0       0      0      0 I   0.0   0.0   0:00.00 [rcu_tasks_rude_kthread]
     14 root      20   0       0      0      0 I   0.0   0.0   0:00.00 [rcu_tasks_trace_kthread]
!---------------------------------------- vmstat 1 4
procs -----------memory---------- ---swap-- -----io---- -system-- -------cpu-------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st gu
 0  1   2352 1074876 1824968 23454972    0    0    95  1825 3276    5  3  1 95  1  0  0
 0  0   2352 1076172 1824968 23454936    0    0     0   416 1757 2263  0  0 99  1  0  0
 1  0   2352 1066336 1824968 23454932    0    0     0     8 2485 3553  3  1 96  0  0  0
 0  0   2352 1085316 1824968 23454964    0    0     0  1684 3055 4422  3  1 96  0  0  0

top - 02:44:01 up 3 days, 23:35,  0 user,  load average: 0.40, 0.42, 0.49
Tasks: 392 total,   4 running, 388 sleeping,   0 stopped,   0 zombie
%Cpu(s): 47.3 us, 20.9 sy,  0.0 ni, 31.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  32040.5 total,   2468.3 free,  10248.9 used,  24971.2 buff/cache
MiB Swap:   4096.0 total,   4093.7 free,      2.3 used.  21791.6 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
2231361 xxxxxxx+  20   0  185172  89096  30532 R 100.0   0.3   0:00.20 php /usr/bin/wp-cli cron event run --due-now --path=public_html
2231363 yyyyyyy+  20   0  181076  85384  30904 R 100.0   0.3   0:00.20 php /usr/bin/wp-cli cron event run --due-now --path=public_html
2231362 zzzzzzz+  20   0  154872  81152  30672 R  90.9   0.2   0:00.20 php /usr/bin/wp-cli cron event run --due-now --path=public_html --quiet
   3382 xxxxxxx+  20   0   65336  17272   9784 S   9.1   0.1   4:35.63 /usr/bin/redis-server 127.0.0.1:6379
2229969 root      20   0       0      0      0 D   9.1   0.0   0:00.05 [kworker/u32:0+events_unbound]
2231261 mysql     20   0 2950344 603808  38272 S   9.1   1.8   0:01.09 /usr/sbin/mysqld
2231393 root      20   0   38624  22224  14184 D   9.1   0.1   0:00.01 cpguard-job-logs::fetchlogs
      1 root      20   0   23836  14624   9256 S   0.0   0.0      6,29 /sbin/init
      2 root      20   0       0      0      0 S   0.0   0.0   0:00.10 [kthreadd]
      3 root      20   0       0      0      0 S   0.0   0.0   0:00.00 [pool_workqueue_release]
      4 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 [kworker/R-rcu_gp]
      5 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 [kworker/R-sync_wq]
      6 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 [kworker/R-slub_flushwq]
!---------------------------------------- vmstat 1 4
procs -----------memory---------- ---swap-- -----io---- -system-- -------cpu-------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st gu
 4  0   2352 2440068 1828244 23743340    0    0    97  1825 3276    5  3  1 95  1  0  0
 0  0   2352 2590880 1828248 23743784    0    0   500  1780 6258 5815 17  3 79  2  0  0
 0  0   2352 2563720 1828248 23743784    0    0     0    40 1676 2018  2  1 98  0  0  0
 0  0   2352 2563468 1828248 23743784    0    0     0     0 1202 1833  0  0 99  0  0  0

So, this server had plenty of free memory before MySQL was killed so what am I missing?

I have since this morning also added the following audit rule to try and catch low-level kill() syscalls

-a always,exit -F arch=b32 -S kill -F a1=9 -k kill_signal9

I am starting to suspect that something with the cgroups setup is causing the server to "think" it is out of memory when it isn't but how do I prove or disprove that?

Any thoughts or suggestions are welcome as we are scratching our heads here!

cPFence

MarkD

The culprit is usually one or two clients running poorly written WordPress themes or plugins that generate long, heavy queries. Even on a highly optimized MySQL setup, this can bring the most powerful server down to its knees, this exact issue is why CloudLinux was invented. Troubleshooting MySQL issues from logs is a real pain, so the old-school method still works best: monitor slow queries until you identify the source, then ask the client to fix it or force them to move to a VPS.

xyzulu

It would only mask the problem, but have you thought of trying this on one server as a test?

https://github.com/xyzulu/enhance-related/wiki/v12-mariadb-mysql-tweaks

MarkD

xyzulu Sorry I should have added that yes we have already implemented that and luckily the service did restart itself.

We also have Zabbix monitoring setup to alarm out when the MySQL Service has been running for less than 10 minutes as that would mean it has just been restarted.

gmakhs

IT happens A LOT after v12, not sure why i traced down to OOM kill, but this happens when server had ram and i believe it gets triggered from ram limit reached on website containers but i am not sure.

I install monit to monitor, to keep services active and to know when they reboot while am still tracing, it also kills web server so be aware
setting OMM=-1000 on the service prvents it

MarkD

gmakhs triggered from ram limit reached on website containers

That's what I am leaning towards as well, we currently monitor the Memory, CPU and Process count for each website using Zabbix.

For the memory usage we are pulling in the contents of /sys/fs/cgroup/websites/{website-ID}/memory.current

Maybe there is a counter that we can also pull in that would show the "faults" as CloudLinux calls them that have occurred per website?

MarkD

Not sure if this disproves the container OOM theory but if I run the following command to get a count of OOM kills per container they are all set to 0 so assuming that means nothing hit the maximum memory?

find /sys/fs/cgroup/websites/*/ -type f -name "memory.events" -exec grep oom_kill {} \;

They all come back as "oom_kill 0"

gmakhs

MarkD Have you started a ticket? or we need to ping @Adam @Aliysa_Enhance

gmakhs

i had in my logs /sys/fs/cgroup/websites// this is why i came with the idea
find /sys/fs/cgroup/websites// -type f -name "memory.events" -exec grep oom_kill {} \;
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 2
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 176
and on another server

find /sys/fs/cgroup/websites/*/ -type f -name "memory.events" -exec grep oom_kill {} \;
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 352
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 22466
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 0
oom_kill 13217
oom_kill 0

MarkD

gmakhs Yes I have raised a ticket to get their feedback on this

gmakhs

cPFence since the OOM is triggered from the container reaching the limit, this is a bug, if the server was out of ram maybe i would agree with you but in the case we discuss it isnt

MarkD

cPFence Ordinarily I would agree with you but we also monitor the MySQL metrics and there was nothing showing as you can see from this 5 minute window, nothing zip, nada
MySQL Usage

cPFence

gmakhs

Just monitor the slow queries, you’ll almost always find the root cause there. It’s usually the starting point for these kinds of issues.

gmakhs

cPFence Again you miss the forest for the Trees....

There is no point to monitor slow queries on a server with a lot of websites when the source of the issue is somewhere else

cPFence

gmakhs

No problem, you're free to ignore what I said. Just keep hunting for that mysterious hidden bug in Enhance. Good luck!

MarkD

MySQL metrics is a dog and won’t help much in actually pinpointing the issue. Just monitor slow queries, if you find nothing there, then start looking elsewhere. But take it from this old guy: when you’re dealing with MySQL problems, always start by checking for slow queries. You’ll thank me for this tip later.

cPFence

MarkD

Add this to your [mysqld] section in my.cnf:

slow_query_log = 1
slow_query_log_file = /var/log/mysql/slow.log
long_query_time = 2
log_queries_not_using_indexes = 1

Wait until the next issue happens and check the log. If nothing shows up there, only then move on to other troubleshooting steps.

MarkD

cPFence
I have now added those extra lines to a few of our shared servers and will wait for the next restart which could be hours or days away. I hear what you are saying about the slow_queries and we did already monitor them but only if they took longer than 10 seconds and there weren't any showing this morning - I have now set them as per your suggestion.

I am also going to install netdata on the same servers as that will give us per second memory usage so that the next time this happens we can see exactly how much memory was free when MySQL gets killed - assuming that this is being killed by the OOM process 🤷‍♂️

cPFence

MarkD we did already monitor them but only if they took longer than 10 seconds

I'm not going to mention names, but it's pretty common for large hosting providers (those not using CloudLinux) to set their slow query monitoring as low as 0.5 seconds. They proactively contact clients running slow queries, pushing them to optimize or upgrade their plans. Honestly, I don't blame them; in shared hosting, keeping queries efficient is crucial for overall server stability—and this can easily be achieved by selecting well-coded themes and plugins.

gmakhs

MarkD looking forward to seeing what you find.

One of my server who got the issue had both lsws and mariadb killed (different times ), and there are a handful of sites there with 128 GB ram, then I found the ooms on the /websites/ So I came to that conclusion.
Looking forward to be proved wrong and probably find a solution .

@cPFence I don't believe is a bug of enhance itself, I believe is a byproduct issue of MVP development, and I also believe it can be improved .

As for cloud Linux, is super cheap 13 USD per month, and it does very good job(different php versions , extensions , limits , mysql governor , lve, super fast support ), I wish enhance could work with it.
On sharing hosting you can't predict which website will abuse, and having properly working resource limits, with soft and hard options, is important.

MarkD

So we have managed to catch what has killed MySQL on one occasion and in this instance it was Litespeed but we have no idea why it did that.

We have a server where MySQL was shut down on 13/04/25 at 19:42:27.

If you look at this Audit log it shows the litespeed process sent the SIGKILL

type=PROCTITLE msg=audit(04/13/25 19:42:27.498:355805) : proctitle=litespeed (lshttpd - main)
type=OBJ_PID msg=audit(04/13/25 19:42:27.498:355805) : opid=56235 oauid=unset ouid=mysql oses=-1 obj=/usr/sbin/mysqld ocomm=ib_io_rd-2
type=SYSCALL msg=audit(04/13/25 19:42:27.498:355805) : arch=x86_64 syscall=kill success=yes exit=0 a0=0xdbb8 a1=SIGKILL a2=0x8 a3=0x748ba6bb1fc0 items=0 ppid=1 pid=2303176 auid=unset uid=root gid=www-data euid=root suid=root fsuid=root e
gid=www-data sgid=www-data fsgid=www-data tty=(none) ses=unset comm=litespeed exe=/usr/local/lsws/bin/lshttpd.6.3.2 subj=unconfined key=sigkill_watch

And then if we look at the litespeed error log for that same time we see this

2025-04-13 19:42:11.703539 [NOTICE] [2303176] [T0] [2303176] Cmd from child: [extappkill:56279:-3:0]
2025-04-13 19:42:11.703566 [INFO] [2303176] [T0] Failed to get process [56279] start time, not running, skip killing.
2025-04-13 19:42:11.703339 [INFO] [2303179] [T0] [192.248.156.201:36770>2.223.152.152#xxxxx.co.uk] Abort request processing by PID:56279, kill: 1, begin time: 3, sent time: 3, req processed: 30
2025-04-13 19:42:11.703345 [NOTICE] [2303179] [T0] sendKillCmdToWatchdog: 'extappkill:56279:-3:0'.
2025-04-13 19:42:12.700474 [INFO] [2303179] [T0] [147.78.3.13:12347-H3:7C275FFA95E3C447-0>213.180.203.204#xxxxx.co.uk] Access is denied by context rewrite.
2025-04-13 19:42:27.390287 [NOTICE] [2303176] [T0] [2303176] Cmd from child: [extappkill:56248:-3:0]
2025-04-13 19:42:27.390396 [INFO] [2303176] [T0] [CLEANUP] Send signal: 15 to process: 56248 (ib_io_rd-2)
2025-04-13 19:42:27.500768 [INFO] [2303176] [T0] [CLEANUP] Process 56248 (ib_io_rd-2) wont stop after SIGTERM, send SIGKILL.
2025-04-13 19:42:27.390135 [INFO] [2303179] [T0] [192.248.156.201:57228>2.223.152.152#xxxxx.co.uk] Abort request processing by PID:56248, kill: 1, begin time: 4, sent time: 4, req processed: 142
2025-04-13 19:42:27.390140 [NOTICE] [2303179] [T0] sendKillCmdToWatchdog: 'extappkill:56248:-3:0'.

We have been capturing all of the running processes every 15 seconds and here is the mysql process just before it was killed (you can see it was running since April 6th)

mysql 56235 5.3 19.2 4997680 3138128 ? Ssl Apr06 498:59 /usr/sbin/mysqld

The 2 processes mentioned in the litespeed error log (56279 & 56248) were not picked up by the process capture so no idea what they were and due to their low id's process id recycling must be in effect.

So litespeed tried to kill 2 processes that did not exist and ended up killing Mysql? I checked our pid_max value and its 4194304 so no issues there.

If this is happening to other users it would explain why we are seeing mysql being terminated for what appears no reason as it is not an OOM kill but a litespeed kill - so, why is litespeed killing mysql?

gmakhs

MarkD I had my Litespeed killed today, not from OOM, i haven;t experienced mysql in a while.

I can't tell why :/

twest

MarkD good detective work, getting us closer to a resolution. I'm also experiencing this, seems totally random at different times and different servers, maybe 1 server per week out of the bunch. Definitely could be Litespeed related, I'm also using it.