Server Resource Monitoring

ivansalloum

Hello everyone,

I’ve noticed a lot of requests for server monitoring solutions, but I’m wondering how many of you are familiar with monitoring a Linux server from the command line and identifying performance bottlenecks.

I spent a week creating a guide called "Linux Server Resource Monitoring Made Easy". In it, I cover key areas like CPU, memory, storage, and disk I/O. I also go beyond basic monitoring, explaining concepts like load average, process states, memory metrics (e.g., virtual vs. resident memory), context switching, I/O wait, tmpfs filesystems, and how to monitor them. I also explain how to use the du command to analyze directories and identify large files consuming space.

Additionally, I shared an experience where I discovered that a slow disk was causing high I/O wait, which significantly impacted performance.

I hope this guide will help you understand resource monitoring better and give you a solid starting point. It may not cover everything in depth, but it’s a good foundation. I’d love to hear your thoughts on it: @twest @JohnB @cPFence and @xyzulu .

Link: https://ivansalloum.com/linux-server-resource-monitoring-made-easy/

cPFence

ivansalloum

Thank you for sharing. I've bookmarked it and will take a closer look when time allows.

slimx

ivansalloum
This is a great write up Ivan can you just please for the love of god increase the container width so its readable on desktop lol

ivansalloum

slimx Haha, it is the theme I'm using.

rdbf

Good write-up, thanks for that.

Now, since most (?) here are using package limits for IOPS/IO-BW, which are applied at the Cgroup level, if an abusive customer (website/user) hits these limits, will it therefore lead to a high IO-wait + high Load, while all other users/websites remain unaffected due to the context-switching? Or will it not lead to high IO-wait + high Load while the one user hits the limits?

ivansalloum

rdbf Context switching isn’t really involved here. The issue is not that the CPU is frequently switching between processes, but that when a container exceeds its I/O limits (whether IOPS or bandwidth), the kernel actively throttles its I/O. This forces the processes in that container to wait for I/O operations to complete.

When throttling kicks in, those processes cause increased I/O wait. Even though they are waiting, they remain resident in memory and continue to consume RAM. This can become problematic if memory is also tight, but the key point is that the high I/O wait is confined to the container (website) that hit its limit. Since each container is managed by its own cgroup, the high I/O wait and any associated load increase are isolated to that container.

I believe that if you inspect resource usage inside the container, say by running top or checking cgroup-specific metrics, you’ll see a spike in I/O wait. On the host, top aggregates data across all processes, so you might only notice a slight increase overall, even though the heavy throttling happening in a single container.

Unless your server is very I/O constrained, it’s often best to leave I/O limits (like IOPS and bandwidth) disabled. In environments running on enterprise-grade drives, setting these limits can lead to unwanted throttling that causes performance problems (high iowait and increased load) for the abusive container without providing any real isolation benefits, as the drives are already fast.

I recommend focusing on CPU, RAM, and nproc limits instead, as they tend to be more effective. In many cases, limiting only RAM and nproc may be sufficient, since nearly every process uses memory and runaway process creation can be harmful.

twest

This would be a great resource for 90% of the users that post about load spikes. It does take a lot of testing to narrow down where the bottleneck is, and there's a lot of good tips in there to run through many steps.

I would note about your example issue with a slow disk. That's one of the primary reasons why it's SO important to make sure you're getting Enterprise grade storage drives. Enterprise drives are able to handle orders of magnitude the amount of usage as consumer drives (difference of terabytes vs petabytes over the life of drive, and also an enormous amount more you can safely write per day). When I receive a new server I always check the drives to make sure... Surprisingly there are a LOT of companies out there selling consumer rated gear all willy nilly, but also sometimes even a trusted supplier slips the wrong gear in an order.

ivansalloum

twest I’ll add this to my guide. People should use Enterprise-grade drives and check their drive speed first.

Have you ever caught a supplier slipping in the wrong drives? Curious if you’ve had any tricky situations with storage performance.

ivansalloum

I’ve just published a new guide on monitoring a Linux server in real-time using the dstat command. This guide covers how to track CPU, disk, memory, and network activity.

Link: https://ivansalloum.com/real-time-linux-server-monitoring-with-dstat/