Thursday, July 28, 2016

linux - Can someone explain the "use-cases" for the default munin graphs?



When installing munin, it activates a default set of plugins (at least on ubuntu). Alternatively, you can simply run munin-node-configure to figure out which plugins are supported on your system. Most of these plugins plot straight-forward data. My question is not to explain the nature of the data (well... maybe for some) but what is it that you look for in these graphs?



It is easy to install munin and see fancy graphs. But having the graphs and not being able to "read" them renders them totally useless.



I am going to list standard plugins which are enabled by default on my system. So it's going to be a long list. For completeness, I am also going to list plugins which I think to understand and give a short explanation as to what I think it's used for. Pleas correct if I am wrong with any of them.




So let me split this questions in three parts:




  • Plugins where I don't even understand the data

  • Plugins where I understand the data but don't know what I should look out for

  • Plugins which I think to understand



Plugins where I don't even understand the data




These may contain questions that are not necessarily aimed at munin alone. Not understanding the data usually mean a gap in fundamental knowledge on operating systems/hardware.... ;) Feel free to respond with a "giyf" answer.



These are plugins where I can only guess what's going on... I hardly want to look at these "guessing"...




  • Disk IOs per device (IOs/second)
    What's an IO. I know it stands for input/output. But that's as far as it goes.

  • Disk latency per device (Average IO wait)
    Not a clue what an "IO wait" is...

  • IO Service Time
    This one is a huge mess, and it's near impossible to see something in the graph at all.




Plugins where I understand the data but don't know what I should look out for




  • IOStat (blocks/second read/written)
    I assume, the thing to look out for in here are spikes? Which would mean that the device is in heavy use?

  • Available entropy (bytes)
    I assume that this is important for random number generation? Why would I graph this? So far the value has always been near constant.

  • VMStat (running/I/O sleep processes)
    What's the difference between this one and the "processes" graph? Both show running/sleeping processes, whereas the "Processes" graph seems to have more details.

  • Disk throughput per device (bytes/second read/written)
    What's thedifference between this one and the "IOStat" graph?

  • inode table usage
    What should I look for in this graph?




Plugins which I think to understand



I'll be guessing some things here... correct me if I am wrong.




  • Disk usage in percent (percent)
    How much disk space is used/remaining. As this is approaching 100%, you should consider cleaning up or extend the partition. This is extremely important for the root partition.

  • Firewall Throughput (packets/second)
    The number of packets passing through the firewall. If this is spiking for a longer period of time, it could be a sign of a DOS attack (or we are simply recieving a large file). It can also give you an idea about your firewall performance. If it's levelling out and you need more "power" you should consider load balancing. If it's levelling out and see a correlation with your CPU load, it could also mean that your hardware is not fast enough. Correlations with disk usage could point to excessive LOG targets in you FW config.

  • eth0 errors (packets in/out)
    Network errors. If this value is increasing, it could be a sign of faulty hardware.

  • eth0 traffic (bits/second in/out)
    Raw network traffic. This should correlate with Firewall throughput.

  • number of threads
    An ever-increasing value might point to a process not properly closing threads. Investigate!


  • processes
    Breakdown of active processes (including sleeping). A quick spike in here might point to a fork-bomb. A slowly, but ever-increasing value might point to an application spawning sub-processes but not properly closing them. Investigate using ps faux.

  • process priority
    This shows the distribution of process priorities. Having only high-priority processes is not of much use. Consider de-prioritizing some.

  • cpu usage
    Fairly straight-forward. If this is spiking, you may have an attack going on, or a process is hogging the CPU. Idf it's slowly increasing and approaching max in normal operations, you should consider upgrading your hardware (or load-balancing).

  • file table usage
    Number of actively open files. If this is reaching max, you may have a process opening, but not properly releasing files.

  • load average
    Shows an summarized value for the system load. Should correlate with CPU usage. Increasing values can come from a number of sources. Look for correlations with other graphs.

  • memory usage
    A graphical representation of you memory. As long as you have a lot of unused+cache+buffers you are fine.

  • swap in/out
    Shows the activity on your swap partition. This should always be 0. If you see activity on this, you should add more memory to your machine!


Answer





Disk IOs per device (IOs/second)




With traditional hard drives this is a very important number. I/O operation is a read or write operation to disk. With rotational spindles you can get around from dozens to perhaps 200 IOPS per second, depending on the disk speed and its usage pattern.



This is not all to it: modern operating systems do have I/O schedulers which try to merge several I/O requests as one and make things faster that way. Also the RAID controllers and so on do perform some smart I/O request reordering.




Disk latency per device (Average IO wait)





How long it took from performing the I/O request to an individual disk to actually receive the data from there. If this hovers around couple of milliseconds, you are OK, if it's dozens of ms, then you are starting to see your disk subsystem sweating, if it's hundreds of more ms, you are in big trouble, or at least have a very, very slow system.




IO Service Time




How your disk subsystem (possibly containing lots of disks) is performing overall.





IOStat (blocks/second read/written)




How many disk blocks were read/written per second. Look for spikes and also the average. If average starts to near the maximum throughput of your disk subsystem, it's time to plan for performance upgrade. Actually, plan that way before that point.




Available entropy (bytes)




Some applications do want to get "true" random data. Kernel gathers that 'true' randomness from several sources, such as keyboard and mouse activity, a random number generator found in many motherboards, or even from video/music files (video-entropyd and audio-entropyd can do that).




If your system runs out of entropy, the applications wanting that data stall until they get their data. Personally in the past I've seen this happening with Cyrus IMAP daemon and its POP3 service; it generated a long random string before each login, and on a busy server that consumed the entropy pool very quickly.



One way to get rid of that problem is to switch the applications to use only semi-random data (/dev/urandom), but that's not among this topic anymore.




VMStat (running/I/O sleep processes)




Not thought about this one before, but I would think that this tells you about per-process I/O statistics, or mainly if they are running some I/O or not, and if that I/O is blocking I/O activity or not.





Disk throughput per device (bytes/second read/written)




This is purely bytes read/written per second, and more often this is more human-readable form than blocks, which may vary. Block size may differ because of the disks used, file system (and its settings) used, and so on. Sometimes the block size might be 512 bytes, other times 4096 bytes, sometimes something else.




inode table usage





With file systems having dynamic inodes (such as XFS), nothing. With file systems having static inodes maps (such as ext3), everything. If you have combination of static inodes, a huge file system and huge number of directories and small files, you might encounter a situation where you cannot create more files on that partition, even though in theory there would be lots of free space left. No free inodes == bad.


No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server

I have created a VPC in aws with a public subnet and a private subnet. The private subnet does not have direct access to external network. S...