Thursday, June 1, 2017

linux - How do we find out what is causing high cache usage on our server?



Every Friday from about 22:00 our server starts using large amounts of cache and then dies about two hours later. Please look at the following Cacti graph.



Memory Usage



We tried hunting processes which use large amounts of memory with https://raw.githubusercontent.com/pixelb/ps_mem/master/ps_mem.py but all it shows is the following.



...
438.0 MiB + 1.1 MiB = 439.1 MiB XXXEngine XXX 961f4dbc-3b01-0000-0080-ff115176831d xxx

520.2 MiB + 1.7 MiB = 521.9 MiB XXXEngine XXX f2ac330c-3a01-0000-0080-a2adb5561889 xxx
10.4 GiB + 829.0 KiB = 10.4 GiB java -server -Xms1G -Xmx5G -Djava.net.preferIPv4Stack=true -cp ../lib/hazelcast-3.2.2.jar:../lib/xxx.cache.jar com.hazelcast.examples.StartServer (2)
---------------------------------
28.1 GiB
=================================


This is nothing near to the 100G of cache and we were thinking that Linux may be using that much memory for caching disk I/O so we used atop to measure it. This is what we get when we run atop -r atop-20140919-230002-062979000.bin -d -D (-c).



  PID                                   TID                                   RDDSK                                  WRDSK                                  WCANCL                                   DSK                                 CMD       1/405

1 - 907.9G 17.0T 2.8T 97% init
6513 - 175.1G 46.1G 5.9G 1% crond
8842 - 8K 110.3G 128K 1% xxxzmuc0
6296 - 6.5G 25.1G 15.9G 0% sshd
4463 - 4668K 23.2G 0K 0% kjournald
19681 - 1835K 22.5G 22.4G 0% xxxtroker
4469 - 4728K 15.2G 0K 0% kjournald
4475 - 4716K 14.9G 0K 0% kjournald
2401 - 588K 11.4G 0K 0% kjournald
8652 - 7.0G 2.6G 1.3G 0% k6gagent

26093 - 9.5G 0K 0K 0% bpbkar
...


And atop with option -c.



  PID   TID S  DSK COMMAND-LINE (horizontal scroll with <- and -> keys)                                                                                                                                                                            1/405
1 - S 97% init [3]
6513 - S 1% crond
8842 - S 1% xxzmuc0 -m XXX

6296 - S 0% /usr/sbin/sshd
4463 - S 0% kjournald
19681 - S 0% xxxtroker XXX
4469 - S 0% kjournald
4475 - S 0% kjournald
2401 - S 0% kjournald
8652 - S 0% /opt/IBM/ITM/lx8266/6g/bin/k6gagent
26093 - S 0% bpbkar -r 2678400 -ru root -dt 0 -to 0 -clnt ...
...



So what I can see is that init has written 17 Terabytes of data onto the disk which seems a lot. However I have no idea how to find out what is causing this. I was of the opinion that Linux is using cache to speed up disk operations but gives it back when processes are needing it and that it is not possible to kill off a server with cache.



We are on "Red Hat Enterprise Linux Server release 5.5 (Tikanga)" Linux delirium 2.6.18-194.26.1.el5 #1 SMP Fri Oct 29 14:21:16 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux.



What should we do (next) to find out what's wrong?


Answer



I see bpbkar is also active, i would look at that process. It is part of Symantec NetBackup, check to see if you have a backup running at the time of the problems. Disable it and see if the problem occurs again when no backup is scheduled.



If bpbkar is the guilty process, you should enable full logging on it to find out where it is causing this problem. Be sure to have the latest updates installed for it, as they always solve a big list of problems.



No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server

I have created a VPC in aws with a public subnet and a private subnet. The private subnet does not have direct access to external network. S...