linux - Cross-platform, human-readable, du on root partition that truly ignores other filesystems

I made this way too complicated before. I believe that these commands are actually the simpler way, while still formatting everything nicely.

    RHEL 5
    du -x / | sort -n |cut -d\/ -f1-2|sort -k2 -k1,1nr|uniq -f1|sort -n|tail -10|cut -f2|xargs du -sxh

    Solaris 10
    du -d / | sort -n |cut -d\/ -f1-2|sort -k2 -k1,1nr|uniq -f1|sort -n|tail -10|cut -f2|xargs du -sdh

RHEL5

du -x /|egrep -v "$(echo $(df|awk '{print $1 "\n" $5 "\n" $6}'|cut -d\/ -f2-5|egrep -v "[0-9]|^$|Filesystem|Use|Available|Mounted|blocks|vol|swap")|sed 's/ /\|/g')"|egrep -v "proc|sys|media|selinux|dev|platform|system|tmp|tmpfs|mnt|kernel"|cut -d\/ -f1-3|sort -k2 -k1,1nr|uniq -f1|sort -k1,1n|cut -f2|xargs du -sxh|egrep "G|[5-9][0-9]M|[1-9][0-9][0-9]M"|sed '$d'

Solaris

du -d /|egrep -v "$(echo $(df|awk '{print $1 "\n" $5 "\n" $6}'|cut -d\/ -f2-5|egrep -v "[0-9]|^$|Filesystem|Use|Available|Mounted|blocks|vol|swap")|sed 's/ /\|/g')"|egrep -v "proc|sys|media|selinux|dev|platform|system|tmp|tmpfs|mnt|kernel"|cut -d\/ -f1-3|sort -k2 -k1,1nr|uniq -f1|sort -k1,1n|cut -f2|xargs du -sdh|egrep "G|[5-9][0-9]M|[1-9][0-9][0-9]M"|sed '$d'

This will return directories over 50mb within "/" file system in ascending, reursive, human readable format, and in a reasonably fast amount of time.

Request: Can you help make this one-liner more effective, faster, or efficient? How about more elegant? If you understand what I did there then please read on.

The problem is that it can be difficult to quickly discern what directories contained under the "/" directory are contributing to "/" filesystem capaciy because all other filesystems fall under root.

This will exclude all non / filesystems when running du on Solaris 10 or Red Hat el5 by basically munging an egrepped df from a second pipe-delimited egrep regex subshell exclusion that is naturally further excluded upon by a third egrep in what I would like to refer to as "the whale." The munge-fest frantically escalates into some xargs du recycling where du -x/-d is actually useful (see bottom comments), and a final, gratuitous egrep spits out a list of relevant, high-capacity directories that are exclusively contained within the "/" filesystem, but not within other mounted filesystems. It is very sloppy.

pwd = /

du *|egrep -v "$(echo $(df|awk '{print $1 "\n" $5 "\n" $6}'|cut -d\/ -f2-5|egrep -v "[0-9]|^$|Filesystem|Use|Available|Mounted|blocks|vol|swap")|sed 's/ /\|/g')"|egrep -v "proc|sys|media|selinux|dev|platform|system|tmp|tmpfs|mnt|kernel"|cut -d\/ -f1-2|sort -k2 -k1,1nr|uniq -f1|sort -k1,1n|cut -f2|xargs du -shx|egrep "G|[5-9][0-9]M|[1-9][0-9][0-9]M"

This is running against these filesystems:

            Linux builtsowell 2.6.18-274.7.1.el5 #1 SMP Mon Oct 17 11:57:14 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux

            df -kh


            Filesystem            Size  Used Avail Use% Mounted on
            /dev/mapper/mpath0p2  8.8G  8.7G  90M   99% /
            /dev/mapper/mpath0p6  2.0G   37M  1.9G   2% /tmp
            /dev/mapper/mpath0p3  5.9G  670M  4.9G  12% /var
            /dev/mapper/mpath0p1  494M   86M  384M  19% /boot
            /dev/mapper/mpath0p7  7.3G  187M  6.7G   3% /home
            tmpfs                  48G  6.2G   42G  14% /dev/shm
            /dev/mapper/o10g.bin   25G  7.4G   17G  32% /app/SIP/logs
            /dev/mapper/o11g.bin   25G   11G   14G  43% /o11g
            tmpfs                 4.0K     0  4.0K   0% /dev/vx

            lunmonster1q:/vol/oradb_backup/epmxs1q1
                                  686G  507G  180G  74% /rpmqa/backup
            lunmonster1q:/vol/oradb_redo/bisxs1q1
                                  4.0G  1.6G  2.5G  38% /bisxs1q/rdoctl1
            lunmonster1q:/vol/oradb_backup/bisxs1q1
                                  686G  507G  180G  74% /bisxs1q/backup
            lunmonster1q:/vol/oradb_exp/bisxs1q1
                                  2.0T  1.1T  984G  52% /bisxs1q/exp
            lunmonster2q:/vol/oradb_home/bisxs1q1
                                   10G  174M  9.9G   2% /bisxs1q/home

            lunmonster2q:/vol/oradb_data/bisxs1q1
                                   52G  5.2G   47G  10% /bisxs1q/oradata
            lunmonster1q:/vol/oradb_redo/bisxs1q2
                                  4.0G  1.6G  2.5G  38% /bisxs1q/rdoctl2
            ip-address1:/vol/oradb_home/cspxs1q1
                                   10G  184M  9.9G   2% /cspxs1q/home
            ip-address2:/vol/oradb_backup/cspxs1q1
                                  674G  314G  360G  47% /cspxs1q/backup
            ip-address2:/vol/oradb_redo/cspxs1q1
                                  4.0G  1.5G  2.6G  37% /cspxs1q/rdoctl1

            ip-address2:/vol/oradb_exp/cspxs1q1
                                  4.1T  1.5T  2.6T  37% /cspxs1q/exp
            ip-address2:/vol/oradb_redo/cspxs1q2
                                  4.0G  1.5G  2.6G  37% /cspxs1q/rdoctl2
            ip-address1:/vol/oradb_data/cspxs1q1
                                  160G   23G  138G  15% /cspxs1q/oradata
            lunmonster1q:/vol/oradb_exp/epmxs1q1
                                  2.0T  1.1T  984G  52% /epmxs1q/exp
            lunmonster2q:/vol/oradb_home/epmxs1q1
                                   10G   80M   10G   1% /epmxs1q/home

            lunmonster2q:/vol/oradb_data/epmxs1q1
                                  330G  249G   82G  76% /epmxs1q/oradata
            lunmonster1q:/vol/oradb_redo/epmxs1q2
                                  5.0G  609M  4.5G  12% /epmxs1q/rdoctl2
            lunmonster1q:/vol/oradb_redo/epmxs1q1
                                  5.0G  609M  4.5G  12% /epmxs1q/rdoctl1
            /dev/vx/dsk/slaxs1q/slaxs1q-vol1
                                  183G   17G  157G  10% /slaxs1q/backup
            /dev/vx/dsk/slaxs1q/slaxs1q-vol4
                                  173G   58G  106G  36% /slaxs1q/oradata

            /dev/vx/dsk/slaxs1q/slaxs1q-vol5
                                   75G  952M   71G   2% /slaxs1q/exp
            /dev/vx/dsk/slaxs1q/slaxs1q-vol2
                                  9.8G  381M  8.9G   5% /slaxs1q/home
            /dev/vx/dsk/slaxs1q/slaxs1q-vol6
                                  4.0G  1.6G  2.2G  42% /slaxs1q/rdoctl1
            /dev/vx/dsk/slaxs1q/slaxs1q-vol3
                                  4.0G  1.6G  2.2G  42% /slaxs1q/rdoctl2
            /dev/mapper/appoem     30G  1.3G   27G   5% /app/em

This is the result:

Linux:

            54M     etc/gconf
            61M     opt/quest
            77M     opt
            118M    usr/  ##===\
            149M    etc

            154M    root
            303M    lib/modules
            313M    usr/java  ##====\
            331M    lib
            357M    usr/lib64  ##=====\
            433M    usr/lib  ##========\
            1.1G    usr/share  ##=======\
            3.2G    usr/local  ##========\
            5.4G    usr   ##<=============Ascending order to parent
            94M     app/SIP   ##<==\

            94M     app   ##<=======Were reported as 7gb and then corrected by second du with -x.

=============================================

pwd = /

This is running against these filesystems:

            SunOS solarious 5.10 Generic_147440-19 sun4u sparc SUNW,SPARC-Enterprise

            Filesystem             size   used  avail capacity  Mounted on
             kiddie001Q_rpool/ROOT/s10s_u8wos_08a  8G   7.7G    1.3G    96%    / 
            /devices                 0K     0K     0K     0%    /devices
            ctfs                     0K     0K     0K     0%    /system/contract

            proc                     0K     0K     0K     0%    /proc
            mnttab                   0K     0K     0K     0%    /etc/mnttab
            swap                    15G   1.8M    15G     1%    /etc/svc/volatile
            objfs                    0K     0K     0K     0%    /system/object
            sharefs                  0K     0K     0K     0%    /etc/dfs/sharetab
            fd                       0K     0K     0K     0%    /dev/fd
            kiddie001Q_rpool/ROOT/s10s_u8wos_08a/var    31G   8.3G   6.6G    56%    /var
            swap                   512M   4.6M   507M     1%    /tmp
            swap                    15G    88K    15G     1%    /var/run
            swap                    15G     0K    15G     0%    /dev/vx/dmp

            swap                    15G     0K    15G     0%    /dev/vx/rdmp
            /dev/dsk/c3t4d4s0   3   20G   279G    41G    88%    /fs_storage
            /dev/vx/dsk/oracle/ora10g-vol1  292G   214G    73G    75%     /o10g
            /dev/vx/dsk/oec/oec-vol1    64G    33G    31G    52%    /oec/runway
            /dev/vx/dsk/oracle/ora9i-vol1   64G    33G    31G   59%    /o9i
            /dev/vx/dsk/home     23G    18G   4.7G    80%    /export/home
            /dev/vx/dsk/dbwork/dbwork-vol1 292G   214G    73G    92%    /db03/wk01
            /dev/vx/dsk/oradg/ebusredovol   2.0G   475M   1.5G    24%    /u21
            /dev/vx/dsk/oradg/ebusbckupvol   200G    32G   166G    17%    /u31
            /dev/vx/dsk/oradg/ebuscrtlvol   2.0G   475M   1.5G    24%    /u20

            kiddie001Q_rpool       31G    97K   6.6G     1%    /kiddie001Q_rpool
            monsterfiler002q:/vol/ebiz_patches_nfs/NSA0304   203G   173G    29G    86%    /oracle/patches
            /dev/odm                 0K     0K     0K     0%    /dev/odm

This is the result:

Solaris:

            63M     etc

            490M    bb
            570M    root/cores.ric.20100415
            1.7G    oec/archive
            1.1G    root/packages
            2.2G    root
            1.7G    oec

==============

How could one more effectively deal with "/" aka "root" filesystem full issues across multiple platforms that have a devastating number of mounts?

On Red Hat el5, du -x apparently avoids traversal into other filesystems. While this may be so, it does not appear to do anything if run from the / directory.

On Solaris 10, the equivalent flag is du -d, which apparently packs no surprises.

(I'm hoping I've just been doing it wrong.)

Guess what? It's really slow.

Answer

Your problem, as I understand it, is that du is descending into other filesystems (some of which are network or SAN mounts, and take a long time to count up utilization on).

I respectfully submit that if you're trying to monitor filesystem utilization du is the wrong tool for the job. You want df (which you apparently know about since you included its output).

Parsing the output from df can help you target specific filesystems in which you should be running du to determine which directories are chewing up all your space (or if you're lucky the full filesystem has a specific responsible party who you can tell to figure it out for themselves). In either case at least you will know a filesystem is filling up before it's full (and the output is easier to parse).

In short: Run df first, then if you have to run du on any filesystem df identified as having utilization over (say) 85% to get more specific details.

Moving on into your script, the reason du isn't respecting your -d (or -x) flag is because of the question you're asking:

 # pwd   
 /
 # du * (. . .etc. . .)

You are asking du to run on everything under / -- du -x /bin /home /sbin /usr /tmp /var etc. -- du is then doing exactly what you asked (giving you the usage of each of those things. If one of the arguments happens to be a filesystem root du assumes you know what you're doing and give the usage of that filesystem up to the first sub-mount it finds.

This is critically different from du -x / ("Tell me about / and ignore any sub-mounts").

To fix your script *don't cd into the directory you are analyzing -- instead just run
du /path/to/full/disk | [whatever you want to feed the output through]

This (or any other suggestion you may get) doesn't solve your two core problems:

Your monitoring system is ad-hoc
If you want to catch problems before they bite you in the genitals you really need to deploy a decent monitoring platform. If you're having trouble getting your management team to buy into this remind them that proper monitoring lets you avoid downtime.

Your environment (as you've rightly surmised) is a mess
Not much to be done here except rebuild the thing - It's your job as the SA to stand up and make a very clear, very LOUD business case for why the systems need to be taken down one at a time and rebuilt with a structure that can be managed.

You seem to have a pretty decent handle on what needs to be done, but if you have questions by all means ask them and we'll try to help as much as we can (we can't do your architecture for you, but we can answer conceptual questions or practical "How do I do X with monitoring tool Y?" type stuff...

Blog

Saturday, December 6, 2014

linux - Cross-platform, human-readable, du on root partition that truly ignores other filesystems

No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server