Tuesday, July 25, 2017

redhat - ext4: Online resize not detected




On a RedHat 6 server, we ran into an issue with online resizing of an ext4 filesystem.



With only /dev/sda we had 13GB available in the volume group, but needed 20GB more on one logical volume which was 36GB. Added /dev/sdb to the volume group, and the file system was extended (lvextend) and resized (resize2fs) to 56GB.
No error messages during the resize, and the OS reported the new size.



The logical volume in question hosts an installation of IBM HTTP Server (apache 2.2), config and log files for some 8 different web servers.



This morning the file system usage grew beyond 36GB.
What happened first was that the webservers stopped logging (discovered after), while the web servers kept on running without issues.
2,5 hours later, in relation to log rotation and some other writes to the file system things started to freeze up.

Meaning: the webservers stopped taking traffic, allthough the processes stayed up, trying to "tail" a log file would hang, and could not be interupted.
The load of the server went from 0.10 to 4000 (yes...) - mostly related to iowait (it would seem).



The sollution was to shut down the webserver - kill -9 was the only way, and reboot the server. Umount the filesystem, did an fsck (no errors), and start things up again.
No issues since.



We can excactly time the error with logging stopping to the time the disk (lv) usage grew above it's previous size of 36GB.



Services on other file systems seemed to work fine - amongs others the operating system.




In /var/log/messages we saw i.e.:



kernel: INFO: task httpd: blocked for more than 120 seconds.
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: httpd D 0000000000000001 0 6889 6865 0x00000080
kernel: ffff88023aa99c88 0000000000000086 0000000000000000 0000000000006102
kernel: ffff88010aebaa80 ffff880105dd0ae0 000000003aa99c08 ffff880105dd0ae0
kernel: ffff880105dd1098 ffff88023aa99fd8 000000000000fb88 ffff880105dd1098
kernel: Call Trace:
kernel: [] __mutex_lock_slowpath+0x13e/0x180

kernel: [] mutex_lock+0x2b/0x50
kernel: [] generic_file_aio_write+0x71/0x100
kernel: [] ext4_file_write+0x61/0x1e0 [ext4]
kernel: [] do_sync_write+0xfa/0x140
kernel: [] ? autoremove_wake_function+0x0/0x40
kernel: [] ? security_file_permission+0x16/0x20
kernel: [] vfs_write+0xb8/0x1a0
kernel: [] sys_write+0x51/0x90
kernel: [] ? __audit_syscall_exit+0x265/0x290
kernel: [] system_call_fastpath+0x16/0x1b



Versions:



Kernel: 2.6.32-358.2.1.el6.x86_64
lvm2-2.02.98-9.el6.x86_64
e2fsprogs-1.41.12-14.el6.x86_64


There were found no issues with the underlying hardware.



Answer



The answer is:
The filesystem was created with mke2fs



The default behaviour is then to create an ext2 filesystem.
However it was mounted as an ext4 filesystem - without any error messages - and later percieved as an ext4 filesystem.



So no wonder online resizing worked, and no wonder the extended portion was recognized after an unmount/mount or reboot.



It took some time to discover since there was a long time between the creation and the resizing and was finally disovered when running blkid, which said "ext2". tune2fs -l also said "not clean".



No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server

I have created a VPC in aws with a public subnet and a private subnet. The private subnet does not have direct access to external network. S...