Sunday, January 3, 2016

debian - NFS client does not pick up server after restart



EDIT :




To summarize the issue, this is a problem with the NFS server changing IP address and the NFS clients not picking up the new address. I can see via tcpdump that the client still tries to contact the old IP address on port 2049.



We have several NFS mount points defined like this in /etc/fstab. As you can see, this is NFS v3.



storage-1:/data/medias/media /var/www/myproject/data/media nfs rsize=32768,wsize=32768,hard,intr,actimeo=300,nfsvers=3,async,noatime,sec=sys 0 0
storage-1:/data/medias/secure /var/www/myproject/web/secure nfs rsize=32768,wsize=32768,hard,intr,actimeo=300,nfsvers=3,async,noatime,sec=sys 0 0
storage-1:/data/tobeprocessed /var/www/myproject/data/tobeprocessed nfs rsize=32768,wsize=32768,hard,intr,actimeo=300,nfsvers=3,async,noatime,sec=sys 0 0
storage-1:/data/ftp /var/ftp nfs rsize=32768,wsize=32768,hard,intr,actimeo=300,nfsvers=3,async,noatime,sec=sys 0 0



When we restart the server, we have to unmount and remount each endpoint, otherwise the clients are unable to access the NFS server. I tried up to 5 minutes after the reboot before unmounting and remounting.



After a restart of the NFS server, a simple ls /var/www/myproject/data/media makes the console hang.



I can also see the following messages in /var/log/syslog :



Sep 16 11:24:36 encoder-1 kernel: [69688.160102] nfs: server storage-1 not responding, still trying
Sep 16 11:30:15 encoder-1 kernel: [70027.744042] nfs: server storage-1 not responding, still trying



When I umount and then mount one of the nfs directories on the client, I can then access it. But I cannot access the others unless I also umount and mount them.



I anyone knows a possible solution for this, I am all ears. Note that rpcinfo shows that the client is able to contact the server, as shown below.



There is one NFS server, 4 NFS clients for a total of 12 mount points.



The result of rpcinfo -p storage-1 from a client :



[0]root@encoder-1:/var/log # rpcinfo -p storage-1
program vers proto port service

100000 4 tcp 111 portmapper
100000 3 tcp 111 portmapper
100000 2 tcp 111 portmapper
100000 4 udp 111 portmapper
100000 3 udp 111 portmapper
100000 2 udp 111 portmapper
100024 1 udp 52115 status
100024 1 tcp 57907 status
100003 2 tcp 2049 nfs
100003 3 tcp 2049 nfs

100003 4 tcp 2049 nfs
100227 2 tcp 2049
100227 3 tcp 2049
100003 2 udp 2049 nfs
100003 3 udp 2049 nfs
100003 4 udp 2049 nfs
100227 2 udp 2049
100227 3 udp 2049
100021 1 udp 59603 nlockmgr
100021 3 udp 59603 nlockmgr

100021 4 udp 59603 nlockmgr
100021 1 tcp 47716 nlockmgr
100021 3 tcp 47716 nlockmgr
100021 4 tcp 47716 nlockmgr
100005 1 udp 892 mountd
100005 1 tcp 892 mountd
100005 2 udp 892 mountd
100005 2 tcp 892 mountd
100005 3 udp 892 mountd
100005 3 tcp 892 mountd



When enabling NFS debug traces as explained here, we get the following log message :



Sep 17 05:35:00 encoder-1 kernel: [135112.160230] nfs: server storage-1 not responding, still trying
Sep 17 05:53:47 encoder-1 kernel: [136240.018538] NFS: nfs_lookup_revalidate(///) is valid
Sep 17 05:53:47 encoder-1 kernel: [136240.018538] NFS: revalidating (0:12/5242881)
Sep 17 05:53:47 encoder-1 kernel: [136240.018538] NFS call getattr

Answer




I think it may be a problem resolving the hostname. I have noticed that even if resolving seems to work fine otherwise on the system and network the NFS mount processes appear to be occasionally having a problem with it. I would change the hostname to the actual IP address and try that out.Lets say the FQDN is storage-1.example.org and it would resolve to 192.0.2.11 then do:



192.0.2.11:/data/medias/media /var/www/myproject/data/media nfs bg,rsize=32768,wsize=32768,hard,intr,actimeo=300,nfsvers=3,async,noatime,sec=sys 0 0


Even if that doesn't fix the problem I personally find using the IP address instead of the hostname or FQDN to be preferable. But I understand there could be reasons why you wouldn't want to do that.



Note: I added the bg option, which will background the mount process in case it takes longer to mount, in order to speed up booting. It's up to you if you would prefer that. I thought I'd mention it since when there are a number of NFS mountpoints with each one taking longer (or timing out) to mount the boot time may easily become more than one hour.


No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server

I have created a VPC in aws with a public subnet and a private subnet. The private subnet does not have direct access to external network. S...