Networking is not my specialty, and I have done my best to try and isolate this problem, I would like some pointers as to how to further narrow down what the issue may be.
Server: File Sever running Limetech Unraid 5.0.6 (Slackware based OS with kernel 3.9.11p)
This server has been running reliably for about 2 years with the only recent hardware change being a RAM upgrade (I have already used a memory tool to verify that the RAM is working perfectly)
Initial Symtoms:
Other computers on the network accessing network shares suffered intermittent disconnection for 10-20 seconds every 3-5 minutes.
Investigations
Using repetitive ping tests, I was able to determine that all other devices on the network maintained their connections, it was only the file server that was dropping of for brief periods of time.
Pinging to or from the file server would fail for up to 8 seconds at semi-random intervals from just 30 seconds apart to greater than 15 minutes apart, with an average of around 3-5 minutes.
Rebooting the server seemed to make the problem go away for a few hours.
ifconfig shows small numbers (3-7) of RX packets being dropped at about the same time the connection appears to fail
syslog doesn't report anything unusual either on boot, or during the failure.
ethtool shows that the Link is maintained 100% of the time
I'm not a network engineer, but it seems that the issue is specific to this device (Other devices connected to the same infrastructure have no issues).
Is this likely to be a hardware issue with the NIC itself? or is it something to do with the OS or Network config? Could it be caused by user Software?
Any suggestions on how to identify the root cause would be greatly appreciated.
Log/Troubleshooting output:
Ping from Windows laptop to Unraid Server:
Reply from x.x.x.100: bytes=32 time=3ms TTL=64
Reply from x.x.x.100: bytes=32 time=7ms TTL=64
Reply from x.x.x.100: bytes=32 time=3ms TTL=64
Reply from x.x.x.100: bytes=32 time=3ms TTL=64
Reply from x.x.x.100: bytes=32 time=4ms TTL=64
Reply from x.x.x.100: bytes=32 time=3ms TTL=64
Request timed out.
Request timed out.
Request timed out.
Reply from x.x.x.200: Destination host unreachable.
Request timed out.
Reply from x.x.x.100: bytes=32 time=2ms TTL=64
Reply from x.x.x.100: bytes=32 time=1ms TTL=64
Reply from x.x.x.100: bytes=32 time=4ms TTL=64
Reply from x.x.x.100: bytes=32 time=2ms TTL=64
Reply from x.x.x.100: bytes=32 time=2ms TTL=64
ifconfig -a
eth0 Link encap:Ethernet HWaddr 94:de:80:03:2e:3c
inet addr:x.x.x.100 Bcast:x.x.x.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:35866080 errors:0 dropped:2107 overruns:0 frame:0
TX packets:35139719 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:1597778360 (1.4 GiB) TX bytes:1548836243 (1.4 GiB)
Interrupt:49
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:1583 errors:0 dropped:0 overruns:0 frame:0
TX packets:1583 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:158298 (154.5 KiB) TX bytes:158298 (154.5 KiB)
netstat -i
Kernel Interface table
Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
eth0 1500 0 35866360 0 2107 0 35139884 0 0 0 BMRU
lo 65536 0 1583 0 0 0 1583 0 0 0 LRU
dmesg | grep r8168
r8168 Gigabit Ethernet driver 8.037.00-NAPI loaded
r8168 0000:02:00.0: irq 49 for MSI/MSI-X
r8168: This product is covered by one or more of the following patents: US6,570,884, US6,115,776, and US6,327,625.
r8168 Copyright (C) 2013 Realtek NIC software team
cat /var/log/syslog | grep eth
Jan 10 03:04:03 ServerName logger: /etc/rc.d/rc.inet1: List of interfaces: 'eth0'
Jan 10 03:04:03 ServerName kernel: eth%%d: 0xf8560000, 94:de:80:03:2e:3c, IRQ 49
Jan 10 03:04:03 ServerName logger: /etc/rc.d/rc.inet1: Polling for DHCP server on interface eth0:
Jan 10 03:04:03 ServerName logger: /etc/rc.d/rc.inet1: /sbin/dhcpcd -t 10 -h ServerName -L eth0
Jan 10 03:04:03 ServerName dhcpcd[1203]: eth0: waiting for carrier
Jan 10 03:04:08 ServerName kernel: r8168: eth0: link up
Jan 10 03:04:08 ServerName dhcpcd[1203]: eth0: carrier acquired
Jan 10 03:04:08 ServerName dhcpcd[1203]: eth0: broadcasting for a lease
Jan 10 03:04:11 ServerName dhcpcd[1203]: eth0: offered x.x.x.100 from x.x.x.1
Jan 10 03:04:11 ServerName dhcpcd[1203]: eth0: acknowledged x.x.x.100 from x.x.x.1
Jan 10 03:04:12 ServerName dhcpcd[1203]: eth0: checking for x.x.x.100
Jan 10 03:04:16 ServerName dhcpcd[1203]: eth0: leased x.x.x.100 for 86400 seconds
Jan 10 03:04:31 ServerName logger: # * SSL.Connection objects, wrapping the methods of Python's portable
Jan 10 03:04:36 ServerName avahi-daemon[7491]: Joining mDNS multicast group on interface eth0.IPv4 with address x.x.x.100.
Jan 10 03:04:36 ServerName avahi-daemon[7491]: New relevant interface eth0.IPv4 for mDNS.
Jan 10 03:04:36 ServerName avahi-daemon[7491]: Registering new address record for x.x.x.100 on eth0.IPv4.
Jan 10 03:05:08 ServerName ntpd[1258]: Listen normally on 2 eth0 x.x.x.100 UDP 123
Jan 10 15:04:17 ServerName dhcpcd[1203]: eth0: renewing lease of x.x.x.100
Jan 10 15:04:17 ServerName dhcpcd[1203]: eth0: acknowledged x.x.x.100 from x.x.x.1
Jan 10 15:04:17 ServerName dhcpcd[1203]: eth0: leased x.x.x.100 for 86400 seconds
Jan 11 03:04:18 ServerName dhcpcd[1203]: eth0: renewing lease of x.x.x.100
Jan 11 03:04:18 ServerName dhcpcd[1203]: eth0: acknowledged x.x.x.100 from x.x.x.1
Jan 11 03:04:18 ServerName dhcpcd[1203]: eth0: leased x.x.x.100 for 86400 seconds
Jan 11 15:04:18 ServerName dhcpcd[1203]: eth0: renewing lease of x.x.x.100
Jan 11 15:04:18 ServerName dhcpcd[1203]: eth0: acknowledged x.x.x.100 from x.x.x.1
Jan 11 15:04:18 ServerName dhcpcd[1203]: eth0: leased x.x.x.100 for 86400 seconds
Yes this is selected output - I have viewed the syslog and dmesg in detail and can provide more if required, however I am confident that there is nothing in there of value. The reason for holding back? I am paranoid about providing to much info about my network set up on the public internet, and cleaning out put is time consuming.
Answer
Problem solved - it was not the Unraid server, but rather unexpectedly, a surge protector.
In my network setup I have a high end surge protector that isolates the internal network from the outside world so in the event of a lightning strike/power surge my expensive equipment is protected.
It's a bit of a long story as to why everything indicated it was the File Server, but the short answer is:
- The file server and internet gateway sat on one side of the protector while everything else was on the other
- Only traffic passing through the protector was affected (So 2 devices pinging each other on the same side of the S.P didn't appear to show any problems)
- Losing connectivity for a few seconds every few minutes is almost completely undetectable for most network applications - it was only the high data throughput required when using the File Server that made the issue noticeable.
When I was isolating network hardware to ensure it was the file server that was the problem, I never thought to isolate this particular component (I barely consider it to be network hardware).
I have no idea why it was causing the problem, but it is clear that it was. I have simply removed it from the network for now, it's hardly vital to the system in any case.
No comments:
Post a Comment