Monday, June 12, 2017

Xenserver, iSCSI and Dell MD3600i



I have a functional xenserver 6.5 pool with two nodes. It is backed by an iscsi share on a Dell MD3600i SAN, and this works fine. It was set up before my time.




We've added three more nodes to the pool. However these three new nodes will not connect to the storage.



Here's one of the original nodes, working fine:



[root@node1 ~]# iscsiadm -m session
tcp: [2] 10.19.3.11:3260,1 iqn.1984-05.com.dell:powervault.md3600i.6f01faf000eaf7f900000000531ae9bb (non-flash)
tcp: [3] 10.19.3.14:3260,2 iqn.1984-05.com.dell:powervault.md3600i.6f01faf000eaf7f900000000531ae9bb (non-flash)
tcp: [4] 10.19.3.12:3260,1 iqn.1984-05.com.dell:powervault.md3600i.6f01faf000eaf7f900000000531ae9bb (non-flash)
tcp: [5] 10.19.3.13:3260,2 iqn.1984-05.com.dell:powervault.md3600i.6f01faf000eaf7f900000000531ae9bb (non-flash)



Here's one of the new nodes. Notice the corruption in the address?



[root@vnode3 ~]# iscsiadm -m session
tcp: [1] []:-1,2 ▒Atcp: [2] 10.19.3.12:3260,1 iqn.1984-05.com.dell:powervault.md3600i.6f01faf000eaf7f900000000531ae9bb (non-flash)
tcp: [3] 10.19.3.11:3260,1 iqn.1984-05.com.dell:powervault.md3600i.6f01faf000eaf7f900000000531ae9bb (non-flash)
tcp: [4] 10.19.3.14:3260,2 iqn.1984-05.com.dell:powervault.md3600i.6f01faf000eaf7f900000000531ae9bb (non-flash)



The missing IP address is .13 but another node is missing .12



Comments:



I have live running production VMs on the existing nodes and nowhere to move them, so rebooting the SAN is not an option.



Multipathing is disabled on the original nodes, despite the san having 4 interfaces. This seems sub optimal so I've turned on multipathing on the new nodes.



The three new nodes have awfully high system loads. Original boxes have a load average of 0.5 to 1, and the three new nodes are sitting at about 11.1, with no VMs running. top shows no high CPU processes, so its something kernel-related ? There are no processes locked in state D (uninterruptable sleep)




If I tell Xencenter to "repair" those Storage Repositories it sits spinning its wheels for hours till I hit cancel. The message is Plugging PDB for node5



Question: How do I get my new xenserver pool members to see the pool storage and work like expected ?



EDIT Further information




  • None of the new nodes will do a clean reboot either - they get wedged in "stopping iSCSI" on a reboot and I have to use the drac to remotely repower them.

  • Xencenter is adamant that the nodes are in maintenance mode and that they haven't finished booting.




Good pool node:



[root@node1 ~]# multipath -ll
36f01faf000eaf7f90000076255c4a0f3 dm-36 DELL,MD36xxi
size=3.3T features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 rdac' wp=rw
|-+- policy='round-robin 0' prio=12 status=enabled
| |- 14:0:0:6 sdg 8:96 active ready running
| `- 15:0:0:6 sdi 8:128 active ready running
`-+- policy='round-robin 0' prio=11 status=enabled

|- 12:0:0:6 sdc 8:32 active ready running
`- 13:0:0:6 sdh 8:112 active ready running
36f01faf000eaf6fd0000098155ad077f dm-35 DELL,MD36xxi
size=917G features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 rdac' wp=rw
|-+- policy='round-robin 0' prio=14 status=enabled
| |- 12:0:0:5 sdb 8:16 active ready running
| `- 13:0:0:5 sdd 8:48 active ready running
`-+- policy='round-robin 0' prio=9 status=enabled
|- 14:0:0:5 sde 8:64 active ready running
`- 15:0:0:5 sdf 8:80 active ready running



Bad node



[root@vnode3 ~]# multipath
Dec 24 02:56:44 | 3614187703d4a1c001e0582691d5d6902: ignoring map
[root@vnode3 ~]# multipath -ll
[root@vnode3 ~]# (ie no response at all, exit code was 0)



Bad node



[root@vnode3 ~]# iscsiadm -m session
tcp: [1] []:-1,2 ▒Atcp: [2] 10.19.3.12:3260,1 iqn.1984-05.com.dell:powervault.md3600i.6f01faf000eaf7f900000000531ae9bb (non-flash)
tcp: [3] 10.19.3.11:3260,1 iqn.1984-05.com.dell:powervault.md3600i.6f01faf000eaf7f900000000531ae9bb (non-flash)
tcp: [4] 10.19.3.14:3260,2 iqn.1984-05.com.dell:powervault.md3600i.6f01faf000eaf7f900000000531ae9bb (non-flash)

[root@vnode3 ~]# iscsiadm -m node --loginall=all
Logging in to [iface: default, target: iqn.1984-05.com.dell:powervault.md3600i.6f01faf000eaf7f900000000531ae9bb, portal: 10.19.3.13,3260] (multiple)

^C iscsiadm: caught SIGINT, exiting...


So it tries to log into an IP on the SAN, but spins its wheels for hours till I hit ^C.


Answer



For closure, there were multiple things wrong.




  1. The hosts were configured for a 1500 byte MTU, whereas the storage SAN was using 9216 byte MTU.

  2. One of the hosts had a subtly-different IQN from reality - the SAN listed the correct IQN as "unassigned" even though it was visually the same as the IQN in use.


  3. My original two nodes had management IPs configured on their on-board 1 Gbit card. The three new nodes had an acceptable management IP configured on the bonded interface, in a vlan. This was too different and mostly stopped the new hosts from coming out of maintanence mode after a boot.



Multipath seemed to have no bearing on the problem at all.



Deleting and fiddling around with files in /var/lib/iscsi/* on the xenserver nodes had no impact on the problem.



I had to use other means to reboot these newer boxes too - they would wedge up trying to stop the iscsi service.



And finally the corruption in the IQN name visible in iscsiadm -m session has vanished completely. This was possibly related to the MTU mismatch.




For future internet searchers - good luck!


No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server

I have created a VPC in aws with a public subnet and a private subnet. The private subnet does not have direct access to external network. S...