I'm in the process of staging up some new virtualization servers, and part of that is to get some higher-bandwidth pipes into them. The ultimate goal is to bind 4 GigE ports into a single trunk carrying 802.1q tagged traffic. I can get that far, however I've run into a strange problem. But first, a diagram.
---------- ---------- 1GbE trunks
| | 10GbE | | ------------- --------
| SW1 |-------| SW2 | ------------- | VM1 |
| | | | ------------- --------
---------- ----------
| | 1GbE -----------
| 1GbE |--------| client2 |
| -----------
----------
| | 1GbE -----------
| SW3 |------| client1 |
| | -----------
----------
All the switches are HP ProCurve 2910al switches and are not stacked. Client2 in the above diagram is in the same VLAN as VM1 is. Client1 is in a different VLAN. For the VM machine (CentOS 6) both iptables and SELinux have been disabled.
My problem is that when trunking is involved, two-way network traffic is impossible when talking to either Client machine. TCPDUMP shows that pings are received by them and ECHO REPLY packets are sent, but the VM host never sees them. At the same time, if I try to ping the VM from a client machine, it also doesn't work. The fact I can't ping client2, which is on the same subnet, suggests something is screwy in the network layer somewhere.
Strangely, from the VM host I can ping the gateway IPs on any of the switches. If I use a single interface everything works fine both with and without VLAN tagging. If I just bind a single interface and turn VLAN tagging on that interface, I can go anywhere. Build a trunk, and I'm limited to the switch-fabric.
The type of trunk doesn't seem to matter. Right now they're configured with mode 0 trunks (balance-rr), though using LACP/802.1qa behaves the same way.
vlan 70
name "Virtualization Subnet"
untagged 35,36,38,40
tagged Trk1-Trk2,Trk5,Trk8
no ip address
jumbo
exit
That's the VLAN config on SW2 up there. SW1's VLAN 70 definition has the "ip address" defined on it. The above snippet is in the fully-untrunked mode. When I'm trunked:
trunk 35-36,38,40 Trk16 trunk
vlan 70
name "Virtualization Subnet"
tagged Trk1-Trk2,Trk5,Trk8,Trk16
no ip address
jumbo
exit
The 802.1qa/LACP version trades out the trunk definition for trunk 35-36,38,40 Trk16 lacp
but as I said, doesn't change the problem presentation.
Client2 is actually connected to SW1, but putting it there in the chart would have made formatting trickier. In any case, the only thing in the Interface stanza is a name
directive; it is listed as an untagged
port in the vlan 70 stanza for SW1.
What am I missing?
Answer
After a long debate in chat involving MikeyB, Pauska, and ChrisS, the problem ended up being two-fold:
- A possible bug in CentOS 6 was not changing the module options for the
bonding
module as part ofservice network restart
, so it wasn't tracking my changes between LACP mode (4) and roundrobin (0). - Round-Robin mode doesn't like to work with ProCurve switches.
Once I forced the bonded interface to LACP/802.1qa mode through this command:
ifconfig bond0 down
echo "4" > /sys/class/net/bond0/bonding/mode
ifconfig bond0 up
Both the server and the switch were talking. At that point, starting with only one interface enabled on the switch, traffic started working normally. Enabling a second, third, and finally, the fourth interfaces all kept traffic working.
Ultimately, LACP-mode is what made things work. The clue was that round-robin mode worked when there was only one enabled switch-port in the Trunk. The server survives a reboot and comes up in the correct mode. However, a service network restart
does not cause the MODE="4"
part of the ifcfg-bond0
file in /etc/sysconfig/network-scripts/
to be take effect. If that mode changes, it'll remain what was set at boot (or more likely, module-load time of the bonding
module).
No comments:
Post a Comment