linux - Network traffic doesn't appear to leave the trunk

Tuesday, January 19, 2016

linux - Network traffic doesn't appear to leave the trunk

I'm in the process of staging up some new virtualization servers, and part of that is to get some higher-bandwidth pipes into them. The ultimate goal is to bind 4 GigE ports into a single trunk carrying 802.1q tagged traffic. I can get that far, however I've run into a strange problem. But first, a diagram.

----------       ----------  1GbE trunks 
|        | 10GbE |        | ------------- --------
|  SW1   |-------|   SW2  | ------------- | VM1  |
|        |       |        | ------------- --------
----------       ----------
     |                |  1GbE  -----------
     | 1GbE           |--------| client2 |
     |                         -----------
----------

|        | 1GbE -----------
|  SW3   |------| client1 |
|        |      -----------
----------

All the switches are HP ProCurve 2910al switches and are not stacked. Client2 in the above diagram is in the same VLAN as VM1 is. Client1 is in a different VLAN. For the VM machine (CentOS 6) both iptables and SELinux have been disabled.

My problem is that when trunking is involved, two-way network traffic is impossible when talking to either Client machine. TCPDUMP shows that pings are received by them and ECHO REPLY packets are sent, but the VM host never sees them. At the same time, if I try to ping the VM from a client machine, it also doesn't work. The fact I can't ping client2, which is on the same subnet, suggests something is screwy in the network layer somewhere.

Strangely, from the VM host I can ping the gateway IPs on any of the switches. If I use a single interface everything works fine both with and without VLAN tagging. If I just bind a single interface and turn VLAN tagging on that interface, I can go anywhere. Build a trunk, and I'm limited to the switch-fabric.

The type of trunk doesn't seem to matter. Right now they're configured with mode 0 trunks (balance-rr), though using LACP/802.1qa behaves the same way.

vlan 70 
   name "Virtualization Subnet" 
   untagged 35,36,38,40 
   tagged Trk1-Trk2,Trk5,Trk8 
   no ip address 
   jumbo 

   exit

That's the VLAN config on SW2 up there. SW1's VLAN 70 definition has the "ip address" defined on it. The above snippet is in the fully-untrunked mode. When I'm trunked:

trunk 35-36,38,40 Trk16 trunk
vlan 70 
   name "Virtualization Subnet" 
   tagged Trk1-Trk2,Trk5,Trk8,Trk16
   no ip address 

   jumbo 
   exit

The 802.1qa/LACP version trades out the trunk definition for trunk 35-36,38,40 Trk16 lacp but as I said, doesn't change the problem presentation.

Client2 is actually connected to SW1, but putting it there in the chart would have made formatting trickier. In any case, the only thing in the Interface stanza is a name directive; it is listed as an untagged port in the vlan 70 stanza for SW1.

What am I missing?

Answer

After a long debate in chat involving MikeyB, Pauska, and ChrisS, the problem ended up being two-fold:

A possible bug in CentOS 6 was not changing the module options for the bonding module as part of service network restart, so it wasn't tracking my changes between LACP mode (4) and roundrobin (0).

Round-Robin mode doesn't like to work with ProCurve switches.

Once I forced the bonded interface to LACP/802.1qa mode through this command:

ifconfig bond0 down

echo "4" > /sys/class/net/bond0/bonding/mode
ifconfig bond0 up

Both the server and the switch were talking. At that point, starting with only one interface enabled on the switch, traffic started working normally. Enabling a second, third, and finally, the fourth interfaces all kept traffic working.

Ultimately, LACP-mode is what made things work. The clue was that round-robin mode worked when there was only one enabled switch-port in the Trunk. The server survives a reboot and comes up in the correct mode. However, a service network restart does not cause the MODE="4" part of the ifcfg-bond0 file in /etc/sysconfig/network-scripts/ to be take effect. If that mode changes, it'll remain what was set at boot (or more likely, module-load time of the bonding module).

Blog

Tuesday, January 19, 2016

linux - Network traffic doesn't appear to leave the trunk

No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server