Hello,
I am trying to resolve an issue I uncovered related to networking and our vSphere environment...
I have an IBM BladeCenter H with 3 HS22 servers running ESX 4.1. I have Nortel (BNT) Layer 2/3 GB switch modules in slots 1 and 2. These are used to connect the environment up to the rest of the network.
To support VLANS, I have configured a single standard vSwitch on each of the hosts with a few port groups, using VST. The uplinks (vmnic0 and vmnic1) are in active/active state; vmnic0 connects to internal port 1 on switch module 1 and vmnic1 connects to internal port 1 on switch module 2. Each of the internal switch ports are configured to accept the tagged traffic. On each of the switch modules, I have one external port dedicated for each VLAN (untagged), which connect to a pair of upstream switches - switch module 1 connects to upstream switch "a" and switch module 2 connects to upstream switch "b".
I noticed that every so often, when I reboot a guest VM in one of my non-default VLANS, that it will not be accessible on the network. Once I power it off and back on again, it will come back online. Looking a little closer, I noticed that two of the external ports on the 2nd BladeCenter switch module (the ones for VLAN 222 and 333 as shown below) are in a blocking state. However, the external port on both BladeCenter switch modules for the default VLAN are in a forwarding state, and I do not experience the same symptoms.
So, I am going up and down through all of my switches and so on, expecting to find a misconfiguration somewhere - something causing a loop that would make these ports go into a blocking state. I have not come across anything yet though. I should also state that I do not have spanning tree enabled on any of the upstream switches. - it is only enabled on the BladeCenter switch module's external ports (I am not exactly sure why though... STP is still a little foreign to me).
To confuse me even a little more, I just stood up a new ESX 4.1 server running on a DL 360 G6. I configured the vSwitch on it the same as the blade server ESX hosts, and cabled it up the same way... and I was able to reproduce the problem there also. (the DL 360 G6 is in no way connected to the BladeCenter switch modules), so I am a bit baffled.
I am sure I have something wrong, but I can't see it yet...
Any advice would be greatly appreciated!!
~David