We are having a problem with some of our virtual machines intermittently losing communication with each other, and I’m at a loss as to the source.
We have about 250 VM’s running on about 20 HP BL465C blades installed on two HP C7000 chassis, using the HP Virtual Connect interconnect modules. The blade chassis are connected to our core Cisco 6500 switches. The VMWare hosts are at 5.0, the guest VM’s are a mix on Windows 2003, 2008, and 2008R2.
What’s going on is that everything seems to be OK, but then out of nowhere, we will get communication failures between specific machines. It looks like it’s an ARP issue. Using PING, it works fine in one direction, but we get an “unreachable” error when going the other way, unless we ping from the target back to the source first.
For example: we have servers, “A” and “B”. Ping A to B fails with “unreachable”. Ping “B” to “A” works fine. However after pinging “B” to “A”, we can now ping “A” to “B”, at least for a while until the entry falls out of the ARP cache. If we go into server “A” and set a static ARP entry (“arp –s”) for server “B”, everything works OK. Through all this both server “A” and server “B” have no issues communicating with any other machines.
We tried using vMotion to move the servers to a different host, different blade chassis, etc. Nothing worked except when we put both VM’s on the same host. Then everything worked OK. Moving one of the servers to a different host and the problem came back.
It seems like either the ARP broadcast from the one server, or the reply back from the target isn't making it through. However, according to our networking group, there are no issues showing up Cisco switches.
Early this year, we had an issue where it happened on about a third of machines at the same time (it caused significant outages to production systems!). It seemed like it was limited to machines on one chassis (but not all of the machines on that chassis). At that time, we opened up tickets with VMWare and HP. Neither found anything wrong with our configurations, but somewhere in the various server moves, configuration resets, etc., everything started working.
Since that time we’ve seen it very intermittently on a few machines, but then it seems to go away after a few days.
The issue we found today was that the server we’re using for the Microsoft WSUS server hadn’t been receiving updates from a couple of the member servers. We could ping from the WSUS to the member server, but not back from the member server unless we put a static ARP entry in the member server. The member servers are working fine otherwise, talking to other machines OK, etc. They are a production environment, so we’re limited on the testing we can do.
Also, when it has happened, it seems like always been between machines on the same subnet. However, most of our servers are on the same subnet, so it might just be coincidence.
I’ve done a lot of internet searching, and have found some postings with similar issues, but haven’t found any solution. I don’t know if it’s a VMWare issue, HP, Cisco, or Windows issue.
Any assistance would be appreciated.
Mike O'Donnell