Sunday, August 5, 2012

Provisioning Server error : No ARP Reply

Environment :  

XS= 6.0.2 with latest patch
PVS =6.1 HF1 running on Dell R720
Target=Windows 7 32 bit VM
Hardware = Dell R720  , NIC =BCM57712
Core Switch : Nexus5K

Problem description:When the target's  were booted   it use to contact the PVS server but while downloading image ARP use to time out. This use to happen with few machine say 3 out 5 machine.


Troubleshooting :

1.      Broke the bond , tested with Xenbridge/Open V Switch mode
2.    Finally put the virtual PVS and DHCP on same VLAN , all the target worked booted perfectly. This gave us thought that intra VLAN something is wrong. This translate into layer 3 issue . When target and PVS are on same VLAN , layer 3 act as a     layer 2 and just f/w the packets.
3.    We decided to test something else , used the working VM mac with non -working VM and voila. This gave us déjà vu that something messed at layer 3.
4.    Now Q was how do I separate Nexus 5K in the core and create my own layer 3 on Dell switch to test intra VLAN testing. Dell blade switch M8024 can also act as layer 3. We created layer 3 VLAN with different gateway. Now we moved our v PVS on this network. We also had to extend our DB and AD to this PVS as well ? So we added static route to one leg of PVS. Now streaming traffic was in separate VLAN and Target were in separate VLAN. One target boots it does intra VLAN communication to PVS . PVS intern fetch those info from DB /AD using backend connection. This way I eliminated layer 3 Nexus . Above setup can only be replicated on single host

CISCO troubleshooting:

1.    Core had VPC configured on their Active /Standby pair. We decided to shut down one leg and see if this works. To our surprise every thing work.
2.    We captured two set of trace one with VPC shut down and one without VPC shut down.

Here is what we see when VM's work : When ARP request received by CISCO core they does return back the MAC address of the targets image

But when it does not work then it does not broadcast the MAC .


Basically, depending on the hashing algorithm of the Dell switch, a packet may arrive at either the HSRP active or standby.  If the packets arrived on the HSRP standby, it would forward it over the peer-link towards the HSRP active which would then result in the ARP reply being broadcast . For other streams, the Dell switch would send the packet to the HSRP active which would result in a unidirectional reply and this works fine. This turned out to be KNOWN bug and CISCSO advised them upgrade their IOS to 5.0(3)N2(2).

As per bug Symptom:
ARP response from the Nexus 5000 is sent as a broadcast instead of a unicast. Some TCP/IP implementations on Network interface cards do not accept a broadcast ARP response and will not install an ARP entry in their ARP database. Such
clients will not be able to access network resources.

When the arp request is received on the HSRP standby switch and sent over the peer link to the HSRP active switch.

Once the IOS upgraded , this fixed this problem

No comments: