I was trying to resolve following error message ,this look straight forward?
No it is not as simple as it looks. So I started troubleshooting this
- I restarted vpx/hostd using following command “service mgmt-vmware restart”. This had no effect at all. Tried couple of time without any luck.
- We then disconnect the host and tried to reconnect but it was not responding at all.
- I was able to ping host by netbios name and IP address as well. Host was up and running.
- I was not able to login to the host bypassing VC. This seems to be bigger problem then it has been anticipated.
- Surprising part was when I restart mgmt-vmware service all seems to be OK but even though I decided to reinstall the VPX agent
service mgmt-vmware stop
service vmware-vpxa stop
Get the currently installed vpx version:
rpm -qa grep vpxa
This should return something like "VMware-vpxa-2.5.0-84767"
Now remove the vpxa agent:
rpm -e VMware-vpxa-2.5.0-84767
This also can be done using following way :
- On the vCenter Server, look for the following files in C:\Program Files\VMware\Infrastructure\VirtualCenter Server\upgrade
vpx-upgrade-esx-7-linux-104215
vpx-upgrade-esx-7-linux-104215.sig
- copy the files to the ESX host and run the following commands:
sh vpx-upgrade-esx-7-linux-104215
service mgmt-vmware restart
- Now I thought let me reconnect the host back because that is the time it does push the VC agent back into ESX host. But it did not respond back to reconnect command .
- Though this host was in cluster but I will not be able to put host into maintenance mode since it was in disconnected from VC.Now it looks like I really have to reboot the host but I did have live production VM.
- I checked the vm status and all were up and running. We decided to check logs under vmfs partition and guess what we were not able to cd into it. This is the problem because vmfs partition does response to “hostd ” and if there is something wrong with vmfs then we can anticipate such kind of problem.
- We tried rescanning HBA and it was getting into hung state. At this time we decided to check the health of filer and as soon as I logged in I found following comment
Can this be the problem ? No this was not the problem. Storage admin added some more space and this message went off. But I was not still able to get my HBA response to rescan.
- I then called my onsite engineer to have physical look at a HBA and guess what he told me. He told light at HBA is not blinking. Is my HBA dead (We have 4 port but still using one port ,reason not to be reveled ) ? I asked him to put in next card on the port and light started blinking. I still can not do anything by rescanning. I have to reboot the host. I killed all the VM’s and then rebooted the host.
This should have been straight forward issue but guess where it turn out to be. Nice troubleshooting experience.
1 comment:
Really Helpful Gratz !
Post a Comment