ESXi Hosts Becoming Disconnected from vCenter

Recently I have been experiencing problems with ESXi 4.1 hosts becoming disconnected from vCenter.

Right clicking on the host and selecting Connect does not normally fix the problem. Also logging in directly to the host with the vSphere Client usually does not work, or if it does it disconnects quickly after connecting.

All of the virtual machines running on the host continue to run; however without the host being connected to vCenter or being able to connected directly with the vSphere Client you are unable to manage the virtual machines. Also, connecting to the host with PowerCLI does not work, if it does connect it drops the connection soon afterwards.

I have used a variety of “tricks” to get the host to reconnect.

  1. Restart the management agents and then reconnect. If this does not work the first time then try again. I have found that it often works on the second attempt.
    1. From the host console press F2 to login
    2. Enter the root password
    3. Go down to Troubleshooting Options and select it
    4. Select Restart Management Agents
    5. Press F11 to restart the management agents
    6. Once they have been restarted attempt to reconnect the host by right clicking on it within vCenter and select Connect. You will normally get an error message and then prompted to enter a username and password, enter root and the root password.
  2. If the above fails twice then try removing the host from vCenter and adding it in again. This has an impact in that you will lose performance statistics, the virtual machines will need to be put back into the correct resource pools if you are using resource pools and if you are using Site Recovery Manager (SRM) the virtual machine protection will need to be reconfigured. You might want to skip this step and try the ones below first and use this as a last resort.
    1. Right click on the host in vCenter and select Remove.
    2. Once it has been removed right click on the container the host was originally in, e.g. a cluster and select Add host
    3. Enter the host name, root for the username and the root password
    4. If the host starts to add and then fails then repeat the steps in 1 above, again you might have to try the steps in 1 above a couple of time.
  3. If you still cannot get the host to reconnect and you are using Fibre Channel storage then rescan the HBAs. As you cannot manage the host with a vSphere Client you will have to do this at the command line.
    1. At the console, if you are not already in the Troubleshooting Options then follow steps a through to c in 1 above to get to the Troubleshooting Options.
    2. If the menu shows Disable Remote Tech Support Mode then Remote Tech Support is already enabled, if there is an option for Enable Remote Tech Support Mode then select it.
    3. Using a SSH client such as Putty to get a SSH connection to the host
    4. Login as root
    5. Issue the command esxcfg-rescan for each of the HBAs on the host where hba is the hba device, e.g.
      esxcfg-rescan vmhba1
      esxcfg-rescan vmhba2
    6. Now try reconnecting the host by right clicking on it and selecting Connect as described in step 1.f. Again if it does not work then follow the steps in 1 above a couple of times.
  4. If you still cannot get the host to connect check for redundant directories in /var/run/vmware/root_0 and /var/lib/vmware/hostd/stats being full, tidy these directories and attempt to reconnect again.
    1. If you do not already have a SSH connection to the host follow steps a through to d in 3 above to get a SSH connection.
    2. cd /var/run/vmware/root_0
    3. There should be a directory in here for each running virtual machines on the host, issue the following command to get a list of all the directories here.
      ls

      If there are more directories than the number of running virtual machines then use the following command to remove the empty ones, it will attempt to delete the non-empty ones but will fail to delete these, so you are safe to run the command against all directories
      rmdir *

    4. Issue the following command to check for full filesystems.
      vdf

      The one to check is hostdstats, if it is 100% full then tidy it as follows

      1. cd /var/lib/vmware/hostd/stats
      2. rm hostAgentStats-*.stats
    5. Now restart all of the services with
      services.sh restart
      You can run this while there are running virtual machines on the host without affecting them.
    6. Now attempt to connect the host again, as detailed in 1.f.

I have found that if a host becomes disconnected and I have had to use the steps above to reconnect it then it becomes disconnected again within the next 24 hours or so unless the host is restarted. Therefore, I suggest that once you have the host reconnected put it into maintenance mode and restart it.

I think this issue is being caused by NetApp SnapManager for SQL and SnapManager for Exchange because usually the SnapDrive running on one of the virtual machines on the failing host is normally reconfiguring the virtual machine and rescanning the HBAs to attach or detach RDMs from a NetApp snapshot to verify a backup that has just been run by the SnapManager product running on the virtual machine when the host becomes disconnected. I do not think that it is a fault of the SnapManager product as it is just using the VMware APIs to perform the tasks it needs to do. All of the hosts I am having this issue with are running an unpatched version of ESXi 4.1 (build 260247). They are also running from IBM supplied USB keys without the latest IBM customisation. I plan to upgrade the hosts to at least ESXi 4.1 update 1 (build 45697) or ESXi 4.1 update 2 (build 502767) to see if this helps the situation. I will also apply IBM customisation 1.0.4 as this fixes the issue of vMotion and Fault Tolerance becoming disabled following a reboot of the host. I will update this post with details of whether these updates helped or not.

This entry was posted in VMware, vSphere. Bookmark the permalink.

5 Responses to ESXi Hosts Becoming Disconnected from vCenter

  1. Patrick says:

    Very interesting article. I am having siimliar issues and have tried to restart the managment agents as you suggest, however at that point the host becomes unresponsive and requires a hard reboot.

    I’m going to double check my versions and was curious if you have corrected the issue on your end yet?

    Thanks,
    Patrick

  2. Dan says:

    Thank you very much for this post. I am having the same problem – which appears to be related to Veeam Backup and Replication. If you run a esxcli network connection list, I see numerous connections from my Veeam server with high numbers of Send-Q. I followed these steps and was able to reconnect.

  3. Andrew says:

    Nice article. BIG Help. We had some MDS switch work this morning that caused a path failover and ESXi didn’t like it. Step 3 worked for me.

Leave a Reply

Your email address will not be published. Required fields are marked *