Monday, December 21, 2009

VMware ESX Corrupt --redo log--

I was recently faced with a VMware ESX issue and while researching a solution I found an useful article in the VMware knowledge base.

Issue:

I notice one of the Servers weren't functioning correctly and after further investigation of the logs I notice this following error message "The Redo Log on Srvr01-0000001.vmdk has detected to be corrupt. The virtual machine needs to be powered down. If the problem still persists, you need to discard the redo log".
After attempting a few times to restart the guest OS and failing, I use VMware article which resolved this issues for me.

The article is well written so the only thing I would like to add is the followings steps for creating a local user account on the VMware host. This account is needed to ssh into the host and perform the steps necessary to resolve this issue.
In addition Root login via SSH is disabled by default so these steps are necessary.

Steps to create a local user with shell access on a VMware ESX host
:

Step One:


Launch your VMware infrastructure client and connect to the ESX server that’s hosting the problematic guest OS, if you connect to your virtual center server you wouldn’t get the option to create a local user. Login with your Root user and and password.

Step two:

Once you are logged in click on the "Users and Groups" tab, insure that you are in the "users view", then right click a user account from the list select add and fill in requested information (username and password).

***note** Check off the box that says "Grant shell access to this user"


You can now ssh to the ESX server using your newly created credentials. Once you log in issue the "su -" command, you will be prompted for your root password enter it and you are ready to continue with the rest of the article.

Excerpt from the article:

To terminate the Master World and User Worlds for the virtual machine:

1. Run the following command to list the running virtual machines to determine the virtual machine ID for the affected virtual machine:

#cat /proc/vmware/vm/*/names

The output appears similar to:

vmid=1076 pid=-1 cfgFile="/vmfs/volumes/50823edc-d9110dd9-8994-9ee0ad055a68/VMNAME/VMNAME.vmx" uuid="50 28 4e 99 3d 2b 8d a0-a4 c0 87 c9 8a 60 d2 31" displayName="VMNAME-192.168.1.10"

Note: vmid='1076' is used as an example in this article.

2. Run the following command to identify the Master World ID:

# less -S /proc/vmware/vm/1076/cpu/status

Expand the terminal or scroll until you can see the right-most column labeled 'group'. In this column you find the vm.####

In this example, '1092' is the ID of the Master World.

3. Run the following command to terminate the Master World and the virtual machine running in it:

/usr/lib/vmware/bin/vmkload_app -k9 1092

4. The virtual machine's User Worlds and the virtual machine's processes are stopped.
If the command is successful, you see output similar to:

# /usr/lib/vmware/bin/vmkload_app --kill 9 1070 Warning: Jul 12 07:24:06.303: Sending signal '9' to world 1070.

If the Master World ID is wrong, you see the error:

# /usr/lib/vmware/bin/vmkload_app --kill 9 1071

Warning: Jul 12 07:21:05.407: Sending signal '9' to world 1071.
Warning: Jul 12 07:21:05.407: Failed to forward signal 9 to cartel 1071: 0xbad0061

The virtual machine is now powered off. Power on the virtual machine. Verify that it is able to boot properly and that the message error no longer occurs.

Here is the link to the article --> http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1006585.





No comments:

Post a Comment