Friday, March 26, 2010

Troubleshooting :Set retry timeout for failed TaskMgmt abort for CmdSN

1. We were having issue with one of the esxh host which had 3 VM’s with multiple RDM lun on it. Host was running fine but VM were getting BSOD with following error message.

Host was running fine but vmkernel had following message

vmkernel: 0:00:55:23.452 cpu4:1066)LinSCSI: 3201: Abort failed for cmd with serial=0, status=bad0001, retval=bad0001

Mar 25 22:01:57 xxx vmkernel: 0:00:55:23.458 cpu4:1066)WARNING: ScsiPath: 3802: Set retry timeout for failed TaskMgmt abort for CmdSN 0x0, status Failure, path vmhba1:C0:T0:L2

Mar 25 22:02:37 xxxx vmkernel: 0:00:56:03.465 cpu4:1066)LinSCSI: 3201: Abort failed for cmd with serial=0, status=bad0001, retval=bad0001

Mar 25 22:02:37 xxx vmkernel: 0:00:56:03.471 cpu4:1066)WARNING: ScsiPath: 3802: Set retry timeout for failed TaskMgmt abort for CmdSN 0x

0, status Failure, path vmhba1:C0:T0:L2

Mar 25 22:02:41 xxxx vmkernel: 0:00:56:06.931 cpu4:1062)VSCSI: 3183: Retry 0 on handle 8202 still in progress after 62 seconds

2. We tried to find out which lun it was . We then use SCSI HBA tool to find out lun . This will show all the vmfs partition. Since all the lun were configured as RDM hence we were not able to find any

[root@xxxx log]# esxcfg-vmhbadevs -m

vmhba0:0:0:3 /dev/cciss/c0d0p3 496a5f4e-dda2c50a-1326-00237d5adda0

3. Then w e were trying to find out which all managed path for the luns

Disk vmhba3:0:6 /dev/sdf (25600MB) has 1 paths and policy of Fixed

iScsi 26:1.1 iqn.2000-04.com.qlogic:qle4062c.lfc0908h85049.1<->iqn.1992-08.com.netapp:sn.xxx vmhba3:0:6 On active preferred

Disk vmhba3:0:5 /dev/sde (71687MB) has 1 paths and policy of Fixed

iScsi 26:1.1 iqn.2000-04.com.qlogic:qle4062c.lfc0908h85049.1<->iqn.1992-08.com.netapp:sn.xxx vmhba3:0:5 On active preferred

Disk vmhba3:0:10 /dev/sdj (5120MB) has 1 paths and policy of Fixed

iScsi 26:1.1 iqn.2000-04.com.qlogic:qle4062c.lfc0908h85049.1<->iqn.1992-08.com.netapp:sn.xxx vmhba3:0:10 On active preferred

Disk vmhba3:0:16 /dev/sdp (1399988MB) has 1 paths and policy of Fixed

iScsi 26:1.1 iqn.2000-04.com.qlogic:qle4062c.lfc0908h85049.1<->iqn.1992-08.com.netapp:sn.xxx vmhba3:0:16 On active preferred

4. We then looked at location /proc/scsi/ location and read the file called scsi file

If you look at the vmkernel error message above there it is mentioning vmhba name with lun number . To clarify more scsi file would be very help. This give NetAPP version which is running and also provide what kind of access it has

Host: scsi3 Channel: 00 Id: 00 Lun: 01

Vendor: NETAPP Model: LUN Rev: 7310

Type: Direct-Access ANSI SCSI revision: 04

Host: scsi3 Channel: 00 Id: 00 Lun: 02

Vendor: NETAPP Model: LUN Rev: 7310

Type: Direct-Access ANSI SCSI revision: 04

Host: scsi3 Channel: 00 Id: 00 Lun: 03

Vendor: NETAPP Model: LUN Rev: 7310

Type: Direct-Access ANSI SCSI revision: 04

Host: scsi3 Channel: 00 Id: 00 Lun: 04

Vendor: NETAPP Model: LUN Rev: 7310

Type: Direct-Access ANSI SCSI revision: 04

Host: scsi3 Channel: 00 Id: 00 Lun: 05

Vendor: NETAPP Model: LUN Rev: 7310

Type: Direct-Access ANSI SCSI revision: 04

5. We also checked following location for HBA F/W version

cd /proc/scsi/qla4022/

[root@xxx]# ls

1 2 3 4 HbaApiNode

6. We also suspected hpsim which was installed as my earlier post and finally we uninstalled it . Go to hpmgmt folder and run ./ installvm811.sh --uninstall
[root@xzxxx qla4022]# esxupdate -l query

Installed software bundles:

------ Name ------ --- Install Date --- --- Summary ---

3.5.0-64607 20:40:59 12/31/08 Full bundle of ESX 3.5.0-64607

ESX350-200802303-SG 20:41:00 12/31/08 util-linux security update

ESX350-200802408-SG 20:41:00 12/31/08 Security Updates to the Python Package.

ESX350-200803212-UG 20:41:00 12/31/08 Update VMware qla4010/qla4022 drivers

ESX350-200803213-UG 20:41:00 12/31/08 Driver Versioning Method Changes

ESX350-200803214-UG 20:41:01 12/31/08 Update to Third Party Code Libraries

ESX350-200804405-BG 20:41:01 12/31/08 Update to VMware-esx-drivers-scsi-megara

ESX350-200805504-SG 20:41:01 12/31/08 Security Update to Cyrus SASL

ESX350-200805505-SG 20:41:01 12/31/08 Security Update to unzip

ESX350-200805506-SG 20:41:01 12/31/08 Security Update to Tcl/Tk

ESX350-200808206-UG 20:41:02 12/31/08 Update to vmware-hwdata

7. We then checked console of the ESX host and press Alt+F12

After doing all the above we decided to swap the HBA cable. Currently HBA was directly plugged into FAS 2020 using QLE4032 dual port. We changed to different HBA and that seems to fix problem. During the course of troubleshooting VMware told us that we can not have 2 dual port QLE4032 as officially one is supported. I was surprised when they share configure MAX as well. I told them that this might be honest mistake in part of statement . Lets see what VMware has to say

Create additional vmfs volume if you have more then one raid disk

You have a situation where you have configure your ESX host as 3.5 with 2 disk as RAID 1+0 and rest all as RAID5 for the VMFS partition. If you have installed your ESX host like next and next finish then you will not see RAID 5 as VMFS partition though when you see your VC you can see that mounted as different target but when you try to add it as datastore you wont be able to do so.

I had college of mind who has installed the ESX like that and was struggling to create additional VMFS partition on RAID 5 partition. . When he tried to see if I can add using add store wizard but we were not able to see anything under storage wizard

How do we then create additional VMFS partition? We can not add it as extend as discussed in my previous blog. I then found beautiful KB and ask them to follow step by step (You never know if VMware make it paid so copying it on blog)

To create a new VMFS volume from the command line:

1. Locate the LUN you wish to format. For example, vmhba1:2:0.

2. Log in to the ESX console, either directly or through an SSH client.

3. Rescan the adapter to ensure that ESX is updated with the latest storage information. Run the command:

esxcfg-rescan vmhba<X>

where <X> is the adapter number

4. Locate the SCSI device from the console in order to find the device node for the LUN, and make note of the identifier.

o For versions of ESX earlier than 4.0, run the command:

esxcfg-vmhbadevs -m

Note: For ESX 3.x, the identifier is in the form of vmhba<C>:<T>:<L>:<P>.

o For ESX 4.0 and later, run the command:

esxcfg-scsidevs -c

Note: For ESX 4.0, the identifier is in the form of naa.<NAA

5. Enter either the Linux or VMkernel device name to open with fdisk.

6.

o For a Linux device, run the command:

fdisk /dev/sd<X>

where <X> is the device node letter

o For a VMkernel device, run the command:

fdisk /vmfs/disk/<device>

where <device> is the device reported in the output of step 4

7. Type p and then Enter to determine if any VMFS partitions already exist.

Note: VMFS partitions are identified by a partition system ID of fb.

8. Type n and then Enter to create a new partition.

9. Type p and then Enter to create a primary partition.

10. Type 1 and then Enter to create partition number 1.

Note: If partitions already exist but you want to use the free space, type 2, 3 or 4. You cannot have more than 4 primary partitions.

11. Select the defaults to use the complete disk.

12. Type t and then Enter to set the partition's system ID.

13. Type fb and then Enter to set the partition system ID to fb (VMware VMFS volume).

14. Skip to step 16 if the partition you created in step 9 is not the first partition.

15. Type x and then Enter to go into expert mode.

16. Type b and then Enter to adjust the starting block number.

17. Type 1 and then Enter to choose partition 1.

18. Type 128 and then Enter to set the offset to 128.

19. Type w and then Enter to write label and partition information to the disk.

20. Use vmkfstools to format the partition.

o For ESX 3.x, run the command:

# vmkfstools -C vmfs3 -b <Block_Size> -S <VMFS_Name> vmhba<C>:<T>:<L>:<P>

Note: Refer to the applicable identifier in step 4. The last number is the partition number, which must match the partition you created with fdisk.

For example:

# vmkfstools -C vmfs3 -b 8m -S LocalVMFS /vmfs/devices/disks/vmhba1:2:0:

This creates a new VMFS3 volume named LocalVMFS on the target vmhba1:2:0:1 with an 8 MB block size.

o For ESX 4.x, run the command:

# vmkfstools -C vmfs3 -b <Block_Size> -S <VMFS_Name> naa.<NAA>:<partition>

Note: Please refer to the applicable identifier in step 4. The last number is the partition number, which must match the partition you created with fdisk.

For example:

# vmkfstools -C vmfs3 -b 8m -S LocalVMFS /vmfs/devices/disks/naa.6090a038f0cd6e5165a344460000909b:1

This creates a new VMFS3 volume named LocalVMFS on the target naa.6090a038f0cd6e5165a344460000909b:1 with an 8 MB block size.

21. Rescan the HBAs on all of the ESX hosts to update them with the new information.

Thursday, March 25, 2010

Using Array Configuration Utility (ACU) on ESX host to extend VMFS volume

We had a situation where we had local VMFS volume and it needed to be expanded. Condition was without rebooting it. Searched through VMTM forum and could not found the way to expand VMFS volume with additional drives.  We have to reboot  no matter what so ever we do . This blog will explain about using ACU (Array Configuration Utility )

1. To start with we need to put HD into empty slot in ESX host.

2. We need to install HP Insight Manager and then ACU utility for Linux.

3. To install HP Insight Manager on 3.5 please follow this link.

4. For installing ACU on ESX host please download it from this link

While installing ACU I had issue with rpm . I had installed hpsmh-3.0.0-68 and my hprsm-8.1.1-29 was failing during the install

[root@xxx]# ./installvm811.sh --install
HP Insight Manager Agent 8.1.1-13 Installer for VMware ESX Server
Target System is VMware ESX Server 3.5.0 build-176894
This script will now attempt to install the HP Insight Manager Agents.
Do you wish to continue (y/n) y
Verifying VMware ESX Server version                                      [ OK ]
Verifying RPM packages:
        Verifying hp-OpenIPMI-8.1.1-26.vmware30.i386.rpm                 [ OK ]
        Verifying hpasm-8.1.1-29.vmware30.i386.rpm                       [ OK ]
        Verifying hprsm-8.1.1-29.vmware30.i386.rpm                       [ OK ]
        Verifying hpsmh-2.1.15-210.vmware30.i386.rpm                     [ OK ]
Checking for previously installed agents                                 [FAILED]
Some agents have already been installed. Please remove the previous installation.
Check hpmgmtlog for additional information

I then checked the log and found this
The following packages have already been installed on your system:
hpsmh-3.0.0-68
Please remove the previous installation
Exit 1

So I had to remove this RPM . For that we need to find the RPM name.

[root@xxxx 811]# rpm -qa | grep -i hp
VMware-esx-drivers-scsi-hpsa-350.2.4.66.95vmw-153875
hpsmh-3.0.0-68
Now since we found the rpm we can uninstall it
[root@xxxx 811]# rpm -e hpsmh-3.0.0-68
error: Failed dependencies:
   hpsmh is needed by (installed) cpqacuxe-8.25-5
Now we have to uninstall cpqacuxe-8.25-5
So here is the way
[root@zxxx  811]# rpm -e cpqacuxe-8.25-5
cpqacuxe still running! Stop it first.
Stop it first
[root@zxxx 811]# cpqacuxe -stop
[root@xxxx 811]# rpm -e cpqacuxe-8.25-5
[root@xxx 811]# rpm -e hpsmh-3.0.0-68
Stopping hpsmhd: [  OK  ]

5. Once you have installed HP Insight Manager agent on ESX host then you should install ACU

[root@xxx  hp_install]# rpm -ivh cpqacuxe-8.25-5.noarch.rpmcpqacuxe-8.25-5.noarch.rpm
[root@xxx  hp_install]# rpm -ivh cpqacuxe-8.25-5.noarch.rpm
Preparing...                ########################################### [100%]
   1:cpqacuxe               ########################################### [100%]

6. We have to enable the remote once it is installed.

[root@xxx  hp_install]# cpqacuxe --enable-remote
Array Configuration Utility version 8.25.5.0
Make sure that you have gone through the following checklist:
   1. Change the administrator password to something other than the default.
   2. Only run ACU on servers that are on a local intranet or a secure network.
   3. Secure the management port (port 2301 or 2381)on your network.
Remote connection enabled!

7. Now you should be able to access this using  url  https://<ILO IP>:2381. Use root and password for root . Once you logged in you will see something as squared below

clip_image002

8. Open it and that will bring the screen for ACU . You can see the unassigned drives. We will be using those drives and creating another array . Click on array and it will list all the option.

clip_image004

9. Select all the drives which you want to part of the array

clip_image006

10. Once you create array it will appear below the same controller .it will appear as unused space. We need to create logical partition

clip_image008

11. Create logical partition has to be saved or else it will not be visible. If you missed anything you can delete or discard.

clip_image010

12. It will popup following message. Choose OK

clip_image012

13. Once it is done it will appear under ESX host like this after host is rebooted. I was not able to get it without rebooting ESX host.

clip_image014

14. Now we need to extend the existing VMFS volume and select properties . Add the above capacity as extent.

clip_image016

I will be writing another one where I will be doing it from BIOS since reboot is required.

Friday, March 19, 2010

My VCP 4.0 Certification


I have completed my VCP on NOV 2009 and from that moment I was following up with VMware education department for my certificate. Finally I received it yesterday. Thanks to VMware for it :)

Wednesday, March 17, 2010

Vmotion between esx3.5 and 4.0 host; Error VMotion is not license

I was testing if HA/DRS cluster can exist with ESX 4.0 and 3.5 . You can have full fledge DRS enabled cluster. But there is some bug.

I forgot to set vmotion enable on my vSwitch (normal manual mistake which people does ) and I was testing vmotion from 4.0 host to 3.5. It was giving me weird message

I called VMware license support guy and he/she (since they put me on mute) directly went to vSwitch and changed the setting. I am pretty sure they knew this error message in-hand. As you can have look at error and you can understand it is completely misguiding. This 3.5 host was part of VC2.5 HA/DRS cluster. Was wondering how come it will not be licensed?

Let’s see if VMware already had fix or accept it as bug.