Hi
I have a very frustrating problem in that I have to keep hard rebooting a newly built ESXi 5.1.0 (Kernel build 799733) host machine.
The machine is a 2 x Dual Core AMD Opteron x64 based server with 28GB RAM and 2 x 1TB local HD. ESXi is booting from a USB drive, and each HD has one VM datastore so I have DS1 and DS2 datastores.
The symptom is that after a period of low activity (guests are all up but not being asked to do much work), DS2 becomes frozen. Trying to browse the datastore just says "Searching datastore........" and all VM's on that datastore are uncontactable. It only ever affects DS2. DS1 doesn't experience the issue.
Going into the host via SSH, at this point I cannot even list the contents of the /var/log or var/vmfs/volumes directories - it just hangs.
I cannot restart the managemnt agents or reboot from the ESXi console. The only way to bring things back to life is to restart the host, at which everything is fine. VM's start and are responsive.
I have tried this but it made no difference.
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1030265
I have also disabled all power saving options and IOMMU in the host BIOS.
After reboot I check vmkernel.log and can see these disk related messages logged just before the reboot
2013-05-02T14:59:18.681Z cpu2:6453)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba33:0:0:0 (driver name: sata_nv) - Message repeated 2 times
2013-05-02T14:59:18.681Z cpu2:6453)ScsiDeviceIO: 2303: Cmd(0x4124403d5a40) 0x2a, CmdSN 0x80000007 from world 6453 to dev "t10.ATA_____WDC_WD1002FAEX2D00Z3A0________________________WD2DWCATRC182891" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0$
2013-05-02T15:00:01.530Z cpu1:19075)VSCSI: 2370: handle 8200(vscsi0:0):Reset request on FSS handle 198922 (0 outstanding commands)
2013-05-02T15:00:01.530Z cpu1:19075)VSCSI: 2370: handle 8201(vscsi0:1):Reset request on FSS handle 166153 (0 outstanding commands)
2013-05-02T15:00:01.530Z cpu2:4170)VSCSI: 2648: handle 8200(vscsi0:0):Reset [Retries: 0/0]
2013-05-02T15:00:01.530Z cpu2:4170)VSCSI: 2446: handle 8200(vscsi0:0):Completing reset (0 outstanding commands)
2013-05-02T15:00:01.530Z cpu2:4170)VSCSI: 2648: handle 8201(vscsi0:1):Reset [Retries: 0/0]
2013-05-02T15:00:01.530Z cpu2:4170)VSCSI: 2446: handle 8201(vscsi0:1):Completing reset (0 outstanding commands)
2013-05-02T15:29:18.941Z cpu0:5267)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba1:0:0:0 (driver name: sata_nv) - Message repeated 17 times
2013-05-02T15:29:18.941Z cpu0:5267)ScsiDeviceIO: 2303: Cmd(0x4124003edac0) 0x85, CmdSN 0x18 from world 5267 to dev "t10.ATA_____WDC_WD1002FAEX2D00Z3A0________________________WD2DWCATRC024514" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.
2013-05-02T15:59:20.236Z cpu3:6569)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba1:0:0:0 (driver name: sata_nv) - Message repeated 91 times
2013-05-02T15:59:20.236Z cpu3:6569)ScsiDeviceIO: 2303: Cmd(0x4124403f5400) 0x2a, CmdSN 0x80000016 from world 6569 to dev "t10.ATA_____WDC_WD1002FAEX2D00Z3A0________________________WD2DWCATRC024514" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x5 0x20 $
2013-05-02T16:29:20.717Z cpu3:6522)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba33:0:0:0 (driver name: sata_nv) - Message repeated 7 times
2013-05-02T16:29:20.717Z cpu3:6522)ScsiDeviceIO: 2303: Cmd(0x4124404017c0) 0x2a, CmdSN 0x8000001d from world 6522 to dev "t10.ATA_____WDC_WD1002FAEX2D00Z3A0________________________WD2DWCATRC182891" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0$
2013-05-02T16:59:21.052Z cpu2:6453)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba33:0:0:0 (driver name: sata_nv) - Message repeated 9 times
2013-05-02T16:59:21.052Z cpu2:6453)ScsiDeviceIO: 2303: Cmd(0x4124403f2400) 0x2a, CmdSN 0x80000044 from world 6453 to dev "t10.ATA_____WDC_WD1002FAEX2D00Z3A0________________________WD2DWCATRC182891" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0$
2013-05-02T17:59:21.822Z cpu2:6453)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba33:0:0:0 (driver name: sata_nv) - Message repeated 3 times
2013-05-02T17:59:21.822Z cpu2:6453)ScsiDeviceIO: 2303: Cmd(0x4124403da000) 0x2a, CmdSN 0x8000001e from world 6453 to dev "t10.ATA_____WDC_WD1002FAEX2D00Z3A0________________________WD2DWCATRC182891" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0$
2013-05-02T18:59:22.730Z cpu2:6569)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba1:0:0:0 (driver name: sata_nv) - Message repeated 3 times
2013-05-02T18:59:22.730Z cpu2:6569)ScsiDeviceIO: 2303: Cmd(0x4124403d3d40) 0x2a, CmdSN 0x8000003d from world 6569 to dev "t10.ATA_____WDC_WD1002FAEX2D00Z3A0________________________WD2DWCATRC024514" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0$
2013-05-02T19:29:23.425Z cpu2:6453)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba33:0:0:0 (driver name: sata_nv) - Message repeated 1 time
2013-05-02T19:29:23.425Z cpu2:6453)ScsiDeviceIO: 2303: Cmd(0x4124403dbf00) 0x2a, CmdSN 0x80000045 from world 6453 to dev "t10.ATA_____WDC_WD1002FAEX2D00Z3A0________________________WD2DWCATRC182891" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0$
2013-05-02T19:29:23.425Z cpu2:6453)ScsiDeviceIO: 2303: Cmd(0x4124403d5340) 0x2a, CmdSN 0x8000005f from world 6453 to dev "t10.ATA_____WDC_WD1002FAEX2D00Z3A0________________________WD2DWCATRC182891" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0$
2013-05-02T19:59:25.512Z cpu3:6529)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba1:0:0:0 (driver name: sata_nv) - Message repeated 125 times
2013-05-02T19:59:25.512Z cpu3:6529)ScsiDeviceIO: 2303: Cmd(0x4124403d4040) 0x2a, CmdSN 0x800000b1 from world 6529 to dev "t10.ATA_____WDC_WD1002FAEX2D00Z3A0________________________WD2DWCATRC024514" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0$
2013-05-02T19:59:25.672Z cpu2:6529)ScsiDeviceIO: 2303: Cmd(0x4124403da200) 0x2a, CmdSN 0x7164 from world 4100 to dev "t10.ATA_____WDC_WD1002FAEX2D00Z3A0________________________WD2DWCATRC024514" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.
VMK_SCSI_DEVICE_BUSY = 0x8
vmkernel: 1:02:02:02.206 cpu3:4099)NMP: nmp_CompleteCommandForPath: Command 0x28 (0x410005078e00) to NMP device "naa.6001e4f000105e6b00001f14499bfead" failed on physical path "vmhba1:C0:T0:L100" H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.
This status is returned when a LUN cannot accept SCSI commands at the moment. As this should be a temporary condition, the command is tried again.
In vmkwarning.log I am getting a similar messages every 30 minutes
2013-05-02T14:29:18.420Z cpu0:6453)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba33:0:0:0 (driver name: sata_nv) - Message repeated 1 time
2013-05-02T14:59:18.681Z cpu2:6453)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba33:0:0:0 (driver name: sata_nv) - Message repeated 2 times
2013-05-02T15:29:18.941Z cpu0:5267)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba1:0:0:0 (driver name: sata_nv) - Message repeated 17 times
2013-05-02T15:59:20.236Z cpu3:6569)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba1:0:0:0 (driver name: sata_nv) - Message repeated 91 times
2013-05-02T16:29:20.717Z cpu3:6522)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba33:0:0:0 (driver name: sata_nv) - Message repeated 7 times
2013-05-02T16:59:21.052Z cpu2:6453)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba33:0:0:0 (driver name: sata_nv) - Message repeated 9 times
2013-05-02T17:59:21.822Z cpu2:6453)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba33:0:0:0 (driver name: sata_nv) - Message repeated 3 times
2013-05-02T18:59:22.730Z cpu2:6569)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba1:0:0:0 (driver name: sata_nv) - Message repeated 3 times
2013-05-02T19:29:23.425Z cpu2:6453)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba33:0:0:0 (driver name: sata_nv) - Message repeated 1 time
2013-05-02T19:59:25.512Z cpu3:6529)WARNING: LinScsi: SCSILinuxQueueCommand:1193:queuecommand failed with status = 0x1056 Unknown status vmhba1:0:0:0 (driver name: sata_nv) - Message repeated 125 times
A similar story appears in http://communities.vmware.com/thread/341512 but there doesn't seem to be anything extra here to try that I haven't already.
Any ideas appreciated.
Thanks