We had our first issue today with our recent VMware vSphere 4 installation. We’re currently up to about 30 virtual machines spread across five BL460c (36GB) blades in an HP 7000 Enclosure. The problem started with a few virtual machines just going south, like they had lost their mind. It was discovered that all the virtual machines that were affected were on the same datastore (LUN). One of the engineers put the ESX host that was running those VMs into maintenance mode and rebooted it. After the reboot the ESX host was unable to mount the datastore. Everything seemed fine from a SAN standpoint and the Fiber Channel switches were working fine. A quick look at /var/log/vmkwarning on the ESX host revealed the following messages;
Sep 1 13:04:35 mdcc01h10r242 vmkernel: 0:00:26:02.384 cpu4:4119)WARNING: ScsiDeviceIO: 1374: I/O failed due to too many reservation conflicts. naa.600508b4000547cc0000b00001540000 (920 0 3) Sep 1 13:04:40 mdcc01h10r242 vmkernel: 0:00:26:07.400 cpu6:4119)WARNING: ScsiDeviceIO: 1374: I/O failed due to too many reservation conflicts. naa.600508b4000547cc0000b00001540000 (920 0 3) Sep 1 13:04:40 mdcc01h10r242 vmkernel: 0:00:26:07.400 cpu6:4119)WARNING: Partition: 705: Partition table read from device naa.600508b4000547cc0000b00001540000 failed: SCSI reservation conflict
A quick examination of the other ESX hosts revealed the following;
Sep 1 13:04:26 mdcc01h09r242 vmkernel: 21:22:13:25.727 cpu10:4124)WARNING: FS3: 6509: Reservation error: SCSI reservation conflict Sep 1 13:04:31 mdcc01h09r242 vmkernel: 21:22:13:30.715 cpu12:4124)WARNING: FS3: 6509: Reservation error: SCSI reservation conflict Sep 1 13:04:36 mdcc01h09r242 vmkernel: 21:22:13:35.761 cpu9:4124)WARNING: FS3: 6509: Reservation error: SCSI reservation conflict
We had a SCSI reservation issue that was locking out the LUN from any of the ESX hosts. The immediate suspect was the VCB host as it was the only other host that was being presented the same datastores (LUNs) as the ESX hosts from the SAN (HP EVA 6000).
We rebooted the VCB host and then issued the following command to reset the LUN from one of the ESX hosts;
vmkfstools -L lunreset /vmfs/devices/disks/naa.600508b4000547cc0000b00001540000
After issuing the LUN reset we observed the following message in the log;
Sep 1 13:04:40 mdcc01h10r242 vmkernel: 0:00:26:07.400 cpu9:4209)WARNING: NMP: nmp_DeviceTaskMgmt: Attempt to issue lun reset on device naa.600508b4000547cc0000b00001540000. This will clear any SCSI-2 reservations on the device.
The ESX hosts were almost immediately able to see the datastore and the problem was resolved.
We believe the problem occurred when the VCB host tried to backup multiple virtual machines from the same datastore (LUN) at the same time. This created an issue when the VCB host locked the LUN for too long causing the SCSI queue to fill-up on the ESX hosts. This is new to us and to me so we’re still trying to figure it out.
Cheers!
References;
http://kb.vmware.com/kb/1009899
http://www.vmware.com/files/pdf/vcb_best_practices.pdf
Eric K. Miller says
We ran into a similar situation (LUN locking) but in a totally different way, with ESX 3.5 Update 4 in a 2-node cluster and a relatively new Infortrend SAN (S16F-R1840).
Quite often we’re seeing random LUNs stop responding on one of the hosts. It requires a vmkfstools -L lunreset to re-gain access to the LUN from the host that can’t gain access to the LUN.
What’s strange is that we don’t have this problem on any of our other SANs, only the Infortrend LUNs.
I’ve notified Infortrend of the issue and have been on hold waiting for VMware to discuss the issue to see if there is a resolution.
Eric
Michael McNamara says
Thanks for the comment Eric!
While I’m not a SAN expert by any means I understand the problem can arise from a number of different reasons. The research I did pointed to numerous possible contributors. The most frequently mentioned were SAN performance, LUN size, and SAN switch and HBA compatibility. I would venture to guess that software/driver compatibility would also be a concern.
Good Luck on resolving your problem!
Eric K. Miller says
Thanks for the reply! I also suspected that it could be related to those numerous factors. In our case, we have a new unit that had a LOT of load on it, and never any problems, but then we started to see this happen more and more, even after moving VMs off of the SAN (Storage VMotion’ing). It’s barely used now and I’ve still seen the problem.
Every bit of equipment we have is on the HCL, thankfully, but I have seen other HCL-listed hardware have problems, so it doesn’t mean much to me anymore, other than getting support and having a “better” chance of getting a resolution to problems.
I’m “still” on hold waiting for VMware. :)
I’ll write back if I find a specific issue that caused our problem.
In case this helps, this is the command I modified from a VMware forum to work with ESX 3.5 to find the locked LUNs on a host:
esxcfg-info -s | grep -i -B 30 PendingReservations………………………….1
Eric
Eric K. Miller says
Sorry, that last command got changed by the blog.
It should be:
esxcfg-info -s | grep -i -B 30 Pending(insert a backslash and then a space here)Reservations………………………….1
David Gibbons says
Michael,
Your post saved me! Thanks for the tip about vmkfstools -L.
Looks like these reservation issues are becoming a bigger and bigger deal with ESXi4. Apparently the software iSCSI stack is so much more robust now it challenges some of the target stacks. My issue in particular was (is!) with openfiler as the target. Here’s the link to aid others in googling for answers:
https://forums.openfiler.com/viewtopic.php?pid=19063#p19063
Thanks again
Dave
Michael McNamara says
Hi Dave,
I’m happy the information helped you out! I have Openfiler running here at home with ESXi 3.5 host and really like it. At work all our clusters are running ESX 4 on an HP EVA SAN which is soon to be upgraded to an IBM XIV.
Cheers!
Eric K. Miller says
Hi Michael,
Off topic, but do you know much about the XIV? I know it’s a product they purchased from some other company. The main bad point I’ve heard is the possibility that you can lose the entire array (100’s of disks) if more than 2 disks fail in the whole unit. I may be off on the number of disks, but it was something like that, where the concept sounds good, but there is some major issue that can cause a whole unit to go down.
We’re looking at NEC currently for our next SAN solution (D3 and D8).
Eric
Michael McNamara says
Hi Eric,
I honestly wasn’t very involved in the IBM XIV discussions. I know it wasn’t the product of choice but because of price it somehow made it in the door. I know quite a few guys are skeptical that it’ll be able to handle the IO load of 32+ ESX 4 hosts, I believe the ESX 4 hosts will even be booting from SAN.
We’re suppose to be getting one soon so I’ll keep you posted if you’d like.
Mike
Eric K. Miller says
Hi Mike,
That would be great if you could provide feedback. I suspect its price is tempting, and their marketing is definitely fantastic, but I’m sure there are practical limits. Unfortunately, you can’t easily figure out what these limits are until you actually use it. Thankfully, NEC is going to provide a trial unit for us to push to its limits to see what we might need.
I’ll keep you posted on my progress as well. Feel free to email me at emiller at genesishosting.com if you want to discuss.
Thanks!
Eric
Valerian Crasto says
Hi Eric,
i have small question.
1.) what is the difference between # esxcfg-info -s | egrep -B16 “s Reserved|Pending” and esxcfg-info -s | grep -i -B 30 Pending(insert a backslash and then a space here)Reservations………………………….1
2.) what does ‘B 30 or B16’ means?
Note:- we have VMware ESX 4.0.
thnx in advance.
Valerian Crasto.
Michael McNamara says
Hi Valerian,
I’m not sure that Eric will see your post but here are your answers.
1) The backslash before the pipe tells egrep to look for ether term “Reserved” or “Pending”. It’s an OR operator for egrep and grep.
2) The -B tells egrep/grep to display X amount of lines before the line matching the search criteria.
Cheers!
Eric K. Miller says
Hi Michael,
I saw the post! :) Thanks for answering, though.
As a follow-up to my previous messages, there is apparently a firmware update that was released last year for the HP MSA1500cs that supposedly fixes the “table full” issue that caused locked LUNs on that device. Only 7 years after the release of the hardware. :)
Also, our NEC SAN works fantastic. We had a couple initial firmware issues, but they have been solved and it’s been running flawlessly for 6+ months.
Eric