Michael McNamara https://blog.michaelfmcnamara.com technology, networking, virtualization and IP telephony Sun, 31 Oct 2021 01:21:22 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.2 VMware VeloCloud SD-WAN Orchestrator API and Python – Part 3 https://blog.michaelfmcnamara.com/2021/03/vmware-velocloud-sd-wan-orchestrator-api-and-python-part-3/ Tue, 02 Mar 2021 03:31:11 +0000 https://blog.michaelfmcnamara.com/?p=6886 It looks like this project is going to be moving forward again… time to dust off the Python code and finish out the last few pieces to the puzzle.

Interestingly enough I ran into a quick problem testing my original code. It looks like something had changed with the “Profile” that we’re using for each Edge. When I run my original Python script I’m getting a HTTP/400 returned along with the following response code, Interface “CELL1” present in Profile but not in Edge. Looking through some of the JSON data it would appear that something has changed with the Profile that I’m using in the configuration. The error I’m getting when calling rest/configuration/updateConfigurationModule likely means that I’m missing some required data in my Jinja templates that the VMware VeloCloud Orchestrator is now expecting.

There is a Chrome extension called VeloCloud Developer Assistant, that can help you break down the JSON data and make it a little easier to visually consume and troubleshoot. I personally prefer just going into the Chrome developer tools and copying out the entire JSON data block that’s being posted and then running that through some JSON formatting tool to help clean it up for human consumption. If you go through the steps in the web UI with the Chrome developer tools open, can you go back and extract all the JSON data that is being sent to the VeloCloud Orchestrator, and in short you can easily reverse engineering the calls and the JSON data.

In the end I was able to find the missing CELL1 interface under the routedInterfaces element. I added the missing data elements to the Jinja template and everything started working again. I ended up writing a few other supporting scripts to help with the overall project goal. I wrote a Perl script to poll the existing hardware to gather up all the IP configuration details from each VLAN and interface which then can be fed into the Python script to build the configuration within the VeloCloud Orchestrator. There’s also a management IP required, so I used a snippet of Perl code that I wrote back in 2016 to call the Infoblox API to assign the next available IP address in the management subnet.

With the Jinja templates it’s relatively easy to put this code onto a web server and build a simple WebUI around some Python or PHP code to generate new configurations when needed.

Cheers!

]]>
VMware VeloCloud SD-WAN Orchestrator API and Python – Part 2 https://blog.michaelfmcnamara.com/2020/08/vmware-velocloud-sd-wan-orchestrator-api-and-python-part-2/ Sun, 02 Aug 2020 14:26:20 +0000 https://blog.michaelfmcnamara.com/?p=6517 Update: July 2020 – unfortunately COVID-19 halted my VeloCloud roll out just as it was starting. It’s difficult being a retailer when you can’t have your 650 stores open for business.

In my previous post I detailed how I was setting out to programatically create 650 VeloCloud Edge profiles using VMware VeloCloud’s Orchestrator API.

I had a few hours to dedicate to the quest last weekend and I was able to complete a working Python script. It turns out I was missing an additional parameter in one of my calls that took a few hours to track down. The modulesId needs to be determined from getEdgeConfigurationStack, this appears to be a set of device specific configurations that will override the default “Profile” settings.

I now I have a working script that will build an edge configuration using a set “Profile” within the Orchestrator and then passing it a template using Jinja2 with the device name, VLANs, IPv4 addressing, etc.

Here are the updates steps I’m taking in my script;

  • Step 1. Login via rest/login/enterpriseLogin (store authentication cookie)
  • Step 2. Call rest/enterprise/getEnterpriseConfigurationsPolicies to get the profileId that we’ll be using for all the devices (this is the equivalent of the Profiles in the web UI)
  • Step 3. Call rest/edge/edgeProvision with template passing device name along with profileId (again this is the Profiles in the web UI), result will be edgeId and activation key
  • Step 4. Call rest/enterprise/getEnterpriseEdges passing edgeId to confirm
  • Step 5. Call rest/edge/getEdgeConfigurationStack to get modulesId of new edge profile (this is the device specific profile for anything that is overridden from the “Profile” set in the device configuration)
  • Step 6. Call rest/configuration/updateConfigurationModule with template replacing edgeId, profileId and modulesId along with IPv4 addresses, etc parse result to confirm – THE MAJORITY OF WORK IS ACCOMPLISHED IN THIS STEP
  • Step 7. Logout via /rest/logout

Those are the major steps… now I need to write some accompanying code to parse a list of stores and ultimately dump all of this into a database or CSV so we can store and track the activation code for each physical device.

I will also work on publishing the code I’m using so others can follow in my footsteps… it really wasn’t that hard, it took a few days to figure out the REST API calls and then the relationships between the different ‘modules’ and then track down the missing pieces to get everything working properly.

If there’s interest in me releasing the code, drop a note below… depending on the interest, I’ll see if I can make time to clean up the code and publish it to Github.

Cheers!

]]>
How to install and setup Ansible to manage Junos on CentOS https://blog.michaelfmcnamara.com/2020/07/how-to-install-and-setup-ansible-to-manage-junos-on-centos/ https://blog.michaelfmcnamara.com/2020/07/how-to-install-and-setup-ansible-to-manage-junos-on-centos/#comments Fri, 03 Jul 2020 12:01:48 +0000 https://blog.michaelfmcnamara.com/?p=6563

If you Google “Ansible” and “Junos” you’ll find literally hundreds of articles, posts and videos… some covering pre 2.0 Ansible, some covering Ansible 2.5, or 2.6 or later and almost all of them are completely different – and a great many of the instructions no longer work!

I recently wanted to test out the Ansible Junos modules put out by Juniper but first I had to spend a good hour figuring out all the inter dependencies to get everything working on a CentOS 7 server. The Juniper DAY ONE: AUTOMATING JUNOS WITH ANSIBLE written by Sean Sawtell is a great starting point but I ran into problems just getting my local environment running. The hundreds if not thousands of posts and videos were extremely confusing and I quickly grew frustrated.

What follows is a quick guide on how to get everything working on a minimal CentOS 7 server. Depending on your requirements, it might be more advisable to look at running a fully prepared Docker container, where all the needed software is ready to run. You just need to provide the Ansible configuration and playbooks.

Here’s what you need to-do from root or a root equivalent account using sudo. Since I built this test VM on a VMware ESXi 6.5 server I wanted to install the open-source VMware tools and perform any updates.

yum install open-vm-tools
yum update
init 6

yum install epel-release
yum install python3 jxmlease

pip3 install ncclient
pip3 install junos-eznc
pip3 install ansible

ansible-galaxy install Juniper.junos

That’s all you need and you are ready to go… if you want to play around with Netmiko or Napalm you only need to use PIP to install those Python modules.

pip3 install netmiko
pip3 install napalm

Cheers!

]]>
https://blog.michaelfmcnamara.com/2020/07/how-to-install-and-setup-ansible-to-manage-junos-on-centos/feed/ 2
VMware VeloCloud SD-WAN Orchestrator API and Python https://blog.michaelfmcnamara.com/2020/03/vmware-velocloud-sd-wan-orchestrator-api-and-python/ https://blog.michaelfmcnamara.com/2020/03/vmware-velocloud-sd-wan-orchestrator-api-and-python/#comments Sun, 01 Mar 2020 15:52:11 +0000 https://blog.michaelfmcnamara.com/?p=6500 This will likely be a mult-part series… let’s just jump right into it and see where we end up.

I’m preparing to potentially roll-out VMware’s VeloCloud SD-WAN solution to 650 locations globally. In a recent conversation with VMware I heard how difficult and expensive it was going to be from a professional services standpoint to accomplish all that work, and it was suggested that I needed to work with a partner for all the application coding and development time that would be needed. The work was suggested to potentially cost 5 or even 6 figures. I’ m not talking about the physical logistics of such an undertaking, I’m talking about just the steps needed to programatically build and configure 650 Edge profiles, with names, IPv4 addresses, etc.

I called bullsh*t.

I’m assuming this is another example of a vendor or manufacturer trying to appease their partners and resellers. If they provide the customer and end-user too much information, there might not be a need for any professional services by the resellers or partners. This is completely the wrong approach to take IMHO… the vendor should empower their customers and end-users. There will always be customers that don’t have the skill sets or resources to tackle the job for themselves and that’s where the resellers and partners should fill the role.

I took a few hours this weekend and wrote some quick cURL commands and Python code to understand if the effort was really worth 5 and 6 figures. In short it wasn’t. I understand resellers and partners want to be compensated but if you are going to be unreasonable then I’m going to go do it myself, and then I’ll likely open source the code just to prove my point.

Here are the steps I believe we’ll need (yet to be fully tested);

  • Step 1. Login via rest/login/enterpriseLogin (store authentication cookie)
  • Step 2. Call rest/enterprise/getEnterpriseConfigurationsPolicies to get the profileId that we’ll be using for all the devices (this is the Profiles in the UI)
  • Step 3. Call rest/edge/edgeProvision with template passing STORE_NAME along with profileId (again this is the Profiles in the UI), result will be edgeId and activation key
  • Step 4. Call rest/enterprise/getEnterpriseEdges passing edgeId to confirm
  • Step 5. Call rest/edge/getEdgeConfigurationStack to get profileId of new edge profile (this is the device specific profile for anything that is overridden from the “Profile” set in the device configuration)
  • Step 6. Call rest/configuration/updateConfigurationModule with template replacing edgeId and profileId along with IPv4 addresses, etc parse result to confirm – THE MAJORITY OF WORK IS ACCOMPLISHED IN THIS STEP
  • Step 7. Logout via /rest/logout

I can’t completely trash VMware because there is a well documented API available and VMware has made examples available but when talking face to face with VMware personnel it’s almost as if they are forbidden from talking about any of that. VMware even hosts a community repository (click on Related Code Samples) which is open to the public.

Why not just create the 650 Edge profiles in the WebUI?

  1. Time – I would estimate it would take about 5 minutes to create a single Edge profile in the WebUI. 650 * 5 = 3,250 / 60 = 54 hours
  2. Errors – There would inevitably be a lot of errors from all the typing or cutting-n-pasting required which would create additional deployment issues and delays downstream.
  3. Future – Wouldn’t it be nice to have a proven method of deploying new devices programtically? Yes it most certainly would be.

It boils down to these two facts; 1) It will take me much less time to write this code now than it would to sit at a keyboard for 54 hours, and 2) the rate at which we are rolling these out, I can’t afford any fat finger errors in the configuration – I don’t have the backend resources to support troubleshooting “stupid” fat finger mistakes down the road.

So now I have a general outline of what I need to-do. The store names can easily be generated programatically along with all the IPv4 addressing so what’s left then is to decide what language to use in writing this beast?

Perl or Python?

I’ve used Perl since 1997 and I’m really comfortable writing code in Perl and more so re-using all the different snippets of code I’ve written in the past. So I’ll probably use Python to help expand my experience writing code in Python although I likely won’t use the VMware SDK, I’ll just write directly against the REST API.

You have any recommendations?

Cheers!

]]>
https://blog.michaelfmcnamara.com/2020/03/vmware-velocloud-sd-wan-orchestrator-api-and-python/feed/ 1
VMware – CPU hardware should support cluster’s current EVC mode https://blog.michaelfmcnamara.com/2018/12/vmware-cpu-hardware-should-support-clusters-current-evc-mode/ Sat, 15 Dec 2018 17:34:20 +0000 https://blog.michaelfmcnamara.com/?p=6276 Another interesting problem… a client was trying to add a new ESXi host into an existing cluster and was getting an error that the EVC mode of the cluster didn’t match the CPU capabilities and was generating this error message;

This wasn’t just the normal CPU compatibility issue… as the Dell M620 server that we were trying to add to the cluster was the exact same model and CPU as the current ESXi hosts that were in the cluster.

I eventually discovered that the existing ESXi hosts were running 6.5 Update 2 where as the host we were trying to add to the cluster was running 6.5 Update 1. It turns out that there were come CPU microcode updates in 6.5 Update 2 that were intended to mitigate the Spectre v2 vulnerability and those was tripping up the Enhanced vMotion Compatibility check between the ESXi hosts.

I found the following VMware knowledge base article that details the issue;
https://kb.vmware.com/s/article/52085

I upgraded the new ESXi server to 6.5 Update 2 and was able to add it to the cluster without issues.

Cheers!

]]>
It’s the networks fault #18 https://blog.michaelfmcnamara.com/2017/12/its-the-networks-fault-18-2/ https://blog.michaelfmcnamara.com/2017/12/its-the-networks-fault-18-2/#comments Sat, 16 Dec 2017 11:47:20 +0000 https://blog.michaelfmcnamara.com/?p=6153 Let’s everyone be honest here… working in Information Technology requires certain skills. Probably the most important skill set is what I’d call your ‘Google-Fu‘. Your ability to efficiently search Google using various keywords to find useful information on the problem or issue confronting you. I often find some of the better written but least ranks articles by removing the manufacturer from the search results. Here’s an example if I wanted to exclude any results from the domain cisco.com, I would append the following to the Google search, “-site:cisco.com”. This would show me all search results except for anything from the Cisco website.

Articles

Cisco ASA- Basic LDAP Authentication by Dan – It’s been a while since I configured a Cisco ASA to authenticate VPN users against a Microsoft Windows Active Directory Domain Controller. If you Google ‘Cisco ASA Active Directory Authentication’ you’ll get hundreds of links and articles. I choose to scroll down a bit in the list and chose the link from IN THE WORKS – A tech apprenticeship. Thankfully Dan’s article from 2016 was straight forward and easy to follow. The trick was in reusing the DefaultWEBVPNGroup tunnel-group so users don’t need to select from multiple tunnel-groups in the client.

Authenticate to vCenter from Active Directory credentials by Romain Serre – A customer wanted to authenticate with vSphere using his Active Directory credentials. In this specific case the client was using the vCSA (vCenter Server Appliance) and not a typical Windows Server running vCenter. I initially ran into some DNS issues, thankfully the CLI error gave me the hint I needed as the web UI error was pretty basic.

How to Configure NTP Server on Windows Server 2016 by Stefan – A client was having some significant clock drift issues with one of their servers. I recalled the command was w32tm but could recall exactly what the commands were to enable NTP. Stefan has an easy to follow post. Stefan, I’m not a big fan of ad banners placed in the middle of the content and I’m sure I’m not alone.

Cheers!

]]>
https://blog.michaelfmcnamara.com/2017/12/its-the-networks-fault-18-2/feed/ 1
VMWare ESXi 6.5 Update 1 – There are no free physical adapters to attach to this virtual switch https://blog.michaelfmcnamara.com/2017/09/vmware-esxi-6-5-update-1-there-are-no-free-physical-adapters-to-attach-to-this-virtual-switch/ https://blog.michaelfmcnamara.com/2017/09/vmware-esxi-6-5-update-1-there-are-no-free-physical-adapters-to-attach-to-this-virtual-switch/#comments Wed, 06 Sep 2017 01:42:31 +0000 https://blog.michaelfmcnamara.com/?p=6086 I was working with a Dell M1000E Chassis this weekend, installing VMWare ESXi 6.5 Update 1 onto a number of Dell M620 server blades when I ran into an error adding the additional physical NICs to each virtual switch post install.

“There are no free physical adapters to attach to this virtual switch”.

In the end I had to refer to this VMWare knowledgebase article for the commands to add the physical NICs to the vswitch from the command line.

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1008127

I used the following commands on each ESXi host after enabling SSH access;

esxcli network vswitch standard uplink add --uplink-name=vmnic1 --vswitch-name=vSwitch0
esxcli network vswitch standard uplink add --uplink-name=vmnic5 --vswitch-name=vSwitch1
esxcli network vswitch standard uplink add --uplink-name=vmnic3 --vswitch-name=vSwitch2
esxcli network vswitch standard list

I also ran into another problem trying to bind the VMK interfaces to the iSCSI initiator.

Thankfully Dave Davis had already solved that problem in a post titled, vSphere 6.5 iSCSI Binding Bug…Dude, Where’s My Unused Adapters?

I haven’t worked much with VMware ESXi 6.5 but I’m already missing the legacy vSphere client.

Cheers!

Update: Saturday September 9, 2017
I should point out that you can still use the VMware vSphere 6.0 Client to manage an VMware ESXi 6.5 server. You don’t need to use just the new web UI.  Thank to Grant for pointing out that fact!

Update: February 28, 2019
This issue has been resolved in VMware ESXi 6.5 Update 2

[ad name=”ad-articlefooter”]

]]>
https://blog.michaelfmcnamara.com/2017/09/vmware-esxi-6-5-update-1-there-are-no-free-physical-adapters-to-attach-to-this-virtual-switch/feed/ 8
Cisco Nexus 1000V Upgrade to 4.2(1)SV1(4) https://blog.michaelfmcnamara.com/2011/09/cisco-nexus-1000v-upgrade-to-4-21sv14/ Thu, 29 Sep 2011 03:00:18 +0000 http://blog.michaelfmcnamara.com/?p=2344 I just recently completed an upgrade of our Cisco Nexus 1000V from 4.0(4)SV1(3) to 4.2(1)SV1(4). Prior to proceeding with the Cisco Nexus 1000V upgrade we had to first upgrade vCenter to 4.1, Update Manager to 4.1 Update 1 and lastly the VMware ESX hosts themselves to 4.1. With all that complete we set out to upgrade the Cisco Nexus 1000V but quickly ran into a few problems.

Pre-Upgrade Script

The Pre-Upgrade Script, a TCL script which checks your Cisco Nexus 1000V configuration for any potential issues, immediately flag our Port Channel configurations in Check 2.

###############################################################################
## COMPANY NAME: Cisco Systems Inc                                           ##
## Copyright © 2011 Cisco Systems, Inc. All rights Reserved.                 ##
##                                                                           ##
## SCRIPT NAME: pre-upgrade-check-4.2(1)SV1(4).tcl                           ##
## VERSION: 1.0                                                              ##
## DESCRIPTION: This script is applicable to all releases prior to           ##
##              4.2(1)SV1(4).                                                ##
##                                                                           ##

...
...
...

=========
 CHECK 2:
=========
Checking for Interface override configurations on Port-channnel Interface ...
###############################################################################
##                           FAIL                                            ##
##                                                                           ##
## List of Port-Channel Interface(s) with interface level configuration(s)   ##

1: port-channel1 has below overrides.
mtu 9000

2: port-channel2 has below overrides.
mtu 9000

3: port-channel3 has below overrides.
mtu 9000

4: port-channel4 has below overrides.
mtu 9000

...
...
...

While originally testing the Cisco Nexus 1000V prior to going into production some 463 days earlier we had played around with enabling Jumbo Frame support within the VM environment. We had set the MTU on the individual port-channel configurations to 9000. Now the pre-upgrade script was telling us that we needed to clean this up and remove any specific configuration from the port-channel and instead apply it to the port-profile configuration. I added the system mtu 9000 command to the port-profile but got a few surprises when I tried to remove the MTU command. The first surprise when I issued the “inter po1, no mtu 9000” was loosing connectivity to the VM guests on that host. I had to manually restart the VEM on the ESX host with the “vem restart” command from an CLI prompt. The second surprise was that “mtu 9000” in the configuration had been replaced by “mtu 1500”. That wasn’t going to work so I had to reach out to Cisco TAC who immediately recognized the issue and provided a workaround;

On the ESX host stop the VEM

vem stop

Then on the Cisco Nexus 1000V delete the port-channels and associated VEM (I’ll use the first server as an example assuming there are two VSMs installed)

config
no inter po1
no inter po2
no vem 3

On the ESX host start the VEM

vem start

And sure enough it worked just as promised… recreating the port-channels without the MTU commands. Obviously we had to put each ESX host into maintenance mode before we could just stop and start the VEM.

With that taken care of we started upgrading the VEMs using Update Manager. Unfortunately Update Manager only made it through about 6 ESX hosts before it got stuck. We had to manually install the update VIB on the remaining 12 ESX hosts ourselves. We utilized FileZilla to copy the VIB up to each server and the utilized PuTTY to SSH into each server and manually update the VEM;

[root@esx-srv1-dc1-pa ~]# esxupdate -b ./cross_cisco-vem-v130-4.2.1.1.4.0.0-2.0.1.vib update
Unpacking cross_cisco-vem-v130-esx_4.2.1.1.. ############################################################# [100%]

Installing cisco-vem-v130-esx                ############################################################# [100%]

Running [rpm -e cisco-vem-v120-esx]...
ok.
Running [/usr/sbin/vmkmod-install.sh]...
ok.

With all the ESX hosts upgrade we had to physically reboot the vCenter server to get the currently running VUM task to fail so we could complete the upgrade from the Cisco Nexus 1000V.

Next we launched the upgrade application and before long we had the standby VSM upgraded to 4.2(1)SV1(4). Here’s where we ran into another small scare. After the upgrade of the standby VSM the upgraded VEMs are supposed to re-register to the newly upgraded VSM. We waited about 5 minutes an none of the VEMs had discnonected from the primary VSM running 4.0(4)SV1(3) to the standby VSM that was now running 4.2(1)SV1(4). It was only approximately 15-20 minutes later (while searching Google for some hint) that the VEMs just up and started to connect to the newly upgraded standby VSM.

Cheers!

]]>
New Data Center – Where have I been? https://blog.michaelfmcnamara.com/2010/06/new-data-center-where-have-i-been/ https://blog.michaelfmcnamara.com/2010/06/new-data-center-where-have-i-been/#comments Sat, 12 Jun 2010 02:00:24 +0000 http://blog.michaelfmcnamara.com/?p=1407 I thought I would post a few quick words on where I’ve been for the past 2 months (certainly not writing quality content for this blog).  The past 60 days have been very hectic as I’ve started down the final stretch of designing, building and lighting a new data center. Thankfully the team and I are no strangers to moving computer rooms or constructing new buildings so we’re keenly aware of all the technical details needed to be successful in such a large endeavor.

I have so many short stories to share but no time to share them… In any event I’m now getting up to speed with a lot of new equipment, specifically Cisco’s Nexus gear.

What equipment did we use?

  • Cisco Nexus 7010
  • Cisco Nexus 5010
  • Cisco Nexus 2148
  • Cisco Catalyst 3750E
  • Cisco Catalyst 2960G
  • Cisco ASA5520
  • Cisco ACE 4710
  • Cisco AS5300 (yes we still have some dial-up users/vendors)
  • Cisco 7301 Router
  • Cisco 2821 Router

What racks did we use for the network equipment?

  • Liebert Knurr Racks
  • Liebert MPH/MPX PDUs

What equipment did we use for the servers/blades?

  • HP Rack 10000 G2
  • HP Rack PDU (AF503A)
  • HP IP KVM Console (AF601A)
  • HP BladeSystem c7000 Enclosure
  • HP Virtual Connect Flex-10 Interconnect
  • HP SAN 8Gb Interconnect
  • HP BL460c G6
  • HP BL490c G6
  • HP DL380 G6
  • HP DL360 G6

What are we using for storage?

  • IBM XIV System Storage (SAN) (w/4 1Gbps iSCSI replication ports)
  • IBM SAN80B-4 SAN Switch

Additional miscellaneous equipment;

  • MRV LX-4048T (terminal server)
  • Brother P-Touch PT-2100 / Brady ID PRO Plus label makers

As some of you might know we selected Cisco as the network electronics vendor and have implemented their Cisco Nexus 7010 switches as our core, followed by the Nexus 5010 switches as distribution to the Nexus 2148 (FEX) switches in a top of rack configuration. We also utilized Catalyst 2960G switches for our management/out-of-band network given that the Nexus 2148 only supports 1000BaseT, no 10Mbps or 100Mbps connectivity. Of course Cisco is in the process of releasing the Nexus 2248 which supports 100/1000Mbps connectivity to edge devices. We chose to utilize the HP Virtual Connect Flex-10 in our VM enclosures and we’ll utilize the Cisco 3120X in our non-VM enclosures. We’ve also installed and configured the Nexus 1000V in coordination with our VMware vSphere 4 environment. We decided that the CEE/DCE/FCoE revolution wasn’t quite here yet, or perhaps we weren’t quite ready for it so we stayed with a traditional Fiber Channel infrastructure around two IBM (oem Brocade) 80 port 8Gbps SAN switches. For SAN replication we’ll be using multiple 1Gbps iSCSI ports over a 10GE WAN. We ended up with an IBM XIV so we’ll have to see if it can keep up with all the traffic that’s going to be thrown it’s way.

So there should certainly be no shortage of material to talk about with all this new equipment, however, I’m certainly going to be very busy the next six months.

Here are some pictures of the cage (under 800 sq ft) , if interested. You’ll notice the chair and the upgrade that we performed on it in the last two pictures.

Cheers!

]]>
https://blog.michaelfmcnamara.com/2010/06/new-data-center-where-have-i-been/feed/ 9
vSphere SCSI reservation conflict https://blog.michaelfmcnamara.com/2009/09/vsphere-scsi-reserv-co/ https://blog.michaelfmcnamara.com/2009/09/vsphere-scsi-reserv-co/#comments Wed, 02 Sep 2009 00:00:38 +0000 http://blog.michaelfmcnamara.com/?p=942 logo_vmware_ready_90wWe had our first issue today with our recent VMware vSphere 4 installation. We’re currently up to about 30 virtual machines spread across five BL460c (36GB) blades in an HP 7000 Enclosure. The problem started with a few virtual machines just going south, like they had lost their mind. It was discovered that all the virtual machines that were affected were on the same datastore (LUN). One of the engineers put the ESX host that was running those VMs into maintenance mode and rebooted it. After the reboot the ESX host was unable to mount the datastore. Everything seemed fine from a SAN standpoint and the Fiber Channel switches were working fine. A quick look at /var/log/vmkwarning on the ESX host revealed the following messages;

Sep  1 13:04:35 mdcc01h10r242 vmkernel: 0:00:26:02.384 cpu4:4119)WARNING: ScsiDeviceIO: 1374: I/O failed due to too many reservation conflicts. naa.600508b4000547cc0000b00001540000 (920 0 3)
Sep  1 13:04:40 mdcc01h10r242 vmkernel: 0:00:26:07.400 cpu6:4119)WARNING: ScsiDeviceIO: 1374: I/O failed due to too many reservation conflicts. naa.600508b4000547cc0000b00001540000 (920 0 3)
Sep  1 13:04:40 mdcc01h10r242 vmkernel: 0:00:26:07.400 cpu6:4119)WARNING: Partition: 705: Partition table read from device naa.600508b4000547cc0000b00001540000 failed: SCSI reservation conflict

A quick examination of the other ESX hosts revealed the following;

Sep  1 13:04:26 mdcc01h09r242 vmkernel: 21:22:13:25.727 cpu10:4124)WARNING: FS3: 6509: Reservation error: SCSI reservation conflict
Sep  1 13:04:31 mdcc01h09r242 vmkernel: 21:22:13:30.715 cpu12:4124)WARNING: FS3: 6509: Reservation error: SCSI reservation conflict
Sep  1 13:04:36 mdcc01h09r242 vmkernel: 21:22:13:35.761 cpu9:4124)WARNING: FS3: 6509: Reservation error: SCSI reservation conflict

We had a SCSI reservation issue that was locking out the LUN from any of the ESX hosts. The immediate suspect was the VCB host as it was the only other host that was being presented the same datastores (LUNs) as the ESX hosts from the SAN (HP EVA 6000).

We rebooted the VCB host and then issued the following command to reset the LUN from one of the ESX hosts;

vmkfstools -L lunreset /vmfs/devices/disks/naa.600508b4000547cc0000b00001540000

After issuing the LUN reset we observed the following message in the log;

Sep  1 13:04:40 mdcc01h10r242 vmkernel: 0:00:26:07.400 cpu9:4209)WARNING: NMP: nmp_DeviceTaskMgmt: Attempt to issue lun reset on device naa.600508b4000547cc0000b00001540000. This will clear any SCSI-2 reservations on the device.

The ESX hosts were almost immediately able to see the datastore and the problem was resolved.

We believe the problem occurred when the VCB host tried to backup multiple virtual machines from the same datastore (LUN) at the same time. This created an issue when the VCB host locked the LUN for too long causing the SCSI queue to fill-up on the ESX hosts. This is new to us and to me so we’re still trying to figure it out.

Cheers!

References;
http://kb.vmware.com/kb/1009899
http://www.vmware.com/files/pdf/vcb_best_practices.pdf

]]>
https://blog.michaelfmcnamara.com/2009/09/vsphere-scsi-reserv-co/feed/ 12
HP Virtual Connect & VMware vSphere 4 https://blog.michaelfmcnamara.com/2009/08/hp-virtual-connect-vmware-vsphere-4/ https://blog.michaelfmcnamara.com/2009/08/hp-virtual-connect-vmware-vsphere-4/#comments Sun, 02 Aug 2009 14:00:23 +0000 http://blog.michaelfmcnamara.com/?p=882 HP 7000 EnclosureAs I’ve mentioned in the past we’re kicking off a very large endeavor to move a significant number of our servers to a virtual environment. Over the past two weeks we built out an HP 7000 enclosure with 4 HP BL460c server blades, 6 HP Virtual Connect 1/10 Gb-F Ethernet interconnects, and 2 HP Virtual Connect 8Gb 24-Port Fiber Channel interconnects. The purpose of this hardware is to provide a temporary staging location as we perform the physical to virtual conversions before moving the virtual machines (across the network) to a new data center where additional VMware vSphere 4 clusters will be waiting.

We had some small issues when we first turned up the enclosure but the biggest hurdle was our unfamiliarity with Virtual Connect and locating the default factory passwords (we had ordered the enclosure as a special build so it came pre-assembled which saved us a lot of time and effort and was well worth the small added cost).

We’re currently using two Nortel Ethernet Routing Switch 5530s in a stack configuration mounted at the top of the rack. We also have a Nortel Redundant Power Supply Unit (RPSU) 15 installed to provide redundant power in the event that we loose a one of the rooms UPS’s or that we have an issue with an internal power supply in either ERS 5530 switch.  We load software release 6.1 onto the ERS 5530s and so far haven’t observed any issues. We’re initially connecting the ERS 5530 stack via 2 1000BaseSX (1Gbps) uplinks distributed across both ERS 5530 switches (DMLT) to a pair of Nortel Ethernet Routing Switch 8600s running in a cluster configuration using IST/SMLT(SLT) trunking. As the solution grows we can expand the uplink capacity by adding additional 1Gbps uplinks or by installing 10Gbps XFPs. We’re downlinking from the ERS 5530 stack to multiple HP Virtual Connect 1/10Gb-F modules using LACP. Unfortunately you can’t have a LAG span multiple HP Virtual Connect 1/10Gb-F Ethernet modules as this time. If you do, only the ports on one of the modules will be “Active” while the ports on other modules will by in “Standby”.

hpvirtualconnect110gbfThe HP Virtual Connect 1/10 Gb-F Ethernet interconnects provide 16 Internal 1Gb Downlinks,  4 External 10/100/1000BASE-T Uplinks, 2 External 10/100/1000BASE-T SFP Uplinks, 2 External 10Gb XFP Uplinks, 1 External 10Gb CX-4 Uplinks, and 1 10Gb Internal Cross Connect. Using the internal 10Gbps cross connect along with the external 10Gb CX-4 uplink you can create a 10Gbps network within the enclosure. You can also link multiple enclosures together to form a 10Gbps network contained entirely within the rack. This could be very beneficial in keeping vMotion and other unneeded traffic off the core uplinks.

In testing we did run into a significant problem that already appears to have been documented by HP although a solution is yet to be formulated. In testing several failure scenarios (physically removing the HP Virtual Connect Ethernet interconnects or remotely powering them down) we observed a significant problem when the interconnects where restored. The HP Virtual connect 1/10Gb-F would show no link to the blade server while the VMware 4 console would indicate that there was link. This problem obviously affected all traffic associated with that port group. The solution was to either reboot the VMware host or reset the NIC using ethtool -r  {NIC} from the server console.

Here’s the excerpt from the release notes;

When a VC-Enet module is running VC v1.30 firmware or higher, a link might not be re-established
between the module and the ports of an NC364m mezzanine card under the following conditions:

  • The network mappings are changed on the NIC ports through Virtual Connect.
  • The VC-Enet module is restarted from a power-cycle or reboot, or the module is removed and inserted.

If the server is rebooted, the link is established on all ports on both sides of the connection. Manually
toggling the link from the server should also restore the link.

The jury is still out on HP’s Virtual Connect although I hope to dig deeper in later posts.

Cheers!

References;
http://h18004.www1.hp.com/products/blades/components/c-class-interconnects.html
http://bizsupport1.austin.hp.com/bc/docs/support/SupportManual/c01730696/c01730696.pdf

]]>
https://blog.michaelfmcnamara.com/2009/08/hp-virtual-connect-vmware-vsphere-4/feed/ 2