Michael McNamara https://blog.michaelfmcnamara.com technology, networking, virtualization and IP telephony Fri, 15 Apr 2022 14:10:15 +0000 en-US hourly 1 https://wordpress.org/?v=6.8.3 Ansible Default Forks = 5 https://blog.michaelfmcnamara.com/2022/04/ansible-default-forks-5/ Fri, 15 Apr 2022 14:10:15 +0000 https://blog.michaelfmcnamara.com/?p=7368

We recently starting using Ansible to help perform software upgrades on the large number of Juniper EX-4300 and EX-2300 switches in our environment. Like the vast majority of organizations our downtime windows are extremely short and unfortunately the element of human error is usually greater than the standard mean between 12AM and 6AM. Thankfully Ansible solves most of these issues and is very reliable. Out of the box, Ansible has a configuration default of 5 forks and as such it will only upgrade 5 switches at a time. If you are going to be working with any sizable number of devices you’ll need to update the configuration value in the ansible.cfg file.

[defaults]
inventory = inventory
host_key_checking = False
log_path = ~/ansible/ansible.log
forks = 30
timeout = 60

You’ll need to make sure that whatever server or virtual machine is running your Ansible instance can support the number of forks you configure.

Cheers!

]]>
Aruba ClearPass – userPrincipalName and samAccountName https://blog.michaelfmcnamara.com/2020/06/aruba-clearpass-userprincipalname-and-samaccountname/ Sat, 27 Jun 2020 13:53:15 +0000 https://blog.michaelfmcnamara.com/?p=6556 I’ve recently been standing up a number of virtual Aruba ClearPass appliances to provide 802.1X RADIUS authentication for both wired and wireless clients. If you are using Windows Active Directory as an authentication source, here’s a quick trick to allow your users to authenticate using either the userPrincipalName (email address) or their samAccountName (username). In my current environment, we’re a multi-brand organization with multiple @brand.com email domains where users are more likely to know their email address than their AD username. In it’s default configuration Aruba ClearPass will only authenticate against the username (samAccountName).

Log into Aruba ClearPass and go to the Policy Manager and select Configuration -> Authentication -> Sources and select your Windows Active Directory source – see the example below;

You need to update the filters on the source such as follows.

Original ClearPass Filter Query:
(&(sAMAccountName=%{Authentication:Username})(objectClass=user))
Updated ClearPass Filter Query:
(|(&(objectClass=user)(sAMAccountName=%{Authentication:Username}))(&(objectClass=user)(userPrincipalName=%{Authentication:Username})))

And then don’t forget to Save the changes and now you should be good to go!

Cheers!

]]>
COVID-19 The War Waged by Information Technology Professionals https://blog.michaelfmcnamara.com/2020/03/covid-19-the-war-waged-by-information-technology-professionals/ Fri, 27 Mar 2020 01:29:02 +0000 https://blog.michaelfmcnamara.com/?p=6516 The past few weeks have been extremely exhausting both professionally and personally. Coronavirus (COVID-19) has taken the world by storm and is literally upending people’s daily lives and ruining businesses large and small. Let’s not forget the large number of people that have lost their lives to this virus. My thoughts and prayers for all those who have lost love ones. My thanks and admiration to all those medical professionals on the front lines treating the sick.

While very few of us have planned and organized days these past few weeks have been unlike anything I’ve ever experienced, running from one fire to another, one disaster to another. Whether it’s a power failure in a data center or someone deciding to water the potted plant that they hung over the network switch, there’s always some new emergency or problem that requires IT to jump in and save the day. This event was no different but the scale and duration was a whole new experience for everyone.

We started mobilizing our disaster preparedness plan around the middle of February. The initial request from the leadership team was pretty straight forward, “How do we prepare to have our home office employees and call centers agents work remotely?”. Like most large-medium sized enterprises we have a couple of hundred people working remotely every day, however we were talking about going from 200-300 daily remote users to potentially 3,000-4,000 daily remote users in a very short time span. And a significant portion of those users still had desktop devices.

In the span of a week we had ordered, imaged, configured and deployed (shipped or handed out) over 400 laptops to over 400 employees and call center agents. We also spun up a new Virtual Private Network (VPN) solution using Palo Alto Network’s GlobalProtect to help supplement our existing Pulse Secure and Microsoft Direct Access solutions.

I should note that I reached out to Pulse Secure and they offered us a temporary 60 day license to help us cope with the additional users – kudos to Pulse Secure.

Like everyone we’re in the middle of our second week and the Internet itself is starting to show it’s cracks. This past Monday and Tuesday we experienced connectivity issues across 30 stores in and around London, UK for ~ 45-60 minutes at a time. We later learned that Monday was the first day in the UK with all schools closed and British Telecom (BT) wasn’t handling the strain well. I’m sure it’s not helping BT that Disney+ just launched in the UK and Ireland on Wednesday.

We’ve had a number of issues with Microsoft, Slack and Zoom over the past two weeks and expect those issues will likely continue as more and more people around the nation and globe transition to working remotely.

Nobody’s really sure what the future holds… hopefully things will start to improve as we work to flatten the curve.

Thanks to all the IT folks that are continuing to carry on the struggle, be it onsite or from the confines of your own home…we know what what your’re going through and we appreciate your efforts!

If you have story to share, let us know below.

Stay safe! Cheers!

]]>
AT&T Says Failure was Verizon’s Fault https://blog.michaelfmcnamara.com/2019/08/att-says-failure-was-verizons-fault/ https://blog.michaelfmcnamara.com/2019/08/att-says-failure-was-verizons-fault/#comments Sat, 10 Aug 2019 12:13:50 +0000 https://blog.michaelfmcnamara.com/?p=6304

This is yet another example of AT&T failing a large enterprise customer. While this post has nothing to-do with all the recent hubbub around AT&T’s new “5G E” marketing campaign it highlights the continued challenges enterprises have in dealing with large telecom carriers that either just don’t care or simply just don’t have the ability to operate and manage large scale networks effectively.

The incident started as most incidents start… a call from the Help Desk alerting that there was a lot of red on the network dashboard and e-mail alerts were flowing in by the hundreds. In this case I got a call from one of my network engineer’s informing me that the primary AT&T AVPN/MPLS link into our primary Data Center was down and had been down for almost 60 minutes. That was very unwelcome news as it would significantly impact a large portion of our user base and a number of business critical applications.

While AT&T was supposedly “testing” the circuit my team and I went about re-routing traffic through a secondary Data Center that was still connected to the AT&T AVPN/MPLS network. From our secondary Data Center traffic could flow on a dedicated WAN link to our primary Data Center. That effort of dealing with BGP and EIGRP route maps and policies took about 2 hours to get the majority of traffic re-routed and working again although the re-route was adding about 140ms round-trip time to every IP packet as it needed to traverse the US West coast instead of the US East coast. We have firewalls all throughout the WAN layer so asynchronous routing will cause all sorts of issues and problems and since we also have some DMVPN sprinkled in there some care and planning was needed to successfully re-route traffic.

At the 3 hour mark AT&T had declared that the circuit was good and that there were no issues found on the circuit. The technician explained that we should “verify power”. At the 7 hour mark AT&T would be telling us that their last mile provider, Verizon, had de-provisioned the 1Gbps transport, and that was the cause of our outage.

Thankfully Verizon was able to re-provision the circuit within 20 minutes. Although it would take AT&T and Verizon another 9 hours before they could commit that the circuit wouldn’t be “automatically” de-provisioned again the following night.

I truly miss the days of un-managed dark fiber where all I needed to worry about were fiber breaks and my own gear… while we had a number of fiber breaks they were fairly infrequent and in the majority of cases there were quickly remedied within 2 hours – I can’t even get a call back from AT&T in under 2 hours, forget about resolution in under 2 hours.

What story do you have to share regarding any telecom carriers?

Cheers!

]]>
https://blog.michaelfmcnamara.com/2019/08/att-says-failure-was-verizons-fault/feed/ 6
Are you doing your part to thwart DDoS attacks? https://blog.michaelfmcnamara.com/2017/01/are-you-doing-your-part-to-thwart-ddos-attacks/ Sun, 15 Jan 2017 21:36:17 +0000 https://blog.michaelfmcnamara.com/?p=5954 I was recently talking with some colleagues about the increasing threat of DDoS attacks using spoofed IP addressing and we ended up deep in discussion concerning BCP 38 / RFC 2827. The Best Current Practice and Request For Comments document is an outline for Internet Service Providers to perform ingress filtering on the edge for their networks for valid source IP prefixes. While ISPs have a responsibility I would argue that large and small enterprises have an equal responsibility to filter their egress traffic to ensure that they don’t source any traffic onto the Internet that doesn’t have valid source IP information. I personally like to do this extra level of filtering on my Internet facing router, this also helps prevent any private IP address leakage due to a poor NAT configuration on the border firewall.

A few years ago I had to work with an ISP that couldn’t understand how my Internet link was being flooded by packets that were being sourced from 127.0.0.1 egressing their network. I had to send them multiple packet traces and tediously explain to them what IP spoofing was and how to block it on ingress into their network backbone.

On Cisco IOS routers we only need to create an extended access-list to identify those packets which don’t have valid source IP addresses. We want to permit any packet which has a valid source IP address from our public IANA IP address block but we want to deny/block any IP packet which has a spoofed source IP address from leaving our network. In the example below I’m using the three TEST networks that are designated by the IETF/IANA for use in documentation. You should substitute the three IP address blocks below with your own public ARIN/APNIC/RIPE IP address assignment.

ip access-list extended bcp38_out
 permit ip 203.0.113.0 0.0.0.255 any 
 permit ip 198.51.100.0 0.0.0.255 any
 permit ip 192.0.2.0 0.0.0.255 any
 deny any any log

Now that we have our access-list we need to apply it to outbound traffic leaving our Internet facing interface.

interface fa0/1
 ...
 ip access-group bcp38_out out
 ...

And that’s it, pretty easy and painless yet too many folks miss this simple step.

Cheers!

]]>
Retail Holiday Peak 2016 https://blog.michaelfmcnamara.com/2016/11/retail-holiday-peak-2016/ Sat, 19 Nov 2016 20:13:26 +0000 https://blog.michaelfmcnamara.com/?p=5902 It’s that time of year again… the holidays are just around the corner and every retailer is gearing up for Black Friday and Cyber Monday. My employer kicked off the holiday shopping season last night with one brand having their yearly 4-hour sale. Thankfully there were no surprises and our infrastructure and application stack was able to handle the additional load without issue. I did stumble upon an instrumentation issue between PRTG and a Cisco FirePOWER 4110 firewall – perhaps I’ll share more about that problem in another post. It’s a challenge every year to try and forecast the potential load and then meet the surge in demand, let’s not forget about all the email marketing campaigns and app push notifications that the brands want to hit their customers with. It can be a very challenging time for many Information Technology teams.

Now we wait for Thanksgiving and the four days to follow… confident that we’ve taken all the correct steps and everything is ready.  Only time will tell the true story.

Cheers!

]]>
VLANs and IP Routing on an Cisco Switch and Router https://blog.michaelfmcnamara.com/2016/06/vlans-and-ip-routing-on-an-cisco-switch-and-router/ Thu, 16 Jun 2016 23:15:44 +0000 https://blog.michaelfmcnamara.com/?p=5753 One of the most popular blog posts I’ve written in the past five years is, VLANs and IP Routing on an Ethernet Routing Switch. It continues to be the top post on my blog so I decided to write a follow-up using Cisco equipment. This has been covered many times on the Internet before but I’m going to try and add my spin to it here. In this example I’m going to take a Cisco 1921 router and a Cisco 3560-CX switch and show two different examples of how you could design a simple topology.

Example 1 – VLAN Routing on Router (Layer 2 Switching)

In the old days when we only had Layer 2 switching we could create an 802.1Q/ISL trunk between the switch and the router and we would route from the physical router itself. The switch would have an IP address just for management and was generally incapable of routing traffic. The router would have the IP address for every VLAN and that would be the default gateway for every device in that specific VLAN.  If there is no WAN or Internet link this topology is often referred to as a router on a stick. The topology might look something like this;

VLAN-IP-Routing-Cisco-2

This was and still is a widely accepted design although it has significant limitations since the legacy Cisco router is generally not capable of wire speed routing. In the case of the Cisco 1921 it can handle anywhere from 68Mbps – 110Mbps depending on packet size and configured features.  Lots of people upgrade their Layer 2 switches to Gigabit only to later figure out that they can’t achieve Gigabit speeds between VLANs because they are routing between VLANs on a legacy software based router.

Example 2 – VLAN Routing on Switch (Layer 3 Switching)

The newer accepted design is to do Layer 3 switching (routing) right on the switch itself, no need to include the legacy router. The IP interface for each VLAN is moved to the actual switch and traffic between those VLANs doesn’t need to leave the physical switch. That topology might look something like this;

VLAN-IP-Routing-Cisco-1

In this design all the internal routing occurs on the Cisco 3560 which is capable or wire speed switching and routing thanks to the ASIC hardware. In this case we can now achieve near Gigabit transfer speeds between the VLANs, the only real variable is the speed of the actual source and destination device – servers, laptops, desktops, etc. We isolate the WAN router so it’s only required when we need to communicate with the WAN or Internet. We might only have a 50Mbps Internet connection so the legacy router is capable of handling that amount of traffic and provides additional features such as NAT for hiding our private network from the public Internet.

So what’s the configuration look like, it’s actually pretty straight forward;

Example 1 – Cisco 3560-CX Switch

enable
config t

username mike privilege 15 secret mypassword
enable secret myenable

vlan 100
 name "VLAN_100"
vlan 200
 name "VLAN_200"

inter vlan 1
 ip address 192.168.1.10 255.255.255.0
 no shut

inter range gig0/1-2
 switchport mode access
 switchport access vlan 1
inter range gig0/3-4
 switchport mode access
 switchport access vlan 100
inter range gig0/5-6
 switchport mode access
 switchport access vlan 200
inter gig0/10
 switchport mode trunk

line vty 0 4
 login local 

Example 1 – Cisco 1921 Router

enable
config t

username mike privilege 15 secret mypassword
enable secret myenable

inter gig0/0.1
encapsulation dot1q 1
ip address 192.168.1.1 255.255.255.0
description VLAN_1
exit

inter gig0/0.100
encapsulation dot1q 100
ip address 192.168.100.1 255.255.255.0
description VLAN_100
exit

inter gig0/0.200
encapsulation dot1q 200
ip address 192.168.200.1 255.255.255.0
description VLAN_200
exit

line vty 0 4
 login local

Let’s look at the commands needed for the second example. In this case you’ll notice that I added a few IP routes to the configuration. In this case I added a default route on the Cisco 3560 forwarding traffic to the Cisco 2921. I also added IP routes to the Cisco 2921 for the IP subnets that we configured on the Cisco 3560. It’s not enough that the network knows where to send traffic to the destination. The network also needs to know how to send the replies back to the source devices, so we need routes in both directions.

Example 2 – Cisco 3560-CX Switch

enable
config t

username mike privilege 15 secret mypassword
enable secret myenable

vlan 100
 name "VLAN_100"
vlan 200
 name "VLAN_200"

inter vlan 1
 ip address 192.168.1.1 255.255.255.0
 no shut

inter vlan 100
 ip address 192.168.100.1 255.255.255.0
 no shut

inter vlan 200
 ip address 192.168.200.1 255.255.255.0


inter range gig0/1-2
 switchport mode access
 switchport access vlan 1
inter range gig0/3-4
 switchport mode access
 switchport access vlan 100
inter range gig0/5-6
 switchport mode access
 switchport access vlan 200

inter gig0/10
 desc UPLINK_C1921
 no switchport
 ip address 192.168.255.1 255.255.255.252
 no shut

ip route 0.0.0.0 0.0.0.0 192.168.255.2

line vty 0 4
 login local 

Example 2 – Cisco 1921 Router

enable
config t

username mike privilege 15 secret mypassword
enable secret myenable

inter gig0/0
 descr UPLINK_C3560
 ip address 192.168.255.2 255.255.255.0
 exit

ip route 192.168.1.0 255.255.255.0 192.168.255.1 
ip route 192.168.100.0 255.255.255.0 192.168.255.1 
ip route 192.168.200.0 255.255.255.0 192.168.255.1 

line vty 0 4
 login local

Cheers!

Note: Thanks to Cisco for providing the equipment I’m using today, it’s a small switch and router but it’s really helpful when working through example topologies to work on real equipment.

]]>
DNS Loops – how to not configure DNS forwarding https://blog.michaelfmcnamara.com/2016/04/dns-loops-how-to-not-configure-dns-forwarding/ https://blog.michaelfmcnamara.com/2016/04/dns-loops-how-to-not-configure-dns-forwarding/#comments Sun, 03 Apr 2016 12:51:31 +0000 https://blog.michaelfmcnamara.com/?p=5679 I’m continually amazed by how much I don’t know and by all the little issues and problems that I encounter in my day to day tasks.  You never know what’s going to pop-up.

I recently stumbled over a very poorly configured DNS environment and thought I would share how to not configuring DNS forwarding.

There was a standalone Microsoft domain which had four domain controllers which were set to forward their requests to another pair of Microsoft DNS servers which eventually forwarded those requests to a fairly new Infoblox DNS environment. Upon looking at the Infoblox reports I noticed a number of non-resolvable hostnames at the top of the reports, they outpaced the next domains in the list by several million requests. Assuming that there was some mis-configured application server that was continually pounding the DNS environment I decided to hunt through the logs to see if I could identify the original requestor and get them to clean up their act. I enabled query logging on one of the servers and set out to examine who was making the request. Oddly enough I found that the other three DNS servers were making the request. Ok, I went to the next server and repeated the steps finding that the next server showed that again the other three servers were making the request. I repeated the process on the remaining two servers and found that all the requests for this bad hostname that I could capture weren’t coming from a specific client but were instead coming just from the servers themselves. This isn’t completely odd, the servers themselves can be clients at the same time but the volume of requests was huge and there was nothing running on these servers except for being Microsoft Domain Controllers and Domain Name Services for a very small Microsoft environment.

It wasn’t until I opened the Microsoft DNS server configuration that I was able to piece together what was happening.

DNS loop

The servers were configured all as (multiple) Primary Masters for the internal domains but they had all been configured to use each other as forwarders along with OpenDNS. So in short the configuration was causing a DNS loop for any requests that failed to resolve. A query from a client to WEST-02 would be forwarded to all three DNS servers along with OpenDNS. Those three DNS servers would then forward that query again back to each of the other servers over and over and over. There is no TTL on a DNS query with respect to propagation. A DNS query can be propagated through as many servers as needed. There is a TTL value on the actual DNS record but that value is used in determining the caching lifecycle.

As I understand it a single DNS query for a bad DNS name would continually self propagate in this configuration because the servers would continually try to obtain an answer by looking to the forwarders and since the forwarders were configured to forward to each other you’d end up in a loop scenario.

You should never configure a pair of DNS servers to forward to each other.

Cheers!

Image Credit: airfin

]]>
https://blog.michaelfmcnamara.com/2016/04/dns-loops-how-to-not-configure-dns-forwarding/feed/ 2
CryptoWall CryptoLocker – Here’s what you should be doing https://blog.michaelfmcnamara.com/2016/03/cryptowall-cryptolocker-heres-what-you-should-be-doing/ https://blog.michaelfmcnamara.com/2016/03/cryptowall-cryptolocker-heres-what-you-should-be-doing/#comments Sat, 19 Mar 2016 15:42:19 +0000 https://blog.michaelfmcnamara.com/?p=5668 The past two weeks have been insane thanks to CryptoWall and the many variants that are out in the wild. It started early Monday morning with a UK office, hit the US West coast mid-week and then finally came calling to the US East coast by weeks end. All told we had to restore some 250,000+ files on multiple Windows File Shares from either Shadow Copies or from backups. The bigger issue for us are the Windows File Shares from which the infected clients have mapped drives. So I’ve spent the past 2 days researching CryptoWall and trying to figure out how we can stem the tied and help right the ship. The current solution of re-imaging infected devices and restoring files just isn’t sustainable – we just don’t have the personnel to be playing wackamole every day of the week. So what should we be doing as an enterprise to combat this threat? It turns out you’ll need to really step up your game for this one.

  • Windows Software Updates and Patches – I’m hopeful this goes without saying but you never know
  • Java Updates – I would highly recommend you remove Java unless you absolutely need it
  • Adobe Flash Updates – I would highly recommend you remove Adobe Flash unless you absolutely need it
  • Adobe Reader Updates – I’m not sure if Adobe Reader is still the target is used to be but I keep it up-to-date regardless
  • Anti-Virus Updates – very important to keep your anti-virus up to date and functional

In my specific case we’re currently using Trend Micro as our anti-virus solution but I don’t believe many of the AV solutions are fairing well in protecting their clients due to the obfuscation.

You could be batting .400 with everything on the punch list above and still end up with CryptoWall / CryptoLocker or some variant infecting your laptop or desktop.

It turns out that you can use the Software Restriction Policies in Windows to block executables from running outside of the normal C:\Program Files\ or C:\Program Files(x86)\ folders, this helps prevent the actual launch and installation of the ransomware and/or malware. The malware will still be downloaded by the Flash, Java or Javascript code/exploit but it won’t be allowed to execute from the %AppData%, %AppLocalData%, %TEMP%, %UserProfile%, etc. There are varying discussions regarding a whitelist or blacklist approach. The National Security Agency even put out a configuration document back in 2010 covering the topic. I will stress that you really need to test thoroughly before deploying SRP across your enterprise network.

I’ll be firing up a Group Policy configuration to test Software Restriction Policies in my environment over the next week or so. I’ll probably need to-do some live testing by intentionally infecting a machine or two in order to validate that it actually works.

What are you doing to combat the threat? How are you making out? I’d love to hear, it might help save my sanity.

Cheers!

Resources;

Image Credit: Carolin Chan

]]>
https://blog.michaelfmcnamara.com/2016/03/cryptowall-cryptolocker-heres-what-you-should-be-doing/feed/ 3
Black Friday 2015, behind the scenes https://blog.michaelfmcnamara.com/2015/12/black-friday-2015-behind-the-scenes/ Fri, 04 Dec 2015 22:45:30 +0000 http://blog.michaelfmcnamara.com/?p=5437 I was reading a post from Paul Stewart entitled, “Black Friday, Technology Glitches and Revenue Lost“, last night and thought to myself – there’s a blog post in there somewhere.

As someone who works for a large global retailer I have a new found appreciation for the challenges and issues that present themselves during the holiday season for retailers. It’s been a very tiring week for myself and a large number of my co-workers as we worked tirelessly from Thursday night (Thanksgiving) to Tuesday morning to keep any issues like what Paul’s wife experienced from impacting our brands. In the past two years I’ve learned that scale brings a whole other dimension to the game. I’m not just talking about bandwidth, that’s usually a pretty simple problem to solve. Instead I’m referring to all the inter-dependencies in both the application layer and in the hardware (networking, storage, compute). Let me try and put this into perspective, over the long weekend one of the brands we manage attracted over 7.9 million visitors to their website generating some 234 million page views. In comparison this blog gets about 1,500 page views a day just to put that number into perspective.

Why is it so hard?

For these few weeks of the year most sites will see significantly more traffic than they see throughout the year. Discount sales will also drive additional volume to the sites which may have issues trying to keep up with the number of online users. Often time the issue is scale… can the application scale… can the infrastructure scale to meet the demand. And more importantly can the vendors and third-parties that you rely on also scale to meet the demand. Look at the issues Target experienced on Cyber Monday when they offered 15% off every item. Around 10AM that morning they started having load issues and had to turn on Akamai’s Shopper Prioritization Application (SPA) which essentially holds users in a queue within Akamai’s network only allowing a fixed number of users through the door to keep the website from completely collapsing under the load.

In my role I’m generally concerned with the following infrastructure;

  • Internet Service Providers (availability of the websites from multiple peering points)
  • Internet Bandwidth (bandwidth and performance to/from the websites from a global viewpoint)
  • Internet Load Balancers (balancing the external load across multiple external facing web servers)
  • Internet Firewalls (protecting origin servers)
  • Internal Bandwidth (performance within and between data centers)
  • Internal Load Balancers (load balancing internal API services)
  • Internal Firewalls (filtering traffic between web/app/db tiers)
  • Storage Fabric Performance (bandwidth utilization on individual SAN switches)
  • Storage Array Performance (read/write latency per storage pool or LUN)

When we have problems we’ll often see connection counts on our external and internal firewalls spike as the web or app servers try to spin up additional processes and connections to compensate either for the load or for the previously failed query or connection. This is often just a symptom of the problem and not the cause of the problem. The cause might be a third-party API that’s not responding quickly enough and because that API is backing up, now our application is backing up because we’re relying on them. In another example a poorly written stored procedure severely impacted a database, while some folks believed the database was at fault, it wasn’t hard to quickly identify the poorly written SQL code as the culprit. And while this stored procedure worked for 100 users it quickly failed when there were 1,000 users on the website.

Here’s one that people often forget about…

  • Voice Infrastructure – Contact Center 1-800 PRI/T1/SIP channel availability

We have 10 T1 circuits into one of our Call Centers, providing some 240 channels (ability to handle 240 concurrent calls). Throughout the year we rarely ever get above 60 concurrent calls. On Black Friday and Cyber Monday though we ran at 100% utilization (240 calls in queue) for quite a few hours each day, keeping our agents extremely busy.

Isn’t the cloud all about scalability?

Yes, it certainly is but not every retailer is as big as Amazon and has the ability to re-architect their entire application stack around the cloud. We purposely spin up additional web and app server instances just prior to the holiday season and we spend a lot of time running load tests to validate that everything is working properly.  Not every retailer has either the resources or the staff to bulk up for the holiday rush.

The majority of large retailers definitely track cart abandonment, some retailers will even remind you of an item left in your cart and occasionally entice you with an additional discount or coupon code. Go add something to your cart at NewEgg.com and see how long it is before they drop you an email message.

Retailers rely on tools like Pingdom, AlertSite, New Relic, AppDynamics and AppBoy to provide visibility into their website or mobile application performance along with the user experiences (user timings).

So while it might seem like a pretty easy problem to quickly solve, it’s actually a very complex problem.

Cheers!

Image Credit: Philippe Ramakers

]]>
ISC BIND 9.10.2-P3 Forwarding Caching Only Nameserver https://blog.michaelfmcnamara.com/2015/08/isc-bind-9-10-2-p3-forwarding-caching-only-nameserver/ Mon, 03 Aug 2015 12:00:16 +0000 http://blog.michaelfmcnamara.com/?p=5373 I recently had to migrate a large DNS environment from about 23 Microsoft Domain Controllers to Infoblox DNS. I could have just deleted all the zones and set the forwarding on the Microsoft DNS servers but I wanted to leave the Microsoft DNS configuration and data in place to provide a quick backout option in the unlikely event that it was need (it was needed but the second time around using the named.conf file below was the charm).

PrintI ended up deploying ISC BIND 9.10.2-P3 across a mix of Windows 2003 and Windows 2008 domain controller servers, some 32-bit and some 64-bit.

As I alluded to above I originally had issues running BIND getting error messages such as the following after only a few hours running the service and clients failing to get name resolution.

27-Jul-2015 19:15:04.575 general: error: ..\client.c:2108: unexpected error:
27-Jul-2015 19:15:04.575 general: error: failed to get request's destination: failure
27-Jul-2015 19:15:04.981 general: error: ..\client.c:2108: unexpected error:
27-Jul-2015 19:15:04.981 general: error: failed to get request's destination: failure
27-Jul-2015 19:15:20.971 general: error: ..\client.c:2108: unexpected error:
27-Jul-2015 19:15:20.971 general: error: failed to get request's destination: failure

There were also a few other errors that apeared to be releated to the anti-DDoS mechanisms built into BIND;

27-Jul-2015 19:50:02.369 resolver: notice: clients-per-query increased to 15

So I went back and recrafted the named.conf file and came up with the following which seems to be working well for me now almost 5 days after the Infoblox DNS migration.

You’ll noticed that I commented out the localhost zone and the 127.0.0.1 reverse zone as well. I didn’t think that BIND would run without them but sure enough it does. I also enabled query logging so I could see what type of abuse the DNS servers were getting. I found a couple of servers that were querying more than 40,000 times a minute for a management platform that had been retired almost 5+ years ago.

options {
  directory "c:\program files\isc bind 9\bin";
 
  // here are the servers we'll send all our queries to
  forwarders {10.1.1.1; 10.2.2.2;};
  forward only;

  auth-nxdomain no;

  // need to include allow-query at a minimum
  allow-recursion { "any"; };
  allow-query { "any"; };
  allow-transfer { "none"; };

  // lets leave IPv6 off for now less to worry about
  listen-on-v6 { "none"; };

  // standard stuff
  version none;
  minimal-responses yes;
 
  // cache positive and negative results for only 5 minutes
  max-cache-ttl 300;
  max-ncache-ttl 300;

  // disable DDoS mechanisms in BIND
  clients-per-query 0;
  max-clients-per-query 0;

};

logging{
   channel example_log{
    file "C:\program files\isc bind 9\log\named.log" versions 3 size 250k;
    severity info;
    print-severity yes;
    print-time yes;
    print-category yes;
  };

  channel queries_file {
    file "c:\program files\isc bind 9\log\queries.log" versions 10 size 10m;
    severity dynamic;
    print-time yes;
  };

  category default{ example_log; };
  category queries { queries_file; };

};

//zone "localhost" in{
//  type master;
//  file "pri.localhost";
//  allow-update{none;};
//};

//zone "0.0.127.in-addr.arpa" in{
//  type master;
//  file "localhost.rev";
//  allow-update{none;};
//};

I setup my first nameserver running BIND 4.x back in 1995, more than 20 years ago while working at Manhattan College. While I'm pretty familiar with BIND a lot has changed since then and so I had to-do a fair bit of research to arrive at the configuration above.

Hopefully someone else will find it helpful.

Cheers!

]]>
IP Routing in Small Networks https://blog.michaelfmcnamara.com/2015/05/ip-routing-in-small-networks/ https://blog.michaelfmcnamara.com/2015/05/ip-routing-in-small-networks/#comments Tue, 05 May 2015 23:37:02 +0000 http://blog.michaelfmcnamara.com/?p=5212 I’ve seen quite a few issues throughout the years but one that I run into time and time again revolves around which device to set as the default gateway, in a DHCP scope, when you have both an internal router and Internet firewall on the same IP network. Should you set the router as the default gateway or should you set the firewall as the default gateway? Let’s use the topology below as a pretty simple example;

Internet Default Route

Theoretically this shouldn’t be an issue but I’ve seen it be an issue time and time again. This configuration usually results in performance issues because ICMP redirects are ignored, or never issued.

Cisco 2921 Router – Default Gateway

If we set the default gateway for the Windows 7 desktop/laptop to the Cisco 2921 Router then all traffic that’s not local to this network will be sent to that router. That includes traffic for other corporate networks and for Internet traffic. In theory the Cisco 2921 Router will forward traffic destined for corporate networks out over the WAN while it will forward traffic for Internet based destinations to the Cisco ASA 5505 firewall. When it forwards traffic to the Internet it will also issue an ICMP redirect to the Windows 7 Desktop/Laptop informing that device that 10.100.25.2 is a better destination for traffic to the Internet.

Cisco ASA 5505 – Default Gateway

If we set the default gateway for the desktop/laptop to the Cisco ASA 5505 firewall then all traffic that’s not destined for the Internet will still be sent toward the firewall. In the case of the Cisco ASA 5505, since it’s a firewall it won’t issue ICMP redirects to the desktop/laptop so all traffic will need to pass through the Cisco ASA 5505.

What’s the solution?

I’ll be honest and say I haven’t seen this too much lately and in smaller networks it doesn’t usually amount to much of a problem. In larger networks though I’ve seen a lot of performance issues around this scenario.So the easy solution is just don’t place the firewall on the same network segment as any of the desktops/laptops.

Have you ever run into this problem?

Cheers!

]]>
https://blog.michaelfmcnamara.com/2015/05/ip-routing-in-small-networks/feed/ 7
Response: Holiday retail “freeze” takes hold https://blog.michaelfmcnamara.com/2014/12/response-holiday-retail-freeze-takes-hold/ Thu, 11 Dec 2014 13:00:13 +0000 http://blog.michaelfmcnamara.com/?p=4751 I recently came across the article, Holiday retail “freeze” takes hold by Dan Raywood in which Dan quotes numerous sources concerning the impact of the typical change freeze that occurs in retail organizations over the Christmas holiday from a security perspective. I thought I would add my $0.02 to the discussion here. The typical change freeze lasts about 6 weeks starting just before Thanksgiving until just after New Years Day. It seems that many of Dan’s sources are really focused on those 6 weeks but I would ask, what happened with the other 42 weeks of the year?

In the retail vertical we plan our entire year around the holiday shopping period. In January we’ll kick off numerous projects and infrastructure upgrades that will need to be completed before next year’s holiday rush. We’ll perform our yearly maintenance and upgrades, keeping our systems on the leading edge not the bleeding edge. We’ll still push down Microsoft, Adobe and Apple security updates to desktops and laptops alike throughout the freeze but we’ll hold off on pushing any security updates to our back-end infrastructure unless the risk warrants a change. This year has been the exception to the norm because of a number of high profile late breaking security vulnerabilities that have demanded some unscheduled but approved emergency changes and software upgrades.

It’s all about preparation and planning…

I would compare it to my teenage daughter keeping her room clean. If she keeps her room relatively tidy and clean then it’s pretty easy to clean quickly. If she keeps her room like a pig sty then it’s going to take a lot of time and effort to clean it up. The same goes for any network infrastructure and your security posture. If your keeping things relatively tidy and up to date for the other 42 weeks of the year you should be able to survive a 6 week change freeze.

Cheers!
Note: This is a series of posts made under the Network Engineer in Retail 30 Days of Peak, this is post number 17 of 30. All the posts can be viewed from the 30in30 tag.

]]>
Cyber Monday – the aftermath https://blog.michaelfmcnamara.com/2014/12/cyber-monday-the-aftermath/ Tue, 02 Dec 2014 20:00:22 +0000 http://blog.michaelfmcnamara.com/?p=4597 I survived my first Cyber Monday and came away with a new appreciation for how difficult it can be when working with applications at scale. Thankfully our data centers performed well and held up under the extreme load. We had a few database issues with one of our brands but the others all ran fine without any significant problems. The excitement started around 8AM yesterday but it really picked up just after 8PM a traditionally busy hour on Cyber Monday when we noticed a significant increase in the number of users hitting the site. At that time we started splitting traffic across multiple data centers to try and alleviate the compute load – Internet bandwidth wasn’t the bottleneck. As midnight drew closer the load on all our data centers surged as shoppers tried to cash in on the sales and deals. Here’s a combined graph of all Internet traffic for yesterday across all our Internet Service Providers. Thankfully our sites are front-ended by a Content Distribution Network so we only see the traffic to/from the origin servers and we’re spared the actual edge bandwidth.

InternetUtilization

Looking at some of the stats from our CDN provider they served up 233 million page views on Monday December 1, 2014 totaling some 31TB of data. To put that into comparision, last Monday we only had 130 million page views totaling some 14.8TB of data, that’s a 179% increase in page views and a 209% increase in edge bandwidth. Looking at the graph above you can see that we peaked around 450Mbps outbound. That’s still a lot of data when you remember that the majority of caching is going on in the CDN, and that traffic is just the raw HTML and XML data for the category pages and the shopping cart. Mobile Apps were a big hit this year, as projected in the retail industry. They also caused us some significant performance issues as Apple Push notifications went out to 50K+ users at a time the site would start to break under the load only to recover shortly thereafter.

CyberMonday2014-WarRoom1-scale

While there are still quite a few shopping days left this holiday season a few of the folks in the war room were already talking about preparing for next year.  I was speaking to a few of the developers who are eager to look into utilizing Docker in their staging and development environments.  While I’ve played with Docker a little myself need to figure out how it interacts with the network.

All in all it was a rewarding but very tiring experience. It was impressive to work with such a talented and dedicated team all working tirelessly to make sure that the lights stayed out throughout the past 4 days.  I’m excited and look forward to what we’ll be able to-do together in the future.

Cheers!

Note: This is a series of posts made under the Network Engineer in Retail 30 Days of Peak, this is post number 8 of 30. All the posts can be viewed from the 30in30 tag.

]]>
Network is Slow! https://blog.michaelfmcnamara.com/2014/06/network-is-slow/ https://blog.michaelfmcnamara.com/2014/06/network-is-slow/#comments Wed, 18 Jun 2014 00:10:08 +0000 http://blog.michaelfmcnamara.com/?p=4372 Those are the words that almost every network engineer loathes to hear, the “network is slow”. And those words are usually spoken by folks who really have no idea of how the network works let alone the understanding to quantify the word “slow”. In my past life I had built a fairly large dark fiber metropolitan network where the smallest link, outside of remote VPN offices, was 1Gbps. I spent years training the IT staff around me to understand the difference between a under performing application, an overloaded server and a congested (slow) network. In that past life I rarely heard the words, the “network is slow”. Fast forward to current day and it seems I have my work cutout for me since I hear those glowing words on average 3-4 times a day.

Add in 100Mbps to the desktop (thanks to the Avaya 4621 IP Phones) and you’ll even find me occasionally grinding my mouse in frustration as I wait for a large Visio diagram or Excel spreadsheet to open.  That’s not even mentioning the added delay if the document has been archived by an electronic vaulting or archiving solution. It’s even more grinding when that solution prompts me to authenticate in order to retrieve the archived document – that’s a great user experience.

Any reasonable user would blame the network because that’s the piece that ties everything together so that’s probably what’s broken. I wouldn’t fault the user for that assumption because they really don’t know any better, in their view it’s just plain slow and occasionally painfully so. So next time you hear the “network is slow”, just remember to smile!

Cheers!

]]>
https://blog.michaelfmcnamara.com/2014/06/network-is-slow/feed/ 3
Wireless LAN Vendors https://blog.michaelfmcnamara.com/2008/04/wireless-lan-vendors/ Sat, 12 Apr 2008 03:00:00 +0000 http://maddog.mlhs.org/blog/2008/04/wireless-lan-vendors/ Thanks to everyone that participated in the poll, “What vendor are you using for your wireless LAN?”. It’s only to be expected that more folks responded with Motorola since I have a few articles dedicated to the Motorola Wireless LAN Switches posted on this blog.

Wireless networking has definitely brought its own set of distinct challenges. Channel and power management are among the too big problems with wireless networking. And let’s not forget the whole security issue with WEP, WPA and WPA2. Interoperability issues can also create a lot of headaches. And the never ending discussions over which band is better, the 2.4Ghz (802.11b/g) or 5Ghz bands (802.11a).

What vendor are you using for your wireless LAN?

Aruba
4 (9%)
Cisco
6 (14%)
Extreme
0 (0%)
Motorola
15 (36%)
Muru
3 (7%)
Trapeze
6 (14%)
3Com
0 (0%)
Other
8 (19%)

Thanks for the feedback!

Cheers!

]]>
Power over Ethernet Plus (PoE+) https://blog.michaelfmcnamara.com/2008/03/power-over-ethernet-plus-poe/ Sat, 22 Mar 2008 15:00:00 +0000 http://maddog.mlhs.org/blog/2008/03/power-over-ethernet-plus-poe/ I just recently learned that the majority of 802.11n products in design will likely out pace the current 13-15 watts of power provided by the 802.3af specification. It seems the IEEE is already working on 803.at, a new specification labeled “PoE+” by some.

What does this mean for the thousands of PoE (802.3af) ports already deployed throughout organizations?

Here’s a good article, A Look at POE Plus, in Network Computing by Peter Morrissey.

There are also some interesting articles over at Network World regarding 802.11n.

I’m not sure about everyone else out there but I won’t be rushing to deploy 802.11n or 802.3at gear anytime soon. We’ve actually standardized on using PoE capable network switches throughout the network going forward. The price cost between a PoE switch and a non-PoE switch is almost negligible when you consider the time and effort required to replace that switch in the future if PoE is required for some new application.

If you’re seriously thinking about deploying 802.11n you’ll need to consider how you’re going to power those devices.

Cheers!

]]>