It’s no surprise that I need to know when our websites are down, but I also need to know why they went down. Often the redundancy will kick in and the website will quickly recovery. However, the question remains why was the website down? Was there a circuit failure, a router failure, a load-balancer failure, a web server failure, an application server failure, a database failure? While you can glean a lot of information from the log data generated by the routers, firewalls, switches, load-balancers and web servers sometimes there are gaps in that data. A few months ago I put together a quick bash script that calls a few Nagios plugins to help me gather some data points in the event that I needed to look back in time, after the fact, to determine what had caused an outage or failure. I decided to stand up a few Linode Linux servers spread out across a number of Data Centers around the world. While there are dozens if not hundreds of commercial solutions for website monitoring but I wanted something cheap in which I had complete control over and writing this script took all of 2 hours one afternoon.
The script will run every 60 seconds and will call the origin web server via an HTTP call and validate that it’s returning the proper HTML content. If the server fails to answer the first HTML call or response doesn’t contain the prerequisite content the script will wait and try a second time. If the second HTTP call fails the script will then log that fact and it will try a PING to verify that it can reach the web server. If the PING fails, the script will kick off a traceroute using mtr to try and isolate the location of the problem. A second script performs ICMP pings every 60 seconds to every piece of our public network infrastructure including the firewalls and load-balancers across our multiple Data Centers from multiple public Internet points.
The combination of the data points from both scripts, being run in multiple Data Centers around the world made it relatively easy to quickly determine what had transpired during an event. In one case we were alerted to a peering issue between NAC and Level3. In another event we observed a complete disconnect between NetworkLayer/SoftLayer and Comcast between 1AM and 2AM one night – I’m guessing that was some type of scheduled maintenance, and they didn’t have BGP configured properly. There were a few times though when the script would alert that everything was down but only from a single Data Center, this often indicated that there was a problem with the Internet peers that connected that Data Center to the Internet in general. It wasn’t a fool proof solution by any means but it gave me the data points I needed and the freedom to adapt as needed.
You can download the entire script from the link.
#!/bin/bash # # Filename: /usr/local/monitor/monitor.sh # # Purpose: Monitor the availability of several websites and report their # availabilty. This script leverages several Nagios plugins to # help simplify the collection of data. # # Language: Bash Script # # Author: Michael McNamara # # Verzion: 0.9 # # Date: Oct 26, 2014 # # License: # Copyright (C) 2014 Michael McNamara (mfm@michaelfmcnamara.com) # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program. If not, see <http://www.gnu.org/licenses/> # # Changes: # Nov 11, 2014 add lock file checking to prevent multiple instances # Oct 31, 2014 added code to retry the HTTP_CHECK before alarm # Oct 30, 2014 added additional websites to query # Oct 27, 2014 cleaned up script/updated documentation # # Requirements: # # Nagios check_icmp plugin # Nagios check_http plugin # Nagios check_dns plugin # http://nagiosplugins.org/ # # Notes: # Command Line Reference; # ./monitor.sh # # # Declare Variables SENDMAIL="/bin/mail" CHECK_HTTP="/usr/local/monitor/check_http" CHECK_FPING="/usr/local/monitor/check_fping" CHECK_DNS="/usr/local/monitor/check_dns" MTR="/usr/sbin/mtr" LOG="/usr/local/monitor/monitor.log" LOCKFILE="/tmp/monitor.tmp" LOCATION="New York, NY" MAIL_TO="root" MAIL_SUBJECT="HTTP: Web Application Status Report ($LOCATION)" # ### SITE SPECIFIC INFORMATION <<<<< YOU SHOULD EDIT THE LINES BELOW # # IPS = List of webservers by FQDN or IP address IPS=( webserver1.acme.com webserver2.acme.com webserver3.acme.com) # HOSTS = The FQDN of the web property that resides on the webserver HOSTS=( www.brandone.com www.brandtwo.com www.brandthree.com ) # URLS = The path to be appended to the FQDN of the web property URLS=( /brand1/index.jsp /brand2/index.jsp /brand3/index.jsp ) # CONTENTS = A regex containing some text that should be found on # the webpage for each brand or web property. CONTENTS=( "Brand One" "Brand Two" "Brand Three" ) # # <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< #################################################################### # M A I N P R O G R A M #################################################################### # LETS WAIT FOR A LITTLE SO WE'RE NOT FIRING AT THE TOP OF THE MINUTE sleep 15 # LETS CHECK TO SEE IF THERES ALREADY A COPY RUNNING if [ -e ${LOCKFILE} ] && kill -0 `cat ${LOCKFILE}`; then echo "already running" exit fi # SETUP A TRAP INCASE WE EXIT PREMATURELY trap "rm -f ${LOCKFILE}; exit" INT TERM EXIT # LETS CREATE A LOCKFILE echo $ > ${LOCKFILE} # LETS ITERATE OVER THE WEBSERVERS ($IPS[]) for (( i = 0; i < ${#IPS[@]}; i++ )) do # LETS TRY A QUICK HTTP CALL AND SEE WHAT WE GET RESULT1="`${CHECK_HTTP} -I ${IPS[$i]} -H ${HOSTS[$i]} -u ${URLS[$i]} -s ${CONTENTS[$i]}`" # LETS CHECK THE RESULT if [[ $RESULT1 =~ "OK" ]] then # IF THE RESULT WAS OK THEN LOG THE RESULT echo "$(date) OK ${IPS[$i]} ${HOSTS[$i]} $RESULT1" >> $LOG else # IF THE RESULT WAS BAD LETS DO MORE, FIRST WAIT A LITTLE sleep 10 # LOG THAT WE FAILED THE FIRST CHECK echo "$(date) FAIL ${IPS[$i]} ${HOSTS[$i]} $RESULT1" >> $LOG # ATTEMPT A SECOND HTTP CALL AND SEE WHAT WE GET RESULT2="`${CHECK_HTTP} -I ${IPS[$i]} -H ${HOSTS[$i]} -u ${URLS[$i]} -s ${CONTENTS[$i]}`" # LETS CHECK THE RESULT if [[ $RESULT2 =~ "OK" ]] then # IF THE RESULT WAS OK THEN LOG THE RESULT echo "$(date) RETRY OK ${IPS[$i]} ${HOSTS[$i]} $RESULT2" >> $LOG else # IF THE RESULT WAS BAD LETS DO MORE, FIRST LOG AND EMAIL echo "FAIL RETRY ${IPS[$i]} ${HOSTS[$i]} $RESULT2" | $SENDMAIL -s "$MAIL_SUBJECT" $MAIL_TO echo $(date) RETRY FAIL ${IPS[$i]} ${HOSTS[$i]} $RESULT2 >> $LOG # LETS TEST AN ICMP CALL TO THE WEBSERVER TO VALIDATE ITS NOT A NETWORK ISSUE PING1="`${CHECK_FPING} ${HOSTS[$i]} -n 3`" # LETS CHECK THE RESULT if [[ $PING1 =~ "OK" ]] then # IF THE RESULT WAS OK THEN LOG THE RESULT echo $(date) OK ${IPS[$i]} ${HOSTS[$i]} $PING1 >> $LOG else # IF THE RESULT WAS BAD THEN LOG AND EMAIL AND COLLECT A TRACEROUTE echo "FAIL RETRY PING ${IPS[$i]} ${HOSTS[$i}} $PING1" | $SENDMAIL -s "$MAIL_SUBJECT" $MAIL_TO echo $(date) FAIL ${IPS[$i]} ${HOSTS[$i]} $PING1 >> $LOG `${MTR} --no-dns -rc 5 ${IPS[$i]} >> $LOG` fi fi fi done # LETS WAIT FOR A FEW SECONDS sleep 10 # LETS CLEAN UP AND REMOVE THE LOCKFILE rm -f ${LOCKFILE} # THATS ALL FOLKS! exit 0
Cheers!
Note: This is a series of posts made under the Network Engineer in Retail 30 Days of Peak, this is post number 29 of 30. All the posts can be viewed from the 30in30 tag.
Image Credit: michele de notaristefani