It’s no surprise that I need to know when our websites are down, but I also need to know why they went down. Often the redundancy will kick in and the website will quickly recovery. However, the question remains why was the website down? Was there a circuit failure, a router failure, a load-balancer failure, a web server failure, an application server failure, a database failure? While you can glean a lot of information from the log data generated by the routers, firewalls, switches, load-balancers and web servers sometimes there are gaps in that data. A few months ago I put together a quick bash script that calls a few Nagios plugins to help me gather some data points in the event that I needed to look back in time, after the fact, to determine what had caused an outage or failure. I decided to stand up a few Linode Linux servers spread out across a number of Data Centers around the world. While there are dozens if not hundreds of commercial solutions for website monitoring but I wanted something cheap in which I had complete control over and writing this script took all of 2 hours one afternoon.
The script will run every 60 seconds and will call the origin web server via an HTTP call and validate that it’s returning the proper HTML content. If the server fails to answer the first HTML call or response doesn’t contain the prerequisite content the script will wait and try a second time. If the second HTTP call fails the script will then log that fact and it will try a PING to verify that it can reach the web server. If the PING fails, the script will kick off a traceroute using mtr to try and isolate the location of the problem. A second script performs ICMP pings every 60 seconds to every piece of our public network infrastructure including the firewalls and load-balancers across our multiple Data Centers from multiple public Internet points.
The combination of the data points from both scripts, being run in multiple Data Centers around the world made it relatively easy to quickly determine what had transpired during an event. In one case we were alerted to a peering issue between NAC and Level3. In another event we observed a complete disconnect between NetworkLayer/SoftLayer and Comcast between 1AM and 2AM one night – I’m guessing that was some type of scheduled maintenance, and they didn’t have BGP configured properly. There were a few times though when the script would alert that everything was down but only from a single Data Center, this often indicated that there was a problem with the Internet peers that connected that Data Center to the Internet in general. It wasn’t a fool proof solution by any means but it gave me the data points I needed and the freedom to adapt as needed.
You can download the entire script from the link.
#!/bin/bash
#
# Filename: /usr/local/monitor/monitor.sh
#
# Purpose: Monitor the availability of several websites and report their
# availabilty. This script leverages several Nagios plugins to
# help simplify the collection of data.
#
# Language: Bash Script
#
# Author: Michael McNamara
#
# Verzion: 0.9
#
# Date: Oct 26, 2014
#
# License:
# Copyright (C) 2014 Michael McNamara (mfm@michaelfmcnamara.com)
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>
#
# Changes:
# Nov 11, 2014 add lock file checking to prevent multiple instances
# Oct 31, 2014 added code to retry the HTTP_CHECK before alarm
# Oct 30, 2014 added additional websites to query
# Oct 27, 2014 cleaned up script/updated documentation
#
# Requirements:
#
# Nagios check_icmp plugin
# Nagios check_http plugin
# Nagios check_dns plugin
# http://nagiosplugins.org/
#
# Notes:
# Command Line Reference;
# ./monitor.sh
#
#
# Declare Variables
SENDMAIL="/bin/mail"
CHECK_HTTP="/usr/local/monitor/check_http"
CHECK_FPING="/usr/local/monitor/check_fping"
CHECK_DNS="/usr/local/monitor/check_dns"
MTR="/usr/sbin/mtr"
LOG="/usr/local/monitor/monitor.log"
LOCKFILE="/tmp/monitor.tmp"
LOCATION="New York, NY"
MAIL_TO="root"
MAIL_SUBJECT="HTTP: Web Application Status Report ($LOCATION)"
#
### SITE SPECIFIC INFORMATION <<<<< YOU SHOULD EDIT THE LINES BELOW
#
# IPS = List of webservers by FQDN or IP address
IPS=( webserver1.acme.com webserver2.acme.com webserver3.acme.com)
# HOSTS = The FQDN of the web property that resides on the webserver
HOSTS=( www.brandone.com www.brandtwo.com www.brandthree.com )
# URLS = The path to be appended to the FQDN of the web property
URLS=( /brand1/index.jsp /brand2/index.jsp /brand3/index.jsp )
# CONTENTS = A regex containing some text that should be found on
# the webpage for each brand or web property.
CONTENTS=( "Brand One" "Brand Two" "Brand Three" )
#
# <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
####################################################################
# M A I N P R O G R A M
####################################################################
# LETS WAIT FOR A LITTLE SO WE'RE NOT FIRING AT THE TOP OF THE MINUTE
sleep 15
# LETS CHECK TO SEE IF THERES ALREADY A COPY RUNNING
if [ -e ${LOCKFILE} ] && kill -0 `cat ${LOCKFILE}`; then
echo "already running"
exit
fi
# SETUP A TRAP INCASE WE EXIT PREMATURELY
trap "rm -f ${LOCKFILE}; exit" INT TERM EXIT
# LETS CREATE A LOCKFILE
echo $ > ${LOCKFILE}
# LETS ITERATE OVER THE WEBSERVERS ($IPS[])
for (( i = 0; i < ${#IPS[@]}; i++ )) do
# LETS TRY A QUICK HTTP CALL AND SEE WHAT WE GET
RESULT1="`${CHECK_HTTP} -I ${IPS[$i]} -H ${HOSTS[$i]} -u ${URLS[$i]} -s ${CONTENTS[$i]}`"
# LETS CHECK THE RESULT
if [[ $RESULT1 =~ "OK" ]] then
# IF THE RESULT WAS OK THEN LOG THE RESULT
echo "$(date) OK ${IPS[$i]} ${HOSTS[$i]} $RESULT1" >> $LOG
else
# IF THE RESULT WAS BAD LETS DO MORE, FIRST WAIT A LITTLE
sleep 10
# LOG THAT WE FAILED THE FIRST CHECK
echo "$(date) FAIL ${IPS[$i]} ${HOSTS[$i]} $RESULT1" >> $LOG
# ATTEMPT A SECOND HTTP CALL AND SEE WHAT WE GET
RESULT2="`${CHECK_HTTP} -I ${IPS[$i]} -H ${HOSTS[$i]} -u ${URLS[$i]} -s ${CONTENTS[$i]}`"
# LETS CHECK THE RESULT
if [[ $RESULT2 =~ "OK" ]]
then
# IF THE RESULT WAS OK THEN LOG THE RESULT
echo "$(date) RETRY OK ${IPS[$i]} ${HOSTS[$i]} $RESULT2" >> $LOG
else
# IF THE RESULT WAS BAD LETS DO MORE, FIRST LOG AND EMAIL
echo "FAIL RETRY ${IPS[$i]} ${HOSTS[$i]} $RESULT2" | $SENDMAIL -s "$MAIL_SUBJECT" $MAIL_TO
echo $(date) RETRY FAIL ${IPS[$i]} ${HOSTS[$i]} $RESULT2 >> $LOG
# LETS TEST AN ICMP CALL TO THE WEBSERVER TO VALIDATE ITS NOT A NETWORK ISSUE
PING1="`${CHECK_FPING} ${HOSTS[$i]} -n 3`"
# LETS CHECK THE RESULT
if [[ $PING1 =~ "OK" ]]
then
# IF THE RESULT WAS OK THEN LOG THE RESULT
echo $(date) OK ${IPS[$i]} ${HOSTS[$i]} $PING1 >> $LOG
else
# IF THE RESULT WAS BAD THEN LOG AND EMAIL AND COLLECT A TRACEROUTE
echo "FAIL RETRY PING ${IPS[$i]} ${HOSTS[$i}} $PING1" | $SENDMAIL -s "$MAIL_SUBJECT" $MAIL_TO
echo $(date) FAIL ${IPS[$i]} ${HOSTS[$i]} $PING1 >> $LOG
`${MTR} --no-dns -rc 5 ${IPS[$i]} >> $LOG`
fi
fi
fi
done
# LETS WAIT FOR A FEW SECONDS
sleep 10
# LETS CLEAN UP AND REMOVE THE LOCKFILE
rm -f ${LOCKFILE}
# THATS ALL FOLKS!
exit 0
Cheers!
Note: This is a series of posts made under the Network Engineer in Retail 30 Days of Peak, this is post number 29 of 30. All the posts can be viewed from the 30in30 tag.
Image Credit: michele de notaristefani