Website Monitoring with Bash and Nagios Plugins

It’s no surprise that I need to know when our websites are down, but I also need to know why they went down. Often the redundancy will kick in and the website will quickly recovery. However, the question remains why was the website down? Was there a circuit failure, a router failure, a load-balancer failure, a web server failure, an application server failure, a database failure? While you can glean a lot of information from the log data generated by the routers, firewalls, switches, load-balancers and web servers sometimes there are gaps in that data. A few months ago I put together a quick bash script that calls a few Nagios plugins to help me gather some data points in the event that I needed to look back in time, after the fact, to determine what had caused an outage or failure. I decided to stand up a few Linode Linux servers spread out across a number of Data Centers around the world. While there are dozens if not hundreds of commercial solutions for website monitoring but I wanted something cheap in which I had complete control over and writing this script took all of 2 hours one afternoon.

The script will run every 60 seconds and will call the origin web server via an HTTP call and validate that it’s returning the proper HTML content. If the server fails to answer the first HTML call or response doesn’t contain the prerequisite content the script will wait and try a second time. If the second HTTP call fails the script will then log that fact and it will try a PING to verify that it can reach the web server. If the PING fails, the script will kick off a traceroute using mtr to try and isolate the location of the problem. A second script performs ICMP pings every 60 seconds to every piece of our public network infrastructure including the firewalls and load-balancers across our multiple Data Centers from multiple public Internet points.

The combination of the data points from both scripts, being run in multiple Data Centers around the world made it relatively easy to quickly determine what had transpired during an event. In one case we were alerted to a peering issue between NAC and Level3. In another event we observed a complete disconnect between NetworkLayer/SoftLayer and Comcast between 1AM and 2AM one night – I’m guessing that was some type of scheduled maintenance, and they didn’t have BGP configured properly. There were a few times though when the script would alert that everything was down but only from a single Data Center, this often indicated that there was a problem with the Internet peers that connected that Data Center to the Internet in general. It wasn’t a fool proof solution by any means but it gave me the data points I needed and the freedom to adapt as needed.

You can download the entire script from the link.

#!/bin/bash
#
# Filename: /usr/local/monitor/monitor.sh
#
# Purpose:  Monitor the availability of several websites and report their
#           availabilty. This script leverages several Nagios plugins to
#           help simplify the collection of data.
#
# Language: Bash Script
#
# Author:   Michael McNamara
#
# Verzion:  0.9
#
# Date:     Oct 26, 2014
#
# License:
#           Copyright (C) 2014 Michael McNamara (mfm@michaelfmcnamara.com)
#
#           This program is free software: you can redistribute it and/or modify
#           it under the terms of the GNU General Public License as published by
#           the Free Software Foundation, either version 3 of the License, or
#           (at your option) any later version.
#
#           This program is distributed in the hope that it will be useful,
#           but WITHOUT ANY WARRANTY; without even the implied warranty of
#           MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#           GNU General Public License for more details.
#
#           You should have received a copy of the GNU General Public License
#           along with this program.  If not, see <http://www.gnu.org/licenses/>
#
# Changes:
#           Nov 11, 2014 add lock file checking to prevent multiple instances
#           Oct 31, 2014 added code to retry the HTTP_CHECK before alarm
#           Oct 30, 2014 added additional websites to query
#           Oct 27, 2014 cleaned up script/updated documentation
#
# Requirements:
#
#           Nagios check_icmp plugin
#           Nagios check_http plugin
#           Nagios check_dns plugin
#           http://nagiosplugins.org/
#
# Notes:
#        Command Line Reference;
#          ./monitor.sh
#
#

# Declare Variables
SENDMAIL="/bin/mail"

CHECK_HTTP="/usr/local/monitor/check_http"
CHECK_FPING="/usr/local/monitor/check_fping"
CHECK_DNS="/usr/local/monitor/check_dns"

MTR="/usr/sbin/mtr"
LOG="/usr/local/monitor/monitor.log"
LOCKFILE="/tmp/monitor.tmp"
LOCATION="New York, NY"

MAIL_TO="root"
MAIL_SUBJECT="HTTP: Web Application Status Report ($LOCATION)"

#
### SITE SPECIFIC INFORMATION <<<<< YOU SHOULD EDIT THE LINES BELOW
#
# IPS = List of webservers by FQDN or IP address
IPS=( webserver1.acme.com webserver2.acme.com webserver3.acme.com)

# HOSTS = The FQDN of the web property that resides on the webserver
HOSTS=( www.brandone.com www.brandtwo.com www.brandthree.com )

# URLS = The path to be appended to the FQDN of the web property
URLS=( /brand1/index.jsp /brand2/index.jsp /brand3/index.jsp )

# CONTENTS = A regex containing some text that should be found on
#            the webpage for each brand or web property.
CONTENTS=( "Brand One" "Brand Two" "Brand Three" )
#
# <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 

#################################################################### 
# M A I N    P R O G R A M 
#################################################################### 
# LETS WAIT FOR A LITTLE SO WE'RE NOT FIRING AT THE TOP OF THE MINUTE 
sleep 15 

# LETS CHECK TO SEE IF THERES ALREADY A COPY RUNNING 
if [ -e ${LOCKFILE} ] && kill -0 `cat ${LOCKFILE}`; then     
    echo "already running"     
    exit 
fi 

# SETUP A TRAP INCASE WE EXIT PREMATURELY 
trap "rm -f ${LOCKFILE}; exit" INT TERM EXIT 

# LETS CREATE A LOCKFILE 
echo $ > ${LOCKFILE}

# LETS ITERATE OVER THE WEBSERVERS ($IPS[])
for (( i = 0; i < ${#IPS[@]}; i++ )) do     
    # LETS TRY A QUICK HTTP CALL AND SEE WHAT WE GET
    RESULT1="`${CHECK_HTTP} -I ${IPS[$i]} -H ${HOSTS[$i]} -u ${URLS[$i]} -s ${CONTENTS[$i]}`"     

    # LETS CHECK THE RESULT
    if [[ $RESULT1 =~ "OK" ]]     then
         # IF THE RESULT WAS OK THEN LOG THE RESULT
         echo "$(date) OK ${IPS[$i]} ${HOSTS[$i]} $RESULT1" >> $LOG
    else
        # IF THE RESULT WAS BAD LETS DO MORE, FIRST WAIT A LITTLE
        sleep 10

        # LOG THAT WE FAILED THE FIRST CHECK
        echo "$(date) FAIL ${IPS[$i]} ${HOSTS[$i]} $RESULT1" >> $LOG

        # ATTEMPT A SECOND HTTP CALL AND SEE WHAT WE GET
        RESULT2="`${CHECK_HTTP} -I ${IPS[$i]} -H ${HOSTS[$i]} -u ${URLS[$i]} -s ${CONTENTS[$i]}`"

        # LETS CHECK THE RESULT
        if [[ $RESULT2 =~ "OK" ]]
        then
            # IF THE RESULT WAS OK THEN LOG THE RESULT
            echo "$(date) RETRY OK ${IPS[$i]} ${HOSTS[$i]} $RESULT2" >> $LOG
        else
            # IF THE RESULT WAS BAD LETS DO MORE, FIRST LOG AND EMAIL
            echo "FAIL RETRY ${IPS[$i]} ${HOSTS[$i]} $RESULT2" | $SENDMAIL -s "$MAIL_SUBJECT" $MAIL_TO
            echo $(date) RETRY FAIL ${IPS[$i]} ${HOSTS[$i]} $RESULT2 >> $LOG

            # LETS TEST AN ICMP CALL TO THE WEBSERVER TO VALIDATE ITS NOT A NETWORK ISSUE
            PING1="`${CHECK_FPING} ${HOSTS[$i]} -n 3`"

            # LETS CHECK THE RESULT
            if [[ $PING1 =~ "OK" ]]
            then
                # IF THE RESULT WAS OK THEN LOG THE RESULT
                echo $(date) OK ${IPS[$i]} ${HOSTS[$i]} $PING1 >> $LOG
            else
                # IF THE RESULT WAS BAD THEN LOG AND EMAIL AND COLLECT A TRACEROUTE
                echo "FAIL RETRY PING ${IPS[$i]} ${HOSTS[$i}} $PING1" | $SENDMAIL -s "$MAIL_SUBJECT" $MAIL_TO
                echo $(date) FAIL ${IPS[$i]} ${HOSTS[$i]} $PING1 >> $LOG
                `${MTR} --no-dns -rc 5 ${IPS[$i]} >> $LOG`
            fi
        fi
    fi
done

# LETS WAIT FOR A FEW SECONDS
sleep 10

# LETS CLEAN UP AND REMOVE THE LOCKFILE
rm -f ${LOCKFILE}

# THATS ALL FOLKS!
exit 0

Cheers!

Note: This is a series of posts made under the Network Engineer in Retail 30 Days of Peak, this is post number 29 of 30. All the posts can be viewed from the 30in30 tag.

Image Credit: michele de notaristefani

Website Monitoring with Bash and Nagios Plugins

Recent Posts

Categories

SUBSCRIBE BY EMAIL