It’s a wonder the odd and bizarre problems that seem to find me. Straight from the front lines I had an issue with a Motorola WS5100 v3.3.5.0-002R falling down at the most inopportune time of the retail calendar. While the original problem appeared on December 17 it returned last night to spoil the weekend.
I'm troubleshooting a @Motorola WS5100 v3.3.5.0-002R that is being plagued with frequent high CPU utilization and generating 25,000+ pps
— Michael McNamara (@mfMcNamara) December 18, 2014
In the process of trying to understand the problem and come up with a solution I wanted to have better visibility and alerting when the problem actually occurred, I didn’t want to incur the delay that would involve the users calling the help desk and the help desk calling me. Thankfully there is a SYSLOG message recorded when an Access Port experiences a watchdog reset so I had a log message now I needed to find a way to alert on that message.
That’s where I turned to swatch, a handy little utility that will monitor log files for regular expressions and then take whatever action, such as ringing the console or sending an email message is configured. I installed swatch with relative ease thanks to yum and then set out to configure it appropriately.
I created the following configuration file;
#/etc/swatchrc
# swatchrc - define regular expressions and generate alerts when matches are found in logs
#            daemon is started from /etc/cron.d/swatch
# Motorola AP300 - malfunctioning AP ignore events from this device
ignore /00-A0-F8-ZZ-ZZ-ZZ/
# Motorola WS5100 Access Port Adoption Errors Reboot/Watchdog events
#Dec 20 07:53:07 ACME-WLS1 %CC-6-APREADOPTREASON: AP 00-A0-F8-XX-XX-XX readoption reason: ColdBoot/Watchdog
#Dec 20 07:53:25 ACME-WLS1 %CC-6-APREADOPTREASON: AP 00-A0-F8-XX-XX-XX readoption reason: Link failed
# Let's look for the phrase readoption and we'll alert of that text
watchfor /readoption/
        exec "echo '$_' | mail swatch -s 'SWATCH: Motorola WS5100 Adoption Issue' "
        threshold track_by=$6,type=limit,count=1,seconds=60
        echo=red
        bell 5
#endIn the swatch configuration I used the mail aliase of swatch so I edited the /etc/newaliases file to make sure that the entire team would receive the alert;
# # Aliases in this file will NOT be expanded in the header from # Mail, but WILL be visible over networks or from /bin/mail. # # >>>>>>>>>> The program "newaliases" must be run after # >> NOTE >> this file is updated for any changes to # >>>>>>>>>> show through to sendmail. # # Basic system aliases -- these MUST be present. mailer-daemon: postmaster postmaster: root # General redirections for pseudo accounts. bin: root daemon: root adm: root ... ... ... swatch: root,mike,john,dan,tom
If the problem is extremely important I’ll usually add the the email SMS text message gateway for my provider. This way I’ll get both an email message and an SMS text message alerting me to the problem.
# Verizon SMS Text Messaging 123456789@vtext.com # AT&T SMS Text Messaging 123456789@txt.att.net # T-Mobile SMS Text Messsaging 123456789@tmomail.net # Sprint SMS Text Messaging 123456789@messaging.sprintpcs.com
I made sure to recompile the aliases file with the newaliases command and then I set off to run swatch in the foreground of my SSH session.
[root@centos /]# swatch -c /etc/swatchrc -p 'tail -f -n 0 /var/log/loc1fac17.log' *** swatch version 3.2.3 (pid:30643) started at Sat Dec 20 15:54:14 EST 2014
And I waited for the event.
Now I could go about doing some research and due diligence without worrying that I might inadvertently fail to spot the problem.
I’ll let you know how it turned out!
Cheers!
Note: This is a series of posts made under the Network Engineer in Retail 30 Days of Peak, this is post number 26 of 30. All the posts can be viewed from the 30in30 tag.
[…] and copy the data before it was overwritten. While I waited for the problem to occur I took to setting up SWATCH to alert myself and the team when the problem started so we could quickly gather all the data points during the start of the […]