It’s been a very interesting two days around the test lab we maintain in the organization I work. There was an issue reported to me yesterday about a server, just happened to be the primary (and only) DNS/DHCP server, that wasn’t communicating with the test lab network. A few coworkers had already been looking at the problem for a few hours before they gave me a call.
It was a Windows 2003 Service Pack 2 server running on an HP DL580. The server had link with the Nortel Ethernet Routing Switch 5510 and it had properly auto negotiated to 1.0 Gbps. The server had an ARP table (arp -a) with numerous entries in it. However, the server was unable to communicate or ping anything on the network, not the default gateway, not the other servers in the same VLAN, not the desktops in the same VLAN. A cursory review of the system logs on the server and switch didn’t reveal anything of interest. I didn’t believe it was the NIC, the patch cable, the switch port but we literally swapped everything including the server just taking the hard disks out of one box and throwing them into another. One of the engineers even went as far as deleting the NICs within Windows Device Manager and re-installing them after a reboot. Still the problem persisted. After looking at the problem for almost 60 minutes a small horde of folks had assembled all waiting for word on when the lab would be available so I decide to take the path of least resistance and asked the server team to re-image the server and I would rebuild the DHCP/DNS configuration and so the problem was solved.
Until I came into work today and was told of yet another server in the test lab acting identically to the server that we had re-imaged yesterday. I immediately became suspicious, was there some Trojan loose on the test lab network, was there some Microsoft Security patch gone awry, or was it something more sinister like McAfee’s ePolicy Orchestrator (ePO) or Symantec’s Altiris. Anti-Virus sweeps and RootKit checks all came back negative and by now we had yet another server that was experiencing the same exact issue. That meant we were up to 3 servers in 2 days, 1 of which had been restored by re-imaging and rebuilding the application configuration. Later in the afternoon one of our senior engineers took to the Microsoft Knowledgebase and Google in search of answers and after noticing an interesting event in the system log,
Description: The IPSec driver has entered Block mode. IPSec will discard all inbound and outbound TCP/IP network traffic that is not permitted by boot-time IPSec Policy exemptions. User Action: To restore full unsecured TCP/IP connectivity, disable the IPSec services, and then restart the computer. For detailed troubleshooting information, review the events in the Security event log,
came across KB870910 on the Microsoft Support website. We issued the following command;
Click Start, click Run, type regsvr32 polstore.dll, and then click OK.
After a quick reboot we were back up and running again. We suspect that the local C: drive had filled up on the servers and had caused the policy store to become corrupt which was causing the IPSec service to enter ‘Block’ mode.
That was a great find Brian!
Blind squirrel finds a nut every once and a while.