Last Thursday, before the long Easter weekend, I was pre-configuring the ShoreGear switches for our clients offices, so they could be shipped to their remote locations in the U.K, and be Plug-N-Play when they go live.
After I completely setup the switches (with the exception of one that needed an RMA) I was on my way home to begin my 3-day weekend; until I got a call from my boss.
"Are you doing anything on "XYZ clients" network right now? I told him no, as I have been at our other clients site all afternoon. "well, all the phones just did a reboot" he said, I told him to call me back if they go down again. Before I got home, I did a small bit of shopping and had not received any call back from my boss. Just to be on the safe side, I gave him a follow up call and not only were the phones still down, but the entire network is down now too! This is bad, very bad I thought and I had a feeling, my 3-day weekend is going to have to wait until later to start; so I went back on-site to see what was going on. Part of my reasoning for going back down to work was because I was involved in a change the night before, to setup a new VLAN fore a sister company that was utilizing the clients existing infrastructure.
My boss let's me in the door, and I see he's already logged into the HP 5406 switch checking the logs and configuration. After about a good hour or so of dead-end troubleshooting, these are our findings:
- Access to the Net - FAILED
- PINGs to inside IP on WAN router from Desktop VLAN - FAILED
- PINGs to outside interface from Internet (tested using 3G connection) - PASS
- PING from Core Switch management VLAN IP to IPs assigned to VLAN interfaces - PASS
- PING from core switch to IP phones - FAILED
- PING from any VLAN to any other VLAN - FAILED
So, basically a PING from outside the LAN passes, but communication between VLANs fails. This really started to make us believe there is an Inter-VLAN routing issue, and I started to think it was caused by the changes that I made the night before; but let's recap here, what happened between last night and today?
- 9PM - Night Before:
- Added the new VLAN, tagger fiber ports and untagged edge ports for users
- Next Day:
- Network humming smoothly for over 20 hours with no hiccups
- 4:30PM: Network goes completely KAPUT and stays down
What changed in that time? why did everything just cease to function at the end of the day? was it a possible reboot on the switch? if it was, I did a write mem the night before so the config would have been maintained in NVRAM. Another hour passed by, and some of the actual IT guys who work for this client showed up and try to help, but truthfully we all just scratched our heads together. Eventually, I see a ShoreTel IP phone on a users desk display their name and the speaker light lit. It appeared to be working now, until I took it off-hook and heard absent dial-tone and the LCD displayed "No service for 10.x.x.x"; all the phones did every minute even if you didn't put one off-hook. Then, the little gears started to turn in my head.
The Desktop VLAN, where workstations and phones reside is different from the VLAN used for the phone system, how did the phone get registered with his/her name if the VLAN communication is broken? We begin to take a different approach now, we assume the configuration has not changed on the core switch and start to look for a flapping Fiber or other type of connection that would cause intermittent connectivity, dropping the phones every so often. The four of us begin brainstorming possibilities: "bad fiber link", "faulty switch", "bad UPS", "loop", "DHCP overlap with Linksys device", etc. Me and my boss head down to one of the floors to physically inspect a switch in the access layer. When we got there, we knew something was definitely wrong, the switchports for users were not blinking...every single port was pegged and solid green!
Without a second thought, we isolated the floor from the rest of the LAN by unplugging the Gigabit fiber links; we now see IP phones on the other floors registering and giving dial-tone. Our observation tells us the root cause is on this floor somewhere, we start physically checking on top and under each desk for unsupported network hardware (hub, switch from BestBuy, etc) and any connections that could be causing a switching loop. We put in as much effort as we could, but then figured it would be easier to go back to the riser room and unplug each port until the Link LEDs go normal again. In one of the 5 modules (modules A-E) on the switch, Craig pulled the lucky cable and the LEDs started to blink again; we traced the cable back the patch panel and recorded the drop (or wall jack) #. To our amazement, all the phones on the floor were alive again, except for one, which was looped into two jacks on the wall.
Basic Network and VoIP background info
IP phones, whether Cisco, ShoreTel or Avaya have a mini 2-port network switch built into them.
One port goes to the network, and the other connects to the LAN port on your PC, so the phone gives it access to the network. If instead, you plug a cable to the network and PC port on the back of the phone to the wall jack, you create a loop, and broadcast packets endlessly get forwarded and corrupt the MAC address Table in the switch. This is exactly what happened, root cause: Network loop caused broadcast storm through entire network.
Some of you might be thinking "But you had VLANs, VLANs are suppose to eliminate these broadcast storms I thought", well, that is not so when you route VLANs in the core, and span the same VLAN across ALL your switches. Another very valid point that also crossed my mind, "WHAT ABOUT SPANNING-TREE? WASN'T THAT ENABLED?", the short answer is "No, it wasn't". But if you want the long answer, it wasn't enable intentionally, it came down to a design decision (not mine) to disable it on all of our clients switches because it has been known to interfere with the phone system.
But that's not the worst of it, this apparently is not the first time something like this has happened, for the same client, at the same time of day; it has apparently happened a few times in the past too. When the night was over (we were there from 8PM-12AM) and we resolved the problem, my boss decided to rule this incident as sabotage and has notified the IT staff to spring an investigation against it.
Planning forward, I did some research on ways how we can prevent this from happening again in the future and stumbled across a feature in newer HP switches called Loop-detection which can be used independently from Spanning-Tree; we will be looking into this as a solution in the near future.