Diary of a Cisco guy: April 2011

Wednesday 27 April 2011

Network Blunders: Sabotage

It's late, so I'm going to try and get in as much as I can possibly remember the day it happened.

Last Thursday, before the long Easter weekend, I was pre-configuring the ShoreGear switches for our clients offices, so they could be shipped to their remote locations in the U.K, and be Plug-N-Play when they go live.
After I completely setup the switches (with the exception of one that needed an RMA) I was on my way home to begin my 3-day weekend; until I got a call from my boss.
"Are you doing anything on "XYZ clients" network right now? I told him no, as I have been at our other clients site all afternoon. "well, all the phones just did a reboot" he said, I told him to call me back if they go down again. Before I got home, I did a small bit of shopping and had not received any call back from my boss. Just to be on the safe side, I gave him a follow up call and not only were the phones still down, but the entire network is down now too! This is bad, very bad I thought and I had a feeling, my 3-day weekend is going to have to wait until later to start; so I went back on-site to see what was going on. Part of my reasoning for going back down to work was because I was involved in a change the night before, to setup a new VLAN fore a sister company that was utilizing the clients existing infrastructure.

My boss let's me in the door, and I see he's already logged into the HP 5406 switch checking the logs and configuration. After about a good hour or so of dead-end troubleshooting, these are our findings:

Access to the Net - FAILED
PINGs to inside IP on WAN router from Desktop VLAN - FAILED
PINGs to outside interface from Internet (tested using 3G connection) - PASS
PING from Core Switch management VLAN IP to IPs assigned to VLAN interfaces - PASS
PING from core switch to IP phones - FAILED
PING from any VLAN to any other VLAN - FAILED

So, basically a PING from outside the LAN passes, but communication between VLANs fails. This really started to make us believe there is an Inter-VLAN routing issue, and I started to think it was caused by the changes that I made the night before; but let's recap here, what happened between last night and today?

9PM - Night Before:
Added the new VLAN, tagger fiber ports and untagged edge ports for users
Next Day:
Network humming smoothly for over 20 hours with no hiccups
4:30PM: Network goes completely KAPUT and stays down

What changed in that time? why did everything just cease to function at the end of the day? was it a possible reboot on the switch? if it was, I did a write mem the night before so the config would have been maintained in NVRAM. Another hour passed by, and some of the actual IT guys who work for this client showed up and try to help, but truthfully we all just scratched our heads together. Eventually, I see a ShoreTel IP phone on a users desk display their name and the speaker light lit. It appeared to be working now, until I took it off-hook and heard absent dial-tone and the LCD displayed "No service for 10.x.x.x"; all the phones did every minute even if you didn't put one off-hook. Then, the little gears started to turn in my head.

The Desktop VLAN, where workstations and phones reside is different from the VLAN used for the phone system, how did the phone get registered with his/her name if the VLAN communication is broken? We begin to take a different approach now, we assume the configuration has not changed on the core switch and start to look for a flapping Fiber or other type of connection that would cause intermittent connectivity, dropping the phones every so often. The four of us begin brainstorming possibilities: "bad fiber link", "faulty switch", "bad UPS", "loop", "DHCP overlap with Linksys device", etc. Me and my boss head down to one of the floors to physically inspect a switch in the access layer. When we got there, we knew something was definitely wrong, the switchports for users were not blinking...every single port was pegged and solid green!

Without a second thought, we isolated the floor from the rest of the LAN by unplugging the Gigabit fiber links; we now see IP phones on the other floors registering and giving dial-tone. Our observation tells us the root cause is on this floor somewhere, we start physically checking on top and under each desk for unsupported network hardware (hub, switch from BestBuy, etc) and any connections that could be causing a switching loop. We put in as much effort as we could, but then figured it would be easier to go back to the riser room and unplug each port until the Link LEDs go normal again. In one of the 5 modules (modules A-E) on the switch, Craig pulled the lucky cable and the LEDs started to blink again; we traced the cable back the patch panel and recorded the drop (or wall jack) #. To our amazement, all the phones on the floor were alive again, except for one, which was looped into two jacks on the wall.

Basic Network and VoIP background info

IP phones, whether Cisco, ShoreTel or Avaya have a mini 2-port network switch built into them.
One port goes to the network, and the other connects to the LAN port on your PC, so the phone gives it access to the network. If instead, you plug a cable to the network and PC port on the back of the phone to the wall jack, you create a loop, and broadcast packets endlessly get forwarded and corrupt the MAC address Table in the switch. This is exactly what happened, root cause: Network loop caused broadcast storm through entire network.

Some of you might be thinking "But you had VLANs, VLANs are suppose to eliminate these broadcast storms I thought", well, that is not so when you route VLANs in the core, and span the same VLAN across ALL your switches. Another very valid point that also crossed my mind, "WHAT ABOUT SPANNING-TREE? WASN'T THAT ENABLED?", the short answer is "No, it wasn't". But if you want the long answer, it wasn't enable intentionally, it came down to a design decision (not mine) to disable it on all of our clients switches because it has been known to interfere with the phone system.

But that's not the worst of it, this apparently is not the first time something like this has happened, for the same client, at the same time of day; it has apparently happened a few times in the past too. When the night was over (we were there from 8PM-12AM) and we resolved the problem, my boss decided to rule this incident as sabotage and has notified the IT staff to spring an investigation against it.

Planning forward, I did some research on ways how we can prevent this from happening again in the future and stumbled across a feature in newer HP switches called Loop-detection which can be used independently from Spanning-Tree; we will be looking into this as a solution in the near future.

Tuesday 26 April 2011

I've been slacking!

Sorry all, I've been swamped with things for weeks! I promise as soon as I have a day to just do nothing but stare at a wall (or screen) all day I will post all the news I have backlogged!

Some things to expect:

Results from the graveyard shift I did a while back
Huge catastrophic network outage before Easter weekend
Integrating ShoreTel with Microsoft's Office Communications Server (OCS) 2007
ShoreTel IP8000 conference phone (SIP)
Configuring an HP switch for VLANs and "VLAN tagging"
Writing an VLAN Access Control List (VACL) on an HP switch - not as easy as ACLs in Cisco!
DHCP Relay - How to hand-out DHCP requests from a single server to multiple VLANs
Example configs of a small subnet I recently setup with a complex VACL!

Hopefully I will have the majority of this written by the weekend. Stay tuned!

Thursday 7 April 2011

Network Blunders: ACL flub

Alright, this was the plan:

Translate connections outside on TCP port 3392 to port 3389 (RDP) on one of the inside hosts
Configure Access List on the WAN interface to only allow this connection from our office

Seemed simple enough, I did something similar a week back for a different client and had no issues; but this time..I broke something..

I went to paste the Access List Entries (ACE) I had prepared into Notepad, each entry contained "line" then a number, but I realized the "line" command was unrecognized on this device. So I figured since I'm not able to use the line command, I'll have to modify the ACL the old fashion way, by removing the whole ACL with the "no access-list" command and pasting my new ACL without the 'line' command. This did not go as planned at all. When I pasted the ACL into the config terminal, it stopped right at the beginning. I thought "ooh...fudge.", my PuTTY session just hung there and I could no longer PING or access the router.

I knew exactly what had happened, the ACL was still applied to the WAN interface, the router immediately started denying any traffic that did not match the few lines which I pasted. I tried not to look panicked but I was freaking out inside. I told my manager that I lost connection with the site, and he gave me the O.K to run like the wind and go onsite to fix the mess I caused. I arrived to the site relieved to find that nobody noticed the impact, they could still access resources on the LAN and internet; however, anyone connected remotely by VPN or Terminal Services definitely noticed it. So even though the impact was minimal, we still had to restore the router to the original configuration before the changes, so that meant rebooting it and causing a temporary outage.

We were provided an outage window in the afternoon to re-do the change onsite, while everybody was on lunch. Before then, I spoke to one of our senior consultants and he told me what I did wrong. Apparently the IOS does support the line numbering but not the same way as on the ASAs.

On the Cisco IOS software, the entries look like this

1 permit 192.168.1.0, wildcard bits 0.0.0.255

2 permit 192.168.2.0, wildcard bits 0.0.0.255

3 permit 192.168.3.0, wildcard bits 0.0.0.255

4 permit 192.168.4.0, wildcard bits 0.0.0.255

On the Cisco ASA software it looks like this:

access-list mylist line 1 extended permit tcp any any eq http

access-list mylist line 2 extended permit tcp any any eq ftp

access-list mylist line 3 extended permit tcp any any eq telnet

Using this newly acquired knowledge, I performed another ACL change for a client and they remained up and running smoothly.