Mike Melvill, born in KZN South Africa, is the ultimate test pilot. He recently became the first commercial astronaut, after flying SpaceShipOne to above 100 km on 21th June 2004. He flew to the edge of space without government support. Melvill, has a dangerous occupation but he has survived mainly due to his perfectionism in following the basic pilot tool, the checklist. The checklist has important parallels in Information Technology. Another one of the ultimate test pilots is Chuck Yeager. "Chuck Yeager is a pilot of unsurpassed skill and determination," said Mike Melvill. "I've met General Yeager several times, and hold him in very high esteem." Yeager’s flying experience before, during and after the war would hone his skills to absolute precision. Twelve air victories, including five in one day, were an indication of his piloting ability. But there was also a willingness to push himself to the edge of his limitations while still maintaining a coolness under pressure. These ultimate test pilots are a model for any technologist and the pilot's checklist is an important tool.Thus in the spirit of Melvill and Yeager, who are experts in the use of checklists, here is a Network troubleshooting checklist:
- Assumptions! What is really wrong? Is it the network that is being blamed for something else? Fully describe and detail the issue. The mere act of writing it down, often clarifies matters.
- Kick the tyres. As is the case in the eyeball blog entry, do a visual inspection. I once went to a factory where there was a problem. Upon inspection, the network equipment was covered in pigeon pooh! The chassis had rusted and the PCB boards were being affected by the stuff. No wonder there was a problem. In another example, which involved radio links. It is difficult with radio links to remotely troubleshoot alignment errors. (I can recall when a heavy storm blew some radio links out of alignment. Until we climbed onto the roof we never realised how strong the wind really was that day!)
- Cabling. Is the cable actually plugged in? Is it plugged into the correct location. Wear and tear on cabling can also not be discounted. As a minimum invest in a tester like the LinkRunner or even the NetTool. Check for power cable runs that are in parallel to network cables. Check for dust on fibre optic connectors.
- Check the auto negotiation settings. Many problems are as of a result of switch or host setting misconfiguration. Tip: Auto is best!
- Check the network drivers. Most of the network drivers that are pre-released with the operating systems are not optimal! Visit the NIC (Network Interface Card) manufacturer web site and update.
- Walk through the configuration. Are the IP addresses correct? Are the subnets correct? Is the right VLAN being used? Is the gateway correct?
- Changes. Compare and determine differences. Firewall rule changes are often candidate changes for review. (And don't discount desktop firewalls!)
- What: conditions, activity, equipment.
- When: schedule, occurrence, status.
- Where: local, environment.
- How: practice, actions, procedures.
- Who: personnel, supervision. Review the network documentation. Is what is written there reflected in reality?
- Power! Refer to the power blog entry. Often network equipment does not start up correctly after a power outage or is adversely affected by brown outs.
- Refer to those Release Notes. Somewhere in the world someone has had the same problem as you. Download and read the latest release notes for your network equipment. As an example, here are notes about Cisco router crashes.
- Black holes. It is amazing how common black holes really are in networks and it is usually down to incorrect MTU settings. Use this guide from Microsoft to help locate the issue. I can recall a mad day of scrambling around attempting to troubleshoot network connectivity issues when finally narrowing it down to a WAN compression device from Expand that was messing with the MTU. Be sure to check all the appliances and netwok devices along the communications path and check the MTU. As more tunelled networks are deployed and MPLS becomes more prevalent this issue will occur more often. (I was phoned only yesterday by a pal, Doep, who had the issue on one of his customer's networks between some old 3Com kit and a Cisco WAN. Everyone has gone down the wrong path in trying to troubleshoot the issue before I suggested he check the MTU. Voila! Problem solved).
- Sniff free. Wireshark's powerful features make it the tool of choice for network troubleshooting. Load the software and capture a copy of the packets involved in the problem. This forms the basis of any extended analysis.
- Are the router tables correct? "show ip route"
- Is the bandwidth being saturated? FTP and email are bandwidth killers and the usual suspects.
- Spanning tree. Spanning tree must be setup in a deterministic fashion and not in a default manner. And hubs in a switched network or disasters in waiting. Also make sure a techie hasn't left a span port enabled and then reallocated it later.
- QoS settings. Have the correct bandwidth allocations been made and are they correct end to end?
- Buffers and peaks. Don't be caught out by the averages. A 20 minute average on a link graph will hide the small 5 second 100% utilization peak that is breaking everything.
- Are the security bunnies up to anything? Those vulnerability scans often cause more trouble than what they are trying to prevent. Death by shooting squadron at dawn is the only punishment for those doing vulnerability scans across a WAN link.
- Service provider finger pointing. Never trust a carrier or service provider when their lips are moving.
- Name resolution. Is name resolution working correctly?
- Complexity. Often network engineers try to show the worth of their big pay packages by designing complex networks. The true worth of a good design is if it is normalized and taken down to its most simple form. A simple network is less likely to go titsup.
- Broadcasts. In many cases too many nodes are installed into a single VLAN or broadcast domain. Has the LAN being correctly structured and designed?
- Pre-empt the issue. Fundamentally, this requires a good network configuration management tool and continuous reviews. Is this being done pro actively?

0 comments:
Post a Comment