
In the mid nineties I lived in Pretoria, in Hill Street, close to Loftus. I worked for Madge Networks as a Technical Account Manager and was fighting a rearguard action with Token-ring against Ethernet.
The Madge offices in South Africa were located in Rivonia. My trip would start early mornings before 7 as with heavy traffic the journey might take up to an hour. (the situation is now worse with even longer travel times)
One morning I was driving in to work and was listening to a CD and switched to Radio 702 just before 7. The headline news was about major delays at Johannesburg International Airport (now called OR Tambo) that was caused by a computer problem. I had worked at the airport for SAA installing a large number of switches and CAUs and was very familiar with the environment. I was coming up to the Buccleuch interchange and decided to go left to the airport instead of right to the office. Maybe there was something I could do to help.
A few moments later my mobile rang. It was the SAA IT manager who had heard the same news broadcast and asked me to meet him at the airport. He was as much in the dark about the situation as what I was. The two of use arrived at much the same time at the departure hall. The crowd and queues were extremely long. We walked to the front counters to determine from the staff what was the issue. Many of the passengers threatened us as the assumed we were hopping the queue and told us to get back into the line. Having struggled our way to the front we duly learned from the controller that there was no connectivity to the mainframe. Since my companion has walked over from the headquarters building, he confirmed it was not a general mainframe outage as the systems were all accessible and working from that building.
I had my laptop with me. An IBM Butterfly (BTW: best laptop I have ever owned!) I attached my laptop using my Smart token-ring adapter to the network and loaded a good old DOS application called RingManager. I looked in dismay as RingManager proceeded to tell me that there were well over 500 nodes connected onto the single ring (maximum should not be more than 200) that I had attached to and the locked up solid and froze. Being a typical techie and not believing it the first time I proceeded to do it a second time with the same results. I conferred with the SAA guys and told them that it was my suspicion that someone had looped the rings at the airport together. There was only one location where this was possible and that was the main fibre patching room. We proceeded to the patch room and at first inspection all was in order. However, when we started checking the patches they were totally incorrect with the diagram pasted on the wall next to the cabinet.
The security guards confirmed that a cable installer had signed in and worked in the room at 1am in the morning connecting a new voice switch. We deduced that the patch cables were in his way, so he unpatched them all. Put in his new fibre cables and then randomly patched the cables back. We started patching the fibres back to their correct position using the diagram on the wall. The technician who pasted the diagram on the wall was a hero. Many times I have encountered no diagrams present in patch rooms. When it comes to resolving issues they are crucial.
The airport network slowly started coming online as we worked through the patches and the departures were able to move from manual boarding to full ticket check-in and boarding.
After the crisis was over we had some strong cups of coffee. I asked the SAA guys was there was no escalation as the operators had a network management alert console. No one had the answer, so we decided to go over to the operations area to work out the reason. The operator was sitting in his office blissfully unaware of the crisis. The network management console had a total of 11 million unacknowledged alerts. When questioned as to why there was no escalation the guy said that no-one had phoned him. He only does anything when someone phones him and since no-one phoned there wasn't an issue. I never knew what happened to the poor bloke but there were some pretty pissed off people ready to strangle him.
I have repeated seen change management done in isolation and not in an end to end fashion. In some companies each business unit does there own thing and does not consult with any other stakeholders. Change management will fail and cause outages unless there is a transparent end to end process where all stakeholders and their vendors within an organization have visibility to the activities.
To often the changes are done in a fashion of need to know. I need to know and will tell you about it if I think you need to know. Murphy usually conspires to bring two changes together at the most inappropriate moment to cause absolute havoc.
Whenever I sit delayed in some airport, I often have the urge to go and ask the airport personnel who messed up the change management process, because that is the most probable reason.

I find it funny that if you ask an ITIL guy for a solution, it's usually "more process". Ask someone in IT and they'll say "fire the idiots" ;)
ReplyDeleteI suspect the real answer lies somewhere in the middle.
Sean