Sunday 31 January 2010
They Slipped the Surly Bonds of Earth to Touch the Face of God
On the 28th January 1986, the Space Shuttle Challenger disaster occurred. The news filtered through to us on television, where I was starting my first year at Stellenbosch University.
Labels: Space
Another excellent RCA blog
Discovered this blog today. It mentions the Airlink problems that have been occurring in South Africa.
The blog by Think Reliability is called Root Cause Analysis Info.
At Think Reliability they have a great tool to help with solving problems. This is an Excel template and very impressive! Definitely five pineapples in the next Blogs and stuff that rocks awards.
At Think Reliability they have a great tool to help with solving problems. This is an Excel template and very impressive! Definitely five pineapples in the next Blogs and stuff that rocks awards.Labels: RCA
Legacy radio
I downloaded VLC and listened to Springbok Radio Digital. In this neck of the Kalahari we never had TV until the late seventies. (Must remind the Meerkats to award some pineapples for Springbok Radio as one of the best legacy radio stations!) (BTW: Any other good legacy radio stations out there in the digital world?)
The service officially launches on 1 February 2010, but I was listening today!
Labels: Legacy Radio
Monday 25 January 2010
Remembering Madge Networks (Support without limits)
It's not about the technology, it is about the service and support.
Labels: Madge, token-ring, Videos
Sunday 24 January 2010
Blogs and stuff that rock - Early 2010 edition
My trusty clan of Meerkats, reminded me that it has been a long time since we dished out pineapples in the Thinking problem management! ratings of blogs and stuff that rock. The ratings are reliant on the non-scientific scavenging of the Meerkats and the attitude of the cobras. Thus without ant further fanfare, we call on Wickus Meerkat (recent star of that blockbuster Prawn 9) to open the box of pineapples and dish them out:- ZA News (because here we can)
- Cisco IOS hints and tricks (still the most interesting Cisco blog around)
- Root Cause Analysis Blog (beats Crash investigations hands down)
- The IT Skeptic - A sceptical view of ITIL, CMDB and whatever catches my eye
- iBurst forum on mybroadband.co.za
- The Internet Economy (with Reuben Goldberg)
- Tomato Firmware (mosted trusted of software for my litle olde Linksys WRT54GL)
- PacketLife.net Blog (excellent source of cheat sheets)
- The Daily WTF (the most humorous of IT blogs)
- DITY Newsletter (the most practical of blogs)
- Bad Astronomy Blog (the best technology blog)
- Mybroadband.co.za (best breaking news for broadband and telecommunications new in ZA)
- Aki Anastasiou's twitter stream (best technology journalist in ZA)
- iTraffic Monitor (very useful traffic metering application)
- AVG Free (Best of the free bunch for anti-virus)
- BraunsBlog (the best problem management blog around)
- RISKS Digest (it been around a long time and is a great reference)
- The Hot Aisle (one of the better data center blogs around)
- Brad Reese on Cisco (the blog with the best pulse on networking and Cisco)
- DownThemAll!(if you only have one plug-in for Firefox, then this should be it!)
- Adventures in Open Source (always hoping that these guys will kick the Big Four out of their slumber)
- ZA Tech show (has its moments)
- Royal Pingdom (always a good read)
- Farmville (I have always wanted to farm potatoes)
- Afrigator (the best aggregator of ZA blogs)
- The Pineapple Society of Root Cause Analysis (it just happened by accident)
- TweetDeck (by far the best twitter tool)
- Storagebod's twitter stream (no nonsense insights into storage)
- Eyewitness News (better news that the traditional media)
- Richmark Sentinel (a new form of news)
- Keo.co.za (great distraction from IT)
- TECHCENTRAL (has potential)
- LinkedIn (Best business web site around)
- Mikrotik wiki (most useful for the low end)
- Skype (best messaging tool)
- BraunsBlog (a great addition to the problem management landscape)
If you think the Meerkats have been slapgat and want to nominate a blog or other stuff please use the comments tag below and supply a suitable motivation....
Budget exception form template
A template to use when an unbudgeted expenditure is incurred.
Labels: SOPs
Project Request Form
A project request form template that includes mining for business requirements.
Standard Operating Procedure (SOP) for IT Operations
Labels: SOPs
Cactus fifteen fourty-nine
Would it be great if all IT major incidents could have a video time lapse?
Saturday 23 January 2010
Top content on TPM - 2009
The top content on this blog for the year 2009 is:
- Checklist for standby generator
- SOP (Standard Operating Procedure) Template
- Best Practice Network Design
- ITIL implementation checklist
- Lessons from Apollo 13 - working the problem
- ITIL process checklists
- The major incident process - (Magnum MIP)
- Checklist for HVAC
- Henry Ford (My Life and Work) at Project Gutenberg
- Checklist of Active Directory tasks
- Checklist for AHU (Air handling units) installation.
- Kepner-Tregoe: Houston, we have a problem!
- Major incident tsunami
- Checklist for problem solving
- Major Incident Draft Template and Consequence Analysis Report
- Network troubleshooting checklist
- Checklist of IT Metrics
- The Leaky VLANs myth?
- Checklist for network architecture documentation
- Microsoft's checklist for Infrastructure maturity
Grading the resouces involved in major incidents
It is recommended that the resources that are involved in handing a major incident are graded as part of a continuous improvement program. This is a means of doing that grading.The maximum possible score is 32 and the grading is calculated by totaling up the scores from the eight different areas and representing it as a percentage of the maximum. The eight areas are:
- Identification and business impact – have the resources correctly identified the major incident and described in the correct level of detail what happened. Has the correct service impacted been identified from the service catalogue? Was the business impact obtained or measured?
- Conditions – what were the business, IT or environmental conditions present during the incident and did the resources describe these to a suitable level of detail.
- Expanded Incident Lifecycle – are all the times in the expanded incident lifecycle recorded and are they realistic. Were these recorded in the incident reference at the service desk.
- Resolution/ Workaround – how suitable was the resolution and was a workaround implemented to reduce the time the service was unavailable.
- Classification – have the resources correctly classified the impact to the company and was the incident handled with the correct level of prioritization.
- Outage – have the resources recorded and classified the outage times correctly.
- Risk – has a suitable risk assessment of the service, asset and process been conducted?
- Escalations/ Communications – did the resources escalate the incident and was communicate during the process suitable.
Each area scores a maximum of 4 points with a minimum of 0.
Incident User Metric (How big was it really?)
The Incident User Metric (IUM) is a mechanism to measure incident in an objective manner and which will allow problem managers to classify these as either minor, normal or major. Most incidents that effect a significant amount of IT customers are potential major incidents. What constitutes a major incident and what does not? The key is in the IUM. After a large enough sample pool has been built (> 10 incidents) the average is calculated. Minor incident is an incident where the IUM is less than 40% of the norm. Major incident is an incident where the IUM is greater than 40% of the norm. Normal incident is an incident that is within 40% of the norm.This metric is determined in the following manner:
- What is the opportunity cost to the company of 1 minutes outage based on the effect on productivity? (or put another way, what is the total salary bill of the company for 1 minute?)
- What was the length of the outage?
- What percentage of the IT customer population was impacted?
- Is it a lesser multiplier? (Liability, scrutiny by management, internal process, company’s image).
- Length of outage * population impacted * opportunity cost * (multiplier) = INCIDENT USER METRIC.
Risk management as taught at Meerkat Manor - (CRAMM Lite - the "R" in "ROC" analysis) - (Step 2 in BMX)
Meerkat Manor is a British television series made by Oxford Scientific Films for Animal Planet. It presents the daily activities of a clan of wild meerkats as part of the Kalahari Meerkat Project. Although the narrative is presented in a style similar to a soap opera, it is documentary, featuring actual events in the lives of the meerkats. Meerkats are one of the more risk aware animals. One or more meerkats stand sentry (lookout) while others are foraging or playing, to warn them of approaching dangers. When a predator is spotted, the meerkat performing as sentry gives a warning bark, and other members of the gang will run and hide in one of the many bolt holes they have spread across their territory. The sentry meerkat is the first to reappear from the burrow and search for predators, constantly barking to keep the others underground. If there is no threat, the sentry meerkat stops signaling and the others feel safe to emerge. Thus in the spirit of the Meerkat's I present CRAMM lite.CRAMM provides a staged and disciplined approach embracing both technical (e.g. IT hardware
and software) and non-technical (e.g. physical and human) aspects of security. In order to assess these components, CRAMM is divided into three stages:
and software) and non-technical (e.g. physical and human) aspects of security. In order to assess these components, CRAMM is divided into three stages:- Asset identification and valuation
- Threat and vulnerability assessment
- Countermeasure selection and recommendatio
The full blown CRAMM methodology is too cumbersome to use for ad-hoc assessment as those encountered in major incident reporting or small projects. CRAMM Lite forms part of a greater impact management framework called ROC (Risk, Outage and Classification). In CRAMM Lite the asset, process or resources involved are measured from a risk perceptive. Three areas are assessed. Each area has a maximum score of 4 and the grading is the score of all areas represented as a percentage.
- Impact - CIA(Confidentiality, integrity and availability) are scored.
- Vulnerability - Loss(C), error(I) and failure(A) are scored.
- Counter measures - Countermeasures already in place and those that will be implemented in the future are scored.
Example:
The impact is rated as 4 – Critical – Confidentiality = Secure, Integrity = Very high, Availability
= Mandatory. The impact is rated as 4 – High loss probability, High error probability, High failure probability. Counter measures is rated as 2 – Service provider due diligence. The score is thus 10 out of a max of 12 = 84%.
The impact is rated as 4 – Critical – Confidentiality = Secure, Integrity = Very high, Availability
= Mandatory. The impact is rated as 4 – High loss probability, High error probability, High failure probability. Counter measures is rated as 2 – Service provider due diligence. The score is thus 10 out of a max of 12 = 84%. SOA Lite (the "O" in "ROC" analysis)
An outage analysis is conducted of the service impacted. Two areas are assessed. Each area has a maximum score of 4 and service outage is the score of all areas represented as a percentage.
- Period - The measurement is based on elapsed time.
- Consequence - determined by financial means or business perceptions
Measurement scale
Service period classification
- (4) Critical - App, server, link (network or voice) unavailable for greater than 4 hours or degraded for greater than 1 day – negative business delivery for more than 1 month.
- (3) Major - App, server, link (network or voice) unavailable for greater than 1 hour or degraded for greater than 4 hours - negative business delivery for more than 1 week.
- (2) Moderate - App, server, link (network or voice) unavailable for greater than 30 minutes or degraded for greater than 1 hour - negative business delivery for more than 1 day.
- (1) Minor - App, server, link (network or voice) unavailable greater than 5 minutes or degraded for greater than 30 minutes - negative business delivery for more than 1 hour.
- (0) Low (default) - App, server, link (network or voice) unavailable for less than 5 minutes or degraded for less than 30 minutes - negative business delivery for less than 1 hour.
Service consequence outage classification
- (4) Critical - Financial loss, which puts a business unit in a critical position - greater than $10m or substantial loss of credibility or litigation or prosecution or fatality or disability.
- (3) Major - Financial loss which severely impacts the profitability of a business unit - greater than $1m or serious loss of credibility or sanction or impairment.
- (2) Moderate - Financial loss which impacts the profitability of the business unit, greater than $100k or embarrassment or reported to regulator or hospitalization.
- (1) Minor -Financial loss with a visible impact on profitability but no real effect, greater than $10k or some embarrassment or rule or process breaches or medical treatment.
- (0) Low (default) - Financial loss with no real effect, less than R50k or irritating or no legal or regulatory issue or no medical treatment.
- The period is rated as 3 - Major - App, server, link (network or voice) unavailable for greater than 1 hour or degraded for greater than 4 hours.
- The consequence is rated as 2 - Moderate - Financial loss which impacts the profitability of the business unit, greater than $100k or embarrassment or reported to regulator or hospitalization.
- The score is thus 5 out of a max of 8 = 63%
BIA Lite (the "C" in "ROC" analysis)
The resultant impact on the company is measured to determine the perceptive.of the IT customer Five areas are assessed. Each area has a maximum score of 4 and the classification is the score of all areas represented as a percentage.
- Scope - Percentage of customers affected.
- Credibility - Internal and external negative consequences in the company.
- Operations - Business interference.
- Urgency - Time planning.
- Prioritization - Resource reaction.
Scope scoring
- (4) More than 50% of customers affected
- (3) More than 25% of customers affected
- (2) Less than 25% of customers affected*
- (1) Less than 1% of users affected
- (0) Single IT customer affected
Credibility scoring
- (4) Areas outside the company will be affected negatively
- (3) Company affected negatively
- (2) Multiple business units affected negatively
- (1) Single business units affected negatively
- (0) No credibility issue*
Operations scoring
- (4) Interferes with core business functions
- (3) Interferes with business activities*
- (2) Significant interference with completion of work
- (1) Some interference with normal completion of work
- (0) No work interference
Urgency scoring
- (4) Underway and could not be stopped
- (3) Caused by unscheduled change or maintenance
- (2) Incident caused by a change
- (1) Incident caused by scheduled maintenance
- (0) Completion time not important*
Prioritization scoring
Reviewing the scope , credibility, operations and urgency please classify the priority of the incident.
- (4) Critical - An immediate and sustained effort using all available resources until resolved. On-call procedures activated, vendor support invoked.
- (3) High - Technicians respond immediately, assess the situation, and may interrupt other staff working low or medium priority jobs for assistance.
- (2) Medium - Respond using standard procedures and operating within normal supervisory management structures.
- (1) Low - Respond using standard operating procedures as time allows. *
- (0) No prioritization
* - default score
- The scope is rated as 2 – less than 25% of customers affected.
- The credibility is rated as 4 – Areas outside the company will be affected negatively.
- The operations is rated as 2 - Interferes with normal completion of work.
- The urgency is rated as 3 - Caused by unscheduled change or maintenance.
- The prioritization is rated as 3 – High - Technicians respond immediately, assess the situation, and may interrupt other staff working low or medium priority jobs for assistance.
- The score is thus 14 out of a max of 20 = 70%.
Meaning
After a large enough sample pool has been built the averages are calculated. Let's assume the average for classification is 70%. Dependant on the calculate score for a specfic major incident the following statements can be made:
- The incident affected the company less than usual.
- The incident affected the company the same as usual.
- The incident affected the company greater than usual.
ITIL described in under 1 minute
I like the tube analogy. (Note to self: must post my network tube map one day!)
Root cause is not about singular blame
Thus the following cannot be a root cause when you blame:
- the person
- the product
- the location
- the time
Invariably, it is how the product was used at the point in time by that person in that place. It is about the sticky stuff in between, the thorny stuff up top and the dark stuff underneath. (the pineapple analogy). Root cause analysis is not linear but dimensional in that it is typically a sequence of events that unfolds to an incident of negative consequence. Each of these events cascade a set of contributing causes and addressing these causes might prevent the negative incident from occurring or alternatively trigger a different type of incident (positive or negative!) Heinrich called it the domino theory and it applies to IT as much as to any other discipline.
A root cause described in the singular never is a root cause! When it is claimed to be a people problem, then this is usually unlikely as that implies a singular cause.
A root cause described in the singular never is a root cause! When it is claimed to be a people problem, then this is usually unlikely as that implies a singular cause.
Friday 22 January 2010
The major incident process - (Magnum MIP)
There is a close relationship between problem management and the major incident process. Dealing with these processes is much like conducting a private investigation and hence the naming of this methodology as Magnum MIP, after the TV series.An incident is any event that is not part of the standard operation of a service and that causes an interruption or a reduction in the quality of that service. Incidents are recorded in a standardized system which is used for documenting and tracking outages and disruptions. A Major Incident is an unplanned or temporary interruption of service with severe negative consequences. Examples are outages involving core infrastructure equipment/services that affects a significant customer base, such as isolation of a company site, which is considered a Major Incident. Any equipment or service outage that does not meet the criteria necessary to qualify as a Major Incident is either a Moderate, Minor or Normal Incident. Major incident reports are escalated to the problem manager for quality assurance.
Incident Pyramid
The scale of incidents follows an Incident Pyramid where the most incidents are normal, escalating up to a singular Major Incident.


The major incident process, the Magnum MIP methodology, consists of following components:
- Impact assessment: Classification, outage analysis and risk management.
- Measurement using the Incident User Metric (IUM).
- Grading the resouces involved in a major incident.
Major incidents can be seen in the context of the Magnum MIP quad.rant In this quad major incidents are either failures or misses and result from undesired outcomes. The difference between a miss and a failure is that the former is associated with desired activity while the latter is not. Success is measured as having both desired activity and outcomes. In the context of problem management a lucky break, desired outcome associated with undesired activity, is not optimal and should be investigated.

Iceberg
Incidents are a portion of activity in problem management that form the tip of an iceberg. The major incident process deals with the visible portion of the iceberg, while in the greater field of problem management a large number of non-visible issues are lurking.

Documents
An example major incident draft template is available here. An example incident consequence analysis template is available here. This can be used as the Major Incident Summary Report.
The following consolidated worksheet forms the engine room of the process and is used to produce the graphs in the reports Monday 04 January 2010
South African robots
In South Africa we call traffic lights, robots. Usually, there are beggars...
Labels: Photos
Sunday 03 January 2010
Social media: WTF!
A great and informative presentation on how to use social media.
What the F**K is Social Media: One Year Later
What the F**K is Social Media: One Year Later
Friday 01 January 2010
Two for only one cent
For only one cent, you can have two pieces of bubblegum. Ah, the good old days. The best is the "Did you Know?" on the inside of the wrapper!
Listen to the advert in Afrikaans (via Springbok Radio):
Labels: Chappies, South Africa
Subscribe to:
Posts (Atom)











