If there is one thing I really enjoy, it is having fun with acronyms. I did make one and tried to get people to use it but to no avail. I think it was an example of self-WOFTAM (defined later). But as this is not a blog about fun with language, but more about Risk Management and other Random Comments, I'd like to talk about SPOFs.
SPOFs are closely related to that other topic I have covered, the Delegations of Risk Authority, if only because so frequently when someone says "we've accepted that risk", you can be there is a SPOF in there that they simply unable to justify correcting, that they do not know how to correct, or that they have had too much push-back to keep trying to solve.
Too often there are invisible Single Points of Failure (SPOFs) across an environment. These naturally occur due to the constant changing of configurations, slowly introduced incompatibilities, and sometimes (too often) by short-cuts built into environments to keep the cost of the application to within the level of spending that was thought to be acceptable.
The following tale is, in principle and flow, a true tale.
I sat with the Systems Architects and looked at the environment diagram they gave me, and asked them to explain how systems and applications were reached by our users, and how a critical outage could have lasted so long (almost a week). The conversation when something like:
"Well, the application is on this server, and since the users are in this building, we keep that server in the network room in the same building."Weren't these users customer facing?
Good so far.
Okay, where is the alternate server or backup system?
"We have a backup server in the network room in the other building."
What happens if this server room should go down - that's never happened, has it.
Sideways looks from person to person.
"Well, it did go down last year."
"We brought it back up."
Okay. How long was the system down?
"About a week. No, four days. We're pretty proud of that, it was only four days, and the users were able to keep working off-line during that time."
"Only some of them and they were able to take details on paper and ring the customers back. After the system came back."
It turned out that there was not one, but a number of Single Points of Failure (SPOFs) across the architecture, each waiting for that wonderful moment to manifest.
First, the primary server was in one location, and the backup server was in another. Backups were taken every night, and then transported to the backup site. Tapes were cycled between the sites. All pretty standard (and now replaced by a link for online backup to a remote backup device), except that recovery on the backup server to the copy of the application had not been tested.
So why wasn't the application just brought back up on the alternate server?
SPOF number two. Well, now it gets interesting. The primary server had a physical fault that required the replacement of physical kit, and the backup application was loaded on a server which acted as the primary server for an application with a different operating system release level and patch history, and therefore did not have a compatible operating system configuration. The "backup server" had not been updated, and was missing key licenses for underlying software required by the application.
The first attempt to bring the backup server into production failed because the underlying operating system and supporting software were not up to date. That took a couple of days to correct, with a rebuild of the backup server, while ensuring compatibility with the primary application on that server. Meanwhile, the primary service was being worked on by the engineers, who, on the recovery of the server, discovered that the network connections to the alternative site would not allow a pass-through for the users.
The list of SPOFs in this situation continued to grow, each needing to be worked through or around.
SPOF number three. The lack of adequate bandwidth required building a new set of IP pipes that would enable authorised users to access the back from their primary location, without opening the application to access from any IP address. While not difficult, it was time-consuming.
Further adding to the problem was that as each SPOF was worked around, the primary objective of recovering the application meant that the SPOFs were mentally abandoned. Maybe one day we will go back and look at them again, but for now, our only priority is system recovery.
This was a wake-up call. A string of SPOFs had almost crippled a key element of the business.
The identification of SPOFs is not easy, and is a project in itself. In addition and from experience, once all SPOFs have been identified, the probability is that only 80% have actually been found. For months after, expect someone to come to the project team or lead and say "Um, I think we might have found another".
The list of SPOFs then needs to be reviewed, ideally by a combination of the technical people, architects and with input from users (for confirmation of criticality), for:
- Cost of remediation, and
- Interdependence (caused by or contributing to another SPOF).
From this, a plan for remediation can be developed. Bearing in mind that such plans should start with remediation of the most critical and the most interdependent SPOFs. Eventually, the cost against remediation benefit break-point will be reached. However, it is not the role of the technical team to determine that cutoff. The cost of remediation needs to be determined, and a multi-phase and costed project plan developed. Multiple scenarios of levels of remediation at various price points should be provided so that those who have the authority to approve spend and the authority to accept a level of residual risk have the information that they need for decision-making.
Too often I've seen the "we've accepted that risk" type of response when considering specific SPOFs or elements of the plan. It is not the role of the technical team to accept those risks, but to communicate the risk and the cost of remediation to those with authority to accept that residual risk.
A note on cost.
It is almost impossible to remove all SPOFs. First, the costs become higher than the potential cost of a resulting event caused by a SPOF. Second, there are many SPOFs that, through analysis, will be seen to be of such low possibility that the cost of remediation will probably far outweigh the cost of a mad-scramble to resolve the situation should that SPOF eventuate.
Special consideration should be given to the high cost (for marginal return) and long time frame for remediating SPOFs. For the higher cost elements, it might be reasonable to identify human and process workarounds in the event of the failure of that specific SPOF. For longer duration remediation elements, additional consideration should be given beyond a simple cost/benefit.
For example, high capacity network links can take some weeks to be installed, and the lack of such links can turn a simple recover effort into a Business Continuity and Disaster Recovery exercise.
Finally, the Future.
Even when the SPOF remediation plan has been approved, work has taken place, new equipment installed and tested, and the project has reported back to the steering committee that the project has accomplished its objects, within budget and within time (no comment), the management of SPOFs is not done.
As mentioned above, systems and environments evolve. It will not take long for divergences in system configurations to creep in, for levels of installed software to become out of synch between production and backup, or for model office environments to slip out of synch with the production environments that they are meant to replicate.
A SPOF review on an annual basis will not be a WOFTAM, and will identify new SPOFs, or may result in a reassessment of the importance of a SPOF that was previously accepted.
(WOFTAM: "Waste of F*** Time And Money" - do feel free to use that, as it is one acronym that I've found myself muttering under my breath for years. Oh, and there is no copyright on it.)