As the dust settles in Wellington after a 7.8 earthquake that city finds itself battling with resilience. Many companies were caught without contingency plans and those that had them were back in business rapidly. Business Continuity Planning is simple, and I wanted to give some tips to people who will undoubtedly get landed with the task of creating them in the next few months.
I’ve worked on resilience with several large agencies over the years building plans and testing recovery across a spectrum of disciplines. There are two immense challenges to creating a resilient company. First, it’s seen as a compliance activity that is an overhead to the business and second, it is severely overcomplicated by some practitioners to the point where the corporation simply becomes so confused they give up.
Continuity Planning, even for a large organisation, shouldn’t take more than around twelve weeks of business time. Of course, you must allow for some breaks due to people’s availability over that period.
The process itself is common sense and simple:
- Take an inventory of where you are
- Decided what is important and fast, you need it back
- Match that against your current state
- Put plans and technology in place to bridge the gap
- Exercise it
- Repeat as required
We use something called the Resilient Stack as a fundamental for resilience planning.
It can be scaled down to a smaller company footprint as well.
The process looks something like this:
One of the most expensive, onerous, and error-laden parts of the process is business impact analysis.
Traditionally an army of consultants arrives to interview key stakeholders in your business on what they do, how critical those processes are, and how quickly they need to return those processes to normal operation.
Depending on size it can take months, costs hundreds of thousands of dollars, and doesn’t cover the entire range of your business.
We automated that process using an interactive tool that allows all employees in your company and stakeholders to enter that information at their desk or home.
The advantage of that is you can cover the entire staff, meaning your quality result is far higher. It means you get every corner of the organisation rather than a handful of key stakeholders. It’s fast. You can carry out impact analysis from setup to completion in about a month if you are a large organisation. It’s also repeatable because you can run it again over time and see what has changed.
Because it is automated, it automatically produces the draft continuity plans for you. Another time-consuming process. They are then validated by the stakeholders, and the job is mostly complete.
Two things happen at this point.
The first is that all the supporting parts of the company are mapped. In other words, as part of that automated process, you can see which business units rely on others to succeed. For example, everyone needs finance and IT. It drives out the critical supply chain internally and externally.
Second, the IT group has all the necessary requirements to go away and figure out whether services are adequately protected or not, and can come back to the business where they need to be amended.
For IT, this can be a tricky proposition because they often don’t have standard service levels that they offer the business. The government developed something called the “Global Service Levels” about a decade ago (they came out of ACC) and they have slowly grown into more mature organisations as a method of providing very clear, very simple service levels.
Platinum, Gold, Silver, and Bronze service levels each have a recovery time metric and how much data can be acceptably lost. It makes it real for business stakeholders, and it also makes it far easier for IT to measure where services fall on that spectrum.
If anyone wants a copy of that template, then please drop me an email firstname.lastname@example.org
Interestingly, as more and more companies move to Cloud, those service levels automatically increase. The days of building secondary data centres with multiple fail-over systems and engineering are fading. Public cloud is very persistent, so if you can connect to them, even over a home internet connection, you end up with increased resilience by default.
One of the fascinating observations in the Kaikoura quake is that while the phone lines very quickly stopped due to loading, the internet connectivity (at least in Wellington), didn’t. People immediately connected via the various social media tools.
Communications during an event are fraught. In a large-scale event like the Kaikoura quake, National Radio kicked into action and then as time went by so did other agencies. Communicating with your staff is paramount.
There exist right tools in the market that can help with this. One I know of is Solity. It works like this.
Once you have created your continuity plans, you find very quickly that they are a mere list of actions that staff undertake. The problem is that they need to be kept up to date, accessible, easily readable, and that can have a high overhead.
What Solity does is automate those plans.
Each staff member downloads the application to their phones, and the event co-ordinators have a dashboard they can monitor in real-time. Any changes to the scheme are pushed out with the staff having to do nothing.
When the plan is activated, the staff are notified on their phones. The first question is “Are you ok?” staff can respond with yes, or, I’m in danger. That information allows controllers to direct their efforts to vulnerable staff immediately; it can also be tied with a geo-location so you can see where at risk staff are.
The next question is “are you available to help?”
Generally, in a large event, the priority of people is to find friends, families, and pets. This gives the controller visibility of who is around.
Then, the staff’s plan is displayed allowing them to follow the simple step by step instructions which they check off. As each step is ticked off, the controller can see this across the organisation.
If plans need to be changed, or resources redirected, it can be done in real-time.
Even if the phone is getting patchy connectivity, as the internet comes up data is synchronised.
It’s an elegant and simple Cloud-based tool that provides a method of plan activation that just has not existed before, and I understand it can be bought on a per user, software as a service basis.
Resilience does not need to be complicated. It should be based on common sense and simple. The days of carrying around hundred-page continuity plans are over.
Tips on resilient planning:
- If it feels too complicated, it is.
- There are only ever four scenarios you need to plan for. Loss of a service, loss of a floor, loss of a building, or loss of a region. Don’t get trapped in planning for multiple events. There are exceptions to this, but they are rare.
- The larger the Cloud service, the more resilience you have. Cloud provides excellent resilience.
- A continuity plan should probably not be more than two pages long for staff.
- Build a work from home capability. The whole world does this yet we still have these old-fashioned notions that it is insecure and lowers productivity. Neither of these are true.
- Common sense trumps all.
- Don’t panic.