What now for IT and change professionals now we are truly ‘working from home’?
Part 2 – Business Continuity Planning
In the first article in this series (see here), we looked in broad terms at the opportunities and challenges of enforced remote working during the Covid-19 lockdown. This article – part two in the series – explores in more detail some of the issues and possible solutions in relation to business continuity (BC).
What is business continuity?
This may seem a facile question, but in order to develop a business continuity strategy that is appropriate for their own organisations, business and IT leaders must ask themselves ‘what do I mean by business continuity?’
In its simplest form, the objective of a BC plan is to enable the business to continue to function and deliver its services to customers, when a significant challenge faces it, or to recover quickly from such a challenge. The most commonly considered threats are where an organisation’s physical premises are no longer usable or severely compromised – a power outage is probably the most common threat considered in the UK, followed closely behind by other material disasters such as fire and flood. Terrorist threats are often considered, particularly for organisations operating in high profile sectors such as banking or defence, or working out of a physical location that might be a target, such as the City of London.
In most plans we have seen or developed, the business seeks to continue to deliver a full set of services when a ‘disaster’ strikes, albeit usually with significant changes to how those services are delivered. So the first thing to do is to determine what those services are, and prioritise them. This is sometimes referred to as Business Impact Analysis.
Defining the services to maintain
So, for a housing provider for example, the key service of repairs needs to continue, as does that of rent collection and arrears management to ensure the organisation has the cashflow to continue. In this context, there will always be a hierarchy of priority, so it is important to define what are the key services and how much interruption in service can be tolerated.
So, responsive repairs and rent collection need to be up and running very quickly following a disaster, but non-core activities – say resident involvement and non-essential staff training – could be resumed after a longer acceptable delay. Grouping them by priority can be a good way to do this whilst keeping it simple – P1 (top priority services such as repairs and rent collection), P2 (medium priority e.g. new business functions for yet to be built schemes) and P3 (e.g. resident involvement).
Challenges such as Covid-19 present an unusual but instructive set of circumstances, because even if an organisation’s BC capabilities are very advanced, many services cannot be delivered as normal because of the constraints of social distancing. Most housing providers we know of are only carrying out emergency repairs at this time, but allowing customers to log other repair requests. So whilst the BC capabilities might enable full repairs fulfilment, the risks outweigh the benefits, and the job of fulfilling this backlog of repairs after the restrictions are lifted will be another task, of which more in a later article..
So a good plan should articulate what is the likely impact of the common threat vectors, and how they will affect what the organisation can and cannot do. For example, a power outage at Head Office means all IT and telephony systems based there will be down. If systems are cloud hosted, then the systems may still be available, but how do staff get access to them? If the disaster is a flood at Head Office, then it may be that visitors cannot be admitted or, if more serious, that access to the building is prevented completely, in which case delivering services will need to be done from alternative location(s).
In this way you can have the basis for generating a nuanced approach to resuming critical services most quickly and non-core activities at a speed that suits the circumstances.
Documenting how services are delivered
Having identified the services that are core, and non-core, the next step is to flesh out how they are delivered now.
The plan will need to consider how to marshal the human resources of the organisation, as well as the IT (a critical component for all companies) and all other necessary resources. Certain organisational functions in many organisations are still ‘paper heavy’ and in this scenario particularly, you will need to think about what additional arrangements will be needed to carry out the business process effectively. If the process has no paper trail, then it should be possible in most cases to construct alternative delivery models using IT solutions. If it is paper heavy, then consider what would happen if you do not have access to your office and how you might continue the process in that situation.
It is helpful to create an inventory of current services. This can be done as a simple matrix which shows, for each service that is to be maintained:
the current physical location(s) of it
the ICT technologies used
the physical assets needed, such as phones (and including the degree of paper dependency)
the staff who deliver it
the process owner
Having verified the information as outlined above for each service, with the process owners and other stakeholders such as HR, IT, Facilities and so on, you will then have a complete picture of how services are delivered at present.
Exploring the specific impact of threats on those services
One way to approach this is two-fold: top-down, followed by bottom-up.
It is useful here to return to the possible threat vectors such as pandemic, fire, IT outage, flood, major storm, cyber attack etc. Look at the risk of each of these in turn, assessing the likelihood of their occurring, together with the impact if they do to arrive at a gross risk for each threat. So, for providers based in the UK, earthquakes have a very low probability of occurring, and whilst the impact might be high, the gross risk is low, so most plans do not consider this to be the chief threat to protect against. On the other hand, if your office is on low ground, and your data centre is in the basement, flood risk may have a much higher gross risk score.
So, firstly, think about the top-down, or high level – the effect on the whole business. What if your Head Office where all your IT systems and paper records resides, burns to the ground? Deep breaths required, I know, but you can’t plan for it effectively if you don’t consider the real possibility of it happening.
In this scenario, you would need a way to re-provision those services – people, IT and the other parts of the processes – from a different location or locations. You would also need to ensure backups of your data are as up to date as possible. And you would need to do this within a timeframe that is acceptable to the business.
Fundamentally that means you need a temporary disaster recovery (DR) location, or to be able to deliver services from another site or office if you have one that is suitable. Traditional BC plans for smaller organisations with no ‘second site’ were generally focussed on paying for a DR location and specialist BC provider that would host the IT systems and key staff for a period of a few months, whilst the organisation finds a new permanent location.
Larger organisations often preferred to host DR sites in another office site and then fail over to systems at that location if the main site became unavailable. One thing to consider as well as availability of locations here is technical skills. Provisioning and maintaining DR infrastructures is fairly complex, and some of the skills required to get systems up and running in a disaster are quite different from the business as usual (BAU) skills needed to manage a live infrastructure. Time is of the essence in DR, and if the ‘DIY’ approach is adopted, regular testing is arguably even more essential than if the DR service is to be provided by a specialist third party provider.
Increasingly however, organisations have some or all of their infrastructure in ‘the cloud’ (but rarely all, in the case of housing providers). If your ICT systems are not located in your offices, then providing business continuity can be easier. However, most housing providers have a number of systems – often including some of their core business systems such as the Housing Management solution – which are often legacy systems that are not suitable for cloud hosting and so are on-premise. So in such a hybrid situation, the BC plan will need to allow for an alternative to access to those on-premise systems too.
One other key area to consider is customer contact channels. Most housing providers still have telephone calls as a major, often the primary channel. Does your telephony solution allow you to route calls easily to other locations, or it is also tied to hardware on-premise? What if you wish to enable a virtual team setup – can you manage contact centre queues staffed by people working from home in many different locations?
Taking this top-down approach allows you to consider what sort of infrastructure and capabilities will be needed if a major disaster strikes.
Having done this, before you start evaluating solutions, it is sensible to carry out a bottom-up evaluation too. So, considering each process in more detail, and starting with the key ones, what are the steps involved and what would be the implication of the major disasters. So, for the example of responsive repairs, you will need to address a number of key questions, such as:
How would customers contact you to report repairs?
How would those repairs get communicated to your operatives (if you have a DLO) or our
contractors (if you don’t), or both?
How would you carry out activities like pre- and post-inspections?
How would you process invoices and pay your contractors?
This will help to flesh out the activities and spot where a particular capability may be required.
To take another example, approving payments to suppliers, do your senior staff who authorise payments have multifactor authentication devices? If so, are these devices tied to their own endpoints? If so, what if their endpoints have just gone up in flames? As with most problems, there will be multiple solutions – in extremis, you could probably authorise your bank to pay your suppliers what you paid them last week, to keep things going, but this is a short-term solution at best, and you will want to consider how this essential process can be kept going if the period before normal operations resume is prolonged.
Once the top-down and bottom-up analyses have been done, you can start exploring the possible solutions.
In all cases, you will want to consider your Recovery Point Objective (RPO) and Recovery Time Objective (RTO). The RPO measures how much data you’re willing to forego e.g. do you intend to recover to a copy of your data from the previous working day’s backup, a snapshot from two hours previously, or do you want an RPO of zero i.e. no data loss? The RTO measures how long it will take to get back to normal, so an RTO of 8 hours for core / 24 for all others means effectively no business in the first 8 hours, then core services available from then onwards, and everything else 16 hours later.
For a more traditional DR location approach, the shorter you want the RPO and the RTO, the more the solution will cost. What you choose to pay for is then a business decision based an assessment of the business cost of an outage, taking into account how long it might last, versus the cost to mitigate against it through a BC solution.
To enable a very low RTO typically requires having systems that are hosted outside your environment and so ‘always on’, or having ‘warm’ copies of systems elsewhere, with up-to-date data, ready to be switched over to manually, or even done automatically.
Some elements of any recovery process will take time – restoring data takes a long time, and improvements in restore speeds lag considerably behind the pace of data growth. This component alone can take 24-48 hours or more and is one of the reasons why traditional ‘backup and restore the data’ approaches are becoming less attractive, as restore speeds in most cases simply do not allow an acceptable RTO when you have 5, 50 or even 500 terabytes (TB) of data.
Necessary actions such as rerouting telephony systems can also take time – maybe not days, but it may be more than minutes. It will always be necessary to carry out some manual reconfiguration steps e.g. you may want to choose which location(s) to route your inbound telephone calls to, so it is unusual to have a BC solution that is fully automated with no manual intervention. This means an RTO of zero is probably unrealistic for all but those with the deepest pockets.
However, it is important for the business to articulate what is an acceptable period of downtime. Most housing providers I suspect would consider that resuming core systems within 2 hours of a major outage would be pretty good, whereas this would almost certainly be unacceptable for a bank’s trading systems, for example.
As well as considering the options for co-location, cloud hosting, hybrid models and DR locations for your core IT assets, you will also want to consider what tools staff need to access systems.
If staff will be working from another office, do you have enough endpoint devices? If working from home, have you issued staff with corporate devices, are you expecting them to have their own, or will you have a ‘war chest’ supply in a secure location that you can give out to staff that wouldn’t normally have corporate mobile devices such as, say, customer contact centre staff?
And think too about remote access. Many organisations will have suffered in the last 3-4 weeks because of constraints on remote access, whether that be the bandwidth into their IT systems, or the number of licences for VPNs or other methods used for remote access. Most BC plans do not – did not until now – take into account the scenario of 95%-100% of staff working from home for a prolonged period of time. Given the change to working patterns as a result of the lockdown, future plans will need to rethink this.
And if staff who normally work in a corporate office are working outside the traditional IT network perimeter, how does your security work in this model? Multi-factor authentication (MFA) is likely to be necessary to some extent to ensure that you can robustly mitigate against unauthorised access to information assets.
Whilst many models for IT delivered on a consumption basis – Software as a Service, Infrastructure as a Service etc – are by definition priced based on usage in some way – computing power, data storage etc – there are many components of enterprise IT that still use a more traditional ‘bums on seats’ licensing model (even if those bums are going to be on different seats to the ones in your office if disaster strikes). In this case, paying for thousands of VPNs licenses may seem poor value for money if you only use 200 concurrently in normal business operations.
But it is also important to consider what ‘normal’ will look like after the lockdown is lifted (which is a separate subject) and also how you might ramp up from one level of usage to a much higher one. Can you do it quickly in a disaster, or will it take you two weeks to get those extra licences? Considering models in which consumption can be paid for on a fair usage basis, but scaled up quickly when needed, will be increasingly important.
Selecting the right solution for your organisation
Having identified your requirements, you can then explore your own organisational capabilities and the providers and solutions in the marketplace to determine the costs and benefits of the alternative solutions, and the risks inherent in each.
We have experience of a number of different approaches and solutions, and can assist providers in the evaluation and selection process.
It is important too to remember that a BC plan is not just about technology. Absolutely crucial is determining the crisis management command structure, roles and responsibilities, contact details for key staff and suppliers, and so on.
Scenario planning and testing
In the course of implementing a solution, you will of course want to validate that it can achieve the objectives, and do so within the RTO and RPO parameters you have specified.
However, be wary of considering it a done deal at this point. BC plans should be tested, at least annually, and more often if you have the capacity to do so.
And again, remember that BC plans are not just about the tech – although the tech is important. The organisation should carry out regular scenario testing activities, ideally without warning – the BC equivalent of the fire drill. In these activities, you play out a possible scenario – say a flood at Head Office – and walk through how the organisation needs to respond and what each member of the BC team individually needs to do. This is a great way to identify gaps in your solutions – whether process, people or technology – and hone what you will do it if happens for real.
And BC is not a one-step plan, but a process. Organisations that want to minimise risk and continue to be effective when disasters strike regularly review their BC plans and amend them as business demands change, and indeed as technologies evolve.
Getting further information
The main professional organisation for this discipline in the UK is the Business Continuity Institute (https://thebci.org), which is a good source of best practise on the subject.
For IT and business change professionals, UK IT’s professional body the British Computer Society (BCS) also has a number of resources.
And, of course, seeking advice from and comparing notes with your peers in the housing sector is an excellent way to share knowledge and experience.
Having a fit for purpose BC plan which matches an organisation’s appetite for risk and its budget, and enables it to continue to deliver services when things get tough, is essential. Good planning, coupled with a structured approach to exploring possible solutions, and regular testing and review will ensure your organisation is match fit for the challenges that you might face at any time.
If the Covid-19 lockdown has exposed weaknesses in your BC plan, or if you think it needs a fresh look because your business has changed and the plan is no longer fully aligned to it, now is the time for a concerted organisational effort to revisit it.
And it’s important to be aware that there are many emergent technologies and innovative solutions out there which can help organisations to improve their capacity to deliver services in the event of disasters, or to do so more cost effectively than they do now. DtL is working to help introduce a number of these organisations to the housing sector, and we are happy to help providers who are interested in seeing what’s out there. If you don’t know what solutions are available, how can you pick the ones that are right for your business? Please contact DtL on firstname.lastname@example.org if you wish to know more.
In the next article, we’ll explore how to make the most out of working remotely, considering the personal and social factors, as well as the tools available.