Recovery

Lessons in handling data disasters

The WanaCrypt0r ransomware attack on the NHS and BA power failure both in May 2017 show the right and wrong ways to deal with system outages

Over the UK bank holiday weekend of 28th and 29th May British Airways flights were cancelled and delayed. News reports highlighted customers waiting around airports unable to get any information from BA.  In some cases the congestion was so great that passengers were not able to get into the airport to join the queues inside.  It was 30th May before BA claimed to be running a full flight schedule. Around 12th May 2017 the WanaCrypt0r ransomware affected systems worldwide including parts of the UK NHS.  Appointments were cancelled, there were queues but the majority of affected NHS sites switched to a paper system and carried on regardless.  Those NHS sites that had properly patched up their Windows systems were unaffected by the Ransomware attack.

As the smoke is clearing the NHS comes out as a resilient organisation making the best out of a bad job and BA are seen as cost cutting and careless.  In both cases the issue was data loss and the results a demonstration of how to deal with that loss and its impact on continuing operations.

The best way to plan for failure is to start with a solid system design that should not fail.  The NHS and BA systems were more likely to have grown with use rather than to have been designed and maintained as a robust whole.  Even where a system has been been built with the maximum resilience towards failure it would be folly not to have a plan to deal with that failure, no matter how remote.  Both the NHS and BA had a plan, the NHS one worked and considered how to deal with its customers as services were restored.  The BA software affected was not involved with the actual flying of planes so there would be no immediate danger to life, a luxury that the NHS recovery plan could not consider.

The BA incident was attributed to a power failure. If this were the sole cause there should have been no disruption as any power loss would be replaced by local generators.  A loss of grid power combined with any UPS batteries going down and local generators being out of action could only be explained by a wanton lack of planning or some major natural disaster hitting the data processing sites.   An event of that magnitude would not have been missed by the news media.  Even so any plan for data loss should consider what to do if the worst possible case occurs regardless of its actual cause.  There was no evidence of such a plan in BA’s reaction to its system failure.

Flight bookings have long been a text book example of real time systems.  The exact details of who is booked on what flight could vary by the second.  On the other hand the majority of passengers will have booked their seats days or weeks in advance.  If BA had lost their live system but were able to restore from back ups they would have lost some customer data.  These individuals would have been inconvenienced, there would be disruption, compensation would be paid but the flight network could be back up and running.

Few computer systems exist in isolation.  BA had problems with its customer’s baggage during the system failure.  Baggage handling is usually controlled by the computer systems of individual airports.  With the BA provision down the airport systems must take some blame for not being able to cope well with whatever records BA could generate.   Assuming there was an emergency plan and BA did not just muddle through.  The affect of that plan on external systems, such as baggage handling and on customers needed to be worked out before any plan was put into action.  A common complaint from passengers was that BA did not provide information on the problem and how their journeys would be affected.  On the other hand the effect of the WanaCrypt0r attack was well publicised and where possible the NHS kept services running.

Dealing with data loss is not just a case of trying to recover the data but also dealing with the immediate consequences of losing that data.  These consequences can be planned for; any plan is better than no plan at all and this becomes more evident the larger the organisation affected.  Kindus can help your organisation create and test a disaster recovery plan that will have a minimal impact on your data and on your customers.

Leave a comment:

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.