Unity Registry Logo               Time to re-organise
The Proposal
 

C17.15. System recovery procedures. Procedures for restoring the system to operation in the event of a system outage, both expected and unexpected. Identify redundant/diverse systems for providing service in the event of an outage and describe the process for recovery from various types of failures, the training of technical staff who will perform these tasks, the availability and backup of software and operating systems needed to restore the system to operation, the availability of the hardware needed to restore and run the system, backup electrical power systems, the projected time for restoring the system, the procedures for testing the process of restoring the system to operation in the event of an outage, the documentation kept on system outages and on potential system problems that could result in outages.

Preamble

Unity Registry  understands the serious consequences of a registry failure upon its customer base and the Internet community, whether caused by commercial, technical or other factors. Given the seriousness of such failure Unity Registry will adopt industry best practice by developing a business continuity plan to re-establish core business functions and technical services following a catastrophic loss of its resources.

In the context of the operation of a global registry serving a top level domain, any event which results in the Registry Function becoming unstable or unavailable must be resolved as rapidly and effectively as possible using all the available resources of the Registry Operator.   In the event of a system outage Unity Registry will restore  the Registry Function to full operation using a variety of systems and procedures which follow an established business continuity methodology and are fully documented in the Registry Operator’s business continuity plan.

These procedures are based in part on the successful experience of the Unity Registry partner organisations in developing and implementing business continuity/disaster recover plans for the .coop gTLD and the .au ccTLD.  They build on the investment made and experience gained in the operation of these major registries and will ensure that the .org Registry Function is stable, reliable and fully available under the widest range of circumstances and that Registry Function is resumed as rapidly as possible following any catastrophic loss of resources.

The Business Continuity Plan aims to minimize interruptions to the operation of the Registry Function, and to resume critical operations and services within a specified time after a disaster.  Business continuity planning also aims to minimize financial loss within the Registry operator and the registrar community while assuring customers, ICANN, and the wider Internet community that their interests are protected.

The approach to be taken for the .org business operations will be to identify the business critical functions within Unity Registry and to integrate these into existing Business Continuity Planning.  This will ensure that both the .org specific operations and processes will be safeguarded together with the supporting business and physical infrastructure. The necessary Business Insurance will be put in place and is reviewed quarterly.

The planning process and associated training programme in business continuity measures ensure that management and staff within Unity Registry fully understand the importance of  continuing service provision and are able to work effectively to restore full system functions in the event of an outage.

Planning for system recovery

Business continuity planning requires a study of the operations of a business, identification of areas and facilities which are likely to be affected, and the provision of backup equipment and procedures for re-establishing services in the event of a system outage. 

Business continuity planning is an established process which has evolved to provide a graded approach to the re-establishment of systems and services following failures or disasters.  Failures  can be brought about by nature (e.g. floods, cyclones, heat-waves, flu epidemics), can be accidental (e.g. fire, building collapse), can be man-made (e.g. bombs, sabotage, viruses, activation of sprinkler systems) or due to industrial disputes (e.g. power strikes).  While the variations are numerous, they  can be broadly categorized as

Ø      loss of facilities

Ø      loss of information

Ø      loss of access

Ø      loss of personnel.

Business Impact Analysis

The first stage in the planning process is a Business Impact Analysis, which involves an analysis of all aspects Unity Registry and the .org  Registry Function, including housing, personnel, equipment, communications, procedures and business requirements. 

The business impact analysis incorporates:

An audit of business sites, the personnel and equipment located at each site, and the impact of the loss of the sites, personnel and equipment. 

A security assessment of computer and communications equipment within the Registry, to include

·                Physical security, including access control;

·                Tasks performed by personnel;

·                Operating procedures;

·                Backup and recovery procedures;

·                System development and maintenance;

·                Database security;

·                Personal computers.

An audit of possible disaster situations likely to impact on the Registry Function, in particular:

·                Loss of power (e.g. failure or prolonged strike);

·                Loss of environmental controls (e.g. air-conditioning);

·                Breaches of security (e.g. physical, electronic – virus or hack attack);

·                Loss of internal/external communications;

·                System failure (e.g. computer or disk  malfunction);

·                Internet communication failure or interruption;

·                Degraded performance;

·                File corruption or lost files;

·                Unreliable or incorrect results.

Determination of critical resource requirements for business continuity.

Recovery strategies and methods to be applied in the event of disasters, and timelines for partial and full recovery.

Cost/benefit analysis for the various recovery alternatives.

Staffing requirements for the various recovery alternatives.

Formulation of a recommended recovery strategy.

The business impact analysis is usually performed once, and subjected to a relatively minor annual review to assess changes introduced during the year.  Unity Registry will carry out a full business impact analysis once all systems are in place and the Registry is operational.  The business impact analysis in this section is based on the technical proposals outlined throughout Section III (Technical Plan).

Business Continuity Plan

Following the Business Impact Analysis a Business Continuity Plan is drawn up to document the procedures to be followed to recover from facility loss or service interruption.  Copies of the documents are kept off-site with appropriate backup and software files in the event that the primary site is destroyed.  The Business Continuity Plan is written to allow an external organisation or qualified individual to undertake the recovery process. 

The major components of the Business Continuity Plan are:

Organisational details

This includes details of  alternate office locations, contact details and staff trained in the execution of the recovery procedures.

Declaration procedures for instigating business continuity operations.

This defines the procedure for commencing the business continuity process, including a list of organisations and individuals to be notified.

Procedures for activating alternate work-sites.

Arrangements must be made for alternate work sites in the event that the primary work site cannot continue to be used (e.g. destroyed by fire).  This may take the form of an initial temporary arrangement at another site until a new site is found, or it may be part of a multi-site plan within the organisation.

Procedures for recovering vital records and files

Vital records and files must be stored off-site as part of the business continuity procedure.  This section provides a list of such items and their locations.  Procedures are to be established to ensure that the required files are stored off-site as part of the site’s normal operational procedures, and for checking that they are correctly stored and updated.  Procedures will be documented for the recovery of off-site information (software and data).

Definition of recovery teams and responsibilities

Provide a list of individuals assigned to recovery teams and the tasks to be performed by the teams.  This documentation should take the form of a “flowchart” for recovery in any situation.   Arrangements could be made with external organisations or qualified individuals to be used as alternatives to in-house staff in the event of a disaster.  External staff will be trained in recovery procedures as detailed below.

Recovery procedures

This defines the steps involved in the recovery process.  The steps should be clearly defined and reviewed during staff training as below and testing as below.  This is the key area of the Business Continuity Plan.

Relocation procedures

This section relates to the relocation of the Registry Function technical operations,  either temporarily or permanently as the result of a disaster situation.

Resource requirements and procurement

This provides a list of vendors and suppliers who may be required to provide equipment and/or services to assist with the recovery process.  The section should also document any arrangements or contracts with vendors to supply equipment at short notice, e.g. immediate supply of a replacement computer.

In addition the plan requires provision for:

Staff Training

Training is required for both in-house staff and external contractors in the execution of the business recovery plan.  This section documents the level of training and provides procedures for documenting staff training levels.  Training should include a review of the business continuity plan and participation in testing as described in (d) below.

Testing

This section documents procedures for testing the Business Continuity Plan to ensure that recovery operations function correctly and that staff are adequately trained.  Procedures should be included to evaluate the progress of general staff in following recovery procedures.  Tests should be performed periodically and should be used to refine the recovery process.

Effectiveness Evaluation and Monitoring

An annual review of the entire Business Continuity process, conducted and reviewed by senior management. 

The business continuity plan is  constantly updated to reflect changes in systems, staffing, software and external circumstances.  Unity Registry will draw up a comprehensive business continuity plan once all systems are in place and the Registry is operational and will constantly maintain this plan and make it available for inspection by authorised parties within the limits of commercial confidentiality.

Copies of the Business Impact Analysis and of the Business Continuity Plan will be maintained in both electronic and hardcopy format at all Unity Registry business premises, off-site and in escrow in case it should be necessary for a third-party to take on some or all of the system recovery responsibility.

Risk analysis for .org registry function

The table below identifies key technical  risks that the Registry Operator faces. For each risk the probability of it occurring and the impact of such an event on the business has been assessed as high, medium or low. The risks are ordered within in each risk area (i.e. technical, operational and demand) by impact then probability. Against each risk we have identified how we are proposing to reduce the risk of occurring. Against each high impact risk we have described our contingency arrangements.

Unity Registry is based in two locations on separate hemispheres and working in different time zones.  Any incident which affected both operations would have to be of such a scale that it can reasonably be argued that the continued operation of the .org registry would not be a prime concern of the Internet community.  However even in this event plans have been made for the transfer of the Registry Function to a third party organisation based in the continental United States.

Major technical risks identified are:

Description of technical risk

Probability

Impact

Measures to manage risk avoidance

 

Contingency (for high priority risks)

 

Security infringement leading to loss or corruption of data

 

Medium

High

Security policies defined and implemented.

Firewalls and user authentication systems implemented.

Audit trails.

Regular data backups

Data held on multiple sites including off-site and in escrow

Security infringement leading to loss of service

Medium

High

Security policies defined and implemented.

Firewalls and user authentication systems implemented.

Audit trails.

Alternate Network Operations Center implemented and on standby

Critical failure of Registry Operator’s main systems

Low

High

Proactive systems management

No single point of failure (clustering and secondary systems in place).

Alternate Network Operations Center implemented and on standby

Data held on multiple sites including off-site and in escrow

Systems unable to handle level of transactions

Low

High

Demand models determined system choice.

Load balancing included in infrastructure.

Continually monitor system performance.

Implement scalable architecture

Upgrade systems or increase network capacity.

Reduce level of service in short term

 

Failure of single key system

High

Low

No single point of failure for overall Registry Function. Clustering, secondary systems and two-site operation in place

Spares held on site.

Support agreements in place.

A full Business Impact Analysis and Business Continuity Plan will be drawn up by Unity Registry as part of the process of commissioning the registry operation. The processes described below will form part of that plan.

In the case of an expected outage, the exact cause of the outage  is known and appropriate steps can be taken before hand to prepare for the outage and make the recovery from the outage simple. An expected system outage should provide minimal impact to registrars as Unity Registry will be able to put in place actions to counter act the expected outage before the outage hits, for example, if network access from one site was to be unavailable because of engineering work by a network supplier,  the registry could transfer operations to the secondary site in preparation for this and then transfer back afterwards.  Unity Registry’s registry is based on the use of redundant and diverse systems, as described in Section C17.14.

If the outage is unexpected then the basic steps to be followed are:

  1. diagnose the problem
  2. decide whether it is to be a recovery process or if the problem was severe enough to require a fail over, or at worst a systems rebuild
  3. in the case of a rebuild a decision as to which site is to be used for the rebuild, for example if one of the data centers was destroyed by earthquake a rebuild would have to attempted in the other data center
  4. quickly plan the recovery, assemble required resources
  5. effect the recovery

In the event that the Registry System based in Salford were to fail,  the hot standby registry based in Manchester would be used. If this were to fail a temporary registry would be established at the Melbourne Network Operations Center.  If all three systems were to fail simultaneously, normal Registry activities could not be performed, i.e. new domain names could not be registered and existing domain names could not be changed. 

However, name server operation would continue with existing domain names via the redundant Name Servers.

In the event that Unity Registry is unable to restore operation of the Registry function (either permanent or  hot standby) within the time period given in the full business continuity plan then we would  instigate the establishment of a  temporary Registry facility at AusRegistry’s facility in Melbourne using data and software from backups or, if necessary, from escrow.

Copies of all system software and of installation CDs will be stored at each data center location.  There will also be  copies of backup software  and copies of all software needed. These  will be stored physically at each data center location, and available online from secure sites.

A complete image of the application machines will also be kept  at each location on spare hard drives, allowing a new  application machine to be “disk duped”  in matter of minutes.

Contracts will be put in place with hardware suppliers to ensure that the hardware needed to restore and run the system is available. Most application machines are based on commodity Intel hardware and these can be considered immediately available.  The availability of these is so fast that it could almost be consider instantaneous similarly configure Intel machines could be obtained with ease, the SUN database machine on the other hand is going to require support contracts with SUN. CISCO contracts will also be required for the CISCO networking hardware.

The backup electrical power system is available, as stated in Section C17.1 and Section C17.14 , based on  UPS backup and “in-flight refuellable” generators at each data center location”.

Recovery Timescales

Depending on the problem, a database fail over is instant, a failed application machine is instant, complete fail over to the other data center takes approximately  15 mins, a complete system rebuild, once hardware is sourced would take 1-2 days, a database rebuild would require a day, application machines being rebuilt (assuming similar hardware) would take approx 10 mins per machine.

It is estimated that the Temporary Registry could be brought to operational status within  one week or sooner if arrangements are made with equipment suppliers.  The Melbourne site should be operational within 24 hours, as the Registry Data will have been replicated across the Internet or installed via the NCC electronic vault service, the infrastructure will already be in place as this Center will usually function as a development environment. During this period, only existing Domain Names from surviving .org NameServer facilities will be accessible. 

System Testing

The Business Continuity Plan will be  tested regularly to ensure that recovery operations are correctly documented and function correctly, and that staff are adequately trained.

The testing process will  be used to refine the recovery process. The testing will involve full fail over tests to Hot Standby and will be done quarterly.  The build of the temporary registry will be done six monthly.  

Testing will be carried out by internal and also by external staff in order to test the quality of the documentation and to obtain feedback from those not involved in the formation of the Recovery procedures.

All technical staff are recruited for their skills and understanding of the systems which underpin the registry function, and continued professional development is a key part of Unity Registry’s approach to personnel development.

AusRegistry will provide Poptel with the in-house training on the operation of the registry software, which was developed and is used by AusRegistry.   Where the skills involve external products such as CISCO staff members will be required to hold the relevant recognised qualifications e.g. Cisco Certified Network Administrator (CNS).

Training in the Business Continuity Plan is also required.  All internal and external staff will be fully trained. The level of this training will  be analysed and training should be provided at a level appropriate to the role that they are required to perform.

These training levels and the required training will  be documented. All documentation pertaining to the Business Continuity Plan, including the establishment and operations of the Temporary Registry will be regularly put on a CD for staff to be able to access and ensure they are trained appropriately.

Senior staff will  regularly review the Business Continuity and participate in the testing of the Business Continuity Plan

All known potential problems that could result in outages are maintained in the knowledge base, and all technical support staff are educated as to how to recognise these, periodic in house training update ensure staff are kept up to date with how to recognise these things in all the systems statistics and monitoring packages.

Events which could lead to the instigation of Unity Registry’s Business Continuity Operations are as follows:

·                Destruction of Unity Registry’s Business Offices (London, Manchester, Melbourne or Backup) and/or loss of Unity Registry’s senior staff in a disaster situation;

·                Commencement of ICANN’s Business Continuity Plan;

·                Commencement of a Registrar’s Business Continuity Plan;

·                Disaster situations which disrupt business activities in London, Manchester or  Melbourne or cities containing backup facilities for Unity Registry;

·                Major Internet communication failures;

·                Any other event, which threatens the on-going operation of the Internet around the world.

Unity Registry will thoroughly investigate and document all aspects of any system outage.  This will cover:

  • the problem
  • cause/s
  • symptoms
  • corrective action required
  • personnel involved
  • other problems detected
  • action taken and by who
  • time and date of outage
  • extent of outage
  • who the outage impacted
  • analysis of impact on registrars and Internet community
  • analysis of recovery operation
  • ideas/suggestions to prevent the outage occurring again