C17.14. System outage prevention. Procedures for problem detection, redundancy of all systems, back-up power supply, facility security, technical security, availability of back-up software, operating system, and hardware, system monitoring, technical maintenance staff, server locations.

Preventing unplanned system outages and keeping planned outages to a minimum are the result of robust planning and execution of the system reliability practices noted in previous sections. A robust system architecture and continuous system monitoring help prevent and/or minimize unplanned outages. A modular system design with well-defined interfaces helps to minimize planned outages due to normal system maintenance.

To summarize the prior discussions, the UIA Team's system outage prevention includes the following elements:

  • Redundant secure facilities (discussed in Section C17.1)
  • Redundant facility infrastructure, including power, cooling, fire suppression, etc. (discussed in Section C17.1)
  • Redundant Internet providers and massive bandwidth (in Section C17.1)
  • Redundant servers (either hot spares or load balanced) (in Section C17.1)
  • Multiple data center infrastructures within a single facility (in Section C17.1)
  • Redundant servers located in different data centers and different facilities (in Section C17.1)
  • System and network health and intrusion monitoring (discussed below and in Section C17.9 and Section C17.13)

Procedures for Problem Detection

The UIA Team's Registry Command Center (RCC) monitors the .org database in increments of 60 seconds. Monitoring includes not only server behavior (e.g., up/down, CPU and memory utilization), but also system characteristics such as the number of RRP sessions per registrar and the number of RRP transactions currently being processed. The RCC will be staffed 7x24 by trained and qualified engineers. If problems occur, the RCC will be aware of them in 60 seconds or less. Each site in the global DNS constellation will be monitored in increments of 4 seconds. Issues that cannot be resolved by RCC engineers will be escalated to an Operations team that is either onsite or on-call 7x24x365. More details of the RRP and database monitoring can be found in Section C17.13. Details of DNS monitoring are discussed in Section C17.10.

Certainly one of the most critical aspects of system outage prevention is the ability to shift operations between facilities. The locations of facilities supporting the .org registry are noted in Section C17.1. The .org database operations are divided between two facilities and can be run from either one in the event of a facility failure. This architecture is discussed in greater detail in Section C17.1.

DNS is by its very nature extremely robust in many ways. The outage of one of the 13 global DNS sites would not be noticed on the Internet. Other sites would automatically assume the load. In fact, the current constellation has sufficient capacity such that one third of the sites (four sites) could handle normal DNS volumes. With the deployment and implementation of the ATLAS platform (discussed in Section C17.4 and Section C17.5), one "super" site can process normal volumes. Even so, a "swing site" is maintained to which the DNS transactions from any global site can be routed in the event of a site failure or planned site maintenance. Another critical element of preventing outages in the DNS is to over-provision bandwidth. A DOS or DDOS attack against the Internet DNS is extremely difficult to repel. The best option is to absorb it. The greater the ability to absorb such an attack, the less impact it would have. As discussed in Section C17.10, the global DNS constellation currently has the ability to process more than 400,000 DNS transactions per second, with more than 3gb of combined network bandwidth. With the deployment of ATLAS, this capacity will increase to more than 5 million transactions per second and nearly 10gb of combined network bandwidth.

 

Back to Table of Contents