Preventing unplanned system outages and keeping planned outages to a minimum
are the result of robust planning and execution of the system reliability
practices noted in previous sections. A robust system architecture and
continuous system monitoring help prevent and/or minimize unplanned outages. A
modular system design with well-defined interfaces helps to minimize planned
outages due to normal system maintenance.
To summarize the prior discussions, the UIA Team's system outage prevention
includes the following elements:
- Redundant secure facilities (discussed in Section
C17.1)
- Redundant facility infrastructure, including power, cooling, fire
suppression, etc. (discussed in Section C17.1)
- Redundant Internet providers and massive bandwidth (in Section
C17.1)
- Redundant servers (either hot spares or load balanced) (in Section
C17.1)
- Multiple data center infrastructures within a single facility (in Section
C17.1)
- Redundant servers located in different data centers and different
facilities (in Section C17.1)
- System and network health and intrusion monitoring (discussed below and in
Section C17.9 and Section
C17.13)
Procedures for Problem Detection
The UIA Team's Registry Command Center (RCC) monitors the .org database in
increments of 60 seconds. Monitoring includes not only server behavior (e.g.,
up/down, CPU and memory utilization), but also system characteristics such as
the number of RRP sessions per registrar and the number of RRP transactions
currently being processed. The RCC will be staffed 7x24 by trained and qualified
engineers. If problems occur, the RCC will be aware of them in 60 seconds or
less. Each site in the global DNS constellation will be monitored in increments
of 4 seconds. Issues that cannot be resolved by RCC engineers will be escalated
to an Operations team that is either onsite or on-call 7x24x365. More details of
the RRP and database monitoring can be found in Section
C17.13. Details of DNS
monitoring are discussed in Section C17.10.
Certainly one of the most critical aspects of system outage prevention is the
ability to shift operations between facilities. The locations of facilities
supporting the .org registry are noted in Section
C17.1. The .org database
operations are divided between two facilities and can be run from either one in
the event of a facility failure. This architecture is discussed in greater
detail in Section C17.1.
DNS is by its very nature extremely robust in many ways. The outage of one of
the 13 global DNS sites would not be noticed on the Internet. Other sites would
automatically assume the load. In fact, the current constellation has sufficient
capacity such that one third of the sites (four sites) could handle normal DNS
volumes. With the deployment and implementation of the ATLAS platform (discussed
in Section C17.4 and Section C17.5), one "super" site can process normal
volumes. Even so, a "swing site" is maintained to which the DNS
transactions from any global site can be routed in the event of a site failure
or planned site maintenance. Another critical element of preventing outages in
the DNS is to over-provision bandwidth. A DOS or DDOS attack against the
Internet DNS is extremely difficult to repel. The best option is to absorb it.
The greater the ability to absorb such an attack, the less impact it would have.
As discussed in Section C17.10, the global DNS constellation currently has the
ability to process more than 400,000 DNS transactions per second, with more than
3gb of combined network bandwidth. With the deployment of ATLAS, this capacity
will increase to more than 5 million transactions per second and nearly 10gb of
combined network bandwidth.