II.2.15        Registry Failure Provisions (RFP Section D13.2.15)

The JVTeam, combining the unique strengths of NeuStar and Melbourne IT, has a unique appreciation for the mission-critical nature of a DNS registry, and the types of real-world operational, technical, business, and legal, issues that must be engineered into a total business solution to meet those needs.

For example, NeuStar's almost five year operating history in deploying, operating, and significantly expanding, its mission-critical Number Portability Administration Center (NPAC) has direct experience with an industry facility and high-performance database on which every service provider operational support system (OSS) and telephone call in North America relies.  In addition, its role as the North American Numbering Plan Administrator requires the highest of professional, technical, and ethical excellence to operate effectively with the extreme political and policy pressures associated with managing this fixed public resource at the intersection of immense industry policy, technical, regulatory, and financial forces.  NeuStar has also been a major contributor to industry standards efforts, both in the telco and Internet communities, and for example, has been a leader in facilitating the development of the emerging ENUM standards work, at the intersection of telephone numbering and DNS domain disciplines.

Likewise, Melbourne IT is one of the few recognized leaders in DNS registry operations and also has extensive experience in registrar operations and systems as well.  Melbourne IT is consistently the preferred provider for high-volume and high-quality registrar back-office operations services.  Consequently, the JVTeam brings a uniquely specialized combination of focused DNS/registry domain expertise and best-of-breed business, IT, operational, and financial strengths.

Our registry service is engineered and managed to the highest quality standards to ensure continued operation in case of numerous types of failures.  Rigorous software development lifecycle processes are employed to provide strict change manage processes for implementing functional or capacity upgrades into various subsystems.  The JVTeam will employ the same software lifecycle processes that NeuStar has successfully employed over the past four years to consistently deliver on time new software releases on time to rigorous mission-critical standards.  On the NPAC SMS system alone, NeuStar has deployed over seven major new software releases on time over the past four years implementing over 300 functional change orders contracted by the industry.  Each release cycle involves extensive industry testing to validate system interoperability in a captive testbed environment prior to being placed into production.

In addition, the JVTeam will provide XRP lab-to-lab interoperability testing as a service to SRS client system suppliers to ensure compatibility with then-existing and up-coming releases of the XRP protocol.

To provide a mission-critical service of this kind, we employ extensive operational and technical measure of quality.  In our existing services, again for example the NPAC SMS, we report on 29 different service level measures to our customers to provide them with an objective measure of the quality and consistency of our service.

Nameserver Operations

Consistent with the JVTeam's collective experience in developing and operating critical shared support services to the industry, we understand that the stability of the internet rests on the integrity of DNS nameserver operations for new gTLDs. 

Secondly, effective usability of the gTLD name space requires sufficient DNS nameserver capacity be consistently available and on a geographically distributed basis to provide networks and end-users with the necessary resolution bandwidth.  Otherwise, the new name space will not fulfill it's user's needs for effective usability and accessibility.

Consequently, we have engineered our nameserver function to provide at least 99.999% service availability.  From our first-hand experience in providing shared support services at these availability levels, we've developed an extensive set of availability and scalability attributes for our nameserver systems.

First, upon completion of our deployment phases, there will be at least three geographically distributed nameserver sites, each hosting multiple copies of the zone file, and each capable of operating autonomously in the unlikely case of a dual communications network failure.

Secondly, each site hosts one or more load distributors and subtending nameserver platforms providing the aggregate nameserver capacity for each site.  Failure of any one server will gracefully reduce total capacity at that site, but will be detectable in a near real-time fashion using detection and keep-alive facilities between the load distributors and the servers themselves.  Each nameserver site will be dual-homed off of a separate service provider network to each of our SRS sites, providing full diversity of communications access, both to the Internet as well as to our internal WAN.  We can add additional nameserver hardware capacity easily online, by adding additional servers to the redundant site LAN, and adding them logically to the load distributor.  Being able to expand and manage capacity while in an on-line operational state is critical to maintaining 99.999% availability, and cannot be done with conventional, large, monolithic nameserver systems.

Each server will maintain a complete in-RAM copy of the zone master file, and will process transactional update requests from the site-local update distributor which will broadcast and manage the processing of zone updates.  Servers which fail to correctly post updates will be logically placed out of service by the load distributor to prevent responding with erroneous information.  Each server is a high performance 64-bit processor which can be readily expanded beyond 4GB of RAM to ensure sufficient growth capacity should the gTLD grow to beyond approximately 23M names (depending on use of keyed signatures per DNSSEC extensions).

Third, our registry service is designed employing real-time event-based transactional updates from the SRS sites, which prevents the timing consuming and error-prone batch process of generating an entirely new zone update file on a regular basis.  It allows us to provide near real-time zone updates and simultaneously reduces the possibility of zone file generation errors or delays, as only incremental updates are posted.  The zone updates are themselves posted to a staging database, where various application and database level logical consistency checks are performed to ensure the zone update transaction is valid before propagating that transaction to the nameserver infrastructure.  Each nameserver site maintains a full copy of the master zone file in the unlikely case of corruption to the operating copy in one of the servers, or the deployment of a new server.

Fourth, the nameserver sites collectively are sized to handle the actual peak world-wide zone resolution load in case of multiple site failures.  This insulates the Internet from rare dual-failure conditions that could otherwise impact nameserver capacity or reachability.  This is also why the nameserver sites are geographically distributed.

The use of event-based transactional updates, two-stage update transaction validation, geographically distributed autonomous nameserver sites, distributed servers in each nameserver site, and diverse communications network, allows us to provide nameserver service availability at the very highest level.

In addition, the JVTeam will generate frequent master zone file (as well as SRS database, and Whois) updates for offsite storage and for escrow with ICANN.

Whois Operations

Similar to the design attributes of our nameserver function, the Whois function is deployed with a similar high-availability architecture, to 99.95% availability.  Due to their somewhat lessened criticality and load, the Whois servers are deployed in the SRS sites.  The Whois function is implemented in each site with a series of server engines on a redundant LAN front-ended by a load distributor.  Transactional whois updates are flowed from the SRS update distributor, similar to transactional zone file updates.  Whois capacity is distributed across the two SRS sites to ensure availability and capacity while running in degraded mode.

SRS Server Operations

The SRS service is engineered to 99.95% availability standards to ensure minimal business impact to registrars and their registrants and end-users.  SRS server capacity is geographically distributed across the two SRS sites, which are interconnected via a fully redundant and diverse set of WAN facilities.

The SRS infrastructure consists of a three-tier architecture: XRP/web front-end processors, SRS application processors, and SRS database processors.  The XRP front-end processors, similar to the nameserver and Whois systems, are deployed as a distributed series of servers behind multiple load distributors in each site.  XRP sessions are distributed across the servers, which forward binary transactions to the SRS application tier.  There, an SRS-specific application protocol provides for diversity and distribution of binary XRP transactions between these tiers.  The SRS application processors operate in a state-less manner, so that any front-end may forward a binary transaction to any SRS application processor.  All persistence state (other than XRP session state) is maintained in the final tier, the SRS database processors.

Consequently, the entire SRS service is immune to a number of hardware and infrastructure failures.  The load distributors on the front-ends ensure transparency of front-end failures when an SRS client system re-associates with the SRS after a front-end failure.  SRS front-end processors enjoy diverse routes to multiple SRS application processors, any of which can process each transaction.  Failure of an SRS application server at most disrupts the current transactions being processed on that server, which will be restarted on another application processor upon timeout by the initiating front-end server.  Use of IETF SCTP protocol for reliable multihomed transaction sessions, and real-time association failure detection, is planned between the SRS front-end servers and the SRS application servers.

The SRS database processors, one cluster in each SRS site, employs synchronous transaction database replication between the two SRS sites, over the diverse intersite WAN, to maintain duplicate co-active versions of the SRS master database.  Query operations (the bulk of the traffic) are individually processed by the database processor local to the application processor.  Update transactions are simultaneously committed to both database copies (one per site) prior to acknowledging the XRP transaction requested by the SRS client system.

The SRS service is immune to an entire facilities failure or to failure of any one of the clustered SRS database servers as well.  The ODBMS technology employed provides the SRS applications with a transparent two-phase commit protocol that ensures simultaneous posting of an update in both locations.  In case of failure and re-start of one of the SRS database servers, the newly started server will automatically initiate a re-sync process with the other database server, operating in background, at the completion of which the re-started server would reactivate synchronous replication.  Again, SRS downtime is not incurred in these failures or for related maintenance and upgrade activities.  Neither time and load consuming bulk backups or asynchronous replication (with unacceptable time lags during peak transaction processing) are relied upon to safeguard the SRS database or ensure SRS service availability.

Our SRS database servers provide online incremental backup capability that maximizes SRS uptime.  Regular, frequent, SRS database backups will also be generated for offsite storage (in addition to our two SRS sites) and for ICANN escrow.

Helpdesk and Other Operations

The registry technical support help desk and other critical operations functions (e.g., real time network element monitoring,  capacity management, security administration and intrusion detection) are replicated between both of our SRS sites.  In addition, in case of a snowstorm or other impediment to staff access to a location, all internal and help desk staff may perform their functions remotely, all using strong physical security token authentication.

Registry Operations Transition and Contingencies

In the scenario that the JVTeam must transition registry services to another entity, either at the expiration of our term or on other condition, we will cooperate fully with ICANN and the new registry operator to ensure a smooth transition of services.  These include:

          Data Escrow:  We will ensure that all registry operational data is preserved frequently (e.g nightly incremental backups), using our current network backup-facilities and storing the backup on appropriate media (DVD or CD-ROM) or uploading to an escrow provider's facility, for quick reload or upload.  Regular backups and associated documentation of the database schema will be provided to an escrow provider for the benefit of ICANN or its designated new registry operator.

          Management during transition:  JVTeam will assist in the management of the transition period.  We will have pre-identified key personnel in the different technical and operational areas to ensure adequate transition support.

          Facilities:  We will negotiate with the new registry operator reasonable access to our facilities in order to ensure a smooth transition.

          Registrar Contracts:  We will furnish ICANN with all business and contract documentation between our registry and its registrars.

         Documentation:  We will make appropriate operational and technical documentation available to both ICANN and the new registry operator.