C17.10

C17.10. Peak capacities. Technical capability for handling a larger-than-projected demand for registration. Effects on load on servers, databases, back-up systems, support systems, escrow systems, maintenance, personnel.

Recent experience has demonstrated that all registries have had problems with unexpected larger-than-projected demand for registrations. For the new gTLD registries, these problems were apparent in system failures associated with the start-up period of land-rush registrations. For the UIA Team, this phenomenon became apparent with the high demand associated with desirable domain name registrations in .com, .net and .org being deleted and reentering the registration market. Handling high volumes is a difficult enough challenge by itself, but doing so while satisfying equivalent access requirements is a daunting task. In September 2001, an architecture was implemented that protects the .org registry database from high transaction volumes, while maintaining equivalent access for all registrars. This architecture is shown in Figure C17.10-1.

Figure C17.10-1: Redundant Multi-Pool Architecture

Registration traffic is driven by three distinct categories of users. First and foremost, there are those who use the Internet for legitimate business or private purposes. They have web sites that must be maintained. They register domain names because they use the domain names. When they need to modify their registration data (e.g., their primary and/or secondary nameservers), they need to do it in order to keep their web identity up and running and current. Coincidentally, these users generate the least amount of registration traffic, although it is the most critical to the usefulness and stability of the Internet. Secondly, there are domain name speculators. For reasons often known only to them, they are constantly attempting to obtain what they believe are desirable and potentially valuable domain name registrations. Their quest to obtain these registrations generates extremely high registration traffic volumes as they vie with each other to be the first to obtain a desirable registration that is either newly available or is being abandoned by someone else. Finally, there are those who would hold domain registrations hostage. These people or entities are often called "cyber-squatters". They make their business obtaining abandoned registrations (often abandoned unintentionally due to payment confusion with the registrar); then directing the domains to pornography web sites and demanding a high price from the original owner to have it restored. Although this is a dirty business, it is unfortunately quite prevalent today and is a source of high transaction volumes against registry systems. Registrars who cater to these last two categories of users can monopolize the connections to, and capacity of, a registration system that lacks a workable quality-of-service (QoS) architecture, making it extremely difficult for smaller registrars who focus on the first category to support the most legitimate Internet use.

In September 2001, a multi-pool model for RRP traffic was implemented for the current .org registry database. The UIA Team proposes to use this same multi-pool model for the .org TLD. It will protect the .org registry database during periods of heavy transaction volumes. It will also protect the equivalent access rights of each registrar. Additionally, it will enable the most legitimate category of registrant to conduct the necessary business to protect the usefulness and stability of the Internet. This type of solution is critical for the .org registry operations because of the large number of .org domain names that are deleted each month. Most new registries have not yet had to face this problem, but the new registry operator for .org will have to deal with it beginning on day one.

In the multi-pool architecture, each registrar will be guaranteed a minimum number of connections in the Guarantee Pool. For 75% of registrars, this is sufficient to conduct all of their business. Throughput is fast and access is guaranteed. Registrars that require additional connections to support a larger base of business and private registrants can obtain additional connections from the Overflow Pool. The purpose of the Overflow Pool is to provide equivalent access in an environment where the connection requirements of registrars vary widely depending on the size of their market base. A registrar with 5 million registrations clearly needs more connections to support its customers than a registrar with 500 registrations. Yet the Equivalent Access clause of the registry agreement requires that all registrars be given equivalent access to the SRS. With the Overflow Pool, the additional connections (up to a predetermined maximum) are "there if you need them" for any registrar. The Overflow Pool will be closely monitored to determine if additional connections are needed.

Figure17.10-2: Distribution of RRP Transaction in Multiple Pools

Finally, the Automated Batch Pool will support registrars whose business model caters to speculators, cyber-squatters or business partners who generate large volumes of transactions. These registrars and their business partners have made a science of engineering systems that thrust as many transactions as technically possible to the registry. In the Automated Batch Pool, participating registrars will be collectively guaranteed a predetermined number of connections and bandwidth. However, the total throughput (bandwidth) in this pool will be capped and distributed equally among all who participate. Registrars will receive either their maximum bandwidth (the same for all registrars) or an equal share of the total available bandwidth, whichever is greater. Therefore, if two registrars are sending transactions to the Automated Batch Pool, they will each receive their maximum bandwidth. If 50 registrars are sending transactions, they will each get 1/50th of the total available bandwidth. As should be expected, and as is demonstrated in Figure C17.10-2, the traffic volume in the Automated Batch Pool dwarfs the volumes in the other pools.

A QoS architecture provides the ability to "tune" the transaction volume permitted into the .org registry. To date, it has handled up to 120,000 transactions per minute, and been tested to 300,000 transactions per minute, without sacrificing SLAs or causing undue system load on application and database servers. Additionally, the .org registry database architecture permits transaction load to be balanced across multiple servers. If additional capacity is required, additional servers can be added behind the load-balancers. Back-end support systems (back-ups, escrow, maintenance, etc.) are all sufficiently resourced to handle the maximum transaction volume permitted into the .org registry database. The current .org registry database has historically demonstrated the ability to quickly scale to meet unforeseen increases in transaction volume. In April 2000, the daily transaction rate increased from 5 million transactions per day to 25 million transactions per day. This increase occurred over a period of 48 hours. Since then, the transaction volume has continued to increase to its current 150 million transactions per day. Handling this increase has been accomplished without the need to increase staff or redesign the system. The current .org database is designed and proven to be not only highly robust and reliable, but also quickly extensible. Extensibility is key to the stability of the .org registry. The current CPU load on the database server is typically less than 10%. Even during peak periods (more than 300,000 transactions per minute), the CPU load remains below 50%.

The RFP does not require submitters to discuss or address the single most critical element of performance and capacity that can impact Internet stability-the number and throughput of DNS nameservers. If a registration system is offline, new registrations and changes to existing registrations cannot be accommodated. Certainly this is not desirable, but the Internet continues to work. If DNS nameservers are offline, Internet resolution ceases as DNS cache decays. The impact of such an event goes far beyond the stability of a specific TLD and the top-level DNS servers for that TLD. As lower-level DNS cache decays, the stability of the entire Internet can be put at risk. Internet stability is ICANN's primary objective, and therefore the architecture, capacity and proven performance of a candidate registry's DNS constellation should be of paramount concern.

Figure C17.10-3: Internet DNS Architecture

As Figure C17.10-3 indicates, Internet DNS servers fall into three basic levels. At the top level are the 13 Internet root servers. At the middle level are the top-level DNS servers for each of the TLDs. At the bottom level are all the rest of the Internet DNS nameservers owned and operated by ISPs, companies, individuals, etc. Unfortunately, history and direct experience have demonstrated that problems at the third level have a direct impact on the servers in the levels above. Figure C17.10-4 shows an actual example.

In March 2002, the DNS servers of a major Internet portal went offline. When they stopped responding to DNS queries, browsers and DNS resolvers around the world began resubmitting the queries. As lower level DNS caches decayed, these re-queries were directed higher up to the servers at levels one and two. This resulted in a significant increase in DNS queries at the first and second levels. As Figure C17.10-4 shows, the traffic volumes to the DNS constellation serving .com, .net and .org increased significantly.

Figure C17.10-4: March 2002 DNS Incident

Lest there be any thought that such events are rare, In October of 2001, a major ISP had a similar problem. Some, but not all, of their DNS servers went offline. The remaining servers were unable to handle the load. As in the example noted above, transactions at the upper levels were significantly increased. Figure C17.10-5 shows the impact this problem at an ISP had on the global constellation of nameservers for .com, .net and .org.

Figure C17.10-5: October 2001 DNS Incident

This example then raises two questions:

Do other registries see or experience this phenomenon?

How does it present a risk to the stability of the Internet?

The transactions against, and performance of, the DNS servers for .com, .net and .org are monitored with what is believed to be the highest rigor in the Internet community. Whereas some registries, when problems arise, might not see them unless and until their servers crash, a snapshot of each globally deployed .com, .net and .org DNS server is taken every four seconds. Figure C17.10-6 shows a proprietary DNS monitoring Heads-Up-Display. This tool quickly depicts the DNS transaction volume and server performance at each individual site, as well as rolling the data up into a global view.

Figure C17.10-6: gTLD Heads-Up Display

Figure C17.10-7 depicts another proprietary monitoring tool that displays the details of one four-second snapshot of one of the 13 global nameserver sites (one site actually contains multiple load-balanced nameservers).

Figure C17.10-7: Four Second Snapshot of a.gtld-servers.net

This proprietary capability allows staff to see the top-20 DNS transaction generators at each of its DNS sites. Using this tool, it is possible to pinpoint the exact cause of sudden increases in DNS transactions. In the examples sited above, it was possible to determine the cause of the increases and notify the entities that were the source of the problem before those entities were even aware of their own problems.

Why does this phenomenon present a potential risk for the stability of the Internet? The answer to this question is most clearly demonstrated by another actual example of a major Internet portal that had problems with its DNS servers in January 2001. All four of its DNS nameservers were on the same network segment. When the network segment failed, DNS resolution for the portal ceased. As lower level DNS cache decayed, DNS transactions on the global .com, .net and .org DNS constellation increased from an average of 60,000 per second to more than 350,000 per second. When the portal fixed its network problem several hours later and attempted to restart its DNS nameservers, there were so many DNS transactions waiting that their nameservers were unable to handle the load and were immediately overwhelmed and crashed. The portal had to stand up an additional eight nameservers just to handle the load and restore their portal services.

These examples are just as applicable for a registry provider as they are for a major ISP or portal. A registry's DNS constellation is subject to significant transaction loads due to problems and issues that are outside of its control. If its DNS nameservers are not designed to handle unforeseen loads, they will either stop answering queries, forcing re-queries at the next level up - the Internet root servers (thus putting those servers at risk) - or they will crash. If they crash, the registry operator is at risk of having the same experience as in the January 2001 portal example, and may be unable to get their DNS servers back online without adding more emergency capacity. This situation puts not only the entire TLD at risk, but the Internet root servers as well. Therefore, it is absolutely critical to the stability of the Internet that the winning candidate for the .org registry is able to demonstrate beyond any doubt that they have a DNS constellation capable of absorbing massive and sudden increases in transaction volumes.

The UIA Team's existing constellation of DNS nameservers is capable of handling more than 500,000 DNS transactions per second, even though the normal load is less than 100,000 per minute. When the ATLAS platform (discussed in detail in Section C17.4) is deployed, each normal site will be able to handle 200,000 transactions per second, with the "super" sites handling more than 1 million transactions per second. This is an aggregate global capability of more than 5 million transactions per second.

Back to Table of Contents