Recent experience has demonstrated that all registries have had problems with
unexpected larger-than-projected demand for registrations. For the new gTLD
registries, these problems were apparent in system failures associated with the
start-up period of land-rush registrations. For the UIA Team, this phenomenon
became apparent with the high demand associated with desirable domain name
registrations in .com, .net and .org being deleted and reentering the
registration market. Handling high volumes is a difficult enough challenge by
itself, but doing so while satisfying equivalent access requirements is a
daunting task. In September 2001, an architecture was implemented that protects
the .org registry database from high transaction volumes, while maintaining
equivalent access for all registrars. This architecture is shown in Figure
C17.10-1.
Figure C17.10-1: Redundant Multi-Pool Architecture
Registration traffic is driven by three distinct categories of users. First
and foremost, there are those who use the Internet for legitimate business or
private purposes. They have web sites that must be maintained. They register
domain names because they use the domain names. When they need to modify their
registration data (e.g., their primary and/or secondary nameservers), they need
to do it in order to keep their web identity up and running and current.
Coincidentally, these users generate the least amount of registration traffic,
although it is the most critical to the usefulness and stability of the
Internet. Secondly, there are domain name speculators. For reasons often known
only to them, they are constantly attempting to obtain what they believe are
desirable and potentially valuable domain name registrations. Their quest to
obtain these registrations generates extremely high registration traffic volumes
as they vie with each other to be the first to obtain a desirable registration
that is either newly available or is being abandoned by someone else. Finally,
there are those who would hold domain registrations hostage. These people or
entities are often called "cyber-squatters". They make their business
obtaining abandoned registrations (often abandoned unintentionally due to
payment confusion with the registrar); then directing the domains to pornography
web sites and demanding a high price from the original owner to have it
restored. Although this is a dirty business, it is unfortunately quite prevalent
today and is a source of high transaction volumes against registry systems.
Registrars who cater to these last two categories of users can monopolize the
connections to, and capacity of, a registration system that lacks a workable
quality-of-service (QoS) architecture, making it extremely difficult for smaller
registrars who focus on the first category to support the most legitimate
Internet use.
In September 2001, a multi-pool model for RRP traffic was implemented for the
current .org registry database. The UIA Team proposes to use this same
multi-pool model for the .org TLD. It will protect the .org registry database
during periods of heavy transaction volumes. It will also protect the equivalent
access rights of each registrar. Additionally, it will enable the most
legitimate category of registrant to conduct the necessary business to protect
the usefulness and stability of the Internet. This type of solution is critical
for the .org registry operations because of the large number of .org domain
names that are deleted each month. Most new registries have not yet had to face
this problem, but the new registry operator for .org will have to deal with it
beginning on day one.
In the multi-pool architecture, each registrar will be guaranteed a minimum
number of connections in the Guarantee Pool. For 75% of registrars, this is
sufficient to conduct all of their business. Throughput is fast and access is
guaranteed. Registrars that require additional connections to support a larger
base of business and private registrants can obtain additional connections from
the Overflow Pool. The purpose of the Overflow Pool is to provide equivalent
access in an environment where the connection requirements of registrars vary
widely depending on the size of their market base. A registrar with 5 million
registrations clearly needs more connections to support its customers than a
registrar with 500 registrations. Yet the Equivalent Access clause of the
registry agreement requires that all registrars be given equivalent access to
the SRS. With the Overflow Pool, the additional connections (up to a
predetermined maximum) are "there if you need them" for any registrar.
The Overflow Pool will be closely monitored to determine if additional
connections are needed.
Figure17.10-2: Distribution of RRP Transaction in Multiple Pools
Finally, the Automated Batch Pool will support registrars whose business
model caters to speculators, cyber-squatters or business partners who generate
large volumes of transactions. These registrars and their business partners have
made a science of engineering systems that thrust as many transactions as
technically possible to the registry. In the Automated Batch Pool, participating
registrars will be collectively guaranteed a predetermined number of connections
and bandwidth. However, the total throughput (bandwidth) in this pool will be
capped and distributed equally among all who participate. Registrars will
receive either their maximum bandwidth (the same for all registrars) or an equal
share of the total available bandwidth, whichever is greater. Therefore, if two
registrars are sending transactions to the Automated Batch Pool, they will each
receive their maximum bandwidth. If 50 registrars are sending transactions, they
will each get 1/50th of the total available bandwidth. As should be expected,
and as is demonstrated in Figure C17.10-2, the traffic volume in the Automated
Batch Pool dwarfs the volumes in the other pools.
A QoS architecture provides the ability to "tune" the transaction
volume permitted into the .org registry. To date, it has handled up to 120,000
transactions per minute, and been tested to 300,000 transactions per minute,
without sacrificing SLAs or causing undue system load on application and
database servers. Additionally, the .org registry database architecture permits
transaction load to be balanced across multiple servers. If additional capacity
is required, additional servers can be added behind the load-balancers. Back-end
support systems (back-ups, escrow, maintenance, etc.) are all sufficiently
resourced to handle the maximum transaction volume permitted into the .org
registry database. The current .org registry database has historically
demonstrated the ability to quickly scale to meet unforeseen increases in
transaction volume. In April 2000, the daily transaction rate increased from 5
million transactions per day to 25 million transactions per day. This increase
occurred over a period of 48 hours. Since then, the transaction volume has
continued to increase to its current 150 million transactions per day. Handling
this increase has been accomplished without the need to increase staff or
redesign the system. The current .org database is designed and proven to be not
only highly robust and reliable, but also quickly extensible. Extensibility is
key to the stability of the .org registry. The current CPU load on the database
server is typically less than 10%. Even during peak periods (more than 300,000
transactions per minute), the CPU load remains below 50%.
The RFP does not require submitters to discuss or address the single most
critical element of performance and capacity that can impact Internet
stability-the number and throughput of DNS nameservers. If a registration system
is offline, new registrations and changes to existing registrations cannot be
accommodated. Certainly this is not desirable, but the Internet continues to
work. If DNS nameservers are offline, Internet resolution ceases as DNS cache
decays. The impact of such an event goes far beyond the stability of a specific
TLD and the top-level DNS servers for that TLD. As lower-level DNS cache decays,
the stability of the entire Internet can be put at risk. Internet stability is
ICANN's primary objective, and therefore the architecture, capacity and proven
performance of a candidate registry's DNS constellation should be of paramount
concern.
Figure C17.10-3: Internet DNS Architecture
As Figure C17.10-3 indicates, Internet DNS servers fall into three basic
levels. At the top level are the 13 Internet root servers. At the middle level
are the top-level DNS servers for each of the TLDs. At the bottom level are all
the rest of the Internet DNS nameservers owned and operated by ISPs, companies,
individuals, etc. Unfortunately, history and direct experience have demonstrated
that problems at the third level have a direct impact on the servers in the
levels above. Figure C17.10-4 shows an actual example.
In March 2002, the DNS servers of a major Internet portal went offline. When
they stopped responding to DNS queries, browsers and DNS resolvers around the
world began resubmitting the queries. As lower level DNS caches decayed, these
re-queries were directed higher up to the servers at levels one and two. This
resulted in a significant increase in DNS queries at the first and second
levels. As Figure C17.10-4 shows, the traffic volumes to the DNS constellation
serving .com, .net and .org increased significantly.
Figure C17.10-4: March 2002 DNS Incident
Lest there be any thought that such events are rare, In October of 2001, a
major ISP had a similar problem. Some, but not all, of their DNS servers went
offline. The remaining servers were unable to handle the load. As in the example
noted above, transactions at the upper levels were significantly increased.
Figure C17.10-5 shows the impact this problem at an ISP had on the global
constellation of nameservers for .com, .net and .org.
Figure C17.10-5: October 2001 DNS Incident
This example then raises two questions:
- Do other registries see or experience this phenomenon?
- How does it present a risk to the stability of the Internet?
The transactions against, and performance of, the DNS servers for .com, .net
and .org are monitored with what is believed to be the highest rigor in the
Internet community. Whereas some registries, when problems arise, might not see
them unless and until their servers crash, a snapshot of each globally deployed
.com, .net and .org DNS server is taken every four seconds. Figure C17.10-6
shows a proprietary DNS monitoring Heads-Up-Display. This tool quickly depicts
the DNS transaction volume and server performance at each individual site, as
well as rolling the data up into a global view.
Figure C17.10-6: gTLD Heads-Up Display
Figure C17.10-7 depicts another proprietary monitoring tool that displays the
details of one four-second snapshot of one of the 13 global nameserver sites
(one site actually contains multiple load-balanced nameservers).
Figure C17.10-7: Four Second Snapshot of a.gtld-servers.net
This proprietary capability allows staff to see the top-20 DNS transaction
generators at each of its DNS sites. Using this tool, it is possible to pinpoint
the exact cause of sudden increases in DNS transactions. In the examples sited
above, it was possible to determine the cause of the increases and notify the
entities that were the source of the problem before those entities were even
aware of their own problems.
Why does this phenomenon present a potential risk for the stability of the
Internet? The answer to this question is most clearly demonstrated by another
actual example of a major Internet portal that had problems with its DNS servers
in January 2001. All four of its DNS nameservers were on the same network
segment. When the network segment failed, DNS resolution for the portal ceased.
As lower level DNS cache decayed, DNS transactions on the global .com, .net and
.org DNS constellation increased from an average of 60,000 per second to more
than 350,000 per second. When the portal fixed its network problem several hours
later and attempted to restart its DNS nameservers, there were so many DNS
transactions waiting that their nameservers were unable to handle the load and
were immediately overwhelmed and crashed. The portal had to stand up an
additional eight nameservers just to handle the load and restore their portal
services.
These examples are just as applicable for a registry provider as they are for
a major ISP or portal. A registry's DNS constellation is subject to significant
transaction loads due to problems and issues that are outside of its control. If
its DNS nameservers are not designed to handle unforeseen loads, they will
either stop answering queries, forcing re-queries at the next level up - the
Internet root servers (thus putting those servers at risk) - or they will crash.
If they crash, the registry operator is at risk of having the same experience as
in the January 2001 portal example, and may be unable to get their DNS servers
back online without adding more emergency capacity. This situation puts not only
the entire TLD at risk, but the Internet root servers as well. Therefore, it is
absolutely critical to the stability of the Internet that the winning candidate
for the .org registry is able to demonstrate beyond any doubt that they have a
DNS constellation capable of absorbing massive and sudden increases in
transaction volumes.
The UIA Team's existing constellation of DNS nameservers is capable of
handling more than 500,000 DNS transactions per second, even though the normal
load is less than 100,000 per minute. When the ATLAS platform (discussed in
detail in Section C17.4) is deployed, each normal site will be able to handle
200,000 transactions per second, with the "super" sites handling more
than 1 million transactions per second. This is an aggregate global capability
of more than 5 million transactions per second.