Registry Advantage Test Results

Table of Contents

Registry Advantage Test Plan. .........................................................................1

Table of Contents.......................................................................... 2
Summary.......................................................................... 3
Purpose. .........................................................................5
Definitions.......................................................................... 5

Boundary Testing. .........................................................................5
Capacity Testing. .........................................................................5
Compatibility Testing. .........................................................................5
Fault Testing.......................................................................... 5
Regression Testing..........................................................................5

Validation Criteria. .........................................................................6
Testing Procedures.......................................................................... 9

RA-DNS Cluster Capacity.......................................................................... 9
SRS Cluster Capacity. .........................................................................10
Whois Cluster Capacity. .........................................................................12
RA-DNS Functionality. .........................................................................12
SRS Functionality. .........................................................................13
Whois Functionality.......................................................................... 14
Oracle High Availability Testing...................................................... 15
Network and Data Interconnects High Availability. ...............................................16
Redundancy for Remote Satellites..........................................................................18

Conclusion. .........................................................................18

 

Summary

The purpose of this paper is to describe the automated and manual tests that were conducted to validate the functionality, performance and scalability of the Registry Advantage infrastructure and applications.

To maximize the validity of the tests, the full .org dataset was loaded into the registry database. In addition, the hardware and network configurations used in the tests were those that are already in place at Registry Advantage’s primary data center and are proposed for operation of the .org registry (for details, see C17.1). The test methodology utilized strict definitions of test criteria and expected results as a priori inputs to each test case. The expected test results were established based on several triangulated inputs where ever possible; e.g.: ICANN advisories, SnapNames’ State of The Domain reports, Registry Advantage’s experience as a registry outsourcing provider and public information supplied by VeriSign.

The SRS tests included queries of all the major types (adds, checks, infos, deletes) combined into test sets that were representative of ‘typical’ loads as well as atypical ‘add storm’ loads.  The DNS tests covered a range of .org query mixes (successful queries, failed queries, malformed packets, etc.) conducted in a random order to determine maximum queries per second (q/s) and round trip times (rtt). The Whois tests were performed using a similar methodology to determine maximum queries per second and round trip times.

The following key selected test results were achieved:

  • SRS Capacity: In a composite test of different query types, Registry Advantage’s SRS achieved transaction per second results of 202 adds, 776 checks, 726 changes, 724 deletes and 772 info queries.  Under an ‘add storm’ test configuration, 1084 successful checks per second were performed. This translates to a peak capacity of 3.9 million checks per hour in this test cluster of 3 SRS servers.  Additionally, Registry Advantage normally runs six SRS servers per cluster in an 2N configuration, where N=3. This drives the total transactional capacity of the non-degraded cluster even further above the expected peak requirements (approximately double).
  • DNS Capacity: Maximum server capacity of over 18,737 successful DNS queries per second with an average round trip response (excluding network factors) of 1.54 ms.
  • Whois: Achieved a capacity of 272 q/s per server for successful queries, or over 815 q/s for the cluster of three machines, which exceeded expected peak requirements by 132%. Round trip response time under this load was a mere 0.005 ms. 

In addition, this document provides the results of high availability tests of the hardware, database, network and data interconnects.

Purpose

The purpose of this paper is to describe how the Registry Advantage  infrastructure and applications were validated by automated and manual testing.  The definitions and procedures in this document were used for conducting and analyzing the validation process. 

Definitions

Boundary Testing

A boundary test is one that approaches a known limit.  As the limit is approached, behavior is well known.  At the limit, and beyond, behavior may be undefined.  This type of test is designed to ensure that a system behaves normally at least until the known limits are reached.  This is sometimes referred to as "limit testing".

Capacity Testing

A capacity test is one that determines the capacity of a system or service by measuring its ability to process data in terms of total volumes in a given time quantum.  An example of this may be the total number of DNS queries a system can process per second.  Resource sizing is also a possible metric.  For example, the total number of records a database can store.

Compatibility Testing

A compatibility test is one that determines whether or not a system is compatible with another system, an existing implementation of the same system, a published API, or a known protocol.

Fault Testing

This type of testing is used to determine the behavior of a system during a fault condition.  These fault conditions may be intentionally introduced hardware failures (or simulated failures) as well as broken interconnects and inoperable services.

Regression Testing

A regression test is conducted in such a way that all functionality for a system is tested.  This differs from "unit testing" or "modular testing" in which only a specific system module, or set of functionality, is tested.  The significance of regression testing should not be underestimated.  A simple change to a system may pass its unit test, but have side effects on other parts of the system that will only be found with regression testing.  The regression testing conducted by Registry Advantage includes the sum of all individual tests and test cases described in this document.

Validation Criteria

For each area of testing, the validation criteria taken from specific known behavioral patterns, and industry norms where specific data was not known in advance.  These were used as the expected results documented for that test.  If the actual results did not fall within the accepted results set, the test failed and remedial action was taken.  If the actual results were within the expected results set, the test passed.  In all cases the validation criteria were documented for each test in advance.

Validation for a test suite was given only when all tests for that suite passed successfully in the same testing cycle.

Whois

Data Points:  An e-mail from VeriSign’s Scott Hollenbeck to the ietf-whois mailing list in January 2001, indicated that the VeriSign registry’s Whois servers performed approximately 30 million queries per day. [1]   Dividing the queries equally over the course of a day yields a rate of 347 queries per second.  This rate applies to total queries done for all .com, .net, and .org (CNO) domain names.

Typical Rate:  Registry Advantage presumed that the percentage of queries related to .org domain names would be roughly equivalent to the portion of .org names in the CNO database, or approximately 10%.  This analysis suggests a typical Whois query rate of approximately 35 queries per second for .org names.

Peak Rate:  Based on its experience as a registry operator, as well as on the experience of its parent company, Register.com, which provides Whois service for over two million domain names, Registry Advantage has determined that peak Whois query rates will be as high as ten times the average rate.  Consequently, the peak Whois query rate for .org is estimated to be approximately 350 queries per second.

DNS

Data Points:  A September 2000 ICANN statement on GTLD registry best practices indicated that the “A” root server, which was also acting as a GTLD server at the time, handled approximately 5000 DNS queries per second, with peaks as high as 8000 queries per second. [2] Similarly, a presentation by David Conrad of Nominum to the ITU in January 2001, indicated that moving GTLD data from the root servers to independent name servers resulted in a shift of approximately 5000 queries per second to the GTLD servers. [3]

Typical Rate:  Assuming all 13 root servers experienced typical loads to the “A” root server, Registry Advantage calculated that the total number of DNS queries per second to the root server constellation was approximately 65,000 on average.  Registry Advantage assumed that growth in DNS queries would roughly correlate to growth in the total number of hosts on the Internet.  Telcordia Technology’s Netsizer tool [4] indicates that in September 2000, there were 91.2 million hosts, growing to 189.8 million hosts in May, 2002, a ratio of two to one.  Consequently, in order to account for growth in the number of DNS queries since September 2000, Registry Advantage assumed that doubled this number, yielding a typical requirement of 130,000 queries per second across the entire DNS constellation.  Once again, the assumption was made that approximately 10% of the typical CNO traffic would apply to .org queries, resulting in a system-wide typical requirement of 13,000 queries per second.  This requirement is spread over a total of eight locations, resulting in a typical requirement of 1625 queries per second at each site.

Maximum Rate:  From the ICANN data point, Registry Advantage knew that the maximum rate of transactions experienced by the “A” server was approximately 60% greater than the typical rate, resulting in a value of 2600 queries per second.  However, because it is possible that traffic may not be evenly distributed across all sites, Registry Advantage nearly doubled the expected maximum rate, to 5000 queries per second, anticipating that this would be the peak load at the site with the greatest amount of traffic.

SRS

Data Points:  Due to a lack of sufficiently granular data, creating typical data sets for testing purposes, as well as expected results, was challenging.  Registry Advantage correlated numerous data points in order to make an estimate.  The total number of domain names in the .org database, approximately 2,700,000, was used as a baseline against which the number of domain registration events was calculated.  VeriSign’s presentation to the North American Network Operator’s Group (NANOG) in February 2002 [5] provided raw data for the number of CNO failed write and check domain events as of December 2001 at 420 million and 3.6 billion, respectively.  Other data sources were used to cross-check various assumptions and generate the full “typical load” test data set.  These included various SnapNames State of the Domain Reports (SOTD), as well as ICANN’s Second Advisory Concerning Equitable Allocation of Shared Registration System Resources [6] .

Typical Rate:  For the purposes of this testing, add and renew commands were considered to be identical as they have similar impacts on the registry systems.  Registry Advantage derived the typical number of add commands based on the total number of .org domain names.  Although the total number of .org domain names is slowly declining, the rate was considered to be low enough that for the purposes of determining these rates, Registry Advantage assumed that all domains would be either renewed or re-registered in the month that they expired.  An additional assumption was made that the renewal dates of these domains were evenly spread throughout the calendar year.  On this basis, it was determined that approximately 225,000 registrations were re-registered or renewed each month.  Registry Advantage further assumed that registrations would only occur on one of approximately 20 business days per month, and that all registration events would occur within a twelve hour period per day.  These assumptions yield a rate of 938 registrations per hour, or approximately one every four seconds.

To determine the typical rate of check commands, Registry Advantage noted that the “add storm” events that began in approximately June 2001 seemed to be responsible for roughly two-thirds of all check commands, or 2.4 billion of the monthly total, leaving 1.2 billion check commands as part of the typical usage pattern.  Once again, only about 10% of the total CNO usage can be attributed to .org, resulting in a total volume of 120 million checks associated with .org names. According to the ICANN advisory cited above, add storm activity was concentrated within a four hour window each day.  Registry Advantage spread the 120 million check commands across the remaining 20 hours of each day, resulting in an hourly rate of 200,000 check commands, or 56 per second.

In a similar vein, Registry Advantage analyzed the data to determine the expected typical number of changes, deletes and info transactions. Registry Advantage derived the number of change transactions based on its knowledge that the ratio of adds to changes in its ccTLD registries is approximately 2:1. The expected deletes were based on the turnover and renewal rates typical with com/net/org: approximately 50% of adds and renewals. The number of info queries was based on the data from VeriSign’s North American Network Operator’s Group presentation. Registry Advantage distributed info commands in the same manner as the check commands, deriving expected typical rates of 8 per second.

Peak Rates:  Peak rates of domain creation were assumed to occur in situations in which registrars were performing batch processing of registration events.  Consequently, the peak rate is unlikely to be correlated to externally observable trends.  Based on its experience as a registry operator and previous experience of various staff members working at its parent company, Register.com, Registry Advantage estimated that peak rates during these batched events would be unlikely to exceed fifty registration events per second.

Peak check command rates were used by spreading 240 million check commands across the remaining four hours of each day, resulting in a rate of 2,000,000 checks per hour, or 560 checks per second.  Registry Advantage believes that this estimate may significantly overstate the peak requirement, as add storm activity is likely to relate disproportionately to .com domains, which are considered more valuable in the secondary market.

The peak expected changes and deletes were considered to be potentially similar to the peak amounts of add transactions. For the info transactions, based on the VeriSign North American Network Operator’s Group data, the expected peak number of info commands is expected to be ~25 per second.

Testing Procedures

Each of these areas will be tested with its own test suite:

  • RA-DNS Cluster Capacity
  • SRS Cluster Capacity
  • Whois Cluster Capacity
  • RA-DNS Functionality
  • SRS Functionlaity
  • Whois Functionality
  • Oracle Cluster High Availability
  • Network and Data Interconnects High Availability
  • Redundancy for Remote Satellites

RA-DNS Cluster Capacity

Registry Advantage has developed a proprietary DNS server (RA-DNS) designed specifically for extended capacities.  The testing done against this server included a dataset of 5 ccTLDs, as well as the most recently available .ORG dataset.  The number of queries per second and the average round trip time per query (measured in milliseconds) were used as the metric for these tests.  The expected results were taken from leading industry service metrics where ever possible, but many of these metrics were not available prior to this test, and so were extrapolated from whatever available data we could obtain.

Each test was performed with 400 query clients to generate the load.  The MAX success test was done by having every domain in .ORG randomly queried a total of 4 times each across all 400 clients.  The remaining tests used both .ORG domain data and manually created data to produce desired query mixes (such as failed queries where the domains are not present in the zone, and malformed packets).

The following tests were performed:

Test

Expected Peaks

Actual Results

Pass (Y/N)

MAX successful queries for random .ORG names

> 5000 q/s

< 500 ms rtt

17,513.74 q/s

1.0463 ms rtt

Y

MAX queries where 20% names did not exist

> 5000 q/s

< 500 ms rtt

15,564.21 q/s

2.81 ms rtt

Y

MAX queries with the same name over and over again

> 5000 q/s

< 500 ms rtt

18,737.08 q/s

1.54 ms rtt

Y

MAX queries with 80% of the queries from the top 5% of the names and 20% queries from the bottom 95%

> 5000 q/s

< 500 ms rtt

18,202.17 q/s

1.94 ms rtt

Y

MAX queries when 1% of the packets are malformed

> 5000 q/s

< 500 ms rtt

15,284.33 q/s

2.56 ms rtt

Y

MAX queries when 10% of the packets are malformed

> 5000 q/s

< 500 ms rtt

15,437.03 q/s

3.21 ms rtt

Y

MAX queries when 50% of the packets are malformed

> 3000 q/s

< 500 ms rtt

11,597.04 q/s

4.08 ms rtt

Y

In the course of our RA-DNS capacity testing, we were unable to fully stress the server as we did not have the load generator capacity to do so.  These numbers so exceed any of the projected requirements, however, that they are reported them as a client constrained maximum result set.

SRS Cluster Capacity

The SRS application cluster was tested for capacity in a number of ways.  The metrics used to measure capacity for these tests were taken from publicly available data from ICANN, SnapNames SOTD reports, our own experience as a registrar and registry outsourcing provider, and public information supplied by VeriSign to NANOG.  From NANOG and SOTD data for December 2001, we extrapolated a typical hourly load consisting of approximately 0.03% adds, 10.14% failed creates, 87.15% successful checks, 2.30% infos, 0.12% changes, and 0.06% deletes, and scaled these for today's volumes.  In addition, we considered the impact of an "add storm" as described in the ICANN advisory from August 10 [7] 2001 on the performance of the typical load base case.  The add storm we generated was more than double the volume reported in the ICANN advisory.  All of these figures are in excess of current transaction trends.

We measured the number of successful adds, checks, changes, deletes, and info commands on a test cluster of 3 servers, and observed linear scaling.  We then tested the MAX performance for what is overwhelmingly the most common command – check.  Last, we tested our capacity to support client connections by testing the maximum number of connections per SRS cluster member, with linear scalability.  These expected results are taken from the public information mentioned above.  We also project that even under an add storm of outrageous proportion, due to the linear scaling of our cluster, we would have no problem managing the additional load. 

These are the tests we performed:

Test

Expected Peak Results

total VeriSign derived volume requirements

Actual Results

with 3 SRS servers

Pass (Y/N)

typical load baseline

50 add/s

560  check/s

50 change/s

50 delete/s

25 info/s

201.9 add/s

776.31 check/s

726.19 change/s

724.46 delete/s

772.46 info/s

Y

typical load under an add storm

50 add/s

560  check/s

50 change/s

50 delete/s

25 info/s

182.18 add/s

842.49 check/s

558.58 change/s

562.37 delete/s

579.43 info/s

Y

MAX success check

560 /s

1083.85 /s

Y

MAX failed check

560/s

1097.97 /s

Y

MAX add storm

97/s

1091 /s

Y

MAX connections per cluster member

at least 900 SRS connections per box

1007 per box

Y

These tests were performed against a cluster consisting of three SRS cluster members, with linear scaling in most cases.  The actual compliment of SRS cluster members in production is 2N, or six cluster members.  Therefore, in the general case where we are at full capacity, performance will be twice the actual results.

Whois Cluster Capacity

The Whois application cluster capacity was tested for MAX queries per second under an extreme load, while tracking the round trip times for these queries.  It was then subjected to a similar load where 50% of the queries were for objects not present in the loaded data set.  The expected results for this test come from internal experience and publicly available information on the VeriSign Whois service supporting .COM, .NET, and .ORG (concurrently).  The objective of the test was to achieve at least the peak capacity under atypical peak loads.

The tests we performed were:

Test

Expected Results

(per server)

Actual Results

(per server)

Pass (Y/N)

MAX sustained successful query test per cluster member

> 117 q/s

< 500 ms rtt

189.4 q/s

0.005 ms rtt

Y

MAX sustained 50% negative response query test per cluster member

> 117 q/s

< 500 ms rtt

271.57 q/s

0.004 ms rtt

Y

The Whois cluster members scale nearly linearly in our load balanced configuration.  To support the peak query requirements for the entire Whois service of an estimated 350 queries per second, the standard cluster size of three was found to significantly exceed the performance requirements, with cluster performance exceeding 815 queries per second.  A 2N cluster of six Whois servers with this sizing will be deployed at both the primary and secondary sites to support the peak load requirements.

RA-DNS Functionality

The DNS functionality testing consisted of testing the RA-DNS for appropriate response codes and data in all sections of the DNS response packet to various type of requests.  This same test was performed against a BIND server running on identical hardware with identical data as a reference case, in addition to referencing the RFCs.  We also testing RA-DNS with a variety of DNS client resolver platforms to ensure compatibility with each.  The tests included:

  • malformed packets with garbage data
  • pointers to beyond the DNS packet
  • cyclical self-referencing pointers
  • other broken compression (e.g. pointers to the middle of a label)
  • inappropriate header bits being set
  • legal DNS labels and FQDNs
  • illegal DNS labels and FQDN (containing illegal characters)
  • maximum DNS packet size
  • exceeding the maximum DNS packet size
  • maximum size of a DNS label
  • exceeding the maximum size of a DNS label
  • maximum FQDN
  • exceeding the maximum FQDN
  • compatibility with generally available resolver platforms
    • Windows (95/98/me/NT/2000/XP)
    • Mac (8.0,8.5,9.0,10.0)
    • Linux (RedHat 5.2/6.1/6.2/7.1/7.2, Suse 6.0, Mandrake 7.0)
    • SunOS 5.5.1/5.6/5.7/5.8
    • FreeBSD 2.2/3.4/4.2/4.3/4.4/4.5
    • NetBSD 1.5.1/1.5.2

In every case the RA-DNS response was correct.  In all of the cases we tested the performance was markedly faster than the reference BIND server under the same fault conditions.

SRS Functionality

Registry Advantage supports two SRS protocols currently – EPP v06/04, and SRP v1.1.  Each of these has a suite of automated tests associated with them that validate proper functionality.

The SRP tests consist of the following:

  • proper blowfish encryption and padding
  • proper dual key blowfish authentication
  • proper IP restricted authentication
  • proper session initiation and termination
  • proper execution of each command
  • proper handling of list formatted parameters
  • proper error responses to malformed commands
  • proper error responses to malformed list parameters
  • maximum size of each command parameter
  • exceeding the maximum size of each command parameter
  • maximum size of each command string
  • exceeding the maximum size of each command string
  • registry policy enforcement
  • proper handling of billing and credit limits
  • proper handling of session limits
  • proper handling of idle session expiration
  • proper handling of lost database connections
  • proper handling of lost client connections

The EPP is very similar to the SRP at a high level, but differs quite a bit in the details of how it functions.  The following tests were conducted against our EPP implementation:

  • proper SSL certificate validation
  • exception handling of invalid certificates
  • proper client certificate to user name validation
  • exception handling of mis-matched logins and client certificates
  • proper EPP session initiation and termination
  • proper XML validation and exception handling
  • well formed XML recognition and exception handling
  • XML name space and schema function and exception handling
    • data type enforcement
    • data element sizes at the maximum size and beyond
    • partially qualified URIs
    • URIs from a foreign namespace
  • strict EPP standards compliance except where registry policy prohibits
  • proper exception handling of malformed commands
  • proper exception handling of unsupported commands
  • proper exception handling of billing and credit limits
  • proper idle session expiration
  • proper exception handling of lost database connections
  • proper exception handling of lost client connections
  • enforcement of registry policies
  • compatibility with EPP-RTK [8]

These tests are rigorous and very comprehensive.  They are automated and so can be run at will against a target SRS cluster for validation.

Whois Functionality

The Whois protocol functionality testing consisted of validating the service against the RFC 954 [9] definition of port 43 Whois, as well as Registry Advantage's own strict formatting requirements.  The following tests were performed:

  • garbage input for each query type
  • correct input (object was found) for each query type
  • incorrect input (object was not found) for each query type
  • proper formatting of response according to formatting template
  • compatibility with common Whois command line clients

As with the other functionality tests, these tests are fully automated and can be re-run at will.

Oracle High Availability Testing

The Oracle Cluster consists of a pair of Sun Enterprise 6500 servers configured with identical hardware and interconnects.  These host the active and standby Oracle instances, synchronized at the application level.  The metrics used to assess the high availability of the Oracle database infrastructure were the amount of time it took to recover from a failure, expressed as our Recovery Time Objective, as well as our ability to recover the database without data loss, expressed as our Recovery Point Objective.

The following faults were introduced to implement this test:

Test

Expected Results

Actual Results

Pass (Y/N)

Failed Process

No interruption of system services.

as expected

Y

Single Boot Disk Error

No interruption of system services.

as expected

Y

Multiple boot disk error

RTO 0-5 minutes

RPO 0 minutes

RTO 3 minutes

RPO 0 minutes

Y

Single Network Error (GigE)

RTO 0-5 minute

RPO 0 minutes

RTO < 1 minute

RPO 0 minutes

Y

Multiple Network Error (GigE)

RTO 0-5 minutes

RPO 0 minutes

RTO 3 minutes

RPO 0 minutes

Y

Single FA Failure

No interruption of system services, notification is sent

as expected

Y

Multiple FA Failure

RTO 0-5 minutes

RPO 0 minutes

RTO 5 minutes

RPO 0 minutes

Y

Single FA Switch Failure

No interruption of system services, notification is sent

as expected

Y

Multiple FA Switch Failure

RTO 0-5 minutes

RPO 0 minutes

RTO 5 minutes

RPO 0 minutes

Y

Single Volume Failure

RTO 0-5 minutes

RPO 0-3 minutes

RTO 3 minutes

RPO 0 minutes

Y

Full Symmetrix Failure

RTO 0-5 minutes

RPO 0-3 minutes

RTO 3 minutes

RPO 0 minutes

Y

Component Failure (single board)

RTO 0-5 minutes

RPO 0 minutes

RTO 3 minutes

RPO 0 minutes

Y

Manual controlled fail over

RTO 0-5 minutes

RPO 0 minutes

RTO 3 minutes

RPO 0 minutes

Y


Network and Data Interconnects High Availability

The network and data interconnect testing consisted of introducing faults to determine the behavior of the network and data architecture under failure conditions.  The expected results for each test were taken from either documented behavior for the device or interconnect and from industry standard expectations.

The following tests were performed:

Test

Expected Results

Actual Reslts

Pass (Y/N)

App Cluster "out" interface failure

load balancer takes the host out of the mix

load balancer takes the host out of the mix

Y

App Cluster "in" interface failure

monitoring must notify Ops, who must notify NetEng, and the host must be taken out of the mix manually

Ops 2.0 detects this and operators escalate manually

Y

Single Connectrix Switch failure

servers fail over to remaining connection, Sym reduces aggregate bandwidth to its remaining connections

Power path noticed link status failure immediately and failed the path over.

Y

Double Connectrix switch failure

total loss of data access to the SAN, RRP and WHOIS services unavailable, DNS updates unavailable

Oracle got very unhappy, as did the RRP and MASTERs.  Srp got internal server error, dns reponded, whois responded.

Y

Single Symmetrix Interface failure

aggregate bandwidth reduction, but no loss of service

Power path failed the path over after 28 seconds

Y

Multiple Symmetrix Interface failures

possible loss of RRP and WHOIS service due to sym unavailability or performance reduction

Multiple failures in different paths failed over in < 30 seconds.  Total loss of a path made LUNs inaccessable, oracle got very unhappy, RRP reported internal server errors.

Y

Cisco Router 100MB cross connect failure

mesh interruption with no acatul loss of connectivity

as expected (6 packets got dropped)

Y

BigIP Serial cross connect failure

heart beat lost, both BigIPs think they are master, everything seizes up at layer 4 on the outside network.

BigIPs got very unhappy, none of the VIPs responded, but DNS still responded since its not load balanced.

Y

NetApp Single GigE failure

interface fail over with no loss of service

This is a NetApp trunk, so there was no loss of service at all

Y

NetApp Double GigE failure (same head)

cluster fail over to working head, minor service interruption (less that 3 minutes?)

In once instance, NetApp did not fail over, Ops monitoring picked this up and a manual fail over was initiated from the good head in under 3 minutes

Y

Single Summit 48 failure

50% loss of hosts from all clusters,master fails over to backup

50% loss of hosts from all clusters,master fails over to backup

Y

Single Summit 5i failure

interface failovers occur on all dual connected hosts, possibly causing momentary interruption in service

4 second interruption on Summit 48s, but BigIP takes 2 minutes to fail over and DB servers require manual interface fail over.

Y

Single BigIP GigE failure

interface failover occurs, possibly causing momentary interruption in service

reboots after 2 minutes

Y

Double BigIP GigE failure

cluster fail over to remaining BigIP, minor service interruption (less than 1 minute?)

reboots after 2 minutes

Y

Single Cisco Router GigE failure

interface failover occurs, possibly causing momentary interruption in service

as expected (3 packets got dropped)

Y

Single Internet Feed failure

BGP fail over from the outside should be immediate, while HSRP fail over from the inside should take 60-90 seconds

As expected

Y

Single Switch to Switch Interconnect failure

No impact

as expected

Y

Redundancy for Remote Satellites

The application cluster redundancy was tested for site independence and DNS service availability.  Faults were introduced to simulate various types of site failures and the response of the overall collection of satellites was measured for each test.

The following list of tests was performed: 

Test

Expected Results

Actual Results

Pass (Y/N)

Both Summit 48 Switches (or other redundant 100MB switch) fail at master site

RRP, WHOIS, account management are unavailable, DNS is available from remaining satellites sustaining performance minimums

RRP, WHOIS, account management are unavailable, DNS is available from remaining satellites sustaining performance minimums

Y

Both Summit 48 Switches (or other redundant 100MB switch) fail at satellite site

DNS service fails for that satellite, all other DNS satellites remain available sustaining performance minimums, no other service loss

DNS service fails for that satellite, all other DNS satellites remain available sustaining performance minimums, no other service loss

Y

Both Core Switches fail (master site only)

RRP, WHOIS, account management are unavailable, DNS is available from remaining satellites sustaining performance minimums

RRP, WHOIS, account management are unavailable, DNS is available from remaining satellites sustaining performance minimums

Y

Both BigIP Load Balancers fail, or Both Cisco Routers fail, or Both Internet Feeds fail

RRP, WHOIS, account management are unavailable, DNS is available from remaining satellites sustaining performance minimums

RRP, WHOIS, account management are unavailable, DNS is available from remaining satellites sustaining performance minimums

Y

Master Site hits saturation level for BigIP / Router (TBD)

RRP, WHOIS, account management are unavailable, DNS is available from remaining satellites sustaining performance minimums

DNS is unaffected, but RRP, WHOIS, and Account management were very slow, many connection failures and time outs, some internal server errors for RRP

Y

Conclusion

The combination of boundary, capacity, compatibility, and fault testing represent the complete set of regression tests run by Registry advantage to validate its applications and infrastructure.  These tests demonstrate the ability to operate within a broad range of operational circumstances with expected results, as a predictable level of performance and stability.



[1] http://www.imc.org/ietf-whois/mail-archive/msg00014.html

[2] http://www.icann.org/tlds/gtld-registry-best-practices-30sep00.htm

[3] http://www.itu.int/osg/spu/enum/workshopjan01/annex2-conrad.ppt

[4] www.netsizer.com

[5] http://www.nanog.org/mtg-0202/ppt/larson.ppt

[6] http://www.icann.org/announcements/icann-pr10aug01.htm

[7] http://www.icann.org/accouncements/icann-pr1aug01.htm

[8] http://sourceforge.net/projects/epp-rtk/

[9] http://www.ietf.org/rfc/rfc0954.txt