C17.13. System reliability. Define, analyze, and quantify quality of service.

System reliability is guaranteed through the use of clustered, load balanced solutions, redundant network structure, and constant system performance monitoring to ensure maximum system reliability and quality of service is maintained and that all required SLAs are constantly being met.

Clustered, Load Balanced Systems

Application systems are divided into load balanced clusters that can handle many request simultaneously in a quick and efficient manner. Each task such as management of registered domains would be handled transparently by any of a number of systems. This method of clustering allows for distributed load sharing as well as the ability to bring application machines in and out of service without affecting any registry downtime at all. The load balancing is achieved by implementing using Cisco Local Director’s dedicated load balancing hardware, with these devices it is not only possible to load balance across computers in the same data center but by connecting them to routers inter-network, (Read intercontinental) load balancing can be achieved.

There are a number of benefits as a result of implementing clustered systems some of which are:

- Decentralized systems

- Load distribution

- Easier maintainability

- Totally transparent to all external systems

- No single point of failure (through the use of multiple load balancers and serial “heart beat” protocols.

- Redundancy

- Uninterrupted service during machine maintenance.

Decentralizing servers by using a clustered system, where the clustered system is constructed of an array of systems all of different capabilities is possible. This is possible by using one or more load balancers that determine a portion of the total load to be distributed to each node in the cluster. This means that as newer faster machines are introduced into the registry system during further system upgrades they can run simultaneously with the old hardware which would be slowly phased out allowing “incremental upgrades”

Nodes within a cluster that require maintenance are easily removed from the cluster, or even replaced with another system, whilst external systems are totally unaware and unaffected by the swap.

Downtime is also eliminated in times of failing nodes of the cluster, these are detected instantly buy dedicated monitoring applications running on the load balancers. These monitor each node of the cluster verifying that it is still offering the indicated service, as soon as the service is no longer offered (daemon crashes, machine is stopped for maintenance), or the machine is no longer accessible (network failure, power failure to machine, hardware fault), the machine is automatically removed from the cluster configuration. Once the machine is detected to have returned to operation and offering the service again, it is added back to the cluster configuration and requests are then sent to the machine again. This happens automatically without any human intervention.

System Monitoring

Availability Monitoring

System monitoring ensures high availability of services as well as immediate notification of problems to registry engineers who will be able to diagnose and ultimately solve the problems.

Networking hardware is monitored using SNMP, monitoring programs running on the management machines, use SNMP to ensure the operation of these hardware device continue, as soon as faults are detected various alerts are sent ranging from emails, to SMS/pager alerts and instant messages to appropriate support personal, as well as the incident being logged locally.

Application machines will be monitored in a similar fashion using simple ping probes as well as service monitoring, this is in addition to the load balancers service monitoring, again appropriate logs and alerts will be sent to required personnel when any system faults are detected. Machines also monitor their own services as soon as they detect that they themselves are no longer supplying a service they are meant to be supplying, they automatically attempt to restart the service, should additional restart attempts fail they will trigger their own reset, this procedure will continue until either the service has been restored or a registry engineer intervenes. “self diagnosing and self repair” means that any down time is kept to a minimum, often the problem will be fixed before an engineer has had time to connect to the network. Logging and email notifications of all steps in the process are essential especially in diagnosing the cause of the problem.

Application machines also make use of the hardware watchdog timers embed in their chipsets. These watch dogs periodically verify that the machine is still functioning correctly, if it detects any problems, or does not get any response from the CPU the watch dog “soft reboots” the machine. Lock up of machine is very rare in hardware these days with recent advancements in computing technology, however when these occur simply rebooting the machine fixes the problem in 99% of cases, this also adds to the “self maintenance” design of the registry.

Performance Monitoring

All aspects of registry are monitored for their performance, not just those needed to meet specific SLAs, this monitoring not only assist in detecting errors, it acts as an early warning system for upcoming failures that may occur and also allows engineers to identify possible bottle necks or areas that need improvements to increase overall system throughput.

All networking equipment is monitored though SNMP and graphs of things such as router CPU load, interface traffic etc are generated and automatically viewable online using MRTG. Extensive statistics are available from the Packeteer product that enable us to get a broad understanding of the usage patterns of registrars ranging from bandwidth utilization, to rate of data, to current number of connections etc.

Sampling of RRP and EPP performance, as well as DNS resolution times will also be graphed along with the appropriate SLA times for verification that SLAs are being met. Uptimes of all systems will be able to be verified from these MRTG graphs. Monitoring and graphing of CPU utilization, memory usage and disk usage of all application machines allows further analysis of registry performance.

SNMP monitoring of database, using Oracle specific MIBs allows graphing of database performance as well. All of this automated graphing and performance monitoring assists with SLA verification, detection of system bottle necks, monitoring of system performance during periods of high and low demand, gives trends as to when peak usage times actually occur thorough out the year.

All of these statistics are automatically evaluated according to a set of rules, and when certain conditions are met, for example processor utilization on a machine exceeds 80% for more than 5 mins, automated warning emails, SMS’s etc are sent.

Certain statistical reports and graphs will be available publicly on the Registry web site so that others can verify or at anytime check the availability of the registry.

Internal Network Reliability

The Internal network will be implemented with the latest high speed equipment, supplied by the world leader in network equipment. All network hardware essential for full operation will be made fully redundant with the use of additional equipment eliminating all single point of failures. Spare networking equipment will be located in the rack ready to use at any time.

Network connectivity will be implemented with failover, and redundancy in mind. Each system or a cluster of systems will have multiple network connections by using multiple network interfaces as well as connectivity through multiple switches, and or routers. Each machine will have more then one path to reach any destination in the network, even between the primary and secondary sites two separate fiber links are going to be present.

Communication between redundant load balancers and packet shapers will be implemented with the use of a serial cables to ensure that even in times of loss of network operation (failed network port or faulty interface) communications between each of these device can still take place. This avoids relying on network connectivity in order to maintain communication channels between primary and backup network equipment.

The packet shaper itself helps to enhance quality of service, as well as our other security measures that ensure the integrity of the network is maintained with only authorized clients able to connect and even then the authorized clients are not able to “flood” the network or the services offered maintaining high quality of service to not just one, but all registrars, equally.

Quality of Service

All of the above enhance out ability to quantify and analyse our effective quality of service, as well as detect any potential problems that may effect the QOS before they occur. Staff members will be dedicated to the monitoring of these systems, they will not perform a passive role, we understand that just sitting around waiting for the registry to “email you” if there is a problem is not sufficient, the support staff will be able to actively monitor the systems through online monitoring pages as well as custom unit monitoring programs. Appropriate procedures will be put in place to escalate any relevant problems that may be detect to registry engineers, full training will be given to staff as to how to detect these problems and what to look out for.