Adrenaline Logo
space
Home > Publications > Availability White Paper
spacer Home spacer About Us spacer Services spacer Clients spacer Partners spacer
   

 Search   
 
 About Us
Company History
Management Team
Adrenaline Process
Practice Teams
Chump-Free Zone
Syllabus
Publications
 
 Careers
Take the Challenge
Career Highlights
The Inside Scoop...
Benefits
Events
 
 Press Center
Press Releases
In the News
News Archive
Analyst Reports
Events
Milestones
FAQ
Corporate Background
Press Kit
 
Availability - What is it, and what does it mean to me?
An Adrenaline Group White Paper


Availability Terms & Concepts
Availability Applied
Availability and Internet Services
Service Level Agreements
Case History
Conclusion

 

the Adrenaline group, inc.
http://www.adrenaline.com
1445 New York Avenue, NW
4th Floor
Washington, DC 20005
202.628.4438







A Brief Discussion of Availability

What is service availability? Why does it matter? What difference does it make to my Internet service? This paper will tell you what availability and uptime are, what they mean to your website, how to evaluate your availability requirements, and will give you some idea of the cost of achieving these requirements. It introduces terms and concepts, then discusses issues about achieving various levels of availability.

AVAILABILITY TERMS AND CONCEPTS

Service availability is a measure of service "uptime". A service with 100% availability is always up. While this is what most people want, for most services, this is not practical to achieve. What follows are some of the terms and nuances of availability. A service that is supposed to be available all the time is said to be a 24 x 7 service. The Food Line all night grocery store is a 24 x 7 store, or a continuously available service. Because the Food Line is open 24 x 7, they may aspire to a target availability of 100% availability, meaning that they will always be open and ready for shoppers. The Behemoth grocery store closes at midnight and re-opens at 6:00 AM is therefore not continuously available. The Behemoth provides a service window of 18 hours per day. That means there are 6 hours of scheduled downtime each day, or in other words, the grocery store is unavailable 6 hours per day. The Behemoth could possibly achieve an overall availability of 18 / 24 or 75% availability. This scheduled downtime is very useful for tasks like cleaning the store, restocking, taking inventory, and other chores that are harder to complete with customers underfoot. It is cheaper to operate the Behemoth than the Food Line because of this effect. If the Behemoth day manager overslept and didn’t unlock the doors until 9:00, the store would experience 3 hours of unscheduled downtime, or outage, which would be much more annoying to customers than the 6 hour scheduled downtime, because it violates their expectations. They expect the Behemoth to close at midnight and re-open at 6:00 AM, not 9:00 AM. The Behemoth had violated the service level agreement (SLA) they have with their customers, which is labeled "Store Hours" and is posted on their automatic door.

AVAILABILITY APPLIED TO ELECTRONIC SERVICES

Availability in electronic systems is often measured in terms of "nines", i.e.

99%, 99.9%, and 99.99%

…which are 2, 3 and 4 nines, respectively (and may also be called Class 2, Class 3, or Class 4 availability).

There are also intermediate measures, like 99.5%, or 99.91025347%, or whatever you want to state your availability to be. See the chart below to see what these numbers mean. In short, there are 525,600 minutes in a year. An availability target states what percentage of that total the service will be "up" for customers.

Availability Class outage/ year outage/ month
..in minutes ..in hours ..in minutes ..in hours
90% 1 52560 876.0 4380 73.0
95% 26280 438.0 2190 36.5
99% 2 5256 87.6 438 7.3
99.5% 2628 43.8 219 3.7
99.9% 3 526 8.8 44 0.7
99.95% 263 4.4 22 0.4
99.99% 4 53 0.9 4 0.1
99.999% 5 5 - - -
99.9999% 6 0.5 - - don’t blink

Often times availability numbers are further qualified along various dimensions:

Service outages and partial service outages:

A total service outage is service affecting for the entire user population. If a service outage only effected, say, 10% of the customers of that service, the total service availability would only suffer by 1/10 of the total outage time, as the rest of the users are not affected. To the 10% of the users affected the outage was total, but the service provider may only "mark down" 1/10 of the total outage time against that population. In other words, if an outage affects only a portion of the users, that's less severe than if it affects everyone.

Slow response times:

For websites, excessively slow response times should be counted as outages. Most sites respond in less than 5 seconds. Beyond this, the site is noticably slow, say 5 – 15 seconds response time. Beyond that, the site might as well be down. For most sites, response times greater than 15 or 30 seconds should be counted as outages.

Weak link in the chain:

Service availability requires that all parts of the system needed to provide service be functioning. A failure in only one part can affect other parts. This is problematic when multiple service suppliers are providing components of the total service. For example, if the power fails and the Behemoth cannot get their automatic doors open, they still experience an outage, even though they are not ultimately responsible for electrical power.

Short vs. long outages:

A series of short outages may not be equivalent to one long outage. In some situations it may be worse; in others it may be better. If the power company loses power for three days, there will be news stories, lawsuits and investigations by various government agencies. On the other hand, if it loses power for short periods of time, and over the course of ten years these short disruptions aggregate to three days, the impact may be less. In contrast, if the power company were to lose power every day at 5:00 p.m. for one minute, an aggregate of about 6 hours per year, the outcry would be far more than if there were just one six hour outage in the course of a year.

Cost of downtime:

Better availability statistics means more cost to equip and operate. Assessing the cost to the business of downtime is useful in making these tradeoffs. What is the direct lost revenue for an hour of downtime? What is the indirect lost revenue, e.g. a customer that may not return? What value do you put on intangible costs, e.g. damage to reputation?

For example, an e-commerce site may transact $10,000 per hour in the busy hour. An hour of downtime has a direct cost of $10,000 for this site. However, if that hour caused repeat customers to leave, or new customers to not become repeat customers, the actual cost of downtime could be considerably higher.

AVAILABILITY AND INTERNET SERVICES

Availability is a common measure applied to Internet services. Here’s what some of these numbers mean in this space.

Typical configuration for
achieving 99% availability
99% Availability
99% is "what you get" if you have good equipment and are reasonably careful in your operations and software practices, but don’t take exceptional steps to improve the situation. You should expect an average of 7 hours of outage per month at these levels. Beware that it can also be much worse if you are not careful about your systems software, application software, network connections, power, and all the other piece parts that make up your server.

The cost of achieving this is whatever you paid for your server(s) and the hosting facility – whether self-hosted or hosted at a co-location center (i.e. Internet data center). For example, a pair of Linux / Apache servers sitting at the end of a T1 line plugged into the wall in your basement might cost you $10,000 fully equipped, plus about $1000 / month for the line. One of the servers could be running the site, with the other available as a standby to be manually put in service if the primary unit fails.

99.9% Availability
99.9% is achievable with a single site (i.e. location) run very carefully. You can't afford many multi-hour outages due to power failure, hardware failure, software failure, or network failure – you only get 8.8 hours to play with all year. This downtime budget means you have to use reliable hardware and software, backup up by solid operations and recovery practices.

  1. To achieve 99.9%, power and network connections have to be redundant, as they take too long to restore if they fail; the mean-time-to-repair (MTTR) is too high for your time budget.

  2. Computer and network hardware have to be redundant. Faults must be detected and corrected very quickly. Fault correction can usually be either automatic or manual at this level of availability; i.e. the failover solution may be hot standby or warm standby, respectively.

  3. The software has to be rock solid; and deployment and changes managed very carefully. New versions have to be introduced cautiously with comprehensive testing. Test and staging systems are necessary to support these practices.


Typical configuration for achieving 99.9% availability

Content must also be managed as carefully as the applications software. Depending on the nature of the website, a content failure may be perceived as an outage by the user. The same testing and deployment issues that affect software also apply to content.

If you do all that, you have a shot at achieving 99.9%.

The cost of achieving this level is usually multiple, redundant servers, fronted by load balancing equipment, being operated in a first rate co-location center (unless you happen to have a very well equipped data center lying around). You will also need a separate staging environment separate from the production environment on which to test your upgrades.

Our friendly Linux / Apache servers are quite capable of achieving 99.9%, so let’s stick with them, but let’s add another server to production ($5000), and put the three of them behind a server load balancer: costing from zero dollars (Open Source) up to about $40,000 for a redundant pair of hardware units. The load balancers share the load between the servers, and will take a failed unit out of service automatically, giving you an N+1 redundancy scheme. You will have to move out of your basement, and expect to pay a co-location center hosting service about $2000+ per month for bandwidth and floor space. Allow another $10,000 for routers and switching hubs to glue this equipment together, and add in the cost of a separate staging environment, for another $5000 or so.

Note that this simplified discussion does not address database issues

Typical configuration for
achieving 99.95% availability
99.95% Availability
99.95% is harder. You get 4.4 hours per year. All of the above applies, with the following changes:

  1. You are probably now beyond the reliability of a single location. The usual things that can take a site out of service are power or network related – and are likely to exceed your time budget. Disasters such as fire or flood must also be taken into account.

  2. Your MTTR must be reduced. Monitoring systems must be very tight. Manual failovers (warm standby operation) is usually not an option.

The additional costs, over and above the 99.9% case, include doubling the production environment equipment (2 sites, assuming full processing capacity at each), and paying double the hosting services or data center occupancy. Additionally, you may wish to introduce DNS load balancing equipment (aka Wide Area Network Load Balancing or WAN Virtual Resource Management equipment), adding another $30,000 or so to the new total.

Better than 99.95% Availability

The numbers keep going up. So does the expense. Non-stop systems become necessary. So can redundant operations involving multiple vendors, e.g. co-location centers. In the limit, there are always things outside your control, like an Internet wide routing failure, that affect you. 100% availability cannot be achieved, it can only be approached.

At these levels, the base reliability of the core Internet becomes your limiting factor. Careful study of the environments at your Internet geographies is recommended, as it makes no sense to over-engineer your sites if the pipes that feed them are less reliable than your equipment. Separately, the effects of edge failures (i.e. failures of individual access providers or other "islands") may affect significant portions of your population, causing partial outages for your customer base.

There is an objective limit to availability that is probably about 99.99% at the current state of the art. While this number would be debated hotly by various people. Some ISPs have advertised number far in excess of this, but if you take the whole end-to-end solution, it is probably not possible to go much further than 99.99% using current networks and techniques.

SERVICE LEVEL AGREEMENTS

Most service providers: ISPs, co-location centers, use availability figures in their service level agreements. Service level agreements are not always an accurate statement of a service’s availability, for a variety of reasons.

First, often the SLA simply states the thresholds beyond which the service provider will apologize or give money back. In some cases it may be cheaper to give money back than to achieve the availability target.

Second, many SLAs limit the availability guarantee to systems directly under the service provider’s control; the overall availability to the end user may be less than that figure, given that service affecting failures may occur outside the service provider’s service boundary.

Third, the definition of an outage often excludes short service interruptions – say less than 60 seconds. These can still affect you.

Lastly, SLAs do not typically distinguish between the time budget for routine maintenance and unexpected outages. If a service provider is guaranteeing 99.9% availability, then they may be saying that the service will be routinely unavailable up to 44 minutes per month. It depends on whether they intend to use that time budget for maintenance or simply reserve it for failures. Some of this time may be reserved for unexpected outages (unscheduled downtime), and some may be budgeted for routine system maintenance, equipment upgrades, and other scheduled uses of downtime.

Many service providers track availability statistics by month, and may give money back or otherwise apologize for missing the target availability reset the counter every month. By doing it this way the service providers give themselves a clean slate each month and can worry less about annual availability targets.

CASE HISTORY – EBAY, JUNE 10 – 11, 1999

On June 10, 1999 eBay, the Internet auction house experienced an outage lasting 22 hours. If you want to read more about this rather well publicized event, see:

http://www.news.com/News/Item/0,4,0-37718,00.html

The point to be taken away from this is that a single 22 hour outage represents 99.75% annual availability. Does that sound good to you? Not to eBay.

CONCLUSION

The lesson in all this is that it's hard and expensive to achieve high availability, and the specific target levels have to be considered seriously. A new service on the Internet is not likely to be better than 99% and may be much worse. 99.9% availability comes only when the system and operations have matured, and connectivity is very good. Higher levels of availability require a different architecture and a significantly greater commitment of resources. For any service, the cost / availability tradeoff needs to be considered carefully.

TOP

Availability Terms & Concepts
Availability Applied
Availability and Internet Services
Service Level Agreements
Case History
Conclusion

1655 North Fort Myer Drive, Suite 410 | Arlington, VA 22209 | Phone: 703.807.2700 | Fax: 703.807.2773
home | about us | services | clients | partners

© ADRENALINE GROUP, INC. 1997-2003 ALL RIGHTS RESERVED
WWW.ADRENALINE.COM

Contact Us | Directions