the Adrenaline group, inc.
http://www.adrenaline.com
1445 New York Avenue, NW
4th Floor
Washington, DC 20005
202.628.4438
A Brief Discussion of Availability
What is service availability?
Why does it matter? What
difference does it make to my
Internet service? This paper
will tell you what availability
and uptime are, what they mean
to your website, how to evaluate
your availability requirements,
and will give you some idea of
the cost of achieving these requirements.
It introduces terms and concepts,
then discusses issues about achieving
various levels of availability.
AVAILABILITY TERMS AND CONCEPTS
Service availability is a measure
of service "uptime". A service
with 100% availability is always
up. While this is what most people
want, for most services, this is
not practical to achieve. What
follows are some of the terms and
nuances of availability. A service
that is supposed to be available all
the time is said to be a 24 x 7
service. The Food Line all night
grocery store is a 24 x 7 store, or
a continuously available service.
Because the Food Line is open 24 x 7,
they may aspire to a target
availability of 100% availability,
meaning that they will always be
open and ready for shoppers. The
Behemoth grocery store closes at
midnight and re-opens at 6:00 AM is
therefore not continuously available.
The Behemoth provides a service window
of 18 hours per day. That means there
are 6 hours of scheduled downtime each
day, or in other words, the grocery store
is unavailable 6 hours per day. The
Behemoth could possibly achieve an
overall availability of 18 / 24 or 75%
availability. This scheduled downtime is
very useful for tasks like cleaning the
store, restocking, taking inventory, and
other chores that are harder to complete
with customers underfoot. It is cheaper
to operate the Behemoth than the Food
Line because of this effect. If the
Behemoth day manager overslept and
didn’t unlock the doors until 9:00,
the store would experience 3 hours of
unscheduled downtime, or outage, which would
be much more annoying to customers than the
6 hour scheduled downtime, because it
violates their expectations. They expect
the Behemoth to close at midnight and re-open
at 6:00 AM, not 9:00 AM. The Behemoth had
violated the service level agreement (SLA)
they have with their customers, which
is labeled "Store Hours" and is posted on
their automatic door.
AVAILABILITY APPLIED TO ELECTRONIC SERVICES
Availability in electronic systems
is often measured in terms of
"nines", i.e.
99%,
99.9%, and
99.99%
…which are 2, 3 and 4 nines, respectively
(and may also be called Class 2, Class 3,
or Class 4 availability).
There are also intermediate measures, like
99.5%, or 99.91025347%, or whatever you
want to state your availability to be. See
the chart below to see what these numbers
mean. In short, there are 525,600 minutes
in a year. An availability target states what
percentage of that total the service will be
"up" for customers.
| Availability |
Class |
outage/ |
year |
outage/ |
month |
|
|
..in minutes |
..in hours |
..in minutes |
..in hours |
| 90% |
1 |
52560 |
876.0 |
4380 |
73.0 |
| 95% |
|
26280 |
438.0 |
2190 |
36.5 |
| 99% |
2 |
5256 |
87.6 |
438 |
7.3 |
| 99.5% |
|
2628 |
43.8 |
219 |
3.7 |
| 99.9% |
3 |
526 |
8.8 |
44 |
0.7 |
| 99.95% |
|
263 |
4.4 |
22 |
0.4 |
| 99.99% |
4 |
53 |
0.9 |
4 |
0.1 |
| 99.999% |
5 |
5 |
- |
- |
- |
| 99.9999% |
6 |
0.5 |
- |
- |
don’t blink |
Often times availability numbers are further qualified along various dimensions:
Service outages and partial service outages:
A total service outage is service affecting for the entire user population. If a service outage only effected, say, 10% of the customers of that service, the total service availability would only suffer by 1/10 of the total outage time, as the rest of the users are not affected. To the 10% of the users affected the outage was total, but the service provider may only "mark down" 1/10 of the total outage time against that population. In other words, if an outage affects only a portion of the users, that's less severe than if it affects everyone.
Slow response times:
For websites, excessively slow response times should be counted as outages. Most sites respond in less than 5 seconds. Beyond this, the site is noticably slow, say 5 – 15 seconds response time. Beyond that, the site might as well be down. For most sites, response times greater than 15 or 30 seconds should be counted as outages.
Weak link in the chain:
Service availability requires that all parts of the system needed to provide service be functioning. A failure in only one part can affect other parts. This is problematic when multiple service suppliers are providing components of the total service. For example, if the power fails and the Behemoth cannot get their automatic doors open, they still experience an outage, even though they are not ultimately responsible for electrical power.
Short vs. long outages:
A series of short outages may not be equivalent to one long outage. In some situations it may be worse; in others it may be better. If the power company loses power for three days, there will be news stories, lawsuits and investigations by various government agencies. On the other hand, if it loses power for short periods of time, and over the course of ten years these short disruptions aggregate to three days, the impact may be less. In contrast, if the power company were to lose power every day at 5:00 p.m. for one minute, an aggregate of about 6 hours per year, the outcry would be far more than if there were just one six hour outage in the course of a year.
Cost of downtime:
Better availability statistics means more cost to equip and operate. Assessing the cost to the business of downtime is useful in making these tradeoffs. What is the direct lost revenue for an hour of downtime? What is the indirect lost revenue, e.g. a customer that may not return? What value do you put on intangible costs, e.g. damage to reputation?
For example, an e-commerce site may transact $10,000 per hour in the busy hour. An hour of downtime has a direct cost of $10,000 for this site. However, if that hour caused repeat customers to leave, or new customers to not become repeat customers, the actual cost of downtime could be considerably higher.
AVAILABILITY AND INTERNET SERVICES
Availability is a common measure applied to Internet services. Here’s what some of these numbers mean in this space.
 |
Typical configuration for achieving 99% availability |
99% Availability
99% is "what you get" if you have good equipment and are reasonably careful in your operations and software practices, but don’t take exceptional steps to improve the situation. You should expect an average of 7 hours of outage per month at these levels. Beware that it can also be much worse if you are not careful about your systems software, application software, network connections, power, and all the other piece parts that make up your server.
The cost of achieving this is whatever you paid for your server(s) and the hosting facility – whether self-hosted or hosted at a co-location center (i.e. Internet data center). For example, a pair of Linux / Apache servers sitting at the end of a T1 line plugged into the wall in your basement might cost you $10,000 fully equipped, plus about $1000 / month for the line. One of the servers could be running the site, with the other available as a standby to be manually put in service if the primary unit fails.
99.9% Availability
99.9% is achievable with a single site (i.e. location) run very carefully. You can't afford many multi-hour outages due to power failure, hardware failure, software failure, or network failure – you only get 8.8 hours to play with all year. This downtime budget means you have to use reliable hardware and software, backup up by solid operations and recovery practices.
- To achieve 99.9%, power and network connections have to be redundant, as they take too long to restore if they fail; the mean-time-to-repair (MTTR) is too high for your time budget.
- Computer and network hardware have to be redundant. Faults must be detected and corrected very quickly. Fault correction can usually be either automatic or manual at this level of availability; i.e. the failover solution may be hot standby or warm standby, respectively.
- The software has to be rock solid; and deployment and changes managed very carefully. New versions have to be introduced cautiously with comprehensive testing. Test and staging systems are necessary to support these practices.
 Typical configuration for achieving 99.9% availability
Content must also be managed as carefully as the applications software. Depending on the nature of the website, a content failure may be perceived as an outage by the user. The same testing and deployment issues that affect software also apply to content.
If you do all that, you have a shot at achieving 99.9%.
The cost of achieving this level is usually multiple, redundant servers, fronted by load balancing equipment, being operated in a first rate co-location center (unless you happen to have a very well equipped data center lying around). You will also need a separate staging environment separate from the production environment on which to test your upgrades.
Our friendly Linux / Apache servers are quite capable of achieving 99.9%, so let’s stick with them, but let’s add another server to production ($5000), and put the three of them behind a server load balancer: costing from zero dollars (Open Source) up to about $40,000 for a redundant pair of hardware units. The load balancers share the load between the servers, and will take a failed unit out of service automatically, giving you an N+1 redundancy scheme. You will have to move out of your basement, and expect to pay a co-location center hosting service about $2000+ per month for bandwidth and floor space. Allow another $10,000 for routers and switching hubs to glue this equipment together, and add in the cost of a separate staging environment, for another $5000 or so.
Note that this simplified discussion does not address database issues
 |
Typical configuration for achieving 99.95% availability |
99.95% Availability
99.95% is harder. You get 4.4 hours per year. All of the above applies, with the following changes:
- You are probably now beyond the reliability of a single location. The usual things that can take a site out of service are power or network related – and are likely to exceed your time budget. Disasters such as fire or flood must also be taken into account.
- Your MTTR must be reduced. Monitoring systems must be very tight. Manual failovers (warm standby operation) is usually not an option.
The additional costs, over and above the 99.9% case, include doubling the production environment equipment (2 sites, assuming full processing capacity at each), and paying double the hosting services or data center occupancy. Additionally, you may wish to introduce DNS load balancing equipment (aka Wide Area Network Load Balancing or WAN Virtual Resource Management equipment), adding another $30,000 or so to the new total.
Better than 99.95% Availability
The numbers keep going up. So does the expense. Non-stop systems become necessary. So can redundant operations involving multiple vendors, e.g. co-location centers. In the limit, there are always things outside your control, like an Internet wide routing failure, that affect you. 100% availability cannot be achieved, it can only be approached.
At these levels, the base reliability of the core Internet becomes your limiting factor. Careful study of the environments at your Internet geographies is recommended, as it makes no sense to over-engineer your sites if the pipes that feed them are less reliable than your equipment. Separately, the effects of edge failures (i.e. failures of individual access providers or other "islands") may affect significant portions of your population, causing partial outages for your customer base.
There is an objective limit to availability that is probably about 99.99% at the current state of the art. While this number would be debated hotly by various people. Some ISPs have advertised number far in excess of this, but if you take the whole end-to-end solution, it is probably not possible to go much further than 99.99% using current networks and techniques.
SERVICE LEVEL AGREEMENTS
Most service providers: ISPs, co-location centers, use availability figures in their service level agreements. Service level agreements are not always an accurate statement of a service’s availability, for a variety of reasons.
First, often the SLA simply states the thresholds beyond which the service provider will apologize or give money back. In some cases it may be cheaper to give money back than to achieve the availability target.
Second, many SLAs limit the availability guarantee to systems directly under the service provider’s control; the overall availability to the end user may be less than that figure, given that service affecting failures may occur outside the service provider’s service boundary.
Third, the definition of an outage often excludes short service interruptions – say less than 60 seconds. These can still affect you.
Lastly, SLAs do not typically distinguish between the time budget for routine maintenance and unexpected outages. If a service provider is guaranteeing 99.9% availability, then they may be saying that the service will be routinely unavailable up to 44 minutes per month. It depends on whether they intend to use that time budget for maintenance or simply reserve it for failures. Some of this time may be reserved for unexpected outages (unscheduled downtime), and some may be budgeted for routine system maintenance, equipment upgrades, and other scheduled uses of downtime.
Many service providers track availability statistics by month, and may give money back or otherwise apologize for missing the target availability reset the counter every month. By doing it this way the service providers give themselves a clean slate each month and can worry less about annual availability targets.
CASE HISTORY – EBAY, JUNE 10 – 11, 1999
On June 10, 1999 eBay, the Internet auction house experienced an outage lasting 22 hours. If you want to read more about this rather well publicized event, see:
http://www.news.com/News/Item/0,4,0-37718,00.html
The point to be taken away from this is that a single 22 hour outage represents 99.75% annual availability. Does that sound good to you? Not to eBay.
CONCLUSION
The lesson in all this is that it's hard and expensive to achieve high availability, and the specific target levels have to be considered seriously. A new service on the Internet is not likely to be better than 99% and may be much worse. 99.9% availability comes only when the system and operations have matured, and connectivity is very good. Higher levels of availability require a different architecture and a significantly greater commitment of resources. For any service, the cost / availability tradeoff needs to be considered carefully.
|