We're pleased you're looking at this page, many people only consider IT providers internal systems as an after thought, baseing a decision purely on cost per line item or uptimes in sales material.
Cost is very important, as can be lost revenue from down time. Our figures are not marketing/sales numbers or from a system in a High Availability location, data presented here is taken from real world running systems monitored from a regular ADSL Connection.
Unavailability shown includes regular / planned unavailablility due to maintenace and backups performed out of hours.
A 2008r2 VPS's RDS service over a 31 day window. The 18 minutes down time is unavailability due to snapshotting for backup purposes (backups are taken during non working hours)
For Windows servers we routinely monitor:
A VPS Host over a 31 day window. This is purely ICMP. The 9 minute down time is neither the host nor DC connectivity. It is a client side connectivity.
HTTPS availability on a Linux VPS (on host as above). The certificate expiry date is checked too.
For Linux servers we routinely monitor:
We store (using RRDTool) data around
This is graphed every 5 minutes and checked by a human once a day for each server. This data helps spot annomalies and changes in usage patters that are not picked up via other methods.
An SMTP/IMAP end-to-end relay and delivery check over a 31 day window.
SIP trunk over a 31 day window. The test is to whether certain SIP peers are currently reachable by the Asterisk server.
As with any best practice ISP / IT Services company we monitor our own and certain customers servers & services. To do this we use Nagios, the self styled, but widely accepted "Industry Standard In IT Infrastructure Monitoring".
Our monitoring system watches items across 6 sites from an independent location.
We monitor to ensure we respond quickly to issues and to provide figures for Service Level Agreement purposes.
We commit to 99.5% (or better) availability i.e. 3.6525 hours (or less) downtime per 30.4375 day window.
To ensure someone knows what's happening 24x7x365 we send notifications out via 2 seperate email infrastructures.
There is a 6-7 hour window every day (depending on Day Light Savings) where it is probable that no engineers are awake. If any (predetermined) critical system is noted to have an issue our monitoring system will place a telephone call to the on duty engineer to wake them up.
We monitor variables at a 5 minute intervals, once an alert state is determined (or no response is received) the item is checked 3 more times at 1 minute intervals before an alert is issued. The time between the initial failure and notification is thus 3 to 8 minutes.
In contrast you will find support companies who know your disk is full a whole day after your database application has been failling to save changes.
We do not pay someone to watch a traffic light system (although Nagios does include this) our alerts interrupt the right responder's normal/scheduled work immediately. Most alerts are sent to 2 people.