Articles

dental-it-heart-article

Down Time and Data Loss

by Liam McNaughton

The hard drive is the part in your server that holds all your data. You may have two hard drives in your server: one holding the operating system, and one holding your data. Either way, the risk is the same – if you are relying on one hard disk to hold your data, you are running a huge risk of suffering sudden and potentially catastrophic downtime and data loss. It is easily the biggest risk you need to manage, and the one that is typically managed the worst by most IT suppliers and installers.

Anyone who works in IT knows that hard drives fail all the time. This can be due to external factors such as sudden impact, electrical surges, environmental conditions (though there is a debate about whether heat, often considered a factor in hard drive failure, actually makes any difference), or due to mechanical failure inside the drive. Depending on whether the failure is sudden and catastrophic or gradual, this will result in your hard drive either degrading in performance and integrity, causing system instability and data loss, or immediately causing your system to be unusable, and requiring a rebuild and restore.

What are the chances of suffering a hard drive failure?

Well, it is difficult to get hold of accurate statistics for this – as you can imagine, hard drive manufacturers don’t readily offer this kind of information. They claim one or two per cent failure rate over the life of the drive. Some real world tests put this figure much higher at three to seven per cent. A recent study by Carnegie Mellon University with 100,000 hard drives, showed a replacement rate of between two and four per cent per year, and up to 13 per cent on some systems. Customers expect their computer systems to last at least three to five years, so you can see that putting your trust in a single hard drive can be very risky indeed. Our experience varies – some recent batches of Maxtor hard drives that we used in-house, have had failure rates of over 50 per cent.

When your hard drive fails, you then have to turn to your backup, and hope that:

  1. The person responsible for changing the tapes or other media, has actually been doing so
  2. The backup software is actually backing up all the data you need (when did you last check?)

Even if your backup is 100 per cent reliable, you are still going to suffer from downtime, while the data is restored, and systems reconfigured. And if your hard drive failure occurs, for example, at 4.30pm, it is possible you will lose all that day’s data, even if your backup from last night is intact.

It is entirely possible that you have never suffered a hard drive failure. You have been lucky, so far. In four years with Dental IT, I have seen hundreds of hard drive failures. Indeed our own business customers have had hard drive failures, but none of them have suffered downtime or data loss as a result. We have not even had to turn their server off. Why not?

All our business customers with servers supplied by Dental IT have servers that support hot-swappable drives in a RAID5 configuration with one hot spare available. This means that the server does not rely on a single hard drive, and is built in anticipation of hard drive failure. The data is shared between the hard drives, in such a way that if one of the hard drives fails, no data is lost, and the system stays up. The drive needs to be swapped out, but this can be done 'live', without shutting the machine down or opening it up. The faulty drive is pulled out of the front of the server, and a new drive inserted. Within minutes, the system rebuilds data onto the replacement drive, and your system is back to full strength. Furthermore, using hot spare technology, our systems can actually suffer up to two hard drive failures without any downtime or data loss.

Are there any disadvantages to this RAID technology?

Yes.

  1. Cost. The cost is considerable. A quality server, properly specified, can cost three to five times as much as a desktop machine used "as" a server. In our opinion, if you value your data and require your systems to be up all the time, this cost is amply justifiable.
  2. Performance. As the data has to be shared between the drives, this requires additional processing power. If your RAID system is not properly specified and configured, this can lead to an enormous performance hit, such that write speeds can be a fraction of the speed of writing to a single drive, which in turn will cause poor performance for your users. Our servers do not suffer this problem.

RAID on the cheap

Some suppliers have a grasp of the risk associated with hard drive failure, and offer some form of RAID to reassure the customer. Sadly, not all RAID systems are created equal, as we shall see.

A note about RAID1 (aka mirroring) and onboard RAID

Some cheap motherboards in generic PCs can be configured to mirror 2 of the drives. This is not RAID5 with an additional hot spare, but simply a live copy of the data from one drive to the other. The idea, of course, is that in the event of hard drive failure, the system will still boot up on the functional drive, and the faulty drive can be replaced. Sounds like a good idea? Well, they are better than no RAID at all, but we have worked with and tested these systems extensively, and have found three major problems with them:

  1. Corruption. If the hard drive fails suddenly, the mirror works as expected, the faulty drive is failed, and the system can continue to operate. However, hard drives do not always fail suddenly – sometimes they fail gradually, causing file system corruption or gradual data loss. A basic mirror setup on cheap systems can simply copy this corruption to the other drive, resulting in two degraded hard drives. RAID5 does not suffer this problem, as it does not use mirroring to protect the data, but parity – a topic beyond the scope of this document.
  2. Downtime. Because the drives cannot be swapped live, the system has to be shut down to do the repair, which does not apply to hot-swappable RAID5 systems. In addition, it is not always straightforward to identify which of the drives it is. As these systems typically have the drives in caddies inside the machine, and there is no external signal that the drive is failed, it takes some time to identify and swap out the correct drive.
  3. Performance. We find that, as these systems do not have dedicated RAID controllers with onboard cache memory, they rely on the host system’s resources to do the extra work, and the read and write performance can be very poor indeed, creating a bottleneck on the network and dreadful performance for the user. Many IT suppliers do not bother testing their systems’ suitability for purpose, but create a system that offers false reassurance to the customer, while extending their profit margins.

The importance of monitoring and maintenance

It is vital with any system, and especially with a system with RAID, that the server is monitored, locally or remotely. If a hard drive in a RAID does fail, it needs to be replaced promptly. If the drives are not replaced promptly, you will suffer performance degradation and eventually more hard drives may fail. Dental IT monitors all our customer networks remotely across broadband. We know within minutes if any of our customers’ servers have a hard drive failure, or indeed if any other hardware on the system fails.


Return to Articles page

More Articles

SIP isn't as reliable as ISDN. Get over it. Lessons from Wannacry Is backing up to the cloud enough?
Support Ticket