22 September 2005

If Government got S.M.A.R.T.

Your government cares deeply about your governance experience, and takes a strong line against corrupt polititians who impact on that experience. We looked to the thriving and successful IT industry for best practices to manage this problem, and what better to model this on than how the hard drives that hold your data are managed?

So from now on, every time a politician commits an offence, we will note this in our internal reacord-keeping system. After the 100th such offense, our management strategy will swoop into action.

The offending politician will then be barred from holding office within your state, and will exit our Witness Protection Program to serve in another state with a new identity and a clean record. After 100 transgressions in that state, the process will repeat, until retirement age is reached.

We see this as proof of our committment to good governance.

It's the least we can do.

The cover-up

Consider how many layers exist that will cover up for hard drive surface defects:
  • Internal hard drive firmware "fixes" on the fly
  • NTFS drivers "fixes" on the fly
  • AutoChk "fixes" automatically
  • Win98+; auto-Scandisk "fixes" automatically, no prompt
  • Win9x Scandisk accepts seconds-long per-sector retry loops as "OK"
But then, we have S.M.A.R.T. to tell us when the hard drive is about to fail... don't we?

Well, maybe and maybe not. Windows XP doesn't show any signs of S.M.A.R.T. awareness (certainly nothing as crass as some UI element you can click to query status, or Help in interpreting the results) although it's noted to fall back from aggressive Ultra DMA modes if "too many errors" are noted. BIOS can query S.M.A.R.T. as the PC boots up, but the CMOS duhfault is to disable this. S.M.A.R.T. has been around for years before XP; not sure what's taking so long there - the cynic in me says a desire to reduce support calls.

Hence the cottage industry in add-on S.M.A.R.T. utilities, either to monitor it in real time, or to query it on demand. The latter typically show full raw detail with no explanation, or a one-line result that is either "OK" or "call your hard drive vendor". Hard drive vendors often offer free diagnostics; quess which type of reporting you get?

How smart is S.M.A.R.T.?

Here's a case in point that prompted me to write this. I have a system in for troubleshooting, as it's been generally unreliable, no pattern involved. Motherboard capacitors are bad, so it goes off for repair; comes back OK and the testing begins with an overnight of RAM checking in MemTest86. That passes, so I Bart up and run HD Tune to look at the hard drive. S.M.A.R.T. says all is well, and the detail looks good; the drive temperature is fine, and doesn't increase alarmingly by the end of an uneventful surface check.

So I proceed on to my nascant "antivirus wizard", which is currently 5 different scans stapled together with log scooping (the "Bart Project" is another story for a looong day's blogging). I leave the system to carry on, and an hour or so later, it's clank-clank-clank. I power off (Bart's nice that way, you can do that if nothing's beeing written to disk) and proceed to data recovery.

On the nth attempt, the recovery PC starts up without POST dying in a sea of clanking retries, and I BING off C: OK (the data on this PC's already been saved off D: before taking it offsite, so salvaging the installation's the first remaining priority). But it dies a-clanking on the next soft restart on the way to what would have been file-level salvage of the huge E: and small F:

Well, S.M.A.R.T. certainly didn't see that one coming, and I can't see how I can change my SOP so as not to get caught out in that away again. Image every PC before powering it up?

Business as usual

What's more alarming, is what degree of grossly abnormal mileage is accepted as "normal", even by tools such as Scandisk that purport to assess such things properly. The best tool to pick this up is DOS Mode Scandisk surface scan, because it runs without any background processes that could cause innocent delays (processor overheating is the only false-positive delay factor) and it maintains a fine-grained cluster count as progress indicator.

When that counter pauses every now and then, or even stops for a second or so at a time, you should consider that hard drive as at-risk and evacuate it before doing anything else (and yes, that includes waiting for Scandisk to finish or stop on an explicit error). This mileage correlates to "every now and then my PC stops responding completely for seconds, with mouse pointer stuck, keystrokes ignored, and HD LED on" in Windows.

The significance is that Scandisk will carry on through these latencies, and even seconds of noisy retries, without reporting any errors at all. When an event that should take a fraction of a second is accepted as normal when it takes seconds to complete, you have to wonder how "awake" S.M.A.R.T. and other such "data sentries" are.

Note these Scandisk limitations:
  • Only for FATxx volumes
  • I'd consider it unsafe beyond 137G
  • Won't check surface until file system logic is "fixed"

Still, at least it prompts interactively on each error, before "fixing" it, unlike ChkDsk.

Understanding S.M.A.R.T. detail

My hunch is that SMART is something the hard drive industry reluctantly provided as a window into the closed world of internal defect management, as practiced by firmware within the hard drive itself. This may have been in response to OEM or other industry complaints.

Certainly, there's no effort to make S.M.A.R.T. information available to the end user in understandable form. I was pretty much in the dark myself, until I read the Help in one particular free S.M.A.R.T. reporting utility, which I'll link to shortly. It seems that the raw counts are subtracted from an initial "100%" or "255" value until the acceptable threshold is reached, at which point those "easy" tools will finally stop reporting "OK" and suggest you call your hard drive vendor. That threshold could be the 100th bad sector that had to be "fixed".

A simple S.M.A.R.T. reporter with Help that actually helps with the detail is here:

http://www.passmark.com/products/diskcheckup.htm

An excellent utility that shows S.M.A.R.T. detail, temperature, surface test and benchmarks is here:

http://www.hdtune.com/

Both of these get full marks in the Bart test, i.e. they operate as plugins from Bart PE CDR without the need to be run from writable disk, have registry stuff carried over, etc. Not only will these show you full detail, unlike many hard drive vendors' free downloads, but they will run regardless of what brand hard drive you happen to be testing.

Where do bad drives go?

Warranty replacement drives were once new, i.e. drawn from new stock; then they were "re-manufactured", or "refurbished", and the current language is "re-certified". I suspect that means blanking out the S.M.A.R.T. counters to fresh values, perhaps doing some testing and re-checking those values, then shipping as "OK". Certainly, I don't see hard drive repair gnomes in a clean room reassembling new platters and heads into old drives as a cost-effective way of "re-manufacturing" mass-produced hard drives.

If what I suspect is the case, then your warranty replacement hard drive could very well be the same drive I returned as defective. Perhaps I'll get your original drive as "re-certified"?

1 comment:

Chris Quirke said...

And today's prize for most convincing bot goes to "Anonymous", who said "I must say that without some of the information you have, would my computer be filled with spyware" before linking off to a site.

Let's see if that site attempts to drop commercial malware and/or punt a few of the 200+ fake anti-spyware scanners that are listed at...

http://www.spywarewarrior.com/rogue_anti-spyware.htm

...yup; NoAdware is punted all over the place, even on links to download good stuff like HiJackThis.

So, sorry "Anonymous"; you smell like a "bad guy" to me.